0% found this document useful (0 votes)

61 views435 pages

Optimization Learning Control

Uploaded by

Oscar Montes Garcia

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

61 views435 pages

Optimization Learning Control

Uploaded by

Oscar Montes Garcia

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 435

Optimization for Learning and Control

Anders Hansson
Linköping University
Linköping
Sweden

Martin Andersen
Technical University of Denmark
Kongens Lyngby
Denmark
Copyright © 2023 by John Wiley & Sons, Inc. All rights reserved.

Published by John Wiley & Sons, Inc., Hoboken, New Jersey.

Published simultaneously in Canada.

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any
means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section
107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or
authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc.,
222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com.
Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons,
Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/
go/permission.

Trademarks: Wiley and the Wiley logo are trademarks or registered trademarks of John Wiley & Sons, Inc. and/or
its affiliates in the United States and other countries and may not be used without written permission. All other
trademarks are the property of their respective owners. John Wiley & Sons, Inc. is not associated with any product
or vendor mentioned in this book.

Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing
this book, they make no representations or warranties with respect to the accuracy or completeness of the contents
of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose.
No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies
contained herein may not be suitable for your situation. You should consult with a professional where appropriate.
Further, readers should be aware that websites listed in this work may have changed or disappeared between when
this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or
any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

For general information on our other products and services or for technical support, please contact our Customer
Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317)
572-4002.

Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be
available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com.

Library of Congress Cataloging-in-Publication Data

Names: Hansson, Anders, author. | Andersen, Martin, author.

Title: Optimization for learning and control / Anders Hansson, Linköping
University, Linköping Sweden, Martin Andersen, Technical University of
Denmark.
Description: First edition. | Hoboken, NJ, USA : Wiley, [2023] | Includes
index.
Identifiers: LCCN 2023002568 (print) | LCCN 2023002569 (ebook) | ISBN
9781119809135 (cloth) | ISBN 9781119809142 (adobe pdf) | ISBN
9781119809173 (epub)
Subjects: LCSH: System analysis–Mathematics. | Mathematical optimization.
| Machine learning–Mathematics. | Signal processing–Mathematics. |
MATLAB.
Classification: LCC T57.62 .H43 2023 (print) | LCC T57.62 (ebook) | DDC
004.2/10151–dc23/eng/20230202
LC record available at https://lccn.loc.gov/2023002568
LC ebook record available at https://lccn.loc.gov/2023002569

Cover Design: Wiley

Cover Image: © pluie_r/Shutterstock

Set in 9.5/12.5pt STIXTwoText by Straive, Chennai, India

To Erik.
To Cassie, Maxwell, and Patrick.
vii

Contents

Preface xvii
Acknowledgments xix
Glossary xxi
Acronyms xxv
About the Companion Website xxvii

Part I Introductory Part 1

1 Introduction 3
1.1 Optimization 3
1.2 Unsupervised Learning 3
1.3 Supervised Learning 4
1.4 System Identification 4
1.5 Control 5
1.6 Reinforcement Learning 5
1.7 Outline 5

2 Linear Algebra 7
2.1 Vectors and Matrices 7
2.2 Linear Maps and Subspaces 10
2.2.1 Four Fundamental Subspaces 10
2.2.2 Square Matrices 12
2.2.3 Affine Sets 13
2.3 Norms 13
2.4 Algorithm Complexity 15
2.5 Matrices with Structure 16
2.5.1 Diagonal Matrices 16
2.5.2 Orthogonal Matrices 17
2.5.3 Triangular Matrices 18
2.5.4 Symmetric and Skew-Symmetric Matrices 19
2.5.5 Toeplitz and Hankel Matrices 19
2.5.6 Sparse Matrices 20
viii Contents

2.5.7 Band Matrices 20

2.6 Quadratic Forms and Definiteness 21
2.7 Spectral Decomposition 22
2.8 Singular Value Decomposition 23
2.9 Moore–Penrose Pseudoinverse 24
2.10 Systems of Linear Equations 25
2.10.1 Gaussian Elimination 25
2.10.2 Block Elimination 26
2.11 Factorization Methods 26
2.11.1 LU Factorization 27
2.11.2 Cholesky Factorization 27
2.11.3 Indefinite LDL Factorization 27
2.11.4 QR Factorization 28
2.11.5 Sparse Factorizations 28
2.11.6 Block Factorization 29
2.11.7 Positive Semidefinite Block Factorization 31
2.12 Saddle-Point Systems 32
2.12.1 H Positive Definite 32
2.12.2 H Positive Semidefinite 33
2.13 Vector and Matrix Calculus 33
Exercises 35

3 Probability Theory 40
3.1 Probability Spaces 40
3.1.1 Probability Measure 41
3.1.2 Probability Function 41
3.1.3 Probability Density Function 42
3.2 Conditional Probability 42
3.3 Independence 44
3.4 Random Variables 44
3.4.1 Vector-Valued Random Variable 45
3.4.2 Marginal Distribution 45
3.4.3 Independence of Random Variables 46
3.4.4 Function of Random Variable 46
3.5 Conditional Distributions 47
3.5.1 Conditional Probability Function 47
3.5.2 Conditional Probability Density Function 47
3.6 Expectations 48
3.6.1 Moments 49
3.6.2 Expected Value of Function of Random Variable 49
3.6.3 Covariance 50
3.7 Conditional Expectations 50
3.8 Convergence of Random Variables 51
3.9 Random Processes 51
3.10 Markov Processes 53
Contents ix

3.11 Hidden Markov Models 53

3.12 Gaussian Processes 56
Exercises 57

Part II Optimization 61

4 Optimization Theory 63
4.1 Basic Concepts and Terminology 63
4.1.1 Optimization Problems 64
4.1.2 Equivalent Problems 65
4.2 Convex Sets 66
4.2.1 Convexity-Preserving Operations 67
4.2.2 Examples of Convex Sets 68
4.2.3 Generalized Inequalities 71
4.3 Convex Functions 72
4.3.1 First- and Second-Order Conditions for Convexity 73
4.3.2 Convexity-Preserving Operations 75
4.3.3 Examples of Convex Functions 78
4.3.4 Conjugation 78
4.3.5 Dual Norms 79
4.4 Subdifferentiability 80
4.4.1 Subdifferential Calculus 82
4.5 Convex Optimization Problems 84
4.5.1 Optimality Condition 84
4.5.2 Equality Constrained Convex Problems 85
4.6 Duality 86
4.6.1 Lagrangian Duality 86
4.6.2 Lagrange Dual Problem 87
4.6.3 Fenchel Duality 88
4.7 Optimality Conditions 90
4.7.1 Convex Optimization Problems 90
4.7.2 Nonconvex Optimization Problems 91
Exercises 91

5 Optimization Problems 94
5.1 Least-Squares Problems 94
5.2 Quadratic Programs 96
5.3 Conic Optimization 97
5.3.1 Conic Duality 99
5.3.2 Epigraphical Cones 100
5.4 Rank Optimization 103
5.5 Partially Separability 106
5.5.1 Minimization of Partially Separable Functions 106
5.5.2 Principle of Optimality 108
x Contents

5.6 Multiparametric Optimization 109

5.7 Stochastic Optimization 111
Exercises 113

6 Optimization Methods 118

6.1 Basic Principles 118
6.1.1 Smoothness 118
6.1.2 Descent Methods 119
6.1.3 Line Search Methods 120
6.1.4 Surrogation Methods 122
6.1.5 Convergence of Sequences 124
6.2 Gradient Descent 124
6.2.1 L-Smooth Functions 125
6.2.2 Smooth and Convex Functions 125
6.2.3 Smooth and Strongly Convex Functions 127
6.3 Newton’s Method 128
6.3.1 The Newton Decrement 129
6.3.2 Analysis of Newton’s Method 129
6.3.3 Equality Constrained Minimization 133
6.4 Variable Metric Methods 134
6.4.1 Quasi-Newton Updates 135
6.4.2 The Barzilai–Borwein Method 137
6.5 Proximal Gradient Method 137
6.5.1 Gradient Projection Method 139
6.5.2 Proximal Quasi-Newton 139
6.5.3 Accelerated Proximal Gradient Method 139
6.6 Sequential Convex Optimization 141
6.7 Methods for Nonlinear Least-Squares 142
6.7.1 The Levenberg-Marquardt Algorithm 143
6.7.2 The Variable Projection Method 143
6.8 Stochastic Optimization Methods 144
6.8.1 Smooth Functions 145
6.8.2 Smooth and Strongly Convex Functions 148
6.8.3 Incremental Methods 149
6.8.4 Adaptive Methods 150
6.9 Coordinate Descent Methods 153
6.10 Interior-Point Methods 155
6.10.1 Path-Following Method 157
6.10.2 Generalized Inequalities 158
6.11 Augmented Lagrangian Methods 161
6.11.1 Method of Multipliers 162
6.11.2 Alternating Direction Method of Multipliers 163
6.11.3 Variable Splitting 164
Exercises 165
Contents xi

Part III Optimal Control 173

7 Calculus of Variations 175

7.1 Extremum of Functionals 175
7.1.1 Necessary Condition for Extremum 176
7.1.2 Sufficient Condition for Optimality 177
7.1.3 Constrained Problem 177
7.1.4 Du Bois–Reymond Lemma 178
7.1.5 Generalizations 179
7.2 The Pontryagin Maximum Principle 179
7.2.1 Linear Quadratic Control 182
7.2.2 The Riccati Equation 182
7.3 The Euler–Lagrange Equations 183
7.3.1 Beltrami’s Identity 184
7.4 Extensions 185
7.5 Numerical Solutions 188
7.5.1 The Gradient Method 190
7.5.2 The Shooting Method 190
7.5.3 The Discretization Method 191
7.5.4 The Multiple Shooting Method 193
7.5.5 The Collocation Method 194
Exercises 195

8 Dynamic Programming 206

8.1 Finite Horizon Optimal Control 206
8.1.1 Standard Optimization Problem 207
8.1.2 Dynamic Programming 208
8.2 Parametric Approximations 211
8.2.1 Fitted-Value Iteration 211
8.3 Infinite Horizon Optimal Control 213
8.3.1 Bellman Equation 214
8.4 Value Iterations 215
8.5 Policy Iterations 216
8.5.1 Approximation 218
8.6 Linear Programming Formulation 220
8.6.1 Approximations 221
8.7 Model Predictive Control 221
8.7.1 Infinite Horizon Problem 222
8.7.2 Guessing the Value Function 222
8.7.3 Finite Horizon Approximation 223
8.7.4 Receding Horizon Approximation 223
8.8 Explicit MPC 225
8.9 Markov Decision Processes 226
8.9.1 Stochastic Dynamic Programming 227
8.9.2 Infinite Time Horizon 231
xii Contents

8.9.3 Stochastic Bellman Equation 232

8.10 Appendix 233
8.10.1 Stability and Optimality of Infinite Horizon Problem 233
8.10.2 Stability and Optimality of Stochastic Infinite Time Horizon
Problem 234
8.10.3 Stability of MPC 235
Exercises 235

Part IV Learning 243

9 Unsupervised Learning 245

9.1 Chebyshev Bounds 245
9.2 Entropy 246
9.2.1 Categorical Distribution 249
9.2.2 Ising Distribution 251
9.2.3 Normal Distribution 252
9.3 Prediction 254
9.3.1 Conditional Expectation Predictor 255
9.3.2 Affine Predictor 256
9.3.3 Linear Regression 258
9.4 The Viterbi Algorithm 259
9.5 Kalman Filter on Innovation Form 261
9.6 Viterbi Decoder 264
9.7 Graphical Models 266
9.7.1 Ising Distribution 266
9.7.2 Normal Distribution 267
9.7.3 Markov Random Field 267
9.8 Maximum Likelihood Estimation 269
9.8.1 Categorical Distribution 269
9.8.2 Ising Distribution 269
9.8.3 Normal Distribution 270
9.8.4 Generalizations 270
9.9 Relative Entropy and Cross Entropy 271
9.9.1 Gibbs’ Inequality 271
9.9.2 Cross Entropy 272
9.10 The Expectation Maximization Algorithm 273
9.11 Mixture Models 274
9.12 Gibbs Sampling 277
9.13 Boltzmann Machine 278
9.14 Principal Component Analysis 280
9.14.1 Solution 280
9.14.2 Relation to Rank-Constrained Optimization 283
9.15 Mutual Information 283
9.15.1 Channel Model 283
Contents xiii

9.15.2 Orthogonal Case 284

9.15.3 Nonorthogonal Case 286
9.15.4 Relationship to PCA 287
9.16 Cluster Analysis 288
Exercises 289

10 Supervised Learning 297

10.1 Linear Regression 297
10.1.1 Least-Squares Estimation 297
10.1.2 Maximum Likelihood Estimation 299
10.1.3 Maximum a Posteriori Estimation 299
10.2 Regression in Hilbert Spaces 300
10.2.1 Infinite-Dimensional LS Problem 300
10.2.2 The Kernel Trick 301
10.3 Gaussian Processes 302
10.3.1 Gaussian MAP Estimate 302
10.3.2 The Kernel Trick 303
10.4 Classification 304
10.4.1 Linear Regression 304
10.4.2 Logistic Regression 306
10.5 Support Vector Machines 306
10.5.1 Hebbian Learning 306
10.5.2 Quadratic Programming Formulation 307
10.5.3 Soft Margin Classification 307
10.5.4 The Dual Problem 308
10.5.5 Recovering the Primal Solution 309
10.5.6 The Kernel Trick 309
10.6 Restricted Boltzmann Machine 310
10.6.1 Graphical Ising Distribution 310
10.6.2 Gradient Expressions for EM Algorithm 311
10.7 Artificial Neural Networks 312
10.7.1 The Network 312
10.7.2 Approximation Potential 314
10.7.3 Regression Problem 314
10.7.4 Special Cases 314
10.7.5 Back Propagation 315
10.8 Implicit Regularization 316
10.8.1 Least-Norm Solution 316
10.8.2 Dropout 317
Exercises 319

11 Reinforcement Learning 327

11.1 Finite Horizon Value Iteration 327
11.1.1 Fitted Value Iteration 328
11.2 Infinite Horizon Value Iteration 330
xiv Contents

11.2.1 Q-Learning 331

11.3 Policy Iteration 332
11.3.1 Critic Network 333
11.3.2 SARSA-Algorithm 335
11.3.3 Actor Network 336
11.3.4 Variational Form 337
11.4 Linear Programming Formulation 337
11.5 Approximation in Policy Space 338
11.5.1 Iterative Learning Control 338
11.5.2 ILC Using Root-Finding 339
11.5.3 Iterative Feedback Tuning 341
11.6 Appendix – Root-Finding Algorithms 342
Exercises 343

12 System Identiﬁcation 350

12.1 Dynamical System Models 350
12.2 Regression Problem 351
12.3 Input–Output Models 352
12.3.1 Optimization Problem 353
12.3.2 Implementation Details 354
12.3.3 State Space Realization 354
12.4 Missing Data 355
12.4.1 Block Coordinate Descent 355
12.4.2 Maximum Likelihood Problem 356
12.5 Nuclear Norm system Identification 357
12.6 Gaussian Processes for Identification 358
12.6.1 MAP Estimate 358
12.6.2 Empirical Bayes 358
12.7 Recurrent Neural Networks 360
12.8 Temporal Convolutional Networks 360
12.9 Experiment Design 361
12.9.1 The Positive Real Lemma 362
12.9.2 D-Optimality 363
12.9.3 E-Optimality 364
12.9.4 A-Optimality 364
12.9.5 L-Optimality 365
12.9.6 Realization of Covariance Function 365
12.9.7 OE Model 366
12.9.8 Experiment Design for Applications 368
Exercises 370

Appendix A 373
A.1 Notation and Basic Definitions 373
A.2 Software 374
A.2.1 Modeling Software 374
Contents xv

A.2.2 Optimization Software 375

A.2.3 Software for Control 376

References 379
Index 387
xvii

Preface

This is a book about optimization for learning and control. The literature and the techniques for
learning are vast, but we will here not focus on all possible learning methods. Instead we will
discuss some of them, and especially the ones that result in optimization problems. We will also dis-
cuss what optimization methods are relevant to use for these optimization problems. The book is
primarily intended for graduate students with a background in science or engineering and who
want to learn more about what optimization methods are suitable for learning problems. It is
also useful for those who want to study optimal control. Very limited knowledge of optimization,
control, or learning is needed as a background. The book is accompanied with a large number
of exercises, many of which involve computer tools in order for the students to obtain hands-on
experience.
The topics in learning span a wide range from classical statistical learning problems like
regression and maximum likelihood estimation to more recent problems like deep learning using,
e.g. recurrent neural networks. Regarding optimization methods, we cover methods from simple
gradient methods to more advanced interior-point methods for conic optimization. A special
emphasis is on stochastic methods applicable to the training of neural networks. We also put a
special emphasis on nondifferentiable problems for which we discuss subgradient methods and
proximal methods. We cover second-order methods, variable-metric methods, and augmented
Lagrangian methods. Regarding applications to system identification, we discuss identification
both for input–output models as well as for state-space models. Recurrent neural networks
and temporal convolutional networks are naturally introduced as ways of modeling nonlinear
dynamical systems. We also cover calculus of variations and dynamic programming in detail, and
its generalization to reinforcement learning.
The book can be used to teach several different courses. One could be an introductory
course in optimization based on Chapters 4–6. Another course could be on optimal control
covering Chapters 7–8, and possibly also Chapter 11. Another course could be on learning
covering Chapters 9–10 and perhaps Chapter 12. There is of course also the possibility to combine
more chapters, and a course that has been taught at Linköping University for PhD students covers
all but the material for optimal control.

Linköping and Kongens Lyngby Anders Hansson and Martin Andersen

November 2022
xix

Acknowledgments

We would like to thank Andrea Garulli at University of Siena who invited Anders Hansson to
give a course on optimization for learning in the spring of 2019. The experience from teaching
that course provided most valuable inspiration for writing this book. Daniel Cederberg, Markus
Fritzsche, and Magnus Malmström are gratefully acknowledged for having proofread some of the
chapters.

Anders Hansson and Martin Andersen

xxi

Glossary

Sets
ℕ set of natural numbers
ℕk set {1, 2, … , k}
ℤ set of integers
ℤk set {0, 1, … , k}
ℤ+ set of nonnegative integer numbers
ℝ set of real numbers
ℝ+ set of nonnegative real numbers
ℝ++ set of positive real numbers
̄
ℝ set of extended real numbers
̄+
ℝ set of nonnegative extended real numbers
̄ ++
ℝ set of positive extended real numbers
ℂ set of complex numbers
𝕊n set of symmetric real-valued matrices of order n
𝕊n+ set of positive semidefinite real-valued matrices of order n
𝕊n++ set of positive definite real-valued matrices of order n
ℚn quadratic cone of dimension n
Δn probability simplex of dimension n − 1
∅ empty set

Numbers, Vectors, and Matrices

e Euler’s number
𝜋 Archimedes’ constant
𝟙 vector of ones
I identity matrix

Elementary Functions
exp natural exponential function
log logarithm function
ln natural logarithm function
log2 logarithm function, base 2
xxii Glossary

sin sine function

cos cosine function
tan tangent function
sgn sign function
IA indicator function for set A

Operations on Sets or Functions

aff affine hull of set
argmin minimum argument
argmax maximum argument
cl closure of set
conv convex hull of set
cone conic hull of set
dom effective domain
epi epigraph of function
inf infimum of set or function
int interior of set
max maximum of set or function
min minimum of set or function
prox proximal operator
relint relative interior of set
sup supremum of set or function

Operations on Vectors, Vector Spaces or Matrices

adj adjugate of matrix
blkdiag block diagonal matrix from matrices
det determinant of matrix
diag diagonal matrix from vector
dim dimension of vector space or convex set
 nullspace of matrix
nnz number of nonzero entries
nullity nullity of matrix
 range of matrix
rank rank of matrix
span span of vectors
svec symmetric vectorization of matrix
tr trace of matrix
vec vectorization of matrix

Probability Spaces
𝔼 expectation functional
 normal probability density function
ℙ probability measure
Var variance functional
Glossary xxiii

Symbols
∑
summation
∏
product
∫ integral
∮ contour integral
∞ infinity
∈ belongs to
∉ does not belong to
⊂ proper subset of
⊆ subset of
⊃ proper superset of
⊇ superset of
⊄ not proper subset of
⊈ not subset of
⊅ not proper superset of
⊉ not superset of
∪ set union
∩ set intersection
∖ set difference
+ plus
− minus
± plus or minus
× multiplied by
⊗ Kronecker product
⚬ Hadamard product or composition of functions
= is equal to
< is less than
≤ is less than or equal to
> is greater than
≥ is greater than or equal to
≠ is not equal to
≮ is not less than
≰ is neither less than nor equal to
≯ is not greater than
≱ is neither greater than nor equal to
≪ much smaller than
≫ much greater than
≈ is approximately equal to
∼ asymptotically equivalent to
∝ proportional to
≺ precedes
⪯ precedes or equals
≻ succeeds
⪰ succeeds or equals
⊀ does not precede
 neither precedes nor equals
⊁ does not succeed
 neither succeeds nor equals
xxiv Glossary

∃ there exists
∄ there is no
∀ for all
¬ logical not
∧ logical and
∨ logical or
⟹ implies
⟸ is implied by
⟺ is equivalent to
→ to or tends toward
↔ corresponds to
↘ tends toward from above
 → maps to
⟂ is perpendicular to
| such that or given
: such that
xxv

Acronyms

AdaGrad adaptive gradient method

Adam adaptive moment estimation
ADMM alternating direction method of multipliers
a.e. almost everywhere
ANN artificial neural network
ARE algebraic Riccati equation
ARMAX auto-regressive-moving-average with exogenous terms
ARX auto-regressive with exogenous terms
a.s. almost surely
BB Barzilai–Borwein
BFGS Broyden, Fletcher, Goldfarb, and Shanno
BM Boltzmann machine
DAE differential algebraic equation
df distribution function
DFP Davidon, Fletcher, and Powell
DC diagonal/correlated
EM expectation maximization
EVD eigenvalue decomposition
FIR finite impulse response
FLOP floating point operation
GN Gauss–Newton
GMM Gaussian mixture model
i.i.d. independent, identically distributed
HMM hidden Markov model
IFT iterative feedback tuning
ILC iterative learning control
IP interior point
IPG incremental proximal gradient
KKT Karush–Kuhn–Tucker
LICQ linear independence constraint qualification
LM Levenberg–Marquardt
LMI linear matrix inequality
LP linear program
LQ linear quadratic
LS least-squares
LSTM long short-term memory
xxvi Acronyms

MA moving average
MAP maximum a posteriori
MDP Markov decision process
ML maximum likelihood
MPC model predictive control
m.s. mean square
MSE mean square error
NP nondeterministic polynomial time
ODE ordinary differential equation
OE output error
PCA principal component analysis
pdf probability density function
pf probability function
PI policy iteration
PMP Pontryagin maximum principle
QP quadratic program
RBM restricted Boltzmann machine
RMSprop root mean square propagation
RNN recurrent neural network
ReLU rectified linear unit
SA stochastic approximation
SAA stochastic average approximation
SARSA state-action-reward-state-action
SG stochastic gradient
SGD stochastic gradient descent
SMW Sherman–Morrison–Woodbury
SNR signal-to-noise ratio
SR1 symmetric rank-1
SVD singular value decomposition
SVM support vector machine
SVRG stochastic variance-reduced gradient
TCM temporal convolutional network
TPBVP two-point boundary value problem
VI value iteration
w.p.1 with probability one
xxvii

About the Companion Website

This book is accompanied by a companion website.

www.wiley.com/go/opt4lc

This website includes:

● Data files and templates for solving the exercises
● Solutions to the exercise in the book in terms of a pdf-document (Instructors only)
● MATLAB solution files (Instructors only)
1

Part I

Introductory Part
3

Introduction

This book will take you on a journey through the fascinating field of optimization, where we explore
techniques for designing algorithms that can learn and adapt to complex systems. Whether you are
an engineer, a scientist, or simply curious about the world of optimization, this book is for you. We
will start with the basics of optimization and gradually build up to advanced techniques for learning
and control. By the end of this book, you will have a solid foundation in optimization theory and
practical tools to apply to real-world problems. In this opening, we informally introduce problems
and concepts, and we will explain their close interplay with simple formulations and examples.
Chapters 2–13 will explore the topic with more rigor, and we end this chapter with an outline of
the remaining content of the book.

1.1 Optimization
Optimization is about choosing a best option from a set of available alternatives based on a specific
criterion. This concept applies to a range of disciplines, including computer science, engineer-
ing, operations research, and economics, and has a long history of conceptual and methodological
development. One of the most common optimization problems is of the form
∑
m
minimize fk (x)2 , (1.1)
k=1
with variable x ∈ ℝ n . This is called a nonlinear least-squares problem, since we are minimizing the

squares of the possibly nonlinear functions fk . We will see that the least-squares problem and its
generalizations have many applications to learning and control. It is also the backbone of several
optimization methods for solving more complicated optimization problems. In Chapter 4, we set
the foundations for optimization theory, in Chapter 5, we cover different classes of optimization
problems that are relevant for learning and control, and in Chapter 6, we discuss different methods
for solving optimization problems numerically.

1.2 Unsupervised Learning

Learning is often about finding low-dimensional descriptions of data that provide insight and
are easy to understand or interpret. A very simple example is the repeated measurement of the
same real-valued quantity x ∈ ℝ. This could be the length of a piece of wood. Each measurement
may be assumed to be somewhat different because of random measurement errors, i.e. we have

Optimization for Learning and Control, First Edition. Anders Hansson and Martin Andersen.
© 2023 John Wiley & Sons, Inc. Published 2023 by John Wiley & Sons, Inc.
Companion Website: www.wiley.com/go/opt4lc
4 1 Introduction

a sequence of, say, m measurements y1 , … , ym . Clearly, a reasonable estimate of x would be the

average of the m measurements, i.e.
1∑
m
x̂ = y .
m k=1 k
This is a scalar representation of the m measurements, and hence of lower dimension, which is a
reasonable representation of the length of the piece of wood. Here we have estimated the length of
the piece of wood by averaging the measurements. The learning we have discussed here is called
unsupervised learning, and it is discussed in more detail in Chapter 9.

1.3 Supervised Learning

Let us consider another simple problem where we are driving a car with constant speed 𝑣. We start
at an unknown distance y0 from a city toward which we are driving, and we receive distances infor-
mation yk to the city from road signs corresponding to having traveled for a time of tk , k = 1, … , m.
The following relation should hold
yk = y0 − tk 𝑣 = aTk x, k = 1, … , m,
where ak = (1, −tk ) and x = (y0 , 𝑣). We assume that we do not know x and we would like to learn it.
It is enough to have m = 2, i.e. two linear equations, to solve for x, and this solution will be unique
since the ak s will be linearly independent unless t1 = t2 . However, in an application like this, it is
unlikely that the above equations hold exactly due to measurement errors ek , i.e.
yk = aTk x + ek , k = 1, … , m,
is a more appropriate description, and hence, it might be more suitable to have m larger than 2 to
average out the effect of the measurement errors. This can be accomplished by solving
∑
m
( T )2
minimize ak x − yk ,
k=1

with variable x. This is an example of a least-squares problem for which fk (x) = aTk x − yk is an affine
function. Hence, this is often called a linear least-squares problem. Typically, m is much larger
than n, and therefore, the optimal solution x⋆ is of lower dimension than the measurements. If we
later get a new value of a, we may predict the value of the corresponding measurement as aT x⋆
without performing the measurement. For our application, this means that we can estimate the
distance to the city by just checking how long we have been traveling. We do not have to wait for a
new sign to appear. This is a so-called supervised learning problem, since for each ak , we know the
corresponding yk . For learning the length of the piece of wood the data did not come in pairs, but
instead, we just had one stream of data yk . We learned the mean of the data. That is the reason for
the name unsupervised learning for such a learning problem. We will discuss supervised learning
in more detail in Chapter 10.

1.4 System Identiﬁcation

Many physical phenomena can be described with dynamical systems, which in mathematical
terms are given by differential equations or difference equations. A simple example is the difference
equation
yk+1 = ayk + buk , k = 1, … , m, (1.2)
where k denotes time, yk ∈ ℝ is a physical phenomenon that we can observe. It is influenced
by both its previous value and by a quantity denoted by uk ∈ ℝ. Depending on which field is
1.7 Outline 5

studying dynamical systems, the pairs (uk , yk ) are called stimuli and response or input and output,
respectively. Sometimes the word signal is added at the end for each of the four words. Often,
the above equation does not hold exactly due to measurement errors ek , and hence, it is more
appropriate to consider
yk+1 = ayk + buk + ek , k = 1, … , m.
When we do not know the parameters (a, b) ∈ ℝ × ℝ, we can use supervised learning to learn the
values assuming that we are given pairs of data (uk , yk ) for 1 ≤ k ≤ m + 1. The following linear
least-squares problem
∑
m
( )2
minimize yk+1 − ayk − buk ,
k=1

with variables (a, b) will provide an estimate of the parameters. Learning for dynamical systems is
called system identification, and it will be discussed in more detail in Chapter 12.

1.5 Control
Control is about making dynamical systems behave in a way we find desirable. Let us again consider
the dynamical system in (1.2), where we are going to influence the behavior by manipulating the
input signal uk . In the context of control, we also often call it the control signal. We assume that the
initial value y0 and the parameters (a, b) are known, and our objective is to make yk for 1 ≤ k ≤ m
small. We can make y1 equal to zero by taking u0 = −ay0 ∕b, and then we can make all future values
of yk equal to zero by taking all future values of uk equal to zero. However, it could be that the value
of u0 is large, and in applications, this could be costly. Hence, we are interested in finding a trade-off
between how large the values of uk are in comparison to the values of yk . This can be accomplished
by solving
∑
m
( 2 )
minimize yk + 𝜌u2k ,
k=1 (1.3)
subject to yk+1 = ayk + buk , k = 1, … , m − 1,
with variables (u1 , y2 , … , um , ym ). This is an equality constrained linear least-squares problem. The
parameter 𝜌 > 0 can be used to trade-off how small ||yk || should be as compared to ||uk ||. We will cover
control of dynamical systems in Chapter 7 for continuous time, and in Chapter 8 for discrete time.

1.6 Reinforcement Learning

When we discussed control above we had to know the values of the parameters (a, b). Had they not
been known, we could have used system identification to estimate them, and then we could have
used the estimated parameters for solving the control problem. However, sometimes it is desirable
to skip the system identification step and do control without knowing the parameter values. One
way to accomplishing this for the formulation in (1.3) is called reinforcement learning. This will be
discussed in more detail in Chapter 11.

1.7 Outline
The book is organized as follows: first, we give a background on linear algebra and probabilities
in Chapters 2 and 3. Background on optimization is given in Chapter 4. We will cover both convex
6 1 Introduction

and nonconvex optimization. Chapter 5 introduces different classes of optimization problems that
we will encounter in the learning chapters later on. In Chapter 6, we discuss different optimization
methods that are suitable for solving learning problems. After this, we discuss calculus of variations
in Chapter 7 and dynamic programming in Chapter 8. We then cover unsupervised learning in
Chapter 9, supervised learning in Chapter 10, and reinforcement learning in Chapter 11. Finally,
we discuss system identification in Chapter 12. For information about notation, basic definitions,
and software useful for optimization and for the applications we consider, see the Appendix.
7

Linear Algebra

Linear algebra is the study of vector spaces and linear maps on such spaces. It constitutes a funda-
mental building block in optimization and is used extensively for theoretical analysis and deriva-
tions as well as numerical computations.
A procedure for solving systems of simultaneous linear equations appeared already in an
ancient Chinese mathematical text. Systems of linear equations were introduced in Europe
in the seventeenth century by René Descartes in order to represent lines and planes by linear
equations and to compute their intersections. Gauss developed the method of elimination. Further
important developments were done by Gottfried Wilhelm von Leibniz, Gabriel Cramer, Hermann
Grassmann, and James Joseph Sylvester, the latter introducing the term “matrix.”
The purpose of this chapter is to review key concepts from linear algebra and calculus in
finite-dimensional vector spaces as well as a number of useful identities that will be used
throughout the book. We also discuss some computational aspects, including a number of matrix
factorizations and their application to solving systems of linear equations.

2.1 Vectors and Matrices

We start by introducing vectors and matrices. A vector x of length n is an ordered collection of n
numbers,
x = (x1 , x2 , … , xn ),
where xi is the ith element or entry of the vector x. The n-dimensional real space, denoted ℝn , is
the set of real-valued n-vectors, i.e. vectors of length n whose elements are all real. The product
of a scalar t ∈ ℝ and a vector x ∈ ℝn is defined as tx = (tx1 , … , txn ), the sum of two real-valued
n-vectors a and b is the vector a + b = (a1 + b1 , … , an + bn ), and the Euclidean inner product or dot
product of a and b is the real number
∑
n
⟨a, b⟩ = ai bi = a1 b1 + a2 b2 + · · · + an bn . (2.1)
i=1
The vectors a and b are said to be orthogonal if ⟨a, b⟩ = 0.

A matrix A of size m-by-n, also written as m × n, is an ordered rectangular array that consists of
mn elements arranged in m rows and n columns, i.e.
⎡ a11 a12 ··· a1n ⎤
⎢ ⎥
a a22 ··· a2n ⎥
A = ⎢ 21 ,
⎢ ⋮ ⋮ ⋱ ⋮ ⎥
⎢a am2 ··· amn ⎥⎦
⎣ m1
where aij denotes the element of A in its ith row and jth column. The set of m-by-n matrices with
real-valued elements is denoted by ℝm×n . The transpose of A is the n × m matrix defined as
⎡a11 a21 ··· am1 ⎤
⎢ ⎥
a a22 ··· am2 ⎥
AT = ⎢ 12 ,
⎢ ⋮ ⋮ ⋱ ⋮ ⎥
⎢a a2n ··· ⎥
amn ⎦
⎣ 1n
i.e. the (i, j)th element of AT is the (j, i)th element of A.
It is often convenient to think of a vector as a matrix with a single row or column. For example,
when interpreted as a matrix of size 1 × n, the vector a ∈ ℝn is referred to as a row vector, and
similarly, when interpreted as a matrix of size n × 1, a is referred to as a column vector. In this
book, we use the convention that all vectors are column vectors. Thus, a vector x ∈ ℝn is always
interpreted as the column vector
⎡x1 ⎤
⎢ ⎥
x
x = ⎢ 2⎥ ,
⎢⋮⎥
⎢x ⎥
⎣ n⎦
[ ]
and hence, xT is interpreted as the row vector x1 x2 · · · xn . Similarly, to refer to the columns of a
matrix A ∈ ℝm×n , we will sometimes use the notation
[ ]
A = a1 a2 · · · an ,
where a1 , a2 , … , an ∈ ℝm . When referring to the rows of A, we will define
[ ]
AT = a1 a2 · · · am ,
where a1 , a2 , … , am ∈ ℝn such that A is the matrix with rows aT1 , aT2 , … , aTm . The notation is some-
what ambiguous because ai may refer to the ith element of a vector a or the ith column of either A or
AT , but the meaning will be clear from the context and follows from our convention that vectors
are column vectors.
Given two vectors x, y ∈ ℝn , the inner product ⟨x, y⟩ can also be expressed as the product
⎡y1 ⎤
[ ] ⎢ ⎥ ∑ n
y
xT y = x1 x2 · · · xn ⎢ 2 ⎥ = xy.
⎢ ⋮ ⎥ i=1 i i
⎢y ⎥
⎣ n⎦
In contrast, the outer product of two vectors u ∈ ℝm and 𝑣 ∈ ℝn , not necessarily of the same length,
is defined as the m × n matrix
⎡ u1 𝑣1 u1 𝑣2 … u1 𝑣n ⎤
⎢ ⎥
u 𝑣 u 𝑣 … u2 𝑣n ⎥
u𝑣T = ⎢ 2 1 2 2 .
⎢ ⋮ ⋮ ⋱ ⋮ ⎥
⎢u 𝑣 u 𝑣 … u 𝑣 ⎥
⎣ m 1 m 2 m n⎦
2.1 Vectors and Matrices 9

The product of a matrix A ∈ ℝm×n , with columns a1 , … , an ∈ ℝm , and a vector x ∈ ℝn is the linear
combination
y = a1 x1 + a2 x2 + · · · + an xn .
Equivalently, the ith element of the vector y = Ax is the inner product of x and the ith row of A.
The vector inner and outer products and matrix–vector multiplication are special cases of matrix
multiplication. Two matrices A and B are said to be conformable for multiplication if the number of
columns in A is equal to the number of rows in B. Given two such matrices A ∈ ℝm×p and B ∈ ℝp×n ,
the product C = AB is the m × n matrix with elements
∑
p
cij = aik bkj , i ∈ ℕm , j ∈ ℕn .
k=1

Note that cij is the inner product of the ith row of A and the jth column of B. As a result, C = AB
may be expressed as
T T T T
⎡ a1 ⎤ ⎡ a1 b1 a1 b2 … a1 bn ⎤
⎢ aT ⎥ [ ] ⎢ aT b aT b … aT2 bn ⎥
C = ⎢ 2 ⎥ b1 b2 · · · bn = ⎢ 2 1 2 2 ⎥,
⎢⋮⎥ ⎢ ⋮ ⋮ ⋱ ⋮ ⎥
⎢ T⎥ ⎢ T ⎥
⎣am ⎦ ⎣am b1 aTm b2 … aTm bn ⎦
where aT1 , … , aTm are the rows of A and b1 , … , bn are the columns of B. Equivalently, by expressing
A in terms of its columns and B in terms of its rows, C may also be written as the sum of p outer
products
T
⎡b1 ⎤
[ ] bT ⎥ ∑
⎢ p
C = a1 a2 · · · ap ⎢ 2 ⎥ = a bT ,
⎢ ⋮ ⎥ i=1 i i
⎢ T⎥
⎣bp ⎦
where a1 , … , ap are the columns of A and bT1 , … , bTp are the rows of B.
It is important to note that matrix multiplication is associative, but unlike scalar multiplication,
it is not commutative. In other words, the associative property (AB)C = A(BC) holds, provided that
A, B, and C are conformable for multiplication, but the identity AB = BA does not hold in general.
The Frobenius inner product of two matrices A, B ∈ ℝm×n is defined as
∑∑
m n
⟨A, B⟩ = aij bij . (2.2)
i=1 j=1

This is also called the trace inner product of A and B since

⟨A, B⟩ = tr(AT B),
where the trace of a square matrix C ∈ ℝn×n is defined as
∑
n
tr(C) = cii .
i=1

The inner product ⟨A, B⟩ may also be written as vec(A)T vec(B), where vec(A) maps a matrix
A ∈ ℝm×n to a vector of length mn by stacking the columns of A, i.e.

([ ]) ⎡ a1 ⎤
vec(A) = vec a1 · · · an = ⎢ ⋮ ⎥ ∈ ℝmn .
⎢ ⎥
⎣an ⎦
10 2 Linear Algebra

2.2 Linear Maps and Subspaces

The span of n vectors 𝑣1 , … , 𝑣n ∈ ℝm is the set of all linear combinations of these vectors, i.e.

span(𝑣1 , … , 𝑣n ) = {𝛼1 𝑣1 + · · · + 𝛼n 𝑣n ∣ 𝛼1 , … , 𝛼n ∈ ℝ}.

The set of vectors 𝑣1 , … , 𝑣n is said to be linearly independent if

𝛼1 𝑣1 + · · · + 𝛼n 𝑣n = 0 ⟺ 𝛼1 = · · · = 𝛼n = 0,

and otherwise, the set is said to be linearly dependent. Equivalently, the set 𝑣1 , … , 𝑣n is linearly
independent if and only if all vectors in span(𝑣1 , … , 𝑣n ) have a unique representation as a linear
combination of 𝑣1 , … , 𝑣n . The span of a set of k linearly independent vectors 𝑣1 , … , 𝑣k ∈ ℝm is
a k-dimensional subspace of ℝm , and 𝑣1 , … , 𝑣k is a basis for the subspace. In other words, the
dimension of a subspace is equal to the number of linearly independent vectors that span the sub-
space. The vectors 𝑣1 , … , 𝑣n are said to be orthonormal if they are mutually orthogonal and of unit
length, i.e. 𝑣Ti 𝑣j = 0 if i ≠ j and 𝑣Ti 𝑣i = 1 for i ∈ ℕn . The standard basis or natural basis for ℝm is the
orthonormal basis e1 , … , em , where ei ∈ ℝm is the unit vector whose ith element is equal to 1, and
the rest are zero.
The range of a matrix A ∈ ℝm×n , denoted (A), is the span of its columns. This is also referred
to as the column space of A, whereas (AT ) is referred to as the row space of A. The dimension of
(A) is called the rank of A, denoted rank(A). The null space of A ∈ ℝm×n , denoted  (A), consists
of all vectors 𝑣 such that A𝑣 = 0, i.e.

 (A) = {𝑣 ∈ ℝn ∣ A𝑣 = 0}.

The dimension of  (A) is called the nullity of A and is denoted nullity(A). The null space of A is
said to be trivial if  (A) = {0} in which case nullity(A) = 0.

2.2.1 Four Fundamental Subspaces

It follows from the definition of the range and nullspace of a matrix A that

(A) =  (AT )⟂ , (AT ) =  (A)⟂ . (2.3)

To see that (A) =  (AT )⟂ , note that for every y ∈  (AT ), we have that

(AT y)T x = yT Ax = 0, ∀x ∈ ℝn ,

or equivalently, if we let u = Ax, we see that yT u = 0 for all u ∈ (A). This shows that (A)
and  (AT ), which are both subspaces of ℝm , are orthogonal complements. Similarly, for every
x ∈  (A), it immediately follows that yT Ax = 0 for all y ∈ ℝm , and hence, (AT ) is the orthogonal
complement of  (A).
The result (2.3) can be used to derive the so-called rank-nullity theorem which states that

rank(A) + nullity(A) = n. (2.4)

To see this, note that the identity (AT ) =  (A)⟂ combined with the fact that

dim( (A)⟂ ) = n − nullity(A),

implies that rank(AT ) = n − nullity(A). The rank-nullity theorem follows from the identity

rank(A) = rank(AT ), (2.5)

2.2 Linear Maps and Subspaces 11

𝐴𝑇

𝐴 𝑇 𝑦1 𝐴 𝑦1
ℛ(𝐴𝑇 ) ℛ(𝐴)
𝑥1 𝐴𝑥1
dim. 𝑟 𝑥 1 + 𝑥2 dim. 𝑟
𝐴
0 0
ℝ𝑛 ℝ𝑚
𝐴𝑇
𝑥2
dim. 𝑛 − 𝑟 𝐴 𝑦2 dim. 𝑚 − 𝑟
𝒩(𝐴) 𝒩(𝐴𝑇 )

Figure 2.1 The four subspaces associated with an m × n matrix A with rank r.

which we will now derive. First, suppose that rank(AT ) = r, and let 𝑣1 , … , 𝑣r ∈ ℝn be a linearly
independent set of vectors that span (AT ). It follows that the set of vectors A𝑣1 , … , A𝑣r is linearly
independent since

𝛼1 A𝑣1 + · · · + 𝛼r A𝑣r = A(𝛼1 𝑣1 + · · · + 𝛼r 𝑣r ) = 0 ⟺ 𝛼1 = · · · = 𝛼r = 0.

As a result, we have that rank(A) ≥ rank(AT ). Applying the same inequality to B = AT implies that
rank(A) = rank(AT ). A direct consequence of this identity is that rank(A) ≤ min (m, n). We say that
A has full rank if rank(A) = min (m, n), and we will use the term full row rank when r = m and the
term full column rank when r = n. The four subspaces (A),  (AT ), (AT ), and  (A) and their
dimensions are illustrated in Figure 2.1.
If two matrices A and B are conformable for multiplication, then

rank(AB) ≤ min (rank(A), rank(B)). (2.6)

This follows from the fact that (AB) ⊆ (A) and (BT AT ) ⊆ (BT ) which implies that
rank(AB) ≤ rank(A) and rank(BT AT ) ≤ rank(BT ) = rank(B). Furthermore, we have that (AB) =
(A) when B has full row rank, in which case rank(AB) = rank(A).
If A and B are conformable for addition, then

rank(A + B) ≤ rank(A) + rank(B). (2.7)

This means that rank is subadditive, and it follows from the fact that (A + B) ⊆ (A) + (B)
where

(A) + (B) = {u + 𝑣 ∣ u ∈ (A), 𝑣 ∈ (B)}

is the Minkowski sum of (A) and (B). Subadditivity implies that a rank k matrix can be decom-
posed as the sum of k or more rank 1 matrices. Thus, a rank k matrix A ∈ ℝm×n can be factorized
as
∑
k
A = CRT = ci riT ,
i=1

where c1 , … , ck and r1 , … , rk are the columns of C ∈ ℝm×k and R ∈ ℝn×k , respectively, and (C) =
(A) and (R) = (AT ).
12 2 Linear Algebra

2.2.2 Square Matrices

A matrix with an equal number of rows and columns is called a square matrix, and the order of such
a matrix refers to its size. The square matrix with ones on the main diagonal and zeros elsewhere is
an identity matrix and denoted by I, i.e. Iij = 1 if i = j and otherwise, Iij = 0. Furthermore, a square
matrix of order n is said to be nonsingular or invertible if it has full rank, and otherwise, it is said
to be singular. Equivalently, a square matrix A is invertible if there exists a matrix B such that
AB = BA = I. Such a matrix is unique if it exists, and it is called the inverse of A and is denoted
by A−1 . In the special case where A−1 = AT , the matrix A is said to be orthogonal. The columns
of an orthogonal matrix are orthonormal since A−1 = AT implies that AT A = I. More generally, if
two matrices A and B are conformable for multiplication, i.e. not necessarily square matrices, and
AB = I, then B is a right inverse of A whereas A is a left inverse of B.
The determinant of a scalar matrix A is the scalar itself, whereas the determinant of a square
matrix A of order n may be defined recursively as
∑
n
det (A) = (−1)i+j aij Mij , (2.8)
i=1

where Mij denotes the minor of aij which is the determinant of the (n − 1) × (n − 1) matrix that
is obtained by removing the ith row and jth column of A. This expression is a so-called Laplace
expansion of the determinant along the jth column of A, and it holds for every j ∈ ℕn . As a special
case, the determinant of a 2 × 2 matrix A may be expressed as
det (A) = a11 a22 − a12 a21 ,
and its absolute value may be interpreted as the area of a parallelogram defined by the columns
of A, as illustrated in Figure 2.2. More generally, the absolute value of the determinant of an n × n
matrix A is the volume of the parallelotope defined by the columns of A, i.e. the set
{Ax ∈ ℝn | 0 ≤ xi ≤ 1 ∀i ∈ ℕn }.
As a result, det (A) ≠ 0 if and only if A has full rank.
The term (−1)i+j Mij is called the cofactor of aij . The n × n matrix composed of all the cofactors,
i.e. the matrix C with elements
cij = (−1)i+j Mij , i, j ∈ ℕn ,
is called the cofactor matrix. Expressed in terms of the cofactors, the Laplace expansion (2.8) may
be written as the inner product of the jth column of A and C, i.e.
∑
n
det (A) = akj ckj .
k=1

Furthermore, since the Laplace expansion is valid for any j ∈ ℕn , the diagonal elements of the
matrix CT A are all equal to det (A). In fact, it can be shown that
CT A = det (A)I,

𝑙 𝑎+𝑝 Figure 2.2 Parallelogram deﬁned by two vectors a and b in ℝ2 .

𝑏 𝑎+𝑏 The area is given by | det ([a b])| = ||p||2 ||a||2 , where p is the
𝑝 projection of b onto the line l with normal a.

𝑎
0
2.3 Norms 13

and if A is nonsingular, this identity implies that

1
A−1 = CT , (2.9)
det (A)
where CT , the transpose of the cofactor matrix, is known as the adjugate matrix which is denoted
by adj(A).

2.2.3 Afﬁne Sets

Given two points x, y ∈ ℝn and 𝜃 ∈ ℝ, the point
𝜃x + (1 − 𝜃)y
is a so-called affine combination of x and y. The set of all such affine combinations of x and y, which
may be expressed as
{𝜃x + (1 − 𝜃)y | 𝜃 ∈ ℝ}
is a line in ℝn if x ≠ y. This is illustrated in Figure 2.3. More generally, an affine combination of k
points x1 , … , xk ∈ ℝn is a linear combination
∑
k
𝜃1 x1 + · · · + 𝜃k xk = 𝜃i xi ,
i=1

where 𝜃1 + · · · + 𝜃k = 1. A set  ⊆ ℝn is an affine set if and only if it contains all affine combinations
of its points. Equivalently,  is affine if for every pair of points x, y ∈ ,
{𝜃x + (1 − 𝜃)y | 𝜃 ∈ ℝ} ⊆ 𝒜 .
An example of an affine set is the set of solutions to the system of equation Ax = b with A ∈ ℝm×n
and b ∈ ℝm , i.e.
 = {x ∈ ℝn | Ax = b}.
In fact, every affine subset of ℝn may be expressed in this form. The affine hull of a set  ⊆ ℝn ,
which we denote aff, is the smallest possible affine set that contains .

2.3 Norms
Equipped with the Euclidean inner product (2.1), the set ℝn is a Euclidean vector space of dimen-
sion n with the induced norm
√ ( )1∕2
||a||2 = ⟨a, a⟩ = a21 + a22 + · · · + a2n . (2.10)
More generally, a norm of a vector x ∈ ℝn , denoted ||x||, is a function with the following defining
properties:

Figure 2.3 Afﬁne combinations 𝜃x + (1 − 𝜃)y of two 𝜃 = 1.4

points x and y.
𝜃=1
𝑥 𝜃 = 0.6

𝜃=0
𝑦 𝜃 = −0.4
14 2 Linear Algebra

1. Subadditive1 :
||u + 𝑣|| ≤ ||u|| + ||𝑣||, ∀u, 𝑣 ∈ ℝn .
2. Absolutely homogeneous:
||𝛼x|| = |𝛼|||x||, ∀x ∈ ℝn , ∀𝛼 ∈ ℝ.
3. Positive definite:
||x|| = 0 if and only if x = 0.
4. Nonnegative:
||x|| ≥ 0, ∀x ∈ ℝn .
Such a function is not unique, so to distinguish between different norms, a subscript is typically
used to denote specific norms. For example, the 2-norm or Euclidean norm of a vector x, defined
in (2.10), is denoted ||x||2 . Other examples are the 1-norm and the infinity norm, defined as
∑
p
{ }
||x||1 = |xi | and ||x||∞ = max |x1 |, … , ||xn | .
i=1

These are special cases of the more general vector p-norm

( n )1∕p
∑
||x||p = |xi | p
, (2.11)
i=1

which is defined for p ≥ 1.

A norm on ℝn is said to be orthogonally invariant if ||x|| = ||Qx|| for all x and all orthogonal matri-
ces Q. The 2-norm is orthogonally invariant, which follows by noting that ||x||22 = xT x = xT QT Qx =
||Qx||22 if Q is an orthogonal matrix.
The following inequality, known as the Cauchy–Schwarz inequality, is used extensively in linear
algebra and optimization:
|xT y| ≤ ||x||2 ||y||2 . (2.12)
The inequality can be derived by noting that
‖ ‖2 ( )
‖||y||22 x − xT yy‖ = ||y||22 ||y||22 ||x||22 − 2(xT y)2 + (xT y)2 ≥ 0,
‖ ‖2
where the inequality follows from the nonnegativity property of the norm. An immediate conse-
quence is that ||y||22 ||x||22 − (xT y)2 ≥ 0 for all x and y, from which the Cauchy–Schwarz inequality
(2.12) follows by rearranging the terms and taking the square root.
The Frobenius norm of A ∈ ℝm×n is a vector norm on ℝm×n induced by the Frobenius inner
product, i.e.
(m n )1∕2
√ ∑∑
||A||F = tr(A A) =T |aij | 2
. (2.13)
i=1 j=1

This is sometimes referred to as an entrywise norm. In contrast, given both vector norms on ℝm and
ℝn , the operator norm or induced norm on ℝm×n may be defined as
||A|| = inf {t | ||Ax|| ≤ t||x|| ∀x ∈ ℝn }.
t≥0

1 This is also known as the triangle inequality.

2.4 Algorithm Complexity 15

It follows directly from this definition that ||Ax|| ≤ ||A||||x|| for all x ∈ ℝn and that ||I|| = 1. The
induced norm may also be expressed as
{ }
||Ax||
||A|| = sup , (2.14)
x≠0 ||x||
from which the submultiplicative property ||AB|| ≤ ||A||||B|| follows:
{ }
||ABx||
||AB|| = sup
x≠0 ||x||
{ } { }
||ABx|| ||ABx|| ||Bx||
= sup = sup
Bx≠0 ||x|| Bx≠0 ||Bx|| ||x||
{ } { }
||Ay|| ||Bx||
≤ sup sup = ||A||||B||.
y≠0 ||y|| x≠0 ||x||
The matrix p-norm of A is the norm induced by the vector p-norm, and it is denoted ||A||p .
The next example considers the matrix 1-norm, and the matrix infinity norm is treated in
Exercise 2.3.

Example 2.1 The matrix 1-norm on ℝm×n is the induced norm defined as
{ }
||Ax||1 { }
||A||1 = sup = sup ||Ax||1 = max ||ai ||1 ,
x≠0 ||x||1 ||x||1 =1 i=ℕn

where a1 , … , an denote the n columns of A. To verify the third equality, first note that
{ }
sup ||Ax||1 ≥ ||Aej ||1 = ||aj ||1 , j ∈ ℕn ,
||x||1 =1

where e1 , … , en are the columns of the identity matrix of order n. Moreover, subadditivity, i.e. the
triangle inequality, implies that
{‖ n ‖} { n }
‖∑ ‖ ∑
||A||1 = sup ‖ ax‖ ≤ sup |xj |||aj ||1 = max ||aj ||1 ,
‖ j j‖
||x||1 =1 ‖ j=1 ‖ ||x||1 =1 j∈ℕn
‖ ‖1 j=1

which shows that ||A||1 = max i∈ℕn ||ai ||1 .

2.4 Algorithm Complexity

The number of arithmetic operations on floating-point numbers required by an algorithm is often

used as a rough measure of its complexity. One floating-point operation or “FLOP” refers to a single
floating-point addition, subtraction, multiplication, or division, and an algorithm’s FLOP count is
the total number of FLOPs. When expressed in terms of the problem size, the FLOP count often
provides a useful characterization of the asymptotic growth of the running time as a function of the
problem size. However, FLOP counts can generally not be used to accurately predict computation
time on modern computers. Table 2.1 shows the FLOP counts for some basic operations involving
vectors and/or matrices. Since we are mostly interested in the asymptotic growth rate as one or sev-
eral problem dimensions increase, it is customary to use the so-called “Big O” notation. A function
f (m, n) is said to be “big O” of g(m, n), which is expressed mathematically as f (m, n) = O(g(m, n)),
if there exists scalars c > 0, m0 , and n0 such that

f (m, n) ≤ c g(m, n), for all m ≥ m0 , n ≥ n0 . (2.15)

16 2 Linear Algebra

Table 2.1 FLOP counts for basic matrix and vector operations: 𝛼 denotes a
scalar, x and y are n-vectors, and A ∈ ℝm×n and B ∈ ℝn×p are matrices.

Operation FLOPs

Scaling 𝛼x n
Vector addition/subtraction x±y n
T
Inner product x y 2n − 1
Matrix–vector multiplication Ax m(2n − 1)
Matrix–matrix multiplication AB mp(2n − 1)

The function g provides an upper bound on the growth rate of f as the parameters m
and/or n increase. For example, the FLOP count for the matrix–vector product Ax is O(mn)
since m(2n − 1) ≤ 2 mn for all m ≥ 1, n ≥ 1. Similarly, the function f (m, n) = m2 + 10m + n3
satisfies f (m, n) = O(m2 + n3 ) since m2 + 10m + n3 ≤ 2 (m2 + n3 ) for m ≥ 10, n ≥ 1.

Example 2.2 Recall that matrix multiplication is associative, i.e. given three matrices A, B, and C
that are conformable for multiplication, we have that A(BC) = (AB)C. This implies that the prod-
uct of three or more matrices A1 A2 … Ak can be evaluated in several ways, each of which may have
a different FLOP count. For example, the product ABC with A ∈ ℝm×n , B ∈ ℝn×p , and C ∈ ℝp×q
may be evaluated left-to-right by first computing L = AB and then LC, or right-to-left by comput-
ing R = BC and then AR. The first approach requires O(mp(n + q)) FLOPs, whereas the second
approach requires O(nq(m + p)) FLOPs. A special case is the product abT x with a ∈ ℝm , b ∈ ℝp ,
and x ∈ ℝp which requires O(mp) FLOPs when evaluated left-to-right, but only O(m + p) FLOPs
when evaluated right-to-left. More generally, the product A1 A2 … Ak of k matrices requires k − 1
matrix–matrix multiplications which can be carried out in any order since matrix multiplication is
associative. The problem of finding the order that yields the lowest FLOP count is a combinatorial
optimization problem, which is known as the matrix chain ordering problem, see Exercise 5.7. This
problem can be solved by means of dynamic programming, which we introduce in Section 5.5 and
discuss in more detail in Chapter 8.

2.5 Matrices with Structure

Matrices are sometimes classified according to their mathematical properties and structure, which
are often of interest from a computational point of view. This section provides a brief review of
some of the most frequently encountered kinds of structure in this book.

2.5.1 Diagonal Matrices

A diagonal matrix is a square matrix in which all off-diagonal entries are equal to zero. Such a
matrix may be stored as a vector d = (d1 , … , dn ) of length n rather than as a general square matrix
with n2 entries, which amounts to significant saving in terms of storage when n is large. We use the
notation
diag(d) = diag(d1 , … , dn ),
2.5 Matrices with Structure 17

when referring to the diagonal matrix with the elements of d on its diagonal. Diagonal matrices
are also attractive from a computational point of view. For example, a matrix–vector product of the
form diag(d)x = (d1 x1 , … , dn xn ) involves only n FLOPs, whereas a matrix–vector product Ax with
a general square matrix A requires O(n2 ) FLOPs. Similarly, the inverse of a nonsingular diagonal
matrix diag(d) is also diagonal, i.e. diag(d)−1 = diag(1∕d1 , … , 1∕dn ).

2.5.2 Orthogonal Matrices

Recall that a square matrix Q of order n is said to be orthogonal if QT = Q−1 , or equivalently, QT Q =
QQT = I. The product of two orthogonal matrices Q1 and Q2 of order n is itself orthogonal, which
T T T
follows by noting that (Q1 Q2 )−1 = Q−1
2
Q−1
1 = Q2 Q1 = (Q1 Q2 ) .
A basic example of an orthogonal matrix of order 2 is a rotation matrix of the form
[ ]
cos(𝜃) − sin(𝜃)
R= . (2.16)
sin(𝜃) cos(𝜃)
The action of such a matrix corresponds to the counterclockwise rotation by an angle 𝜃 about the
origin in ℝ2 . More generally, a Givens rotation in ℝn is a rotation in a two-dimensional plane defined
by two coordinate axes, say, i and j. Such a transformation can be represented by an orthogonal
matrix of order n
⎡1 … 0 … 0 … 0⎤
⎢⋮ ⋱ ⋮ ⋮ ⋮⎥
⎢ ⎥
⎢0 … c … −s … 0⎥
G(i, j, 𝜃) = ⎢⋮ ⋮ ⋱ ⋮ ⋮⎥ , i < j, (2.17)
⎢ ⎥
⎢0 … s … c … 0⎥
⎢⋮ ⋮ ⋮ ⋱ ⋮⎥
⎢ ⎥
⎣0 … 0 … 0 … 1⎦
where c = cos(𝜃) and s = sin(𝜃) such that the principal submatrix defined by i and j is of the form
(2.16). Givens rotations are used frequently in linear algebra as a means to introduce zeros in a
vector or√ a matrix. For example, a nonzero vector (a, b) can be transformed into the vector (r, 0)
with r = a2 + b2 by means of a rotation. Indeed, it is easy to verify that
[ ][ ] [ ] [ ]T [ ]
c −s a r c −s c −s
= , = I,
s c b 0 s c s c
if we let c = a∕r and s = −b∕r.
A Householder transformation or Householder reflection in ℝn is a linear transformation that cor-
responds to the reflection about a hyperplane that contains the origin. Such a transformation can
be represented in terms of an orthogonal matrix of the form
𝑣𝑣T
H =I−2 , 𝑣 ∈ ℝn , (2.18)
||𝑣||22
where 𝑣 ≠ 0 is normal to the reflection plane. This is illustrated in Figure 2.4. Matrices of the form
(2.18) are referred to as Householder matrices, and they are typically stored implicitly. In other
words, only the vector 𝑣, which uniquely determines H, is stored rather than storing H directly.
This requires O(n) storage rather than O(n2 ). Moreover, matrix–vector products with a Householder
matrix can be computed in O(n) FLOPs by noting that the operation Hx amounts to a vector opera-
𝑣T x
tion, i.e. Hx = x − 𝛼𝑣 with 𝛼 = 2 ||𝑣|| 2 . A Householder transformation may be used to simultaneously
2
introduce up to n − 1 zeros in a vector of length n, see Exercise 2.4.
18 2 Linear Algebra

𝑥 Figure 2.4 Householder reﬂection in ℝ2 .

𝑣

𝐻𝑥
0

Yet another example of a family of orthogonal matrices are permutation matrices. A permutation
of the elements of a vector x ∈ ℝn is a reordering of the elements. This can be expressed as a map
of the form
(x1 , … , xn ) → (x𝜋(1) , … , x𝜋(n) ), (2.19)
where the function 𝜋 ∶ ℕn → ℕn defines the permutation and satisfies 𝜋(i) ≠ 𝜋(j) ⟺ i ≠ j. This
map can be expressed as a matrix–vector product Px, where P ∈ ℝn×n is the permutation matrix
defined as
⎡eT𝜋(1) ⎤
⎢ ⎥
P = ⎢ ⋮ ⎥. (2.20)
⎢ T ⎥
⎣e𝜋(n) ⎦
It is easy to verify that PPT = I since 𝜋(i) ≠ 𝜋(j) whenever i ≠ j, which implies that P is orthogonal.
It follows that the map (2.19) is invertible, and P−1 = PT is the permutation matrix that corresponds
to the inverse permutation 𝜋 −1 . The special case where 𝜋(i) = n + 1 − i corresponds to reversing the
order of n elements and leads to the permutation matrix
⎡eTn ⎤ ⎡ 1⎤
⎢ ⎥
J = ⎢⋮⎥ = ⎢ ⋰ ⎥,
⎢ T ⎥ ⎢⎣1 ⎥
⎦
⎣e ⎦1

which is the so-called exchange matrix of order n.

2.5.3 Triangular Matrices

A square matrix in which the elements above or below the main diagonal are zero is called a trian-
gular matrix. A matrix L ∈ ℝn×n is said to be lower triangular if 1 ≤ i < j ≤ n ⟹ lij = 0, whereas
U ∈ ℝn×n is said to be upper triangular if 1 ≤ j < i ≤ n ⟹ uij = 0. Furthermore, a triangular
matrix with ones on its diagonal is said to be unit triangular. The determinant of a triangular matrix
T of order n may be expressed as
∏
n
det (T) = Tii ,
i=1

which follows from the Laplace expansion (2.8). A direct consequence is that T is nonsingular if
and only if all of the diagonal entries of T are nonzero. The inverse of a nonsingular lower (upper)
triangular matrix T is another lower (upper) triangular matrix. This follows from the identity (2.9)
by noting that the cofactor matrix associated with T is itself triangular. To see this, first recall that
the minor of tij is the determinant of the (n − 1) × (n − 1) matrix obtained by deleting the ith row
and the jth column of T. This implies that if, say, T is lower triangular and i > j, then the effect of
deleting the ith row and the jth column of T is a triangular matrix of order n − 1. This matrix will
have at least one diagonal entry that is equal to zero, and hence, it is singular, which implies that
2.5 Matrices with Structure 19

the minor of tij is zero. We note that the product of two lower (upper) triangular matrices is another
lower (upper) triangular matrix.
Given a nonsingular triangular matrix T, it is possible to compute matrix–vector products of the
form y = T −1 x and y = T −T x without computing T −1 first. For example, if T is lower triangular,
then y = T −1 x can be computed by means of forward substitution, i.e.
( )
∑
k−1
y1 ← T11 x1 , yk ← Tkk xk −
−1 −1
Tki yi , k = 2, … , n.
i=1

n2
This requires approximately FLOPs. Similarly, if T is an upper triangular matrix, then y = T −1 x
can be computed using backward substitution, i.e.
( )
∑n
yn ← Tnn xn , yk ← Tkk xk −
−1 −1
Tki yi , k = n − 1, … , 1,
i=k+1

which also requires approximately n2 FLOPs.

2.5.4 Symmetric and Skew-Symmetric Matrices

A square matrix X is said to be symmetric if X = X T , and
𝕊n = {X ∈ ℝn×n | X = X T } (2.21)
is the set of symmetric matrices of order n. Similarly, a square matrix Y is skew-symmetric if
Y = −Y T , which implies that the diagonal elements of a skew-symmetric matrix are equal to
zero. A square matrix A ∈ ℝn×n can always be decomposed into the sum of a symmetric and a
skew-symmetric matrix, i.e. A can be expressed as A = X + Y , where X = (1∕2)(A + AT ) is sym-
metric and Y = 12 (A − AT ) is skew-symmetric. We note that this is an orthogonal decomposition of
A since the Frobenius inner product of a symmetric and a skew-symmetric matrix is always zero.
Equipped with the Frobenius inner product, which is defined in (2.2), the set 𝕊n is a Euclidean
vector space of dimension n(n + 1)∕2. It is sometimes convenient to work with a vector rep-
resentation of a symmetric matrix, and to this end, we introduce the bijective linear map
svec ∶ 𝕊n → ℝn(n+1)∕2 defined as
( √ √ √ √ )
svec(A) = a11 , 2a21 , … , 2an1 , a22 , 2a32 , … , 2an2 , … , ann , (2.22)

i.e.
√ svec stacks the lower triangular part of the columns of A and scales the off-diagonal entries by
2. It is straightforward to verify that this definition implies that for all X, Y ∈ 𝕊n ,
x = svec(X), y = svec(Y ) ⟹ xT y = tr(X T Y ),
which means that the svec transformation preserves inner products.

2.5.5 Toeplitz and Hankel Matrices

A Toeplitz matrix of order n is a square matrix of the form
⎡ a0 a−1 · · · a−(n−1) ⎤
⎢ ⎥
a a0 ⋱ ⋮ ⎥
T=⎢ 1 , (2.23)
⎢ ⋮ ⋱ ⋱ a−1 ⎥
⎢a … a1 a0 ⎦ ⎥
⎣ n−1
where the first row and column of T define the subdiagonal, diagonal, and superdiagonal elements.
Such a matrix is sometimes referred to as a diagonal-constant matrix, and it is uniquely determined
20 2 Linear Algebra

by 2n − 1 numbers in contrast to a general n × n matrix, which has n2 degrees of freedom. A special

case of (2.23) is the so-called lower-shift matrix of order n, which has ones on the first subdiagonal
and zeros elsewhere, i.e.
⎡0 ⎤
⎢ ⎥
1 ⋱
S=⎢ ⎥. (2.24)
⎢ ⋱ ⋱ ⎥
⎢ 1 0⎥⎦
⎣
Similarly, ST is sometimes referred to as the upper-shift matrix of order n. It is easy to verify that
Sk with k ∈ ℕn−1 is the matrix with ones on the kth subdiagonal and zeros elsewhere. Using the
convention that S0 = I, we may express any Toeplitz matrix of the form (2.23) as
( n−1 )T
∑
n−1
∑
T= ak S k + a−k Sk . (2.25)
k=0 k=1

A lower triangular Toeplitz matrix whose first column is given by a0 , … , an−1 can be expressed as
∑
n−1
L= ak S k . (2.26)
k=0

It then follows from the fact that det (L) = an0 that L is nonsingular if and only if a0 ≠ 0. Moreover,
a matrix of the form (2.26) commutes with S, which follows from the fact that S commutes with
Sk , i.e. SSk = Sk S. It can also be shown that the inverse of a nonsingular lower triangular Toeplitz
matrix is another lower triangular Toeplitz matrix, see Exercise 2.6.
Toeplitz matrices are closely related to another class of matrices called Hankel matrices. Like a
Toeplitz matrix, a Hankel matrix is a square matrix that is uniquely determined by its first row
and column. However, unlike a Toeplitz matrix, a Hankel matrix has constant entries along the
anti-diagonal, the anti-subdiagonals, and anti-superdiagonals. As a consequence, the exchange
matrix J maps a Toeplitz matrix to a Hankel matrix and vice versa, i.e. if T is a Toeplitz matrix,
then JT and TJ are both Hankel matrices. We note that Hankel matrices are symmetric, whereas
Toeplitz matrices are persymmetric, i.e. symmetric about the anti-diagonal. Finally, we note
that the notion of Toeplitz (Hankel) matrices can be extended to include nonsquare matrices
with diagonal-constant (anti-diagonal-constant) structure, and such matrices may be viewed as
submatrices of square Toeplitz (Hankel) matrices.

2.5.6 Sparse Matrices

A matrix A ∈ ℝm×n is said to be sparse or sparsely populated if a large number of its entries are equal
to zero. In contrast, a matrix with few or no zero entries is said to be dense. We will use nnz(A) to
denote the number of nonzero entries in a matrix A. Sparse matrices can be stored efficiently by
storing only the nonzero entries. For example, storing a sparse matrix A as a list of nnz(A) triplets of
the form (i, j, Aij ) requires O(nnz(A)) rather than O(mn) storage. Similarly, a matrix–vector product
Ax requires only O(nnz(A)) FLOPs if sparsity is exploited as opposed to O(mn) FLOPs if A is a
general dense matrix.

2.5.7 Band Matrices

A sparse matrix B ∈ ℝm×n is said to be banded if all its nonzero entries are located within a band
about the main diagonal. The bandwidth of B is the smallest integer k such that bij = 0 if |i − j| > k.
In other words, all entries below the kth subdiagonal or above the k superdiagonal are equal to
2.6 Quadratic Forms and Deﬁniteness 21

zero. The lower bandwidth of B is a nonnegative integer l such that bij = 0 if j < i − l, and similarly,
the upper bandwidth of B is a nonnegative integer u such that bij = 0 if j > i + u. A bandwidth of
0 corresponds to a diagonal matrix, whereas a bandwidth of 1 is a tridiagonal matrix or, if l = 0 or
u = 0, an upper or a lower bidiagonal matrix.

2.6 Quadratic Forms and Deﬁniteness

A real n-ary quadratic form is a homogeneous quadratic polynomial in n variables, or equivalently,

a function from ℝn to ℝ of the form
∑ ∑
n n
f (x) = xT Ax = Aij xi xj ,
i=1 j=1

for some A ∈ ℝn×n . It is straightforward to verify that

xT Ax = xT AT x = (1∕2)xT (A + AT )x,

which implies that only the symmetric part of A contributes to the value of xT Ax. We therefore limit
our attention to the case where A ∈ 𝕊n .
A matrix A ∈ 𝕊n is positive semidefinite if and only if xT Ax ≥ 0 for all x ∈ ℝn , and it is positive
definite if and only if xT Ax > 0 for all x ≠ 0. Similarly, A is negative (semi)definite if −A is positive
(semi)definite, and it is indefinite if it is neither positive semidefinite nor negative semidefinite. We
will use the notation 𝕊n+ for the set of positive semidefinite matrices in 𝕊n , the interior of which is
the set of positive definite matrices, denoted by 𝕊n++ .
Given two matrices A, B ∈ 𝕊n , the generalized inequality A ⪰𝕊n+ B, which is a partial ordering on
𝕊 , is defined as
n

A ⪰S+n B ⟺ A − B ∈ 𝕊n+ . (2.27)

Similarly, the strict generalized inequality A ≻S+n B is defined as

A ≻S+n B ⟺ A − B ∈ 𝕊n++ . (2.28)

To simplify notation, we will omit the subscript S+n and simply write A ⪰ B and A ≻ B when there
is no danger of ambiguity. We return to generalized inequalities in Section 4.2.
We end this section by deriving some useful properties of positive semidefinite matrices. To this
end, we consider a matrix X ∈ 𝕊n+ , which we partition as
[ ]
A B
X= T , (2.29)
B C
where A ∈ 𝕊n1 , B ∈ ℝn1 ×n2 , and C ∈ 𝕊n2 with n1 + n2 = n. Positive semidefiniteness implies that
zT Xz ≥ 0 for all z ∈ ℝn or, equivalently, for all z = (u, 𝑣) ∈ ℝn1 × ℝn2 ,

f (u, 𝑣) = uT Au + 𝑣T C𝑣 + 2uT B𝑣 ≥ 0.

Thus, f (u, 0) = uT Au ≥ 0 for all u and f (0, 𝑣) = 𝑣T C𝑣 ≥ 0 for all 𝑣, so both A and C must be positive
semidefinite. This holds for any partition of the form (2.29) and any symmetric permutation of X
which, in turn, implies that every principal submatrix of X must be positive semidefinite.
Positive semidefiniteness of X also implies that

(B) ⊆ (A), (BT ) ⊆ (C), (2.30)

22 2 Linear Algebra

which is easily proven by contradiction: if (B) ⊊ (A), then there exists a vector 𝑣 such that B𝑣 ≠ 0
and B𝑣 ∉ (A), and hence, f (−tB𝑣, 𝑣) = 𝑣T C𝑣 − 2t||B𝑣||22 tends to −∞ as t → ∞. This is a contra-
diction since X is positive semidefinite. The condition (BT ) ⊆ (C) can be proven in a similar
manner. An immediate consequence of the range conditions (2.30) is that the ith row and column
of X must be zero if Xii = 0.

2.7 Spectral Decomposition

A symmetric matrix A of order n can be factorized as

A = QΛQT ,

where Q is an orthogonal matrix and Λ is diagonal. This is a so-called spectral decomposition or

eigendecomposition of the symmetric matrix A. The columns of Q are eigenvectors of A, and the
diagonal entries of Λ = diag(λ1 , … , λn ) are the associated eigenvalues, i.e. the ith column of Q sat-
isfies the equation Aqi = λi qi . We will use the convention that the eigenvalues are ordered such that
λ1 ≥ λ2 ≥ · · · ≥ λn . The definiteness of a symmetric matrix is related to the sign of its eigenvalues.
Specifically, given an eigendecomposition A = QΛQT , we can express the quadratic form xT Ax as
∑n
yT Λy = i=1 λi y2i by using the change of variables y = QT x. It follows that A is positive semidefinite
if and only if its eigenvalues are nonnegative since
∑
n
λi y2i ≥ 0, ∀y ∈ ℝn ⟺ λmin (A) ≥ 0.
i=1

Analogously, A is positive definite if and only if λmin (A) > 0, which implies that A has full rank, and
it is indefinite if it has both positive and negative eigenvalues. A positive definite matrix A of order
n defines a weighted inner product ⟨y, x⟩A = ⟨y, Ax⟩ = yT Ax, which induces the quadratic norm
√ √
||x||A = ⟨x, x⟩A = xT Ax.

The symmetric square root of a matrix A ∈ 𝕊n+ is the unique symmetric positive semidefinite
matrix A1∕2 that satisfies A = A1∕2 A1∕2 . Given a spectral decomposition A = QΛQT , the symmetric
square root of A may be expressed as
( )
1∕2 1∕2
A1∕2 = QΛ1∕2 QT = Qdiag λ1 , … , λn QT .

This implies that transformations of the form A → F T AF with F ∈ ℝn×k preserve positive semidef-
initeness, i.e. we have that

xT Ax ≥ 0, ∀x ∈ ℝn ⟹ yT F T AFy = ||A1∕2 Fy||2 ≥ 0, ∀y ∈ ℝk .

Moreover, if A is positive definite and rank(F) = k, then F T AF is also positive definite. We note that
A and B = F T AF are said to be congruent if F is square and nonsingular.
The eigenvalues and eigenvectors of a symmetric matrix are related to the so-called Rayleigh
quotient which, for a given matrix A ∈ 𝕊n and a nonzero vector x ∈ ℝn , is defined as
xT Ax
RA (x) = , x ≠ 0. (2.31)
xT x
A stationary point of RA must satisfy
xT Ax
∇RA (x) = 0 ⟺ Ax = x,
xT x
2.8 Singular Value Decomposition 23

which implies that x is a stationary point of RA if and only if x is a real-valued eigenvector of A

associated with the eigenvalue RA (x). This observation implies that we may express the largest and
smallest eigenvalues of A as

λmax (A) = max {RA (x)}, λmin (A) = min {RA (x)}.
x≠0 x≠0

2.8 Singular Value Decomposition

A singular value decomposition (SVD) of a matrix A ∈ ℝm×n is a factorization of the form A = UΣV T ,
where U ∈ ℝm×m and V ∈ ℝn×n are orthogonal matrices, and Σ ∈ ℝm×n is a matrix with the singular
values of A on its main diagonal and zeros elsewhere, i.e. the diagonal entries of Σ are Σii = 𝜎i ,
i ∈ ℕmin (m,n) where we use the convention that 𝜎1 ≥ 𝜎2 ≥ · · · ≥ 0. If we let r denote the rank of A,
then 𝜎i = 0 for i > r, and hence, we can partition an SVD of A as
[ ][ ]
[ ] S 0 V1T
A = U1 U2 = U1 SV1T , (2.32)
0 0 V2T
where U1 ∈ ℝm×r , V1 ∈ ℝn×r , and S = diag(𝜎1 , … , 𝜎r ) is the square matrix with the nonzero singu-
lar values of A on its diagonal. This shows that an SVD is a so-called rank-revealing factorization,
and A = U1 SV1T is commonly referred to as a thin or reduced SVD of A. We note that the largest
integer k such that 𝜎k > 𝜖 for a given 𝜖 > 0 is referred to as the numerical rank or the 𝜖-rank of A.
The 𝜖-rank of A may also be defined as

rank(A, 𝜖) = min rank(A + E),

||E||2 ≤𝜖

which allows us to interpret the numerical rank of A as the smallest attainable rank for matrices
within a neighborhood of A.
The partition (2.32) can be linked to the four subspaces introduced in Section 2.2 and illustrated
in Figure 2.1. Specifically, we have that

(A) = (U1 ),  (AT ) = (U2 ), (AT ) = (V1 ),  (A) = (V2 ).

This allows us to express the projection of a vector x ∈ ℝm onto (A) as

(A) (x) = argmin{||u − x||2 } = U1 U1T x.

u∈(a)

The matrix P = U1 U1T is a projection matrix, and it is also an idempotent matrix since P2 = P. More-
over, I − P is also a projection matrix, which follows from the fact that I − P = U2 U2T , and it defines
a projection onto  (AT ) = (A)⟂ .
An SVD can also be used to construct useful upper and lower bounds on the trace inner product
of two matrices. A notable example is von Neumann’s trace inequality, which states that
∑
min (m,n)
|tr(AT B)| ≤ tr(ΣT Γ) = Σii Γii , (2.33)
i=1

where A, B ∈ ℝm×n are matrices with SVDs A = UΣV T and B = PΓQT , see, e.g. [76] for the a proof.
The singular values of an m × n matrix A can be used to define a family of matrix norms on ℝm×n
that are known as Schatten norms. For p ∈ [1, ∞), the Schatten p-norm is defined as
(min (m,n) )1∕p
∑
||A||(p) = 𝜎i (A)p
, (2.34)
i=1
24 2 Linear Algebra

i.e. it may be viewed as the p-norm of a vector that contains the singular values of A. The parentheses
in the subscript is not standard notation, but we include them here to avoid confusion with the
induced matrix p-norm defined in (2.14). The Schatten 1-norm, which we will denote by ||A||∗ , is
also known as the nuclear norm or the trace norm. It is straightforward to verify that the matrix
norms ||A||F and ||A||2 are both special cases of the Schatten p-norm, see Exercise 2.5.

2.9 Moore–Penrose Pseudoinverse

The Moore–Penrose pseudoinverse of a matrix A ∈ ℝm×n is a generalization of the inverse of a square
matrix. It is denoted A† and is the unique n × m matrix that satisfies the four so-called Penrose
conditions
AA† A = A, A† AA† = A† , (AA† )T = AA† , (A† A)T = A† A. (2.35)
An SVD A = UΣV T can be used to construct A† as
A† = VΣ† U T , (2.36)
where Σ† is obtained from Σ by taking the reciprocal of all the nonzero elements. Equivalently,
using (2.32), we can write A† as
[ ][ ]
[ ] S−1 0 U1T
A† = V1 V2 = V1 S−1 U1T , (2.37)
0 0 U2T
from which it follows that (A† ) = (AT ) and  (A† ) =  (AT ). In the special case where A has
full rank, the Moore–Penrose pseudoinverse can be expressed as
{ T −1 T
† (A A) A , rank(A) = n,
A = (2.38)
AT (AAT )−1 , rank(A) = m,
in which case A† is a left inverse of A, if rank(A) = n, and/or a right inverse of A, if rank(A) = m.
More generally, the Moore–Penrose pseudoinverse may be defined as the limit
A† = lim(AT A + 𝛿I)−1 AT ,
𝛿↘0

= limAT (AAT + 𝛿I)−1 .

𝛿↘0

The Moore–Penrose pseudoinverse provides a convenient way to express projections onto the
four subspaces (A),  (AT ), (AT ), and  (A). This follows from (2.32) and (2.36) by noting that
AA† and A† A are projection matrices, i.e.
AA† = U1 U1T , A† A = V1 V1T .
Thus, projections onto the four subspaces can be expressed in terms of the projection matrices
included in Table 2.2.
Table 2.2 The four subspaces and the
corresponding projection matrices.

Subspace Projection matrix

(A) AA† = U1 U1T

 (AT ) I − AA† = U2 U2T
(A )
T
A† A = V1 V1T
 (A) I − A† A = V2 V2T
2.10 Systems of Linear Equations 25

2.10 Systems of Linear Equations

Consider a system of m linear equations in n unknowns, which may be expressed as
Ax = b, (2.39)
where A ∈ ℝm×n is the coefficient matrix, x ∈ ℝnis the vector of unknowns, and the vector
b ∈ ℝm is commonly referred to as the right-hand side of the system. Such systems are ubiquitous
in numerical optimization and can be solved in many different ways, some of which we will review
in this section.
The system (2.39) is said to be consistent if b ∈ (A), in which case the solution set
 = {x | Ax = b} is nonempty. Alternatively, the system is said to be inconsistent if b ∉ (A),
which implies that there is no solution. A consistent system of the form (2.39) has a unique
solution if and only if A has a trivial nullspace, or equivalently, if rank(A) = n. Indeed, if x and z are
both solutions and x ≠ z, then A(x − z) = 0 which implies that A has a nontrivial nullspace. Recall
that b = AA† b if b ∈ (A), in which case x = A† b satisfies Ax = b. This is the unique solution if
rank(A) = n, and otherwise, there are infinitely many solutions. If F ∈ ℝn×k is a matrix such that
(F) =  (A), then the solution set may be expressed as
 = {A† b + Fz | z ∈ ℝk }. (2.40)
The solution x = A† b
is the so-called least-norm solution, which is unique. To see this, note that
||x||22 can be expanded as
||x||22 = ||A† b + Fz||22 ,
= ||A† b||22 + ||Fz||22 + 2zT F T A† b,
= ||A† b||22 + ||Fz||22 ,
where the last equality follows from the fact that F T A† = 0 since (A† ) = (AT ) =  (A)⟂ .

2.10.1 Gaussian Elimination

We now turn to methods for solving a system of linear equations of the form Ax = b, where A is
a nonsingular square matrix of order n. One such methods is Gaussian elimination with partial
pivoting, which successively eliminates a variable from the system. To illustrate the basic principle
behind variable elimination, we introduce a permutation matrix P and partition PA as
[ T]
𝛼 𝑣
PA = ,
u C
such that 𝛼 is a nonzero scalar. The element 𝛼 is called the pivot element, and P is typically cho-
sen such that 𝛼 is an element of maximal absolute value from the first column of A. Introducing
̃ the system PAx = Pb can be expressed as
conformable partitions x = (x1 , x̃ ) and Pb = (𝛽, b),
𝛼x1 + 𝑣T x̃ = 𝛽, ̃
ux1 + C̃x = b.
Solving for x1 in the first equation yields x1 = 𝛼 −1 (𝛽 − 𝑣T x̃ ), and substitution of this expression for
x1 in the second equations, we arrive at
(C − u𝛼 −1 𝑣T )̃x = b̃ − u𝛼 −1 𝛽,
which is a system of n − 1 equations in n − 1 unknowns. The coefficient matrix C − u𝛼 −1 𝑣T is the
so-called Schur complement of 𝛼 in PA, and it is nonsingular if and only if A is nonsingular. Thus, the
variable elimination procedure may be repeated until we have one equation with a single unknown
that is readily solved.
26 2 Linear Algebra

2.10.2 Block Elimination

It is sometimes convenient to eliminate several variables at the same time by means of block elim-
ination. To illustrate the principle, we consider a system of equations Mx = b, where M ∈ ℝn×n is
partitioned into four blocks
[ ]
A B
M= , (2.41)
C D
such that A ∈ ℝn1 ×n1 and D ∈ ℝn2 ×n2 are square matrices, and hence, B ∈ ℝn1 ×n2 and C ∈ ℝn2 ×n1 .
Introducing conformable partitions x = (x1 , x2 ) and b = (b1 , b2 ), the system Mx = b may be
expressed as
[ ][ ] [ ]
A B x1 b
= 1 . (2.42)
C D x2 b2
Now, suppose A is nonsingular. The equation Ax1 + Bx2 = b1 then allows us to express x1 in terms
of x2 as
x1 = A−1 (b1 − Bx2 ). (2.43)
Substituting this expression for x1 in the equation Cx1 + Dx2 = b2 yields the equation
(D − CA−1 B)x2 = b2 − CA−1 b1 , (2.44)
which only involves x2 . The matrix D − CA−1 B is the Schur complement of A in M, which exists
when A is nonsingular. Later in this section, we will see that this Schur complement is nonsingu-
lar if and only if M is nonsingular, provided that the Schur complement exists. As a consequence,
we can solve the system Mx = b in two steps when both M and A are nonsingular: first, we com-
pute x2 by solving the system (2.44), and then we compute x1 using (2.43). This approach is often
advantageous if A or its Schur complement D − CA−1 B has some kind of structure that can be
exploited.

2.11 Factorization Methods

Gaussian elimination and its block variant may be viewed as instances of a class of methods that
solve Ax = b in two steps: a factorization step and a solve step. The factorization step decomposes
the coefficient matrix A into a product of factors, say, A = A1 … Ak , where k is typically equal to two
or three. Moreover, each of the k factors generally has some kind of structure that can be exploited
in the solve step. Typical examples include diagonal, orthogonal, and triangular factors. Using the
factorization, the solve step amounts to solving
A1 … Ak x = b,
which is equivalent to k systems of equations
Ai xi = xi−1 , i ∈ ℕk ,
where we define x0 = b and let x = xk . The factorization step is typically the most costly, and the
factorization can be reused if several systems that involve the same coefficient matrix need to be
solved.
In what follows, we provide a review of some of the most common factorizations and their
applications.
2.11 Factorization Methods 27

2.11.1 LU Factorization
A by-product of Gaussian elimination is a factorization of the form
A = PLU,
where A is nonsingular and square, P is a permutation matrix, L is unit lower triangular, and U
is upper triangular. This is known as a PLU factorization or simply an LU factorization of A. The
factorization requires roughly (2∕3)n3 FLOPs, and it only requires additional storage for a permuta-
tion vector if A is overwritten by the factors L and U. The factorization allows us to find the solution
to Ax = b by solving the three simpler systems of equations,
Pz = b, Ly = z, Ux = y.
The first system, Pz = b, has the solution z = PT b, which is simply a permutation of b, and the
two triangular systems can be solved by means of forward and backward substitution in roughly
2n2 FLOPs. Thus, the total cost of the factorization step and the solve step is roughly (2∕3)n3 +
2n2 FLOPs. Note that the solve step is significantly cheaper than the factorization step, so it is
generally advantageous to reuse the factorization of A if several systems with this coefficient matrix
must be solved. We note that A−1 can be computed by solving the matrix equation AX = I, which
is equivalent to solving the n systems Axi = ei , i ∈ ℕn . This costs approximately (2∕3)n3 + 2nn2 =
(8∕3)n3 FLOPs if a single PLU factorization is computed and reused. Thus, the cost of solving Ax = b
by explicitly computing A−1 followed by the matrix–vector product A−1 b is several times higher
than that of the factor-solve approach.

2.11.2 Cholesky Factorization

The LU factorization may be simplified in the special case where the matrix A is symmetric and
positive definite. Such matrices are strongly factorizable, which means that they can be factor-
ized without pivoting. The resulting factorization is the Cholesky factorization, and it is the unique
factorization of the form
A = LLT , (2.45)
where the Cholesky factor L is a lower triangular matrix with positive diagonal elements. The
factorization is equivalent to the LDL factorization
̄ L̄ ,
A = LD
T

where D = diag(L211 , … , L2nn ) and L̄ = LD−1∕2 is unit lower triangular. The cost of computing the
Cholesky factorization is roughly (1∕3)n3 FLOPs.

2.11.3 Indeﬁnite LDL Factorization

Unlike symmetric positive definite matrices, indefinite symmetric matrices are not strongly factor-
izable, and hence, it is generally not possible to factorize an indefinite symmetric matrix without
pivoting. Moreover, symmetric pivoting of the form A ← PAPT , where P is a permutation matrix is
not sufficient to avoid zero pivots in general. To see this, note that the matrix
[ ]
0 1
A=
1 0
28 2 Linear Algebra

is invariant under symmetric permutations. One approach to overcoming this issue, which due to
[25], is to allow 1-by-1 and 2-by-2 pivot blocks. The resulting factorization is of the form
PAPT = LDLT , (2.46)
where D is a block diagonal matrix that contains the pivot blocks, and the cost is roughly (1∕3)n3
FLOPs.

2.11.4 QR Factorization
A matrix A ∈ ℝm×n with linearly independent columns can be decomposed into the product of an
orthogonal matrix Q ∈ ℝm×m and a matrix Rm×n with zeros below its diagonal, i.e.
[ ]
[ ] R1
A = QR = Q1 Q2 = Q1 R1 , (2.47)
0
where Q1 ∈ ℝm×n is the first n columns of Q, and R1 ∈ ℝn×n is upper triangular with nonzero diago-
nal elements. Such a factorization is called a QR factorization and can be computed in several ways,
e.g. by applying a sequence of n − 1 Householder transformations to A. The cost is approximately
2mn2 − (2∕3)n3 FLOPs or simply O(mn2 ), and it requires very little additional storage if A is over-
written by R and the vectors that define the n − 1 Householder transformations. Another benefit
of storing Q implicitly in this way is that matrix–vector products with Q and QT can be computed
in O(mn) FLOPs rather than O(m2 ) FLOPs if Q is formed and stored explicitly. We note that Q1 R1
is referred to as a thin or reduced QR factorization of A, and the matrix R1 is unique if we require
the diagonal of R1 to be positive.
A factorization of the form (2.47) yields an orthogonal basis for both the range of A and the
nullspace of AT . Specifically, the columns of Q1 form an orthogonal basis for the range of A,
whereas the columns of Q2 form an orthogonal basis for the nullspace of AT . This observation
allows us to characterize the solution set to a system of underdetermined equations. Specifically,
if A ∈ ℝm×n with rank(A) = m, then the set of solutions to Ax = b may be expressed in terms of a
QR factorization AT = QR as

1 b + Q2 z | z ∈ ℝ
 = {Q1 R−T n−m
}.
This follows directly from (2.40) by noting that

1 .
A† = AT (AAT )−1 = Q1 R1 (RT1 R1 )−1 = Q1 R−T
The QR factorization can also be used to compute a Cholesky factorization of a matrix of the form
AT A, where A ∈ ℝm×n and rank(A) = n. Indeed, if A = Q1 R1 , then AT A = RT1 R1 is the Cholesky
factorization of AT A, provided that R1 is chosen such that its diagonal is positive.
Finally, by combining the QR factorization with column pivoting, it can be applied to general
matrices A ∈ ℝm×n with linearly dependent columns. The result is a factorization of the form
A = QRP,
where Q ∈ ℝm×m is orthogonal, R ∈ ℝm×n is a matrix with zeros below the diagonal, and P is a per-
mutation matrix, which is commonly chosen so that the diagonal elements of R are nonincreasing.
The QR factorization with column pivoting is sometimes called a QRP factorization, and it is usu-
ally the first step of a so-called rank-revealing QR factorization, which can be used to compute the
numerical rank of a matrix.

2.11.5 Sparse Factorizations

When the coefficient matrix A ∈ ℝn×n is large and sparse, it is often possible to reduce the cost of
Gaussian elimination by using a sparse LU factorization. This is a factorization of the form
A = P1 LUP2 ,
2.11 Factorization Methods 29

Figure 2.5 The black squares in the sparsity pattern (left) 1 2 3 4 5

denote nonzero entries. The sparsity graph (right) has a 1 1 2
node for every row/column of the matrix, and each
off-diagonal nonzero element in the sparsity pattern
2
corresponds to an edge in the sparsity graph. 3 3
4
5 4 5

where the permutation matrices P1 and P2 affect both the stability of the factorization and the spar-
sity of the triangular matrices L and U. To illustrate the basic principle, we consider the somewhat
simpler case where A is symmetric and positive definite. This implies that we can use a sparse
Cholesky factorization
PT AP = LLT , (2.48)
where P is a permutation matrix that determines the elimination order and affects the sparsity of
L. Since A ∈ 𝕊n++ is strongly factorizable, P can be chosen without taking numerical stability into
account, and hence, it can be chosen based on the sparsity pattern of A. Figure 2.5 shows an example
of a sparsity pattern and the corresponding sparsity graph, which is a graph with n nodes and an
edge for every off-diagonal nonzero element.
The sparsity pattern of L can be determined by means of a symbolic factorization of the sparsity
pattern of PT AP. Only the location of the nonzero entries of PT AP are needed for this step. Nonzero
entries in L that are not present in PT AP are referred to as fill-in. The example in Figure 2.6 illus-
trates this for two different symmetric permutations of an “arrow” sparsity pattern. It is clear from
the figure that the elimination order has a significant effect on the amount of fill-in. A large amount
of fill-in is undesirable, since additional nonzero entries in L lead to additional FLOPs. Unfortu-
nately, the problem of computing the minimum fill-in is NP-complete , but several fill-in reducing
heuristics exist that often work well in practice. We note that there exists a zero fill-in elimination
order if and only if the sparsity graph is a so-called chordal graph [106]. An elimination order with
zero fill-in is referred to as a perfect elimination order, and the corresponding symmetric permuta-
tion PT AP has the same sparsity pattern as that of L + LT .

2.11.6 Block Factorization

The block variant of Gaussian elimination can also be expressed in terms of a block factorization
of a matrix M of the form (2.41). Indeed, if the matrix A is nonsingular, then it is easy to verify that
M may be expressed as
[ ] [ ][ ][ ]
A B I 0 A 0 I A−1 B
M= = . (2.49)
C D CA−1 I 0 D − CA−1 B 0 I

1 2 3 4 5 2 3 4 5 1
1 2
2 3
3 ⇝ 4 ⇝
4 5
5 1
𝑃1𝑇 𝐴𝑃1 𝐿 𝑃2𝑇 𝐴𝑃2 𝐿

Figure 2.6 Symbolic Cholesky factorizations of two different symmetric permutations of A. The entries in L
that are marked by ⊠ are ﬁll-in.
30 2 Linear Algebra

This is a block LDU factorization of M, i.e. M is expressed as a a product of a block unit lower
triangular matrix, a block diagonal matrix, and a block unit upper triangular matrix. Similarly, if
D is nonsingular, then M may be expressed as
[ ] [ ][ ][ ]
A B I BD−1 A − BD−1 C 0 I 0
M= = , (2.50)
C D 0 I 0 D D−1 C I
which is a block UDL factorization.
The determinant of M can be expressed in terms of the block factorizations (2.49) and (2.50) if A
or D is nonsingular. Using the fact that the determinant of a triangular matrix with a unit diagonal
is equal to 1 and the fact that det (bdiag(A1 , A2 )) = det (A1 ) det (A2 ) if A1 and A2 are square matrices,
we have that
A nonsingular ⟹ det (M) = det (A) det (D − CA−1 B), (2.51a)
D nonsingular ⟹ det (M) = det (D) det (A − BD−1 C). (2.51b)
It follows directly from (2.51a) that if A is nonsingular, then D − CA−1 B is nonsingular if and
only if M is nonsingular. Similarly, if D is nonsingular, then the Schur complement of D in M,
A − BD−1 C, is nonsingular if and only if M is nonsingular. These observations can be used to derive
the Weinstein–Aronszajn identity, also known as Sylvester’s determinant identity, which states that
det (I + BC) = det (I + CB). (2.52)
Indeed, this identity follows directly from (2.51) by letting A = I and D = −I such that
det (M) = det (A) det (D − CA−1 B) = (−1)n2 det (I + CB)
and
det (M) = det (D) det (A − BD−1 C) = (−1)n2 det (I + BC).
The identity is particularly useful if n1 ≫ n2 or n2 ≫ n1 so that either I + CB or I + BC is much
smaller than the other. For example, in the special case, where BC = u𝑣T is a rank-1 matrix, the
identity reduces to det (I + u𝑣T ) = 1 + 𝑣T u.
The block factorizations (2.49) and (2.50) can also be used to derive explicit expressions for the
blocks of the inverse of M. It follows from (2.49) that if M and A are nonsingular, then M −1 is
given by
[ ]−1 [ ]−1 [ ]−1 [ ]−1
A B I A−1 B A 0 I 0
= ,
C D 0 I 0 D − CA−1 B CA−1 I
[ ][ ][ ]
I −A−1 B A−1 0 I 0
= , (2.53)
0 I 0 (D − CA−1 B)−1 −CA−1 I
[ −1 ]
A + A−1 B(D − CA−1 B)−1 CA−1 −A−1 B(D − CA−1 B)−1
= .
−(D − CA−1 B)−1 CA−1 (D − CA−1 B)−1
Similarly, if M and D are nonsingular, then (2.50) implies that M −1 can be expressed as
[ ]−1 [ ]−1 [ ]−1 [ ]−1
A B I 0 A − BD−1 C 0 I BD−1
= −1 ,
C D D C I 0 D 0 I
[ ][ ][ ]
I 0 (A − BD−1 C)−1 0 I −BD−1
= , (2.54)
−D−1 C I 0 D−1 0 I
[ ]
(A − BD−1 C)−1 −(A − BD−1 C)−1 BD−1
= .
−D−1 C(A − BD−1 C)−1 D−1 + D−1 C(A − BD−1 C)−1 BD−1
2.11 Factorization Methods 31

Now, if both A and D are nonsingular, then the (1,1) block of (2.53) and that of (2.54) must be equal,
i.e.
(A − BD−1 C)−1 = A−1 + A−1 B(D − CA−1 B)−1 CA−1 . (2.55)
Substituting (W, −U, V, Z −1 )
for (A, B, C, D), where W and Z are nonsingular, yields the so-called
Sherman–Morrison–Woodbury (SMW) identity. This is also known as the matrix inversion lemma:
(W + UZV T )−1 = W −1 − W −1 U(Z −1 + V T W −1 U)−1 V T W −1 . (2.56)
This identity is often useful when solving a system of equations of the form (2.44), where the
coefficient matrix is a Schur complement. For example, using (2.56), the solution to the system
(A − BD−1 C)x = r can be expressed as
x = A−1 r + A−1 B(D − CA−1 B)−1 CA−1 r,
and can be computed as follows:
1. Compute u and Y by solving Au = r and AY = B.
2. Form S = D − CY and compute 𝑣 by solving S𝑣 = Cu.
3. Compute x = u + Y 𝑣.
This approach is particularly advantageous when the order of S is much smaller than that of A and
A is simple or, cheap to factorize, e.g. a diagonal matrix.

2.11.7 Positive Semideﬁnite Block Factorization

Next, we consider the special case where M is symmetric and positive semidefinite, i.e.
[ ]
A B
M= T ⪰ 0,
B D
with A ∈ 𝕊n1 and D ∈ 𝕊n2 . Recall that AA† defines a projection onto (A) and A† A defines a pro-
jection onto (AT ) = (A). This means that the range conditions (2.30) can be expressed as
B = AA† B, BT = DD† BT , (2.57)
and hence, we can express M as
[ ] [ ]
A B A AA† B
= T † ,
BT D B A A D
[ ][ ][ ]
I 0 A 0 I A† B
= T † .
B A I 0 D − BT A† B 0 I
It follows that M and the block-diagonal matrix in the center of the right-hand side are congruent.
Thus, we can conclude that
M ⪰ 0 ⟺ A ⪰ 0, D − BT A† B ⪰ 0, B = AA† B. (2.58)
We note that D − BT A† B
is a referred to as the generalized Schur complement of A in M. Finally, M
can also be expressed as
[ ] [ ]
A B A BD† D
= ,
BT D DD† BT D
[ ] [ ][ ]
I BD† A − BD† BT 0 I 0
= ,
0 I 0 D D† BT I
which implies that
M ⪰ 0 ⟺ C ⪰ 0, A − BD† BT ⪰ 0, BT = DD† BT . (2.59)
32 2 Linear Algebra

2.12 Saddle-Point Systems

A special case of the block system (2.42) is a symmetric system of the form
[ ][ ] [ ]
H AT x −g
= , (2.60)
A 0 y b
where H ∈ 𝕊n and A ∈ ℝm×n . This is an example of a so-called saddle-point system that arises fre-
quently in numerical optimization, e.g. from the linearization of optimality conditions, as we will
see in Chapter 6. The coefficient matrix
[ ]
H AT
M=
A 0
is indefinite if A ≠ 0, and if M is nonsingular, then it must hold that
([ ])
H
rank(A) = m, rank = n. (2.61)
A
To see this, note that if rank(A) < m, then there exists a nonzero vector y ∈  (AT ) such that
([ ])
(0, y) ∈  (M), and if rank H AT < n, then there exists a nonzero vector x ∈  (H) ∩  (A)
such that (x, 0) ∈  (M). The converse is generally not true: (2.61) does not imply that M is
nonsingular without additional assumptions. In the special case, where H is positive semidefinite,
the conditions (2.61) are both necessary and sufficient for M to be nonsingular. Indeed, if (x, y)
belongs to the nullspace of M, i.e. Hx + AT y = 0 and Ax = 0, then xT Hx + xT AT y = xT Hx = 0 from
which it follows that Hx = 0 since H is positive semidefinite. Thus, we have that y ∈  (AT ) and
x ∈  (H) ∩  (A) which shows that (2.61) is sufficient for M to be nonsingular since it implies
that  (AT ) = {0} and  (H) ∩  (A) = {0}. We note that the conditions (2.61) and positive
semidefiniteness of H can be shown to imply that M has n positive eigenvalues and m negative
ones; see, e.g. [13].
The system (2.60) can be solved in several ways. We now outline several approaches correspond-
ing to different assumptions on H and assuming that the conditions in (2.61) hold.

2.12.1 H Positive Deﬁnite

The Cholesky factorization H = LLT allows us to express (2.60) as
LLT x + AT y = −g, Ax = b.
Eliminating x = −L−T L−1 (g + AT y) from the first equation and substituting for x in the second
yields
AL−T L−1 AT y = −b − AL−T L−1 g.
The matrix AL−T L−1 AT is symmetric and positive definite because of our assumption that
rank(A) = m, and hence, we can compute the Cholesky factorization AL−T L−1 AT = L̃ L̃ . This
T

approach may be summarized as follows:

1. Compute Cholesky factorization H = LLT .
2. Form AL−T L−1 AT and compute its Cholesky factorization L̃ L̃ .
T

3. Solve L̃ L̃ y = −b − AL L g and then LL x = −g − A y.

T −T −1 T T

An alternative to the Cholesky factorization in step 2 is to compute a QR factorization L−1 AT = QR,

which leads to the following modification of steps 2 and 3:
2. Form L−1 AT and compute a QR factorization QR.
3. Solve Ry = −R−T b − QT L−1 g and then LT x = −L−1 g − QRy.
The QR factorization approach is the preferred method from a numerical point of view.
2.13 Vector and Matrix Calculus 33

The approach that we have just outlined relies on the assumption that H is positive definite, and
hence, another approach is needed if H is only positive semidefinite.

2.12.2 H Positive Semideﬁnite

First recall that the conditions in (2.61) imply that  (H) ∩  (A) = {0}, and hence, H + AT A must
be positive definite even if rank(H) < n. This observation allows us to rewrite the system (2.60) as
[ ][ ] [ ]
H + AT A AT x −g + AT b
= , (2.62)
A 0 y b
i.e. we have added AT b to the left- and right-hand side of Hx + AT y = −g and substituted Ax for b
on left-hand side. The system (2.62) can be solved using the same approach as when H is positive
definite.

2.13 Vector and Matrix Calculus

The analysis and solution of optimization problems often involve the partial derivatives of one or
several differentiable multivariate functions. Here we introduce the notation used throughout the
book and review some useful identities.
𝜕f (x)
Given a real-valued differentiable function f ∶ ℝn → ℝm and a vector x ∈ ℝn , we define 𝜕xT as
the m × n matrix of partial derivatives
⎡ 𝜕f1 𝜕f1 ⎤
⎢ 𝜕x · · · 𝜕xn ⎥⎥
𝜕f (x) 𝜕f ⎢ 1
= T =⎢ ⋮ ⋱ ⋮ ⎥, (2.63)
𝜕xT 𝜕x ⎢ ⎥
⎢ 𝜕fm 𝜕fm ⎥
⎢ ··· ⎥
⎣ 𝜕x1 𝜕xn ⎦
𝜕f
where f (x) = (f1 (x), … , fm (x)). The matrix 𝜕xT
is also known as the Jacobian of f . We will occasion-
ally use the notation
( )T
𝜕f T 𝜕f (x)T 𝜕f
= = , (2.64)
𝜕x 𝜕x 𝜕xT
which is an n × m matrix. In the special case, where f is real valued (i.e. m = 1), the vector of partial
derivatives,
⎡ 𝜕f ⎤
⎢ 𝜕x ⎥
𝜕f ⎢ 1⎥
∇f (x) = =⎢ ⋮ ⎥ (2.65)
𝜕x ⎢ 𝜕f ⎥
⎢ ⎥
⎣ 𝜕xn ⎦
is called the gradient of f . Moreover, if f is twice differentiable, then the n × n matrix of second-order
partial derivatives,
⎡ 𝜕 f ··· 𝜕 f ⎤
2 2

⎢ 𝜕x1 𝜕x1 𝜕x1 𝜕xn ⎥

𝜕2 f ⎢ ⎥
2
∇ f (x) = =⎢ ⋮ ⋱ ⋮ ⎥ (2.66)
𝜕x𝜕xT ⎢ 2 ⎥
⎢ 𝜕 f ··· 𝜕 f ⎥
2

⎣ 𝜕xn 𝜕x1 𝜕xn 𝜕xn ⎦

is the called the Hessian of f .
34 2 Linear Algebra

Now, given a matrix-valued function F ∶ ℝ → ℝm×n that is differentiable, we define the matrix
of first-order derivatives
⎡ dF11 dF1n ⎤
⎢ dx · · · dx ⎥
dF(x) dF ⎢ ⎥
⋮ ⋱ ⋮ ⎥,
dx ⎢⎢
= = (2.67)
dx ⎥
dF
⎢ m1 · · · dF mn ⎥
⎣ dx dx ⎦
where the function Fij (x) is the (i, j)th element of F(x). Similarly, given a differentiable function
f ∶ ℝm×n → ℝ, we define the matrix of first-order partial derivatives
⎡ 𝜕f · · · 𝜕f ⎤
⎢ 𝜕X11 𝜕X1n ⎥
𝜕f (X) 𝜕f ⎢ ⎥
⋮ ⋱ ⋮ ⎥.
𝜕X ⎢⎢
= = (2.68)
𝜕X
𝜕f 𝜕f ⎥
⎢ ··· ⎥
⎣ 𝜕Xm1 𝜕Xmn ⎦
The partial derivatives of a composite function f = g ∘ h, where g ∶ ℝp → ℝm and h ∶ ℝn → ℝp
are differentiable functions, can be expressed in terms of the chain rule as
𝜕f 𝜕g(h(x)) 𝜕g(y) || 𝜕h(x)
= = . (2.69)
𝜕xT 𝜕xT 𝜕yT ||y=h(x) 𝜕xT
For differentiable functions f ∶ ℝn → ℝ and g ∶ ℝn → ℝ, the product rule takes the form
𝜕f (x)g(x) 𝜕f 𝜕g
= g(x) + f (x) = ∇f (x)g(x) + f (x)∇g(x), (2.70)
𝜕x 𝜕x 𝜕x
whereas for f ∶ ℝn → ℝm and g ∶ ℝn → ℝ, we have that
𝜕f (x)g(x) 𝜕f 𝜕g
= T g(x) + f (x) T . (2.71)
𝜕x T 𝜕x 𝜕x

Example 2.3 We now illustrate the use of the chain rule for the special case where f = g ∘ h is
a composition of a differentiable function g ∶ ℝp → ℝm and a linear function h(x) = Ax, where
A ∈ ℝp×n . We have that 𝜕h∕𝜕xT = A, and hence, the chain rule yields
𝜕f 𝜕g(y) ||
= A.
𝜕xT 𝜕yT ||y=Ax
If f is real-valued, i.e. m = 1, then this is equivalent to the identity ∇f (x) = AT ∇g(Ax), and if g is
also twice differentiable, then the Hessian of f is
∇2 f (x) = AT ∇2 g(Ax)A.

Example 2.4 Consider the function f ∶ ℝn → ℝ defined as

( n )
∑
f (x) = ln exi ,
i=1

which is known as the log-sum-exp function. To derive the partial derivatives of f using the chain
rule, we start by expressing f as f = g ∘ h where g(y) = ln(𝟙T y) and h(x) = (ex1 , … , exn ). We have that
𝜕g∕𝜕yT = 𝟙1T y 𝟙T and 𝜕h∕𝜕xT = diag(h(x)), and hence, it follows that

𝜕f 1 1
= T 𝟙T diag(h(x)) = T h(x)T .
𝜕xT 𝟙 h(x) 𝟙 h(x)
Exercises 35

The Hessian of f now follows from the product rule (2.71), i.e.
1 1
∇2 f (x) = diag(h(x)) − T h(x)h(x)T ,
𝟙T h(x) (𝟙 h(x))2
= diag(∇f (x)) − ∇f (x)∇f (x)T .

Example 2.5 Recall the Laplace expansion (2.8) of the determinant of a square matrix A ∈ ℝn×n
along the jth column, i.e.
∑
n
det (A) = cij aij ,
i=1

where cij = (−1)i+j Mij is the cofactor of the (i, j)th entry of A. None of the cofactors c1j , … , cnj are a
function of aij , and hence, it follows that
( n )
𝜕 𝜕 ∑
det (A) = c a = cij .
𝜕aij 𝜕aij k=1 kj kj

This result can be used to derive an expression for the derivative of det (A(t)), where A ∶ ℝ → ℝn×n ,
i.e.
d ∑∑ 𝜕 det (A(t)) daij (t) ∑∑ daij (t)
n n n n
det (A(t)) = = cij .
dt i=1 j=1
𝜕aij (t) dt i=1 j=1
dt

This can be expressed more compactly as

( )
d dA(t)
det (A(t)) = tr adj(A(t)) , (2.72)
dt dt
which is known as Jacobi’s formula.

Exercises

2.1 Show that tr(BT A) = tr(ABT ) for all A, B ∈ ℝm×n .

2.2 Show that the operator norm and the Frobenius norm on ℝm×n are orthogonally invariant,
i.e. if Q1 ∈ ℝm×m and Q2 ∈ ℝn×n are orthogonal matrices, then it holds that

||A||2 = ||Q1 AQ2 ||2 , ||A||F = ||Q1 AQ2 ||F .

2.3 Show that the infinity norm of a matrix A ∈ ℝm×n may be expressed as

||A||∞ = max ||ai ||1 ,

i∈ℕm

if aT1 , … , aTm ∈ ℝn are the m rows of A.

2.4 Let x ∈ ℝn be a nonzero vector. Find a vector 𝑣 ∈ ℝn such that the Householder matrix
H = I − 2 𝑣𝑣
T

𝑣T 𝑣
maps x to ||x||2 e1 , i.e.

𝑣T x
Hx = x − 2 𝑣 = ||x||2 e1 .
𝑣T 𝑣
36 2 Linear Algebra

2.5 Recall that the Schatten p-norm of a matrix A ∈ ℝm×n is given by

(min (m,n) )1∕p
∑ p
||A||(p) = 𝜎i ,
i=1

where 𝜎i denotes the ith singular value of A.

(a) Show that the Schatten p-norm is orthogonally invariant, i.e. ||A||(p) = ||Q1 AQ2 ||(p) if
Q1 ∈ ℝm×m and Q2 ∈ ℝn×n are orthogonal matrices.
(b) Show that the Schatten 2-norm on ℝm×n is equivalent to the Frobenius norm, i.e.
||A||(2) = ||A||F .
(c) Show that the Schatten infinity norm on ℝm×n is equivalent to the operator norm, i.e.
||A||(∞) = ||A||2 .

2.6 Show that the inverse of a nonsingular, lower-triangular Toeplitz matrix whose first column
is given by a0 , … , an−1 is another lower-triangular Toeplitz matrix.

2.7 Show that a lower-triangular Toeplitz matrix T of order n commutes with the lower shift
matrix S of order n, i.e. ST = TS.

2.8 Show that the trace of a symmetric matrix A ∈ 𝕊n is equal to the sum of its eigenvalues, i.e.
∑
n
tr(A) = λi ,
i=1

if A = Qdiag(λ1 , … , λn )QT is a spectral decomposition of A.

2.9 Show that the smallest eigenvalue of X ∈ 𝕊n is greater than or equal to t ∈ ℝ if and only if
X − tI ∈ 𝕊n+ .

2.10 Let H = R + XPX T , where R ∈ 𝕊m ++ , P ∈ 𝕊++ , and X ∈ ℝ

n m×n , and define Y = PX T H −1 . Show

that the inverse of the matrix

[ ]T [ ]T [ ]−1 [ ][ ]
I 0 I −X R 0 I −X I 0
M=
Y I 0 I 0 P 0 I Y I
is given by
[ ]
H 0
M −1 = .
0 P − YHY T
Hint: Use the identity in (2.49).

2.11 Show that a matrix Z ∈ ℝm×n has rank at most r if and only if there exist matrices X ∈ 𝕊m
and Y ∈ 𝕊n such that
[ ]
X Z
rank X + rank Y ≤ 2r, ⪰ 0.
ZT Y

2.12 Consider a linear system

xk+1 = Axk + Buk ,
yk = Cxk + Duk ,
Exercises 37

where xk ∈ ℝn , uk ∈ ℝm , and yk ∈ ℝp for 0 ≤ k ≤ N, and where (A, B, C, D) are real-valued

matrices of compatible dimensions. Suppose T ∈ ℝn×n is invertible and define the trans-
formed states x̄ k via T x̄ k = xk . Show that
̄ xk + Bu
x̄ k+1 = Ā ̄ k,
yk = C̄ x̄ k + Duk ,

where Ā = T −1 AT, B̄ = T −1 B, and C̄ = CT.

2.13 Consider a dynamical system defined by

xk+1 = Axk + Buk ,
yk = Cxk ,

with xk ∈ ℝn , uk ∈ ℝ, and yk ∈ ℝ for k ∈ ℤ+ , and where A ∈ ℝn×n , B ∈ ℝn×1 , and C ∈ ℝ1×n

are given. For a given N ≥ n, we introduce the so-called observability matrix  and control-
lability matrix , which are defined as
⎡ C ⎤
⎢ ⎥ [ ]
CA ⎥
=⎢ ,  = B AB · · · AN−1 B .
⎢ ⋮ ⎥
⎢CAN−1 ⎥
⎣ ⎦

(a) Show that the product  can be expressed as the Hankel matrix
⎡ h1 h2 h3 … hN−1 hN ⎤
⎢ ⎥
⎢ h2 h3 h4 … hN hN+1 ⎥
⎢ h h4 h5 … hN+1 hN+2 ⎥
 = ⎢ 3 ,
⎢ ⋮ ⋮ ⋮ ⋱ ⋮ ⋮ ⎥⎥
⎢hN−1 hN hN+1 … h2N−3 h2N−2 ⎥
⎢ h hN+1 hN+2 … h2N−2 h2N−1 ⎥⎦
⎣ N
where hk = CAk−1 B for k ∈ ℕ2N−1 are the first 2N − 1 so-called impulse response coeffi-
cients or Markov-parameters of the dynamical system.
(b) Show that the Markov parameters are invariant under a state-transformation as in
Exercise 2.12
(c) Suppose  has full column rank, in which case, the dynamical system is said to be
observable. Moreover, assume that  has full row rank, which means that the dynamical
system is controllable. What is then the rank of ?
(d) Suppose that instead of the matrices (A, B, C), we are given the Markov parameters
(h1 , h2 , … , h2N−1 ) for some N > n. Describe a numerical procedure for finding matri-
ces (A, B, C) such that hk = CAk−1 B for k ∈ ℕ2N−1 . How can you find (A, B, C) such that
the corresponding dynamical system is both observable and controllable?

2.14 Consider the equation

Y = X +  U,

where Y ∈ ℝp×m ,  ∈ ℝp×n , X ∈ ℝn×m ,  ∈ ℝp×q , and U ∈ ℝq×m .

(a) Let Π⟂ = I − U T (UU T )−1 U and show that

Y Π⟂ = XΠ⟂ .
38 2 Linear Algebra

[ ]
X
(b) Assume that has full row rank and that  has full column rank. Show that
U
( )
rank Y Π⟂ = n,
( )
 Y Π⟂ = .

(c) Consider the factorization

[ ] [ ] ⎡Q ⎤
U L 0 0 ⎢ 1⎥
= 11 Q ,
Y L21 L22 0 ⎢ 2 ⎥
⎣Q3 ⎦

where L11 ∈ ℝq×q , L22 ∈ ℝp×p , and Q1 Q2 T = 0. Such a factorization can be obtained from
[ ]
a QR-factorization of U T Y T . Show that Y Π⟂ = L22 Q2 .

2.15 Let X ∶ ℝ → ℝm×n and Y ∶ ℝ → ℝn×p . Show that

d dX(t) dY (t)
(X(t)Y (t)) = Y (t) + X(t) .
dt dt dt

2.16 Show that

𝜕
tr(AX) = AT ,
𝜕X
where X ∈ ℝm×n and A ∈ ℝn×m .

2.17 Let X ∶ ℝ → ℝm×m . Show that

d dX(t)
(X(t)−1 ) = −X(t)−1 X(t)−1 , det X(t) ≠ 0.
dt dt

2.18 Let X ∈ ℝm×m and assume that det X > 0. Show that
𝜕
ln det X = X −T .
𝜕X
Hint: Use the chain rule and Jacobi’s formula.

2.19 Let X ∈ ℝm×m be nonsingular and a, b ∈ ℝm . Show that

𝜕 T −1
a X b = −X −T abT X −T .
𝜕X
Hint: First use the identity XX −1 = I to show that
𝜕 −1 𝜕X −1
X = −X −1 X , i, j ∈ ℕm .
𝜕Xij 𝜕Xij

2.20 Show that

𝜕
tr(X T AX) = (A + AT )X,
𝜕X
where X ∈ ℝm×n and A ∈ ℝm×m .
Exercises 39

2.21 Let A ∶ ℝ → ℝm×n with m ≥ n and define f ∶ ℝ → ℝ as

( )1
f (t) = det A(t)T A(t) 2k ,
for some k ∈ ℕ.
(a) Show that
( )
df (t) f (t) dA(t)
= tr A(t)† .
dt k dt
(b) Suppose that A(t) has full column rank, and let A(t) = Q1 R1 be a reduced QR factoriza-
tion of A(t), i.e. Q1 ∈ ℝm×n and R1 ∈ ℝn×n . Show that
1∕k ( )
1 ||∏ ||
n
df (t) −1 T dA(t)
= | Rii | tr R1 Q1 .
dt k || i=1 || dt

2.22 Given a matrix A ∈ ℝm×n , the projection of a point x ∈ ℝm onto  (AT ) can be expressed as
Px, where P = I − AA† is a projection matrix. If we assuming that rank(A) = n, then P can
be expressed as
P = I − A(AT A)−1 AT .
Now, suppose that A is a function of a scalar parameter t.
(a) Show that
dP dA dAT
= −P A† − (A† )T P.
dt dt dt
(b) Suppose rank(A) = n and let A = Q1 R1 be a reduced QR decomposition of A, i.e.
Q1 ∈ ℝm×n and R1 ∈ ℝn×n . Show that the projection Px and the derivative dP∕dt can be
evaluated efficiently using such a QR decomposition of A without explicitly computing
A† = (AT A)−1 AT .
40

Probability Theory

In this chapter, we will discuss the basics of probability theory. It is a branch of mathematics where
uncertain events are given a number between zero and one to describe how likely they are. Loosely
speaking this number should be close to the relative frequency with which the event occurs when
it is repeated many times. As an example we may consider throwing a fair dices one hundred
times and recording how many times a one occurs. If it occurs 18 times, the relative frequency
is 18∕100 = 0.18. This is close to the theoretical value of the probability which is 1/6. The reason
we know it should be 1/6 is that all the possible six outcomes of the experiment should have the
same probability if the dice is fair. In case a probability is one, we are almost sure the event will
occur, and if it is zero, we are almost sure it will not occur.
The roots of probability theory go back to the Arab mathematician Al-Khalil who studied
cryptography. Initially, probability theory only considered combinatorial problems. The theory is
much easier for this case as compared to the case when the number of events is not countable.
Mathematicians struggled for many years to provide a solid foundation, and it was not until in
1933 when Andrey Nikolaevich Kolmogorov made an axiomatic definition of probabilities that
the problem was resolved, and modern probability theory was born. We are however not going
to provide the details of the measure theoretic foundations of probability theory in this chapter;
the interested reader is referred to, e.g. [98]. The presentation given here is more in line with [48].
Probability theory is the foundation for statistics and learning and used in many other branches of
science.

3.1 Probability Spaces

A probability space is defined by a triplet (Ω,  , ℙ). Here Ω is called the sample space, and it is
a set that contains all possible outcomes of an experiment. When throwing a dice, we can take
Ω = {1, 2, 3, 4, 5, 6}. However, if we are only interested in if the number is odd or even, we could
instead take the sample space to be Ω = {odd, even}. For other experiments, it may be more appro-
priate to have Ω = ℝ. An example of this is when the experiment is the error with which we measure
something. The sample space could also contain vectors. If we throw two dices, it is appropri-
ate to consider Ω = {1, 2, 3, 4, 5, 6} × {1, 2, 3, 4, 5, 6}. We may sometimes have infinite-dimensional
vectors, e.g. Ω = ℝℤ , i.e. the set of real-valued functions defined on the integers, cf . the nota-
tion section. Here the dimensions are countable, and ℤ might be interpreted as a set of discrete
time indices. It could also be the case that Ω = ℝℝ , in which case, the sample space contains all
real-valued functions of a real variable. Sample spaces containing functions have applications in
control and signal processing. We also encounter examples, where Ω = ℝℝ , i.e. the sample space
n

contains all real-valued functions defined on ℝn , which has applications in so-called “Gaussian
processes” which we will discuss in more detail in Chapter 9.
The second element  of a probability space should contain all events that we are interested in
assigning probabilities to. This is a set of subsets of Ω, and it should be a so-called 𝜎-algebra, i.e. the
following properties must hold
1. Ω ∈ 
2. Ω∖A ∈  , ∀A ∈ 
⋃∞
3. ∀A1 , A2 , … ∈  ⇒ i=1 Ai ∈  .
The latter two conditions say that  is closed under complement and under countable unions.
The difference between algebra and 𝜎-algebra is that for an algebra, only finite unions are consid-
ered in the last condition. For finite sample spaces, there is no difference since  is then also a finite
set. It can then at most contain all subsets of Ω. The smallest possible 𝜎-algebra is  = {Ω, ∅}.

Example 3.1 Let us again consider the example of throwing a fair dice, and we take
Ω = {1, 2, 3, 4, 5, 6}. We are then interested in the events odd and even. Hence, we define
the sets Aodd = {1, 3, 5}, and Aeven = {2, 4, 6}, which are the sets containing the odd and even num-
bers, respectively. We may then take the 𝜎-algebra to be  = {Aodd , Aeven , Ω, ∅}. It is straightforward
to verify that this is a 𝜎-algebra.

When we carry out an experiment, like throwing a fair dice, we say that we observe an outcome
of the experiment, and this will be an element of the 𝜎-algebra, e.g. the number is either odd or even
as in the example above. We then say that this is the outcome of the experiment or the observation
of the experiment.

3.1.1 Probability Measure

The probability measure ℙ is a function ℙ ∶  → [0, 1] such that
1. ℙ[Ω] = 1
[ ] ∑∞ [ ]
2. If A1 , A2 , … ∈  is a collection of pairwise disjoint sets, then ℙ ∪∞ A = i=1 ℙ Ai .
i=1 i

The first condition is that ℙ is normalized, and the second condition is that ℙ is what is called
𝜎-additive. Well-known properties such as ℙ[A ∪ B ] = ℙ[A] + ℙ[B ] − ℙ[A ∩ B ] all follow from
the above axioms, see Exercise 3.1. Notice that when Ω contains uncountably many elements, say
Ω = ℝ, it is not possible to take  to contain all subsets of Ω and define ℙ to satisfy the second
condition above. It means that it is not possible to consider any subsets of Ω as the events of the
experiment in a meaningful way. This puzzled the mathematicians for many years. One usually
restricts oneself to the so-called Borel 𝜎-algebra, which is the smallest 𝜎-algebra that contains all
intervals of ℝ. All sets in the Borel 𝜎-algebra can be formed from open sets, or equivalently, from
closed sets, through the operations of countable union, countable intersection, and relative com-
plement. It then follows that it is possible to satisfy the second condition in the definition of the
probability measure. We will not discuss how to generalize this to more complicated sample spaces
like Ω = ℝℝ .

3.1.2 Probability Function

When the sample space Ω contains a countable number of elements, we may define a probability
function (pf) p ∶ Ω → [0, 1] such that
∑
p(𝜔) = 1.
𝜔∈Ω
42 3 Probability Theory

Then the probability measure is

∑
ℙ[A] = p(𝜔).
𝜔∈A

for any A ∈  . Notice that the summations above may be infinite.

We now give an example of a probability function.

Example 3.2 We consider the case of Ω = ℕn , and we define the probability function p ∶ ℕn →
[0, 1] as
⎧ ezk
⎪ ∑n−1 zl , k ∈ ℕn−1 ,
p(k) = ⎨ 1+ l=1 e
⎪ ∑1n−1 zl , k = n,
⎩ 1+ l=1 e
for given zk ∈ ℝ, k ∈ ℕn−1 , which is called the categorical probability function. We will see that this
is used in what is called logistic regression analysis. It is straightforward to verify that this is a valid
probability function for any values of zk .

The categorical probability function is an example of a finite probability function since the set ℕn
is finite. When p is a function of an integer, we often write pk instead of p(k). Then we may also use
a vector p = (p1 , … , pn ) ∈ [0, 1]n instead of a function p to describe the probability function.

3.1.3 Probability Density Function

When the sample space is not countable, we must be more careful as mentioned above. We will
here consider Ω = ℝ and define the so-called distribution function (df) F ∶ ℝ → [0, 1] as
F(𝜔) = ℙ[(−∞, 𝜔]] .
For any set A ⊂ Ω, we have that

ℙ[A] = dF(𝜔).
∫A
When F is differentiable1 the derivative f ∶ ℝ → ℝ+ of F exists, and we may write

ℙ[A] = f (𝜔)d𝜔.
∫A
The function f is called the probability density function (pdf). It is straightforward to generalize the
results to Ω = ℝn .
We defer giving examples of pdfs until we have introduced random variables.

3.2 Conditional Probability

We are now interested in the probability of an event A if we know that another event B has
happened. As an example, we may be interested in the probability that we obtain a one when
we throw a fair dice if we already know that the outcome of the experiment was an odd number,
i.e. either one, three, or five. In this example, A = {1} and B = {1, 3, 5}. The whole sample space is
Ω = {1, 2, 3, 4, 5, 6}. Clearly, we can look at a smaller sample space defined by B, i.e. what we know

1 It is actually enough to assume that F is absolutely continuous, and then the derivative exists almost everywhere.
3.2 Conditional Probability 43

has happened, and then we investigate how frequent A is in this sample space, and we obtain the
probability 1∕3, i.e. A contains one of the three elements in B, which all have equal likely outcome.
Another way of computing this value is to go back to the original sample space Ω and compute
ℙ[A ∩ B ] 1∕6 1∕6 1
= = = ,
ℙ[B ] 3∕6 1∕2 3
i.e. we normalize the probability of both events occurring with the probability of the event that we
know have occurred. If P(B) > 0, we define the conditional probability that A occurs given that B
has occurred as
ℙ[A ∩ B ]
ℙ[A ∣ B ] = .
ℙ[B ]
From this it immediately follows that

ℙ[A ∩ B ] = ℙ[B ] ℙ[A ∣ B ] , (3.1)

and by induction that for Ai ∈  , it holds that

[ ] [ ] [ ] [ ] [ ]
ℙ ∩ni=1 Ai = ℙ A1 ℙ A2 ∣ A1 ℙ A3 ∣ A1 ∩ A2 · · · ℙ An ∣ A1 ∩ A2 ∩ · · · ∩ An−1 ,

which is called the multiplication theorem.

Clearly, ℙ[A ∣ B ] ≥ 0, ℙ[Ω ∣ B ] = 1, and ℙ[B ∣ B ] = 1 for any A, B ∈  . If A1 , A2 , … ∈  is a col-
[ ] ∑∞ [ ]
lection of pairwise disjoint sets, then ℙ ∪∞ A ∣ B = i=1 ℙ Ai ∣ B . Hence, (Ω,  , ℙ[ • ∣ B ]) is also
i=1 i
a probability space for a fixed B such that ℙ[B ] > 0, i.e. ℙ[ • ∣ B ] ∶  → [0, 1] is also a probability
measure. This formalizes the intuition we obtained with the dice in the beginning of this section.
From (3.1) and its symmetric counterpart

ℙ[B ∩ A] = ℙ[A] ℙ[B ∣ A] ,

it follows that
ℙ[A] ℙ[B ∣ A]
ℙ[A ∣ B ] = , (3.2)
ℙ[B ]
for any A, B ∈  such that ℙ[A] > 0 and ℙ[B ] > 0. This is called Bayes’ theorem.
Let Ai ∈  , i ∈ ℕn be pairwise disjoint sets such that Ω = ∪ni=1 Ai . Then for any X ∈  , it holds
that
∑
n
[ ] [ ]
ℙ[ X ] = ℙ Ai ℙ X|Ai , (3.3)
i=1

which is called the formula of total probability. This is proven in Exercise 3.3.
[ ] [ ] [ ]
Moreover, by (3.1), we have ℙ Ai ∩ X = ℙ Ai ℙ X ∣ Ai . From what has been said above and
from (3.2) it follows that
[ ] [ ]
[ ] ℙ Ai ℙ X ∣ Ai
ℙ Ai ∣ X = ∑n [ ] [ ] , i ∈ ℕn .
j=1 ℙ Aj ℙ X ∣ Aj

Here we have tacitly assumed that all involved events have nonzero probability.

Example 3.3 In a factory, the same items are manufactured at three different machines with a
proportion given by 15% for machine 1, 45% for machine 2 and 40% for machine 3. The differ-
ent machines produce defect items with probabilities 0.05, 0.04, and 0.03, respectively. Customers
obtain a perfect mix of items from the different machines. We denote by A1 , A2 , and A3 the events
that an item is manufactured by machine 1, 2, and 3, respectively, and we denote by X the event
44 3 Probability Theory

that an item in the mix sent to customers is defect. Then by the formula of total probability, we have
that
∑
3
[ ] [ ]
ℙ[X ] = ℙ Ai ℙ X ∣ Ai = 0.15 × 0.05 + 0.45 × 0.04 + 0.40 × 0.03 = 0.0375
i=1

3.3 Independence
If it had not been for the concept of independence, probability theory would not have been a separate
branch of mathematics, but instead just an example of measure theory. We have that the occurrence
of the event B ∈  changes the probability of the event A ∈  to occur from ℙ[A] to ℙ[A ∣ B ].
However, if ℙ[A ∣ B ] = ℙ[A], then this is not the case. An equivalent condition to this is by (3.1)
that
ℙ[A ∩ B ] = ℙ[A] ℙ[B ] ,
and this is what we take as definition of independence of A and B. Notice that this also implies that
ℙ[B ∣ A] = ℙ[B ]. The definition of independence is valid also when ℙ[A] and/or ℙ[B ] are zero.
The relation to conditional probabilities requires that they are positive.
We can also consider a family of events, i.e.  = {A1 , … , An } ⊆  . We say that this family is
independent if
[ ] [ ]
ℙ ∩i∈J Ai = Πi∈J ℙ Ai ,
for all J ⊆ ℕn . If the family  has the property that
[ ] [ ] [ ]
ℙ Ai ∩ Aj = ℙ Ai ℙ Aj , ∀i ≠ j,
we say that the family is pairwise independent, or that the events in the family are pairwise
independent. Independence of the family implies pairwise independence, but the converse is not
necessarily true.
For a family of independent events, the probability that at least one of them happens is given by
[ ] ∏
n
( [ ])
ℙ ∪ni=1 Ai = 1 − 1 − ℙ Ai , (3.4)
i=1

which is proven in Exercise 3.5.

3.4 Random Variables

A random variable or stochastic variable X(𝜔) is a function from the sample space Ω to some set .
If this set is countable, like  = ℤ, we say that the random variable is discrete, and if it is uncount-
able, like  = ℝ, we say that it is continuous. We assume that the set  is partially ordered, see the
subsection on generalized inequalities in Section 4.2 for a definition of partial ordering. This implies
that A = {𝜔 ∈ Ω ∶ X(𝜔) ≤ x} is well-defined. We also assume that the random variable X ∶ Ω → 
is  - measurable, i.e. the set A ∈  for all x ∈ . This means that the distribution function of the
random variable F ∶  → [0, 1] given by
F(x) = ℙ[𝜔 ∈ Ω ∶ X(𝜔) ≤ x ]
is well defined. We will often use the shorter notation F(x) = ℙ[X ≤ x ]. An important special case
is when  = Ω and X(𝜔) = 𝜔, and then the distribution function of the random variable is the same
3.4 Random Variables 45

as the distribution function associated with (Ω,  , ℙ). Since the sample space can be countable, we
realize that we have now actually defined distribution functions for countable sample spaces as
well. Notice that we never did this explicitly when we talked about countable sample spaces above
in this chapter. The reason was that we had not made any assumption on partial ordering of Ω,2
cf . (4.23a).
For the special case, where  = Ω and X(𝜔) = 𝜔, it is in some sense unnecessary to consider X
to be a function, and then we will often just write X ∈ . We will then most often not make any
specific reference to the 𝜎-algebra  or the probability measure ℙ either. Instead, we just specify
the distribution function F, or the probability function p or probability density function f defined
on . We tacitly assume that there is a well defined underlying probability measure and 𝜎-algebra.
This will in most cases be sufficient for our purposes.

3.4.1 Vector-Valued Random Variable

Just like we can have Ω = ℝn , it is possible to define vector-valued random variables. We then have
[ ]
F(x) = ℙ {𝜔 ∈ Ω ∶ X1 (𝜔) ≤ x1 , … , Xn (𝜔) ≤ xn } ,
for the case of an n-dimensional random variable X. Often, we write F(x) = ℙ[X ≤ x ] as before,
but where the inequality now is to be interpreted as componentwise inequalities. The dimensions
of the elements in Ω and of the random variable, i.e. the dimension of the elements in , do not
have to be the same, but sometimes they are. We will now give an important example of a pdf for
an n-dimensional continuous random variable.

Example 3.4 We say that a random variable X ∶ ℝn → ℝn defined as X(𝜔) = 𝜔 with pdf
f ∶ ℝn → ℝ+ given by
( )
1 1
f (x) = √ exp − (x − 𝜇)T Σ−1 (x − 𝜇) , (3.5)
(2𝜋)n det (Σ) 2

with 𝜇 ∈ ℝn and Σ ∈ 𝕊n++ has a Gaussian or normal distribution.3 When we want to empha-
size the dependence on the parameters 𝜇 and Σ we use  ∶ ℝn × ℝn × 𝕊n++ → ℝ+ defined as
 (x, 𝜇, Σ) = f (x).

3.4.2 Marginal Distribution

For a two-dimensional random variable X ∶ Ω → ℝ2 , we define the marginal distribution func-
tions as
[ ] [ ]
F1 (x1 ) = ℙ X1 ≤ x1 = lim F(x1 , x2 ), F2 (x2 ) = ℙ X2 ≤ x2 = lim F(x1 , x2 ),
x2 →∞ x1 →∞

where F is the distribution function for X = (X1 , X2 ), sometimes called the joint distribution func-
tion. This trivially generalizes to higher dimensions. When F is differentiable, it can be shown that
the marginal probability density functions satisfy

f1 (x1 ) = f (x1 , x2 )dx2 , f2 (x2 ) = f (x1 , x2 )dx1 .

∫ℝ ∫ℝ

2 For any countable sample space, we can always introduce a partial ordering by associating each element of Ω with
an element in ℤn+ for some n.
3 We remark that it is the random variable that has a Gaussian distribution and that f is not a Gaussian distribution
but a Gaussian pdf.
46 3 Probability Theory

where f is the joint probability density function. For discrete-valued random variables, similar
formulas hold, but then involving summations instead of integrals. For n-dimensional random
variables, it is possible to look at marginal pdfs of dimension n1 < n by integrating or summing
over the remaining n2 = n − n1 variables.

Example 3.5 Consider a Gaussian random variable Z = (X, Y ) with pdf  (z, 𝜇, Σ) for which
[ ] [ ]
𝜇x Σx Σxy
𝜇= ∈ℝ m+n
, Σ= ∈ 𝕊++
m+n
.
𝜇y ΣTxy Σy

It is straightforward to show by integration that X and Y are also Gaussian random variables with
marginal pdfs  (x, 𝜇x , Σx ) and  (y, 𝜇y , Σy ), respectively.

3.4.3 Independence of Random Variables

We say that two random variables X ∶ Ω → ℝ and Y ∶ Ω → ℝ are independent if the events
A = {𝜔 ∶ X(𝜔) ≤ x} and B = {𝜔 ∶ Y (𝜔) ≤ y} are independent for all x, y ∈ ℝ. The generalization
to vector-valued random variables is immediate, and the same definition also holds for discrete
random variables with obvious modifications. The independence of the random variables X and
Y is equivalent to

F(x, y) = FX (x)FY (y),

and in case F is differentiable,

f (x, y) = fX (x)fY (y).

Here F is the joint distribution function for (X, Y ), and FX and FY the marginal distribution
functions. Similarly, f is the joint pdf, and fX and fY are the marginal pdfs. This result also
trivially generalizes to discrete random variables. When we discuss independence of n > 2 random
variables, we realize that we define this as independence of n events, and that the criteria in terms
of distribution functions and probability density functions are that we can factorize them in n
factors, where these factors are the marginals.

3.4.4 Function of Random Variable

It is possible to define a function of a random variable X ∶ Ω → . If g ∶  →  is such that g(X(𝜔))
is  -measurable, then g(X) is a well-defined random variable that is a function of the random
variable X. This usually holds for all “nice” functions g. If g is invertible, and the distribution
function for X is F ∶  → [0, 1], then the distribution function FY ∶  → [0, 1] for Y = g(X) can
be obtained as
[ ] [ ] ( )
FY (y) = ℙ g(X) ≤ y = ℙ (X ≤ g−1 (y) = F g−1 (y) . (3.6)

When g is not invertible, obtaining the distribution function for g(X) is more cumbersome.
For continuous random variables, e.g. when  =  = ℝ, it holds that
[ ]
FY (y) = ℙ g(X) ≤ y = f (x)dx,
∫{x|g(x)≤y}

where f ∶ ℝ → ℝ+ is the pdf of X.

3.5 Conditional Distributions 47

3.5 Conditional Distributions

We have already discussed conditional probabilities, and we have shown that we can define a con-
ditional probability measure when the event we condition on has a positive probability. We will
now define conditional pfs and conditional pdfs.

3.5.1 Conditional Probability Function

If we have random variables X ∶ Ω →  and Y ∶ Ω → , where  and  are countable, then we
can define the conditional distribution function FY |X ∶  → [0, 1] as
FY |X (y) = ℙ[Y ≤ y ∣ X = x ] ,
for any x such that ℙ[X = x ] > 0. We can also define the conditional probability function pY |X (y) ∶
 → [0, 1] as
pY |X (y) = ℙ[Y = y ∣ X = x ] ,
for any x such that ℙ[X = x ] > 0. Sometimes we write FY |X (y|x) and pY |X (y|x) to emphasize the
dependence on x. However, strictly speaking, we have just defined functions of the variable y for
all values of x, i.e. a family of functions. We may, of course, consider them to be functions of x as
well if we so desire. It should be clear that
ℙ[Y = y, X = x ]
ℙ[Y = y ∣ X = x ] = ,
ℙ[X = x ]
which can be computed from the joint and marginal probability functions for (X, Y ) and X,
respectively, i.e.
pX,Y (x, y)
pY |X (y|x) = ,
pX (x)
where
pX,Y (x, y) = P (Y = y, X = x) ,
and
∑
pX (x) = pX,Y (x, y).
y∈

It is straightforward to verify that

∑
FY |X (y|x) = pY |X (z|x).
z≤y

3.5.2 Conditional Probability Density Function

It is also possible to define conditional distribution functions for continuous random variables, but
we have to be more careful, since then ℙ[X = x ] = 0. We consider random variables X ∶ Ω → 
and Y ∶ Ω → , where  and  are uncountable, e.g. ℝ. We investigate
[ ]
[ ] ℙ Y ≤ y, x ≤ X ≤ x + dx
ℙ Y ≤ y ∣ x ≤ X ≤ x + dx = [ ] ,
ℙ x ≤ X ≤ x + dx
y
∫−∞ fX,Y (x, 𝑣)dx d𝑣
≈ ,
fX (x)dx
y fX,Y (x, 𝑣)
= d𝑣,
∫−∞ fX (x)
48 3 Probability Theory

for a small dx > 0, where fX,Y ∶  ×  → ℝ+ is the joint pdf for (X, Y ) and where fX ∶  → ℝ+ is
the marginal pdf for X. As dx goes to zero, we obtain ℙ[Y ≤ y ∣ X = x ], and hence, we define the
conditional distribution function FY |X ∶  → [0, 1] as
y fX,Y (x, 𝑣)
FY |X (y|x) = d𝑣.
∫−∞ fX (x)
The conditional probability density function fY |X ∶  → ℝ+ is given by
fX,Y (x, y)
fY |X (y|x) = . (3.7)
fX (x)
Hence, we also obtain the same formula for continuous random variables.

Example 3.6 Consider the case when (X, Y ) is jointly normal, i.e. the pdf is given by
1 1 T Σ−1 (z−𝜇)
fX,Y (z) = √ exp− 2 (z−𝜇) ,
(2𝜋)m+n det (Σ)
where z = (x, y), 𝜇 = (𝜇x , 𝜇y ), and where
[ ]
Σ Σ
Σ = Tx xy .
Σxy Σy
From (2.49) we have that
[ ] [ ][ ][ ]
Σx Σxy I 0 Σx 0 I Σ−1
x Σxy
= .
ΣTxy Σy ΣTxy Σ−1
x I 0 Σy − ΣTxy Σ−1
x Σxy 0 I
Notice that
[ ]−1 [ ]
I Σ−1
x Σxy I −Σ−1
x Σxy
= .
0 I 0 I
This factorizes the above pdf as fX,Y (z) = fX (x)fY |X (y|x), where
1 1 T Σ−1 (x−𝜇 )
fX (x) = √ e− 2 (x−𝜇x ) x x
,
(2𝜋)m det Σx
and where
1 1 T Σ−1 (y−𝜇 )
fY |X (y|x) = √ e− 2 (y−𝜇y|x ) y|x y|x
,
(2𝜋)n det Σy|x

where
𝜇y|x = 𝜇y + ΣTxy Σ−1
x (x − 𝜇x ), x Σxy .
Σy|x = Σy − ΣTxy Σ−1
From Example 3.5, we see that fX is the marginal pdf for X. Hence, it holds by (3.7) that fY |X (y|x) is
the conditional pdf for Y , given X = x. Notice that Σy|x is the Schur complement of Σx in Σ.

3.6 Expectations

Let us assume that we are interested in estimating how frequent a certain event A ⊆ Ω is when
the experiment is repeated. Hence, it is enough to define  = {Ω, ∅, A, Ac }, where Ac = Ω∖A. Let
us assume that ℙ[A] = p and that ℙ[Ac ] = 1 − p. We then define the random variable X ∶ Ω →
{0, 1} as X(𝜔) = 1 when 𝜔 ∈ A and X(𝜔) = 0 when 𝜔 ∉ A. If we repeat the experiment N times,
3.6 Expectations 49

it is reasonable to estimate the relative frequency of A with the sample average approximation
(SAA)

1∑
N
X(𝜔i ), (3.8)
N i=1

where 𝜔i is the outcome of the ith experiment. We realize that this quantity is very close to
[ ]
0 × ℙ[{𝜔 ∶ X(𝜔) = 0}] + 1 × ℙ[{𝜔 ∶ X(𝜔) = 1}] = 0 × ℙ Ac + 1 × ℙ[A] = p.

Inspired by this, we define the expected value of any discrete random variable X ∶ Ω →  ⊆ ℝ as
∑
𝔼[X ] = xp(x),
x∈

where p(x) = ℙ[{𝜔 ∶ X(𝜔) = x}].4 The expected value is a functional from Ω to the real numbers.
We understand that the expected value is close to the sample average of the random variable for
large values of N. For this reason, we sometimes call the expected value of a random variable the
mean of the random variable. Sometimes, we write 𝔼X instead of 𝔼[X ] to ease the notation.
For continuous random variables X ∶ Ω → ℝ, we define the expected value as

𝔼[X ] = xf (x)dx,
∫ℝ
where f is the pdf of the random variable. The generalization to vector-valued random variables is
straightforward. We remark that expected values might be infinite.

3.6.1 Moments
For any scalar-valued random variable X, we define the kth moment of X as
[ ]
mk = 𝔼 X k ,

and the kth central moment of X as

[( )k ]
𝜇k = 𝔼 X − m 1 .

The moment m1 is the expectation, also called the mean, of X and 𝜇2 is called the variance of X.
The variance measures the amount by which X tends to deviate from its average. It is often also
denoted by 𝜎 2 or Var[X]. We sometimes write VarX to ease the notation. The positive square root
of the variance, 𝜎, is called the standard deviation. It is straightforward to show that 𝜇2 = m2 − m21 .

3.6.2 Expected Value of Function of Random Variable

We can also define the expected value of a function g of a continuous random variable X as
[ ]
𝔼 g(X) = g(x)f (x)dx,
∫ℝ
where f is the pdf for X. Notice that we do not need to compute the pdf of g(X) in order to compute
the expectation of g(X). In the special case when g(x) = Ax, with A ∈ ℝm×n , it holds that 𝔼[AX ] =
A𝔼[X ], and we realize that expectation is a linear functional on the set of random variables.

4 We need that the sum in the definition of the expectation is absolutely convergent, since we do not want the result
to depend on the order in which we carry out the summation.
50 3 Probability Theory

For two scalar-valued random variables X and Y with joint pdf fX,Y , we may consider the function
g ∶ ℝ2 → ℝ defined as g(x, y) = x. It then holds that
[ ]
𝔼 g(X, Y ) = xfX,Y (x, y)dx dy = xfX (x)dx = 𝔼[X ] ,
∫ℝ2 ∫ℝ
with fX being the marginal pdf for X. The result trivially generalizes to several random variables.
This shows that we do not need to define the marginal pdf to carry out the expectation computation.
This also means that we never have to distinguish between different definitions of the expectation
functional. It is sufficient to consider the joint pdf for all relevant random variables involved when
defining the expectation functional, independent of how many random variables there are.

3.6.3 Covariance
For two scalar-valued random variables X and Y , the product XY is a special case of a function g of
the two-dimensional random variable (X, Y ). Hence, the expected value of XY is given by

𝔼[XY ] = xyfX,Y (x, y)dx dy,

∫ℝ2
where fX,Y is the pdf of (X, Y ). If X and Y are independent, it is easy to see that 𝔼[XY ] = 𝔼[X ] 𝔼[Y ].
The converse is not true in general. We say that X and Y are uncorrelated, if 𝔼[XY ] = 𝔼[X ] 𝔼[Y ],
and otherwise, they are said to be correlated. The generalization to vector-valued random variables
is obtained by considering the outer product. Specifically, we define the covariance between two
random variables X ∶ Ω → ℝm and Y ∶ Ω → ℝn as
[ ] [ ]
Cov[X, Y ] = 𝔼 (X − mX )(Y − mY )T = 𝔼 XY T − mX mTY ,
where mX and mY are the means of X and Y , respectively. The variance of X is Var[X] = Cov[X, X].
[ ]
The random variables X and Y are uncorrelated if 𝔼 XY T = 𝔼[X ] (𝔼[Y ])T = mX mTY . This is equiv-
alent to Cov[X, Y ] = 0. As before, independence implies being uncorrelated. Notice that X and Y
may have different dimensions.

Example 3.7 For a random vector with a Gaussian pdf as in Example 3.4, it holds that the
expected value is 𝜇 and that the covariance is Σ. This result is shown in Exercise 3.8.

3.7 Conditional Expectations

Given two random variables X ∶ Ω →  and Y ∶ Ω →  the conditional expectation of Y given

X = x is defined as
∑
𝔼[Y | X = x] = ypY |X (y|x),
y∈

for discrete random variables and as

𝔼[Y | X = x] = yfY |X (y|x)dy,

∫
for continuous random variables, where pY |X ∶  → [0, 1] is the conditional probability function
and where fY |X ∶  → ℝ+ is the conditional pdf, respectively. For a given x, the conditional expec-
tation is just a number, but it is possible to consider all values of x ∈ , and then we can define
the function Ψ ∶  → ℝ as Ψ(x) = 𝔼[Y | X = x]. Let us now investigate the random variable Ψ(X).
For discrete random variables, its expectation is given by
3.9 Random Processes 51

( )
∑ ∑ ∑ ∑
𝔼[Ψ(X)] = ypY |X (y|x) pX (x) = y pX,Y (x, y),
x∈ y∈ y∈ x∈
∑
= ypY (y) = 𝔼[Y ] ,
y∈

where pX,Y ∶  ×  → [0, 1] is the joint probability function for (X, Y ), and where pX ∶  → [0, 1]
is the marginal probability function for X. The same result holds for continuous random variables.
We often write Ψ(X) = 𝔼[Y | X], and hence, the result can be summarized as
𝔼[𝔼[Y | X]] = 𝔼[Y ] .
This formula is sometimes useful, when 𝔼[Y ] is difficult to compute directly.

3.8 Convergence of Random Variables

Given a probability space (Ω,  , ℙ), we consider random variables X1 , X2 , … and X defined on this
probability space and are interested in investigating convergence of Xk to X as k → ∞. To this end,
we will define four different modes of convergence:
a.s. [ ]
(a) Xk → X almost surely (Xk −−−→ X) if ℙ 𝜔 ∈ Ω ∣ Xk (𝜔) → X(𝜔), k → ∞ = 1.5
r [ ]
(b) Xk → X in rth mean (Xk → X) if 𝔼 |Xk − X|r → 0 as k → ∞.6
P [ ]
(c) Xk → X in probability (Xk → X) if ℙ |Xk − K| > 𝜖) → 0 as k → ∞ for all 𝜖 > 0.
D [ ]
(d) Xk → X in distribution (Xk → X) if ℙ Xk ≤ x → ℙ[X ≤ x ] as k → ∞ for all points x at which
FX (x) = ℙ[X ≤ x ] is continuous.
The following implications hold for the different modes of convergence:
1. (a) ⇒ (c)
2. (b) ⇒ (c)
3. (c) ⇒ (d)
r s
4. If r > s ≥ 1, then Xk → X ⇒ Xk → X
Without any further assumptions, no other implications hold.
Let X1 , X2 , … be independent identically distributed (i.i.d.) random variables with mean m. Then

1∑
n
a.s.
Xi −−−→ m, n → ∞,
n i=1
[ ]
if and only if 𝔼 |X1 | < ∞. This result is known as the strong law of large numbers. If we assume that
[ 2]
𝔼 X1 < ∞, then convergence holds both almost surely and in mean square. This assumption is a
sufficient condition for the strong law of large numbers. There is also a weak law of large numbers,
which is related to convergence in probability; we refer the reader to [48] for details.

3.9 Random Processes

We will now discuss random variables X ∶ Ω → ℤ+ . Hence, the random variable is a function of
k ∈ ℤ+ , i.e. (𝜔, k) → X(𝜔, k). We will often instead interpret it as an infinite-dimensional vector

a.e.
5 This mode of convergence has two other notations which are Xk → X almost everywhere (Xk −−−→ X) and Xk → X
with probability one (w.p. 1.).
m.s.
6 When r = 1 we say that Xk → X in mean and when r = 2 we say that Xk → X in mean square (Xk −−−−→ X).
52 3 Probability Theory

X = (X(𝜔, 0), X(𝜔, 1), …) and use the notation Xk (𝜔) = X(k, 𝜔) to ease the notation. We may inter-
pret the random variable X as an infinite sequence of random variables Xk ∶ Ω → , k ∈ ℤ+ , and
we may interpret k ∈ ℤ+ as a discrete time-index. Such a random variable X is often called a
discrete-time random process. Sometimes one says stochastic process instead of random process.
The set  could be a finite set like ℕn , a countable set like ℤ or an uncountable set like ℝ. It may
also be vector-valued.
We may observe the evolution of a random process in two different ways. For each fixed outcome
𝜔 ∈ Ω, we obtain a realization or sample path X(𝜔) of X at 𝜔. We can study the properties of this
sample path. Another way of viewing the random process is to investigate a finite subset of compo-
nents of the infinite-dimensional vector X, say K = {k1 , k2 , … , kn } ⊂ ℤ+ . We then look at the joint
distribution function FK ∶ n → [0, 1] defined as
[ ]
FK (x) = ℙ Xk1 ≤ x1 , … , Xkn ≤ xn .
The collection {FK } where K ranges over all finite-dimensional K ⊂ ℤ+ is called the collection of
finite-dimensional distributions (fdds) of X or the name of X. This contains all the information that
is available about X from finitely many components Xk . We mention that knowing the fdds does
not in general provide a complete information about the sample paths. However, we will be content
by studying only properties of the sample path that can be deduced from the fdds.
If we define X ∶ Ω → ℝ+ , we obtain a continuous-time random process. We then often write X(t)
when it is convenient to make the dependence on t ∈ ℝ+ explicit. The fdds are often denoted FT ,
where T is now a finite subset of ℝ+ . An even more general concept is a random field, which is
obtained when X ∶ Ω → ℝ .
n

We say that a discrete-time random process is strongly stationary if {Xk1 , … , Xkn } and
{Xk1 +l , … , Xkn +l } have the same joint distribution for all k1 , … , kn and l > 0. We say that a
[ ] [ ]
discrete-time random process is weakly stationary if 𝔼 Xk1 = 𝔼 Xk2 and Cov[Xk1 , Xk2 ] =
Cov[Xk1 +l , Xk2 +l ] for all k1 , k2 and l > 0. In other words, a random process is weakly stationary if
and only if it has a constant mean and the autocovariance function c ∶ ℤ2+ → ℝ given by
c(k, k + l) = Cov[Xk , Xk+l ],
satisfies
c(k, k + l) = c(0, l),
for all k and l ≥ 0. Thus, for weakly stationary processes, we may define the autocovariance function
as a function of only l and write c ∶ ℤ+ → ℝ.
Strong stationarity implies weak stationarity, but the converse is not true in general. One example
where strong stationarity is equivalent to weak stationarity is when the fdds are all Gaussian.
The definitions of weak and strong stationarity for a continuous-time random process are similar
as for a discrete-time random process, and this also goes for a random field.
We will now discuss a generalization of the law of large numbers where the sequence of random
variables Xk are a stationary process, not necessarily i.i.d. If Xk , k ≥ 1, is a strongly stationary process
with 𝔼|X1 | < ∞, then there exists a random variable Y with the same mean as X1 such that
1∑
n
X → Y a.s. and in mean.
n k=1 k
If instead Xk , k ≥ 1, is a weakly stationary process with 𝔼|X1 | < ∞, then there exists a random
variable Y with the same mean as X1 such that
1∑
n
X → Y in mean square.
n k=1 k
These results are called the strong ergodic theorem and the weak ergodic theorem, respectively.
3.11 Hidden Markov Models 53

3.10 Markov Processes

Discrete-time random processes X ∶ Ω → ℤ+ that satisfies the so-called Markov property
[ ] [ ]
ℙ Xkn ≤ xn ∣ Xkn−1 ≤ xn−1 = ℙ Xkn ≤ xn ∣ Xkn−1 ≤ xn−1 , … , Xk1 ≤ x1 , (3.9)
for all k1 ≤ k2 ≤ · · · ≤ kn , are called Markov processes.
If  is a finite set, like ℕN , the process is often called a Markov chain. We say that a Markov
process is homogeneous if
[ ] [ ]
ℙ Xn ≤ x ∣ Xn−1 ≤ y = ℙ X1 ≤ x ∣ X0 ≤ y ,
for all x, y and n. For a homogeneous Markov chain, we define the transition matrix P ∈ ℝN×N with
elements pij , called the transition probabilities, as
[ ]
pij = ℙ Xn = j ∣ Xn−1 = i .
The n-step transition matrix Pn ∈ ℝN×N has elements pij (n) called the n-step transition probabilities
defined as
[ ]
pij (n) = ℙ Xm+n = j ∣ Xm = i .
It holds that Pm+n = Pm Pn , which is called the Chapman–Kolmogorov equation, and hence, Pn = Pn .
If we let 𝜋 k be the row vector of
[ ]
𝜋ik = ℙ Xk = i ,
it follows that 𝜋 k = 𝜋 0 Pk for k ≥ 0. Hence, the transition matrix fully characterizes a homogeneous
Markov chain.
∑
If 𝜋 ∈ ℝ1×N is such that 𝜋i ≥ 0, i 𝜋i = 1 and 𝜋 = 𝜋P, then 𝜋 is called a stationary distribution
for the Markov chain. The reason for this name is that if 𝜋 0 = 𝜋, then 𝜋 k = 𝜋 for all k ≥ 0,
i.e. the distribution remains the same for all times, and hence, the Markov chain is for this initial
distribution a stationary random process, both weakly and strongly. For other initial values, the
Markov chain may or may not converge to the stationary distribution as k goes to infinity. We refer
the reader to [48] for further details.

Example 3.8 An important example of a Markov process for  = ℝn is obtained by considering

the random process defined by the recursion
Xk+1 = AXk + Ek , k ∈ ℤ+ ,
where A ∈ ℝn×n ,Ek are independent, identically distributed (i.i.d.) n-dimensional random vectors,
and where X0 is a random vector with a known distribution. When Ek has zero mean we say that
E is white noise.

3.11 Hidden Markov Models

Consider two random processes X ∶ Ω → ℤ+ and Y ∶ Ω →  ℤ+ that are correlated. The sets  and
 will be defined later. We will assume that X is a Markov process satisfying (3.10), and that Yj given
Xj are independent of Yk given Xk for j ≠ k. We are interested in computing the conditional probabil-
ity function or pdf for Xk given Ȳ k = (Y0 , … , Yk ). This can be interpreted as we have an observation
of Ȳ k . The pair (X, Y ) is called a hidden Markov Model (HMM), since X is not observed but hidden
to the observer. The process X is called the state and sometimes Y is called the measurement or
output. The problem of computing the above conditional probability function or pdf is often called
54 3 Probability Theory

a filtration problem or state estimation problem. We will do the derivation for  = ℝn and  = ℝp .
The derivation for finite sets is similar and obtained by considering probability functions instead
of pdfs and by replacing integrals with summations.
Let X̄ k = (X0 , … , Xk ) and let pX̄ k ,Ȳ k ∶ k+1 ×  k+1 → ℝ+ be the joint pdf for (X̄ k , Ȳ k ). We also need
the conditional pdf for Ȳ k given X̄ k : pȲ k |X̄ k ∶  k+1 × k+1 → ℝ+ . Using the conditional indepen-
dence assumption, this can be expressed as
∏
k
pȲ k |X̄ k (̄y|̄x) = pYi |Xi (yi |xi ),
i=0
where pYi |Xi ∶  ×  → ℝ+ are the conditional pdfs for Yk given Xk . We also define the marginal
pdf for X̄ as pX̄ ∶ N+1 → ℝ+ .
We start by obtaining an expression for pX0 |Y0 (x0 |y0 ), i.e.
pY0 |X0 (y0 |x0 )pX0 (x0 )
pX0 |Y0 (x0 |y0 ) = ,
pY0 (y0 )
where
pY0 (y0 ) = p (x , y )dx = p (y |x )p (x )dx .
∫ X0 ,Y0 0 0 0 ∫ Y0 |X0 0 0 X0 0 0
We do in this section not write out the set over which we integrate when it is the whole domain
of the functions involved. Now, assume that we know pXk−1 |Ȳ k−1 (xk−1 |̄yk−1 ), where ȳ k = (y0 , … , yk ).
This assumption is true for k = 1. It then follows that
pXk , Xk−1 |Ȳ k−1 (xk , xk−1 |̄yk−1 ) = pXk |Xk−1 ,Ȳ k−1 (xk |xk−1 , ȳ k−1 )pXk−1 |Ȳ k−1 (xk−1 |̄yk−1 ),
= pXk |Xk−1 (xk |xk−1 )pXk−1 |Ȳ k−1 (xk−1 |̄yk−1 ),
where we have made use of the conditional independence property
pXk |Xk−1 ,Ȳ k−1 (xk |xk−1 , ȳ k−1 ) = pXk |Xk−1 (xk |xk−1 ).
This is proven in Exercise 3.16. Integrating over xk−1 results in the following Chapman–Kolmogorov
equation:

pXk |Ȳ k−1 (xk |̄yk−1 ) = p (x |x )p (x |̄y )dx .

pYk |Ȳ k−1 (yk |̄yk−1 ) = pYk , Xk |Ȳ k−1 (yk , xk |̄yk−1 )dxk ,
∫
pȲ k , Xk (̄yk , xk )
= dxk ,
∫ pȲ k−1 (̄yk−1 )
3.11 Hidden Markov Models 55

pȲ k , Xk (̄yk , xk )pXk ,Ȳ k−1 (xk , ȳ k−1 )

= dxk ,
∫ pXk ,Ȳ k−1 (xk , ȳ k−1 )pȲ k−1 (̄yk−1 )

= pYk |Xk ,Ȳ k−1 (yk |xk , ȳ k−1 )pXk |Ȳ k−1 (xk |̄yk−1 )dxk ,
∫

= pYk |Xk (yk |xk )pXk |Ȳ k−1 (xk |̄yk−1 )dxk .
∫
The last inequality follows from the conditional independence property. Thus, we may summarize
the optimal filtering equations as

pYk |Xk (yk |xk )pXk |Ȳ k−1 (xk |̄yk−1 )

pXk |Ȳ k (xk |̄yk ) = , (3.12)
∫ pYk |Xk (yk |𝜉)pXk |Ȳ k−1 (𝜉|̄yk−1 )d𝜉
for k ≥ 0 with initial value given by
pY0 |X0 (y0 |x0 )pX0 (x0 )
pX0 |Y0 (x0 |y0 ) = .
∫ pY0 |X0 (y0 |𝜉)pX0 (𝜉)d𝜉
In the following example, we derive the so-called Kalman filter from the filtering equations for the
special case when the HMM is defined by a so-called “linear state-space equation.”

Example 3.9 Consider the HMM (X, Y ) defined as

Xk+1 = AXk + Vk ,
Yk = CXk + Ek ,
where Vk and Ek , k ≥ 0, are Gaussian i.i.d. random variables with zero mean and covariances
p
R1 ∈ 𝕊n++ and R2 ∈ 𝕊++ , respectively. We assume that X0 also has a Gaussian distribution with
mean x̄ 0 ∈ ℝ and covariance R0 ∈ 𝕊n++ . This is assumed to be independent of Vk and Ek for all
n

k ≥ 0. It is straightforward to verify that the conditional independence condition holds. We have

that
pX0 (x0 ) =  (x0 , x̄ 0 , R0 ),
pXk |Xk−1 (xk |xk−1 ) =  (xk , Axk−1 , R1 ),
pYk |Xk (yk |xk ) =  (yk , Cxk , R2 ),
where  is defined as in Example 3.4, i.e. all the involved pdfs are Gaussian. Now suppose that
f f
pXk |Ȳ k−1 (xk |̄yk−1 ) =  (xk , xk , Σk ), (3.13)
f f
for some xk ∈ ℝn and Σk ∈ 𝕊n++ , where k ≥ 1. We have from Exercise 3.15 that
f f
pYk |Xk (yk |xk )pXk |Ȳ k−1 (xk |̄yk−1 ) =  (yk , Cxk , R2 ) (xk , xk , Σk ),
f f f f
=  (yk , Cxk , Hk ) (xk , xk + Kk (yk − Cxk ), Σk − Kk Hk KkT ),
f f
where Hk = R2 + CΣk CT , and Kk = Σk CT Hk−1 . From the second equality above, we obtain that
f
pYk |Xk (yk |xk )pXk |Ȳ k−1 (xk |̄yk−1 )dxk =  (yk , Cxk , Hk ),
∫
and hence, from (3.12) that
( ( ) )
f f f
pXk |Ȳ k (xk |̄yk ) =  xk , xk + Kk yk − Cxk , Σk − Kk Hk KkT =  (xk , xka , Σk ),
56 3 Probability Theory

Algorithm 3.1: Standard form Kalman filter

Input: System matrices A and C, Mean x̄ 0 , Covariances R0 , R1 , and R2 , Measurement data
(y0 , y1 , …)
Output: xka for k ∈ ℤ
f
x0 ← x̄ 0
f
Σ0 ← R0
for k ← 0 to ∞ do
( )−1 ( )
f f f f
xka = xk + Σk CT CΣk CT + R2 yk − Cxk
( )−1
f f f f
Σk = Σk − Σk CT CΣk CT + R2 CΣk
f
xk+1 = Axka
f
Σk+1 = R1 + AΣk AT
end

where we have defined

f f
xka = xk + Kk (yk − Cxk ),
f
(3.14)
Σk = Σk − Kk Hk KkT .
f f
To make the calculations valid also for k = 0, we just need to define x0 = x̄ 0 and Σ0 = R0 in view of
the definition of pX0 |Ȳ 0 (x0 |̄y0 ). We have from Exercise 3.15 that
pXk |Xk−1 (xk |xk−1 )pXk−1 |Ȳ k−1 (xk−1 |̄yk−1 ) =  (xk , Axk−1 , R1 ) (xk−1 , xk−1
a
, Σk−1 ),
=  (xk , Axk−1
a ̄ k−1 ) (xk−1 , xa + K̄ k−1 (xk − Axa ), Σk−1 − K̄ k−1 H
,H ̄ k−1 K̄ Tk−1 ),
k−1 k−1

where H ̄ −1
̄ k = R1 + AΣk AT and K̄ k = Σk AT H k . From the last equality, we obtain

pXk |Xk−1 (xk |xk−1 )pXk−1 |Ȳ k−1 (xk−1 |̄yk−1 )dxk−1 =  (xk , Axk−1
a ̄ k−1 ),
,H
∫
and we obtain from (3.13) and (3.11), the update formula
f a
xk = Axk−1 ,
(3.15)
f ̄ k−1 = R1 + AΣk−1 AT .
Σk = H
f
We can if we like eliminate xk and Σk from (3.14) and (3.15) to obtain
xk+1 = (A − K̃ k C)xk + K̃ k yk ,
f f

f f f f f
Σk+1 = AΣk AT + R1 − AΣk CT (R2 + CΣk CT )−1 CΣk AT ,

where K̃ k = AΣk CT (R2 + CΣk CT )−1 . The initial values are x0 = x̄ 0 and Σ0 = R0 . The Kalman filter
f f f f

is summarized in Algorithm 3.1. We see that the algorithm consists of two main steps. In the first
one the measurement yk , the matrix C, and the covariance R2 are used to update the old predicted
f
estimate xk and its covariance. In the second step, the matrix A and the covariance matrix R1 are
used to predict the state and its covariance for the next value of k.

3.12 Gaussian Processes

A continuous-time real-valued random processes X ∶ Ω → ℝℝ is called a Gaussian process if its fdds
are all Gaussian with mean m(T) ∈ ℝ|T| and covariance Σ(T) ∈ 𝕊|T|
++ , where T = {t1 , … , tn } ⊂ ℝ.
Exercises 57

A prime example of a Gaussian process is the Wiener process W ∶ Ω → ℝℝ+ with W(0) = 0,
m(T) = 0, and Σ(T) having entries Σi,j (T) = 𝜎 2 min (ti , tj ), for some 𝜎 2 > 0.
A Gaussian process is weakly and strongly stationary if and only if 𝔼[X(t)] is constant for all t
and Σ(T + h) = Σ(T), where T + h = {t1 + h, … , tn + h}, for all T and h > 0. A Gaussian process is
a Markov process if and only if
[ ] [ ]
𝔼 X(tn ) | X(t1 ), … , X(tn−1 ) = 𝔼 X(tn ) | X(tn−1 ) ,
for all t1 < · · · < tn . An example of a stationary Gaussian Markov process is the Ornstein–Uhlenbeck
process which has zero mean and autocovariance function c(t) = c(0)e−𝛼|t| for some 𝛼 > 0 and
c(0) > 0. It is also possible to define Gaussian processes X ∶ Ω → ℝℝ , and we will return to them
n

and their usage in Section 10.3.

Exercises
3.1 We are given a probability space (Ω,  , ℙ).
(a) Show that ℙ[Ac ] = 1 − ℙ[A] for any A ∈  , where Ac = Ω∖A.
[ ]
(b) Show that ℙ ∅ = 0.
(c) Show that ℙ[A ∪ B ] = ℙ[A] + ℙ[B ] − ℙ[A ∩ B ] for any A, B ∈  .

3.2 In a collection of 100 items delivered by a company, there are five defect items. We pick
randomly first one item and then out of the remaining 99, we pick another item randomly.
What is the probability that both items are defect?

3.3 Prove the formula for total probability in (3.3).

3.4 Consider Example 3.3. What is the probability that, if a customer has found a defect item,
that it has been manufactured at machine 1?

3.5 Prove the formula in (3.4).

3.6 A person is crossing the street 300 000 times in his/her lifetime. The probability of being hit
by a car is 1∕300 000. We consider the different crossings to be independent events. What is
the probability of being hit by a car at least once in a lifetime?

3.7 Consider throwing a fair dice repeatedly many times. We define a random variable
Xi ∶ ℕ6 → ℕ6 with value equal to the value of the dice for the ith experiment. Since the dice
is fair, we have that the probability function pXi ∶ ℕ6 → [0, 1] is defined as pXi (k) = 1∕6.
Define the random variable X ∶ ℕN6 → , where  = {1, 1 + 1∕N, 1 + 2∕N, … , 6} via

1∑
N
X= X.
N i=1 i

(a) Compute the expected value and the variance of the random variable Xi .
(b) Compute the probability function for (X1 , X2 ).
(c) Compute the probability function for X when N = 2.
(d) Compute the expected value of the random variable X for N = 2 directly using the prob-
ability function above and indirectly by using the probability function for (X1 , X2 ) and
the formula for expected values of functions of random variables in Section 3.6.
(e) Compute the variance of X when N = 2.
58 3 Probability Theory

3.8 Show that the for a random vector X ∶ ℝn → ℝn with Gaussian distribution given by the
pdf
1 1 T Σ−1 (x−𝜇)
f (x) = √ e− 2 (x−𝜇) ,
(2𝜋)n det Σ
it holds that the expected value is 𝜇 and that the variance is Σ.

3.9 Consider a random variable X with Gaussian distribution with zero mean and variance
I ∈ 𝕊n++ . Let the random variable Y be defined as
Y = AX + b.
Show that this ransom variable is also Gaussian with mean b and variance AAT .

3.10 Consider a scalar-valued random variable X with a Gaussian distribution with zero mean
and variance 𝜎 2 . Show that
[ (p )] ( 2)
1 𝛾m
𝔼 exp (X + m)2 = √ exp ,
2 𝜎 𝛽 2

where 𝛽 = 1∕𝜎 2 − p and 𝛾 = 1 + 1∕𝛽.

3.11 Consider two random variables X ∶ Ω → ℝ and Y ∶ Ω → ℝ. Let Z = (X, Y ). We know the
expected value and the variance of Z, i.e.
[ ] [ 2 ]
𝜇x 𝜎x 𝜎xy
𝔼[Z ] = , Var[Z] =
𝜇y 𝜎xy 𝜎y2
are known. Consider the random variable defined as M = X + c(Y − 𝜇y ) for an arbitrary
c ∈ ℝ.
(a) Show that 𝔼[M ] = 𝜇x .
(b) Show that Var[M] = Var[X] + c2 Var[Y ] + 2cCov[X, Y ] = 𝜎x2 + c2 𝜎y2 + 2c𝜎xy .
(c) Show that Var[M] is minimized for c⋆ = −𝜎xy ∕𝜎y2 and that the minimal value is given
by (1 − 𝜌2 )𝜎x2 , where
√
𝜎xy
𝜌= .
𝜎x 𝜎y

3.12 Consider two scalar-valued random variables (X, Y ) with joint pdf
{
2, if 0 < y ≤ x < 1,
fX,Y (x, y) =
0, otherwise.
Compute the conditional expectation for X given Y = y.

3.13 Let X be a random variable taking the values 0 and 1 with equal probability 1∕2. Also, define
D
the random variables Xk = X for all k ∈ ℤ+ . Clearly, Xk → X, since all random variables
defined have the same distribution. Also define the random variable Y = 1 − X.
D
(a) Show that Xk → Y .
(b) Show that Xk cannot converge to Y in any other mode.
Exercises 59

3.14 Consider the Markov process defined in Example 3.8. Assume that Ek are zero mean
with Var(Xk ) = Σ. Furthermore, assume that X0 has mean m0 and variance P0 . Show that
mk = 𝔼Xx and Pk = VarXk for k ≥ 1 satisfies the recursions
mk+1 = Amk , Pk+1 = AT Pk A + Σ.

3.15 Let H = R + XPX T , where R ∈ 𝕊m ++ , P ∈ 𝕊++ , and X ∈ ℝ

n m×n , and let Y = PX T H −1 .

Furthermore, let y ∈ ℝ and x, 𝜇 ∈ ℝ . Show that the following relationship holds:

m n

 (y, Xx, R) (x, 𝜇, P) =  (y, X𝜇, H) (x, 𝜇 + Y (y − X𝜇), P − XHX T ).
where  is defined as in Example 3.4.
Hint: The formula in Exercise 2.10 is useful.

3.16 Consider a HMM (X, Y ) as defined in Section 3.11. Show that the conditional independence
assumption
∏
k
pȲ k |X̄ k (̄y|̄x) = pYi |Xi (yi |xi ),
i=0

3.17 Consider two stationary random processes X ∶ Ω → ℤ and Y ∶ Ω →  ℤ , where  = ℝn

and  = ℝp , which are assumed to satisfy the following equations:
Xk+1 = AXk + Vk ,
Yk = CXk + Ek ,
where A ∈ ℝn×n , C ∈ ℝp×n , and where (Vk , Ek ) are random variables with zero mean and
variance
[ ]
R R n+p
R = T1 12 ∈ 𝕊++ .
R12 R2
p
We assume that R2 ∈ 𝕊++ , and we also assume that (Vj , Ej ) and (Vk , Ek ) are independent for
j ≠ k. Let P ∈ 𝕊n+ be the unique solution to the algebraic Riccati equation
P = APAT + R1 − (R12 + APCT )(R2 + CPCT )−1 (R12 + APCT )T ,
p p
such that H = R2 + CPCT ∈ 𝕊++ . Such a solution always exists when R2 ∈ 𝕊++ and can in
p
some cases be shown to exist also when only R2 ∈ 𝕊+ . Define K as the solution of
KH = R12 + APCT .
60 3 Probability Theory

Define the random process X̂ ∶ Ω → ℤ as

X̂ k+1 = AX̂ k + K(Yk − CX̂ k ).
Show that
X̂ k+1 = AX̂ k + K Ỹ k ,
Yk = CX̂ k + Ỹ k ,
where Ỹ k defines a random process that has variance H and is such that Ỹ j is independent
of Ỹ k for j ≠ k. This is called the innovation form of the random process Y .
61

Part II

Optimization
63

Optimization Theory

Mathematical optimization is an indispensable tool in learning and control. Its history goes back
to the early seventeenth century with the work by Pierre de Fermat who obtained calculus-based
formulae for identifying optima. This work was later developed further by Joseph-Louis Lagrange.
In this chapter, we will present the foundations of optimization theory. We will start by defining
what constitutes an optimization problem and introduce some basic concepts and terminology.
We will also introduce the notion of convexity, which allows us to distinguish between convex and
general nonlinear optimization problems. The motivation behind this distinction is that convex
problems are, roughly speaking, easier to solve than general nonlinear ones. We will pay special
attention to properties that are useful for recognizing convexity. Finally, we will also discuss
the concept of duality and see how it is used to derive optimality conditions for optimization
problems. In Chapter 6, we will see how duality also plays an important role in some optimization
methods.

4.1 Basic Concepts and Terminology

Let f be a function f ∶  → , where  is the domain of f and  is its codomain. The set
f () = {f (x) | x ∈ } is the image of f , which is a subset of . We will henceforth assume that the
codomain is the extended real line ℝ ∪ {−∞, +∞} = ℝ ̄ or a subset thereof unless otherwise noted.
For such functions, we define the effective domain of f as
dom f = {x ∈  | f (x) < ∞}, (4.1)
which is clearly a subset of . Moreover, f is said to be proper if dom f ≠ ∅ and f (x) > −∞ for all
x ∈ dom f , and otherwise, f is improper. The 𝛼-sublevel set of f is the set
S𝛼 = {x | f (x) ≤ 𝛼}, (4.2)
and the epigraph of f is the set
epi f = {(x, t) ∈  × ℝ | f (x) ≤ t}. (4.3)
Notice that S𝛼 is a subset of , whereas epi f is a subset of  × ℝ, as illustrated for a function
of one variable in Figure 4.1. The function f is said to be closed if and only if its epigraph is
a closed set. The closure of a set C ⊆ ℝn , which we denote clC, is the smallest closed set that
contains C.

𝑡 Figure 4.1 The epigraph of a function

f ∶ ℝ → ℝ and an example of a sublevel set
S𝛼 = [x1 , x2 ] ∪ [x3 , x4 ].
epi 𝑓 𝑓

𝑥
𝑥1 𝑥2 𝑥3 𝑥4

4.1.1 Optimization Problems

We now consider an optimization problem of the form
minimize f0 (x)
subject to fi (x) ≤ 0, i ∈ ℕm , (4.4)
hi (x) = 0, i ∈ ℕp ,
where x ∈ ℝn is the optimization variable. The function f0 ∶ ℝn → ℝ ̄ is the objective function, the
̄
function fi ∶ ℝ → ℝ is the ith of m inequality constraint functions, and hi ∶ ℝn → ℝ
n ̄ is the ith of
p equality constraint functions. The inequality fi (x) ≤ 0 is referred to as an inequality constraint,
and the equation hi (x) = 0 is an equality constraint. If the problem does not have any constraints,
i.e. if m = p = 0, then we say that the problem is unconstrained, and otherwise, it is a constrained
problem. We say that the ith inequality constraint is active at x if fi (x) = 0, the constraint is inactive
at x if fi (x) < 0, and it is violated at x if fi (x) > 0. Similarly, the equality constraint hi (x) = 0 is violated
at x if hi (x) ≠ 0. We say that a constraint is satisfied if it is not violated. For notational convenience,
we define
( ) ( )
f (x) = f1 (x), … , fm (x) , h(x) = h1 (x), … , hp (x) ,
and we will frequently use the notation f (x) ⪯ 0 as shorthand for fi (x) ≤ 0 for all i ∈ ℕm . We return
to such generalized inequalities in Section 4.2.
The domain of the optimization problem is the set
(m ) ( p )
⋂ ⋂
= dom fi ∩ dom hi , (4.5)
i=0 i=1

and the feasible set or feasible region is the subset of points in  that satisfy all constraints, i.e.
 = {x ∈  | f (x) ⪯ 0, h(x) = 0}. (4.6)
A point x is called feasible if it belongs to the feasible set  , and x is strictly feasible if it is both
feasible and all inequality constraints are inactive at x, i.e. if fi (x) < 0, i ∈ ℕm . The optimization
problem is said to be feasible if  ≠ ∅, and otherwise, the problem is infeasible.
The optimal value or the minimum value of the optimization problem (4.4) is defined as
p⋆ = inf f0 (x). (4.7)
x∈

We let p⋆ = ∞ if the problem is infeasible, and if x1 , x2 , … is a sequence of feasible points such

that limk→∞ f (xk ) = −∞, then the optimization problem is unbounded below, in which case we let
p⋆ = −∞. The optimal value is attained if there exists a point x⋆ ∈  such that p⋆ = f0 (x⋆ ), and
4.1 Basic Concepts and Terminology 65

Figure 4.2 Stationary points of a continuously

differentiable function f0 ∶ ℝ → ℝ. local maximum 𝑓0

saddle point

local minimum

local/global minimum

otherwise, the optimal value is unattained. A point x⋆ is called an optimal point, a minimizer, or
a solution if it is feasible and p⋆ = f0 (x⋆ ), and the set of all optimal points is called the optimal set.
A feasible point x that satisfies f0 (x) ≤ p⋆ + 𝜖 for some 𝜖 > 0 is called 𝜖-suboptimal.
A feasible point x is said to be locally optimal if there exists a constant r > 0 such that
{ }
f0 (x) = inf f0 (z) | x ∈  ∩ B2 (x, r) , (4.8)
z

where

B2 (c, r) = {x ∈ ℝn | ||x − c||2 ≤ r},

is the Euclidean ball with center c and radius r. Thus, an optimal point is also locally optimal, but
the converse is not true in general. To emphasize the difference between optimal points and locally
optimal points, we sometimes say that an optimal point is globally optimal. Similarly, we refer to
f0 (x) as a local minimum value if x is locally optimal. The notion of local and global optimality is
illustrated in Figure 4.2 for an unconstrained problem with a continuously differentiable objec-
tive function f0 ∶ ℝ → ℝ. The local extrema of such a function are stationary points, but not all
stationary points are local extrema.
The special case of the optimization problem (4.4), where f0 (x) = 0 for all x ∈  is called a
feasibility problem since solving such a problem amounts to finding any feasible point. The optimal
value is p⋆ = 0 if the problem is feasible, and otherwise, p⋆ = ∞.

4.1.2 Equivalent Problems

Two optimization problems are said to be equivalent if a solution to one problem can readily
be obtained from a solution to the other problem and vice versa. For example, the optimization
problem (4.4) is equivalent to the problem
minimize t
subject to f0 (x) − t ≤ 0,
(4.9)
fi (x) ≤ 0, i ∈ ℕm ,
hi (x) ≤ 0, i ∈ ℕp ,
with variable z = (x, t) ∈ ℝn × ℝ. Note that this is a problem with n + 1 variables and m + 1
inequality constraints, and it is sometimes referred to as the epigraph reformulation of (4.4) since
the inequality f0 (x) ≤ t is equivalent to the constraint z ∈ epi f0 . To see that the two problems are
equivalent, we now show that the solution to one is readily obtained from the other. Indeed, if
x⋆ is a solution to (4.4) and p⋆ = f0 (x⋆ ), then z⋆ = (x⋆ , p⋆ ) is a solution to (4.9) since z = (x, t) is
infeasible in (4.9) if t < p⋆ . Conversely, if z⋆ = (x⋆ , t⋆ ) is a solution to (4.9), then the epigraph
constraint f0 (x) ≤ t must be active at z⋆ , and hence, x⋆ is a solution to (4.4).
66 4 Optimization Theory

4.2 Convex Sets

A set C ⊆ ℝn is said to be convex if and only if for every x, y ∈ C,

𝜃x + (1 − 𝜃)y ∈ C, ∀ 𝜃 ∈ [0, 1]. (4.10)

A point 𝜃x + (1 + 𝜃)y for some 𝜃 ∈ [0, 1] is called a convex combination of x ∈ ℝn and y ∈ ℝn , and
the set of all convex combinations of x and y is the line segment between x and y. In other words,
the definition of convexity requires that the line segment between every x, y ∈ C is contained in C,
as illustrated in Figure 4.3. We note that a linear combination of the form

𝜃1 x1 + · · · + 𝜃k xk , k ≥ 2, (4.11)

with x1 , … , xk ∈ ℝn and 𝜃 = (𝜃1 , … , 𝜃k ) ∈ ℝk , is called

● an affine combination if 𝟙T 𝜃 = 1,
● a conic combination if 𝜃 ∈ ℝk+ ,
● a convex combination if 𝜃 ∈ ℝk+ and 𝟙T 𝜃 = 1.

To simplify notation, we introduce the set

Δk = ℝk+ ∩ {𝜃 ∈ ℝk | 𝟙T 𝜃 = 1}, (4.12)

which is the standard simplex in ℝk . Figure 4.4 illustrates the set of conic and convex combinations
of two points in ℝ2 . A direct consequence of (4.10) is that all convex combinations of any k points

𝑦
𝑦
𝜃𝑥 + (1 − 𝜃)𝑦
𝑥 𝑥 𝜃𝑥 + (1 − 𝜃)𝑦

Convex set Non-convex set

Figure 4.3 A set is convex if the line segment between two points x and y is contained in the set for every
x and y in the set.

𝑥1
𝜃 = (1 , 0 )
𝑥1
𝜃1 𝑥1 + 𝜃 2 𝑥2 𝜃 = (0.6 , 0 .4 )
𝜃1 𝑥1 + 𝜃 2 𝑥2
𝑥2
𝜃 = (0 , 1 )
𝑥2

0
(a) (b)

Figure 4.4 Convex and conic combinations of two points x1 and x2 . (a) Conic combinations: 𝜃 ∈ ℝ2+ .
(b) Convex combinations: 𝜃 ∈ Δ2 .
4.2 Convex Sets 67

in a convex set C are contained in C, i.e., we have that

x1 , … , xk ∈ C ⟹ {𝜃1 x1 + · · · + 𝜃k xk | 𝜃 ∈ Δk } ⊆ C.
To see this, first note that this is trivially true for k = 2 since C is convex. The result then follows
by induction: assuming that it is true for k points, we will show that it is true for k + 1 points
x1 , … , xk+1 ∈ C. Indeed, for all 𝜃̄ ∈ Δk+1 such that 𝜃̄ k+1 < 1, we have that
∑
k+1
∑
k
y= 𝜃̄ i xi = 𝜃̄ i xi + 𝜃̄ k+1 xk+1 = (1 − 𝜃̄ k+1 )z + 𝜃̄ k+1 xk+1 ,
i=1 i=1

where we define
∑ 𝜃̄ 𝜃̄ i
k
z= 𝜃̃ i xi , 𝜃̃ i = ∑k i = , i ∈ ℕk .
i=1 ̄
i=1 𝜃 i
1 − 𝜃̄ k+1
This shows that z is a convex combination of x1 , … , xk , and hence, z belongs to C by assumption.
It follows that y ∈ C since it is a convex combination of z and xk+1 .
The dimension of a convex set C ⊆ ℝn is the dimension of its affine hull, i.e.
dim C = dim (aff C). (4.13)
The relative interior of C is the interior of C within the affine hull of C,
relint C = {x ∈ C | ∃ 𝜖 > 0 such that B2 (x, 𝜖) ∩ aff C ⊆ C}, (4.14)
where B2 (x, 𝜖) is the Euclidean ball centered at x and with radius 𝜖.
The convex hull of a set C ⊆ ℝn is the smallest convex set that contains C. Equivalently, it is the
intersection of all convex sets that contain C, i.e.
conv C = ∩{D ⊆ ℝn | D is convex and C ⊆ D}. (4.15)
We note that it can be difficult to identify the convex hull of a set by using this definition.
Carathéodory’s theorem provides a characterization that is often more useful in practice.
In states that every point in conv C can be expressed as a convex combination of at most n + 1
points in C, i.e.
{ n+1 }
∑
conv C = 𝜃i xi | x1 , … , xn+1 ∈ C, 𝜃 ∈ Δn+1
. (4.16)
i=1

4.2.1 Convexity-Preserving Operations

We will now consider a number of operations that preserve convexity. As we will see, knowledge
of such operations is often very useful when we wish to establish convexity of a given set.

4.2.1.1 Intersection
A fundamental property of convex sets is that the intersection of any number of convex sets is itself
a convex set, i.e. if C𝜏 is a convex set for every 𝜏 ∈ T, then
⋂
C= C𝜏 ,
𝜏∈T

is convex. This follows directly from the definition of a convex set by noting that any two points x
and y in C also belong to C𝜏 for all 𝜏 ∈ T, and moreover,
{𝜃x + (1 − 𝜃)y | 𝜃 ∈ [0, 1]} ∈ C𝜏 ,
since C𝜏 is convex.
68 4 Optimization Theory

4.2.1.2 Afﬁne Transformation

The image and the preimage of a convex set under an affine transformation is itself a convex set.
To see this, let f ∶ ℝn → ℝm be an affine function, i.e. f can be expressed as f (x) = Ax + b for some
A ∈ ℝm×n and b ∈ ℝm , and hence, f satisfies
𝜃f (x) + (1 − 𝜃)f (y) = f (𝜃x + (1 − 𝜃)y).
The image of a set C ∈ ℝn under f , which is defined as
f (C) = {f (x) | x ∈ C},
is therefore convex if C is convex since for every x, y ∈ C and 𝜃 ∈ [0, 1],
𝜃x + (1 − 𝜃)y ∈ C ⟹ 𝜃f (x) + (1 − 𝜃)f (y) = f (𝜃x + (1 − 𝜃)y) ∈ f (C).
Similarly, the preimage of a set C ⊆ ℝm under f , which is defined as
f −1 (C) = {x ∈ ℝn | f (x) ∈ C},
is also convex if C is convex since for every x, y ∈ f −1 (C) and 𝜃 ∈ [0, 1], it holds that
𝜃f (x) + (1 − 𝜃)f (y) = f (𝜃x + (1 − 𝜃)y) ∈ C ⟹ 𝜃x + (1 − 𝜃)y ∈ f −1 (C).

4.2.1.3 Perspective Transformation

The perspective function P ∶ ℝn × ℝ++ → ℝn is the function P((u, s)) = u∕s. The perspective of a set
C ⊆ dom P is the image of C under P, i.e.
P(C) = {u∕s | (u, s) ∈ C},
and P(C) is convex if C is a convex set, as we will now show. Suppose x = (u, s) and y = (𝑣, t) are
two points in C ⊆ dom P. Using the definition of P, we find that
𝜃u + (1 − 𝜃)𝑣 𝜃s (1 − 𝜃)t
P(𝜃x + (1 − 𝜃)y) = = u∕s + 𝑣∕t
𝜃s + (1 − 𝜃)t 𝜃s + (1 − 𝜃)t 𝜃s + (1 − 𝜃)t
̄
= 𝜃P(x) ̄
+ (1 − 𝜃)P(y)
where 𝜃̄ = 𝜃s∕(𝜃s + (1 − 𝜃)t) ∈ [0, 1]. This implies that P maps the line segment between x and y to
the line segment between P(x) and P(y), and hence, P(C) is convex if C is convex.

4.2.2 Examples of Convex Sets

Using the definition of convex sets (4.10) and our knowledge of convexity-preserving operations,
we will now provide some basic examples of convex sets and verify their convexity.

4.2.2.1 Hyperplanes and Halfspaces

A hyperplane in ℝn is a set
 = {x ∈ ℝn | aT x = b}, (4.17)
where a ∈ ℝn
is a nonzero vector and b ∈ ℝ, i.e. it is the solution set to a nontrivial equation with
n unknowns. Given any vector x0 ∈ , we have that
 = {x0 + u | aT u = 0} = x0 + (a)⟂ ,
which shows that  is a translation of the subspace (a)⟂ , and hence, a is normal to . It is
straightforward to verify that  is a convex set: for every x, y ∈ ,
aT (𝜃x + (1 − 𝜃)y) = 𝜃aT x + (1 − 𝜃)aT y = b, ∀ 𝜃 ∈ ℝ.
4.2 Convex Sets 69

𝑎 𝑎𝑇 𝑥 > 𝑏
𝑎𝑇 𝑥 ≤ 𝑏
𝑎𝑇 𝑥 = 𝑏

𝑎𝑇 𝑥 = 𝑏
(a) (b)

Figure 4.5 A hyperplane and a halfspace in ℝ2 : (a) hyperplane and (b) halfspace.

The set of solutions to a nontrivial linear inequality aT x ≤ b with a ∈ ℝn , a ≠ 0, and b ∈ ℝ is a

closed halfspace in ℝn ,
{x ∈ ℝn | aT x ≤ b}. (4.18)
The boundary of the set is the hyperplane {x ∈ ℝn | aT x
= b}. A halfspace is a convex set, which
follows by noting that for every x and y in the halfspace, it holds that
aT (𝜃x + (1 − 𝜃)y) = 𝜃aT x + (1 − 𝜃)aT y ≤ 𝜃b + (1 − 𝜃)b = b, ∀ 𝜃 ∈ [0, 1].
Figure 4.5 shows a hyperplane and a halfspace in ℝ2 .

4.2.2.2 Polyhedral Sets

A polyhedral set is the intersection of a finite number of hyperplanes and halfspaces, and hence, it
is a convex set. Since a hyperplane can be expressed as the intersection of two closed halfspaces, i.e.
{x ∈ ℝn | aT x = b} = {x ∈ ℝn | aT x ≤ b} ∩ {x ∈ ℝn | aT x ≥ b},
an equivalent definition is that a polyhedral set is the intersection of finitely many halfspaces.
Examples of polyhedral sets include line segments, affine sets, the nonnegative orthant ℝn+ , and
the standard simplex Δn . Figure 4.6 shows examples of polyhedral sets in ℝ2 and ℝ3 .

4.2.2.3 Norm Balls and Ellipsoids

A norm ball is a set of the form
B(c, r) = {x ∈ ℝn | ||x − c|| ≤ r}, (4.19)
where || ⋅ || is a norm on ℝn , c ∈ ℝn is the center of the ball, and r > 0 is the radius. It follows
from the properties of a norm that B(c, r) is convex. Specifically, if x and y are in B(c, r), then for all
𝜃 ∈ [0, 1],
||𝜃(x − c) + (1 − 𝜃)(y − c)|| ≤ 𝜃||x − c|| + (1 − 𝜃)||y − c|| ≤ r,
and hence, 𝜃x + (1 − 𝜃)y ∈ B(c, r) for all 𝜃 ∈ [0, 1]. √
A norm ball that is induced by a quadratic norm of the form ||x||A = xT Ax for some A ∈ 𝕊n++ is
an ellipsoid. More generally, an ellipsoid is an affine transformation of the unit Euclidean norm ball
Figure 4.6 Examples of a polyhedral sets: the
set to the left is the intersection of ﬁve
halfspaces in ℝ2 , and the set to the right is the
standard simplex in ℝ3 , which is the
intersection of a hyperplane and three
halfspaces.
70 4 Optimization Theory

‖𝑥‖1 ‖𝑥‖2 ‖𝑥‖∞ ‖𝑥‖𝐴

Figure 4.7 Examples of norm balls in ℝ2 . The 1-norm ball and the ∞-norm ball are polyhedral sets,
whereas the 2-norm ball and the quadratic norm ball are ellipsoids.

B2 (0, 1) = {x ∈ ℝn | ||x||2 ≤ 1}, but it is only a norm ball if it is an injective linear transformation
of B2 (0, 1). To see this, suppose f ∶ ℝn → ℝn is an injective affine transformation, i.e. f (x) = Cx + d
for some nonsingular C ∈ ℝn×n and d ∈ ℝn . We then have that
f (B2 (0, 1)) = {Cx + d | ||x||2 ≤ 1} = {y | ||C−1 (y − d)||2 ≤ 1}
= {y | ||y − d||A ≤ 1},
which shows that f (B2 (0, 1)) is a norm ball induced by the quadratic norm || ⋅ ||A with A = C−T C−1 .
Figure 4.7 shows some examples of norm balls in ℝ2 .

4.2.2.4 Convex Cones

A set K ∈ ℝn is a cone if
x ∈ K ⟹ tx ∈ K, ∀ t ∈ ℝ+ ,
and K is a convex cone if it is also convex. We say that a cone K is pointed if K ∩ (−K) = {0}.
Pointedness implies that the largest linear subspace included in K is {0}. Moreover, we say that
a convex cone K is proper if it is pointed, closed, and full dimensional, i.e.
K ∩ (−K) ⊆ {0}, K = cl K, int K ≠ ∅.
The conic hull of a set C ∈ ℝn , which we denote cone C, is the smallest convex cone that contains C.
A cone is a polyhedral cone if it is also a polyhedral set. One example is the nonnegative orthant in
ℝn , i.e.
ℝn+ = {x ∈ ℝn | xi ≥ 0, i ∈ ℕn },
which is the intersection of n halfspaces. Figure 4.8 illustrates the nonnegative orthant in ℝ3 .
The norm cone associated with a norm on ℝn−1 is the n-dimensional set
K = {(x, t) ∈ ℝn−1 × ℝ | ||x|| ≤ t}. (4.20)
The case where the norm is the Euclidean norm is an important special case, which we will refer
to as the second-order cone and denote by ℚn . We note that the second-order cone has several other

𝑥3 Figure 4.8 The nonnegative orthant in ℝ3 .

𝑥1

𝑥2
4.2 Convex Sets 71

Figure 4.9 The second-order cone in ℝ3 .

𝑥1

𝑥2

names in the literature, e.g. the Lorentz cone, the quadratic cone, and the more casual name, the
ice-cream cone. Figure 4.9 shows the second-order cone in ℝ3 .
The cone of symmetric, positive semidefinite matrices of order n is the set
S+n = {X ∈ 𝕊n | uT Xu ≥ 0, ∀ u ∈ ℝn }. (4.21)
It is easy to verify that it is indeed a cone, and convexity follows by noting that for a given u ∈ ℝn ,
the set
{X ∈ 𝕊n | uT Xu ≥ 0}
is a closed halfspace, and hence, S+n can be expressed as the intersection of infinitely many
halfspaces.
The dual cone of a convex cone K ⊆ ℝn is the set
K ∗ = {y ∈ ℝn | xT y ≥ 0, ∀ x ∈ K}. (4.22)
In other words, K ∗ is the set of vectors that form a nonnegative inner product with all vectors in K.
This is illustrated in Figure 4.10. We note that it can be shown that the dual of the dual cone K ∗ is
the closure of K, i.e. K ∗∗ = cl K. Thus, we have that K ∗∗ = K if K is a proper convex cone. A convex
cone K is called self-dual if K = K ∗ . The nonnegative orthant, the second-order cone, and the cone
of positive semidefinite matrices are examples of self-dual cones. Finally, we note that −(K ∗ ) is
called the polar cone of K.

4.2.3 Generalized Inequalities

Given a proper convex cone K ⊆ ℝn and vectors x, y ∈ ℝn , we define the generalized inequality
x ⪰K y and its strict counterpart x ≻K y as
x ⪰K y ⟺ x − y ∈ K, (4.23a)

x ≻K y ⟺ x − y ∈ int K. (4.23b)

Figure 4.10 An example of a convex cone K and its dual

cone K ∗ in ℝ2 . Note that K ∗ contains K in this example.

𝐾
𝐾∗
72 4 Optimization Theory

The relation ⪰K defines a nonstrict total order on ℝn if x ⪰K y or y ⪰K x for every x, y ∈ ℝn .

For example, this is the case if K = R+ , in which case the generalized inequality reduces to the
ordinary scalar inequality x ≥ y. However, the relation ⪰K does not define a total order on ℝn
in general. For example, if K = ℝ2+ and x = (1, −1) and y = (−1, 1), then neither x − y nor y − x
is nonnegative. More generally, the relation ⪰K defines a nonstrict partial order on ℝn , which
satisfies the following properties:
1. x ⪰K x for all x ∈ ℝn (reflexivity),
2. if x ⪰K y and y ⪰K x, then y = x (antisymmetry),
3. if x ⪰K y and y ⪰K z, then x ⪰K z (transitivity).
Furthermore, the relation ≻K defines a strict partial order on ℝn , which satisfies the following
properties:
1. there is no x such that x ≻K x (irreflexivity),
2. x ≻K y ⟹ y ⊁K x (asymmetry),
3. if x ≻K y and y ≻K z, then x ≻K z (transitivity).
These properties of ⪰K and ≻K follow directly from (4.23) and the assumption that K is a proper
convex cone. We note that a direct consequence of the transitivity property is that
a ⪰K b, c ⪰K d ⟹ a + c ⪰K c + d,
i.e. adding two valid generalized inequalities yields another valid generalized inequality.
We will frequently encounter two types of generalized inequalities. The first type is defined with
respect to the nonnegative orthant ℝn+ and corresponds to componentwise inequality
x ⪰ℝn+ y ⟺ xi ≥ yi , i ∈ ℕn .
The second type is defined with respect to the cone of positive semidefinite matrices or order n and
corresponds to eigenvalue inequalities. Specifically, if X, Y ∈ 𝕊n and K = 𝕊n+ , then
X ⪰𝕊n+ Y ⟺ X − Y ∈ 𝕊n+ ,
or, equivalently, X ⪰𝕊n+ Y if and only if the eigenvalues of X − Y are nonnegative.
To simplify our notation, we will write x ⪰ y and x ≻ y instead of x ⪰ℝn+ y and x ≻ℝn+ y whenever
x and y are vectors, and we will write X ⪰ Y and X ≻ Y instead of X ⪰𝕊n+ Y and X ≻𝕊n+ Y whenever
X and Y are symmetric matrices.

4.3 Convex Functions

̄ is convex if dom f is a convex set and for all x, y ∈ dom f , it holds that
A function f ∶ ℝn → ℝ
f (𝜃x + (1 − 𝜃)y) ≤ 𝜃f (x) + (1 − 𝜃)f (y), ∀ 𝜃 ∈ [0, 1]. (4.24)
In other words, every line segment joining two points (x, f (x)) and (y, f (y)) on the graph of f lies
above or on the graph, as illustrated in Figure 4.11. The function f is strictly convex if (4.24) holds
with strict inequality whenever x ≠ y and 𝜃 ∈ (0, 1). Moreover, f is strongly convex with modulus
𝜇 > 0, or simply 𝜇-strongly convex, if for all x, y ∈ dom f and 𝜃 ∈ [0, 1],
𝜃(1 − 𝜃)𝜇
f (𝜃x + (1 − 𝜃)y) ≤ 𝜃f (x) + (1 − 𝜃)f (y) − ||x − y||22 . (4.25)
2
Equivalently, f is 𝜇-strongly convex if and only if g(x) = f (x) − (𝜇∕2)||x||22 is convex. We say that f is
concave if −f is convex, and similarly, f is strictly/strongly concave if −f is strictly/strongly convex.
4.3 Convex Functions 73

Figure 4.11 A function f is convex if the line 𝑓(𝑥)

segment connecting the two points (x, f (x))
and (y, f (y)) is on or above the graph of the
function for every x and y. 𝜃 𝑓( 𝑥 ) + ( 1 − 𝜃 )𝑓 (𝑦 )

𝑓(𝑦) 𝑓( 𝜃 𝑥 + (1 − 𝜃 )𝑦 )

𝜃
0 𝜃 1

We note that the notion of convexity can be generalized to vector-valued function by replacing
the scalar inequality in (4.24) by a generalized inequality. Specifically, we say that f ∶ ℝn → ℝm is
convex with respect to a proper convex cone K ⊂ ℝm , or simply K-convex, if dom f is a convex set
and for all x, y ∈ dom f ,
f (𝜃x + (1 − 𝜃)y) ⪯K 𝜃f (x) + (1 − 𝜃)f (y), ∀ 𝜃 ∈ [0, 1]. (4.26)
The inequality (4.24) can also be expressed in terms of the epigraph of f . Indeed, using the defi-
nition of the epigraph, we may express (4.24) as
[ ] [ ]
x y
𝜃 + (1 − 𝜃) ∈ epi f , ∀ 𝜃 ∈ [0, 1].
f (x) f (y)
A direct consequence of this is that the function f is convex if and only if its epigraph is a convex
set, as illustrated in Figure 4.12. This implies that all sublevel sets of a convex function are convex,
but the converse is not true in general.

4.3.1 First- and Second-Order Conditions for Convexity

Let f ∶ ℝn → (−∞, +∞] be continuously differentiable on dom f , which we assume is open. The
function f is then convex if and only if dom f is a convex set and
f (y) ≥ f (x) + ∇f (x)T (y − x), ∀ x, y ∈ dom f . (4.27)
In other words, the first-order Taylor approximation of f at x is an affine lower bound of f , as illus-
trated in Figure 4.13. Similarly, f is strictly convex if (4.27) holds with strict inequality whenever
x ≠ y. A direct consequence of (4.27) is that all stationary points of f are global minima because

Figure 4.12 The epigraph of a function f is

a convex set if and only if f is a convex
function.

epi 𝑓

𝑓
74 4 Optimization Theory

Figure 4.13 A continuously differentiable

function f is convex if and only if all tangent
planes lie below or on the graph of f .

𝑓(𝑥)

𝑓(𝑥) + ∇𝑓(𝑥)𝑇(𝑦 −𝑥)

𝑦
𝑥

f (y) ≥ f (x) for all y ∈ dom f whenever ∇f (x) = 0. Moreover, if f is strictly convex, then f has at
most one stationary point since ∇f (x) = 0 implies that f (y) > f (x) for all y ≠ x.
To prove (4.27), we start by rewriting (4.24) as
𝜃f (y) ≥ 𝜃f (x) + f (x + 𝜃(y − x)) − f (x).
The inequality (4.27) then follows by dividing both sides by 𝜃 ≠ 0, i.e.
f (x + 𝜃(y − x)) − f (x)
f (y) ≥ f (x) + ,
𝜃
and taking the limit as 𝜃 goes to 0, which yields the directional derivative
f (x + 𝜃(y − x)) − f (x)
lim = ∇f (x)T (y − x).
𝜃→0 𝜃
The function f is 𝜇-strongly convex if and only if
𝜇
f (y) ≥ f (x) + ∇f (x)T (y − x) + ||y − x||22 , ∀ x, y ∈ dom f , (4.28)
2
for some 𝜇 > 0. To see this, recall that f is 𝜇-strongly convex if and only if g(x) = f (x) − (𝜇∕2)||x||22 is
convex, and note that (4.28) is equivalent to g(y) ≥ g(x) + ∇g(x)T (y − x) for all x, y ∈ dom f , which is
the convexity condition (4.27). The condition (4.28) implies that the sublevel sets of f are bounded,
which follows by noting that
St = {y | f (y) ≤ t} ⊆ {y | f (x) + ∇f (x)T (y − x) + (𝜇∕2)||y − x||22 ≤ t}, ∀ x ∈ dom f ,
where the set on the right-hand side of the inclusion operator is either a Euclidean ball or the empty
set. Strong convexity also allows us to bound the distance from any x to the optimal point x⋆ , if it
exists, in terms of ∇f (x),
2
||x − x⋆ ||2 ≤ ||∇f (x)||2 . (4.29)
𝜇
This inequality can be derived from (4.28) by substituting x⋆ for y, i.e.
𝜇
f (x⋆ ) ≥ f (x) + ∇f (x)T (x⋆ − x) + ||x⋆ − x||22
2
𝜇
≥ f (x) − ||∇f (x)||2 ||x⋆ − x||2 + ||x⋆ − x||22 ,
2
and since f (x⋆ ) ≤ f (x) for all x, we see that
𝜇
0 ≥ −||∇f (x)||2 ||x⋆ − x||2 + ||x⋆ − x||22 .
2
Moreover, substituting x⋆ for y on the left-hand side of (4.28) and minimizing the right-hand side
with respect to y, we find that for all x ∈ dom f ,
1
f (x) − p⋆ ≤ ||∇f (x)||22 . (4.30)
2𝜇
4.3 Convex Functions 75

The first-order condition for convexity can also be expressed as

(∇f (y) − ∇f (x))T (y − x) ≥ 0, ∀ x, y ∈ dom f . (4.31)
This follows from (4.27) by adding the inequality to itself with x and y interchanged. This means
that ∇f is a monotone operator if and only if f is continuously differentiable and convex. Similarly,
if f is strictly convex, then (4.31) holds with strict inequality whenever x ≠ y, in which case ∇f is
said to be strictly monotone. Finally, if f is strongly convex, then
(∇f (y) − ∇f (x))T (y − x) ≥ 𝜇||y − x||22 , ∀ x, y ∈ dom f , (4.32)
which means that ∇f is strongly monotone with parameter 𝜇.
If f is twice continuously differentiable on dom f , then f is convex if and only if dom f is convex
and
∇2 f (x) ⪰ 0, ∀ x ∈ dom f . (4.33)
In other words, f is convex if and only if the Hessian matrix is positive semidefinite on dom f .
To show that (4.33) follows from convexity, we first substitute x + tz for y in (4.31), where t > 0 is
chosen such that x + tz ∈ dom f , i.e.
(∇f (x + tz) − ∇f (x))T (z + tz − x) ≥ 0.
Dividing by t and taking the limit as t → 0, we find that zT ∇2 f (x)z ≥ 0. This must hold for all z since
dom f is open, and hence, ∇2 f (x) is positive semidefinite. Conversely, suppose ∇2 f (x) is positive
semidefinite on dom f , and let 𝜙(t) = f (x + t(y − x)) be the restriction of f to the line through some
x, y ∈ dom f . It follows that
𝜙′′ (t) = (y − x)T ∇2 f (x)(y − x) ≥ 0,
which implies that 𝜙′ (t) is monotone, and hence, 𝜙(t) is convex.
The function f is strictly convex if ∇2 f (x) ≻ 0 for all x ∈ dom f , but the converse is not true in
general: the Hessian of a twice continuously differentiable strictly convex function is not necessarily
positive definite on dom f . Finally, f is 𝜇-strongly convex if and only if
∇2 f (x) ⪰ 𝜇I, ∀ x ∈ dom f , (4.34)
which implies that the eigenvalues of ∇2 f (x) are greater than or equal to 𝜇.

4.3.2 Convexity-Preserving Operations

In Section 4.2, we saw that the knowledge of convexity-preserving operations can be useful in estab-
lishing convexity of sets. The same is true for convex function, which is why we now consider
operations that preserve convexity of functions.

4.3.2.1 Scaling, Sums, and Integrals

The set of convex functions f ∶ ℝn → (−∞, +∞] is closed under addition and scaling by positive
real numbers. Thus, the sum of two proper convex functions f and g is a convex function, and 𝛼f is
convex if f is convex and 𝛼 ∈ ℝ++ . This is an immediate consequence of (4.24). More generally, if
f ∶ ℝn × ℝp → (−∞, +∞] and f (x, y) is convex in x for every y ∈ Y ⊆ ℝp , then

h(x) = f (x, y) dy (4.35)

∫Y
is also convex.
76 4 Optimization Theory

4.3.2.2 Pointwise Maximum and Supremum

The pointwise maximum of k ≥ 2 convex functions fi ∶ ℝn → (−∞, +∞], i ∈ ℕk , is defined as
f (x) = max fi (x).
i∈ℕk

The epigraph of f is the intersection

⋂
epi f = epi fi ,
i∈ℕk

which shows that epi f is a convex set, and hence, f is convex. This is illustrated in Figure 4.14,
which shows the pointwise maximum of four affine functions and its epigraph. The result can
be extended to the pointwise supremum of uncountably many convex functions. Specifically, if
f ∶ ℝn × ℝp → (−∞, +∞] and f (x, y) is convex in x for every y ∈ Y ⊆ ℝp , then h ∶ ℝn → (−∞, +∞]
defined as
h(x) = sup f (x, y) (4.36)
y∈Y

is a convex function.

4.3.2.3 Afﬁne Transformation

Recall from Section 4.2 that the image and the preimage of a convex set under an affine transfor-
mation is a convex set. A similar result holds for convex functions, as we will now show. Suppose
f ∶ ℝm → (−∞, +∞] is a convex function and let g ∶ ℝn → (−∞, +∞] be a composition of f and an
affine function, i.e.
g(x) = f (Ax + b)
for some A ∈ ℝm×n and b ∈ ℝm . The function g is then convex, which follows immediately from
(4.24). Note also that
epi g = {(x, t) | f (Ax + b) ≤ t} = {(x, t) | (Ax + b, t) ∈ epi f },
which shows that epi g is the preimage of epi f under an affine transformation.

4.3.2.4 Perspective Transformation

The perspective function of a function f ∶ ℝn → (−∞, +∞] is the function Pf ∶ ℝn × ℝ → (−∞, +∞]
defined as
{
tf (x∕t), t > 0,
Pf (x, t) = (4.37)
∞, otherwise.
The perspective function of a proper convex function is itself a convex function; see Exercise 4.6.
We note that the epigraph of f is related to Pf in the sense that for t > 0, it holds that
t epi f = {(tx, ts) | f (x) ≤ s} = {(x, s) | tf (x∕t) ≤ s} = {(x, s) | Pf (x, t) ≤ s}.

Figure 4.14 The pointwise maximum of

convex functions is itself convex.

epi 𝑓
𝑓2

𝑓4
𝑓1
𝑓3
𝑥
4.3 Convex Functions 77

Moreover, noting that

epi Pf = {(x, t, s) | tf (x∕t) ≤ s},
we see that for all 𝛼 ≥ 0,
(x, t, s) ∈ {0} ∪ epi Pf ⟹ (𝛼x, 𝛼t, 𝛼s) ∈ {0} ∪ epi Pf ,
which shows that {0} ∪ epi Pf (x, t) is a convex cone if f is a proper convex function.

4.3.2.5 Partial Inﬁmum

̄ of a function f ∶ ℝn × ℝp → (−∞, +∞] as
We define the partial infimum h ∶ ℝn → ℝ
h(x) = inf f (x, y). (4.38)
y

This is a convex function if f is a convex function. This result follows by noting that the strict
epigraph of h, defined as
{(x, t) | h(x) < t},
is the image of the strict epigraph of f
{(x, y, t) | f (x, y) < t}
under the linear transformation (x, y, t) → (x, t).

Example 4.1 As an example of a function that is obtained as a partial infimum, consider f ∶

ℝn → ℝ defined as f (z) = zT Xz for some X ∈ 𝕊n++ , which implies that f is convex. If we partition X
and z conformably as
[ ] [ ]
A B x
X= T , z= ,
B C y
with x ∈ ℝn1 and y ∈ ℝn2 , then the partial infimum of f with respect to y is
h(x) = inf f (x, y) = xT (A − BC−1 BT )x,
y

where A − BC−1 BT
is the Schur complement of C in X. Convexity of h implies that this Schur
complement is positive semidefinite.

4.3.2.6 Square of Nonnegative Convex Functions

If f ∶ ℝn → [0, ∞) is a nonnegative convex function, then
g(x) = f (x)2 (4.39)
is a convex function. Indeed, convexity of f implies that for all x, y ∈ ℝn ,
f (𝜃x + (1 − 𝜃)y) ≤ 𝜃f (x) + (1 − 𝜃)f (y), ∀ 𝜃 ∈ [0, 1],
and by squaring both sides of the inequality, we find that
g(𝜃x + (1 − 𝜃)y) ≤ 𝜃 2 f (x)2 + (1 − 𝜃)2 f (y)2 + 2𝜃(1 − 𝜃)f (x)f (y)
= −𝜃(1 − 𝜃)(f (x) − f (y))2 + 𝜃f (x)2 + (1 − 𝜃)f (y)2
≤ 𝜃f (x)2 + (1 − 𝜃)f (y)2
= 𝜃g(x) + (1 − 𝜃)g(y)
for all 𝜃 ∈ [0, 1].
78 4 Optimization Theory

4.3.3 Examples of Convex Functions

We start by listing some basic examples of convex and concave functions.
● Linear and affine functions are both convex and concave.
● Absolute value: f (x) = |x| is convex on ℝ.
● Powers: f (x) = x𝛼 with dom f = ℝ++ is concave if 𝛼 ∈ [0, 1] and convex if 𝛼 ∉ (0, 1).
● Powers of absolute value: f (x) = |x|𝛼 is convex on ℝ if 𝛼 ≥ 1.
● Exponential function: f (x) = exp(x) is convex on ℝ.
● Logarithm: f (x) = ln(x) with dom f = ℝ++ is concave.
● Negative entropy: f (x) = x ln(x) with dom f = ℝ+ and f (0) = 0 is convex.
● Quadratic-over-linear: f (x, y) = x2 ∕y with dom f = ℝ × ℝ++ is convex.
● One-sided square: f (x) = max (0, x)2 is convex on ℝ.
Next, we give some examples of special classes of convex functions.

4.3.3.1 Norms
All norms are convex function, which is an immediate consequence of the triangle inequality and
(4.24). Moreover, the epigraph of a norm is a norm cone, which is a convex set.

4.3.3.2 Indicator and Support Functions

The indicator function of a set C ⊆ ℝn is the function IC ∶ ℝn → {0, ∞} defined as
{
0, x ∈ C,
IC (x) = (4.40)
∞, x ∉ C.
This is clearly a convex function if and only if the set C is convex, and it is proper if C ≠ ∅.
The support function of a nonempty set C ⊆ ℝn is the function SC ∶ ℝn → ℝ ̄ defined as

SC (x) = sup xT y. (4.41)

y∈C

We note that this is always a convex function since it is the supremum of affine functions.

4.3.4 Conjugation
̄ is the function
The Legendre–Fenchel transformation or conjugate of a function f ∶ ℝn → ℝ
∗ n ̄
f ∶ ℝ → ℝ defined as
f ∗ (y) = sup {yT x − f (x)}. (4.42)
x

The conjugate function f ∗ is always a convex function, which follows by noting that it is the point-
wise supremum of affine functions. Furthermore, epi f ∗ is the intersection of closed halfspaces,
and hence, it is a closed set, i.e. f ∗ is closed. Evaluating f ∗ at 0 yields
f ∗ (0) = sup {−f (x)} = −inf f (x),
x x

and, generally speaking, this means that evaluating the conjugate function is as hard as finding a
global minimum of f . For a fixed y, the conjugate function f ∗ (y) may be interpreted graphically as
the largest signed vertical distance from f to the linear function yT x. This is illustrated in Figure 4.15
for a univariate function.
The definition of the conjugate function implies that if f ∶ ℝn → (−∞, +∞] is a proper function,
then for every x, y ∈ ℝn ,
f (x) + f ∗ (y) ≥ xT y.
4.3 Convex Functions 79

Figure 4.15 The conjugate function f ∗ (y) is

the largest signed vertical distance from f to 𝑓
the linear function y T x, as illustrated here for
a univariate function.

𝑦𝑥 − 𝑓 ∗ (𝑦 )
𝑦𝑥
𝑥
𝑥̄

This is known as the Fenchel–Young inequality. Equality holds if the supremum of yT x − f (x) is
attained at x.
The conjugate of f ∗ is called the biconjugate of f and is denoted f ∗∗ = (f ∗ )∗ . An immediate con-
sequence of the Fenchel–Young inequality is that f ∗∗ ≤ f , i.e.

f (x) ≥ sup {xT y − f ∗ (y)} = f ∗∗ (x).

Moreover, the Fenchel–Moreau theorem states that if f is proper, then f ∗∗ = f if and only if f is
convex and closed. More generally, the biconjugate f ∗∗ is the lower convex envelope of f , which is
the supremum of all closed, convex functions that lie below f . In other words, epi f ∗∗ is the closed
̄ and f ≤ g,
convex hull of epi f . To see this, we first note that if f and g are functions from ℝn to ℝ
then

f (x) ≤ g(x) ⟺ yT x − g(x) ≤ yT x − f (x),

which implies that f ∗ ≥ g∗ . Using the same argument, we also see that f ∗∗ ≤ g∗∗ . Thus, if f is closed
and convex, then f = f ∗∗ ≤ g∗∗ ≤ g. Taking the supremum of all closed, convex functions f ≤ g,
we conclude that g∗∗ is the lower convex envelope of g.

Example 4.2 The conjugate of the indicator function of a set C ⊆ ℝn is given by

IC∗ (y) = sup {yT x − IC (x)} = sup yT x = SC (y),

x x∈C

i.e. it is the support function of the set C. In the special case, where C is a nonempty convex cone
K ⊂ ℝn , the conjugate function is

SK (y) = sup yT x = I−(K ∗ ) (y),

x∈K

which is the indicator function of the polar cone −(K ∗ ).

4.3.5 Dual Norms

The dual norm of a norm || ⋅ || on ℝn is defined as

||y||∗ = sup yT x = sup {yT x − IB (x)} = SB (y) (4.43)

||x||≤1 x

where B = {x ∈ ℝn | ||x|| ≤ 1} denotes the unit norm ball for the norm || ⋅ ||. Thus, the dual norm || ⋅
||∗ is the conjugate function of the indicator function of the norm ball B. We invite the reader to ver-
ify that the dual norm is indeed a norm, see Exercise 4.11. Since IB is convex and closed, we conclude
80 4 Optimization Theory

that the conjugate of the dual norm || ⋅ ||∗ is IB∗∗ = IB . The definition of the dual norm can also be
expressed as
xT y
||y||∗ = sup .
x≠0 ||x||
This readily implies that for all x, y ∈ ℝn ,

||x||||y||∗ ≥ |xT y|,

which may be viewed as a generalization of the Cauchy–Schwartz inequality. We note that the dual
norm of || ⋅ ||∗ is the norm || ⋅ ||∗∗ = || ⋅ ||.

Example 4.3 The dual norm of the Euclidean norm on ℝn is

||y||∗ = sup {yT x | ||x||2 ≤ 1} = ||y||2 ,

which follows from the fact that x = y∕||y||2 achieves the supremum if y ≠ 0.

Example 4.4 The dual norm of || ⋅ ||∞ on ℝn is given by

∑
n
||y||∗ = sup {yT x | ||x||∞ ≤ 1} = |yi | = ||y||1 .
x i=1

Similarly, the dual norm of || ⋅ ||1 is || ⋅ ||∞ since || ⋅ ||∗∗ = || ⋅ ||. More generally, the dual norm of
|| ⋅ ||p with p ≥ 1 is || ⋅ ||q , where 1∕p + 1∕q = 1.

Example 4.5 The dual norm of the matrix 2-norm on ℝm×n may be expressed as

||Y ||∗ = sup {tr(Y T X) | ||X||2 ≤ 1}.

Now, suppose Y = U1 SV1T is a reduced singular value decomposition (SVD) of Y , where

S = diag(𝜎1 , … , 𝜎r ) is the matrix with the nonzero singular values of Y . The condition ||X||2 ≤ 1
implies that the singular values of X are between 0 and 1, and using von Neumann’s trace
inequality (2.33), we find that for all ||X||2 ≤ 1,

∑
k
tr(Y T X) ≤ 𝜎i = tr(S).
i=1

This upper bound is attained at X = U1 V1T if Y ≠ 0, and hence, the dual norm of || ⋅ ||2 is the nuclear
norm.

4.4 Subdifferentiability

A proper function f ∶ ℝn → ℝ ∪ {+∞} is said to be subdifferentiable at x if there exists a vector

g ∈ ℝn such that

f (y) ≥ f (x) + gT (y − x), ∀ y ∈ dom f .

Such a vector g is called a subgradient of f at x, and the set of all subgradients at x, i.e.

𝜕f (x) = {g ∈ ℝn | f (y) ≥ f (x) + gT (y − x) ∀ y ∈ dom f }, (4.44)

4.4 Subdifferentiability 81

is the subdifferential of f at x. Note that the subdifferential is a set-valued map, and it may be an
empty set. We use the convention that 𝜕f (x) = ∅ if x ∉ dom f , and hence, its effective domain is
the set

dom 𝜕f = {x ∈ ℝn | 𝜕f (x) ≠ ∅}.

The subdifferential 𝜕f (x) can also be expressed as

⋂
𝜕f (x) = {g | f (y) ≥ f (x) + gT (y − x)},
y∈dom f

which shows that 𝜕f (x) is the intersection of closed halfspaces, and hence, 𝜕f (x) is closed and con-
vex. Figure 4.16 illustrates the definition of the subdifferential. We note that global information is
necessary to determine if a nonconvex function is subdifferentiable at a point x. In other words, the
plot in Figure 4.16b alone is insufficient to determine where f is subdifferentiable. For example,
if the function approaches some constant as |x| tends to infinity, then f is only subdifferentiable at
its minima.
The set of global minimizers of a proper function f ∶  → (−∞, +∞] may be characterized in
terms of the subdifferential of f using Fermat’s rule, which states that

argmin{f (u)} = {x ∈ dom f | 0 ∈ 𝜕f (x)}. (4.45)

This property is easily verified using the definition of the subdifferential (4.44), i.e.

x ∈ argmin{f (u)} ⟺ f (y) ≥ f (x) + 0T (y − x) ∀ y ∈ dom f ⟺ 0 ∈ 𝜕f (x).

However, this property is primarily useful when f is a convex function, because it is generally
difficult to characterize the subdifferential of a nonconvex function. We note that if f is convex

Figure 4.16 Illustration of subderivatives and

subdifferential of nonsmooth and smooth functions. 𝑓
𝑎) 𝑓(𝑏)+𝑔
The bold parts of the graphs correspond to the intervals on
which f is subdifferentiable. (a) Nonsmooth convex
function: the subdifferential of f at a and b are 𝜕f (a) = {g1 } 𝑓(𝑏)+𝑔
and 𝜕f (b) = [g2 , g3 ], respectively. (b) Smooth nonconvex
function: f is not subdifferentiable on (a, b) ∪ (c, d).
𝑥
𝑎 𝑏
(a)

𝑥
𝑎 𝑏 𝑐 𝑑
(b)
82 4 Optimization Theory

and continuously differentiable at x ∈ dom f , then 𝜕f (x) = {∇f (x)}. Conversely, if f is convex and
𝜕f (x) is the singleton {g}, then f is differentiable at x and ∇f (x) = g. This is not true for nonconvex
functions, as is evident from the example in Figure 4.16b.

4.4.1 Subdifferential Calculus

We will now establish some rules that can be used to find a subgradient or the subdifferential of
convex functions that are constructed via one or more convexity-preserving operations.

4.4.1.1 Nonnegative Scaling

The subdifferential of 𝛼f , where f ∶ ℝn → (−∞, +∞] and 𝛼 > 0 may be expressed as
𝜕(𝛼f )(x) = 𝛼𝜕f (x). (4.46)
This result follows directly from the definition (4.44) by multiplying both sides of the inequality
by 𝛼.

4.4.1.2 Summation
Given two closed proper convex functions f1 ∶ ℝn → (−∞, +∞] and f2 ∶ ℝn → (−∞, +∞], it holds
that
𝜕f (x) ⊇ 𝜕f1 (x) + 𝜕f2 (x) = {u + 𝑣 | u ∈ 𝜕f1 (x), 𝑣 ∈ 𝜕f2 (x)}. (4.47)
The right-hand side of the inclusion is the Minkowski sum of 𝜕f1 (x) and 𝜕f2 (x). It is easy to verify
(4.48) by noting that if u ∈ 𝜕f1 (x) and 𝑣 ∈ 𝜕f2 (x), then
f1 (y) + f2 (y) ≥ f1 (x) + f2 (x) + (u + 𝑣)T (y − x), ∀ y ∈ dom f ,
which implies that u + 𝑣 ∈ 𝜕f (x). Moreover, if f1 and f2 satisfy
relint dom f1 ∩ relint dom f2 ≠ ∅,
then it can be shown that
𝜕f (x) = 𝜕f1 (x) + 𝜕f2 (x), ∀ x ∈ ℝn . (4.48)
The result is readily extended to sums of more than two functions by means of induction. We
note that the Minkowski sum of a set and the empty set is itself empty, which implies that dom
𝜕f = dom 𝜕f1 ∩ dom 𝜕f2 .

4.4.1.3 Afﬁne Transformation

Let f (x) = g(Ax + b), where g ∈ ℝm → (−∞, +∞] is closed and convex, and A ∈ ℝm×n and b ∈ ℝm
are given. The subdifferential of f at x then satisfies
𝜕f (x) ⊇ AT 𝜕g(Ax + b) = {AT u | u ∈ 𝜕g(Ax + b)}.
To see this, note that if u ∈ 𝜕g(Ax + b), then
g(y) ≥ g(Ax + b) + uT (y − Ax − b), ∀ y ∈ dom g,
which implies that
g(Az + b) ≥ g(Ax + b) + uT (Az − Ax), ∀ z ∶ Az + b ∈ dom g,
or, equivalently, AT u ∈ 𝜕f (x). If, in addition, we assume that (A) ∩ relint (dom g) ≠ ∅, then
𝜕f (x) = AT 𝜕g(Ax + b), ∀ x ∈ ℝn . (4.49)
4.4 Subdifferentiability 83

4.4.1.4 Pointwise Maximum

Consider the pointwise maximum of two closed convex functions f1 ∶ ℝn → ℝ and f2 ∶ ℝn → ℝ,
i.e.
f (x) = max {f1 (x), f2 (x)}.
The subdifferential of f is then
⎧𝜕f (x), f1 (x) > f2 (x),
⎪ 1
𝜕f (x) = ⎨𝜕f2 (x), f1 (x) < f2 (x), (4.50)
⎪conv (𝜕f1 (x) ∪ 𝜕f2 (x)), f1 (x) = f2 (x).
⎩
From the definition of the subdifferential, we find that if u ∈ 𝜕f1 (x) and f1 (x) > f2 (x), then for all
y ∈ dom f ,
f (y) = max {f1 (y), f2 (y)} ≥ f1 (y) ≥ f1 (x) + uT (y − x) ≥ f (x) + uT (y − x).
This implies that u ∈ 𝜕f (x) ⟺ u ∈ 𝜕f1 (x) whenever f1 (x) > f2 (x). In the case where f1 (x) = f2 (x),
we note that if u ∈ 𝜕f1 (x) and 𝑣 ∈ 𝜕f2 (x), then for all 𝜃 ∈ [0, 1],
f (y) ≥ 𝜃f1 (x) + (1 − 𝜃)f2 (x) + (𝜃u + (1 − 𝜃)𝑣)T (y − x), ∀ y ∈ dom f .
This shows that the line segment conv ({u, 𝑣}) belongs to 𝜕f (x), and since this is true for all
u ∈ 𝜕f1 (x) and 𝑣 ∈ 𝜕f2 (x), we may conclude that the right-hand side of (4.50) is a subset of the
left-hand side. Showing that equality holds is more tedious, and we refer the reader to [55] for a
rigorous proof.
More generally, the pointwise maximum of k ≥ 2 closed convex functions f1 , … , fk from ℝn to ℝ
may be expressed as
( )
⋃
𝜕f (x) = conv 𝜕fi (x)
i∈(x)

where (x) = {i ∈ ℕk | fi (x) = f (x)} is the set of active indices. We note that this can also be gener-
alized to the supremum of uncountably many proper convex functions.

4.4.1.5 Subgradients of Conjugate Functions

Let f ∶ ℝn → (−∞, +∞] be a proper convex function, and suppose y ∈ 𝜕f (x). The definition of the
subdifferential then implies that
f (z) ≥ f (x) + yT (z − x), ∀ z ∈ dom f ,
or, equivalently,
yT x − f (x) ≥ yT z − f (z), ∀ z ∈ dom f .
The right-hand sides achieves its supremum at z = x, and hence,
y ∈ 𝜕f (x) ⟹ yT x − f (x) = sup {yT z − f (z)} = f ∗ (y).
z

Now, suppose that f is also closed. This means that f ∗∗ = f , and hence,
f ∗∗ (x) + f ∗ (y) = xT y, x ∈ 𝜕f ∗ (y).
Thus, we may conclude that
x ∈ 𝜕f ∗ (y) ⟺ y ∈ 𝜕f (x) (4.51)
if f is proper convex and closed.
84 4 Optimization Theory

Example 4.6 The subdifferential of the indicator function of a nonempty closed convex set
C ⊆ ℝn is a closed proper convex function. This means that
y ∈ 𝜕IC (x) ⟺ x ∈ 𝜕SC (y),
where SC = IC∗ is the support function of C. We note that the subdifferential 𝜕IC (x) is also referred
to as the normal cone of C at x, and it may also be expressed as
NC (x) = 𝜕IC (x) = {g ∈ ℝn | 0 ≥ gT (y − x), ∀ y ∈ C}. (4.52)
It is easy to verify that NC (x) = {0} if x ∈ int C, and furthermore, if g ∈ NC (x) and g ≠ 0, then C is
contained in the halfspace {y | gT y ≤ gT x}.

4.5 Convex Optimization Problems

We say that an optimization problem of the form (4.4) is a convex optimization problem if f0 , f1 , … , fm
are convex functions and h1 , … , hp are affine functions, i.e. h(x) = Ax − b for some A ∈ ℝp×n and
b ∈ ℝp . Thus, a convex problem can be expressed as
minimize f0 (x)
subject to fi (x) ≤ 0, i ∈ ℕm , (4.53)
Ax = b.
The feasible set  is the intersection of 0-sublevel sets of m convex functions and p hyperplanes,
and hence,  is a convex set. We note that the feasible set may be convex even if some or all fi are
nonconvex functions, in which case the optimization problem is not a convex problem according to
our definition, but it is equivalent to a convex problem if  is a convex set and the objective function
is convex on  .

4.5.1 Optimality Condition

We will now show that for convex optimization problems of the form (4.53) with a continuously
differentiable objective function, x ∈  is an optimal point if and only if
∇f0 (x)T (y − x) ≥ 0, ∀ y ∈ . (4.54)
To prove this result, we first assume that x is feasible and satisfies (4.54). The first-order condition
for convexity (4.27) then implies that f0 (y) ≥ f0 (x) for all y ∈  , and hence, x is optimal. Conversely,
if x is optimal and (4.54) does not hold, then there exists some y ∈  such that
∇f0 (x)T (y − x) < 0.
Noting that z(𝜃) = 𝜃y + (1 − 𝜃)x ∈  for all 𝜃 ∈ [0, 1] and substituting z(𝜃) for y in (4.27), we see
that
f0 (z(𝜃)) ≥ f0 (x) + 𝜃∇f0 (x)T (y − x).
This means that f0 (z(𝜃)) < f0 (x) for some sufficiently small 𝜃 > 0, which contradicts that x is opti-
mal. Another way to see this is that the directional derivative of f0 at x in the direction y − x is
negative, i.e.
d
f (z(𝜃))∣𝜃=0 = ∇f0 (x)T (y − x) < 0.
d𝜃 0
The optimality condition (4.54) is illustrated in Figure 4.17, and as the figure shows, the feasible set
 is a subset of the halfspace {x | ∇f0 (x⋆ )T (x − x⋆ ) ≥ 0} if x⋆ is an optimal point and ∇f0 (x⋆ ) ≠ 0.
4.5 Convex Optimization Problems 85

Figure 4.17 The optimality condition (4.54) implies

that an optimal point x⋆ is either a stationary point
of f0 or ∇f0 (x⋆ ) deﬁnes a halfspace that contains the
feasible set  . 𝑥⋆
ℱ

−∇𝑓0 (𝑥 ⋆ )

∇𝑓0 (𝑥 ⋆ ) 𝑇 (𝑥 − 𝑥 ⋆ ) = 0

If x is an the optimal point in the interior of the feasible set, there exists a ball B2 (x, 𝜖) ⊆  with
radius 𝜖 > 0. The optimality condition (4.54) requires that
∇f0 (x)T (y − x) ≥ 0, ∀ y ∈ B2 (x, r),
which can also be expressed as ∇f0 (x)T u ≥ 0 for all u such that ||u||2 ≤ 𝜖. Clearly, this is only possible
if ∇f0 (x) = 0, which means that x is a stationary point of f0 .
Next, we consider the more general case, where f0 is convex but not necessarily continuously
differentiable. First, we note that (4.53) is equivalent to the problem
minimize f0 (x) + I (x), (4.55)
where I is the indicator function for the feasible set  . Fermat’s rule then implies that x is optimal
if and only if
0 ∈ 𝜕f0 (x) + N (x),
where N = 𝜕I is the normal cone of  . In other words, x is optimal if and only if there exists a
subgradient g ∈ 𝜕f0 (x) such that
−g ∈ N (x) ⟺ gT (y − x) ≥ 0, ∀ y ∈ . (4.56)
Note that if f0 is continuously differentiable at x, then 𝜕f0 (x) = {∇f0 (x)} and the optimality condition
(4.56) reduces to (4.54).

4.5.2 Equality Constrained Convex Problems

We now consider the special convex case of (4.53), where f0 ∶ ℝn → ℝ is convex and continuously
differentiable and m = 0, i.e. there are no inequality constraints. The feasible set is then the affine
set
 = {x ∈ ℝn | Ax = b}.
Recall from Section 2.10 that this set may be expressed as x +  (A) for any x that satisfies Ax = b.
The optimality condition (4.54) therefore reduces to
Ax = b, ∇f0 (x)T u ≥ 0, ∀ u ∈  (A).
Noting that u ∈  (A) ⟺ −u ∈  (A), we conclude that ∇f0 (x)T u = 0 for all u ∈  (A), or,
equivalently,
∇f0 (x) ∈  (A)⟂ = (AT ).
In other words, x is optimal if and only if there exists a vector 𝜇 ∈ ℝp such that
Ax = b, ∇f0 (x) + AT 𝜇 = 0. (4.57)
86 4 Optimization Theory

This optimality condition can also be expressed in terms to the so-called Lagrangian L ∶ ℝn × ℝp →
ℝ defined as

L(x, 𝜇) = f0 (x) + 𝜇 T h(x).

Specifically, if h(x) = Ax − b, then (4.57) is equivalent to the conditions

∇x L(x, 𝜇) = 0 (4.58a)

h(x) = 0. (4.58b)

As we will see in Section 4.7, it turns out that these conditions are necessary conditions for opti-
mality even when h is not an affine function.

4.6 Duality

We now return to the general optimization problem (4.4) without making any assumptions about
convexity, and we are interested in computing nontrivial lower bounds on the optimal value p⋆ .
We will do this by constructing a so-called dual problem, and such a problem can be constructed
in several ways. We start out with Lagrangian duality, which emerged in the 1940s but is based on
techniques pioneered by Lagrange in the 1780s. We then look at Fenchel duality, which is closely
related to Lagrangian duality but offers a different perspective.

4.6.1 Lagrangian Duality

The Lagrangian function L ∶ ℝn × ℝm × ℝp → (−∞, +∞] associated with (4.4) is defined as
∑
m
∑
p
L(x, λ, 𝜇) = f0 (x) + λi fi (x) + 𝜇i hi (x)
i=1 i=1
= f0 (x) + λT f (x) + 𝜇 T h(x) (4.59)

with dom L =  × ℝm × ℝp . The auxiliary variables λ and 𝜇 are called Lagrange multipliers or dual
variables. The variable λi is associated with the inequality constraint fi (x) ⪯ 0, and 𝜇i is associated
with the equality constraint hi (x) = 0.
The Lagrangian function can be used to construct lower bounds on p⋆ by noting that if x is feasible
and λ ∈ ℝm + , then

f (x) ⪯ 0, h(x) = 0, λT f (x) ≤ 0.

This readily implies that for all (x, λ, 𝜇) ∈  × ℝm

+ × ℝ , it holds that
p

L(x, λ, 𝜇) ≤ f0 (x).

Now, taking the infimum over x ∈  on the left-hand side of the inequality and the infimum over
x ∈  ⊆  on the right-hand side, we see that

inf L(x, λ, 𝜇) ≤ p⋆ , ∀ λ ⪰ 0.
x∈

The left-hand side is called the Lagrange dual function or simply the dual function. It is a function
g ∶ ℝm × ℝp → ℝ ̄ of λ and 𝜇, i.e.

g(λ, 𝜇) = inf L(x, λ, 𝜇). (4.60)

x∈
4.6 Duality 87

Now, noting that L is an affine function of λ and 𝜇, we see that the dual function g is the pointwise
infimum of a family of affine functions, and hence, g is a concave function. This is true regardless
of whether or not (4.4) is a convex optimization problem. We remark that if L is unbounded from
below for some (λ, 𝜇), then g(λ, 𝜇) = −∞. We will say that (λ, 𝜇) is a dual feasible point if λ ⪰ 0 and
(λ, 𝜇) ∈ dom(−g), where
dom(−g) = {(λ, 𝜇) | g(λ, 𝜇) > −∞}
is the effective domain of −g.

Example 4.7 We now derive the Lagrange dual function for the special case of (4.4), where both
the inequality and equality constraints are affine, i.e. the problem takes the form
minimize f0 (x)
subject to Ax ⪯ b (4.61)
Cx = d
where A ∈ ℝm×n , b ∈ ℝm , C ∈ ℝp×n , and d ∈ ℝp are given. Using the definition of the Lagrangian,
we find that
L(x, λ, 𝜇) = f0 (x) + λT (Ax − b) + 𝜇 T (Cx − d),
and the dual function may be expressed as
g(λ, 𝜇) = inf L(x, λ, 𝜇)
x∈
( ( )T )
= −λT b − 𝜇 T d + inf
f0 (x) + AT λ + CT 𝜇 x
x∈
( ( )T )
= −λ b − 𝜇 d − sup − AT λ + CT 𝜇 x − f0 (x)
T T
x∈

= −λ b − 𝜇 d −
T T
f0∗ (−AT λ − CT 𝜇),
where the last step follows from the definition of the conjugate function. Recalling that conjugate
functions are convex by construction, we immediately see that g is indeed a concave function, and
moreover, dom(−g) = {(λ, 𝜇) | − AT λ − CT 𝜇 ∈ dom f0∗ }.

4.6.2 Lagrange Dual Problem

We have seen that the Lagrange dual function can be used to construct lower bounds on the optimal
value p⋆ of (4.4), and hence, an obvious question is what is the best bound that can be obtained
from the dual function in this way? A natural idea would be to maximize the dual function, which
gives rise to the Lagrange dual problem associated with (4.4) and defined as
maximize g(λ, 𝜇)
(4.62)
subject to λ ⪰ 0.
We note that in the context of duality, the problem (4.4) is referred to as the primal problem.
The optimal value of the dual problem is
d⋆ = sup {g(λ, 𝜇) | λ ⪰ 0}, (4.63)
⋆ ⋆
and (λ , 𝜇⋆ )
is a dual optimal point if it is dual feasible and g(λ , 𝜇⋆ )
= d⋆ .
We will say that the
Lagrange dual problem is a convex optimization problem since it is trivially equivalent to minimiz-
ing −g subject to λ ⪰ 0.
The lower bound property of the dual function directly implies that d⋆ ≤ p⋆ , and this property is
referred to as weak duality. We note that weak duality holds even if p⋆ and/or d⋆ are infinite, and as
a consequence, the primal problem is infeasible if the dual problem is unbounded and vice versa.
88 4 Optimization Theory

Moreover, for any dual feasible (λ, 𝜇) and primal feasible x, it holds that g(λ, 𝜇) ≤ d⋆ ≤ p⋆ ≤ f0 (x),
and as a consequence,
p⋆ − d⋆ ≤ f0 (x) − g(λ, 𝜇).
The difference f0 (x) − g(λ, 𝜇) is called the duality gap at (x, λ, 𝜇), whereas p⋆ − d⋆ is the optimal
duality gap.
We say that strong duality holds if d⋆ = p⋆ , i.e. the optimal duality gap is zero. Strong duality
does not hold in general, but it can be shown to hold for a convex optimization problem of the form
(4.53) if it satisfies certain constraint qualifications. One example is Slater’s constraint qualification
or Slater’s condition, which states that strong duality holds if there exists a point x ∈ relint  that
is strictly feasible in the sense that
fi (x) < 0, i ∈ ℕm , Ax = b.
Slater’s condition also implies that the dual optimal value is attained whenever it is finite, i.e. there
exists a dual point (λ⋆ , 𝜇 ⋆ ) such that g(λ⋆ , 𝜇 ⋆ ) = d⋆ . We note that the duality gap can be used as a
stopping criteria for optimization methods when strong duality holds. Indeed, if (x, λ, 𝜇) is primal
and dual feasible, then
f0 (x) − p⋆ ≤ f0 (x) − g(λ, 𝜇), d⋆ − g(λ, 𝜇) ≤ f0 (x) − g(λ, 𝜇). (4.64)
Thus, if the duality gap is 𝜖 = f0 (x) − g(λ, 𝜇), then x is 𝜖-suboptimal for the primal problem and
(λ, 𝜇) is 𝜖-suboptimal for the dual problem.

4.6.3 Fenchel Duality

We will now consider a conceptually different approach to constructing lower bounds on the
optimal value associated with (4.4). To this end, we will consider problems of the form
minimize 𝜙(x) (4.65)
where 𝜙 ∶ ℝn → (−∞, +∞]. This problem is equivalent to (4.4) if we let 𝜙(x) = f0 (x) + I (x). Now,
suppose that we define a perturbation function h ∶ ℝn × ℝq such that h(x, 0) = 𝜙(x). The optimal
value of the perturbed problem of minimizing h(x, y) is then a function of y, i.e.
𝜈(y) = inf h(x, y),
x
where we note that 𝜈(0) = inf x 𝜙(x) is the optimal value of the unperturbed problem. Recall that
the biconjugate of 𝜈 satisfies 𝜈 ∗∗ ≤ 𝜈, and hence, 𝜈 ∗∗ (0) provides a lower bound on 𝜈(0). Using the
definition of the conjugate function, we have that
𝜈 ∗∗ (0) = sup {−𝜈 ∗ (z)},
z
which naturally leads to the dual problem
maximize −𝜈 ∗ (z) (4.66)
with variable z ∈ ℝ . The dual objective is a concave function, since 𝜈 (z) is convex. Furthermore,
n ∗

we immediately see that strong duality holds if 𝜈 is a closed convex function since this implies that
𝜈 ∗∗ = 𝜈. The conjugate of 𝜈(y) = inf x h(x, y) can also be expressed in terms of h∗ by noting that
𝜈 ∗ (z) = sup {zT y − inf h(x, y)}
y x

= sup {zT y + sup {−h(x, y)}}

y x
T
= sup {z y − h(x, y)}
x,y

= h∗ (0, z).
Thus, the dual problem (4.66) depends on the choice of perturbation function.
4.6 Duality 89

Example 4.8 Let 𝜙 in (4.65) be defined as 𝜙(x) = f0 (x) + Iℝn+ (b − Ax) with A ∈ ℝm×n and b ∈ ℝm .
This corresponds to an optimization problem of the form (4.4) with the objective function f0 (x)
and affine inequality constraints Ax ⪯ b. Now, suppose that we define the perturbation function
h ∶ ℝn × ℝn → ℝ as
h(x, y) = f0 (x) + Iℝn+ (b − Ax − y). (4.67)
The conjugate of h is then
h∗ (𝑣, z) = sup {𝑣T x + zT y − f0 (x) − Iℝn+ (b − Ax − y)}
x,y

= sup {𝑣T x + zT (b − Ax − ỹ ) − f0 (x) − Iℝn+ (̃y)}

x,̃y

= bT z + sup {(𝑣 − AT z)T x − f0 (x)} + sup {−zT ỹ − Iℝn+ (̃y)}

x ỹ

= bT z + f0∗ (𝑣 − AT z) + Sℝn+ (−z)

where the second step follows by letting ỹ = b − Ax − y, and in the last step, we used the fact that
Iℝ∗ n is the support function Sℝn+ . Recall from Example 4.6 that the support function of a nonempty
+
convex cone is the indicator function of the polar cone, so Sℝn+ (−z) = Iℝn+ (z). This implies that we
can express the dual problem as
maximize −bT z − f0∗ (−AT z)
(4.68)
subject to z ⪰ 0.

We end this section by outlining Fenchel’s duality theorem. Specifically, we will consider the opti-
mization problem
minimize f (x) + g(Ax) (4.69)
where f ∶ ℝn → ℝ ̄ and g ∶ ℝm → ℝ ̄ are proper convex functions, and A ∈ ℝm×n is given. The con-
jugate of the perturbation function h(x, y) = f (x) + g(Ax + y) can be expressed as
{ }
h∗ (𝑣, z) = sup 𝑣T x + zT y − f (x) − g(Ax + y)
x,y
{ }
= sup 𝑣T x + zT (̃y − Ax) − f (x) − g(̃y)
x,̃y
{ } { }
= sup (𝑣 − AT z)T x − f (x) + sup zT ỹ − g(̃y)
x ỹ

= f ∗ (𝑣 − AT z) + g∗ (z),
and hence, the dual problem is
maximize −f ∗ (−AT z) − g∗ (z) (4.70)
with variable z ∈ ℝm . We note that this is equivalent to the Lagrange dual problem associated with
the problem
minimize f (x) + g(y)
subject to Ax = y
with variables x ∈ ℝn and y ∈ ℝm and which is equivalent to (4.69). From weak duality, we have
that
p⋆ = inf {f (x) + g(Ax)} ≥ sup {−f ∗ (−AT z) − g∗ (z)} = d⋆ ,
x z

and it can be shown that strong duality holds if f and g satisfy certain conditions. For example, the
condition relint (dom f ) ∩ relint (dom g) ≠ ∅ is a sufficient condition for strong duality.
90 4 Optimization Theory

4.7 Optimality Conditions

We will now study the behavior of the primal problem (4.4) and its associated dual problem (4.62)
at a point at which the optimal values are attained and equal. In other words, we assume that
(x⋆ , λ⋆ , 𝜇 ⋆ ) satisfies
∑
m
∑
p
f0 (x⋆ ) = g(λ⋆ , 𝜇 ⋆ ) = L(x⋆ , λ⋆ , 𝜇 ⋆ ) = f0 (x⋆ ) + λ⋆i fi (x⋆ ) + 𝜇i⋆ hi (x⋆ ).
i=1 i=1
An immediate consequence is that
∑
m
hi (x⋆ ) = 0, i ∈ ℕp , λ⋆i fi (x⋆ ) = 0,
i=1

and since λ⋆i fi (x⋆ ) ≤ 0, it follows that

λ⋆i fi (x⋆ ) = 0, i ∈ ℕm . (4.71)
This property is known as complementary slackness: λ⋆i and fi (x⋆ ) cannot be nonzero at the same
time. Moreover, if the objective function and the constraint functions are continuously differen-
tiable, then it must also hold that
𝜕 |
L(x, λ⋆ , 𝜇 ⋆ )|| = ∇x L(x⋆ , λ⋆ , 𝜇 ⋆ ) = 0,
𝜕x |x=x⋆
i.e. the gradient of L with respect to x vanishes at (x⋆ , λ⋆ , 𝜇 ⋆ ). These observations may be
summarized as the following necessary conditions for optimality, which are known as the
Karush–Kuhn–Tucker (KKT) conditions:
∇x L(x⋆ , λ⋆ , 𝜇 ⋆ ) = 0 (4.72a)
⋆ ⋆
f (x ) ⪯ 0, h(x ) = 0 (4.72b)
⋆
λ ⪰0 (4.72c)
λ⋆i fi (x⋆ ) = 0, i ∈ ℕm . (4.72d)
The condition (4.72b) corresponds to primal feasibility, (4.72c) corresponds to dual feasibility, and
(4.72d) is the complementary slackness condition.

4.7.1 Convex Optimization Problems

In the special case, where the primal problem is a convex optimization problem of the form (4.53),
the KKT conditions (4.72) are not only necessary conditions but also sufficient conditions for
optimality. To see this, suppose that (̄x, λ,̄ 𝜇)
̄ satisfies the KKT conditions for the convex
optimization problem (4.53), i.e.
̄ 𝜇)
∇x L(̄x, λ, ̄ =0
f (̄x) ⪯ 0, Āx = b
λ̄ ⪰ 0
λ̄ i fi (̄x) = 0, i ∈ ℕm .
̄ 𝜇)
The dual feasibility condition λ ⪰ 0 implies that L(x, λ, ̄ is convex in x. Moreover, the stationarity
̄ ̄ 𝜇),
̄ = 0 implies that x̄ is a minimizer of L(x, λ,
condition ∇x L(̄x, λ, 𝜇) ̄ and hence,
̄ 𝜇)
g(λ, ̄ 𝜇)
̄ = inf L(x, λ, ̄ 𝜇).
̄ = L(̄x, λ, ̄
x∈
̄ 𝜇)
Using the complementary slackness condition, we find that L(̄x, λ, ̄ 𝜇)
̄ = f0 (̄x), and hence, x̄ and (λ, ̄
result in a zero duality gap, which shows that they are primal and dual optimal points, respectively.
Exercises 91

We note that for convex optimization problems, Slater’s condition implies the existence of primal
and dual optimal points x⋆ and (λ⋆ , 𝜇 ⋆ ) with zero duality gap.

4.7.2 Nonconvex Optimization Problems

Recall that in general, the KKT conditions are only necessary conditions for optimality. Although
strong duality does hold for some nonconvex optimization problems, it is important to note that it
is not a generic property. Thus, for nonconvex optimization problems, the situation is more com-
plicated if we do not know if there exist primal and dual optimal points x⋆ and (λ⋆ , 𝜇 ⋆ ) with a zero
duality gap. Moreover, we may be interested in characterizing not only global optima but also local
optima. To this end, we once again return to the optimization problem (4.4), where we assume that
f0 , f , and h are continuously differentiable. Given a feasible point x, we say that the linear indepen-
dence constraint qualification (LICQ) holds at x if the gradients of the equality constraint functions
and the active inequality constraint functions at x are linearly independent, i.e. the vectors
{∇h1 (x), … , ∇hp (x)} ∪ {∇fi (x) | fi (x) = 0, i ∈ ℕm }
are linearly independent. Now, suppose that x̄ is a locally optimal point and that the LICQ holds at
x̄ . Then there exist λ ∈ ℝm and 𝜇 ∈ ℝp such that
∇x L(̄x, λ, 𝜇) = 0
f (̄x) ⪯ 0, h(̄x) = 0
λ⪰0
λi fi (̄x) = 0, i ∈ ℕm .
A proof of this result can be found in, e.g. [74, p. 314]. This has the following important practical
consequence: if we do not know x̄ but instead compute a solution to the KKT conditions, then it is
a candidate for a local minimizer if the LICQ holds at x̄ . However, further information is necessary
to determine if x̄ is indeed a local minimizer, since it can also be a local maximizer or a saddle point.
If the functions f0 , f , and h are also twice continuously differentiable, then there are second-order
sufficiency conditions that that guarantee that a point x̄ is a local minimizer. For example, x̄ is a
local minimizer if there exist λ ∈ ℝm and 𝜇 ∈ ℝp such that
∇x L(̄x, λ, 𝜇) = 0
f (̄x) ⪯ 0, h(̄x) = 0
λ⪰0
λi fi (̄x) = 0, i ∈ ℕm ,
and the Hessian of L at x̄ is positive definite on the subspace
{ }
M = y ∈ ℝn | yT ∇hi (̄x) = 0, yT ∇fj (̄x) = 0, ∀ i ∈ ℕp , ∀ j ∈ J ,
{ }
where J = j ∈ ℕm | fj (̄x) = 0, λj > 0 ; see, e.g. [74, p. 316].

Exercises
4.1 Consider the optimization problem
minimize f (x, y)
where f ∶ ℝm × ℝn → ℝ. Show that the minimal value of f can be obtained by first mini-
mizing over x and then minimizing over y. Specifically, let g ∶ ℝn → ℝ be defined as
g(y) = min f (x, y)
x
92 4 Optimization Theory

and let x̄ (y) ∈ argminx f (x, y) denote a minimizer of f (x, y) for a given y. Similarly, we define
ȳ ∈ argminy g(y). Show that (̄x(̄y), ȳ ) is a minimizer of f .

4.2 Consider the optimization problem

∑
m
minimize fi (x)
i=1

with variable x ∈ ℝn , and where fi ∶ ℝn → ℝ, i ∈ ℝm . Show that this problem is equivalent

to the problem
minimize 𝟙T t
subject to fi (x) − ti ≤ 0, i ∈ ℕm
with variables x ∈ ℝn and t ∈ ℝm .

4.3 Let f ∶ ℝ+ → ℝ be defined as f (x) = x ln x, where we use the convention that 0 ln 0 = 0.

(a) Show that f is a convex function.
∑n
(b) Show that g ∶ ℝn+ → ℝ defined as g(x) = i=1 f (xi ) is a convex function.
(c) Derive the conjugate function of f .

4.4 Show that the function f ∶ ℝn → ℝ defined as

f (x) = ln(ex1 + · · · + exn )
is convex.

4.5 Let g ∶ ℝ → ℝ be defined as g(t) = f (x + 𝑣t), where f ∶ ℝn → ℝ and 𝑣 ∈ ℝn are given, and
dom g = {t | x + t𝑣 ∈ dom f }.
(a) Show that f is a convex function if and only if g is a convex function for all x ∈ dom f
and 𝑣 ∈ ℝn .
(b) Show that f ∶ 𝕊n → ℝ defined as f (X) = − ln det X with dom f = 𝕊n++ is a convex
function.
Hint: Show that g(t) = − ln det (X + Vt) is convex for all X ∈ 𝕊n++ and V ∈ 𝕊n .

4.6 Let f ∶ ℝn → (−∞, +∞] be a convex function. Show that the perspective function Pf ∶ ℝn ×
ℝ → (−∞, +∞] defined as
{
tf (x∕t), t > 0,
Pf (x, t) =
∞, otherwise.
is a convex function.

4.7 Show that f ∶ 𝕊n++ → ℝ+ defined as f (X) = yT X −1 y is a convex function.

Hint: Use the Schur complement formula (2.58) to show that epi f is a convex set.

4.8 Let 𝜑 ∶ ℝ → ℝ be a convex function and let f ∶ X → ℝ be a function with domain

X ⊂ ℝn that satisfies ∫X f (x) dx = 1. Show the following result, which is known as Jensen’s
inequality:
( )
𝜑 f (x)dx ≤ 𝜑 (f (x)) dx.
∫X ∫X
Exercises 93

4.9 Let f ∶ ℝm × ℝn → ℝ be a convex function, and let C ⊆ ℝn be a convex set. Show that the
function g ∶ ℝm → ℝ defined as
g(x) = inf f (x, y)
y∈C

is a convex function.

4.10 Let xi ∈ [0, 1∕2], i ∈ ℕn , and 𝜃 ∈ Δn .

(a) Show that
∏n 𝜃i ∑n
i=1 xi 𝜃i xi
∏n ≤ ∑n i=1
𝜃
i=1 (1 − xi ) i=1 𝜃i (1 − xi )
i

which is known as the Ky Fan inequality.

(b) Show that the Ky Fan inequality holds with equality if and only if either 𝜃 T x = 0 or x ≻ 0
satisfies xi = y ∈ (0, 1∕2] for all i ∈ ℕn such that 𝜃i > 0.
x
Hint: Apply Jensen’s inequality using the function 𝜑(x) = − ln 1−x .

4.11 Let || ⋅ || be a norm on ℝn . Show that the dual norm, which is defined as
||y||∗ = sup {yT x | ||x|| ≤ 1},
x

is indeed a norm.
√
4.12 Derive the dual norm of the quadratic norm || ⋅ ||P ∶ ℝn → ℝ+ defined as ||x||P = xT Px
for some P ∈ 𝕊n++ .

4.13 For each of the following functions f ∶ ℝ → ℝ, derive the subdifferential.

(a) The absolute value function f (t) = |t|.
(b) The function f (x) = max (2x + 1, −x + 2).
(c) The function f (x) = max (2x + 1, −x + 2)2 .

4.14 [22, Exercise 2.31] Let C∗ be the dual cone of a convex cone C ⊆ ℝn . Prove the following
statements:
(a) C∗ is a convex cone.
(b) C∗∗ is closed.
(c) C1 ⊆ C2 implies C2∗ ⊆ C1∗ .
(d) The interior of C∗ is given by int C∗ = {y | yT x > 0 ∀ x ∈ C}.
(e) If C has a nonempty interior, then C⋆ is pointed.
(f) C∗∗ is the closure of C.
(g) If the closure of C is pointed, then C∗ has a nonempty interior.
94

Optimization Problems

Applications in learning and control give rise to a wide range of optimization problems, and we
will now discuss different classes of such optimization problems. Our starting point will be the
classes of linear and nonlinear least-squares problems, instances of which occur frequently in, e.g.
supervised learning. It was studied already by Carl Friedrich Gauss in the eighteenth century in
order to calculate the orbits of celestial bodies. We then discuss quadratic programs, which are often
encountered as local surrogate models of general optimization problems and are a component of
many optimization methods. Another important class of problems is the class of conic optimization
problems, and we will see that any convex optimization problem can, in principle, be cast as a conic
optimization problem.
Problems that involve the rank of some matrix variable as part of the objective function or a
constraint function are called rank optimization problems. We will see that some special cases of
these can be solved to global optimality using techniques from linear algebra. However, in general,
rank optimization problems are difficult nonlinear optimization problems, and we will introduce
some heuristics that often produce good approximate solutions. We will also discuss partially sep-
arable optimization problems, which are problems with a special kind of structure that can be
exploited computationally. Several examples of such problems appear later in the book, e.g. optimal
control problems as well as hidden Markov processes. We also consider multiparametric optimiza-
tion, which is about finding a parametric solution to a family of parameterized optimization prob-
lems, and finally, we discuss stochastic optimization problems, which arise, e.g. when the objective
function involves a random variable in addition to the optimization variable.

5.1 Least-Squares Problems

A nonlinear least-squares (LS) problem is a problem of the form

1∑
m
minimize f (x)2 , (5.1)
2 k=1 k
with variable x ∈ ℝn and where fk ∶ ℝn → ℝ, k ∈ ℕm . It may also be written more compactly as
1
minimize ||f (x)||22 ,
2

where we define f (x) = (f1 (x), … , fm (x)). If f is continuously differentiable, then the necessary
optimality condition associated with the nonlinear LS problem may be expressed as
1∑ 𝜕 ∑
m m
fk (x)2 = fk (x)∇fk (x) = 0. (5.2)
2 k=1 𝜕x k=1
This is a set of n nonlinear equations in x, which are not easy to solve in general. We discuss
optimization methods for this type of problem in Chapter 6.
A linear LS problem is the special case where f is an affine function, i.e. f (x) = Ax − b for some
A ∈ ℝm×n and b ∈ ℝm . The resulting LS problem is a convex optimization problem. This follows
by noting that the Hessian of (1∕2)||Ax − b||22 is AT A, which is positive semidefinite. The necessary
optimality condition is therefore also sufficient, and it can be expressed as
AT Ax = AT b. (5.3)
This system of equations is called the normal equations. We will encounter several LS problems,
e.g. linear regression, see Section 10.1.
It is sometimes useful to augment the LS problem by adding constraints. For example, if we add
affine equality constraints to a linear LS problem, we obtain a problem of the form
1
minimize ||Ax − b||22
2 (5.4)
subject to Cx = d,
with C ∈ ℝp×n and d ∈ ℝp . Like the linear LS problem, this is clearly also a convex optimization
problem. The Karush–Kuhn–Tucker (KKT) conditions associated with (5.4) may be expressed as
[ T ][ ] [ T ]
A A CT x A b
= , (5.5)
C 0 𝜇 d
where 𝜇 ∈ ℝp is the vector of Lagrange multipliers associated with the equality constraints. This
is an indefinite system of equations, and it is sometimes referred to as the KKT equations since
it represents the KKT conditions. We note that the linear independence constraint qualification
(LICQ) is independent of x and holds if rank(C) = p. Moreover, as we saw in Section 2.11, the system
has a unique solution if and only if rank(C) = p and AT A is positive definite on the nullspace of
C, i.e.
rank(C) = p,  (A) ∩  (C) = {0}.
Interested readers may find more information about LS problems in [21], which is a comprehensive
reference on the topic.

Example 5.1 As an example of a nonlinear least-squares problem, we consider a so-called

localization problem. Suppose that a1 , … , am ∈ ℝD are m so-called anchor positions, which
we assume are known positions in a D-dimensional space. Furthermore, suppose that we are
interested in estimating n unknown positions x1 , … , xn ∈ ℝD based on partial information about
pairwise distances. Specifically, we will assume that rij is a noisy measurement of ||xi − xj ||2
for (i, j) ∈  ⊂ ℕn × ℕn , and 𝑣ik is a noisy measurement of ||xi − ak ||2 for (i, k) ∈ a ⊂ ℕn × ℕm .
Figure 5.1 illustrates a two-dimensional localization problem.
The problem of estimating the n unknown positions may be posed as a LS problem, i.e.
∑ ( )2 ∑ ( )2
minimize ||xi − xj ||2 − rij + ||xi − ak ||2 − 𝑣ik ,
(i, j)∈ (i, k)∈a

with variables xi ∈ ℝD , i
∈ ℕn . If the measurement errors are independent and normally distributed
with zero mean and the same variance, then the above problem is equivalent to a maximum like-
lihood problem for estimating the positions; cf . Section 10.1.
96 5 Optimization Problems

Figure 5.1 Example of a two-dimensional localization

problem. The m anchors are marked with black squares, and
the circles are the n unknown positions. The edges represent
the noisy distance measurements.

5.2 Quadratic Programs

The linear LS problem and its linearly constrained variant are instances of a more general class of
optimization problems, namely quadratic programs (QP). We define a QP as a problem of the form
1 T
minimize x Px + qT x
2
subject to Ax ⪰ b (5.6)
Cx = d,
with variable x ∈ ℝn and problem data P ∈ 𝕊n , q ∈ ℝn , A ∈ ℝm×n , b ∈ ℝm , C ∈ ℝp×n , and d ∈ ℝp .
This is a convex optimization problem according to our definition if and only if P is positive
semidefinite. We note that the problem is equivalent to a convex optimization problem in the
event that P is indefinite, but positive semidefinite on the nullspace of C. The special case of (5.6)
where P = 0 is called a linear program (LP). We will see an example of an LP in Section 10.5.
QPs of the form (5.6) generally do not have a closed-form solution. The KKT conditions may be
expressed as
Px + AT λ + CT 𝜇 = −q
Ax ⪯ b, Cx = d
λ⪰0
diag(λ)(Ax − b) = 0,

where λ ∈ ℝm and 𝜇 ∈ ℝp and the Lagrange multiplies associated with the inequality and equality
constraints, respectively.
To derive the Lagrange dual of (5.6), we introduce the Lagrangian
1 T
L(x, λ, 𝜇) = x Px + qT x + λT (Ax − b) + 𝜇 T (Cx − d).
2
We immediately see that L is unbounded below if P  0, in which case, the dual function is
g(λ, 𝜇) = −∞, and hence, d⋆ = −∞. A more useful lower bound can be obtained if P ⪰ 0, in which
case, we find that
{ T
−b λ − dT 𝜇 − 𝜓(λ, 𝜇), q + AT λ + CT 𝜇 ∈ (P),
g(λ, 𝜇) = (5.7)
−∞, otherwise,
where
1
𝜓(λ, 𝜇) = (q + AT λ + CT 𝜇)T P† (q + AT λ + CT 𝜇).
2
5.3 Conic Optimization 97

The range condition q + AT λ + CT 𝜇 ∈ (P) can be expressed as the equality constraint

Py = q + AT λ + CT 𝜇 where y ∈ ℝn is an auxiliary variable. Thus, using the fact that PP† P = P, we
can express the dual of a convex QP as
1
maximize −bT λ − dT 𝜇 − yT Py
2
subject to AT λ + CT 𝜇 − Py = −q (5.8)
λ ⪰ 0,
with variables λ ∈ ℝm , 𝜇 ∈ ℝp , and y ∈ ℝn . Note that the dual problem is itself a QP.
QPs arise in many different applications. An important example is the subproblems that arise in
some optimization methods for more general optimization problems. We will also encounter a QP
in Section 10.5 in the context of support vector machines.

Example 5.2 Suppose (Ω,  , P) is a probability space with Ω = ℕn , and let X ∶ Ω → ℝ be

a random variable. Moreover, we define  = {x1 , … , xn }, where xk = X(k), k ∈ ℕn , and let
p = (p1 , … , pn ) be the values of the probability function, i.e.
pk = P(X = xk ), k ∈ ℕn .
We will assume that p is unknown and consider an optimization problem of the form
maximize Var f0 (X)
subject to 𝔼 fi (X) ≤ bi , i ∈ ℕm (5.9)
p ∈ Δn ,
with variable p and where f0 ∶  → ℝ and fi ∶  → ℝ, i ∈ ℕm , and the upper bounds
b = (b1 , … , bm ) are given. Letting ai = (fi (x1 ), … , fi (xn )), we see that the ith inequality con-
straint can be expressed as the affine inequality aTi p ≤ bi . Moreover, the objective function can be
expressed as
Var f0 (X) = 𝔼[f0 (X)2 ] − 𝔼[f0 (X)]2 = r T p − (cT p)2
where r = (f0 (x1 )2 , … , f0 (xn )2 ) and c = (f0 (x1 ), … , f0 (xn )). It is easy to verify that Var f0 (X) is a con-
cave function of p, and hence, (5.9) can be expressed as a convex QP
minimize pT (ccT )p − r T p
subject to Ap ≤ b (5.10)
x ∈ Δn ,
with variable p ∈ ℝn and where A ∈ ℝm×n is the matrix with rows aT1 , … , aTm .

We end this section by noting that the QP in (5.6) can be generalized by allowing quadratic
constraints. The resulting optimization problem is a so-called quadratically constrained quadratic
program (QCQP), which can be expressed as
minimize (1∕2)xT Px + qT x
(5.11)
subject to (1∕2)xT Ai x + bTi x + ci ≤ 0, i ∈ ℕm ,
with variable x ∈ ℝn and where P ∈ 𝕊n , q ∈ ℝn , and (Ai , bi , ci ) ∈ 𝕊n × ℝn × ℝ, i ∈ ℕm . This is a
convex problem if and only if P and A1 , … , Am are positive semidefinite.

5.3 Conic Optimization

A conic linear program or cone LP is an optimization problem of the form
minimize cT x
subject to Ax = b (5.12)
x ⪰K 0,
98 5 Optimization Problems

where K ⊂ ℝn is a proper convex cone, x ∈ ℝn is the optimization variable, and c ∈ ℝn , A ∈ ℝp×n ,

and b ∈ ℝp are the problem data. Recall that x ⪰K 0 ⟺ x ∈ K, and hence, the special case where
K = ℝn+ corresponds to an LP. This observation allows us to view the cone LP as a conceptually
simple extension of an LP. The problem (5.12) is called a second-order cone program (SOCP) if K is
the Cartesian product of second-order cones, i.e.
∑
m
K = ℚn1 × · · · × ℚnm , ni = n.
i=1

Furthermore, if we consider 𝕊n instead of ℝn and let K = 𝕊n+ , we obtain a semidefinite program

(SDP)
minimize tr(CX)
subject to tr(Ai X) = bi , i ∈ ℕp , (5.13)
X ∈ 𝕊n+ .
LPs, SOCPs, and SDPs are special cases that form a hierarchy of increasing complexity in the sense
that LPs are SOCP representable, and SOCPs are SDP representable, i.e.
LP ⊂ SOCP ⊂ SDP.
Indeed, we have that
x ∈ ℝn+ ⟺ x ∈ ℚ1 × · · · × ℚ1 ⟺ diag(x) ∈ S+n ,
which shows that an LP can be formulated as both an SOCP and an SDP. Moreover, an SOCP can
be formulated as an SDP by using the fact that
[ ]
tI y
(y, t) ∈ ℚn ⟺ ∈ 𝕊n+ .
yT t
To verify this equivalence, first note that
(y, t) ∈ ℚn ⟺ ||y||2 ≤ t ⟺ yT y ≤ t2 , 0 ≤ t.
Now, using (2.58), we see that
[ ]
tI y
⪰0 ⟺ t ≥ 0, t − yT t† y ≥ 0, y = tt† y,
yT t
and multiplying both sides of t − yT t† y ≥ 0 by t, we find that
[ ]
tI y
⪰0 ⟺ t2 − yT y ≥ 0, t ≥ 0 ⟺ (y, t) ∈ ℚn .
yT t
It is also possible to represent convex QPs and QCQPs as SOCPs, which is an immediate conse-
quence of the following example.

Example 5.3 Consider the convex quadratic constraint

xT Px + qT x + s ≤ 0, (5.14)
where x ∈ ℝn ,P∈ 𝕊n+ ,
q∈ ℝn , and s ∈ ℝ. Such a constraint can always be expressed as a conic
constraint of the form
||Ax + b||2 ≤ cT x + d, (5.15)
where A ∈ ℝr×n , b ∈ ℝr , c ∈ ℝn , and d ∈ ℝ+ . To see this, first note that (5.15) is equivalent to
xT (AT A − ccT )x + 2(AT b − cd)T x + bT b − d2 ≤ 0, cT x + d ≥ 0,
5.3 Conic Optimization 99

which follows by squaring both sides of (5.15). Thus, to express (5.14) as (5.15), we need to find A,
b, c, and d such that
P = AT A − ccT ⪰ 0, q = 2(AT b − cd), s = b T b − d2 .
We start by considering the case where q ∈ (P). This implies that there exists a full-rank matrix
A such that P = AT A and q ∈ (AT ), and hence, we can choose c = 0 and solve for b and d, i.e.
1 †T
b= (A ) q, d = |(1∕4)bT b − s|1∕2 .
2
Now, suppose q ≠ (P) and let P = BT B, where B ∈ ℝr×n has rank r < n. We may then decom-
pose q as q = q̄ + c, where q̄ = B† Bq ∈ (BT ) and c = (I − B† B)q ∈  (B), and hence, we have that
[ ]
P = AT A − ccT ⪰ 0 if we let AT = BT c . Moreover, q = 2(AT b − cd) is satisfied if we take
[ ]
1 (B† )T q̄
b= ,
2 1 + 2d
and finally, we find d by solving
1
s = b T b − d2 ⟺ ̄ 22 − 1).
d = s − (||(B† )T q||
4

5.3.1 Conic Duality

Lagrangian duality can be extended to cone LPs of the form (5.12). We will define the Lagrangian
L ∶ ℝn × ℝn × ℝp → ℝ associated with (5.12) as
L(x, λ, 𝜇) = cT x − λT x + 𝜇 T (Ax − b).
Here 𝜇 ∈ ℝp is the Lagrange multiplier associated with the equality constraints Ax = b, and λ ∈ ℝn
is the Lagrange multiplier associated with the conic constraint x ∈ K, which can also be expressed
as −x ⪯K 0. Using the definition of the dual cone, i.e.
K ∗ = {y ∈ ℝn | yT x ≥ 0, ∀ x ∈ K},
we immediately see that −λT x ≤ 0 for all x ∈ K if and only if λ ∈ K ∗ . As a consequence, we have
the lower bound property L(x, λ, 𝜇) ≤ cT x if x is feasible and λ ∈ K ∗ . The Lagrange dual function
̄ is given by
g ∶ ℝn × ℝp → ℝ
g(λ, 𝜇) = inf L(x, λ, 𝜇),
x

and since L is an affine function of x, we find that

{ T
−b 𝜇, c − λ − AT 𝜇 = 0,
g(λ, 𝜇) =
−∞, otherwise.
This leads to the following Lagrange dual problem:
maximize −bT 𝜇
subject to AT 𝜇 + λ = c (5.16)
λ ∈ K∗,
with variables λ ∈ ℝn and 𝜇 ∈ ℝp . Equivalently, we may replace the two constraints by
c − AT λ ∈ K ∗ if we eliminate λ. We note that (5.12) and (5.16) are each other’s dual, and
hence, either of the two problems can be designated as the primal problem. For example, in some
applications, it is more natural to specify an optimization problem on the form (5.16), which is
then referred to as the primal problem.
100 5 Optimization Problems

Following the approach in Section 4.7, we may derive the KKT conditions for optimality, which
can be expressed as
Ax = b, x ∈ K,
A 𝜇 + λ = c, λ ∈ K ∗ ,
T

diag(λ)x = 0.
These conditions are both necessary and sufficient for optimality if either the primal problem or
the dual problem is strictly feasible, i.e. if there exists a vector x ∈ int K such that Ax = b, or if there
exists a dual point (λ, 𝜇) such that λ ∈ int K ∗ and AT 𝜇 + λ = c.

5.3.2 Epigraphical Cones

The cone K in the cone LP (5.12) can, in principle, be any proper convex cone in ℝn , and many
such cones can be expressed as the epigraph of some function. The epigraph of a function
f is a cone if (u, t) ∈ epi f ⟹ (𝛼u, 𝛼t) ∈ epi f for all 𝛼 ≥ 0. Using the definition of the epigraph
of f ∶ ℝn−1 → ℝ, ̄ i.e.

epi f = {(x, t) ∈ ℝn−1 × ℝ | f (x) ≤ t}, (5.17)

we see that f must satisfy 𝛼f (x) = f (𝛼x) for all 𝛼 ≥ 0 in order for epi f to be a cone. Such a function
is said to be positively homogeneous. Thus, K = epi f is clearly a convex cone if f is positively homo-
geneous and convex, and we will call such a cone an epigraphical cone. An important class of such
cones are norm cones, i.e. if f (x) = ||x|| is a norm on ℝn−1 , then epi f is a norm cone in ℝn .
We will now consider a class of epigraphical cones that are generated by the perspective of a
proper convex function h ∶ ℝn−2 → (−∞, +∞]. Recall that the perspective of h is the function
Ph ∶ ℝn−2 × ℝ → (−∞, +∞] defined as
{
sh(u∕s), s > 0,
Ph (u, s) =
∞, otherwise.
The set {0} ∪ epi Ph is then a convex cone, as shown in Section 4.3. Taking the closure of epi Ph ,
we obtained a closed convex cone
K = cl(epi Ph ). (5.18)
It can be shown that this cone is pointed if epi h does not contain a line. Moreover, it follows from
the definition that a constraint of the form h(x) ≤ t can be expressed (x, 1, t) ∈ K.
An example of a cone of the form (5.18) is the exponential cone, which is obtained by taking
h(x) = exp(x), i.e.
Kexp = cl{(x, y, z) ∈ ℝ3 | y > 0, y exp(x∕y) ≤ z}
= {(x, y, z) ∈ ℝ3 | y > 0, y exp(x∕y) ≤ z} ∪ {(x, 0, z) ∈ ℝ3 |, x ≤ 0, z ≥ 0}. (5.19)
It is easy to verify that this is a proper convex cone, i.e. it is closed, pointed, and has a nonempty
interior. We note that the dual cone can be expressed as
∗
Kexp = cl{(u, 𝑣, 𝑤) ∈ ℝ3 | u < 0, −u exp(𝑣∕u − 1) ≤ 𝑤}, (5.20)
∗
which shows that Kexp is not a self-dual cone. Equivalently, the dual cone Kexp can also be
expressed as
∗
Kexp = {(u, 𝑣, 𝑤) ∈ ℝ3 | − u ln(−u∕𝑤) ≤ 𝑣 − u, u ≤ 0, 𝑤 ≥ 0},
if we define 0 ln(−0∕𝑤) = 0 for 𝑤 ≥ 0. This is sometimes referred to as the relative entropy cone.
5.3 Conic Optimization 101

1 1 1

0.5 0.5 0.5

𝑧
𝑧

0 0 0
0 −0.5 −0.5
0.6 0 0 0 0
0.4 −0.5 0.5 0.5
0.2 0 −1 0.5
1 0.5 1
𝑦 𝑥 𝑦 𝑥 𝑦 𝑥
(a) (b) (c)

Figure 5.2 The exponential cone Kexp and examples of the power cone K𝛼 : (a) exponential cone, (b) power
cone: 𝛼 = 2∕3, and (c) power cone: 𝛼 = 4∕5.

Yet another example of a cone of the form (5.18) is a so-called power cone. Letting h(x) = |x|1∕𝛼
for some 𝛼 ∈ (0,1), we obtain the three-dimensional cone
K𝛼 = cl{(x, y, z) ∈ ℝ3 | y > 0, y|x∕y|1∕𝛼 ≤ z}
= {(x, y, z) ∈ ℝ3 | y ≥ 0, z ≥ 0, |x| ≤ y1−𝛼 z𝛼 }. (5.21)
The reader may verify that this is a proper cone. The dual cone may be expressed as
{ ( )1−𝛼 ( )𝛼 }
𝑣 𝑤
K𝛼 = (u, 𝑣, 𝑤) ∈ ℝ | 𝑣 ≥ 0, 𝑤 ≥ 0, |u| ≤
∗ 3
, (5.22)
1−𝛼 𝛼
which is the image of K𝛼 under the linear transformation (x, y, z) → (x, (1 − 𝛼)y, 𝛼z). It is also possi-
ble to define a power cone in ℝn+1 . Specifically, given 𝛼 ∈ int Δn , we define the (n + 1)-dimensional
power cone as
𝛼 𝛼
K𝛼 = {(x, y) ∈ ℝ × ℝn+ | |x| ≤ y11 · · · ynn },
and its dual may be expressed as
{ ( )𝛼 ( )𝛼 }
K𝛼∗ = (u, 𝑣) ∈ ℝ × ℝn+ | |u| ≤ 𝑣1 ∕𝛼1 1 · · · 𝑣n ∕𝛼n n .
Figure 5.2 shows the exponential cone and examples of the power cone in ℝ3 .
We end this section with some examples of constraints that can be expressed as conic constraints.

Example 5.4 From the definition of the exponential cone Kexp and the power cone K𝛼 , we imme-
diately see that
ex ≤ t ⟺ x ≤ ln(t) ⟺ (x, 1, t) ∈ Kexp ,
and for p ≥ 1,
|x|p ≤ t ⟺ |x| ≤ t1∕p ⟺ (x, 1, t) ∈ K1∕p .
This observation allows us to transform a number of constraints that involve exponential functions,
logarithms, and/or powers into conic constraints. For example, the epigraph of the log-sup-exp
function f (x) = ln(ex1 + · · · + exn ) can be expressed as (x, t) such that
ln(ex1 + · · · + exn ) ≤ t ⟺ ex1 + · · · + exn ≤ et
⟺ ex1 −t + · · · + exn −t ≤ 1
⟺ u1 + · · · + un ≤ 1, exi −t ≤ ui , i ∈ ℕn
⟺ 𝟙T u ≤ 1, (xi − t, 1, ui ) ∈ Kexp , i ∈ ℕn .
Another example of a function whose epigraph can be expressed in terms of the exponential cone
is the negative entropy function, i.e. f (x) = x ln(x) with dom f = ℝ+ , where we use the convention
102 5 Optimization Problems

that f (0) = 0. Specifically, the constraint f (x) ≤ t can be expressed as (−t, x, 1) ∈ Kexp . Indeed, for
x > 0, it holds that
f (x) ≤ t ⟺ x ≤ et∕x ⟺ (−t, x, 1) ∈ Kexp ,
and for x = 0, we have that 0 ≤ t ⟺ (−t, 0,1) ∈ Kexp .
The geometric mean of x1 , … , xn is f (x) = (x1 · · · xn )1∕n with dom f = ℝn++ . This is a concave
function, and the constraint |t| ≤ f (x) can be expressed in terms the (n + 1)-dimensional power
cone defined by 𝛼 = (1∕n)𝟙, i.e.
|t| ≤ (x1 x2 · · · xn )1∕n ⟺ (t, x) ∈ K𝛼 , 𝛼 = (1∕n)𝟙.
It can also be expressed in terms of n − 1 power cones in ℝ3 . To see this, first note that |t| ≤ f (x)
can be expressed as
1∕n (n−1)∕n
|t| ≤ xn un−1 , |un−1 | ≤ (x1 · · · xn−1 )1∕(n−1) , x ⪰ 0, un−1 ≥ 0.
Using this observation recursively, we see that the constraint |t| ≤ f (x) is equivalent to
t = un , x1 = u1 , (ui , ui−1 , xi ) ∈ K1∕i , i ∈ {2, … , n}.

Example 5.5 Let (Ω,  , P) be a probability space, and suppose that X ∶ Ω → ℝ is a random vari-
[ ]
able with moments mk = 𝔼 X k , k ∈ ℕ2n , cf . Section 3.6. We define the zeroth moment as m0 = 1
for convenience, and we let H ∶ ℝ2n+1 → 𝕊n+1 be the function that maps the moment sequence to
the Hankel matrix
⎡ m0 m1 m2 … mn−1 mn ⎤
⎢ ⎥
⎢ m1 m2 m3 … mn mn+1 ⎥
⎢ m m3 m4 … mn+1 mn+2 ⎥
H(m0 , … , m2n ) = ⎢ 2 .
⎢ ⋮ ⋮ ⋮ ⋱ ⋮ ⋮ ⎥⎥
⎢mn−1 mn mn+1 … m2n−2 m2n−1 ⎥
⎢ m m ⎥
⎣ n n+1 mn+2 … m2n−1 m2n ⎦

The matrix H(m0 , … , m2n ) is positive semidefinite since for all x ∈ ℝn+1 ,
[ n ]2
∑ ∑
n n
∑
x H(m0 , … , m2n )x =
T
xi xj 𝔼[X ] = 𝔼
i+j
xk X k
≥ 0.
i=0 j=0 k=0

There is a partial converse of these results which states that if m0 = 1 and (m1 , … , m2n ) are such
that H(m0 , … , m2n ) ≻ 0, then there exists a probability space and a random variable X such that
mk = 𝔼[X k ] for k ∈ ℕ2n ; see [22, Section 4.6.3 and Exercise 2.37]. More generally, if m0 = 1 and
(m1 , … , m2n ) are such that H(m0 , … , m2n ) ⪰ 0, then there is a sequence of random variables that
converges to a random variable with the given moments. This allows us to pose certain moment
constraints as conic constraints involving the positive semidefinite cone. For example, suppose
that we are given upper and lower bounds on the moments of a random variable X, i.e. we know that
lk ≤ 𝔼[X k ] ≤ uk , k ∈ ℕ2n .
The problem of finding a random variable X that satisfies these constraints and minimizes 𝔼[p(X)]
∑2n
where p ∶ ℝ → ℝ is a polynomial defined as p(x) = k=0 ck xk can then be expressed as the problem
∑
2n
minimize ck mk
k=0
subject to lk ≤ mk ≤ uk , k ∈ ℕ2n
H(1, m1 , … , m2n ) ∈ 𝕊+n+1 ,
with variables m1 , … , m2n . Note that the pdf of the random variable does not appear in this
problem formulation.
5.4 Rank Optimization 103

5.4 Rank Optimization

We now turn our attention to optimization problems that involve the rank of a matrix as part of
the objective function or one or more constraints. Such problems are generally difficult to solve
since the rank function is both discontinuous and nonconvex. However, there are a number of
important special cases where an analytic solution exists. One such special case is the so-called
“matrix nearness problem”
minimize ||A − Z||
(5.23)
subject to rank(Z) ≤ k,
where Z ∈ ℝm×n is the variable, and A ∈ ℝm×n is given, and the norm is either the Frobenius norm
or the 2-norm. It is easy to verify that Z ⋆ = A is the unique solution if k ≥ rank(A). More generally,
a solution to the problem can be obtained by means of a truncated singular value decomposition
(SVD) of A. If we let A = UΣV T be an SVD of A such that Σ is the matrix with the singular values
𝜎1 ≥ · · · ≥ 𝜎min (m,n) ≥ 0 on its diagonal and
∑
min (m,n)
A= 𝜎i ui 𝑣Ti ,
i=1

then the matrix

∑
k
Ak = 𝜎i ui 𝑣Ti ,
i=1

is a minimizer of (5.23) for both the Frobenius norm and the 2-norm. This follows from the
Eckart–Young–Mirsky theorem, which states that if k < min(m, n), then
√(
√ min (m,n) )
√ ∑
||A − Ak ||F = min {||A − Z||F | rank(X) ≤ k} = √ 𝜎i2 (5.24a)
Z
i=k+1

||A − Ak ||2 = min {||A − Z||2 | rank(Z) ≤ k} = 𝜎k+1 . (5.24b)

To prove (5.24a), suppose rank(Z) ≤ k and let Z = PΓQT be an SVD of a matrix Z, where Γ is the
matrix with the singular values of Z on its diagonal and in decreasing order. Moreover, we decom-
̃ where Σk is the matrix with the singular values of Ak . It then follows from
pose Σ as Σ = Σk + Σ,
von Neumann’s trace inequality in (2.33) that |tr(AT Z)| ≤ tr(ΣTk Γ). As a result, we have that
||A − X||2F = ||A||2F + ||Z||2F − 2tr(AT Z)
≥ ||Σ||2F + ||Γ||2F − 2tr(ΣTk Γ)
̃ 2 + ||Γ||2 − 2tr(ΣT Γ)
= ||Σk ||2 + ||Σ||
F F F k
̃ 2 + ||Σk − Γ||2
= ||Σ|| F F
∑
min (m,n)
̃ 2=
≥ ||Σ|| 𝜎i2
F
i=k+1

which proves (5.24a). To verify (5.24b), we first assume that (5.24b) is false, i.e. there exists a matrix
Z such that ||A − Z||2 < 𝜎k+1 and rank(Z) ≤ k. An immediate consequence is that
u ∈  (Z) ⟹ ||Au||2 = ||(A − Z)u||2 ≤ ||A − Z||2 ||u|| < 𝜎k+1 ||u||2 .
On the other hand, if 𝑣1 , … , 𝑣k+1 are the leading k + 1 right-singular vectors of A and
𝑣 ∈ span(𝑣1 , · · · , 𝑣k+1 ), then ||A𝑣||2 ≥ 𝜎k+1 ||𝑣||2 . Noting that rank(Z) ≤ k ⟺ dim  (Z) ≥ n − k,
we see that  (A) ∩ span(𝑣1 , · · · , 𝑣k+1 ) ≠ {0}, and hence, we have a contradiction.
104 5 Optimization Problems

The problem (5.23) is closely related to the rank-minimization problem

minimize rank(Z)
(5.25)
subject to ||A − Z|| ≤ 𝛿,
where the norm is either the Frobenius norm or the 2-norm, Z ∈ ℝm×n is the variable, and A ∈ ℝm×n
and 𝛿 ∈ ℝ+ are given. Again, the optimal value and a minimizer can be expressed in terms of an
SVD of A. For the Frobenius norm, the minimum rank is
{ }
∑
min (m,n)
rF (𝛿) = min k ∈ ℤ+ | 𝜎i ≤ 𝛿
2 2
,
i=k+1

and for the 2-norm, it is

r2 (𝛿) = min {k ∈ ℤ+ | 𝜎k+1 ≤ 𝛿},
where we define 𝜎min (m,n)+1 = 0 for convenience. The optimal value p⋆ is equal to the minimum
∑k
rank, and Ak = i=1 𝜎i ui 𝑣Ti is a minimizer if k = p⋆ . The connection between the rank-constrained
problem (5.23) and the rank minimization problem (5.25) is illustrated in Figure 5.3b.
Next, we consider a more general rank minimization problem of the form
minimize rank(Z)
(5.26)
subject to Z ∈ C,
with variable Z ∈ ℝm×n and where C ⊂ ℝm×n . With few exceptions, such problems cannot be solved
analytically by means of an SVD. However, several heuristics exist that give a “good solution” to
the optimization problem but is not necessarily optimal; see, e.g. [37]. One such heuristic is the
so-called nuclear norm heuristic, which uses the nuclear norm as a proxy for the rank function,
i.e. we replace (5.26) by
minimize ||Z||∗
(5.27)
subject to Z ∈ C,
which is a convex optimization problem if C is a convex set. Note that rank(Z) is equal to the
∑min (m,n)
number of nonzero singular values of Z, whereas ||Z||∗ = i=1 𝜎i (Z) is the sum of the singu-
lar values. Moreover, it can be shown that ||Z||∗ is the lower convex envelope of rank(Z) on the
{ }
norm ball Z ∈ ℝm×n | ||Z||2 ≤ 1 . In other words, ||Z||∗ is the largest convex function such that
||Z||∗ ≤ rank(Z) for all Z in the aforementioned norm ball. Using the fact that
[ ]
X Z
||Z||∗ ≤ t ⟺ tr(X) + tr(Y ) ≤ 2t, ⪰ 0,
ZT Y

‖𝐴 − 𝐴𝑘 ‖2 𝑟2 (𝛿)
5
6
5 4
4 3
3 2
2
1 1
𝑘 𝛿
1 2 3 4 5 1 2 3 4 5 6 7
(a) (b)

Figure 5.3 The optimal value associated with the rank-constrained problem (5.23) and the
rank-minimization problem (5.25) for the 2-norm and a matrix A ∈ ℝ10×5 with the singular values
(6,5, 2,1, 0). (a) Rank-constrained minimization and (b) rank minimization.
5.4 Rank Optimization 105

cf . Exercise 5.6, we can reformulate (5.27) as

minimize (1∕2)(tr(X) + tr(Y ))
[ ]
X Z
subject to ⪰0 (5.28)
ZT Y
Z ∈ C,
with variables X ∈ 𝕊m , Y ∈ 𝕊n , and Z ∈ ℝm×n .
Another heuristic for the problem (5.26) is the so-called “log-det heuristic” [37]. This is based on
+ is a low-rank matrix, then ln det (Z + 𝛿I) ≪ 0 if 𝛿 > 0 is sufficiently small.
the fact that if Z ∈ 𝕊m
This motivates the problem formulation
minimize ln det (Z + 𝛿I)
(5.29)
subject to Z ∈ C,
as a heuristic for solving (5.26) if C ⊂ 𝕊m+ . Note that ln det (Z + 𝛿I) is a concave function of Z, and
hence, it is still a difficult optimization problem in general. However, local optimization methods
can be used in conjunction with a good initial guess as a heuristic solution method. The log-det
heuristic can also be used in the case where Z ∈ ℝm×n is a rectangular matrix by embedding Z in a
symmetric matrix, i.e.
minimize ln det (bdiag(X, Y ) + 𝛿I)
[ ]
X Z
subject to ⪰0 (5.30)
ZT Y
Z ∈ C,
where X ∈ 𝕊m and Y ∈ 𝕊n . The motivation for this problem formulation comes from the fact
that rank(Z) ≤ r if and only if there exists matrices X ∈ 𝕊m and Y ∈ 𝕊n such that rank(X) +
rank(Y ) ≤ 2r and
[ ]
X Z
⪰ 0,
ZT Y
cf . Exercise 2.11.

Example 5.6 Given a partial impulse response h ∈ ℝn , the matrices (A, B, C) ∈ ℝr×r × ℝr×1 ×
ℝ1×r are said to be a minimal state-space realization of h if r is the smallest possible natural number
such that hk = CAk−1 B, k ∈ {1, … , n}. A well-known property of linear systems is that r is given by
̃ | h̃ = h , i ∈ ℕ },
r = min {rankH(h) i i n
h̃

where H ∶ ℝ2n−1 → 𝕊n is the Hankel matrix defined as

⎡ h̃ 1 h̃ 2 h̃ 3 … h̃ n−1 h̃ n ⎤
⎢ ̃ ̃ ̃ ̃ ̃ ⎥
⎢ h2 h3 h4 … hn hn+1 ⎥
̃ ̃ ̃ ̃ ̃
̃ = ⎢⎢ h3 h4 h5 … hn+1 hn+2 ⎥⎥ .
H(h)
⎢ ⋮ ⋮ ⋮ ⋱ ⋮ ⋮ ⎥
⎢h̃ n−1 h̃ n h̃ n+1 … h̃ 2n−3 h̃ 2n−2 ⎥
⎢ h̃ h̃ ̃ ̃ ̃ ⎥
⎣ n n+1 hn+2 … h2n−2 h2n−1 ⎦
We note that the problem of obtaining a minimal state-space realization (A, B, C) from a rank-r
̃ is the topic of Exercise 2.13.
matrix H(h)

We end this section by briefly considering a more general formulation of the rank-constrained
problem (5.23), namely, the problem
minimize f (Z)
(5.31)
subject to rank(Z) ≤ k,
106 5 Optimization Problems

with variable Z ∈ ℝm×n , and where f ∶ ℝm×n → ℝ. ̄ This problem can be reformulated by introduc-
ing two new variables U ∈ ℝ m×k
and V ∈ ℝ and eliminating the rank constraint by substituting
n×k

UV T for Z. However, the reformulated problem is generally still a nonconvex one, but it is often a
convenient form for the use of local optimization methods.

5.5 Partially Separability

A function f ∶ ℝn → ℝ ̄ is called partially separable if it can be expressed in terms of some elemen-
tary functions fk ∶ ℝnk → ℝ, ̄ k ∈ ℕm , as
∑
m
f (x) = fk (Ak x) (5.32)
k=1

where Ak ∈ ℝnk ×n , k ∈ ℕm , are matrices with a high-dimensional nullspace. The notion of partial
separability can also be extended to functions with more general domains, e.g. a discrete set. Partial
separability arises frequently in both estimation and control because of the fact that descriptions of
dynamical systems often involve difference equations that only introduce coupling between states
that are adjacent in time.
We will focus on the special case where the range of ATk is spanned by a small number of standard
basis vectors. This implies that each function fk (Ak x) only depends on a small subset of the entries
of x. The partial separability structure can be expressed in terms of index sets 𝛽k ⊂ ℕn , k ∈ ℕm ,
such that fk depends on xi if and only if i ∈ 𝛽k . We will assume that the index sets are maximal,
i.e. 𝛽i ⊈ 𝛽j if i ≠ j, and their union is ∪m 𝛽 = ℕn . Moreover, we define Ak to be a matrix with rows
k=1 k
{eTi | i ∈ 𝛽k }. We note that f is said to be separable in the special case, where 𝛽i ∩ 𝛽j = ∅ for i ≠ j.
∑n
For example, this is the case if f is of the form f (x) = i=1 fi (xi ).
The partial separability structure of f can also be characterized in terms of an undirected graph
with vertex set ℕn and an edge between nodes i and j, i ≠ j, if and only if {i, j} ∈ 𝛽k for some k ∈ ℕm .
We will refer to this graph as the interaction graph associated with (5.32). The interaction graph is
closely related to the sparsity graph associated with the Hessian matrix ∇2 f (x) provided that f is
twice continuously differentiable at x. This follows by noting that
∑
m
∇2 f (x) = ATk ∇2 fk (Ak x)Ak ,
k=1

and hence, eTi ∇2 f (x)ej = 0 if Ak ei = 0 or Ak ej = 0 for all k ∈ ℕm . Each index set 𝛽k is a so-called
clique of the interaction graph, i.e. a set of pairwise adjacent vertices. The coupling between the
functions f1 , … , fm can be described in terms of a clique intersection graph, which is an undirected
graph with vertex set {𝛽1 , … , 𝛽m } and an edge between vertices 𝛽i and 𝛽j if and only if i ≠ j and
𝛽i ∩ 𝛽j ≠ ∅.

5.5.1 Minimization of Partially Separable Functions

We will now look at how partial separability can be exploited in the context of optimization. To illus-
trate the basic principles, we start by considering a function f ∶ ℝ5 → ℝ defined as
f (x) = f1 (x1 ) + f2 (x1 , x2 ) + f3 (x2 , x3 ) + f4 (x1 , x4 , x5 ), (5.33)
5.5 Partially Separability 107

4 𝛽1 = {1} 𝛽2 = {1,2} 𝛽3 = {2,3}

3 2 1

5 𝛽4 = {1,4,5}

Figure 5.4 The interaction and clique intersection graphs associated with the partially separable function
(5.33).

corresponding to the four index sets 𝛽1 = {1}, 𝛽2 = {1, 2}, 𝛽3 = {2, 3} and 𝛽4 = {1, 4, 5}. Figure 5.4
shows the interaction and the clique intersection graphs associated with (5.33). Partial separability
allows us to compute inf x f (x) recursively by noting that
{ { } }
inf f (x) = inf f1 (x1 ) + inf f2 (x1 , x2 ) + inf f3 (x2 , x3 ) + inf f4 (x1 , x4 , x5 ) .
x x1 x2 x3 x4 ,x5

In other words, we can start by computing

V4 (x1 ) = inf f4 (x1 , x4 , x5 ),
x4 ,x5

and if we define V3 ∶ ℝ → ℝ in a similar manner, i.e.

V3 (x2 ) = inf f3 (x2 , x3 ),
x3

we may compute V3 and V4 in parallel. Once V3 has been obtained, we can compute
V2 (x1 ) = inf {f2 (x1 , x2 ) + V3 (x2 )},
x2

and finally, we obtain p⋆ = inf x f (x) by computing

p⋆ = inf {f1 (x1 ) + V2 (x1 ) + V4 (x1 )}.
x1

Assuming that the infimum is attained, we can also compute an optimal point x⋆ as follows:
x1⋆ ∈ argmin{f1 (x1 ) + V2 (x1 ) + V4 (x1 )}
x1

x2⋆ = 𝜇2 (x1⋆ ) ∈ argmin{f2 (x1⋆ , x2 ) + V3 (x2 )}

x3⋆ = 𝜇3 (x2⋆ ) ∈ argmin{f3 (x2⋆ , x3 )}

(x4⋆ , x5⋆ ) = 𝜇4 (x1⋆ ) ∈ argmin{f4 (x1⋆ , x4 , x5 )}.

x4 ,x5

The computations can be represented by a tree with four nodes as illustrated in Figure 5.5. This is a
so-called spanning tree of the clique intersection graph, and it is often referred to as an elimination
tree because of its connection to the order in which variables are eliminated in Gauss elimination
for sparse linear systems of equations; see, e.g. [106]. The computations start at the leaves of the
tree, possibly in parallel, and then proceed up the tree by adding the functions Vk to parent nodes.
At each node, partial minimization is carried out with respect to the variables that are not shared
with the parent node. Once the optimization problem at the root of the tree has been solved, an
optimal point x⋆ can be computed using the functions 𝜇k by passing the optimal value downward
from the root of the tree toward the leaves. We note that it is also possible to define any other node
108 5 Optimization Problems

R oot node

𝑓 1̄ (𝑥 1 )

𝑉 1 (𝑥 1 ) 𝑉 4 (𝑥 1 )

𝑓 2̄ (𝑥 1 , 𝑥 2 ) 𝑓 4̄ (𝑥 1 , 𝑥 4 , 𝑥 5 )

𝑉 3 (𝑥 2 )

𝑓 3̄ (𝑥 2 , 𝑥 3 )

Figure 5.5 A computational tree associated with the optimization problem (5.33).

of the tree to be the root, which results in a different organization of the computations. In this
example, computations cannot be carried out in parallel if one of the leaves of the tree in Figure 5.5
is used as the root.
Given a minimization problem with a partially separable objective function, it is generally diffi-
cult to find an elimination order that minimizes some notion of computational cost. However, in
practice, it is possible to use heuristics like the nested dissection algorithm to obtain a good elimi-
nation tree; see, e.g. [60]. Optimization over a tree is sometimes called message passing since the
functions Vk can be thought of as messages from a child node to its parent node, while the functions
𝜇k are used to pass information about optimizers in the opposite direction.
Another difficulty with the recursive approach to minimizing a partially separable function is
that it is generally hard or even impossible to obtain analytical expressions for the functions Vk .
A notable exception is when dom f is a finite set, and the values of the functions f1 , … , fm can be
tabulated. Another exception is when f is a quadratic function, in which case, the functions Vk
are also quadratic functions. The optimality condition ∇f (x) = 0 corresponds to a system of linear
equations, which has a unique solution if f is strongly convex. In this special case, the elimination
tree characterizes a partial order in which a sparse Cholesky factorization is computed and the
solution is found.
The optimization over the tree is also often called dynamic programming over trees. The motiva-
tion for this name comes from the fact that dynamical systems with difference equations couple
time-adjacent states. In these types of applications, it turns out that the computational tree is often
a chain, i.e. it is possible to choose the root of the tree in such a way that no node of the tree has
more than one child. This is also the case for the function (5.33), but it is not true in general.

5.5.2 Principle of Optimality

As a special case of optimization problems with a partially separable objective function, we now
consider problems of the form
∑
m−1
minimize fk (xk , xk+1 ) + fm (xm ). (5.34)
k=1
The associated clique intersection graph is a chain, which will also serve as our elimination tree.
Assuming that the minimum is attained, we have that
{m−1 } { k−1 }
∑ ∑
min fi (xi , xi+1 ) + fm (xm ) = min fi (xi , xi+1 ) + Vk (xk ) (5.35)
x x1 ,…,xk
i=1 i=1
5.6 Multiparametric Optimization 109

where Vk ∶ ℝn → ℝ ̄ is defined as
{m−1 }
∑
Vk (xk ) = min fi (xi , xi+1 ) + fm (xm ) , k ∈ ℕm−1 , (5.36)
xk+1 ,…,xm
i=k

and Vm (xm ) = fm (xm ). This is often called the principle of optimality, which refers to the fact that
⋆ ⋆ ) in (5.36) is a subvector of an optimal point x ⋆ in (5.34). It is a direct
an optimal point (xk+1 , … , xm
consequence of partially separability and the fact that for any function F ∶ ℝp × ℝq → ℝ, ̄ it holds
that
inf F(z1 , z2 ) = inf V(z1 )
z1 ,z2 z1

̄ is given by V(z1 ) = inf z F(z1 , z2 ). It now follows from (5.36) that

where the function V ∶ ℝp → ℝ 2
{ m−1 }
∑
Vk−1 (xk−1 ) = min fi (xi , xi+1 ) + fm (xm )
xk ,…,xm
i=k−1
{ {m−1 }}
∑
= min fk−1 (xk−1 , xk ) + min fi (xi , xi+1 ) + fm (xm )
xk xk+1 ,…,xm
i=k
{ }
= min fk−1 (xk−1 , xk ) + Vk (xk ) .
xk

This recursive definition of the functions Vk is often called the dynamic programming recursion, and
the functions Vk are called value functions. The recursion starts with Vm (xm ) = fm (xm ) and proceeds
with Vm−1 (xm−1 ), Vm−2 (xm−2 ), and so on. We see from (5.36) that p⋆ = min x1 V1 (x1 ) and
x1⋆ ∈ argmin V1 (x1 ).
x1

We then define functions 𝜇k ∶ ℝ → ℝ as

{ }
𝜇k (xk−1 ) = argmin fk−1 (xk−1 , xk ) + Vk (xk ) , k ∈ {2, … , m},
xk

such that an optimal point x⋆ is can be computed using the recursion xk⋆ = 𝜇k (xk−1
⋆
) for 2 ≤ k ≤ m.
The approach can readily be extended to the more general case, where xk is a vector instead of a
scalar. Moreover, dynamic programming can also be done over general trees, and although this
is a straightforward generalization, the proof is somewhat messy from a notational point of view,
and hence, we omit it. We will apply the results presented in this section to optimal control prob-
lems in Chapter 8. Dynamic programming over trees also has applications to probabilistic graphical
models, and we will see how it can be used for maximum likelihood estimation for hidden Markov
processes in Chapter 9.

5.6 Multiparametric Optimization

A multiparametric program is an optimization problem on the form

minimize f0 (x, 𝜃)
(5.37)
subject to fi (x, 𝜃) ≤ 0, i ∈ ℕm ,
where x ∈ ℝn is the variable, 𝜃 ∈ Θ ⊆ ℝr is a vector of parameters, and fi ∶ ℝn × ℝr → ℝ,
i ∈ {0,1, … , m}. We note that it is simply called a parametric program when r = 1. Our goal is to
solve (5.37) for all values of 𝜃, and hence, the solution x⋆ will be a function of 𝜃. Notice that we
already have discussed this type of problem in Section 5.5. Here, we will only consider convex
multiparametric programs, i.e. multiparametric programs of the form (5.37), where f0 and f1 , … , fm
110 5 Optimization Problems

are convex in x for every fixed value of 𝜃. We will use f (x, 𝜃) as shorthand for (f1 (x, 𝜃), … , fm (x, 𝜃)).
The KKT conditions may then be expressed as follows: there exist x(𝜃) ∈ ℝn and λ(𝜃) ∈ ℝm such
that
∑
m
∇x f0 (x(𝜃), 𝜃) + λi (𝜃)∇x fi (x(𝜃), 𝜃) = 0 (5.38a)
i=1
f (x(𝜃), 𝜃) ⪯ 0 (5.38b)
λ(𝜃) ⪰ 0 (5.38c)
λi (𝜃)fi (𝜃) = 0, i ∈ ℕm . (5.38d)
Assuming that Slater’s condition is satisfied for all 𝜃 ∈ Θ, the KKT conditions are necessary and
sufficient conditions for optimality.
As an example of a multiparametric program, we now consider a multiparametric quadratic
program (mpQP), where we let f0 (x, 𝜃) = 12 xT Hx with H ∈ 𝕊n++ and f (x, 𝜃) = Gx − 𝑤 − S𝜃 with
G ∈ ℝm×n and S ∈ ℝm×r . The resulting KKT conditions are
Hx + GT λ = 0 (5.39a)
Gx − 𝑤 − S𝜃 ⪯ 0 (5.39b)
λ⪰0 (5.39c)
diag(λ)(Gx − 𝑤 − S𝜃) = 0. (5.39d)
Now, suppose that x⋆ (𝜃)̄ and λ⋆ (𝜃)̄ satisfy the KKT conditions for a given parameter vector 𝜃̄ ∈ Θ.
̄
We will drop the argument 𝜃 when it is obvious from the context and simply write x⋆ and λ⋆ .
In order to express the KKT conditions in terms of the active and inactive constraints, we introduce
the index sets
{ }
(𝜃) ̄ =0 ,
̄ = i ∈ ℕm | fi (x⋆ , 𝜃)
̄ = ℕm ∖(𝜃),
 (𝜃) ̄
and define G to be the matrix with the rows of G that correspond to the active constraints, whereas
G contains the rows that correspond to inactive constraints. We use the same notation for S, 𝑤,
and λ⋆ . It then follows from (5.39d) that λ⋆ = 0, and from (5.39a) we find that
x⋆ = −H −1 GT λ⋆ . (5.40)
Moreover, from the definition of active constraints, we see that
G x⋆ − 𝑤 − S 𝜃̄ = 0, (5.41)
and combining (5.40) and (5.41), we conclude that
( )−1
λ⋆ = − G H −1 GT (𝑤 + S 𝜃) ̄ (5.42a)
( )−1
̄
x⋆ = H −1 GT G H −1 GT (𝑤 + S 𝜃). (5.42b)
Here, we have tacitly assumed that G has full row rank. If this is not the case, we may redefine 
by removing some of its elements such that G has full row rank and GT spans the same subspace.
̄ they are optimal for all 𝜃 that satisfy the KKT
Notice that x⋆ and λ⋆ are not only optimal for 𝜃 = 𝜃;
conditions (5.39), i.e. all 𝜃 such that
( )−1
GH −1 GT G H −1 GT (𝑤 + S 𝜃) − 𝑤 + S𝜃 ⪯ 0
( )−1
− G H −1 GT (𝑤 + S 𝜃) ⪰ 0.
5.7 Stochastic Optimization 111

An immediate consequence is that x⋆ (𝜃) is an affine function of 𝜃 over a polyhedral subset of Θ.

Moreover, the optimal value p⋆ (𝜃) is a quadratic function of 𝜃 over the same set. Once this poly-
hedral subset has been determined, a new parameter vector 𝜃 ∈ Θ outside the polyhedral set can
be chosen, and this procedure is repeated until the whole set Θ has been explored. The result is a
polyhedral decomposition of Θ that defines regions in which x⋆ (𝜃) is an affine function and p⋆ (𝜃)
is a quadratic function of 𝜃. Thus, on the set Θ, x⋆ (𝜃) is a piecewise affine function and x⋆ (𝜃) is a
piecewise quadratic function. See Exercise 5.8 for an example.

Example 5.7 As a special case of an mpQP, consider the unconstrained problem

minimize f0 (x, 𝜃),
with variable x ∈ ℝn and parameters 𝜃 ∈ ℝr , and where
[ ]T [ T
][ ]
1 x H11 H21 x 1 1
f0 (x, 𝜃) = = xT H11 x + 𝜃 T H22 𝜃 + 𝜃 T H21
T
x, (5.43)
2 𝜃 H21 H22 𝜃 2 2
with H11 ∈ 𝕊n++ . The stationarity condition 𝜕f0 ∕𝜕x = 0 can be expressed as
T
H11 x + H21 𝜃 = 0,
which implies that x⋆ (𝜃) = −H11 H21 𝜃 and
−1 T

1 T
p⋆ (𝜃) = f (x⋆ (𝜃), 𝜃) = 𝜃 (H22 − H21 H11
−1 T
H21 )𝜃.
2
Note that p⋆ (𝜃) is a quadratic function of 𝜃, and ∇2 p⋆ (𝜃) is the Schur complement of H11 in the
block matrix in (5.43).

5.7 Stochastic Optimization

A stochastic optimization problem is an optimization problem in which the objective function

and/or one or more constraints involve some element of randomness. To illustrate the basic idea,
we consider a probability space (Ω,  , P) and define a random variable 𝜉 from the sample space Ω
to some set . The random variable could be discrete or continuous. We let 𝔼[•] denote expectation
with respect to the probability distribution associated with the random variable 𝜉. The general
so-called single-stage stochastic optimization problem is a problem of the form
minimize 𝔼[F(x, 𝜉)], (5.44)
with F ∶ ℝn ×  → ℝ ̄ and variable x ∈ ℝn .
The problem (5.44) can be reduced to a deterministic problem if we let f ∶ ℝn → ℝ ̄ be defined
as f (x) = 𝔼[F(x, 𝜉)]. However, the probability distribution associated with the random variable 𝜉
is required to compute the expectation. This distribution is typically not available in practice, and
even if it is known, the problem of computing the expectation may be intractable.
In statistical learning, 𝜉 typically represents the problem data, i.e. an outcome of 𝜉 corresponds
to a random observation. One approach to avoid computing the expectation in (5.44) is to replace
the expectation by a sample average approximation (SAA), i.e.

1∑
m
𝔼[F(x, 𝜉)] ≈ F(x, ai ) (5.45)
m i=1
112 5 Optimization Problems

where a1 , … , am are m independent observations of the random variable 𝜉.1 This is motivated by the
close connection between expected values and averages as discussed in Section 3.6. The resulting
problem is deterministic and of the form
1 ∑
m
minimize f (x), (5.46)
m i=1 i
where fi ∶ ℝn → ℝ ̄ are defined as fi (x) = F(x, ai ). Problems of this form arise naturally in applica-
tions where a finite set of training examples is available. An example is empirical risk minimization
where the functions f1 , … , fm often take the form
fi (x) = l(aTi x, bi )
where l ∶ ℝ2 → ℝ is a loss function and (ai , bi ) ∈ ℝn × ℝ is one of m observations. We will see a
number of applications that involve optimization problems of this form in Chapter 10, including
linear regression, logistic regression, support vector machines, and artificial neural networks.
Multistage stochastic optimization differs from the single-stage problem in that not all variables
are included for optimization at the same time; the optimization is performed in a sequential man-
ner in so-called stages, see, e.g. [34]. The partial optimization that is carried out at a stage is called a
decision. Usually, the decision and random outcomes at the current stage affect the value of future
decisions. To illustrate the basic principle, we now consider a random process 𝜉 ∶ Ω → ℤN , where
we assume that the set  is finite,2 and we partition 𝜉 as 𝜉 = (𝜉0 , … , 𝜉N ), where 𝜉k is a random
variable associated with stage k. The decision variable xk at stage k is actually a function of the
random process and defined as xk ∶ ℤk →  with (𝜉0 , … , 𝜉k ) → xk (𝜉0 , … , 𝜉k ), and where  is also
a finite set. We let x = (x0 , … , xN ). We now realize that x is also a random process, and the way it
is defined it is said to be adapted to the random process 𝜉. We then define F ∶  ℤN × ℤN → ℝ and
the optimization problem
minimize 𝔼[F(x, 𝜉)],
where 𝔼[•] denotes expectation with respect to the random process 𝜉. This problem looks similar
to the single-stage stochastic optimization problem. However, because of the constraints imposed
on the decision variables, it is also possible to state the problem as
[ [ [ [ ]]]]
𝔼𝜉0 min 𝔼𝜉1 min 𝔼𝜉2 · · · 𝔼𝜉N min F(x, 𝜉) (5.47)
x0 (𝜉0 ) x1 (𝜉0 ,𝜉1 ) xN (𝜉0 ,…,𝜉N )

where 𝔼𝜉k [ ] denotes conditional expectation with respect to the probability distribution for 𝜉k given
•

(𝜉0 , … , 𝜉k−1 ) for k ∈ ℕN , and where 𝔼𝜉0 is expectation with respect to 𝜉0 . Here, we have made use of
the multiplication theorem discussed in Section 3.2. We realize that we may just as well consider 𝜉0
to be known and remove the expectation with respect to 𝜉0 , and this is often the way the problem
is stated. Also, notice that the innermost minimization has to be carried out parametrically with
respect to all variables except xN , and then expectation is taken with respect to 𝜉N conditioned on
given values of the random variables (𝜉0 , … , 𝜉N−1 ). One technicality is that the resulting random
variable after we carry out the parametric optimization might not be measurable in case we consider
nonfinite sets, in which case the expectation is not well defined. Another problem can be that the
parametric minimum does not exist. However, even for finite sets, the parametric optimization can
be very cumbersome to carry out, but we will see that there are applications of multistage stochas-
tic optimization to so-called “Markov decision processes,” for which the computational burden is
manageable; see Section 8.9

1 To be more precise, we let 𝜉1 , … , 𝜉m be independent identically distributed random variables, and we let
a1 , … , am be observations of each of these random variables, i.e. we repeat the same experiment m times.
2 The reason we restrict ourselves to a finite set is that the minima and random variables we implicitly define below
are not necessarily well-defined otherwise.
Exercises 113

Exercises

5.1 Consider the optimization problem

minimize ||Ax − b||,

with variable x ∈ ℝn and problem data A ∈ ℝm×n and b ∈ ℝm . For each of the following
cases, reformulate the problem as an LP to show that it is equivalent to an LP.
(a) The objective function is the infinity norm, i.e. ||Ax − b||∞ .
(b) The objective function is the 1-norm, i.e. ||Ax − b||1 .

5.2 Consider the following problem from Example 5.5:

∑
2n
minimize ck mk
k=0
subject to lk ≤ mk ≤ uk , k ∈ ℕ2n
H(1, m1 , … , m2n ) ∈ 𝕊+n+1 ,
with variables m1 , … , m2n and where H ∶ ℝ2n+1 → 𝕊n+1 maps a moment sequence to a
Hankel matrix. Suppose that n = 2 and that we are interested in the possible values of m3
given the moment constraints
0 ≤ m1 ≤ 1, 0 ≤ m2 ≤ 1, 0 ≤ m4 ≤ 1.
Investigate this numerically by writing a program that solves the two problems of
maximizing and minimizing m3 subject to these moment constraints and the Hankel
matrix constraint H(1, m1 , m2 , m3 , m4 ) ∈ 𝕊5+ .

5.3 Write a MATLAB script that uses YALMIP to solve the realization problem in Example 5.6
using the nuclear norm heuristic in (5.27). You can directly use norm(H,’nuclear’),
where H is the Hankel matrix, as your objective function in YALMIP, in which case you do
not need to include any constraints.
(a) Try your code on some problems generated with drss using a system order of
three. Neglect the direct term, and discard the system if it is close to being uncon-
trollable or unobservable. You can check this by computing the eigenvalues of the
controllability and observability Gramians of the system, i.e., using the commands
eig(dlyap(A,B*B’)) and eig(dlyap(A’,C’*C)). If the eigenvalues are close
to zero, then discard the system.
(b) Next, suppose that you know the first five Markov parameters, which you can easily
compute. You can then compute the 𝜖-rank of the optimal Hankel matrix by computing
its singular values. In this context, a reasonable threshold for considering a singular
value to be negligible is 𝜖 = 10−4 . Does the 𝜖-rank agree with the true order? Check at
least ten examples to get a fair statistic before you draw any conclusions.
(c) Use the result of Exercise 2.13 to obtain matrices (A, B, C). Do these matrices agree with
the ones you generated randomly, and if not, why? Do the Markov parameters agree?

5.4 Consider a polynomial p ∶ ℝ → ℝ of even degree defined as p(x) = a0 + a1 x + · · · + a2m x2m

where ak ∈ ℝ, 0 ≤ k ≤ 2m.
(a) Show that p(x) is nonnegative for all x ∈ ℝ if and only if there exist two polynomials
q ∶ ℝ → ℝ and r ∶ ℝ → ℝ defined as q(x) = b0 + b1 x + · · · + bm xm and r(x) = c0 + c1 x +
· · · + cm xm such that p(x) = q(x)2 + r(x)2 .
114 5 Optimization Problems

(b) Show that p(x) is nonnegative for all x ∈ ℝ if and only if there exists a matrix X ∈ 𝕊+m+1
such that
∑
ak = Xi+1,j+1 , 0 ≤ k ≤ 2m.
0≤i,j≤m,i+j=k

(c) Compute the minimum of the polynomial p(x) = x4 + 3x3 − x2 + x − 1.

5.5 Let X ∈ ℝm×n and Y ∈ 𝕊n be given.

(a) Show that Y = X T X if and only if
([ ])
I X
rank = m.
XT Y
(b) Show that
([ ]) [ ]
I X I X
rank =m ⟹ ⪰ 0.
XT Y XT Y

5.6 Recall that the nuclear norm of a matrix Z ∈ ℝm×n can be expressed as
||Z||∗ = sup {tr(W T Z) | ||W||2 ≤ 1},
W∈ℝm×n

i.e. it is the dual norm of the matrix 2-norm.

(a) Show that ||Z||∗ is equal to the optimal value of the problem
maximize tr(W T Z)
[ ]
I WT
subject to ⪰ 0.
W I
(b) Show that ||Z||∗ ≤ t if and only if there exists matrices X ∈ 𝕊n and Y ∈ 𝕊m such that
[ ]
X ZT
tr(X) + tr(Y ) ≤ 2t, ⪰ 0.
Z Y
Hint: Derive the dual of the optimization problem in the first part of this exercise.

5.7 [14, Exercise 1.16] Suppose that we would like to compute the product of N matrices
M1 M2 · · · MN
where Mk ∈ ℝnk ×nk+1 , k ∈ ℕN , are given. Recall that matrix multiplication is associative, and
hence, there are (N − 1)! ways of carrying out the N − 1 matrix–matrix multiplications.
As a simple example, note that the product M1 M2 M3 may be computed as either (M1 M2 )M3
or M1 (M2 M3 ), where the parentheses indicate the order of operations. The result is always
the same in exact arithmetic, but the number of FLOPs is generally different.
Now, suppose that we would like to find a multiplication order that minimizes the total
number of FLOPs, which is known as the so-called matrix chain multiplication problem.
Derive a dynamic programming recursion using the following value functions or messages
V ∶ ℕN × ℕN → ℕ, where V(j, k) is the minimum number of FLOPs required to compute the
product Mj Mj+1 · · · Mk , where j ≤ k. Apply the resulting dynamic programming recursion
to the case, where N = 3 and (n1 , n2 , n3 , n4 ) = (2,10, 5,1).
Exercises 115

5.8 Let A ∈ ℝ15×3 , b ∈ ℝ15 , and E ∈ ℝ15×2 , and consider the following multiparametric
quadratic program
minimize ||x − 𝟙||22
subject to Ax ⪯ b + E𝜃,

with variable x ∈ ℝ3 and parameter 𝜃 ∈ [−1, 1] × [−1, 1]. Solve an instance of this problem
using YALMIP and the multiparametric toolbox MPT3. The problem data (A, b, E) may be
generated randomly.

5.9 Consider the optimization problem

minimize f1 (x1 ) + f2 (x2 ) + · · · + fm (xm )
subject to aT x = b,
with variable x ∈ ℝn , problem data a ∈ ℝn , and b ∈ ℝ, and where fi ∶ ℝ → ℝ, i ∈ ℕm .
(a) Suppose that all elements of a are nonzero, i.e. ai ≠ 0, i ∈ ℕm . Show that the optimiza-
tion problem is equivalent to a partially separable one if we introduce the new variables
si , i ∈ ℕm−1 , defined as

s1 = a1 x1
s2 = s1 + a2 x2
⋮
sm−1 = sm−2 + am−1 xm−1
b = sm−1 + am xm .

(b) Can the problem still be reformulated as a partially separable problem if some element
of a are equal to zero?

5.10 Consider the localization problem from Example 5.1, i.e.

∑ ( )2 ∑ ( )2
minimize ||xi − xj ||2 − rij + ||xi − ak ||2 − 𝑣ik ,
(i,j)∈ (i,k)∈a

with variables xi ∈ ℝD , i ∈ ℕn . Let dij = ||xi − xj ||2 and 𝛿ij = d2ij for (i, j) ∈  and, similarly,
[ ]
let eik = ||xi − ak ||2 and 𝜖ik = e2ik for (i, k) ∈ a . Moreover, let X = x1 · · · xn ∈ ℝD×n and
Y = X T X ∈ 𝕊n+ .
(a) Show that the localization problem is equivalent to the optimization problem
∑ ( )2 ∑ ( )2
minimize dij − rij + eik − 𝑣ik
(i,j)∈ (i,k)∈a
subject to Yii + Yjj − 2Yij = 𝛿ij
𝛿ij = d2ij , dij ≥ 0, (i, j) ∈ 
Yii + ||ak ||22 − 2xiT ak = 𝜖ik
𝜖ik = e2ik , eik ≥ 0, (i, k) ∈ a
Y = X T X,
with variables X, Y , dij and 𝛿ij for (i, j) ∈ , and eik and 𝜖ik for (i, k) ∈ a . Notice that the
objective function is convex, but many of the constraints are not.
116 5 Optimization Problems

(b) Show that the following convex optimization problem

∑ ( )2 ∑ ( )2
minimize (i,j)∈ dij − rij + (i,k)∈a eik − 𝑣ik
subject to Yii + Yjj − 2Yij = 𝛿ij
[ ]
1 dij
⪰ 0, dij ≥ 0, (i, j) ∈ 
dij 𝛿ij
Y + ||ak ||22 − 2xiT ak = 𝜖ik
[ ii ]
1 eik
⪰ 0, eik ≥ 0, (i, k) ∈ a
e 𝜖
[ ik ik]
I X
,⪰ 0
XT Y
is a relaxation of the localization problem, i.e. its feasible set contains the feasible set of
the localization problem, and hence, its optimal value is a lower bound on the optimal
value of the localization problem.
(c) Implement and apply the Levenberg–Marquardt algorithm to the nonconvex localiza-
tion problem in MATLAB. The files localization1.m, localization2.m, and
localization3.m contain data for localization problems: the first is a small problem
whereas the second and third are larger problems where the latter has larger measure-
ment errors for the range measurements.
The range measurements are stored in the matrices R ∈ ℝn×n and V ∈ ℝn×m with
elements rij and 𝑣ik , respectively. The zero entries in these matrices correspond to
unavailable measurements, and hence nonzeros implicitly determine the sets  and
a . The columns of the matrix A ∈ ℝD×m are the anchor positions, and the columns
of the matrix X ∈ ℝD×n are the true, but unknown, positions, which may be used for
comparison.
(d) Use YALMIP to solve the convex relaxation. If you got bad results with the
Levenberg–Marquardt algorithm, then try initializing it with the solution you
obtain by solving the relaxed problem. Next, solve the relaxed problem with nuclear
norm regularization, i.e., add the following terms to the objective function:
( [ ]‖ )
‖ I X ]‖ ∑‖
[ ]‖
∑ ‖
[
‖ ‖ ‖ 1 di,j ‖ ‖ 1 ei,k ‖
𝛾 ‖ T ‖ + ‖ ‖ + ‖ ‖ .
‖ X Y ‖ ‖ ‖ ‖ ‖
‖ ‖∗ (i,j)∈ ‖ di,j 𝛿i,j ‖∗ (i,k)∈a ‖ ei,k 𝜖i,k ‖∗
Try different values of 𝛾 > 0 to see if you can recover the solution from the
Levenberg–Marquardt algorithm. If you are not able to recover the exact solution, then
try to find a value of 𝛾 that minimizes the mean square error
∑
n
||xi⋆ − xi ||22
i=1

where x1⋆ , … xn⋆ is the positions you obtain from the regularized relaxation and x1 , … xn
are the positions you obtain using the Levenberg–Marquardt algorithm.
(e) Compute and report the mean square errors with respect to the true positions for
each of the three investigated algorithms (Levenberg–Marquardt, convex relaxation
with/without regularization). Moreover, report the value of the original objective
Exercises 117

function for all three methods. To do this you have to compute the true distances
between the estimated positions, i.e., ||xi⋆ − xj⋆ ||2 and ||xi⋆ − ak ||2 , where (x1⋆ , … xn⋆ ) is
a solution obtained with one of the three methods. Notice that these distances are not
equal to dij and eij when you use the relaxed formulations. What method performs the
best in terms of the mean square error criterion, and what method performs the best in
terms of the original objective function?
118

Optimization Methods

We now turn our attention to numerical methods that can be used to solve different classes of
optimization problems. The methods are mostly iterative: given some initial point x0 , they gen-
erate a sequence of points xk , k ∈ ℕ, that converges to a local or global minimum. Such methods
were proposed already by Isaac Newton and Carl Friedrich Gauss. We will start by reviewing some
basic principles and properties that we will make use of throughout this chapter. We then discuss
first-order methods for unconstrained optimization, which are methods that make use of first-order
derivatives of the objective function. Second-order methods require that the Hessian of the objec-
tive function exists and is available, and we will see that the use of second-order information can
dramatically reduce the number of iterations required to find a solution. However, this typically
comes at the expense of more costly iterations, and we will explore the trade-off between the cost
per iteration and the number of iterations through the lens of variable metric methods, which use
first-order derivatives to approximate second-order derivatives. We will also consider methods for
nonlinear least-squares problem and methods for optimization problems that involve nonsmooth
functions and/or different types of constraints.
Many large-scale learning problems involve an objective function that is a sum of terms, and for
such problems, it is if often very costly to compute the full gradient at each step. To overcome this
obstacle, we will discuss stochastic optimization methods that replace full gradients with stochastic
gradients that are cheap to compute. Moreover, many large-scale problems in learning are also
partially separable, and we will demonstrate how this can be utilized.

6.1 Basic Principles

Many of the iterative optimization methods that we will consider in this chapter are based on
some basic principles and assumptions that we will now introduce in the context of unconstrained
minimization of a smooth function. We start by pointing out some important properties of smooth
functions, and we then consider two conceptually different ways of constructing a so-called
“descent method.”

6.1.1 Smoothness
A function f ∶ ℝn → ℝ ̄ is said to be smooth if it is continuously differentiable on dom f , which is an
open set. Moreover, f is L-smooth on dom f if ∇f is Lipschitz continuous on dom f with Lipschitz
constant L > 0, i.e. ∇f satisfies
||∇f (y) − ∇f (x)||2 ≤ L||y − x||2 , ∀ x, y ∈ dom f . (6.1)

This property has the important implication that

L
f (y) ≤ f (x) + ∇f (x)T (y − x) + ||y − x||22 , ∀ x, y ∈ dom f , (6.2)
2
see Exercise 6.2. For a given x, the right-hand side of (6.2) is a convex quadratic function of y and
an upper bound of f (y). For twice continuously differentiable functions, L-smoothness implies that
||∇2 f (x)||2 ≤ L, i.e. the spectral norm of ∇2 f (x) is bounded by L for all x. We note that the inequality
(6.2) is equivalent to the condition
L
|(∇f (y) − ∇f (x))T (y − x)| ≤ ||y − x||22 , ∀ x, y ∈ dom f ,
2
which follows by adding (6.2) to itself with x and y interchanged. If f is both L-smooth and convex,
then it holds that
1
||∇f (y) − ∇f (x)||22 ≤ (∇f (y) − ∇f (x))T (y − x), ∀ x, y ∈ dom f , (6.3)
L
and if f is also 𝜇-strongly convex, then for all x, y ∈ dom f ,
𝜇L 1
||y − x||22 + ||∇f (y) − ∇f (x)||22 ≤ (∇f (y) − ∇f (x))T (y − x), (6.4)
𝜇+L 𝜇+L
see Exercise 6.3. We will use these inequalities to analyze the convergence behavior of
gradient-based methods for unconstrained minimization of convex functions.

6.1.2 Descent Methods

Consider an unconstrained optimization problem of the form
minimize f (x), (6.5)
where f ∶ ℝn → ℝ is smooth. Given x ∈ ℝn and Δx ∈ ℝn , we say that Δx is a descent direction at x if
the directional derivative of f at x in the direction Δx is negative, i.e. ∇f (x)T Δx < 0. Given a descent
direction Δx at x, it is possible to find a step size t > 0 such that f (x + tΔx) < f (x). This motivates
an iterative method of the form
xk+1 = xk + tk Δxk , ℤ+ ,
where x0 is an initial guess, Δxk is a descent direction at xk , and the step size tk > 0 is chosen such
that f (xk+1 ) < f (xk ). This is clearly a descent method in the sense that the objective value decreases at
each iteration so long as there exists a descent direction. However, without additional assumptions
on f and restrictions on the step size and direction, such a method does not necessarily converge
to a stationary point of f .
The direction Δx = −∇f (x) is clearly a descent direction, and so is Δx = −P∇f (x) for any
P ∈ 𝕊n++ as well as Δx = −sgn(𝜕f (x)∕𝜕xi )ei for any i ∈ ℕn such that 𝜕f (x)∕𝜕xi ≠ 0. We say that Δx
is a normalized steepest descent direction at x for some norm || ⋅ || if ∇f (x) ≠ 0 and
Δx ∈ argmin{∇f (x)T p | ||p|| ≤ 1} = argmax{−∇f (x)T p | ||p|| ≤ 1} = 𝜕IB∗ (−∇f (x)),
p p

where B denotes the unit norm ball {p ∈ ℝ | ||p|| ≤ 1}. Moreover, the directional derivative in
n

such a direction may be expressed as

inf {∇f (x)T p | ||p|| ≤ 1} = −sup {−∇f (x)T p | ||p|| ≤ 1} = −||∇f (x)||∗ .
p p

For the Euclidean norm, −∇f (x)∕||∇f (x)||2 is the unique normalized steepest descent direction
provided that ∇f (x) ≠ 0.
120 6 Optimization Methods

Before we turn our attention to any specific method, we will outline two different approaches to
the problem of finding a suitable step size and/or a descent direction, namely line search methods
and surrogation methods.

6.1.3 Line Search Methods

Line search methods address the problem of finding a suitable step size t > 0 given a descent direc-
tion Δx at x. Since we are interested in minimizing f , a natural choice is to select t as
t ∈ argmin{𝜙(𝜏)}, (6.6)
𝜏≥0

where 𝜙 ∶ ℝ → ℝ defined as 𝜙(t) = f (x + tΔx) is the restriction of f to the line defined by x and
the search direction Δx. This is a so-called exact line search, and although it is appealing to make
the most out of a descent direction, the minimization (6.6) can be expensive if a minimizer even
exists. We note that the exact line search is also called Cauchy’s rule in the special case, where
Δx = −∇f (x). An alternative to Cauchy’s rule is Curry’s rule, which can be stated as
t = min {𝜏 | 𝜙′ (𝜏) = 0}, (6.7)
𝜏≥0

i.e. the step size is the smallest nonnegative t such that x + tΔx is stationary point of f . This is a
root-finding problem, which can also be expensive to solve.
A more practical approach to the problem of finding a suitable step size is to choose t such that it
satisfies a sufficient descent condition known as the Armijo condition. This condition requires that
t > 0 satisfies
𝜙(t) ≤ 𝜙(0) + 𝛼1 t𝜙′ (0), (6.8)
where 𝛼1 ∈ (0, 1∕2) is a parameter. In other words, the reduction in the objective value must be
proportional to t𝜙′ (0), which is negative since Δx is assumed to be a descent direction. The Armijo
condition is illustrated in Figure 6.1. Note that it is always satisfied for some sufficiently small t > 0
since 𝜙(t) ≈ 𝜙(0) + 𝜙′ (0)t when t is close to zero. Thus, to ensure that the step makes a reason-
able amount of progress, we need to avoid short steps. One way to do this is to impose a so-called
curvature condition of the form
𝛼2 𝜙′ (0) ≤ 𝜙′ (t), (6.9)
where 𝛼2 ∈ (𝛼1 , 1) is a parameter. Roughly speaking, this means that a step size t > 0 is inadmissible
if the slope of 𝜙 at t is still downward and relatively steep. A step size t > 0 is said to satisfy the
Wolfe conditions if it satisfies the Armijo condition (6.8) as well as the curvature condition (6.9).

Figure 6.1 Illustration of the Armijo

condition. The gray regions correspond to
𝜙 (0) 𝜙 (0) + 𝛼 1 𝑡𝜙 (0) inadmissible step sizes.

𝜙( 𝑡)

𝜙 (0) + 𝑡 𝜙 (0)
𝑡
6.1 Basic Principles 121

𝜙 (𝑡 )
−𝛼 2 𝜙 (0)
𝑡

𝛼 2 𝜙 (0)

𝜙 (0)

Figure 6.2 Illustration of the curvature conditions (6.9) and (6.10). The hatched regions correspond to step
sizes that violate the strong curvature condition, whereas step size in the gray regions violate both the
weak and strong curvature conditions.

Notice that the condition (6.9) does not rule out that 𝜙′ (t) may be large and positive. Thus, if we
would like t to be close to a stationary point of 𝜙, we may modify the curvature condition (6.9) to
include an upper bound on 𝜙′ (t), i.e.
𝛼2 𝜙′ (0) ≤ 𝜙′ (t) ≤ −𝛼2 𝜙′ (0). (6.10)
This may also be expressed as |𝜙′ (t)| ≤ 𝛼2 |𝜙′ (0)| and is often referred to as the strong curvature
condition. Figure 6.2 illustrates the curvature condition (6.9) and the stronger version (6.10).
Collectively, the Armijo condition and the strong curvature condition (6.10) comprise the strong
Wolfe conditions.
We end our discussion of line search methods by outlining two practical algorithms. The first
one is based on the Armijo condition, whereas the second one is based on the Wolfe conditions.

6.1.3.1 Backtracking Line Search

The prototypical backtracking line search starts with some initial step size t > 0 that is accepted
if the Armijo condition is satisfied, and otherwise, the step size is repeatedly scaled by a factor
𝛽 ∈ (0, 1) until the Armijo condition is satisfied. The resulting step size may be expressed as
t⋆ = max {𝛽 k t | 𝜙(𝛽 k t) ≤ 𝜙(0) + 𝛼1 𝛽 k t𝜙′ (0)},
k∈ℤ+

and the algorithm is summarized in Algorithm 6.1. The parameter 𝛽 controls how aggressively the
step size is reduced, whereas 𝛼1 controls the sufficient descent condition. To avoid unnecessarily
short steps, we require that 𝛼1 ∈ (0, 1∕2]. This bound can be motivated by considering the case
where 𝜙(t) is a quadratic function of t, i.e.
t2 ′′
𝜙(t) = 𝜙(0) + t𝜙′ (0) + 𝜙 (0).
2
The exact minimizer of 𝜙(t) is then t⋆ = −𝜙′ (0)∕𝜙′′ (0), and the Armijo condition (6.8) reduces to
t ≤ 2(1 − 𝛼1 )t⋆ . It follows that t = t⋆ does not satisfy the descent condition if 𝛼1 > 1∕2.

6.1.3.2 Bisection Method for Wolfe Conditions

To find a step size that satisfies the Wolfe conditions, it is generally insufficient to simply backtrack,
and hence a more sophisticated approach is needed. An admissible step size can be found my means
of bisection as shown in Algorithm 6.2. This algorithm starts with some initial guess, say, t = 1,
122 6 Optimization Methods

Algorithm 6.1: Backtracking line search

Function Backtrack(𝜙, t, 𝛼1 , 𝛽):
Input: Continuously differentiable function 𝜙 ∶ ℝ → ℝ, initial step size t ≥ 0,
and parameters 𝛼1 ∈ (0, 1∕2], 𝛽 ∈ (0, 1).
Output: t > 0 such that 𝜙(t) ≤ 𝜙(0) + 𝛼1 t𝜙′ (0).
while 𝜙(t) > 𝜙(0) + 𝛼1 t𝜙′ (0) do
t ← 𝛽t ⊳ backtracking
end
return t

and the trivial lower bound l = 0 and upper bound u = +∞. The Armijo condition is checked at
the beginning of each loop iteration. If it is violated, then the upper bound u is reduced to t and
the midpoint of the interval [l, u] is used as a new candidate step size. Otherwise, the curvature
condition is checked, and if it is violated, then the lower bound is increased to t, and a new candidate
step size is then the midpoint of [l, u] if u is finite and otherwise, 2t. Note that once the upper bound
u is finite, the width of the interval [l, u] is reduced by a factor of two in each loop iteration. This
observation can be used to show that the algorithm either terminates after finitely many iterations,
or alternatively, the upper bound u remains infinite and t doubles in every loop iterations, which
implies that 𝜙(t) → ∞ as t → ∞.

Algorithm 6.2: Bisection method for Wolfe conditions.

Function BisectionWolfe(𝜙, 𝛼1 , 𝛼2 ):
Input: Continuously differentiable function 𝜙 ∶ ℝ → ℝ and parameters 𝛼1 and 𝛼2 such that
0 < 𝛼1 < 𝛼2 < 1.
Output: t > 0 such that 𝜙(t) ≤ 𝜙(0) + 𝛼1 t𝜙′ (0) and 𝜙′ (t) ≥ 𝛼2 𝜙′ (0).
l ← 0, t ← 1, u ← ∞
repeat
if 𝜙(t) > 𝜙(0) + 𝛼1 t𝜙′ (0) then
u←t ⊳ reduce u and t
t ← (l + u)∕2
else if 𝜙′ (t) < 𝛼2 𝜙′ (0) then
l←t ⊳ increase l and t
if 𝛽 < ∞ then t ← (l + u)∕2
else t ← 2l
else
return t

6.1.4 Surrogation Methods

Another approach to the problem of finding a new point xk+1 that reduces the objective value is to
solve a sequence of surrogate problems. Generally speaking, each surrogate problem should be good
local approximation of the original problem, and it should be relatively easy to solve. At iteration k,
we construct a surrogate model mk ∶ ℝn → ℝ that approximates the objective function f in some
neighborhood of the current iterate xk . The surrogate model mk may incorporate information such
as the previous iterates x0 , … , xk and the corresponding function values and partial derivatives or
6.1 Basic Principles 123

some order. We will introduce two kinds of surrogation methods, namely trust-region methods and
majorization minimization methods.

6.1.4.1 Trust-Region Methods

Trust-region methods find a candidate for xk+1 = xk + Δx by solving a trust-region problem of the
form
minimize mk (xk + Δx)
(6.11)
subject to ||Δx|| ≤ 𝛿k ,
with variable Δx. The constraint is a norm ball of radius 𝛿k > 0, which defines a trust-region in
which we assume that our surrogate model mk is a good approximation of f . The surrogate model
is typically a linear or quadratic function. Given a solution Δx to the trust-region subproblem,
progress can be quantified in terms of the ratio of the actual reduction f (xk ) − f (xk + Δx) to the
predicted reduction f (xk ) − mk (xk + Δx), i.e.
f (xk ) − f (xk + Δx)
𝜌k = .
f (xk ) − mk (xk + Δx)
The denominator is positive unless xk is a local minimum, and hence, 𝜌k ≈ 1 if the local quadratic
model is a good approximation of f within the trust region. The step is rejected and the trust-region
radius is reduced if 𝜌k < 𝜂1 , where 𝜂1 ∈ (0, 1) is a parameter, and otherwise, the step is accepted,
Moreover, the trust-region is expanded if 𝜌 ≥ 𝜂2 , where 𝜂2 ∈ [𝜂1 , 1) is a parameter. An example
of a basic trust-region is shown in Algorithm 6.3. The trust-region parameters 𝜂1 and 𝜂2 may be
chosen as, e.g. 𝜂1 = 0.1 and 𝜂2 = 0.9. The interested reader may find a comprehensive treatment of
trust-region methods in [27].

Algorithm 6.3: Trust-region method.

Input: Continuously differentiable function f ∶ ℝn → ℝ, x0 ∈ ℝn , 𝛿0 > 0, 𝜖 > 0,
and 0 < 𝜂1 ≤ 𝜂2 < 1.
Output: x such that ‖∇f (x)‖ ≤ 𝜖.
x ← x0
for k = 0, 1, 2, … do
Choose surrogate model mk .
Solve (6.11) for Δx and compute 𝜌k .
if 𝜌k ≥ 𝜂1 then
x ← x + Δx ⊳ accept step
if ‖∇f (x)‖ ≤ 𝜖 then
stop
end
end
if 𝜌k ≥ 𝜂2 then
𝛿k+1 ← 2𝛿k ⊳ expand trust region
else if 𝜌k < 𝜂1 then
𝛿k+1 ← 𝛿k ∕2 ⊳ contract trust region
else
𝛿k+1 ← 𝛿k
end
end
124 6 Optimization Methods

6.1.4.2 Majorization Minimization

A surrogate model mk is a so-called majorization of f at xk if mk (xk ) = f (xk ) and mk (x) ≥ f (x) for
all x. This property allows us to construct a descent method without the need for a trust region.
Indeed, the iteration
xk+1 ∈ argmin mk (x), k ∈ ℤ+ , (6.12)
x
satisfies f (xk+1 ) ≤ f (xk ) if mk is a majorization of f at xk , which follows by noting that
f (xk+1 ) ≤ mk (xk+1 ) ≤ mk (xk ) = f (xk ).
Moreover, we have that f (xk+1 ) < f (xk ) unless xk is a minimizer of mk .

6.1.5 Convergence of Sequences

A sequence of vectors xk ∈ ℝn , k ∈ ℕ, is said to converge to x∗ ∈ ℝn if
lim xk = x∗ .
k→∞
The sequence has a q-order convergence rate if there exists finite constants q ≥ 1 and 𝜌 ≥ 0 such
that
||xk+1 − x∗ ||2
lim = 𝜌.
k→∞ ||x − x ∗ ||q
k 2
This is also referred to as quotient convergence or Q-convergence because of the quotient that
appears in the definition. Notable special cases include
● sublinear convergence, q = 1 and 𝜌 = 1;
● linear convergence, q = 1 and 𝜌 ∈ (0, 1);
● superlinear convergence, q = 1 and 𝜌 = 0;
● quadratic convergence, q = 2 and 𝜌 > 0.

Example 6.1 The sequence xk = 1∕k, k ∈ ℕ, tends to x∗ = 0 as k → ∞, and we see that

|xk+1 − x∗ | k
lim = lim = 1.
k→∞ |xk − x ∗ | k→∞ k + 1

This implies that the sequence converges sublinearly. The sequence xk = 2−k , k ∈ ℕ, also converges
to x∗ = 0, and we find that
|xk+1 − x∗ | 2−k−1 1
lim = lim = ,
k→∞ |xk − x ∗ | k→∞ 2−k 2
which shows that this sequence converges linearly. Finally, the sequence defined recursively as
xk+1 = 𝛼xk2 , k ∈ ℕ, with x1 = 1∕2 and |𝛼| ∈ (0, 2) also converges to x∗ = 0. It satisfies
|xk+1 − x∗ | |𝛼xk2 |
lim = lim = |𝛼|,
k→∞ |xk − x ∗ |2 k→∞ |xk |2

and hence, the sequence converges quadratically. We note that the sequence diverges if |𝛼| > 2,
and xk = 1∕2 for all k ∈ ℕ if |𝛼| = 2.

6.2 Gradient Descent

We now return to the unconstrained optimization problem (6.5) where we assume that the objective
function f ∶ ℝn → ℝ is continuously differentiable. We will study the gradient descent method,
which performs updates of the form
xk+1 = xk − tk ∇f (xk ), (6.13)
6.2 Gradient Descent 125

where tk > 0 is the step size at iteration k. The step size can be chosen in several ways, e.g. using
some form of line search. We will analyze the behavior of the iterates generated by (6.13) with
different assumptions on f and the step size sequence.

6.2.1 L-Smooth Functions

We start by considering the case where f is L-smooth. Using the quadratic upper bound (6.2) as
a majorization in the majorization minimization iteration (6.12), we obtain the gradient descent
iteration
1
xk+1 = xk − ∇f (xk ), k ∈ ℤ+ , (6.14)
L
with constant step size tk = 1∕L. Substituting xk+1 for y and xk for x in (6.2), we obtain the inequality
1
||∇f (xk )||22 ,
f (xk+1 ) ≤ f (xk ) − (6.15)
2L
which shows that the update (6.14) always reduces the objective value unless xk is a stationary point.
Moreover, the progress f (xk ) − f (xk+1 ) is at least (2L)−1 ||∇f (xk )||22 , so roughly speaking, gradient
descent is making good progress when the norm of the gradient is large. If f is bounded below,
then it can be shown that
lim ||∇f (xk )||22 = 0,
k→∞

see Exercise 6.1. However, without additional assumptions on f , there is no guarantee that the
gradient descent method (6.14) converges to neither a global nor local minimum.

6.2.2 Smooth and Convex Functions

A stronger result for the gradient descent iteration (6.13) may be obtained if f is both convex and
L-smooth. Recall that a continuously differentiable function f is convex if and only if
f (y) ≥ f (x) + ∇f (x)T (y − x), ∀ x, y ∈ dom f . (6.16)
It immediately follows that f (y) ≥ f (x⋆ ) for all y ∈ dom f if x⋆ is a stationary point, and hence, x⋆
is also a global minimizer. Moreover, we have that
||xk+1 − x⋆ ||22 = ||xk − x⋆ − tk (∇f (xk ) − ∇f (x⋆ ))||22
= ||xk − x⋆ ||22 + tk2 ||∇f (xk ) − ∇f (x⋆ )||22
− 2tk (∇f (xk ) − ∇f (x⋆ ))T (xk − x⋆ ),
and applying (6.3) to the last term on the right-hand side, we obtain the inequality
( )
2
||xk+1 − x⋆ ||22 ≤ ||xk − x⋆ ||22 − tk − tk ||∇f (xk )||22 .
L
This implies that ||xk+1 − x⋆ ||22 < ||xk − x⋆ ||22 whenever tk ∈ (0, 2∕L) and ∇f (xk ) ≠ 0. Similarly,
substituting xk+1 for y and xk for x in (6.2), we have that
( )
L
f (xk+1 ) ≤ f (xk ) − tk 1 − tk ||∇f (xk )||22 , (6.17)
2
and substituting x⋆ for y and xk for x in (6.16) and rearranging the terms, we obtain the inequality
f (xk ) − p⋆ ≤ ∇f (xk )T (xk − x⋆ )
where p⋆ = f (x⋆ ) is the optimal value. Applying the Cauchy–Schwarz inequality to the inner prod-
uct on the right-hand side, we find that
f (xk ) − p⋆ ≤ ||∇f (xk )||2 ||xk − x⋆ ||2 ≤ ||∇f (xk )||2 ||x0 − x⋆ ||2 ,
126 6 Optimization Methods

where we have used the fact that ||xk − x⋆ ||2 is a nonincreasing function of k if tk ∈ (0, 2∕L)
for all k. Assuming that x0 ≠ x⋆ , we may rewrite this inequality as
f (xk ) − p⋆
≤ ||∇f (xk )||2 .
||x0 − x⋆ ||2
Combining this with (6.17), we have that
( ) 𝛿2
L
𝛿k+1 ≤ 𝛿k − tk 1 − tk k2 ,
2 R
where 𝛿k = f (xk ) − p⋆ and R = ||x0 − x⋆ ||2 . Dividing by 𝛿k 𝛿k+1 on both sides and rearranging the
terms, we arrive at
( ) 𝛿
1 1 L k 1
≥ + tk 1 − tk
𝛿k+1 𝛿k 2 𝛿k+1 R2
( )
1 L 1
≥ + tk 1 − tk
𝛿k 2 R2
( )
1∑
k
1 L
≥ + 2 ti 1 − ti ,
𝛿0 R i=0 2
where the last inequality follows by recursive application of the previous inequality. Choosing a
constant step size ti = 𝜌∕L with 𝜌 ∈ (0, 2) leads to the bound
𝛿0 LR2
𝛿k+1 ≤ ,
LR2 + (k + 1)𝛿0 𝜌(1 − 𝜌∕2)
where 𝜌 = 1 minimizes the right-hand side. Thus, it follows that when f is L-smooth and convex,
then the gradient descent iteration as outlined in Algorithm 6.4 satisfies
f (xk ) − p⋆ = O(1∕k),
which means that the worst-case rate of convergence is sublinear. However, note that this does not
guarantee that xk converges unless p⋆ is attained.

Algorithm 6.4: Gradient descent (constant step size)

Input: L-smooth convex function f ∶ ℝn → ℝ, starting point x0 , tolerance 𝜖 > 0, and step size
t ∈ (0, 2∕L).
Output: x such that ‖∇f (x)‖ ≤ 𝜖.
x ← x0
for k = 1, 2, … do
x ← x − t∇f (x)
if ‖∇f (x)‖ ≤ 𝜖 then stop
end

Example 6.2 Consider the function f ∶ ℝ2 → ℝ defined as f (x) = g(Ax + b), where g ∶ ℝ3 → ℝ
is the log-sum-exp function
g(z) = ln (ez1 + ez2 + ez3 ) ,
and
⎡2 1⎤ ⎡−2⎤
A = ⎢−1 1 ⎥ , b = ⎢−1⎥ .
⎢ ⎥ ⎢ ⎥
⎣−1 −2⎦ ⎣0⎦
6.2 Gradient Descent 127

1
101

⋆
100
0.5
10−1

(𝑥𝑘 ) − 𝑝⋆
𝑥2

0 10−2
10−3

−0.5 10−4
−0.5 0 0.5 1 0 2 4 6 8 10
𝑥1 𝑘

Figure 6.3 Examples of the gradient descent iteration with three different starting points. The starting
points and the ﬁrst ten iterations are shown.

The function f is convex and ∇f it is Lipschitz continuous since

||∇2 f (x)||2 = ||AT ∇2 g(Ax + b)A||2 ≤ ||A||22 ||∇2 g(Ax + b)||2 ≤ ||A||22 ∕2.

The last inequality follows from the fact that ||∇2 g(z)||2 ≤ 1∕2, which can be shown using the
so-called “Gershgorin circle theorem”. Figure 6.3 shows the level curves of the function f along
with the iterates of the gradient descent iterations for three different starting points and the con-
stant step size tk = 2∕||A||22 . Observe that the starting point has a significant effect on the practical
performance but not the asymptotic behavior.

6.2.3 Smooth and Strongly Convex Functions

Strong convexity makes it possible to derive a stronger asymptotic error bound for the gradient
descent iteration (6.13). Specifically, we will assume that f is L-smooth, 𝜇-strongly convex, and
closed, which guarantees the existence of a unique minimizer x⋆ . The error at iteration k + 1 can
then be expressed as

||xk+1 − x⋆ ||22 = ||xk − tk ∇f (xk ) − x⋆ ||22

= ||xk − x⋆ − tk (∇f (xk ) − ∇f (x⋆ ))||22
= ||xk − x⋆ ||22 + tk2 ||∇f (xk ) − ∇f (x⋆ )||22 (6.18)
⋆ T ⋆
− 2tk (∇f (xk ) − ∇f (x )) (xk − x ),

where x⋆ is the unique minimizer, and hence, ∇f (x⋆ ) = 0. Using (6.4), it follows that

||xk+1 − x⋆ ||22 ≤ ||xk − x⋆ ||22 + tk2 ||∇f (xk ) − ∇f (x⋆ )||22

( )
𝜇L 1
− 2tk ||xk − x⋆ ||22 + ||∇f (xk ) − ∇f (x⋆ )||22
𝜇+L 𝜇+L
( )
2tk 𝜇L
= 1− ||xk − x⋆ ||22
𝜇+L
( )
2
− tk − tk ||∇f (xk ) − ∇f (x⋆ )||22 ,
𝜇+L
and hence,
( ) ( ]
2tk 𝜇L 2
||xk+1 − x⋆ ||22 ≤ 1− ||xk − x⋆ ||22 , tk ∈ 0, .
𝜇+L 𝜇+L
128 6 Optimization Methods

Using this inequality recursively and employing a constant step size tk = t ∈ (0, 2∕(𝜇 + L)], we con-
clude that
( )k
2t𝜇L
||xk − x⋆ ||22 ≤ 1 − ||x0 − x⋆ ||22 . (6.19)
𝜇+L
The best upper bound is obtained with t = 2∕(𝜇 + L), in which case
( )2 ( )
2t𝜇L L−𝜇 𝜅−1 2
1− = = ,
𝜇+L L+𝜇 𝜅+1
where 𝜅 = L∕𝜇 may be viewed as a condition number. Indeed, if f is twice continuously differen-
tiable, 𝜇-strongly convex, and L-smooth, then the eigenvalues of ∇2 f (x) belong to the interval [𝜇, L]
for all x, or equivalently, 𝜇I ⪯ ∇2 f (x) ⪯ LI. Thus, the ratio 𝜅 = L∕𝜇 may be viewed as an upper
bound on the condition number of the Hessian.

6.3 Newton’s Method

We now return to the unconstrained problem (6.5) with the additional assumption that f is twice
continuously differentiable. This assumption allows us to construct a local quadratic approxima-
tion of f at x using a second-order Taylor expansion around x, i.e. f̃ ∶ ℝn → ℝ defined as
1
f̃ (y; x) = f (x) + ∇f (x)T (y − x) + (y − x)T ∇2 f (x)(y − x). (6.20)
2
The gradient of f̃ (x + Δx; x) with respect to a direction Δx is ∇f (x) + ∇2 f (x)Δx, and setting this to
0 yields the stationarity condition
∇2 f (x)Δx = −∇f (x). (6.21)
This system of equations is known as the Newton equations. If f is a convex function, then the
quadratic approximation (6.20) is convex in y since ∇2 f (x) ⪰ 0 for all x. The stationarity condition
(6.21) therefore characterizes the global minima of the local quadratic approximation provided that
the Newton equations (6.21) are consistent. In contrast, if f is nonconvex and ∇2 f (x)  0, then
the quadratic approximation (6.20) is unbounded below. In this case, the stationarity condition
(6.21) characterizes the saddle points or maxima of a quadratic form. Recall that we are interested
in minimizing f rather than finding a saddle point or a local maximum, so the stationarity condition
(6.21) alone is of limited interest when f is nonconvex. In this case, the quadratic approximation
(6.20) is typically used in combination with a trust-region method or with cubic regularization.
The latter is based on the assumption that ∇2 f is Lipschitz continuous with constant L2 > 0, which
implies that
1 L
f (y) ≤ f (x) + ∇f (x)T (y − x) + (y − x)T ∇2 f (x)(y − x) + 2 ||y − x||32 .
2 6
The right-hand side is a majorization of f at x, and hence, it can be used to construct a descent
method based on the majorization minimization principle; see, e.g. [82].
In the convex case, Newton’s method uses as search direction a vector Δxnt that satis-
fies (6.21). Thus, if ∇2 f (x) is nonsingular, then the search direction is the unique solution
Δxnt = −∇2 f (x)−1 ∇f (x), which naturally leads to an iteration of the form
xk+1 = xk − tk ∇2 f (xk )−1 ∇f (xk ), k ∈ ℤ+ , (6.22)
where x0 is an initial guess, and tk > 0 is the step size at iteration k. A step for which tk = 1 is
called a pure Newton step, whereas a step with 0 < tk < 1 is referred to as a damped Newton step.
6.3 Newton’s Method 129

With additional assumptions on f , a pure Newton step can be shown to yield a descent if xk is
sufficiently close to a stationary point of f . However, a full step does not necessarily yield a descent
if xk is far away from a stationary point, in which case a damped Newton step should be used.

6.3.1 The Newton Decrement

The directional derivative of f in the Newton direction Δxnt is given by
d |
f (x + tΔxnt )|| = ∇f (x)T Δxnt = −∇f (x)T ∇2 f (x)−1 ∇f (x),
dt |t=0
and it is negative if ∇f (x) ≻ 0 and ∇f (x) ≠ 0, in which case a descent is achievable with a sufficiently
small step size. When the Hessian is positive definite, the directional derivative may be expressed
as −λ(x)2 , where λ ∶ ℝn → ℝ is the so-called Newton decrement of f at x, defined as
( )1∕2
λ(x) = ||∇f (x)||∇2 f (x)−1 = ∇f (x)T ∇2 f (x)−1 ∇f (x) . (6.23)
Alternatively, using the Newton equations (6.21), the Newton decrement may also be expressed as
( T 2 )1∕2
λ(x) = ||Δxnt ||∇2 f (x) = Δxnt ∇ f (x)Δxnt . (6.24)
It is straightforward to verify that the Newton decrement satisfies
1
f (x) − inf f̃ (y; x) = λ(x)2 ,
y 2
which allows us to view λ(x)2 ∕2 as an estimate of f (x) − p⋆ . This motivates the use of a stopping
criterion of the form (1∕2)λ(x)2 ≤ 𝜖, where 𝜖 > 0 is a given tolerance. Newton’s method with a back-
tracking line search and this stopping criterion is shown in Algorithm 6.5.

Algorithm 6.5: Newton’s method

Input: Starting point x0 , tolerance 𝜖 > 0, and backtracking parameters 𝛼 ∈ (0, 1∕2] and
𝛽 ∈ (0, 1).
Output: x such that 𝜆(x)2 ∕2 ≤ 𝜖.
x ← x0
for k = 1, 2, … do
Δxnt ← −∇2 f (x)−1 ∇f (x) ⊳ Newton direction
√
𝜆 ← −∇f (x)T Δxnt ⊳ Newton decrement
if 𝜆2 ∕2 ≤ 𝜖 then stop
t←1
while f (x + tΔxnt ) > f (x) − t𝛼𝜆2 do
t ← 𝛽t ⊳ backtracking
end
x ← x + tΔxnt ⊳ update x
end

6.3.2 Analysis of Newton’s Method

We will now consider the convergence properties of Newton’s method for minimizing a twice con-
tinuously differentiable function f ∶ ℝn → ℝ, as outlined in Algorithm 6.5. Given an initial guess
x0 ∈ dom f , we will assume that f is 𝜇-strongly convex and that ∇2 f is Lipschitz continuous on the
sublevel set
S = {x | f (x) ≤ f (x0 )},
130 6 Optimization Methods

i.e. there exists a constant L2 such that

||∇2 f (y) − ∇2 f (x)||2 ≤ L2 ||y − x||2 , ∀ y, x ∈ S. (6.25)
We will assume that S is closed and nonempty, and strong convexity of f implies that S is also
bounded; cf . Section 4.3. Moreover, strong convexity of f and continuity of ∇2 f imply that there
exists a constant L such that ∇2 f (x) ⪯ L for all x ∈ S. In other words, f is L-smooth on S. The back-
tracking line search ensures that all iterates stay within the sublevel set S.
Our analysis of algorithm 6.5 consists of three parts. In the first part, we show that Newton’s
method is affinely invariant. Specifically, we show that given a nonsingular matrix A ∈ ℝn×n and a
vector b ∈ ℝn , there is a one-to-one correspondence between the sequences of iterates obtained
when applying the method to f (x) and g(y) = f (Ay + b) provided that x0 = Ay0 + b. In the sec-
ond part, we show that there exists a constant 𝛿 ∈ (0, 𝜇 2 ∕L2 ] such that ||∇f (xk )||2 ≤ 𝛿 is a sufficient
condition for a pure Newton step to satisfy the line search condition in the current and all subse-
quent iterations, and moreover,
( )2l −1
1
||∇f (xk+l )||2 ≤ ||∇f (xk )||2 , l ∈ ℤ+ . (6.26)
2
This is referred to as the pure Newton phase. Finally, in the third part, we show that there exists a
constant 𝜎 > 0 such that
f (xk+1 ) − f (xk ) ≤ −𝜎, (6.27)
whenever ||∇f (xk )||2 > 𝛿. This is referred to as the damped Newton phase since damped Newton
steps are needed to ensure convergence.

6.3.2.1 Afﬁne Invariance

Consider the affine transformation x = Ay + b, where A ∈ ℝn×n is a nonsingular matrix and b ∈ ℝn .
Using the chain rule, the gradient and Hessian of g(y) = f (Ay + b) can be expressed as
∇g(y) = AT ∇f (Ay + b), ∇2 g(y) = AT ∇2 f (Ay + b)A.
The Newton update for g at y is
y+ = y + tk ∇2 g(y)−1 ∇g(y)
= y + tk (AT ∇2 f (Ay + b)A)−1 AT ∇f (Ay + b)
= y + tk A−1 ∇2 f (Ay + b)−1 ∇f (Ay + b),
and after left-multiplying by A and adding b on both sides, we see that
Ay+ + b = Ay + b + tk ∇2 f (Ay + b)−1 ∇f (Ay + b)
x+ = x + tk ∇2 f (x)−1 ∇f (x),
which shows that there is a one-to-one correspondence between y+ and x+ , which is the Newton
update for f at x. Finally, noting that
∇g(y)T ∇g2 (y)−1 ∇g(y) = ∇f (x)T ∇2 f (x)−1 ∇f (x),
we see that the Newton decrement is affinely invariant, and hence, the backtracking line search
does not depend on A and b.

6.3.2.2 Pure Newton Phase

We start our analysis of the pure Newton phase by showing that there exists a constant 𝛿 > 0 such
that a pure Newton step xk+1 = xk + Δxnt satisfies the backtracking line search condition
f (xk+1 ) ≤ f (xk ) − 𝛼λ(xk )2 , (6.28)
6.3 Newton’s Method 131

whenever ||∇f (xk )||2 ≤ 𝛿. The Lipschitz condition (6.25) implies that
1 L
f (y) ≤ f (x) + ∇f (x)T (y − x) + (y − x)T ∇2 f (x)(y − x) + 2 ||y − x||32 ,
2 6
and substituting xk for x and xk+1 = xk + Δxnt for y, we find that
L
f (xk+1 ) ≤ f (xk ) + ∇f (xk )T Δxnt + Δxnt
T 2
∇ f (xk )Δxnt + 2 ||Δxnt ||32
6
1 L2
≤ f (xk ) − λ(xk ) + λ(xk ) + 3∕2 λ(xk ) ,
2 2 3
2 6𝜇
where we have used the result that 𝜇||Δxnt ||22 ≤ λ(xk )2 , which follows from (6.24). Rewriting this
inequality as
( )
1 L2 λ(xk )
f (xk+1 ) ≤ f (xk ) − − λ(xk )2 ,
2 6𝜇 3∕2
we see that the pure Newton step satisfies the backtracking line search condition (6.28) if
1 L2 λ(xk )
− ≥ 𝛼,
2 6𝜇 3∕2
or, equivalently, if
3𝜇 3∕2
λ(xk ) ≤ (1 − 2𝛼).
L2
Combining this inequality with the bound λ(xk ) ≤ 𝜇 −1∕2 ||∇f (xk )||2 , which is readily obtained from
(6.23), we conclude that a sufficient condition for the line search condition to be satisfied is that
3𝜇 2
||∇f (xk )||2 ≤ 𝛿 ≤ (1 − 2𝛼). (6.29)
L2
Next, to derive the bound (6.26), we start by showing that the pure Newton update
xk+1 = xk + Δxnt satisfies
L
||∇f (xk+1 )||2 ≤ 22 ||∇f (xk )||22 . (6.30)
2𝜇
This follows from the Lipschitz condition (6.25) by noting that
1
||∇f (xk+1 )||2 = ||∇f (xk ) + ∇2 f (xk + 𝜏Δxnt )Δxnt d𝜏)||2
∫0
1 [ ]
= || ∇2 f (xk + 𝜏Δxnt ) − ∇2 f (xk ) Δxnt d𝜏||2
∫0
L2
≤ ||Δxnt ||22 ,
2
and using the result that ||Δxnt ||22 = ||∇2 f (xk )−1 ∇f (xk )||22 ≤ 𝜇 −2 ||∇f (xk )||22 , we arrive at the inequality
(6.30). Applying (6.30) recursively, we find that after l pure Newton steps,
( )2l −1
L2
||∇f (xk+l )||2 ≤ ||∇f (x )||
k 2 ||∇f (xk )||2 . (6.31)
2𝜇 2
Thus, to satisfy both (6.29) and L2 ∕(2𝜇 2 )||∇f (xk )||2 ≤ 1∕2, we take
𝜇2
𝛿 = min (1, 3(1 − 2𝛼)) , (6.32)
L2
which results in the bound
( )2l −1
1
||∇f (xk+l )||2 ≤ 𝛿, l ∈ ℤ+ .
2
This means that the sequence ||∇f (xk+l )||2 , l ∈ ℤ+ , converges at least quadratically to 0.
132 6 Optimization Methods

The strong convexity assumption can also be used to derive upper bounds on ||xk − x⋆ ||2 and the
suboptimality f (xk ) − p⋆ . Indeed, using (6.26) and the fact that ||x − x⋆ ||2 ≤ (2∕𝜇)||∇f (x)||2 for all
x ∈ S when f is 𝜇-strongly convex on S, we see that
( )2l −2
2 1
||xk+l − x⋆ ||2 ≤ ||∇f (xk+l )||2 ≤ 𝜇 −1 ||∇f (xk )||2 , l ∈ ℤ+ .
𝜇 2
Moreover, using the fact that f (x) − p⋆ ≤ 1∕(2𝜇)||∇f (x)||22 for all x ∈ S, cf . (4.30), we find that
( )2l+1 −1
1
f (xk+l ) − p⋆ ≤ 𝜇 −1 , l ∈ ℤ+ .
2

6.3.2.3 Damped Newton Phase

We will now show that there exists a constant 𝜎 > 0 such that f (xk+1 ) − f (xk ) ≤ −𝜎 whenever
||∇f (xk )||2 > 𝛿. Since f is L-smooth on S, we have that
L
f (y) ≤ f (x) + ∇f (x)T (y − x) + ||y − x||22 , ∀ y, x ∈ S.
2
Substituting xk+1 = xk − tk ∇2 f (xk )−1 ∇f (xk ) = xk + tk Δxnt for y and xk for x yields
Lt2
f (xk+1 ) ≤ f (xk ) + tk ∇f (xk )T Δxnt + k ||Δxnt ||22
( ) 2
Ltk2
≤ f (xk ) − tk − λ(xk )2 ,
2𝜇

where we have used the lower bound λ(x)2 ≥ 𝜇||Δxnt ||22 and the fact that ∇f (xk )T Δxnt = −λ(xk )2 .
Substituting 𝜇∕L for tk , which minimizes the right-hand side, yields the bound
𝜇
f (xk+1 ) ≤ f (xk ) − λ(x )2
2L k
𝜇
< f (xk ) − 𝛼1 λ(xk )2 ,
L
since 𝛼1 ∈ (0, 1∕2). As a result, the step size 𝜇∕L satisfies the Armijo condition (6.8), which must
terminate with a step size that satisfies tk ≥ 𝛽 𝜇L . Now, using the bound λ(xk )2 ≥ L−1 ||∇f (xk )||22 , which
follows (6.23), and the assumption that ||∇f (xk )||2 ≤ 𝛿, we arrive at
𝜇
f (xk+1 ) − f (xk ) < −𝛼1 𝛽 λ(xk )2
L
𝜇
≤ −𝛼1 𝛽||∇f (xk )||22 2
L
𝜇
≤ −𝛼1 𝛽𝛿 2 2 .
L
We conclude that (6.27) is satisfied with 𝜎 = 𝛼1 𝛽𝛿 2 𝜇∕L2 .

Example 6.3 To illustrate the typical behavior of Newton’s methods, we now revisit the smooth
function f ∶ ℝ2 → ℝ from Example 6.2, i.e.

f (x) = ln(exp(2x1 + x2 − 2) + exp(−x1 + x2 − 1) + exp(−x1 − 2x2 )).

Figure 6.4 shows the iterates obtained using Newton’s method with a backtracking line search and
for three different starting points. Unlike the gradient method, Newton’s method clearly converges
very rapidly in the vicinity of the minimizer.
6.3 Newton’s Method 133

1 102

⋆
0.5 10−4

⋆∕
𝑥2

)−
0 10−10

(
−0.5 10−16
−0.5 0 0.5 1 0 1 2 3 4 5
𝑥1

Figure 6.4 Newton’s method with three different starting points.

6.3.3 Equality Constrained Minimization

Newton’s method can be generalized to equality constrained problems of the form
minimize f (x)
(6.33)
subject to Ax = b,
with variable x ∈ ℝn and where A ∈ ℝp×n and b ∈ ℝp are given. We will restrict our attention to
the case where f is convex, and we will assume that A has full rank and p < n. One approach
to the equality constrained problem is to eliminate the constraints Ax = b. Specifically, suppose
F ∈ ℝn×(n−p) is a basis for the nullspace of A and x̄ is any vector that satisfies Āx = b. It follows that
{x ∈ ℝn | Ax = b} = {Fz + x̄ | z ∈ ℝn−p },
and hence, the equality constrained problem (6.33) is equivalent to
minimize f (Fz + x̄ ) ,
with variable z ∈ ℝn−p . Applying Newton’s method to this problem yields the Newton equation
F T ∇2 f (Fz + x̄ )FΔz = −F T ∇f (Fz + x̄ ).
Given a minimizer z⋆ of f (Fz + x̄ ), a solution to the equality constrained problem may be obtained
as x⋆ = Fz⋆ + x̄ .
Instead of eliminating the equality constraints, it is also possible to apply Newton’s method to
the optimality conditions associated with (6.33). The Lagrangian L ∶ ℝn × ℝp → ℝ is given by
L(x, 𝜇) = f (x) + 𝜇 T (Ax − b),
where 𝜇 ∈ ℝp is the Lagrange multiplier associated with Ax = b, and hence, the Karush–Kuhn–
Tucker (KKT) conditions can be expressed as
∇f (x) + AT 𝜇 = 0
Ax − b = 0,
or equivalently, r(z) = 0, where z = (x, 𝜇) and r ∶ ℝn × ℝp → ℝn × ℝp is defined as
[ ]
∇f (x) + AT 𝜇
r(z) = .
Ax − b
The Newton direction Δz = (Δx, Δ𝜇) for the nonlinear equation r(z) = 0 is the solution to the lin-
earized equation, i.e.
𝜕r
r(z) + Δz = 0.
𝜕zT
134 6 Optimization Methods

Using the definition of r, this linear system of equations may be expressed as

[ 2 ][ ] [ ]
∇ f (x) AT Δx ∇f (x) + AT 𝜇
=− . (6.34)
A 0 Δ𝜇 Ax − b,
The step size can be chosen using a backtracking line search where
𝜙(t) = ||r(z + tΔz)||2 ,
is used with the Armijo condition 𝜙(t) ≤ 𝜙(0) + 𝛼1 t𝜙′ (0). If r(z) ≠ 0 we have 𝜙′ (0) = −||r(z)||2 which
allows us to express the Armijo descent condition as
||r(z + tΔz)||2 ≤ (1 − t𝛼1 )||r(z)||2 . (6.35)
The resulting method is often referred to as an infeasible start Newton method since it does not
require that the initial guess x0 satisfies the equality constraints.

6.4 Variable Metric Methods

As an alternative to the second-order Taylor approximation in (6.20) that we used to derive the
Newton direction, we now consider a more general convex quadratic approximation mk ∶ ℝn → ℝ
of f at xk of the form
1
mk (y) = f (xk ) + ∇f (xk )T (y − xk ) + (y − xk )T Bk (y − xk ), (6.36)
2
where Bk ≻ 0. The function mk can be viewed as a local surrogate of the function f at xk , and it is
easy to check that it satisfies
mk (xk ) = f (xk ), ∇mk (xk ) = ∇f (xk ).
The approximation mk may be used to define a search direction, i.e.
Δx = argmin mk (xk + p) = −B−1
k
∇f (xk ),
p

which is the steepest descent direction in the quadratic norm || ⋅ ||Bk . This motivates an iteration of
the form
xk+1 = xk − tk B−1
k
∇f (xk ), k ∈ ℤ+ , (6.37)
where the step size is chosen using some form of line search. Note that the update corresponds to
the gradient descent method in (6.13) if we let Bk = I, and the choice Bk = ∇2 f (xk ) corresponds
to Newton’s method.
Generally speaking, the Newton direction is a better search direction than the negative gradient,
but it is also more expensive to compute. Variable metric methods can be viewed as a compromise
between the two methods in that they maintain an approximation of the Hessian or its inverse and
update the approximation without computing second-order derivatives. The main condition that
is used to update Bk or Hk = B−1k
is the so-called secant equation
Bk+1 (xk+1 − xk ) = ∇f (xk+1 ) − ∇f (xk ). (6.38)
Recall that the definition of mk implies that ∇mk+1 (xk+1 ) = ∇f (xk+1 ), and the secant equation is
simply the additional condition that ∇mk+1 (xk ) = ∇f (xk ). We will define yk = ∇f (xk+1 ) − ∇f (xk ) and
sk = xk+1 − xk so that the secant equation can be expressed Bk+1 sk = yk . We will henceforth drop the
iteration index k from yk and sk to simplify the notation.
6.4 Variable Metric Methods 135

6.4.1 Quasi-Newton Updates

The secant equation in (6.38) is a system of n equations in n(n + 1)∕2 unknowns, and hence, it is
underdetermined for n > 1. Quasi-Newton update formulae mostly differ in how they construct
a matrix Bk+1 that satisfies the secant equation. Taking the inner product with s on both sides
of the secant equation, we see that sT Bk+1 s = yT s, and hence, there is no positive definite solu-
tion unless yT s > 0. This condition is always satisfied if f is strongly convex, which follows from
the fact that ∇f is then strongly monotone; cf . (4.32). More generally, the condition yT s > 0 can
be enforced by using a line search based on the Wolfe conditions. Indeed, if 𝜙(t) = f (xk + tΔx),
then yT s = tk 𝜙′ (tk ) − tk 𝜙′ (0), and hence, yT s > tk (𝛼2 − 1)𝜙′ (0) > 0 if tk satisfies the curvature con-
dition (6.9). We will consider three different quasi-Newton updates, namely the Broyden, Fletcher,
Goldfarb, and Shanno (BFGS) update, the Davidon, Fletcher, and Powell (DFP) update, and the
SR1 update.

6.4.1.1 The BFGS Update

The BFGS update, which is named after Broyden, Fletcher, Goldfarb, and Shanno who indepen-
dently discovered it in 1970, constructs Bk+1 as
Bk ssT Bk yyT
Bk+1 = Bk − + T . (6.39)
sT B k s y s
The update Bk+1 is positive definite provided that Bk is positive definite and yT s > 0, and it can be
shown to be the solution to the convex optimization problem
minimize tr(BB−1k
) + 𝜓(B)
(6.40)
subject to Bs = y,
̄ is defined as 𝜓(B) = − ln det (B) with dom 𝜓 = 𝕊n .
with variable B ∈ 𝕊n , and where 𝜓 ∶ 𝕊n → ℝ ++
The BFGS update may also be expressed as
(yT s + yT Hk y)ssT Hk ysT + syT Hk
Hk+1 = Hk + −
(yT s)2 yT s
( T
) ( T
)
sy ys ssT
= I − T Hk I − T + T , (6.41)
y s y s y s
where Hk+1 = B−1 k+1
is the inverse Hessian approximation. An important consequence of this update
formula is that the explicit inverse approximation Hk allows the search direction Δx = −Hk ∇f (xk )
to be computed by means of a matrix–vector product rather than needing to solve the system of
equations Bk Δx = −∇f (xk ). Algorithm 6.6 outlines the BFGS quasi-Newton algorithm using a line
search based on the Wolfe conditions. We note that a backtracking line search is often used instead
when the objective function is strongly convex. The initial inverse Hessian approximation H0 can
be chosen in different ways. One approach is to start with an initial gradient step x1 = x0 − t0 ∇f (x0 )
where t0 is computed using a line search based on the Wolfe conditions. We then define H0 = 𝛾I,
where
yT s0 (∇f (x1 ) − ∇f (x0 ))T (x1 − x0 )
𝛾= 0 2 = ,
||y0 ||2 ||∇f (x1 ) − ∇f (x0 )||22
before computing H1 using the BFGS update formula. Note that 𝛾 is positive, which follows from
the fact that yT0 s0 > 0 when t0 satisfies the curvature condition. We will see the motivation behind
this particular initialization later in this section when we discuss the so-called “Barzilai–Borwein”
step sizes.
The BFGS update is a rank-2 update of the Hessian or inverse Hessian approximation, and Bk
and Hk are typically dense for k > 0. As a result, the approximation requires O(n2 ) memory, which
136 6 Optimization Methods

Algorithm 6.6: BFGS algorithm

Input: Starting point x0 , initial inverse Hessian approximation H0 ∈ 𝕊n++ , tolerance 𝜖 > 0, line
search parameters 0 < 𝛼1 < 𝛼2 < 1
Output: x such that ‖∇f (x)‖ ≤ 𝜖.
x ← x0 , H ← H0 , g ← ∇f (x)
for k = 1, 2, … do
if ‖g‖ ≤ 𝜖 then stop
Δx ← −Hg
t ← BisectionWolfe(𝜙, 𝛼1 , 𝛼2 ) ⊳ 𝜙(t) = f (x + tΔx)
x ← x + tΔx
s ← tΔx, y ← g, g ← 𝛻f (x), y ← g − y
𝜌 ← 1∕(sT y), 𝑣 ← Hy
H ← H + 𝜌2 (yT s + yT 𝑣)ssT − 𝜌(𝑣sT + s𝑣T ) ⊳ BFGS update
end

makes the method impractical for problems with large n. The limited-memory BFGS method,
which is also known as L-BFGS, addresses this issue by storing a limited history of, say, m BFGS
update pairs (sk−l , yk−l ), l ∈ ℕm , that are used to implicitly define Hk for k ≥ m. Specifically, Hk is
defined as a sequence of m BFGS updates starting with an initial approximate inverse Hessian Hk0 ,
which is typically chosen as a diagonal matrix or a scaled identity matrix such as Hk0 = 𝛾k I, where
𝛾k = yTk−1 sk−1 ∕||yk−1 ||22 . Note that rather than explicitly applying the m BFGS updates to form
Hk , the m update pairs are used to recursively compute matrix-vector products with Hk without
forming it. We note that L-BFGS method requires O(n) memory for a fixed memory parameter m,
which is typically between 10 and 50, and hence, L-BFGS requires significantly less memory than
BFGS when n is large.

6.4.1.2 The DFP Update

Another update that is closely related to the BFGS update is the DFP update, which is named
after Davidon, Fletcher, and Powell. It was first proposed in 1959, and hence, it predates the BFGS
update. It can be expressed as
( ) ( )
ysT syT yyT
Bk+1 = I − T Bk I − T + T
y s y s y s
where r = y − Bk s, or equivalently, as an inverse Hessian approximation Hk+1 = B−1
k
,
Hk yyT Hk ssT
Hk+1 = Hk − + T .
yT Hk y y s
Notice the similarity with the BFGS update formulas, which can be obtained from the
DFP update formulas by interchanging s and y as well as B and H. Like the BFGS update,
the DFP update preserves positive definiteness if yT s > 0. However, despite the similarity between
the BFGS and the DFP updates, the BFGS update is known to possess some favorable self-correcting
properties, and hence, it is often used in practice.

6.4.1.3 The SR1 Update

Yet another quasi-Newton update formula is the so-called “SR1 update,” which is a symmetric
rank-1 update of the form
Bk+1 = Bk + 𝜎𝑣𝑣T ,
where 𝜎 ∈ {−1, 1}. The secant equation Bk+1 s = y can then be expressed as
r = 𝜎(𝑣T s)𝑣, (6.42)
6.5 Proximal Gradient Method 137

where r = y − Bk s. Taking the inner product with s on both sides of this equation yields 𝜎(𝑣T s)2 =
r T s. Combining this with (6.42), we find that
rr T rr T
𝑣𝑣T = = ,
(𝜎 2 𝑣T s)2 𝜎r T s
and hence,
rr T
Bk+1 = Bk + . (6.43)
rT s
Note that if r = 0 or r T s = 0, we simply skip the update and take Bk+1 = Bk . Unlike the BFGS and
DFP updates, the SR1 update does not guarantee that Bk+1 is positive definite, and hence, it is often
used in combination with a trust-region method.

6.4.2 The Barzilai–Borwein Method

The secant condition can also be used to motivate the so-called Barzilai–Borwein (BB) step size rules
for the gradient descent method. If we require that Bk+1 = 𝜁k+1 I, where 𝜁k+1 is a scalar, the secant
condition is generally not satisfiable, but we can choose 𝜁k+1 such that it minimizes ||Bk+1 s − y||22 , i.e.
yT s
𝜁k+1 = argmin ||𝜁s − y||22 = .
𝜁 ||s||22
Alternatively, we can define an inverse approximation Hk+1 = 𝛾k+1 I and minimize ||s − Hk+1 y||22 ,
which yields the similar update
yT s
𝛾k+1 = argmin ||s − 𝛾y||22 = .
𝛾 ||y||22
We note that both 𝜁k+1 and 𝛾k+1 are positive if and only if yT s > 0. The so-called “Barzilai–Borwein
gradient method” may be viewed as a special case of the variable metric iteration in (6.37) with
tk = 1 and either Bk = 𝜁k I or Bk = 𝛾k−1 I, or equivalently, as the gradient iteration in (6.13) with the
step size tk = 𝜁k−1 or tk = 𝛾k . The resulting method is not a descent method: it is known to converge if
f is strongly convex and quadratic, but it is not guaranteed to converge in general. Convergence can
be established if the BB steps are combined with some form of safeguard such as a nonmonotone
line search; see, e.g. [121].

6.5 Proximal Gradient Method

The majorization minimization approach that we used to analyze the gradient descent method for
the minimization of an L-smooth function can also be applied to the more general problem
minimize g(x) + h(x), (6.44)
where g ∶ ℝn → ℝ is L-smooth, and h ∶ ℝn → ℝ ̄ is proper, closed, and convex but not necessarily
differentiable. The assumption that g is L-smooth implies that f ∶ ℝn → ℝ ̄ defined as f = g + h
satisfies
L
f (y) ≤ g(x) + ∇g(x)T (y − x) + ||y − x||22 + h(y). (6.45)
2
The right-hand side is a majorization of f at x, and applying the majorization minimization algo-
rithm yields the iteration
{ }
L
xk+1 = argmin h(y) + ∇g(xk )T y + ||y − xk ||22
y 2
{ }
L 1
= argmin h(y) + ||y − (xk − ∇g(xk ))||22 .
y 2 L
138 6 Optimization Methods

The resulting method is known as the proximal gradient (PG) method, and it can also be expressed
as the iteration
( )
xk+1 = proxtk h xk − tk ∇g(xk ) , k ∈ ℤ+ , (6.46)
where tk = 1∕L and proxh ∶ ℝn → ℝn is the so-called proximal operator associated with the func-
tion h and defined as
{ }
1
proxh (x) = argmin h(y) + ||y − x||22 . (6.47)
y 2
From the definition of the proximal operator, we see that
u = proxh (x) ⟺ x − u ∈ 𝜕h(u),
and hence, the PG update in (6.46) can be expressed as
xk+1 = xk − tk ∇g(xk ) − tk 𝑣k+1 ,
for some 𝑣k+1 ∈ 𝜕h(xk+1 ). Note that evaluating the proximal operator amounts to solving an opti-
mization problem, and hence, a single PG iteration is therefore expensive unless the proximal
operator is cheap to evaluate. The PG method is outlined in Algorithm 6.7. It can also be combined
with a line search as an alternative to the constant step size tk = 1∕L.

Algorithm 6.7: Proximal gradient method (constant step size)

̄
Input: L-smooth convex function g ∶ ℝn → ℝ, proper closed convex function h ∶ ℝn → ℝ,
starting point x0 , and tolerance 𝜖 > 0.
Output: x such that ‖x − proxth (x − t∇g(x)) ‖ ≤ 𝜖.
x ← x0 , t ← 1∕L
for k = 1, 2, … do
x̃ ← proxth (x − t𝛻g(x))
if ‖̃x − x‖ ≤ 𝜖 then stop
x ← x̃
end

When g is nonconvex, the PG method can be shown to converge to a stationary point under some
mild conditions. The first-order necessary optimality condition for (6.44) implies that a stationary
point x⋆ must satisfy
−∇g(x⋆ ) ∈ 𝜕h(x⋆ ). (6.48)
If we compare this condition to the stationarity condition for the majorization in (6.45) at x⋆ , i.e.
−∇g(x⋆ ) − L(y − x⋆ ) ∈ 𝜕h(y),
we see that the conditions coincide when y = x⋆ . This implies that a fixed-point of the iteration in
(6.46) is a stationary point of f = g + h. In the convex case, i.e. when both g and h are convex, the
PG iteration can be shown to satisfy
f (xk ) − p⋆ = O(1∕k).
An overview of other variants of the PG method, including a detailed analysis of both the convex
and the nonconvex case, can be found in [10]. We end this section by mentioning a few methods
that are closely related to the PG method.
6.5 Proximal Gradient Method 139

6.5.1 Gradient Projection Method

The PG method can also be applied to constrained optimization problems of the form
minimize g(x)
(6.49)
subject to x ∈ C,
where C is a convex subset of ℝn . This problem is equivalent to (6.44) if we define h(x) = IC (x),
where IC denotes the indicator function of the set C. In this special case, the proximal operator
proxh (x) becomes the Euclidean projection of x onto C, and the resulting method is commonly
referred to as the projected gradient or gradient projection method.

6.5.2 Proximal Quasi-Newton

A generalization of the PG method is obtained if we construct a surrogate function mk ∶ ℝn → ℝ ̄
of the form
1
mk (y) = h(y) + g(xk ) + ∇g(xk )T (y − xk ) + (y − xk )T Bk (y − xk ), (6.50)
2
with Bk ∈ 𝕊n++ and use its minimizer to define a new iterate or a search direction, e.g.
xk+1 = argmin mk (y)
y
{ }
1
= argmin h(y) + ||y − (xk − B−1
k
∇g(xk ))||2Bk
y 2
B
= proxh k (xk − B−1
k
∇g(xk )), (6.51)
B
where proxh k ∶ ℝn → ℝn is the proximal operator of h for the norm || ⋅ ||Bk and defined as
{ }
B 1
proxh k (x) = argmin h(y) + ||y − x||2Bk . (6.52)
y 2
The iteration (6.51) is known as a proximal quasi-Newton method or proximal Newton method in
the special case, where Bk = ∇2 f (xk ). We refer the reader to [66] for further details.

6.5.3 Accelerated Proximal Gradient Method

The gradient method and the PG method can be combined with acceleration techniques that lead to
an improved rate of convergence. One such technique is the so-called heavy ball method, which was
proposed by Polyak in the 1960s [89]. This method may be derived from a second-order differential
equation of the form
d2 x dx
𝜇 = −∇f (x) − m ,
dt2 dt
where x ∶ ℝ → ℝ is a state vector. The differential equation characterizes the motion of a body
n

with mass, e.g. a heavy ball, under friction in a potential field. The constants 𝜇 and m are friction
and mass constants, respectively. Employing a finite-difference method leads to an iteration of the
form

xk+1 = xk − 𝛾∇f (xk ) + 𝜂(xk − xk−1 ),

where 𝛾 and 𝜂 are constants and xk = x(tk ) is the state at time tk . This is an example of a so-called
“multistep method.” Notice that it reduces to the gradient method if 𝜂 = 0, and hence, it is the last
term that introduces momentum.
140 6 Optimization Methods

Inspired by the heavy ball method, Nesterov [80] proposed an accelerated gradient method that
satisfies the improved worst-case bound
f (xk ) − p⋆ = O(1∕k2 ),
when f is convex and L-smooth. An extension of this method is the so-called accelerated proximal
gradient (APG) method, which is due to [11] and outlined in Algorithm 6.8. Unlike the PG method,
the APG method shown here is not a descent method. The next example illustrates the effect of the
acceleration.

Algorithm 6.8: Accelerated proximal gradient method (constant step size)

Input: L-smooth convex function g ∶ ℝn → ℝ, proper closed convex function
̄ starting point x0 , and tolerance 𝜖 > 0.
h ∶ ℝn → ℝ,
x ← x0 , y ← x, 𝛾1 ← 1, t ← 1∕L
for k = 1, 2, … do
x̃ ← proxth (y − t𝛻g(y))
( √ )
𝛾k+1 ← 1 + 1 + 4𝛾k2 ∕2
y ← x̃ + ((𝛾k − 1)∕𝛾k+1 )(̃x − x)
x ← x̃
end

Example 6.4 Consider the 1-norm regularized least-squares problem

1
minimize ||Ax − b||22 + 𝛾||x||1
2
with variable x ∈ ℝn , problem data A ∈ ℝm×n and b ∈ ℝm , and parameter 𝛾 > 0. To apply the
PG and APG methods, we define g ∶ ℝn → ℝ as g(x) = (1∕2)||Ax − b||22 and h ∶ ℝn → ℝ as
h(x) = 𝛾||x||1 . The function g is L-smooth with constant L = ||A||22 , and hence, we may use the
constant step size t = 1∕L. The proximal operator proxth (x) can be expressed as
proxth (x) = (St𝛾 (x1 ), … , St𝛾 (xn )),
where St𝛾 ∶ ℝ → ℝ is defined as
St𝛾 (𝜏) = sgn(𝜏) max (0, |𝜏| − t𝛾).
This known as the soft-thresholding operator; see Exercise 6.6.

Figure 6.5 The relative suboptimality for the PG

101 PG and the APG methods.
APG
⋆

10−2
(𝑥𝑘 ) − 𝑝⋆

10−5

10−8
0 100 200 300 400
𝑘
6.6 Sequential Convex Optimization 141

Figure 6.5 shows a numerical example for a problem instance with m = 200 and n = 800. The
plot clearly shows that the PG method is a descent method, whereas the APG method is not. Both
methods exhibit a sublinear convergence rate, but the effect of acceleration makes a significant
difference.

6.6 Sequential Convex Optimization

We now consider optimization problems with an objective function that can be expressed as the
difference of two convex functions, i.e.

minimize g(x) − h(x), (6.53)

where g ∶ ℝn → ℝ ̄ and h ∶ ℝn → ℝ ̄ are both proper, closed, and convex functions. Such problems
are often referred to as difference convex optimization problems. The difference g − h is generally
not a convex function, but a convex majorization of g − h at a point x can easily be constructed if
h is continuously differentiable or subdifferentiable at x. Indeed, since h is convex, we can use an
affine lower bound on h

h(y) ≥ h(x) + 𝑣T (y − x), 𝑣 ∈ 𝜕h(x),

to construct a majorization of f (x) = g(x) − h(x) at x, i.e.

f (y) ≤ g(y) − h(x) − 𝑣T (y − x), ∀ x, y ∈ dom f . (6.54)

The majorization minimization iteration then amounts to the iteration

xk+1 = argmin{g(y) − 𝑣Tk y}, k ∈ ℤ+ , (6.55)

where 𝑣k ∈ 𝜕h(xk ) is any subgradient of h at xk . This approach is sometimes called sequential

convex optimization since each iteration involves the solution of a convex optimization problem.
The iteration (6.55) is a descent method by design, and with additional assumptions, it can be
shown to converge to a stationary point or a local minimum of f = g − h; see, e.g. [65].
Sequential convex optimization can also be useful when the objective function is not readily
decomposed into the difference of two convex functions, which we now illustrate with an example.

Example 6.5 Consider the optimization problem

∑
n
minimize g(x) + h(|xi |)
i=1

where g ∶ ℝn → ℝ is convex and h ∶ (−𝜀, ∞) → ℝ is nondecreasing, concave, and continuously

differentiable on (−𝜀, ∞) with 𝜀 > 0. The composition h(|xi |) is not necessarily convex or concave,
but concavity of h implies that h(s) + h′ (s)(t − s) ≥ h(t), and hence,
∑
n
( ) ∑n
h(|xi |) + h′ (|xi |)(|yi | − |xi |) ≥ h(|yi |),
i=1 i=1

for all y, x ∈ℝn .The assumption that h is nondecreasing implies that h′ (|xi |) ≥ 0, and hence, the
left-hand side is a convex function of y, and it is a majorization of the right-hand side at x. This
leads to the majorization minimization iteration

xk+1 = argmin{g(y) + ||diag(𝑤k )y||1 }, k ∈ ℤ+ , (6.56)

y
142 6 Optimization Methods

where 𝑤k ⪰ 0 is the vector with elements 𝑤k,i = h′ (|xk,i |) for i ∈ ℕn . For example, the function
h(t) = ln(t + 𝛿) with 𝛿 > 0 is an increasing concave function on (−𝛿, ∞), which leads to the weight
update 𝑤k,i = (|xk,i | + 𝛿)−1 . We end this example by noting that the approach is also known as
iteratively reweighted 𝓁1 -regularization since each iteration involves a weighted 𝓁1 -regularized
optimization problem with new weights.

6.7 Methods for Nonlinear Least-Squares

Recall the general nonlinear LS problem (5.1), which is an unconstrained optimization problem of
the form
1
minimize ||f (x)||22 ,
2
where f ∶ ℝn → ℝm is defined as f (x) = (f1 (x), … , fm (x)) and fi ∶ ℝn → ℝ, i ∈ ℕm . This is generally
a difficult nonlinear optimization problem. Several local optimization methods exist that are tai-
lored to the specific structure of the problem. One such method is the Gauss–Newton (GN) method,
which is applicable when the functions f1 , … , fm are continuously differentiable. The basic idea is
to replace f by its first-order Taylor approximation around the current iterate xk , i.e. f̃ ∶ ℝn → ℝm
defined as
𝜕f (xk )
f̃ (x; xk ) = f (xk ) + (x − xk ).
𝜕xT
The GN method can then be expressed as the iteration
1
xk+1 ∈ argmin ||f̃ (x; xk )||22 , k ≥ 0. (6.57)
x 2
Each iteration involves solving a linear least-squares problem since f̃ (x; xk ) is an affine function of x,
and the update is unique if the Jacobian matrix 𝜕f (xk )∕𝜕xT has full rank. Noting that (1∕2)||f̃ (x; xk )||22
is a quadratic approximation of (1∕2)||f (x)||22 , we see that the GN method can also be viewed as a
variable metric method.
If f is twice continuously differentiable, then the Hessian of f0 (x) = (1∕2)||f (x)||22 is given by
∑
m
( )
∇2 f0 (x) = fi (x)2 ∇2 fi (x) + ∇fi (x)∇fi (x)T ,
i=1

whereas the Hessian of f̃ 0 (x; xk ) = (1∕2)||f̃ (x; xk )||22 is given by

∑
m
∇2 f̃ 0 (x; xk ) = ∇fi (xk )∇fi (xk )T .
i=1

Comparing the two Hessians, we see that

∑
m
∇2 f0 (xk ) = ∇2 f̃ 0 (xk ; xk ) + fi (xk )2 ∇2 fi (xk ).
i=1

This suggests that the GN update (6.57) mimics a pure Newton step if the sum on the right-hand
side is negligible, in which case we can expect super-linear convergence. However, although the
GN method often works well in practice, it may fail to converge. This issue can be addressed by
combining the GN method with a line search provided that the sublevel set {x | f0 (x) ≤ f0 (x0 )} is
bounded and that 𝜕f (xk )∕𝜕xT has full rank for all k.
6.7 Methods for Nonlinear Least-Squares 143

6.7.1 The Levenberg-Marquardt Algorithm

The Levenberg–Marquardt (LM) algorithm addresses the shortcomings of the GN method by intro-
ducing regularization, i.e. it uses the strongly convex quadratic function f̃ 0 ∶ ℝn → ℝ defined as
1 𝜇
f̃ 0 (x; xk ) = ||f̃ (x; xk )||22 + k ||x − xk ||22 , (6.58)
2 2
where 𝜇k ∈ ℝ++ is a parameter, as a local approximation of f0 (x) = (1∕2)||f (x)||22 at xk . From the
Hessian of f̃ 0 (x; xk ), i.e.
∑
m
∇2 f̃ 0 (x; xk ) = ∇fi (xk )∇fi (xk )T + 𝜇k I,
i=1

we see that if 𝜇k is sufficiently large, then the LM method will essentially behave like gradient
descent with a step size that is roughly equal to 1∕𝜇k . On the other hand, if 𝜇k is small, then the
step is very close a Gauss–Newton step. Algorithm 6.9 adjusts 𝜇k in an adaptive manner and is
one of many different variants of the LM algorithm. For example, there are trust-region variants of
the GN method, which are also sometimes referred to as LM algorithms; see, e.g. [78].

Algorithm 6.9: Levenberg–Marquardt algorithm

Input: Starting point x0 , initial damping parameter 𝜇0 > 0, and tolerance 𝜖 > 0.
Output: x such that ‖∇f0 (x)‖ ≤ 𝜖.
x ← x0 , 𝜇 ← 𝜇0
for k = 1, 2, … do
x̃ ← argminu { 21 ‖f̃ (u; x)‖22 + 𝜇2 ‖u − x‖22 }
if f0 (̃x) < f0 (x) then
x ← x̃ ⊳ Accept step
𝜇 ← 0.8𝜇 ⊳ Reduce 𝜇
else
𝜇 ← 2𝜇 ⊳ Increase 𝜇
end
if ‖∇f0 (x)‖ ≤ 𝜖 then stop
end

6.7.2 The Variable Projection Method

A nonlinear LS problem is called separable if it can be expressed as
1
minimize ||f (x, 𝛼)||22 (6.59)
2
with variables x ∈ ℝn and 𝛼 ∈ ℝp , and where f ∶ ℝn × ℝp → ℝm is of the form
f (x, 𝛼) = A(𝛼)x − b(𝛼)
with A ∶ ℝp → ℝm×n and b ∶ ℝp → ℝm . Notice that this is a linear LS problem in x if 𝛼 is fixed.
We can use this fact and minimize with respect to x to obtain
x⋆ (𝛼) = A(𝛼)† b(𝛼).
Substituting x⋆ (𝛼) for x in (6.59), we obtain the nonlinear LS problem
1
minimize ||g(𝛼)||22 , (6.60)
2
144 6 Optimization Methods

where g ∶ ℝp → ℝm is defined as
g(𝛼) = f (x⋆ (𝛼), 𝛼) = −(I − A(𝛼)A(𝛼)† )b(𝛼).
Recall that P(𝛼) = I − A(𝛼)A(𝛼)† is a projection matrix, and hence g(𝛼) is the projection of −b(𝛼)
onto the nullspace of A(𝛼)T . We will henceforth assume that rank(A(𝛼)) = n, which implies that
A† (𝛼) = (A(𝛼)T A(𝛼))−1 A(𝛼)T , and we will omit 𝛼 from A(𝛼), b(𝛼), and P(𝛼) for notational conve-
nience.
The method we will derive is often called the variable projection method, since it is based on
minimizing the variable projection functional (6.60). In principle, one could simply apply the GN
method or the LM algorithm directly to this problem, but it turns out that there is a better way to
proceed. The idea is to use only an approximate gradient [61]. We will follow the derivation in [86].
The gradient of g(𝛼) is the transpose of 𝜕g∕𝜕𝛼 T , which may be expressed as
𝜕g 𝜕P 𝜕b
=− b−P T,
𝜕𝛼 T 𝜕𝛼 𝜕𝛼
𝜕P 𝜕P
where we define 𝜕𝛼
b to be the matrix with columns 𝜕𝛼k
b. Now, using the fact that

𝜕P 𝜕A ( T )−1 T ( )−1 𝜕AT

=− A A A − A AT A
𝜕𝛼k 𝜕𝛼k 𝜕𝛼k
( )
( T )−1 𝜕AT T 𝜕A ( T )−1 T
+A A A A+A A A A
𝜕𝛼k 𝜕𝛼k
𝜕A ( T )−1 T ( )−1 𝜕AT
=−P A A A − A AT A P,
𝜕𝛼k 𝜕𝛼k
the Jacobian of g at 𝛼 can be expressed as
𝜕g 𝜕A ( T )−1 T ( )−1 𝜕AT 𝜕b
=P A A A b + A AT A Pb − P T .
𝜕𝛼 T 𝜕𝛼 𝜕𝛼 𝜕𝛼
The second term, which contains the factor Pb = −g(𝛼), is omitted in an approximation suggested
by Kaufman [61]. This is motivated based on the observation that g(𝛼) is typically small when 𝛼 is
close to a minimum. Using the expression for x⋆ (𝛼), the Kaufman approximation of the Jacobian
can be expressed as
𝜕g 𝜕A 𝜕b
≈ P x⋆ (𝛼) − P T . (6.61)
𝜕𝛼 T 𝜕𝛼 𝜕𝛼
This approximation of the Jacobian does not only save about 25% in computational time, but it can
also be shown to provide better convergence properties; see, e.g. [43]. The Kaufman approximation
was originally proposed for the LM algorithm, but it can be used with any of the methods for non-
linear LS problems that we have discussed in this section. Applications to the training of artificial
neural networks is discussed in [99], and we will discuss applications to system identification in
Chapter 12.

6.8 Stochastic Optimization Methods

Recall the single-stage stochastic optimization problem
minimize 𝔼[ F(x, 𝜉) ] , (6.62)
where F ∶ ℝn ×  → ℝ. As mentioned in Section 5.7, the evaluation of the expectation 𝔼[ F(x, 𝜉) ]
is often intractable, so we will focus on solution methods that only require an oracle that can
return unbiased gradient estimates. One such method is the so-called stochastic approximation (SA)
method of Robbins & Monro [96], which is a method for finding a root of a nonlinear equation.
6.8 Stochastic Optimization Methods 145

In the context of the stochastic problem in (6.62), the nonlinear equation of interest is the sta-
tionarity condition ∇f (x) = 0, where f (x) = 𝔼[ F(x, 𝜉) ]. The resulting algorithm is an iteration of
the form
xk+1 = xk − tk gk , k ∈ ℤ+ , (6.63)
where x0 is an initial guess, tk > 0 is the step size at iteration k, and gk ∈ ℝn
is a realization of an
estimator of Gk ∈ ℝn of ∇f (xk ). It should be stressed that Gk is random variable for each k. Hence,
the iteration (6.63) is a realization of a stochastic process
Xk+1 = Xk − tk Gk , k ∈ ℤ+ , (6.64)
where the random variable X0 is the initial state of the process. We consider an unbiased estimator
Gk of ∇f (xk ) with bounded variance, i.e.
[ ] [ ]
𝔼 Gk | Xk = xk = ∇f (xk ), 𝔼 ||Gk − ∇f (xk )||22 | Xk = xk ≤ c2 , (6.65)
for all k and for some scalar c ≥ 0. Note that in the special case where c = 0, the iteration in (6.63)
is essentially the gradient descent method (6.13). We note that the gradient of F(x, 𝜉) with respect
to x, which we will denote ∇F(x, 𝜉), is a random variable, and it is an unbiased estimator of ∇f (x) if
[ ]
𝜕𝔼[ F(x, 𝜉) ] 𝜕F(x, 𝜉)
=𝔼 , i ∈ ℕn .
𝜕xi 𝜕xi
When this condition is satisfied, it often natural to choose Gk = ∇F(Xk , 𝜉k ), where 𝜉k has the same
distribution as 𝜉 for all k ≥ 0, and where 𝜉k and 𝜉j are independent for j ≠ k.
The iteration (6.63) is often referred to as a stochastic gradient (SG) method or a stochastic
gradient descent (SGD) method. However, it is important to note that it is not a descent method in
the deterministic sense, i.e. the search direction −gk is not necessarily a descent direction, so the
objective value may increase in some iteration. We note that the step size tk is often referred to as
the learning rate in machine learning literature.
To better understand the influence of the step size sequence tk , we will now analyze the process
under different assumptions on f .

6.8.1 Smooth Functions

We start by considering the case where f is L-smooth and bounded from below. Recall from (6.2)
that L-smoothness implies that there exists a constant L > 0 such that
L
f (x + Δx) ≤ f (x) + ∇f (x)T Δx + ||Δx||22 ,
2
is satisfied for all x, Δx ∈ ℝn . Thus, we have that Xk satisfies
[ 2
]
[ ] Ltk−1
𝔼 f (Xk ) | Xk−1 ≤ 𝔼 f (Xk−1 ) − tk−1 ∇f (Xk−1 )T Gk−1 + ||Gk−1 ||22 | Xk−1 . (6.66)
2
It follows from (6.65) that
[ ]
𝔼 ∇f (Xk−1 )T Gk−1 | Xk−1 = ||∇f (Xk−1 )||22
and
[ ] [ ]
𝔼 ||Gk−1 ||22 | Xk−1 = 𝔼 ||Gk−1 − ∇f (Xk−1 ) + ∇f (Xk−1 )||22 | Xk−1 ≤ c2 + ||∇f (Xk−1 )||22 .
Combining these results with (6.66), we arrive at the upper bound
[ ] Lt2 ( )
𝔼 f (Xk ) | Xk−1 ≤ f (Xk−1 ) − tk−1 ||∇f (Xk−1 )||22 + k−1 c2 + ||∇f (Xk−1 )||22
2
Lt2 c2
= f (Xk−1 ) − tk−1 (1 − Ltk−1 ∕2)||∇f (Xk−1 )||22 + k−1 .
2
146 6 Optimization Methods

Rearranging the terms leads to the inequality

[ ] 2 Lc2
tk−1 (1 − Ltk−1 ∕2)||∇f (Xk−1 )||22 ≤ f (Xk−1 ) − 𝔼 f (Xk ) | Xk−1 + tk−1 ,
2
and summing and taking expectation yields
∑
k−1
[ ] ∑k−1
[ [ ] [ ]] ∑
k−1
Lc2
tj (1 − Ltj ∕2)𝔼 ||∇f (Xj )||22 ≤ 𝔼 f (Xj ) − 𝔼 f (Xj+1 ) + tj2 . (6.67)
j=0 j=0 j=0
2

The first sum on the right-hand side satisfies

∑
k−1
[ [ ] [ ]] [ ] [ ] [ ]
𝔼 f (Xj ) − 𝔼 f (Xj+1 ) = 𝔼 f (X0 ) − 𝔼 f (Xk ) ≤ 𝔼 f (X0 ) − p⋆ ,
j=0

where p⋆ = inf x f (x), and combining this inequality with (6.67) yields
∑
k−1
[ ] [ ] ∑
k−1
Lc2
tj (1 − Ltj ∕2) min 𝔼 ||∇f (Xj )||22 ≤ 𝔼 f (X0 ) − p⋆ + tj2 .
j=0
j=0,…,k−1
j=0
2
∑k−1
Equivalently, if we divide by the sum j=0 tj (1 − Ltj ∕2) on both sides, and assuming that it is
positive, we see that
[ ] ∑k−1 2
[ ] 𝔼 f (X0 ) − p⋆ j=0 tj Lc2
min 𝔼 ||∇f (Xj )||2 ≤ ∑k−1
2
+ ∑k−1 . (6.68)
j=0,…,k−1 2
j=0 tj (1 − Ltj ∕2) j=0 tj (1 − Ltj ∕2)

It follows that a sufficient condition for the right-hand side to vanish as k → ∞ is that
∑k−1
∑
k−1
j=0 tj (1 − Ltj ∕2)
lim tj (1 − Ltj ∕2) = ∞, lim ∑k−1 2 = ∞,
k→∞ k→∞
j=0 j=0 tj

or equivalently,
∑k−1
∑
k−1
j=0 tj
lim tj = ∞, lim ∑k−1 = ∞. (6.69)
k→∞ k→∞ 2
j=0 j=0 tj

Examples of step size sequences that satisfy these conditions are sequences of the form
t
tk = , k ∈ ℤ+ , (6.70)
(k + 1 + 𝜁)𝛿
where t > 0, 𝜁 ≥ 0, and 𝛿 ∈ (0, 1] are fixed parameters. The parameter t scales the sequence,
𝛿 controls the asymptotic rate of decay, and 𝜁 may be used to reduce the rate of decay in early
iterations. The value of t has no effect on the asymptotic behavior, but it typically has a strong
effect on the nonasymptotic behavior. To see this, first note that step-size sequences of the form in
(6.70) satisfy
∑k−1 2
1 1 j=0 tj
∑k−1 ∝ t , ∑k−1 ∝ t.
j=0 jt tj=0 j

Comparing with the right-hand side of (6.68), we see that the choice of t presents a trade-off between
the two terms: increasing t reduces the first term, but increases the second and vice versa.
We now analyze the worst-case bound in (6.68) for different step-size sequences. We start by
noting that if c = 0, which corresponds to the ordinary gradient method, then it suffices to choose
6.8 Stochastic Optimization Methods 147

(1 + 𝜁)−𝑝

(𝜏 − 1 +𝜁) −𝑝

(𝜏 + 𝜁)−𝑝

𝜏
1 2 3 4 5 6 7 8

∑k
Figure 6.6 Construction of upper and lower bounds on sk = j=1 (j + 𝜁)−p , illustrated for k = 7. The gray
area is equal to s7 .

∑k−1
a constant step-size sequence tk = t ∈ (0, 2∕L) such that j=0 tj (1 − Ltj ∕2) = O(k). Indeed, this
implies that the right-hand side of (6.68) decays as O(1∕k). However, in the stochastic setting,
where c > 0, a constant step-size sequence does not make the right-hand side of (6.68) vanish
as k → ∞. In this case, we will instead consider the decreasing step-size sequence in (6.70) for
different values of the decay rate parameters 𝛿. The sum of the first k step sizes and the sum of
their squares can be expressed as
∑
k−1
∑
k
∑
k−1
∑
k
tj = t (j + 𝜁)−𝛿 , tj2 = t2 (j + 𝜁)−2𝛿 ,
j=0 j=1 j=0 j=1

which both involve sums of the form

∑
k
1
sk = , p ∈ ℝ+ .
j=1
(j + 𝜁)p

To expose the asymptotic behavior of this sum, we now bound sk from above and below in terms
k
of the definite integral ∫1 (𝜏 + 𝜁)−p d𝜏, as illustrated in Figure 6.6. This leads to the inequalities
k k
(𝜏 + 𝜁)−p d𝜏 ≤ sk ≤ (1 + 𝜁)−p + (𝜏 + 𝜁)−p d𝜏,
∫1 ∫1
and using the result that
{
(1+𝜁 )1−p
k
, p > 1,
lim (𝜏 + 𝜁) −p
d𝜏 = p−1 ,
k→∞ ∫1 ∞, p ∈ [0, 1],
we can conclude that sk converges when p > 1 and diverges otherwise. In the latter case, the upper
and lower bounds on sk allow us to establish the asymptotic equivalence
{ 1−p
k , p ∈ [0, 1),
sk ∼
ln(k), p = 1.
This result may be used to derive upper bounds on the right-hand side of (6.68) for different values
of 𝛿, which are summarized in Table 6.1. Note that asymptotically, the upper bound decays the
fastest when 𝛿 = 1∕2. However, this choice is not necessarily the best one in practice.
148 6 Optimization Methods

Table 6.1 Asymptotic behavior of step-size sums and the resulting

upper bound on the right-hand side of (6.68) as a function of 𝛿.

∑k−1 ∑k−1
Parameter j=0 tj j=0 tj2 Upper bound

𝛿>1 Θ(1) Θ(1) O(1)

𝛿=1 ∼ ln(k) Θ(1) O(1∕ ln(k))
1∕2 < 𝛿 < 1 ∼k 1−𝛿
Θ(1) O(1∕k1−𝛿 )
√ √
𝛿 = 1∕2 ∼ k ∼ ln(k) O(ln(k)∕ k)
0 < 𝛿 < 1∕2 ∼k 1−𝛿
∼k1−2𝛿 O(1∕k𝛿 )
𝛿=0 ∼k ∼k O(1)

6.8.2 Smooth and Strongly Convex Functions

If the function f is 𝜇-strongly convex in addition to being L-smooth and bounded below, we can
obtain a stronger result. Akin to the analysis of the gradient descent method in Section 6.2, we have
that
[ ] [ ]
𝔼 ||Xk+1 − x⋆ ||22 | Xk = 𝔼 ||Xk − tk Gk − x⋆ ||22 | Xk
[ ] [ ]
= 𝔼 ||Xk − x⋆ ||22 | Xk + tk2 𝔼 ||Gk ||22 | Xk
[ ]
− 2tk 𝔼 GTk (Xk − x⋆ ) | Xk
≤ ||Xk − x⋆ ||22 + tk2 (c2 + ||∇f (Xk ) − ∇f (x⋆ )||22 )
− 2tk (∇f (Xk ) − ∇f (x⋆ ))T (Xk − x⋆ ),
which simplifies to (6.18) in the special case, where c = 0. As in the deterministic case, we can apply
(6.4), which yields the bound
( )
[ ] 2t 𝜇L
𝔼 ||Xk+1 − x⋆ ||22 | Xk ≤ 1 − k ||Xk − x⋆ ||22 + tk2 c2
𝜇+L
for tk ∈ (0, 2∕(𝜇 + L)]. Taking expectation and applying this inequality recursively with a fixed step
size tk = t ∈ (0, 2∕(𝜇 + L)], we get the bound
( )k+1 k ( )i
[ ] 2t𝜇L ∑ 2t𝜇L
𝔼 ||Xk+1 − x⋆ ||22 ≤ 1 − R2 + c2 t2 1−
𝜇+L i=0
𝜇+L
[ ]
where R2 = 𝔼 ||X0 − x⋆ ||22 . The summation in the last term on the right-hand side is the kth partial
sum of a geometric series, and using the fact that
( )
∑k
1 − r k+1 1
ri = ≤ , |r| < 1,
i=0
1 − r 1 − r
leads to the simplified bound
( )k
[ ] 2t𝜇L 𝜇+L
𝔼 ||Xk − x⋆ ||22 ≤ 1 − R2 + tc2 . (6.71)
𝜇+L 2𝜇L
Thus, with a fixed step size, the sequence generated by the stochastic gradient iteration in (6.63)
is only guaranteed to converge in mean square to a ball centered at x⋆ and with a radius that is
proportional to the step size t. Clearly, a diminishing step-size sequence is needed to guarantee
[ ]
that 𝔼 ||Xk − x⋆ ||22 → 0 as k → ∞.
6.8 Stochastic Optimization Methods 149

6.8.3 Incremental Methods

The SG method is closely related to the class of incremental methods, and the two terms are often
used synonymously. Incremental methods deal with problems of the form (5.46), where the objec-
tive function is a finite sum of m functions fi (x), i ∈ ℕm , and the iterative update is of the form

xk+1 = xk − tk ∇fik (xk ), (6.72)

where ik ∈ ℕm is chosen according to some index selection rule. Two of the most common rules
are the cyclic rule ik = (k mod m) + 1 and the random rule, where each ik is chosen uniformly at
random from ℕm . With the random index selection rule, the incremental gradient method may
be viewed as a SG method applied to (5.44) if the random variable 𝜉 is discrete with m equiprobable
outcomes. We note that more general incremental proximal gradient (IPG) methods that can handle
simple nondifferentiable functions akin to the proximal gradient method (6.46) have been proposed
and analyzed in [16].
We now return to the problem in (5.46), where the objective function consists of a sum of m
functions, e.g. corresponding to m observations. Applying the gradient method to this problem
requires the full gradient, i.e.

1∑
m
∇f (xk ) = ∇f (x ).
m i=1 i k
In other words, the gradient of all the functions must be computed in order to obtain the search
direction, and as a result, the gradient method is sometimes referred to as a batch gradient method.
In contrast to the gradient method, the incremental gradient method uses as search direction the
negative gradient of a single function, i.e. −∇fik (xk ), and as a consequence, an incremental gradient
iteration can be much cheaper than a gradient iteration when m is large. However, the gradient
of a single function may be viewed as a noisy approximation of the full gradient, and hence, the
search direction is not necessarily a descent direction. A compromise between the gradient and
incremental gradient methods may be obtained by using a subset of p functions at each iteration, i.e.
t ∑
xk+1 = xk − k ∇f (x ),
p i∈ i k
k

where k ⊂ ℕm and |k | = p. This approach, which is known as a mini-batch method, provides a
way to reduce the variance of the search direction at the expense of increased computation cost per
iteration.

Example 6.6 Consider the optimization problem

1∑
m
𝛾
minimize ||x||22 + ln(1 + exp(−yi aTi x)) ,
2 m i=1
with variable x ∈ ℝn , regularization parameter 𝛾 > 0, and problem data y ∈ {−1, 1}m and
A ∈ ℝm×n , where aTi is the ith row of A. To apply the stochastic gradient method to this problem,
we define a realization of a stochastic gradient of the objective function at xk as

1 ∑ yi exp(−yi ai xk )
T
gk = 𝛾xk + ai ,
p i∈ 1 + exp(−yi aTi xk )
k

where k is a sample of p elements from ℕm drawn at random without replacement. Note that
gk = ∇f (xk ) in the special case where p = m. Figure 6.7 illustrates the typical behavior of the
stochastic gradient method based on a data set with m = 32561 and n = 123 using the mini-batch
150 6 Optimization Methods

𝑡 = 0.01 𝑡 = 0.1 𝑡=1

102
𝛿 = 0.5

100

10–2

102
𝛿 = 0.7

100

10–2

102
𝛿 = 0.9

100

10–2

0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10
Epochs

Figure 6.7 The relative suboptimality for the stochastic gradient method using step-size sequences of the
form (6.70) with initial step size t and decay parameter 𝛿.

size p = 326, which corresponds to roughly 1% of the data. The plots show the relative suboptimal-
ity |f (xk ) − f (x⋆ )|∕|f (x⋆ )| obtained with different step-size sequences of the form (6.70) with 𝜁 = 0
and different values of t and the decay parameter 𝛿. The primary axis shows the number of epochs,
which is the number of iterations scaled by p∕m. The figure clearly demonstrates that progress
can be very slow if the initial step size is too small or too large.

6.8.4 Adaptive Methods

One of the main challenges with incremental/stochastic gradient methods is the choice of step-size
sequence. We have seen through our analysis that a diminishing step-size sequence is required to
guarantee convergence, but the asymptotic convergence rate can be slow, and in practice, it is desir-
able to have adaptive step size rules with as few parameters to tune as possible. It is also possible to
replace the scalar step size tk by a symmetric matrix Bk such that, in the simplest case, the update
is of the form
xk+1 = xk − B−1
k k
g ,
where Bk is a symmetric scaling matrix and gk is the gradient estimate at iteration k. This can be
viewed as an incremental/stochastic variant of the variable metric method in (6.51), or equivalently,
an adaptively preconditioned stochastic gradient method.
The variance of the gradient estimator also has an effect on the practical performance of stochas-
tic gradient methods. Indeed, we have seen that a diminishing step size sequence is needed to
ensure convergence even when the objective function is smooth and strongly convex. Stochastic
gradient methods can be combined with variance reduction techniques that, roughly speaking,
6.8 Stochastic Optimization Methods 151

replace a simple gradient estimator with a more sophisticated unbiased estimator with lower vari-
ance. For example, in the incremental setting where the objective function is a finite sum of m
functions, the so-called stochastic variance-reduced gradient (SVRG) method uses an update of the
form
xk+1 = xk − tk (∇fik (xk ) − ∇fik (̃x) + 𝜇)
̃
where ik ∈ ℕm is selected uniformly at random. The vector x̃ is an approximation of xk and is
updated only every M iterations, and 𝜇̃ = ∇f (̃x) is the full gradient at x̃ . In the smooth and strongly
convex setting, this can be shown to converge linearly, in expectation, with a suitable constant
step size. We note that the technique is closely related to the method of control variates, which is
illustrated in Exercise 3.11. Several other variance reduction techniques exist; see, e.g. [46] for an
overview.
We end this section by outlining some examples of popular stochastic methods with adaptive
step-size strategies.

6.8.4.1 AdaGrad
The adaptive gradient method, or AdaGrad, is a method for stochastic problems of the form (5.44),
where the function F(x, 𝜉) is assumed to be of the form F(x, 𝜉) = G(x, 𝜉) + h(x) with G(⋅, 𝜉) and h
closed and convex. The method uses either a diagonal matrix or a full matrix to adaptively scale
stochastic gradients. Specifically, if gk ∈ 𝜕G(xk , 𝜉k ), where 𝜉k denotes a realization of 𝜉 at iteration
k and
∑
k
̂k =
G gk (gk )T ,
i=0

then AdaGrad uses either Bk = 𝛾diag((diag(G ̂ k ) + 𝜀 𝟙)1∕2 or Bk = 𝛾(G

̂ k + 𝜀 I)1∕2 where 𝛾 > 0 and
𝜀 > 0 are parameters. The diagonal variant of AdaGrad can be expressed as the iteration
𝑣k = 𝑣k−1 + gk ∘ gk
Bk = 𝛾 diag(𝑣k + 𝜀 𝟙)1∕2
B
xk+1 = proxh k (xk − B−1
k k
g )
where the operator ∘ denotes the Hadamard product, i.e. elementwise product, and 𝑣−1 = 0 and x0
are initial values. The matrix Bk can also be expressed as
√ ( )1∕2
Bk = 𝛾 k + 1diag
1 ̂ k) + 𝜀 𝟙
diag(G ,
k+1 k+1
which may be implemented recursively as
k 1
𝑣̃k = 𝑣̃ + (g ∘ g )
k + 1 k−1 k + 1 k k
√
Bk = 𝛾 k + 1 diag(𝑣̃k + 𝜀∕(k + 1) 𝟙)1∕2 .
The vector 𝑣̃k is the elementwise second raw sample moment of the sequence of gradient estimates
g0 , … , gk , i.e.

1 ∑
k
𝑣̃k = g ∘g .
k + 1 i=0 k k
The AdaGrad update may be expressed as
B
xk+1 = proxh k (xk − tk g̃ k ),
152 6 Optimization Methods

1
with tk = √ and g̃ k = diag(𝑣̃k + 𝜀∕(k + 1) 𝟙)−1∕2 gk which shows that AdaGrad implicitly
𝛾 k+1
employs a diminishing step-size sequence. In the convex setting, AdaGrad can be shown to satisfy
the worst-case bound
[ ] √
𝔼 f (Xk ) − p⋆ ≤ O(1∕ k).

We refer the reader to [35] for further analysis of AdaGrad and details regarding convergence.

6.8.4.2 RMSprop
The root mean square propagation (RMSprop) method is in many ways similar to AdaGrad. It is a
stochastic gradient iteration of the form

xk+1 = xk − B−1
k k
g ,

where gk = ∇F(xk , 𝜉k ) is a gradient estimate, and Bk is a diagonal matrix that is chosen proportion-
ally to the elementwise root mean square of previous gradient estimates. Unlike AdaGrad that uses
all gradient estimates to compute an adaptive scaling, RMSprop emphasizes more recent stochastic
gradients through the use of an exponential moving average, i.e.

𝑣k = 𝛽𝑣k−1 + (1 − 𝛽)(gk ∘ gk )
Bk = 𝛾 diag(𝑣k + 𝜀 𝟙)1∕2
xk+1 = xk − B−1
k k
g

where the parameter 𝛽 ∈ (0, 1) controls the adaptiveness of the estimate. RMSprop was proposed in
a lecture note [103] along with the suggested parameter value 𝛽 = 0.9. Note that unlike AdaGrad,
RMSprop does not implicitly result in a diminishing step-size sequence, and in fact, the method
need not converge, as shown in [95].

6.8.4.3 Adam
The adaptive moment estimation method, which is known as Adam, combines the adaptive scaling
approach of RMSprop with gradient aggregation or momentum. It employs an exponential moving
average of the form

𝜇k = 𝛽1 𝜇k−1 + (1 − 𝛽1 )gk ,

to compute a weighted average of previous gradient estimates. The vector 𝜇k can be viewed as an
estimate of the first raw sample moment of the weighted sequence of gradient estimates, and this
is used instead of gk to compute a search direction at iteration k, i.e.

𝜇k = 𝛽1 𝜇k−1 + (1 − 𝛽1 )gk
𝑣k = 𝛽2 𝑣k−1 + (1 − 𝛽2 )(gk ∘ gk )
( )
Bk = 𝛾(1 − 𝛽1 ) (1 − 𝛽2 )−1∕2 diag(𝑣k )1∕2 + 𝜀 I
xk+1 = xk − B−1
k
𝜇k .

The method was proposed in [63] with the recommended parameter values 𝛽1 = 0.9 and 𝛽2 = 0.999.
Like RMSprop, the method need not converge, as shown in [95]. A variant of Adam, known as
AdaMax, can be obtained by replacing 𝑣k and Bk by

uk = max (𝛽2 uk−1 , |gk |), Bk = 𝛾(1 − 𝛽1 )diag(uk ),

where |gk | is the elementwise absolute value of gk and with u−1 = 0.

6.9 Coordinate Descent Methods 153

6.9 Coordinate Descent Methods

We now consider a class of methods for solving unconstrained optimization problems of the form
(6.5) using coordinatewise updates. The prototypical coordinate descent iteration is of the form
xk+1 = xk − tk [∇f (xk )]ik eik , (6.73)
where ik ∈ ℕn is a coordinate index and tk > 0 is a step size. Common index selection strategies
include the cyclic order ik = (k mod n) + 1, a randomized cyclic order where the order is reshuffled
every n iterations, and a fully randomized order where the indices are selected uniformly at random.
The step size can be chosen in a similar way to the gradient method, e.g. using some form of line
search. With an exact line search, the method performs coordinatewise minimization, which can
be expressed as
tk ∈ argmin{f (xk + teik )},
t
xk+1 = xk + tk eik . (6.74)
Note that only the ik th element of x is updated at iteration k. This is a descent method by construc-
tion. However, without additional assumptions on f , the iteration in (6.74) does not necessarily
converge: one can construct an example, where (6.74) with the cyclic update order will enter a cycle
for some initializations, as demonstrated by Powell [90]. Moreover, even if the iteration (6.74) does
converge, it may not be to a stationary point if f is nonsmooth. Figure 6.8 shows an example of a
nonsmooth convex function where this can happen.
Next, we consider the case where f is smooth and convex, and we assume that the set
{x | f (x) ≤ f (x0 )} is nonempty and compact, which ensures that f attains its minimum. Using a
suitable step-size sequence, the coordinate descent iteration (6.73) can then be shown to converge
to a minimizer of f . The first-order condition for convexity (4.27) implies that
𝜕f (x)
f (x + tei ) ≥ f (x) + t, i ∈ ℕn ,
𝜕xi
for all x ∈ ℝn , and hence, we have that
∇f (x) = 0 ⟺ f (x + tei ) ≥ f (x) for all t ∈ ℝ, i ∈ ℕn .
In other words, x is a minimizer of f if and only if x minimizes f along all n coordinate directions.
Now, using the exact line search (6.74), iteration k yields a descent unless the ik th element of ∇f (xk )
is equal to zero, and hence, a cycle through all n coordinates yields a descent unless x is a minimum.
With additional assumptions on f and the step sizes, the iteration (6.74) can be shown to converge
to a minimizer of f . An overview of more sophisticated variants of the coordinate descent method
and detailed analyses can be found in, e.g. [12, 116].
Figure 6.8 Contour plot of the function 2
f (x) = max (|2x1 − x2 |, |2x2 − x1 |), which is convex but
nonsmooth. The function is nondifferentiable whenever
x1 = x2 or x1 = −x2 , and none of the coordinate directions is a 1
descent direction when x1 = x2 .
0
𝑥2

−1

−2
−2 −1 0 1 2
𝑥1
154 6 Optimization Methods

Example 6.7 Coordinate descent methods typically work well when the coupling between
variables is weak. To illustrate this, we now consider convex quadratic functions of the form
f (x) = xT Qx with a fixed condition number 𝜅(Q) = 20. Specifically, we take Q = Udiag(20, 1)U T
for different choices of U ∈ ℝ2×2 such that U T U = I. Figure 6.9 shows the coordinate descent
method in action for different choices of U corresponding to different orientations of the coordi-
nate system. The problem of minimizing f (x) is separable in the special case where Q is diagonal,
and in this case, the minimum point is reached in two iterations (one cycle). In contrast, progress
is slow when x1 and x2 are maximally coupled, which is the case when
[ ]
1 1 ±1
U = ±√ .
2 ∓1 1

Coordinate descent methods are often useful for regularized risk minimization problems of the
form
1 ∑ T
m
minimize g(ai x) + λ h(x), (6.75)
m i=1
where a1 , … , am ∈ ℝn are problem data, and where the regularizer h(x) is a separable function
(e.g. ||x||1 or ||x||22 ). Notice that (6.75) is of the form (5.46) with fi (x) = m1 (g(aTi x) + λ h(x)). The dual
problem may be expressed as
1 ∑ ∗
m
maximize − g (−mzi ) − λ h∗ (λ−1 AT z), (6.76)
m i=1

1 1

0 0
𝑥2

𝑥2

−1 −1

−1 0 1 −1 0 1
𝑥1 𝑥1

1 1

0 0
𝑥2

𝑥2

−1 −1

−1 0 1 −1 0 1
𝑥1 𝑥1

Figure 6.9 Coordinate descent with exact line search applied to convex quadratic functions of the form
f (x) = xT Qx with Q = U diag(20, 1)UT , where U is an orthogonal matrix. Each plot includes 20 iterations in
addition to the initial guess x0 = (−1, −1). The minimum is reached in two iterations, i.e. one cycle through
all coordinate directions, when Q is diagonal (upper left).
6.10 Interior-Point Methods 155

with variable z ∈ ℝm , and where A ∈ ℝm×n is the matrix with rows aT1 , … , aTm . The sum is a separa-
ble function, whereas the last term involves all the dual variables. To apply coordinate ascent to the
dual problem, we restrict the dual objective to one of its coordinate directions. Letting z = z̄ + tei ,
the dual function reduces to
1∑ ∗
m
− g (−m(̄zi + t)) − λ h∗ (λ−1 (AT z̄ + tai )) + const.,
m i=1

which can be minimized efficiently for certain choices of g and h.

6.10 Interior-Point Methods

The methods that we have discussed so far are not suitable for problems that involve all but simple
inequality constraints. For example, the gradient projection method is inefficient unless projections
onto the feasible set are cheap, and Newton’s method cannot be applied directly to problems with
inequality constraints. We will now look at a conceptually simple technique for handling inequality
constraints in a general convex optimization problem on the form (4.52), i.e.
minimize f0 (x)
subject to fi (x) ≤ 0, i ∈ ℕm (6.77)
Ax = b
where we assume that f0 ∶ ℝn → ℝ and fi ∶ ℝn → ℝ, i ∈ ℕm , are twice continuously differentiable
convex functions and where A ∈ ℝp×n and b ∈ ℝp . We will assume that Slater’s condition holds
and that rank(A) = p. We remind the reader that the Lagrangian L ∶ ℝn × ℝm × ℝp → ℝ is
∑
m
L(x, λ, 𝜇) = f0 (x) + λi fi (x) + 𝜇 T (Ax − b),
i=1

and the Lagrange dual function is g(λ, 𝜇) = infx L(x, λ, 𝜇).

The problem (6.77) is equivalent to the equality constrained problem
∑
m
minimize f0 (x) + I+ (−fi (x))
i=1
subject to Ax = b
where I+ ∶ ℝ → (−∞, +∞] denotes the indicator function of ℝ+ . Thus, the objective is finite if
and only if the inequality constraints in (6.77) are satisfied. This reformulation introduces another
difficulty, namely that the objective function is no longer differentiable on ℝn . To address this issue,
we introduce a so-called barrier function 𝜙 ∶ ℝ++ → ℝ given by 𝜙(𝜏) = − ln(𝜏) such that (1∕t)𝜙(𝜏)
with t > 0 is a smooth approximation to I+ (𝜏) on ℝ++ , as illustrated in Figure 6.10. This motivates
the so-called barrier problem

1∑
m
minimize f0 (x) + 𝜙(−fi (x))
t i=1 (6.78)
subject to Ax = b
as an approximation to (6.77), and where t > 0 controls the accuracy of the approximation. It is
natural to expect that a solution to (6.78) approaches a solution to (6.77) as t → ∞. We will soon
formalize this intuition. The problem (6.78) is an equality constrained convex optimization prob-
lem, which follows by noting that 𝜙(−fi (x)) is a convex function. Moreover, the domain of the barrier
156 6 Optimization Methods

Figure 6.10 The scaled barrier function

𝜙( 𝜏) (1∕t)𝜙(𝜏) is a smooth approximation to the
1 indicator function I+ (𝜏) for t > 0.
𝜙(𝜏)
4

𝜏
1

function is ℝ++ , and this implies that the domain of the barrier problem (6.78) is the relative inte-
rior of the feasible set of the original problem (6.77). This gives rise to the name interior-point (IP)
method for a method that solves the barrier problem (6.78).
We will now compare the optimality conditions associated with the original problem (6.77) and
the barrier problem (6.78). The KKT conditions for (6.77) may be expressed as
∑
m
∇f0 (x) + λi ∇fi (x) + AT 𝜇 = 0 (6.79a)
i=1
fi (x) ≤ 0, i ∈ ℕm (6.79b)
Ax = b (6.79c)
λ ⪰ 0 (6.79d)
λi fi (x) = 0, i ∈ ℕm , (6.79e)
and the KKT conditions for the barrier problem are
∑
m
1
∇f0 (x) + ∇fi (x) + AT 𝜇 = 0 (6.80a)
i=1
−tfi (x)
Ax = b, (6.80b)
with the additional implicit constraints that fi (x) < 0, i ∈ ℕm , which are imposed by the domain
of the barrier function. Now, suppose that x⋆ (t) and 𝜇 ⋆ (t) satisfy the optimality conditions for the
barrier problem and let λ⋆ (t) be the vector with elements λ⋆i (t) = −1∕(tfi (x⋆ (t))), i ∈ ℕm . It then
follows that x⋆ (t), λ⋆ (t), and 𝜇 ⋆ (t) satisfy (6.79a)–(6.79d) but not the complementarity condition
(6.79e) since λ⋆i (t)fi (x⋆ (t)) = −1∕t. Moreover, x⋆ (t) minimizes the Lagrangian L(x, λ⋆ (t), 𝜇 ⋆ (t)), and
hence, λ⋆ (t) and 𝜇 ⋆ (t) are dual feasible and
p⋆ ≥ g(λ⋆ (t), 𝜇 ⋆ (t)) = L(x⋆ (t), λ⋆ (t), 𝜇 ⋆ (t))
∑
m
−1
= f0 (x⋆ ) + f (x⋆ (t)) + (𝜇 ⋆ (t))T (Ax⋆ (t) − b)
⋆ (t)) i
tf
i=1 i
(x
= f0 (x⋆ (t)) − m∕t.
This shows that the duality gap at (x⋆ (t), λ⋆ (t), 𝜇 ⋆ (t)) is m∕t. Thus, the solution to the barrier prob-
lem x⋆ (t) defines a trajectory of strictly feasible (m∕t)-suboptimal points, which is known as the
central path.
6.10 Interior-Point Methods 157

6.10.1 Path-Following Method

The barrier problem (6.78) can be solved using Newton’s method for equality constrained mini-
mization. The Newton direction (Δx, Δ𝜇) ∈ ℝn × ℝp can be expressed as the solution to the linear
system of equations:
[ ][ ] [ ]
H AT Δx −g
=
A 0 Δ𝜇 r
where
( )
1∑
m
2 −1 2 1
H = ∇ f0 (x) + ∇ fi (x) + ∇f (x)∇fi (x)T
t i=1 fi (x) fi (x)2 i
1 ∑ −1
m
g = ∇f0 (x) + AT 𝜇 + ∇f (x)
t i=1 fi (x) i
r = b − Ax.
This system of equations is obtained by linearizing the optimality conditions (6.80) around (x, 𝜇)
and is referred to as the Newton equations. Rather than choosing t very large and solving a single
barrier problem, it turns out that it is advantageous to solve a sequence of barrier problems with
increasing values of t. Roughly speaking, the barrier problem is easier to solve when t is small, and
given a point x⋆ (t), we can expect that Newton’s methods converges quadratically to x⋆ (t+ ) if t+
is sufficiently close to t. The path-following method outlined in Algorithm 6.10 uses this approach
to find an 𝜖-suboptimal point. Each loop iteration is commonly referred to as an outer iteration,
whereas the iterations involved in solving a barrier subproblem are referred to as inner iterations.
The parameter 𝛾 > 1 determines how quickly t is increased. If the initial value of t is t0 , then the
number of barrier problems to solve is given by
⌈ ⌉
ln(m) − ln(t0 𝜖)
1+ ,
ln 𝛾
provided that 0 < t0 ≤ m∕𝜖. This is a decreasing function of 𝛾. In contrast, we may expect the num-
ber of Newton iterations required to solve a single barrier problem to increase with 𝛾. We note that
our analysis of Newton’s method in Section 6.3 does not apply to the barrier problem in (6.78): the
problem is not necessarily strongly convex and the barrier function 𝜙(t) = − ln(x) does not have
a Lipschitz continuous Hessian on dom 𝜙. However, convergence can still be established if f0 is a
so-called self-concordant function; see, e.g. [81].

Algorithm 6.10: Path-following method

Input: Strictly feasible x0 , tolerance 𝜖 > 0, and parameters t0 > 0 and 𝛾 > 1.
Output: An 𝜖-suboptimal point x.
x ← x0 , t ← t0
repeat
Find x⋆ (t) by solving (6.78) with starting point x.
x ← x⋆ (t)
if m∕t ≤ 𝜖 then stop
t ← 𝛾t
158 6 Optimization Methods

Example 6.8 To illustrate the basic principle behind the path-following method, we will now
apply it to find an 𝜖-suboptimal solution to the convex optimization problem
minimize x1 + x2 ∕5
subject to exp(x1 ) + x2 − 3 ≤ 0
x12 − x2 ≤ 0
with variable x ∈ ℝ2 . The corresponding barrier problem can be expressed as
x 1( )
minimize x1 + 2 + − ln(3 − x2 − exp(x1 )) − ln(x2 − x12 ) ,
5 t
which is an unconstrained convex optimization problem. Figure 6.11 shows the iterates generated
using Algorithm 6.10 for two different values of the parameter 𝛾. In both cases, we used the point
x0 = (0, 0.5) as a strictly feasible starting point and the parameters t0 = 0.5 and 𝜖 = 10−3 .

Algorithm 6.10 requires a strictly feasible initial point x0 , and it is not always easy to find such a
point. One approach is to solve a so-called phase I problem, which can be expressed as
minimize s
subject to fi (x) ≤ s, i ∈ ℕm (6.81)
Ax = b,
with variables x ∈ ℝn and s ∈ ℝ. If the optimal value is attained at (x⋆ , s⋆ ), then x⋆ is clearly a
strictly feasible point for the original problem (6.77) if s⋆ < 0. On the other hand, if s⋆ > 0, the
original problem is infeasible. Note that it is straightforward to find a strictly feasible point for the
phase I problem, and hence, it can be solved using Algorithm 6.10.

6.10.2 Generalized Inequalities

The barrier approach to inequality constrained minimization can be extended to optimization prob-
lems with generalized inequalities, which we will now illustrate for a cone linear program (LP).
Recall that this is a problem of the form
minimize cT x
subject to Ax = b (6.82)
x ⪰K 0,

3 3

2 2
𝑥⋆ (𝑡0) 𝑥⋆ (𝑡0)
𝑥2

𝑥2

1 1

𝑥0 𝑥0
0 0
−1.5 −1 −0.5 0 0.5 1 −1.5 −1 −0.5 0 0.5 1
𝑥1 𝑥1
(a) (b)

Figure 6.11 The path-following method converges to an 𝜖-suboptimal point x by solving a sequence of
barrier problems. The two plots show the iterates obtained with (a) 𝛾 = 2 and (b) 𝛾 = 20, respectively. With
𝛾 = 2, 13 barrier problems were solved using a total of 48 Newton iterations, whereas with 𝛾 = 20, 4 barrier
problems were solved using a total of 21 Newton iterations.
6.10 Interior-Point Methods 159

with variable x ∈ ℝn and problem data A ∈ ℝp×n , b ∈ ℝp , and c ∈ ℝn , and where K ⊂ ℝn is a proper
convex cone. The Lagrangian L ∶ ℝn × ℝn × ℝp → ℝ ̄ is defined as

L(x, λ, 𝜇) = cT x − λT x + 𝜇 T (Ax − b),

and the dual function is g(λ, 𝜇) = infx L(x, λ, 𝜇).
To handle the generalized inequality x ⪰K 0, we introduce a barrier function 𝜙 ∶ ℝn → ℝ ̄ with
dom 𝜙 = int K such that (1∕t)𝜙(x) is a smooth approximation of the indicator function IK (x).
In other words, we require that 𝜙(x) → ∞ as x approaches the boundary of K. Moreover, we
will require that 𝜙 is convex and a so-called generalized logarithm for K, which means that it is
logarithmically homogeneous with some constant 𝜃 ≥ 1, i.e. for all x ∈ dom 𝜙 and t > 0, it holds
that
𝜙(tx) = 𝜙(x) − 𝜃 ln(t).
Taking the derivative with respect to t on both sides, we see that ∇𝜙(tx)T x = −𝜃∕t.
The barrier problem associated with (6.82) is given by
minimize cT x + (1∕t)𝜙(x)
(6.83)
subject to Ax = b,
with variable x ∈ ℝn , and the corresponding KKT conditions can be expressed as
c + (1∕t)∇𝜙(x) + AT 𝜇 = 0
Ax = b.
Now, suppose that x⋆ (t) and 𝜇 ⋆ (t) satisfy these conditions. We then see that x⋆ (t) minimizes
L(x, λ⋆ (t), 𝜇 ⋆ (t)) with λ⋆ (t) = −(1∕t)∇𝜙(x⋆ ), and hence,
g(λ⋆ (t), 𝜇 ⋆ (t)) = inf L(x, λ⋆ (t), 𝜇 ⋆ (t))
x
= cT x⋆ (t) + (1∕t)𝜙(x⋆ (t))T x⋆ (t) + (𝜇 ⋆ (t))T (Ax⋆ (t) − b)
= cT x⋆ (t) − 𝜃∕t.
This shows that the duality gap is 𝜃∕t. Thus, we can solve the cone LP to 𝜖-suboptimality using
a modified version of Algorithm 6.10 with the barrier problem (6.83) and the stopping criterion
𝜃∕t ≤ 𝜖.
The cone K in (6.82) can be a composition of cones, i.e.
K = K1 × · · · × Km ,
∑m
where Ki ⊂ ℝni , i ∈ ℕm , are proper convex cones and n = i=1 ni . The conic inequality x ⪰K 0 is
then equivalent to
xi ⪰Ki 0, i ∈ ℕm ,
where x = (x1 , … , xm ) is a partition of x into subvectors xi ∈ ℝni , i ∈ ℕm . A barrier for K can then
be constructed as
∑
m
𝜙(x) = 𝜙i (xi ),
i=1

where 𝜙i ∶ ℝni → ℝ ̄ with dom 𝜙i = int Ki is a barrier for Ki , i ∈ ℕm . Moreover, it is easy to verify
∑m
that 𝜙 is logarithmically homogeneous with constant 𝜃 = i=1 𝜃i if for all i ∈ ℕm , 𝜙i is logarithmi-
cally homogeneous with constant 𝜃i . Table 6.2 lists logarithmically homogeneous barrier functions
for some elementary proper convex cones.
160 6 Optimization Methods

Table 6.2 Barrier functions for elementary proper convex cones.

Conic constraint Barrier function 𝜽

∑n
x ∈ ℝn+ − i=1 ln(xi ) n
(x, t) ∈ ℚ n T
− ln(t − x x∕t) − ln(t) 2
X ∈ 𝕊n+ − ln det (X) n
(x, y, z) ∈ K𝛼 − ln(z y 2𝛼 2(1−𝛼)−x2
) − (1 − 𝛼) ln(y) − 𝛼 ln(z) 3
(x, y, z) ∈ Kexp − ln(y exp(z∕y) − x) − ln(z) − ln(y) 3

Example 6.9 Consider the semidefinite programming problem

minimize tr(CX)
subject to tr(Ai X) = bi , i ∈ ℕp (6.84)
X⪰0
with variable X ∈ 𝕊m and problem data C ∈ 𝕊m , b ∈ ℝp , and Ai ∈ 𝕊m , i ∈ ℕp . This prob-
lem is equivalent to the conic problem in (6.82). Indeed, using the symmetric vectorization
operator defined in (2.22), we can express the semidefinite programming problem as a conic
problem of the form (6.82) by letting n = m(m + 1)∕2 and defining c = svec(C), x = svec(X),
K = {svec(X) | X ∈ 𝕊m + }, and ai = svec(Ai ), i ∈ ℕp , such that A ∈ ℝ
p×n is the matrix with rows

a1 , … , ap .
T T

The barrier problem associated with (6.84) can be expressed as

1
minimize tr(CX) + 𝜙(X)
t (6.85)
subject to tr(Ai X) = bi , i ∈ ℕp
where 𝜙(X) ∶ 𝕊m → ℝ ̄ is defined as 𝜙(X) = − ln det (X) with dom 𝜙 = 𝕊n . The Newton equations
++
may then be expressed as
1 2 ∑ p
∇ 𝜙(X)[ΔX] + Δ𝜇i Ai = −G
t i=1
tr(Ai ΔX) = ri , i ∈ ℕp
where ∇𝜙(X) = −X −1 , ∇ 𝜙(X)[ΔX] =
2
X −1 ΔXX −1 , and
∑p
1
G=C+ 𝜇i Ai + ∇𝜙(X), ri = bi − tr(Ai X), i ∈ ℕp .
i=1
t
By eliminating ΔX from the first equation, i.e.
∑
p
ΔX = −X(G + Δ𝜇j Aj )X,
j=1

we obtain the reduced system

HΔ𝜇 = −g,
where the entries of H ∈ 𝕊p and g ∈ ℝp are given by
Hij = tr(Ai XAj X), i, j ∈ ℕp , gi = ri + tr(Ai XGX), i ∈ ℕp .
The cost of forming H and g is often the dominating cost of a single iteration, but it is sometimes
possible to reduce the cost, e.g. if some or all of the matrices A1 , … , Ap are sparse. We note that H
6.11 Augmented Lagrangian Methods 161

is positive definite if and only if A1 , … , Ap are linearly independent, in which case the Cholesky
factorization H = LLT can be used to solve HΔ𝜇 = −g.

We end this section by mentioning that there are much more advance IP methods than the basic
path-following scheme that we have presented in this section. These methods generally maintain
both primal and dual variables, and rather than solving a sequence of barrier subproblems to high
accuracy, they stay inside some neighborhood of a primal–dual central path and update the param-
eter t adaptively; see, e.g. [81, 115]. We also note that there are IP methods for local optimization
of more general nonlinear optimization problems; see, e.g. [111].

6.11 Augmented Lagrangian Methods

We will now consider methods for equality constrained optimization problems of the form
minimize f0 (x)
(6.86)
subject to h(x) = 0,
with variable x ∈ ℝn , and where f0 ∶ ℝn → ℝ ̄ and h ∶ ℝn → ℝp . We define the Lagrangian L ∶ ℝn ×
̄
ℝ → ℝ as L(x, 𝜇) = f0 (x) + 𝜇 h(x) and the dual function g ∶ ℝp → ℝ
p T ̄ as g(𝜇) = infx L(x, 𝜇).
A conceptually simple approach to finding an approximate local minimizer is to consider an
unconstrained problem as a proxy for the problem in (6.86). For example, we may consider a
so-called penalty problem
𝜌
minimize f0 (x) + ||h(x)||22 ,
2
where 𝜌 > 0 is a penalty parameter. Roughly speaking, we can expect that the constraint viola-
tion will be small when 𝜌 is large, and this observation is the motivation behind penalty methods
that solve a sequence of penalty problems with increasing values of 𝜌. Unfortunately, the penalty
problem typically becomes very ill-conditioned when 𝜌 is large, which makes it difficult to solve it
reliably and accurately.
An alternative to the penalty approach is to consider the optimization problem
𝜌
minimize f0 (x) + ||h(x)||22
2 (6.87)
subject to h(x) = 0,
with penalty parameter 𝜌 > 0. This problem is equivalent to (6.86), which follows immediately by
noting that ||h(x)||22 = 0 whenever x is feasible. The Lagrangian for the problem in (6.87), which is
the so-called augmented Lagrangian for the problem in (6.86), is the function L𝜌 ∶ ℝn × ℝp → ℝ ̄
defined as
𝜌
L𝜌 (x, 𝜇) = f0 (x) + 𝜇 T h(x) + ||h(x)||22
2
𝜌 1
= f0 (x) + ||h(x) + 𝜌−1 𝜇||22 − ||𝜇||22 .
2 2𝜌
The corresponding dual function g𝜌 ∶ ℝp → ℝ ̄ is given by
{ 𝜌 }
g𝜌 (𝜇) = inf L𝜌 (x, 𝜇) = inf f0 (x) + ||h(x) + 𝜌−1 𝜇||22 .
x x 2
Notice the similarity with the penalty problem: the penalty term in the definition of g𝜌 includes a
shift that is determined by the dual variable 𝜇.
162 6 Optimization Methods

6.11.1 Method of Multipliers

The dual function g𝜌 can be used to construct a local optimization method for the problem in (6.86).
To illustrate the basic idea, we will start by assuming that 𝜇k ∈ ℝp is given and that infx L𝜌 (x, 𝜇k ) is
finite and attained at xk+1 . It then holds that
g𝜌 (𝜇) = inf L𝜌 (x, 𝜇)
x
( 𝜌 )
= inf f0 (x) + 𝜇 T h(x) + ||h(x)||22
x 2
𝜌
≤ f0 (xk+1 ) + 𝜇 h(xk+1 ) + ||h(xk+1 )||22
T
2
( )T 𝜌
= f0 (xk+1 ) + 𝜇k h(xk+1 ) + 𝜇 − 𝜇k h(xk+1 ) + ||h(xk+1 )||22
T
2
= g𝜌 (𝜇k ) + h(xk+1 )T (𝜇 − 𝜇k ),
which implies that −h(xk+1 ) is a subgradient of −g𝜌 at 𝜇k . Furthermore, if L𝜌 is differentiable with
respect to x at xk+1 , then it must also hold that
𝜕h(xk+1 )T
0 = ∇x L𝜌 (xk+1 , 𝜇k ) = ∇f0 (xk+1 ) + (𝜇k + 𝜌h(xk+1 )).
𝜕x
This motivates the update
𝜇k+1 = 𝜇k + 𝜌h(xk+1 ),
which implies that (xk+1 , 𝜇k+1 ) satisfies the necessary stationarity condition
𝜕h(xk+1 )T
0 = ∇x L(xk+1 , 𝜇k+1 ) = ∇f0 (xk+1 ) + 𝜇k+1 .
𝜕x
However, xk+1 does not necessarily satisfy the primal feasibility condition h(x) = 0.
The above procedure may be repeated iteratively, i.e. we fix 𝜇 and find a new x by minimizing the
augmented Lagrangian, and then we update 𝜇. The augmented Lagrangian may be a nonconvex
function of x, and hence, the problem of minimizing it can be a difficult one. Thus, to make the pro-
cedure more practical, the augmented Lagrangian is typically only minimized approximately, and
furthermore, the penalty parameter 𝜌 may be increased if ||h(xk )||2 decays too slowly as a function
of k. A conceptual outline of the so-called method of multipliers is shown in Algorithm 6.11. The
parameters 𝛽 and 𝛾 determine how to adjust the penalty parameter, and typical values are 𝛽 = 10
and 𝛾 = 1∕4. A more detailed treatment of the method, including a convergence analysis, can be
found in, e.g. [15]. We note that the method of multipliers can also be extended to handle inequality
constraints, i.e. problems of the more general form in (4.4).

Algorithm 6.11: Method of Multipliers

Input: Starting point (x0 , 𝜇0 ) ∈ ℝn × ℝp and parameters 𝜌0 > 0, 𝛾 ∈ (0, 1), and 𝛽 > 1.
for k = 0, 1, 2, … do
Compute approximate minimizer xk+1 of L𝜌k ( ⋅ , 𝜇k ).
𝜇k+1 ← 𝜇k + 𝜌h(xk+1 ) ⊳ Update dual variables
if ‖h(xk+1 )‖2 ≤ 𝛾‖h(xk )‖2 then
𝜌k+1 = 𝜌k
else
𝜌k+1 = 𝛽𝜌k
end
end
6.11 Augmented Lagrangian Methods 163

Example 6.10 Constrained nonlinear LS problems is an example of a class of problems that can
be solved efficiently to local optimality with the augmented Lagrangian method, i.e. problems of
the form
1
minimize ||f (x)||22
2
subject to h(x) = 0
with variable x ∈ ℝn , and where f ∶ ℝn → ℝm and h ∶ ℝn → ℝp are nonlinear functions. The aug-
mented Lagrangian L𝜌 ∶ ℝn × ℝp can be expressed as
1 𝜌
L𝜌 (x, 𝜇) = ||f (x)||22 + 𝜇 T h(x) + ||h(x)||22
2 2
1 𝜌 1
= ||f (x)||2 + ||h(x) + 𝜇∕𝜌||22 − ||𝜇||22
2
2 2 2𝜌
[ ] ‖2
1‖ ‖ f (x)
√ ‖ 1
= ‖ √ ‖ − ||𝜇||22 .
2‖ ‖ 𝜌h(x) + 𝜇∕ 𝜌 ‖
‖2 2𝜌
Thus, for fixed value of 𝜇, the problem of minimizing L𝜌 with respect to x is an unconstrained
nonlinear LS problem, which can be minimized locally using, e.g. the LM algorithm.

6.11.2 Alternating Direction Method of Multipliers

We will now consider a variant of the problem in (6.86) that can be expressed as
minimize f (x) + g(y)
(6.88)
subject to Ax + By = c

with variables x ∈ ℝn and y ∈ ℝm , and where f ∶ ℝn → ℝ,̄ g ∶ ℝm → ℝ, ̄ A ∈ ℝp×n , B ∈ ℝp×m , and

c ∈ ℝ . We will make the additional assumption that both f and g are proper, closed, and convex but
p

not necessarily smooth. The augmented Lagrangian L𝜌 ∶ ℝn × ℝm × ℝp → ℝ can be expressed as

𝜌
L𝜌 (x, y, 𝜇) = f (x) + g(y) + 𝜇 T (Ax + By − c) +‖Ax + By − c‖22
2
𝜌 𝜇
= f (x) + g(y) + ‖Ax + By − c + 𝜇∕𝜌‖22 − ||𝜇||22 .
2 𝜌
To apply the method of multipliers, we need to minimize L𝜌 over (x, y) for a given 𝜇, but this is often
an expensive problem to solve. The alternating direction method of multipliers (ADMM) approxi-
mates this step via block-coordinate minimization, which results in the iteration

xk+1 ∈ argmin L𝜌 (x, yk , 𝜇k ) (6.89a)

yk+1 ∈ argmin L𝜌 (xk+1 , y, 𝜇k ) (6.89b)

y
( )
𝜇k+1 = 𝜇k + 𝜌 Axk+1 + Byk+1 − c , (6.89c)

for k ∈ ℤ+ and where y0 ∈ ℝm and 𝜇0 ∈ ℝp are initial values. If L0 has a saddle-point, then the
assumption that f and g are proper, closed, and convex ensures that the updates (6.89a) and (6.89b)
are well-defined, i.e. a minimizer exists, but it is not necessarily unique. If we let uk = 𝜇k ∕𝜌, then the
ADMM updates can be expressed in a more convenient form as shown in Algorithm 6.12, which is
a basic implementation of ADMM. Is can be advantageous to adaptively update the penalty param-
eter 𝜌 and/or make use of preconditioning techniques, which is often done in more sophisticated
variants.
164 6 Optimization Methods

Algorithm 6.12: Alternating Direction Method of Multipliers

Input: Initialization y0 ∈ ℝm and u0 ∈ ℝp , penalty parameter 𝜌 > 0, and tolerances 𝜖p > 0 and
𝜖d > 0.
for k = 0, 1, 2, … do
{ }
𝜌
xk+1 ∈ argmin f (x) + ‖Ax + Byk − c + uk ‖22
x { 2 }
𝜌
yk+1 ∈ argmin g(y) + ‖Axk+1 + By − c + uk ‖22
y 2
uk+1 = uk + Axk+1 + Byk+1 − c
if ‖uk+1 − uk ‖ ≤ 𝜖p and ‖𝜌AT B(yk+1 − yk )‖ ≤ 𝜖d then stop
end

The ADMM updates are often cheaper to compute compared to the cost of the joint minimization
over x and y in the method of multipliers. In the special case where A = I, the update of x can be
expressed as
xk+1 = prox𝜌−1 f (c − uk − Byk )
where prox𝜌−1 f is the proximal operator associated with 𝜌−1 f . Similarly, if B = I, then
yk+1 = prox𝜌−1 g (c − uk − Axk+1 ).

6.11.3 Variable Splitting

Many of the learning problems that we will discuss in Chapters 8–12 can be expressed as optimiza-
tion problems of the form
minimize f (x) + g(x)
with variable x ∈ ℝn and where f ∶ ℝn → ℝ ̄ and g ∶ ℝn → ℝ ̄ are proper, closed, and convex func-
tions. For example, f may represent a data fitting term, and g may represent some form of regular-
ization or constraints. By introducing an auxiliary variable y ∈ ℝn , we can equivalently consider
the problem
minimize f (x) + g(y)
(6.90)
subject to x − y = 0
which is a special case of the problem (6.88) with A = I and B = −I. This reformulation technique
is sometimes referred to as variable splitting, and it allows us apply the ADMM to the problem. The
scaled form of the ADMM updates in Algorithm 6.12 may be expressed as
xk+1 = prox𝜌−1 f (yk − uk ) (6.91a)
yk+1 = prox𝜌−1 g (xk+1 + uk ) (6.91b)
uk+1 = uk + xk+1 − yk+1 . (6.91c)
We note that in the special case where g(x) = IC (x) for some closed convex set C ⊂ ℝn , the proximal
operator prox𝜌−1 g reduces to the Euclidean projection onto C, which can be computed efficiently for
many simple convex sets; see, e.g. [85].
∑N
Variable splitting may also be used when f ∶ ℝn → ℝ is defined as f (x) = i=1 fi (x) where the ith
function fi ∶ ℝn → ℝ ̄ is associated with, e.g. a subset of the available data. By introducing auxiliary
variables xi ∈ ℝ , i ∈ ℕN , we may instead consider the optimization problem
n

∑
N
minimize fi (xi ) + g(y)
(6.92)
i=1
subject to xi − y = 0, i ∈ ℕN ,
Exercises 165

with variables y ∈ ℝn and x = (x1 , … , xN ) with xi ∈ ℝn , i ∈ ℝn . Applying the ADMM to this prob-
lem, it is straightforward to see that the ADMM update of x becomes separable. In other words,
x1 , … , xN can be updated in parallel; see Exercise 6.14.

Exercises
6.1 Consider the gradient descent iteration (6.14), i.e.
1
xk+1 = xk − ∇f (xk ), k ∈ ℤ+ ,
L
and assume that f is bounded below and L-smooth. Show that limk→∞ ||∇f (xk )||22 = 0.

6.2 Recall that a function f ∶ ℝn → ℝ is L-smooth if there exists a finite constant L > 0 such
that
||∇f (y) − ∇f (x)||2 ≤ L||y − x||2 , ∀x, y ∈ ℝn .

(a) Show that if f is L-smooth, then for all x, y ∈ ℝn ,

L
f (y) ≤ f (x) + ∇f (x)T (y − x) + ||y − x||22 .
2
Hint: Let h(𝜏) = f (x + 𝜏(y − x)) and start with the Newton–Leibniz formula
1
h(1) − h(0) = h′ (𝜏) d𝜏.
∫0
(b) Suppose f ∶ ℝn → ℝ is L-smooth and convex, and define h ∶ ℝn → ℝ as h(y) = f (y) −
∇f (x)T y, where x ∈ ℝn is given. Show that h is L-smooth and convex and that for all
y ∈ ℝn ,
1
h(x) ≤ h(y − (1∕L)∇h(y)) ≤ h(y) − ||∇h(y)||22 .
2L
(c) Show that if f is L-smooth and convex, then for all x, y ∈ ℝn ,
1
f (x) + ∇f (x)T (y − x) + ||∇f (y) − ∇f (x)||22 ≤ f (y).
2L
Hint: Use the result from the previous part of this exercise.
(d) Show that if f is L-smooth and convex, then for all x, y ∈ ℝn ,
1
||∇f (y) − ∇f (x)||22 ≤ (∇f (y) − ∇f (x))T (y − x).
L
Hint: Use the result from the previous part of this exercise.

6.3 Let f ∶ ℝn → ℝ be a 𝜇-strongly convex and L-smooth function, and define h ∶ ℝn → ℝ as

h(x) = f (x) − 𝜇∕2||x||22 .
(a) Show that h is convex and (L − 𝜇)-smooth.
(b) Show that for all x, y ∈ ℝn ,
𝜇L 1
||y − x||22 + ||∇f (y) − ∇f (x)||22 ≤ (∇f (y) − ∇f (x))T (y − x).
𝜇+L 𝜇+L
Hint: Use the fact that h is (L − 𝜇)-smooth and consider two cases, namely 𝜇 = L and
𝜇 < L.

6.4 Many learning problems involve objective functions of the form f (x) = g(x) + h(x), where g
is convex and Lg -smooth and where h is a 𝜇-stongly convex and Lh -smooth regularization
function. Show that this implies that f is (Lg + Lh )-smooth and 𝜇-strongly convex.
166 6 Optimization Methods

6.5 Let g ∶ ℝn → ℝ and let f (x) = λg(x∕λ) for some λ > 0. Show that
proxf (x) = λproxλ−1 g (x∕λ).
∑n
6.6 (a) Show that if f ∶ ℝn → ℝ is of the form f (x) = i=1 fi (xi ), where fi ∶ ℝ → ℝ, then
( )
proxf (x) = proxf1 (x1 ), … , proxfn (xn ) .

(b) Show that the proximal operator associated with f (x) = 𝛾||x||1 with 𝛾 > 0 may be
expressed as
( )
proxf (x) = S𝛾 (x1 ), … , S𝛾 (xn ) ,
where S𝛾 ∶ ℝ → ℝ is defined as
⎧0, |t| ≤ 𝛾,
⎪
S𝛾 (t) = ⎨t − 𝛾, t > 𝛾,
⎪t + 𝛾, t < −𝛾,
⎩
or, equivalently,
S𝛾 (t) = sgn(t) max (0, |t| − 𝛾).

6.7 The nuclear norm of an m × n matrix X can be expressed as

||X||∗ = sup {tr(X T Z) | ||Z||2 ≤ 1},
Z∈ℝm×n

i.e. the nuclear norm is the dual norm of the operator norm || ⋅ ||2 on ℝm×n . Let X = UΣV T
be an singular value decomposition (SVD) of X ∈ ℝm×n , where U ∈ ℝm×m and V ∈ ℝn×n
are orthogonal matrices, and Σ ∈ ℝm×n has the singular values of X on its main diagonal in
descending order and zeros elsewhere. Now, letting r = rank(X), a thin SVD of X may be
expressed as
X = U1 SV1T ,
where U1 ∈ ℝm×r and V1 ∈ ℝn×r are the first r columns U and V, respectively, and
S = diag(𝜎1 , … , 𝜎r ) is a submatrix of Σ.
(a) Show that the nuclear norm of X is the sum of its singular values, i.e.
∑
r
||X||∗ = 𝜎i (X).
i=1

(b) Show that the subdifferential of h(X) = ||X||∗ may be expressed as

𝜕h(X) = {U1 V1T + W | ||W||2 ≤ 1, U1T W = 0, WV1 = 0}.
(c) Show that the proximal operator associated with 𝛾||X||∗ with 𝛾 > 0 may be expressed as
prox𝛾h (X) = U1 diag(𝜎̃ 1 , … , 𝜎̃ r )V1T ,
where h(X) = ||X||∗ and 𝜎̃ i = max (0, 𝜎i − 𝛾), i ∈ ℕr .

6.8 Consider the 1-norm regularized least-squares problem

1
minimize ||Ax − b||22 + 𝛾 ||x||1 , (6.93)
2
with variable x ∈ ℝn , problem data A ∈ ℝm×n and b ∈ ℝm , and where 𝛾 > 0 is a given regu-
larization parameter. We will denote by x⋆ (𝛾) a solution to (6.93) for a given 𝛾.
Exercises 167

(a) Show that x⋆ (𝛾) = 0 is a solution if 𝛾 ≥ ||AT b||∞ .

(b) Use CVX or CVXPY to solve the problem (6.93) for different values of 𝛾 using the prob-
lem data (A, b) from the file l1regls.mat. Create a trade-off plot with the points
(||x⋆ (𝛾)||1 , 12 ||Ax⋆ (𝛾) − b||22 ). A brief introduction to CVX can be found in the appendix.
Hint: Use, say, 10-20 logarithmically spaced values of 𝛾, say, between 10−2 and ||AT b||∞ .
(c) Implement the PG method and the APG method for solving (6.93). Apply both imple-
mentations to the problem data from l1regls.mat using 𝛾 = 0.1 and 300 iterations.
Compare your solution to that obtained with CVX, and plot the relative suboptimality
as a function of the number of iterations for both the PG and the APG method. The
relative suboptimality at xk may be defined as
|f (xk ) − p⋆ |
,
|p⋆ |
where f (x) = 12 ||Ax − b||22 + 𝛾 ||x||1 is the objective function. Use a log–linear plot (i.e. a
logarithmically scaled ordinate axis), and use the optimal value returned by CVX as an
estimate of p⋆ .
(d) Implement ADMM for solving the equivalent problem
minimize 12 ||Ax − b||22 + 𝛾 ||y||1
(6.94)
subject to x = y,
with variables x, y ∈ ℝn . Run 300 iterations with 𝜌 = 10−1 , and make a plot of the relative
suboptimality and another plot of ||xk − yk ||2 .
(e) Implement an interior-point method for solving the equivalent problem
minimize 12 ||Ax − b||22 + 𝛾 𝟙T u
(6.95)
subject to −u ⪯ x ⪯ u,
with variables x, u ∈ ℝn . Plot the relative suboptimality as a function of the number of
outer iterations.

6.9 Consider the problem of minimizing

1∑
m
𝛾
f (x) = ||x||22 + ln(1 + exp(−yi aTi x)), (6.96)
2 m i=1

with variable x ∈ ℝn , problem data y ∈ {−1, 1}m and A ∈ ℝm×n where aTi denotes the ith
row of A, and where 𝛾 > 0 is a given parameter.
1
(a) Show that f is 𝛾-strongly convex and L-smooth with L = 4m ||A||22 + 𝛾.
(b) Implement Newton’s method for minimizing (6.96), and use the condition
(1∕2)𝜆(x)2 ≤ 10−6 as a stopping criterion. Test your implementation using the problem
data contained in the file classification_small.mat and with 𝛾 = 10−6 .
(c) Implement a quasi-Newton method with BFGS updates of the inverse Hessian approxi-
mation. Test your implementation and compare with your implementation of Newton’s
method.
(d) Implement the gradient method for minimizing (6.96) with the option to use a constant
step size or a backtracking line search.
(e) Implement a stochastic gradient method for minimizing (6.96). The expression

1 ∑ yi exp(−yi ai xk )
T
gk = 𝛾xk − ai ,
| | T
| k | i∈k 1 + exp(−yi ai xk )
168 6 Optimization Methods

can be used to generate a realization of a stochastic gradient at xk by randomly choosing

a subset k ⊆ ℕm for some ||k || = p ∈ ℕm . Use a step-size sequence of the form (6.70),
p

i.e.
t
tk = , k ∈ ℤ+ .
(k + 1 + 𝜁)𝛿
Test your implementation with p/m ≈ 0.05 and compare with your implementation of
Newton’s method and the gradient method. Plot the objective value versus the number
of epochs for realizations with different values of the parameters 𝛿 ∈ (0, 1] and t > 0.

6.10 In this exercise, we will derive a method for solving the QP in (5.6) based on ADMM. We
will here state the QP slightly differently as
1 T
minimize x Qx + r T x
2
subject to Ax ∈ 
with variable x ∈ ℝn . The problem data are Q ∈ 𝕊n+ , r ∈ ℝn , and A ∈ ℝm×n . We will assume
that  ⊆ ℝm is a nonempty, closed, and convex set of the form
 = {z ∈ ℝm | l ⪯ z ⪯ u}
with l, u ∈ ℝm . Notice that the constraints Bx = c and Cx ⪯ d in (5.6) can be cast in the
above format by defining A, l, and u appropriately, e.g. an equality constraint is obtained by
defining li = ui .
(a) Consider the equivalent optimization problem
1 T
minimize x̄ Q̄x + r T x̄ + I (̄x, ȳ ) + I (y)
2
subject to x̄ − x = 0
ȳ − y = 0
with variables x, x̄ ∈ ℝn and y, ȳ ∈ ℝm , and where  = {(x, y) ∈ ℝn × ℝm | Ax = y}.
Let (λ, 𝜇) be Lagrange multipliers associated with the equality constraints, and let
L𝜌 (x, y, x̄ , ȳ , λ, 𝜇) be the augmented Lagrangian. Show that the three ADMM updates
are equivalent to the following problems:
1. Minimize L𝜌 (x, y, x̄ , ȳ , λ, 𝜇) with respect x̄ and ȳ :
1 T 𝜌 𝜌
minimize x̄ Q̄x + r T x̄ + λTk x̄ + 𝜇kT ȳ + ||̄x − xk ||22 + ||̄y − yk ||22
2 2 2
subject to Āx = ȳ .
2. Minimize L𝜌 (x, y, x̄ , ȳ , λ, 𝜇) with respect x and y:
𝜌 𝜌
minimize ||̄x − x||22 + ||̄yk+1 − y||22 − λTk x − 𝜇kT y
2 k+1 2
subject to y ∈ .
3. Update Lagrange multipliers:
λk+1 = λk + 𝜌(̄xk+1 − xk+1 )
𝜇k+1 = 𝜇k + 𝜌(̄yk+1 − yk+1 ).
(b) Write down the optimality conditions for the first update with 𝜈 as Lagrange multiplier
for the equality constraint, and show that the optimality conditions are equivalent to
[ ][ ] [ ]
Q + 𝜌I AT x̄ k+1 −r + 𝜌xk − λk
=
A −𝜌−1 I 𝜈 yk − 𝜌−1 𝜇k
Exercises 169

together with

ȳ k+1 = 𝜌−1 (𝜈 − 𝜇k ) + yk .

(c) Show that the second update has the solution

xk+1 = x̄ k+1 + 𝜌−1 λk

and that yk+1 is the projection of ȳ k+1 + 𝜌−1 𝜇k onto the set . Also, show that it follows
that λk = 0 for all k, and hence, the update for λk can be omitted.
(d) Define
p
rk = Axk − yk
rkd = Qxk + r + AT 𝜈

which are primal and dual residuals for the optimization problem. Let

𝜖 p = 𝜖 a + 𝜖 r max {||Axk ||∞ , ||yk ||∞ }

𝜖 d = 𝜖 a + 𝜖 r max {||Qxk ||∞ , ||AT 𝜈||∞ , ||r||∞ }

where 𝜖 a > 0 and 𝜖 r > 0 are some absolute and relative tolerances, respectively. The
termination criteria for the algorithm are
p
||rk ||∞ ≤ 𝜖 p , ||rkd ||∞ ≤ 𝜖 d .

Write a MATLAB code that implements the ADDM algorithm you have derived above.
Make sure to use sparse linear algebra routines in MATLAB.
(e) Consider the following so-called “support vector machine” problem

minimize 12 xT x + 𝟙T t
subject to diag(b)Ax ≥ 𝟙 − t
t≥0

with variables x ∈ ℝn and t ∈ ℝm and where b ∈ {−1, 1}m is a label vector and the
rows of A ∈ ℝm×n are m feature vectors of length n. We will return to this problem in
Section 10.5. Generate problem instances for given values of m and n as
{
1, i ≤ m∕2,
bi =
−1, i > m∕2,
and let the elements Aij of the matrix A come from a normal probability distribution
with standard deviation 1∕n and mean given by
{
1∕n, i ≤ m∕2,
−1∕n, i > m∕2,
with 15% nonzeros in each row. This means that 85% of the entries of a row of the matrix
A should be zero. What entries that are zero should be chosen in a random way for each
row. Notice that the matrix A of this subproblem is not the same as the matrix A we
defined before. The same goes for the variable x. Solve several instances of the problem
for different values of m and n. Try different values of m and n. For how large values
of m and n does your code perform well? Make plots of the solution time versus the
total number of nonzero elements in the matrices A and Q. You may use 𝜖 a = 10−3 and
𝜖 r = 10−3 .
170 6 Optimization Methods

6.11 Let  ∶ ℝn → ℝp×q be a linear function, and let H ∈ 𝕊n+ , A0 ∈ ℝp×q , and a ∈ ℝn . Consider
the optimization problem
1
minimize ||(x) + A0 ||∗ + (x − a)T H(x − a)
2
which is equivalent to
minimize ||X||∗ + 12 (x − a)T H(x − a)
subject to (x) + A0 = X
with variables x ∈ ℝn and X ∈ ℝp×q .
Show that an ADMM algorithm for the above optimization problem can be expressed as the
following updates:
1. Compute xk+1 by solving
( )
(H + 𝜌M)xk+1 = adj 𝜌Xk + A0 − Zk + Ha
where the matrix M ∈ 𝕊n is defined as the matrix that satisfies adj ((z)) = Mz for all
z ∈ ℝn , and where adj ∶ ℝp×q → ℝn is the adjoint of .
2. Compute
( )
𝜌 ‖1 ‖2
Xk+1 = argmin ||X||∗ + ‖ Z + (x ) + A − X ‖
X 2‖‖𝜌
k 0 ‖
‖F
( )
∑
min (p,q)
1
= max 0, 𝜎i − ui 𝑣Ti
i=1
𝜌
where ui , 𝑣i and 𝜎i are given by an SVD

1 ∑
min (p,q)
Z + (xk ) + A0 = 𝜎i ui 𝑣Ti .
𝜌 i=1

3. Compute
( )
Zk+1 = Zk + 𝜌 (xk ) + A0 − Xk .
Hint: When defining the augmented Lagrangian, the appropriate inner product is ⟨⋅, ⋅⟩ ∶
ℝp×q × ℝp×q → ℝ is ⟨X,
√ Y ⟩ = tr(X Y ), and the appropriate norm is the Frobenius norm
T

||X||F = ⟨X, X⟩ = tr(X X).

1∕2 T

6.12 Show that if 𝜓 ∶ int K → ℝ is a generalized logarithm for a cone K ⊆ ℝn , then it holds that
∇𝜓(x)T x = 𝜃, x ∈ int K,
where 𝜃 is the degree of 𝜓.
( )
6.13 Show that 𝜓 ∶ int ℚn → ℝ defined as 𝜓(x) = ln xn2 − x12 − · · · − xn−1
2
is a generalized loga-
rithm for Qn with degree 𝜃 = 2.

6.14 Consider the problem (6.92), i.e.

∑N
minimize i=1 fi (xi ) + g(y)
subject to xi − y = 0, i ∈ ℕN,
Exercises 171

where g ∶ ℝn → ℝ ̄ and fi ∶ ℝn → ℝ,
̄ i ∈ ℕN , are proper, closed, and convex functions, and
the variables are y ∈ ℝ and x = (x1 , … , xN ) with xi ∈ ℝn , i ∈ ℕN . Show that the scaled form
n

of the ADMM updates for this problem can be expressed as

xi(k+1) = prox𝜌−1 fi (y(k) − u(k)
i
), i ∈ ℕN

1 ∑ (k+1)
N
x̃ (k+1) = x
N i=1 i

1 ∑ (k)
N
ũ (k) = u
N i=1 i
y(k+1) = prox𝜌−1 g (̃x(k+1) + ũ (k) )
u(k+1)
i
= u(k)
i
+ xi(k+1) − y(k+1) , i ∈ ℕN
where we have used a superscript for the iteration index to avoid abiguity.
173

Part III

Optimal Control
175

Calculus of Variations

So far we have discussed optimization when the variables belong to finite-dimensional vector
spaces. However, it is often of interest to also optimize over infinite-dimensional vector spaces.
The most simple case is when the variable is a real-valued function of a real variable. This
has applications in optimal control in continuous time, where the optimal control signal is
a real-valued function of time. Another important application is the derivation of probability
density functions from the principle of maximizing the entropy subject to moment constraints.
In this case, the variable is the probability density function. For probability density functions, the
argument is often a vector. How to solve these types of optimization problems is called calculus
of variations. The origin of this theory goes back to Newton’s minimal resistance problem. Major
contributions were made by Euler and Lagrange. The generalizations to optimal control were
made by Pontryagin. We will present the theory in this general form, but we will not be able to
prove all the results. The interested reader is referred to the vast literature on optimal control for
most of the proofs, especially for the most general results. However, we will provide the proofs for
some special cases to build intuition.

7.1 Extremum of Functionals

We consider a normed linear space of continuously differentiable real-valued functions defined on
 = [a, b] ⊂ ℝ, which we denote by (a, b). The norm || ⋅ || ∶ (a, b) → ℝ is defined as
||y|| = max |y(x)| + max |y′ (x)|.
x∈ x∈
We then consider real-valued functionals J defined on (a, b), i.e. J ∶ (a, b) → ℝ. A typical
example is
b ( )
J[y] = f x, y(x), y′ (x) dx, (7.1)
∫a
for some function f ∶ ℝ × ℝ × ℝ → ℝ. We are interested in characterizing extrema of such func-
tionals. To this end, we define the increment ΔJ ∶ (a, b) → ℝ of a functional J for a fixed y ∈
(a, b) as
ΔJ[𝛿y] = J[y + 𝛿y] − J[y].
In case
ΔJ[𝛿y] = 𝛿J[𝛿y] + 𝜖||𝛿y||,

where 𝛿J ∶ (a, b) → ℝ is a linear functional and 𝜖 → 0 as ||𝛿y|| → 0, we say that J is differentiable,

and we call the functional 𝛿J the first variation or differential of J. It can be shown that the differ-
ential of a differentiable functional is unique; see, e.g. [41].

Example 7.1 The functional in (7.1) is differentiable if f is a differentiable function in its last two
arguments. This follows from a Taylor series expansion:
b ( ( ) ( ))
ΔJ[𝛿y] = f x, y(x) + 𝛿y(x), y′ (x) + 𝛿y′ (x) − f x, y(x), y′ (x) dx,
∫a
( ( ) ( )
b 𝜕f x, y(x), y′ (x) 𝜕f x, y(x), y′ (x)
= 𝛿y(x) + 𝛿y′ (x),
∫a 𝜕y 𝜕y′
)
( )
+ h y(x), y′ (x) ‖ ‖
‖(𝛿y(x), 𝛿y (x))‖2
′
dx,

where h ∶ ℝ × ℝ → ℝ is a function that goes to zero as (𝛿y(x), 𝛿y′ (x)) → 0. The latter is implied by
||𝛿y|| → 0. Hence, we may define 𝜖 as
b ( )
∫a h y(x), y′ (x) ‖(𝛿y(x), 𝛿y′ (x))‖2 dx
𝜖= ,
||𝛿y||
b ( )
∫ h y(x), y′ (x) ||𝛿y||dx b ( )
≤ a = h y(x), y′ (x) dx,
||𝛿y|| ∫a
which converges to zero as h goes to zero. Hence, this functional is differentiable with first variation
( ( ) ( ) )
b 𝜕f x, y(x), y′ (x) 𝜕f x, y(x), y′ (x)
𝛿J[𝛿y] = 𝛿y(x) + 𝛿y (x) dx.
′
(7.2)
∫a 𝜕y 𝜕y′
It is left as an exercise to show that this functional is linear, see Exercise 7.2. Here, we under-
stand why the norm we use in the definition of (a, b) has to include the derivative. However,
in case we can assure that bounded norm of y implies bounded norm of y′ , we may use another
definition.

7.1.1 Necessary Condition for Extremum

A central theme in calculus of variations is to characterize so-called extrema of functionals. We say
that a functional J has a weak extremum1 at y⋆ if its increment has the same sign for all y in a
neighborhood of y⋆ .2 More formally, there should exist 𝜖 > 0 such that ΔJ[𝛿y] has the same sign
for all 𝛿y such that ||𝛿y|| ≤ 𝜖. If the sign is positive, we have a weak minimum, and if the sign is
negative, we have a weak maximum.
It probably does not come as a surprise that in order to characterize extrema for differentiable
functionals, it is sufficient to consider the first variation. Assume that a differentiable functional
J has an extremum at y⋆ . Then the first variation vanishes, i.e. 𝛿J[𝛿y] = 0. The proof of this is as
follows: we have
ΔJ[𝛿y] = 𝛿J[𝛿y] + 𝜖||𝛿y||,

1 The reason to use the word “weak” is that there is also a definition of strong extremum that is based on another
norm for the linear function space, which is defined as ||y|| = max x∈ ||y(x)||. Clearly, strong extremum implies weak
extremum. Necessary conditions for weak extremum are, hence, also necessary conditions for strong extremum.
2 In case there are constraints on y for x = a or x = b, 𝛿y should be constrained to be zero at those values of x. This is
often stated as 𝛿y(x) is admissible. We will tacitly assume that we only consider such admissible 𝛿y.
7.1 Extremum of Functionals 177

where 𝜖 → 0 as ||𝛿y|| → 0. Hence for sufficiently small ||𝛿y||, the sign of ΔJ[𝛿y] will be the same as
the sign of 𝛿J[𝛿y]. Now assume that 𝛿J[𝛿y0 ] ≠ 0 for some 𝛿y0 . Then for any 𝛼 > 0, we have
𝛿J[−𝛼𝛿y0 ] = −𝛿J[𝛼𝛿y0 ],
since 𝛿J is linear. Hence, the increment can be made to have either sign for arbitrary small 𝛿y
contradicting that J has an extremum.

7.1.2 Sufﬁcient Condition for Optimality

So far we have derived necessary conditions for a differentiable functional to have a weak
extremum. We are also interested in the sufficient conditions. To this end, we consider the so-called
second variation. We say that a functional J is twice differentiable if the increment can be written as
ΔJ[𝛿y] = 𝛿J[𝛿y] + 𝛿 2 J[𝛿y] + 𝜖 2 ||𝛿y||2 ,
where 𝛿J is the first variation, linear in 𝛿y, 𝛿 2 J is the second variation, quadratic in 𝛿y, and 𝜖 → 0
as ||𝛿y|| → 0. It can also be shown that the second variation is unique.

Example 7.2 The functional in (7.1) is twice differentiable if f is a twice differentiable function
in its last two arguments. This follows from a Taylor series expansion similar to what was done in
the previous example. The second variation is given by
⎡ 𝜕 2 f (x, y(x), y′ (x))) 𝜕 2 f (x, y(x), y′ (x))) ⎤
[ ]T ⎢ ⎥[ ]
1
b
𝛿y(x) ⎢ 𝜕y2 𝜕y𝜕y′ ⎥ 𝛿y(x) dx.
𝛿 2 J[𝛿y] =
2 ∫a 𝛿y′ (x) ⎢ 𝜕 2 f (x, y(x), y′ (x))) 𝜕 2 f (x, y(x), y′ (x))) ⎥ 𝛿y′ (x)
⎢ 𝜕y𝜕y′ ⎥
⎣ 𝜕y′ 2 ⎦

It can now, with similar techniques as used above, be proven that a necessary condition for y⋆ to
be a minimum for J is that 𝛿 2 J[𝛿y] ≥ 0 for all 𝛿y; see, e.g. [41], This is not a sufficient condition. We
say that the second variation is strongly positive if there exists a constant k > 0 such that 𝛿 2 J[𝛿y] ≥
k||𝛿y||2 for all y and 𝛿y. A sufficient condition for y⋆ to be optimal for J is that its first variation
vanishes and that its second variation is strongly positive. This is also straightforward to prove; see,
e.g. [41].

7.1.3 Constrained Problem

We now consider the constrained problem
minimize J[y],
(7.3)
subject to K[y] = 0,
with variable y ∈ (a, b), where J and K are functionals from (a, b) to ℝ and ℝp , respectively. We
assume that K is affine and that J is twice differentiable with a strongly positive second variation.
We define the Lagrangian functional L ∶ (a, b) × ℝp → ℝ as
L[y, 𝜇] = J[y] + 𝜇 T K[y].
Now assume that ȳ ∈ (a, b) and 𝜇̄ ∈ ℝp satisfy
𝛿L[𝛿y] = 𝛿J[𝛿y] + 𝜇̄ T 𝛿K[𝛿y] = 0,
K[̄y] = 0.
178 7 Calculus of Variations

It then follows that ȳ is optimal for (7.3). To see this we realize that ȳ by the first condition above
minimizes L[y, 𝜇],̄ since L is twice differentiable with a strongly positive second variation. Now,
assume that ȳ is not optimal for the above optimization problem, but that ỹ is optimal. Then J[̃y] <
J[̄y] and K[̃y] = 0 implies that
L[̃y, 𝜇]
̄ = J[̃y] < J[̄y],
which contradicts that ȳ minimizes L[y, 𝜇].
We will now show that strong duality holds. Let g ∶ ℝp → ℝ be the Lagrange dual function
defined via
g(𝜇) = min L[y, 𝜇].
y∈(a,b)

We have
g(𝜇) = min L[y, 𝜇] ≤ min L[y, 𝜇] = min J[y] = p⋆ .
y∈(a,b) y∈(a,b), K[y]=0 y∈(a,b), K[y]=0

Since 𝜇̄ achieves equality above, we have that

d⋆ = max g(𝜇) = g(𝜇)
̄ = p⋆ ,
𝜇

and hence strong duality holds. The extension to inequality constraints is straightforward and sim-
ilar to what was done in Chapter 4. Necessary conditions for optimality of (7.3) are a bit tricky.
For finite-dimensional optimization problems, we saw in Chapter 4 that constraint qualifications,
such as Slater’s conditions, are needed in order to guarantee strong duality, on which the proof of
necessity was based.

7.1.4 Du Bois–Reymond Lemma

When the functional is described in terms of an integral, the following lemma called the fundamen-
tal lemma of calculus of variations or the du Bois–Reymond lemma is instrumental. If y ∶ [a, b] → ℝ
is a continuous function and
b
y(x)h(x)dx = 0,
∫a
for all h ∈ (a, b) such that h(a) = h(b) = 0, then y(x) = 0 for all x ∈ [a, b]. The proof is by contra-
diction. Suppose the function y is positive at some point c ∈ [a, b]. By continuity, it is also positive
in some interval [x1 , x2 ] ⊂ [a, b]. Let
h(x) = (x − x1 )(x2 − x),
for x ∈ [x1 , x2 ] and zero, otherwise. Then it follows that
b
y(x)h(x)dx > 0,
∫a
which is a contradiction. In case the function is negative, at some point, a similar argument can be
done. The lemma still holds true if we do not constrain h at a and/or b. We now consider an example.

Example 7.3 Let f (y) = y log y, and consider the functionals

b
J[y] = f (y(x))dx,
∫a
b
K[y] = xy(x)dx − k,
∫a
7.2 The Pontryagin Maximum Principle 179

for some constant k. We have f ′ (y) = log y + 1 and f ′′ (y) = 1∕y and hence, from (7.2) and from
Example 7.2, the variations of J are
b ( )
𝛿J[𝛿y] = log y(x) + 1 𝛿y(x)dx,
∫a
b
1 1
𝛿 2 J[𝛿y] = 𝛿y(x)2 dx.
2 ∫a y(x)
b
The first variation of K is 𝛿K[𝛿y] = ∫a x𝛿y(x)dx, and the second variation of K is zero. We define
the Lagrangian as L[y, 𝜇] = J[y] + 𝜇K[y], and we see that its first variation is
b ( )
𝛿L[𝛿y] = log y(x) + 1 + 𝜇x 𝛿y(x)dx.
∫a
By the du-Bois–Reymond lemma, it holds that if the first variation is zero, then
log y(x) + 1 + 𝜇x = 0,
and hence, we have that y(x) = exp(−1 − 𝜇x). The constraint K[y] = 0 can be used to determine
𝜇 in terms of a, b, and k. We should try to verify that the second variation of L is strictly positive
in order to show that y constitutes a minimum. This is however not so easy, or even possible, and
we will in Section 9.2 prove optimality in another way for similar problems. Understand that strict
positivity is just a sufficient condition that might be too strong in some cases.

7.1.5 Generalizations
Most of what has been discussed in this section generalizes to  ⊆ ℝn . This will be important when
we revisit the example above and extensions of it in Section 9.2. We will also consider y(x) to be
vector valued, i.e. y ∶  → ℝn . To this end, we just define the norm as
||y|| = sup ||y(x)||2 + sup ||y′ (x)||2 ,
x∈ x∈

where || ⋅ ||2 as usual is the Euclidean vector norm. From now on,  n (a, b) is the normed linear
space of differentiable functions y ∶  → ℝn with the above norm, where  = [a, b]. We will be
less formal in our derivation of results in the remaining part of this chapter.

7.2 The Pontryagin Maximum Principle

We consider a dynamical system described by
̇ = F (t, x(t), u(t)) ,
x(t) (7.4)
where the vector field F ∶ ℝ × ℝn × ℝm → ℝn is continuously differentiable. Here, x ∈  n (0, T) is
called the state, and u ∈  m (0, T) is called the input signal. The signal u is also called the control
signal. The initial value is given, i.e. x(0) = x0 for some known x0 ∈ ℝn . Let us consider the following
optimal control problem:
T
minimize 𝜙 (x(T)) + ∫0 f (t, x(t), u(t)) dt,
(7.5)
̇ = F (t, x(t), u(t)) ,
subject to x(t)
with variables x and u, where the incremental cost f ∶ ℝ × ℝn × ℝm → ℝ and the terminal cost or
final cost 𝜙 ∶ ℝn → ℝ are continuously differentiable. We realize that this is an infinite-dimensional
optimization problem, since the variables that we optimize over are the functions x and u.
180 7 Calculus of Variations

We assume that the optimal u, which we denote by u⋆ , is continuous.3 We also assume that the
corresponding solution x⋆ to the differential equation is unique, and that in case u⋆ is perturbed
with a small amount, the corresponding perturbation of x⋆ is also small. We refrain from giving
detailed conditions when this is satisfied, but just mention that it is related to what is called
Lipschitz continuity of F.
We define the Lagrangian functional L ∶  m (0, T) → ℝ as
T ( )
L[u] = 𝜙 (x(T)) + ̇
f (t, x(t), u(t)) + λ(t)T (F (t, x(t), u(t)) − x(t)) dt,
∫0
T ( )
= 𝜙 (x(T)) + ̇
H (t, x(t), u(t), λ(t)) − λ(t)T x(t) dt,
∫0
where H ∶ ℝ × ℝn × ℝm × ℝn → ℝ is the Hamiltonian defined as
H (t, x, u, λ) = f (t, x, u) + λT F(t, x, u).
We could have defined the Lagrangian functional to also depend explicitly on x and λ, but we will
not need this in our derivations, and hence, we refrain from doing so. Similarly, as with the multi-
plier rule of Lagrange in (4.57a) and (4.57b) we expect to get a necessary condition for optimality
by letting the first variation of L be zero. We make a perturbation u = u⋆ + 𝛿u of u⋆ , where 𝛿u is
small, i.e. ||𝛿u|| < 𝜖. See Section 7.1 for the definition of the norm. The corresponding perturbed
trajectory x, which is the solution of (7.4) for u differs from the original solution x⋆ with the quan-
tity 𝛿x = x − x⋆ , which by our assumptions is small, i.e. ||𝛿x|| can be made as small as we like by
taking 𝜖 sufficiently small. We have that the increment of L is given by
( ) ( )
ΔL[𝛿u] = L[u⋆ + 𝛿u] − L[u⋆ ] = 𝜙 x⋆ (T) + 𝛿x(T) − 𝜙 x⋆ (T)
T ( ( ⋆ ) ( ))
+ H t, x (t) + 𝛿x(t), u⋆ (t) + 𝛿u(t), λ(t) − H t, x⋆ (t), u⋆ (t), λ(t) dt
∫0
d ( ⋆ )
T T
d
− λ(t)T x (t) + 𝛿x(t) dt + λ(t)T x⋆ (t))dt.
∫0 dt ∫0 dt
We now make a Taylor series expansion and obtain the first variation
T( )
𝜕𝜙 𝜕H 𝜕H T d𝛿x(t)
𝛿L = T 𝛿x(T) + 𝛿x(t) + T 𝛿u(t) − λ(t) dt.
𝜕x ∫0 𝜕xT 𝜕u dt
Here, we have not written out the arguments of the partial derivatives. This is clearly a linear
functional. The assumption on small perturbations of 𝛿u resulting in small perturbations of 𝛿x
is necessary. Otherwise, the remainder term does not converge to zero as ||𝛿u|| → 0. By integration
of parts, it follows that
T( )
[ ]T
𝜕𝜙 𝜕H 𝜕H dλ(t)T
𝛿L = T 𝛿x(T) + 𝛿x(t) + T 𝛿u(t) + 𝛿x(t) dt − λ(t)T 𝛿x(t) 0 .
𝜕x ∫0 𝜕xT 𝜕u dt
Since the initial value x(0) is given, it follows that 𝛿x(0) = 0, and hence, we obtain
( ) T( )
𝜕𝜙 𝜕H 𝜕H dλ(t)T
𝛿L = − λ(T)T
𝛿x(T) + 𝛿x(t) + 𝛿u(t) + 𝛿x(t) dt.
𝜕xT ∫0 𝜕xT 𝜕uT dt
We have so far made no assumptions on λ. We assume that it satisfies the adjoint equations defined
as
⋆ ⋆ ⋆
̇ = − 𝜕H(t, x (t), u (t), λ(t)) , λ(T) = 𝜕𝜙(x (T)) .
λ(t)
𝜕x 𝜕x

3 An extension to the case when u⋆ is piecewise continuous is possible.

7.2 The Pontryagin Maximum Principle 181

This is a linear time-varying differential equation for λ, and hence, it has a solution [97, Chapter 3],
under mild conditions on H.4 For this λ, it follows that
T
𝜕H
𝛿L = 𝛿u(t)dt.
∫0 𝜕uT

From the du Bois–Reymond lemma, it then follows that

𝜕H(t, x⋆ (t), u⋆ (t), λ(t))

= 0,
𝜕uT
in order for the first variation to vanish. For differentiable u⋆ , it holds that
( )T
dH 𝜕H 𝜕H 𝜕H 𝜕H 𝜕H 𝜕H 𝜕H
= + T ẋ ⋆ + T u̇ ⋆ + T λ̇ = + + λ̇ F = ,
dt 𝜕t 𝜕x 𝜕u 𝜕λ 𝜕t 𝜕x 𝜕t

where all functions are evaluated for (x⋆ , u⋆ ). It is possible to show that the result holds also for
piecewise continuous u⋆ . In case H does not explicitly depend on t, which is called an autonomous
optimal control problem, we realize that H is a constant independent of t.
We have now proven the following necessary conditions of Pontryagin, also called the Pontryagin
maximum principle (PMP). Given optimal u⋆ and x⋆ for (7.5), there exists an adjoint variable λ such
that
⋆ ⋆
̇ = − 𝜕H(t, x (t), u (t), λ(t)) , 𝜕𝜙(x⋆ (T))
λ(t) λ(T) = , (7.6a)
𝜕x 𝜕x
𝜕H(t, x⋆ (t), u⋆ (t), λ(t))
= 0, (7.6b)
𝜕uT
dH(t, x⋆ (t), u⋆ (t), λ(t)) 𝜕H(t, x⋆ (t), u⋆ (t), λ(t))
= . (7.6c)
dt 𝜕t
We remark that the necessary conditions do not distinguish between maximum or minimum. They
hold for any extremum, and this is the reason why the conditions are called a maximum principle.
They could just as well have been called a minimum principle.
We are also interested in the sufficient conditions for a locally optimal solution of (7.5). We
will not provide details of the derivation. The condition is based on the second variation of the
̄ x̄ ) and λ satisfy
Lagrangian L. Let (u,

̇ = F(t, x̄ (t), u(t)),

x(t) ̄ x(0) = x0 ,
̇ =− 𝜕H(t, ̄
x (t), ̄
u(t), λ(t)) 𝜕𝜙(̄x(T))
λ(t) , λ(T) = ,
𝜕x 𝜕x
𝜕H(t, x̄ (t), u(t),
̄ λ(t))
= 0,
𝜕u
dH(t, x̄ (t), u(t),
̄ λ(t)) 𝜕H(t, x̄ (t), u(t), ̄ λ(t))
= ,
dt 𝜕t
d2 𝜙(̄x(T)) 𝜕 2 H(t, x̄ (t), u(t),
̄ λ(t)) 𝜕 2 H(t, x̄ (t), u(t),
̄ λ(t))
⪰ 0, ⪰ 0, ≻ 0,
dxdx T 𝜕z𝜕z T 𝜕u𝜕uT
̄ x̄ ) is a local minimum of (7.5).
where z = (x, u). Then (u,

4 The system matrix and the input signal should be bounded for the existence of a solution. This holds if
𝜕f (t,x⋆ (t),u⋆ (t)) ⋆ (t),u⋆ (t))

𝜕x
and 𝜕F(t,x 𝜕x are bounded functions of t on [0, T].
182 7 Calculus of Variations

7.2.1 Linear Quadratic Control

We will now look at an optimal control problem which has linear dynamics and a quadratic cost.
The problem is called a linear quadratic (LQ) control problem and is given by
T( )
1 1
minimize x(T)T Q0 x(T) + x(t)T Qx(t) + u(t)T Ru(t) dt,
2 2 ∫0
̇ = Ax(t) + Bu(t),
subject to x(t)

with variables x and u for given initial value x(0) = x0 . The Hamiltonian is given by
1( T )
H= x Qx + uT Ru + λT (Ax + Bu).
2
We realize that the adjoint equations are
𝜕H
λ̇ = − = −Qx − AT λ, λ(T) = Q0 x(T).
𝜕x
From the PMP, we have that
𝜕H
= Ru + BT λ = 0.
𝜕u
If we assume that R ≻ 0, i.e. positive definite, we have that u = −R−1 BT λ. If we insert this into the
differential equations for x and λ, we obtain
[ ] [ ][ ] [ ] [ ]
ẋ A −BR−1 BT x x(0) x0
= , = .
λ̇ −Q −AT λ λ(T) Q0 x(T)
This is a two-point boundary value problem. Such problems are in general not easy to solve. The
reason is that we do now know the initial value for λ. However, for the problem above, we may
define
[ ] [ ]
A −BR−1 BT Φ11 (t, s) Φ12 (t, s)
= , Φ(t, s) = = e(t−s) ,
−Q −AT Φ21 (t, s) Φ22 (t, s)
where the partitions are conformable. The matrix Φ is called the transition matrix, and it can be
shown that it is always invertible, and that Φ(t, t) = I for any t, see [97]. It then follows that
[ ] [ ][ ] [ ]
x0 Φ11 (0, T) Φ12 (0, T) x(T) Φ11 (0, T) + Φ12 (0, T)Q0
= = x(T).
λ(0) Φ21 (0, T) Φ22 (0, T) Q0 x(T) Φ21 (0, T) + Φ22 (0, T)Q0
We may now use these equations to express λ(0) in terms of x0 , i.e. we obtain
( )( )−1
λ(0) = Φ21 (0, T) + Φ22 (0, T)Q0 Φ11 (0, T) + Φ12 (0, T)Q0 x0 ,

assuming the inverse exists. It actually does; see, e.g. [20]. We are now in a position to solve the
linear differential equation, which will then give us the optimal control signal.

7.2.2 The Riccati Equation

It turns out that we can actually be even more explicit. The above formula that we derived for λ(0)
can be generalized from 0 to any t, i.e.

λ(t) = P(t)x(t),

where
( )( )−1
P(t) = Φ21 (t, T) + Φ22 (t, T)Q0 Φ11 (t, T) + Φ12 (t, T)Q0 .
7.3 The Euler–Lagrange Equations 183

From this, it follows that

λ̇ = Px
̇ + Pẋ = Px
̇ + P(Ax − BR−1 BT λ) = Px
̇ + P(Ax − BR−1 BT Px),

but we also have from the adjoint equations that

λ̇ = −Qx − AT λ = −Qx − AT Px,
and hence,
̇ + PAx − PBR−1 BT Px.
−Qx − AT Px = Px
This is implied by
Ṗ + AT P + PA + Q − PBR−1 BT P = 0,
with a boundary condition
( ) ( )−1
P(T) = Φ21 (T, T) + Φ22 (T, T)Q0 , Φ11 (T, T) + Φ12 (T, T)Q0
= (0 + Q0 )(I + 0)−1 = Q0 .
The above equation is called a differential Riccati equation. Once this equation has been solved, it
holds that the optimal control signal is given by
u(t) = −R−1 BT λ(t) = −R−1 BT P(t)x(t).
We have actually been able to derive a feedback policy for the optimal solution, i.e. u depends
explicitly on x, and not only on t, which is the case when we express u in terms of λ. It is not in
general possible to obtain a feedback policy from the PMP. We were lucky here that the we could
use the transition matrix. This is based on that all differential equations involved are linear. The
result still holds for time-varying A, B, Q, and R.

7.3 The Euler–Lagrange Equations

We will now relate the above result to a classical result in the calculus of variations which is the
special case when ẋ = u or equivalently F(t, x, u) = u. It then follows that H = f + λT u. Hence, the
conditions in (7.6a and b) become
𝜕f
λ̇ = − ,
𝜕x
𝜕f
+ λ = 0,
𝜕u
from which it follows that
( )
d 𝜕f 𝜕f
− = 0, (7.7)
dt 𝜕 ẋ 𝜕x
which are called the Euler–Lagrange equations. They are the necessary conditions for an optimal
x⋆ of the problem
T
minimize ̇
f (t, x(t), x(t)) dt,
∫0
with variable x. We will now look at an example from mechanical systems.

Example 7.4 In mechanical systems, the state x contains positions and angels, and ẋ is called
generalized velocity. We define the potential energy of the system as V(x) and the kinetic energy as
184 7 Calculus of Variations

̇ where it is assumed that T(x, x)

T(x, x), ̇ = ẋ T A(x)ẋ for some symmetric matrix A(x). From this, we
̇ = T(x, x)
define f (x, x) ̇ − V(x), which is the difference between kinetic energy and potential energy.
In mechanics, this is called the Lagrangian, and this should not be confused with the Lagrangian
we have defined previously. The Lagrangian of mechanics is our incremental cost. A fundamental
principle for conservative mechanical systems is the principle of least action, which says that the
states should be a stationary solution to
T
̇
f (x(t), x(t))dt.
∫0
Hence, x should be a solution of the Euler–Lagrange equations, i.e.
( )
d 𝜕T 𝜕T 𝜕V
− + = 0.
dt 𝜕 ẋ 𝜕x 𝜕x
We also have that the Hamiltonian is
𝜕f
H = f + λT ẋ = T − V + λT ẋ = T − V − ẋ = ẋ T Aẋ − V − 2xȦ ẋ = −T − V.
𝜕 ẋ
Since the system is autonomous, the Hamiltonian should be a constant, and hence, we have proven
that the sum of potential energy and kinetic energy is constant for conservative mechanical systems.

7.3.1 Beltrami’s Identity

For the case, when f does not depend explicitly on t, we may derive a very useful identity called
Beltrami’s identity from the Euler–Lagrange equations in (7.7). We have for this case
( )
df 𝜕f 𝜕f d 𝜕f
= ẋ + ẍ = ẋ ,
dt 𝜕x 𝜕 ẋ dt 𝜕 ẋ
where the first equality follows by assumption and the second equality by the chain rule. Hence,
( )
d 𝜕f
f − ẋ = 0,
dt 𝜕 ẋ
from which it follows that
𝜕f
f − ẋ = C,
𝜕 ẋ
where C is a constant. This is Beltrami’s identity.

Example 7.5 We will now derive the shape of a chain hanging from two given points by min-
imizing the total potential energy of the chain. A differential piece√ of the chain, of length ds has
mass dm = 𝜌ds, where 𝜌 is the mass density of the chain. Here, ds = 1 + ẋ 2 , where x(t) is the posi-
tion x(t) of the differential segment. The position x(t) of the segment multiplied by its mass, and
the gravitational constant g, is equal to the potential energy of the segment, i.e. g𝜌x(t)ds. The total
potential energy is therefore given by
T
̇
f (x, x)dt,
∫0
√
̇ = g𝜌x 1 + ẋ 2 . This integral should be minimized subject to the constraints x(0) = x0
where f (x, x)
and x(T) = xT , which defines where the endpoints of the chain are located. We see that for this
𝜕f
example we do not have a constraint on λ = − 𝜕ẋ at t = T. Instead, we use a constraint on x(T).
Beltrami’s identity gives
√ ẋ 2
g𝜌x 1 + ẋ 2 − g𝜌x √ = C.
1 + ẋ 2
7.4 Extensions 185

We let a = C∕(g𝜌) and obtain by multiplying with the square root

√
x = a 1 + ẋ 2 ,
or equivalently
( )2
x
− ẋ 2 = 1.
a
It is straightforward to verify that
( )
t − t0
x(t) = a cosh
a
satisfies the equation for any t0 . The boundary conditions give
( )
−t0
a cosh = x0 ,
a
( )
T − t0
a cosh = xT ,
a
from which a and t0 can be determined.

7.4 Extensions
We will now discuss different extensions of the PMP. These are for the cases when there are con-
straints on the control signal, when the final time T is optimized, and for the cases when the
initial value of the state x(0) and/or the final value of the state x(T) are constrained to belong to
manifolds. We will not give the proof for these cases, but we will instead state the corresponding
version of the PMP and solve the examples. We will from now on only discuss the autonomous
case.
Consider the optimal control problem
T
minimize 𝜙 (x(T)) + f (x(t), u(t)) dt,
∫0
̇ = F (x(t), u(t)) ,
subject to x(t)
(7.8)
x(0) ∈ S0 , x(T) ∈ ST ,

u(t) ∈ U ⊂ ℝm ,

T ≥ 0,
with variables x, u, and T, where f ∶ ℝn × ℝm → ℝ, F ∶ ℝn × ℝm → ℝ, and 𝜙 ∶ ℝn → ℝ are con-
tinuously differentiable. The sets S0 and ST are subsets of ℝn and manifolds. We assume that S0 can
be described as
{ }
S0 = x ∈ ℝn ∶ G0 (x) = 0 ,
where G0 ∶ ℝn → ℝp with p ≤ n is differentiable with a full rank Jacobian for x in a neighborhood
of the optimal solution point on S0 . We assume a similar description of ST with a function GT .
Notice that we in this formulation also optimize the final time T.
Define the Hamiltonian H ∶ ℝn × ℝm × ℝn+1 → ℝ as
̃ = λ f (x, u) + λT F(x, u),
H(x, u, λ) 0
186 7 Calculus of Variations

where λ̃ = (λ0 , λ). Assume that (x⋆ , u⋆ , T ⋆ ) are optimal for the optimal control problem above. Then
there exists a nonzero adjoint function λ̃ ∶ [0, T] → ℝn+1 such that
̃
(i) ̇ = − 𝜕H(x⋆ (t),u⋆ (t),λ(t))
λ(t) , λ0 = c ≥ 0, where c ∈ ℝ is a constant
𝜕x
(ii) ̃
H(x⋆ (t), u⋆ (t), λ(t)) ̃
= min 𝑣∈U H(x⋆ (t), 𝑣, λ(t)) = 0, ∀t ∈ [0, T ⋆ ]
(iii) λ(0) ⟂ S0
⋆ ⋆
(iv) λ(T) − 𝜕𝜙(x𝜕x(T )) ⟂ ST
𝜕G (x(0))
Above we have used the notation λ0 ⟂ S0 to mean that λ(0)T 𝑣 = 0 for all 𝑣 such that 𝜕x 0
T
𝑣 = 0,
where G0 is the function defining the manifold S0 . In case the final time T is not optimized, then
condition (ii) is replaced with that the Hamiltonian is constant and not necessarily zero along the
optimal solution. For many problems, it turns out that λ0 > 0, and then since H is homogeneous in
̃ there is no loss in generality to take λ = 1. We then obtain the same definition of the Hamiltonian
λ, 0
as we had before. It is typically pathological cases for which λ0 = 0, and hence, we will often in the
examples we investigate assume that λ0 = 1.
When using the above conditions to try to find a solution to the optimal control problem, the
first step is to define the Hamiltonian. Then one should try to minimize this with respect to u.
This should be done parametrically with respect to λ̃ and x. This means that we obtain a function
𝜇 ∶ ℝn × ℝn+1 → ℝm such that u⋆ = 𝜇(x, λ). ̃ This function is then substituted into the dynamical
̃
equations for x and λ, i.e.
̃
ẋ = F(x, 𝜇(x, λ)),x(0) ∈ S0 , x(T) ∈ ST ,
̃ ̃
𝜕H(x, 𝜇(x, λ), λ) 𝜕𝜙(x(T)) (7.9)
λ̇ = − , λ(0) ⟂ S0 , λ(T) − ⟂ ST ,
𝜕x 𝜕x
This is a so-called two-point boundary value problem (TPBVP). We should also use (ii) when solving
the above equations. The equations are by no means easy to solve in general. To carry out the
parametric optimization of H can be very difficult, especially when u is constrained by the set U.
We should also remember that the PMP are only necessary conditions for optimality. Hence, they
may not provide enough information to uniquely determine the optimal control. Moreover, they do
not guarantee optimality, but only stationarity. Further investigations are necessary to prove that a
candidate solution obtained from the PMP is indeed optimal.
We will now investigate some examples that can be solved analytically. In the first example, we
illustrate how to carry out the parametric optimization of H for a constrained problem.

Example 7.6 Consider the problem

1
minimize (x(T))2 ,
2
̇ = u(t),
subject to x(t)
x(0) = x0 ,
u(t) ∈ [−1,1], ∀ t ∈ [0, T],
with variable u. We will show using the PMP that the optimal control signal is given by the feedback
𝜇 ∗ (t, x) = −sgn(x(t)), (7.10)
where sgn ∶ ℝ → {−1,0, 1} is defined as
⎧ 1, x > 0,
⎪
sgn(x) = ⎨ 0, x = 0,
⎪−1, x > 0.
⎩
7.4 Extensions 187

Such a control signal that only takes values on the boundary of the set of allowed values for the
control signal is often called bang–bang. The reason for this is that if implemented using a mechan-
ical actuator, a bang will often be heard when switching from one of the values to the other. The
Hamiltonian is given by
H(t, x, u, λ) = λu.
Pointwise minimization yields
⎧ 1, λ < 0,
⎪
𝜇(t, x) = argmin {λu} = ⎨−1, λ > 0,
|u|≤1 ⎪ u,
⎩ ̃ λ = 0,
where ũ ∈ [−1,1] is arbitrary. The adjoint equation is given by
̇ = − 𝜕H (t, x, u, λ) = 0, λ(T) = 𝜕𝜙 (x(T)) = x(T),
λ(t)
𝜕x 𝜕x
which has the solution λ(t) = x(T). We now have two cases:
● x(T) ≠ 0: In this case, λ(t) ≠ 0 for all t, and we can write
𝜇(t, x) = −sgn(λ) = −sgn(x(T)) = −sgn(x(t)).
The last equality holds since x will have the same sign as x(T) during the whole time interval.
● x(T) = 0: In this case, λ = 0 for all t and we may use any control signal ũ ∈ [−1,1], which obeys
the constraint x(T) = 0. One such control signal is
𝜇(t, x) = −sgn(x(t)),
since this will drive x to zero and stay there.
Consequently, one optimal control is
𝜇 ∗ (t, x) = −sgn(x(t)).

The next example will discuss control to a manifold.

Example 7.7 We are interested in finding the path with the shortest distance from a given point
{ }
x0 ∈ ℝ2 to a manifold ST = x ∈ ℝ2 ∶ G(x) = 0 , where G ∶ ℝ2 → ℝ is a differentiable function.
This can be done by finding the shortest time T it takes to “drive” with constant speed of one from
the given point to the manifold. This driving is described by the differential equations:
ẋ 1 (t) = cos 𝜃(t),
ẋ 2 (t) = sin 𝜃(t),
T
where 𝜃(t) ∈ [0,2𝜋) is the heading angle. We express the time as T = ∫0 dt. The corresponding
optimal control problem is hence
T
minimize dt,
∫0
subject to ẋ 1 (t) = cos 𝜃(t),
ẋ 2 (t) = sin 𝜃(t),
x(0) = x0 , x(T) ∈ ST ,
𝜃(t) ∈ [0,2𝜋),
188 7 Calculus of Variations

with variable 𝜃 and T, where the final time T should be optimized. The Hamiltonian is for this
example given by
H(x, 𝜃, λ) = 1 + λ1 cos 𝜃 + λ2 sin 𝜃.
The adjoint equations are
̇ = 0,
λ(t)
since the Hamiltonian does not depend on x. Hence, the adjoint variables are constants, i.e. λ(t) = λ0
for some constant λ0 ∈ ℝ2 . Since the Hamiltonian has to be zero along the optimal solution, we
have that
√
1 + λ21 + λ22 sin(𝜃(t) + 𝛿) = 0,
√ √
where 𝛿 satisfies sin 𝛿 = λ1 ∕ λ21 + λ22 and cos 𝛿 = λ2 ∕ λ21 + λ22 . This follows from the formula for
the sinusoidal of the sum of two angles. Since λ is a constant, this implies that 𝜃 also has to be
a constant. Hence, the shortest path is a straight line. Moreover, from the fact that the optimal 𝜃
should minimize the Hamiltonian, we obtain the necessary condition that
dH
= −λ1 sin 𝜃 + λ2 cos 𝜃 = 0,
d𝜃
from which it follows that
cos 𝜃 λ
= 1,
sin 𝜃 λ2
if 𝜃 ∈ (0,2𝜋). We will now make use of the condition λ(T) ⟂ ST , which is equivalent to
𝜕G(x(T))
λ(T) = 𝛼 ,
𝜕x
for some 𝛼 ∈ ℝ. From the differential equations, we have that the slope of the straight line is given
by
dx1 cos 𝜃 λ
= = 1.
dx2 sin 𝜃 λ2
We hence conclude that the shortest path has to be perpendicular to the manifold ST . The case
𝜃 = 0 can be taken care of by investigating dx2 ∕dx1 instead.

7.5 Numerical Solutions

Except for very simple optimal control problems, it is not possible to obtain analytical solutions.
Hence, we must often resort to numerical solutions. All methods use ideas from ordinary
differential equation (ODE) solvers to either integrate or approximate differential equations. There
are essentially two types of methods:
1. Indirect methods
2. Direct methods
The indirect methods aim at solving the necessary conditions of the PMP by integrating the differ-
ential equations in a recursive manner. The direct methods aim at solving a discretization of the
optimal control problem in (7.8) directly.
7.5 Numerical Solutions 189

Regarding the indirect methods, there are two main ideas that are used and which can be sum-
marized in the two methods
1. Shooting method
2. Gradient method
In the shooting method, the TPBVP is solved by integrating the dynamical equations in (7.9) for-
ward in time using an ODE solver. The challenge here is that normally no initial value is known
for the adjoint variable λ, and one has to guess an initial value and then modify this guess based on
what value is obtained for λ(T). The method is conceptually simple, and the control constraints are
easily accommodated. However, it is crucial to find a good initial guess for λ, and the method can be
numerically unstable due to the fact that the adjoint equations are often unstable when integrated
forward in time. A nonlinear equation solver is needed in order to find the correct initial value for
λ. This method was successfully used for launching satellites in the 1950s.
In the gradient method the control signal is used as an optimization variable, and the dynami-
cal equation relating the control signal and the state is integrated forward in time, and the adjoint
equation is integrated backward in time. This is done iteratively using an ODE solver until conver-
gence of the gradient of the objective function to zero. The gradient of the objective function is the
partial derivative of the Hamiltonian with respect to the control signal. The gradient method has
the advantage that both differential equations are integrated in their stable direction. Control sig-
nal constraints can be taken care of by projection on the feasible set of control signals. The method
has rapid convergence for the first iterates, but then tends to be slow. It was used successfully in
the 1960s for a large number of aeronautical problems. To speed up the convergence, second-order
derivatives may be used, however, with the drawback of making each iteration much more com-
putationally expensive.
Regarding the direct methods, there are three main ideas which can be summarized in the three
methods:
1. Discretization method
2. Collocation method
3. Multiple shooting method
The methods are all based on approximating the control signal as, e.g. piecewise constant or as a
polynomial. They basically differ in how the state trajectory is approximated. The first one uses a
very simple Euler forward difference. The latter two use ideas from ODE solvers. Collocation meth-
ods explicitly make use of the polynomial approximations used in ODE solvers to approximate the
state trajectory, whereas multiple shooting methods explicitly make use of an ODE solver with the
advantage of using adaptive step-lengths. This ODE solver has to be able to deliver derivatives with
respect to the control signal parameters. All methods rely on an efficient nonlinear programming
solver at the top level to optimize the control signal parameters.
In-between the indirect methods and the direct methods is the method of consistent approxi-
mation which borrows ideas from both methods. This method approximates the control signal as,
e.g. piecewise constant, as a polynomial or using orthogonal functions. It also uses a nonlinear
programming solver at the top level. However, ideas from ODE solvers is not used directly. Instead,
differential equations for adjoint variables are first derived. Then an ODE solver is used to integrate
first the differential equation relating the state to the control signal forward in time, and then the
adjoint equation backward in time. This will explicitly provide gradients for the nonlinear program-
ming solver. Hence, the ODE solver does not have to be able to deliver derivatives with respect to the
control signal parameters. More details regarding consistent approximations is given in, e.g. [75].
190 7 Calculus of Variations

We will now in detail explain several of the different algorithms. We will assume that we have
numerical algorithms for integrating ODEs, for finding roots of systems of nonlinear equations, and
for solving finite-dimensional optimization problems. General purpose algorithms in MATLAB for
this are, e.g. ode45, fsolve, and fmincon, respectively.
In case one does not want to write one’s own code for solving optimal control problems, there
are dedicated solvers for optimal control problems such as ACADO [58], and CasADi [5].

7.5.1 The Gradient Method

We will only present here the gradient method for the optimal control problem in (7.5). More gen-
eral cases are treated in e.g. [24]. It can be shown that the gradient of the objective function with
respect to u(t) is given by
𝜕H(t, x(t), u(t), λ(t))
,
𝜕u
for any x(t) as long as λ(t) satisfies the adjoint equation in (7.6a); see, e.g. [73]. The algorithm is as
follows:

1. Guess an initial value for the control signal u(t), t ∈ [0, T].
2. Solve

̇ = F (t, x(t), u(t)) ,

x(t) x(0) = x0 ,

using an ODE solver.

3. Solve
̇ = − 𝜕H(t, x(t), u(t), λ(t)) , 𝜕𝜙(x(T))
λ(t) λ(T) = ,
𝜕x 𝜕x
using an ODE solver.
4. Update u(t) as
𝜕H(t, x(t), u(t), λ(t))
u(t) ← u(t) − 𝛼 .
𝜕u
5. Repeat steps 2–4 until
T | 𝜕H(t, x(t), u(t), λ(t)) |2
| | dt
∫0 | 𝜕u |
| |
is small enough.

Here, care has to be taken to integrate the adjoint equation backward in time. In case there are
constraints such that u(t) ∈ U for all t ∈ [0, T], this can be taken care of by projecting the new
value of u(t) in Step 4 onto U. The parameter 𝛼 > 0 is a step size that has to be chosen with care.

7.5.2 The Shooting Method

The shooting method solves the TPBVP in (7.9) using an ODE solver, which assumes that the initial
values x(0) and λ(0) are given. As can be seen in (7.9), this is not the case. Only some of them are
given, and not necessarily explicitly. However, we assume that we can summarize what we know
about the initial values in a nonlinear equation:

G0 (x(0), λ(0)) = 0,
7.5 Numerical Solutions 191

where G0 ∶ ℝn × ℝn → ℝp . Similarly, we assume that we can summarize what we know about the
final values in
GT (x(T), λ(T)) = 0,
where GT ∶ ℝn × ℝn → ℝ2n−p . We also define a function G ∶ ℝn × ℝn → ℝn × ℝn that takes as
input the initial values (x(0), λ(0)), integrates the differential equations, and outputs the final
values (x(T), λ(T)). This can be implemented with an ODE solver, e.g. ode45 in MATLAB. Then
we need to solve the system of equations given by
G0 (x(0), λ(0)) = 0,
GT (G(x(0), λ(0))) = 0.
In MATLAB, this can be done with fsolve. One of the challenges is as mentioned above to choose
a good enough initial guess for λ(0). Another challenge is to carry out the minimization of the
Hamiltonian explicitly in order to obtain the function 𝜇 used in (7.9). In case this cannot be done
analytically, we need to resort to numerical solutions. In case the minimum of the Hamiltonian is
obtained when its partial derivative with respect to u is zero, then this equation can be added to the
differential equations in (7.9), i.e. we consider
ẋ = F(x, u), x(0) ∈ S0 , x(T) ∈ ST ,
̃
𝜕H(x, u, λ) 𝜕𝜙(x(T))
λ̇ = − , λ(0) ⟂ S0 , λ(T) − ⟂ ST , (7.11)
( 𝜕x) 𝜕x
̃
𝜕H x, u, λ
= 0,
𝜕u
instead of (7.9) when we define the function G. The equation above is not an ODE, but a differential
algebraic equation (DAE), which in MATLAB can be solved with, e.g. ode15i for given initial
conditions. In case the minimum of the Hamiltonian could be on the boundary of the domain U,
then we would need to carry out the minimization of the Hamiltonian at each step of the DAE
solver, and this would require us to write special purpose code.

7.5.3 The Discretization Method

The discretization method is the most straightforward approach of all numerical approaches for
solving optimal control problems. We will apply this to
T
minimize 𝜙 (x(T)) + f (t, x(t), u(t)) dt,
∫0
̇ = F (t, x(t), u(t)) ,
subject to x(t)
(7.12)
x(0) ∈ S0 , x(T) ∈ ST ,
u(t) ∈ U ⊂ ℝm ,
T ≥ 0,
with variables x and u, where f ∶ ℝ × ℝn × ℝm → ℝ, F ∶ ℝ × ℝn × ℝm → ℝ and 𝜙 ∶ ℝn → ℝ are
continuously differentiable. The sets S0 and ST are subsets of ℝm and manifolds.
First, we define a partitioning of the time interval [0, T] as
0 = t0 ≤ t1 ≤ · · · ≤ tN = T,
in N intervals [ti , ti+1 ], where i ∈ ℕN−1 . We also denote the length of the intervals by hi = ti+1 − ti .
Often, the length of the intervals are equal, i.e. hi = T∕N for all i. We approximate u(t) for t ∈ [ti , ti+1 ]
192 7 Calculus of Variations

as u(t) = ui ∈ ℝm , i.e. u(t) is approximated as a piecewise constant function of t. We then approxi-

mate the derivative of the state with a forward Euler approximation as
x(ti+1 ) − x(ti )
̇ ≈
x(t) , t ∈ [ti , ti+1 ].
hi
This results in the following approximation of the differential equation:
xi+1 = xi + hi F(ti , xi , ui ),
where xi = x(ti ). We finally approximate the objective function as
∑
N−1
𝜙(xN ) + hi f (ti , xi , ui ).
i=0

We then realize that the whole optimal control problem in (7.12) may be approximated with the
discrete time optimal control problem
∑
N−1
minimize 𝜙(xN ) + hi f (ti , xi , ui ),
i=0
subject to xi+1 = xi + hi F(ti , xi , ui ), i ∈ ℤN−1 ,
x0 ∈ S0 , xN ∈ ST ,
ui ∈ U ⊂ ℝ m ,
with variables x = (x0 , … , xN ), and u = (u0 , … , uN−1 ). How this optimal control problem can be
solved as a finite-dimensional optimization problem is explained for a very similar problem in
Section 8.1. We will anyway give the details, since they are useful later on. We define 0 ∶ ℝ(N+1)n ×
ℝNm → ℝ as
∑
N−1
0 (x, u) = 𝜙(xN ) + hi f (ti , xi , ui ).
i=0

We assume that u ∈ U × U × · · · × U can be described as  (u) ⪯ 0 for some  ∶ ℝNm → ℝNq . We

assume that x0 ∈ S0 and xN ∈ ST can be described as
G0 (x0 ) = 0, GT (xN ) = 0,
for some functions G0 ∶ ℝn → ℝp and GT ∶ ℝn → ℝn−p . We define  ∶ ℝ(N+1)n × ℝNm → ℝ(N+1)n
as
⎡ x1 − x0 − h0 F(t0 , x0 , u0 ) ⎤
⎢ ⎥
⎢ x2 − x1 − h1 F(t1 , x1 , u1 ) ⎥
⎢ ⋮ ⎥
(x, u) = ⎢ ⎥.
⎢xN − xN−1 − hN−1 F(tN−1 , xN−1 , uN−1 )⎥
⎢ ⎥
⎢ G0 (x0 ) ⎥
⎢ G ⎥
⎣ (x
T N ) ⎦
Then the discrete time optimal control problem can be stated as
minimize 0 (x, u),
subject to  (u) ⪯ 0,
(x, u) = 0,
7.5 Numerical Solutions 193

with variables x and u. It is straightforward to generalize to the case when there are also inequality
constraints related to the states x. The above optimization problem is a finite-dimensional
optimization problem as discussed already in Chapter 4. The above optimization problem is
often of very high dimension, but it has a lot of structure. The objective function is what is called
separable. The constraint functions are what is called partially separable, see Section 5.5. This can
be utilized to solve the problem efficiently.

7.5.4 The Multiple Shooting Method

The multiple shooting method uses similar ideas as in the discretization method for solving
(7.12). However, the approximations are more accurate. This is first accomplished by considering
a more general description of u(t) in the intervals [ti , ti+1 ]. To this end, we define functions
𝜑i ∶ ℝ × ℝki → ℝm and we let
u(t) = 𝜑i (t, ai ), t ∈ [ti , ti+1 ], i ∈ ℤN−1 .
The vector ai parameterizes the function, and these vectors will be optimization variables. We col-
∑N−1
lect them in the vector a ∈ ℝk , where k = i=0 ki . The functions 𝜑i could be constants as in the
previous section, and then ai = ui , but we can also define affine functions or general polynomials.
We can also let them be orthogonal basis functions.
We define initial values si , i ∈ ℤN−1 , and we then use an ODE solver to integrate the N differential
equations
̇ = F(t, x(t), 𝜑i (t, ai )),
x(t) x(ti ) = si , i ∈ ℤN−1 ,
over the intervals [ti , ti+1 ]. For each si , we integrate over disjoint intervals. We denote with an abuse
of notation the solutions by x(t, ai , si ) for t ∈ [ti , ti+1 ]. In order to ensure continuity of the solutions,
we introduce the matching conditions
h(si , ai , si+1 ) = si+1 − x(ti+1 , ai , si ) = 0,
where h ∶ ℝn × ℝki × ℝn → ℝn . We can also use an ODE solver to solve the differential equations:
̇ = f (t, x(t, ai , si ), 𝜑i (t, ai )),
J(t) J(ti ) = 0,
over the intervals [ti , ti+1 ]. We denote with an abuse of notation the solutions by Ji (t, si , ai ) for t ∈
[ti , ti+1 ]. Then the objective function in (7.12) is approximated as
∑
N−1
0 (s, a) = 𝜙(sN ) + Ji (ti+1 , si , ai ),
i=1

where 0 ∶ ℝ(N+1)n × ℝk → ℝ. To take the constraint u(t) ∈ U for all t ∈ [0, T] into account, one
possibility is to sample the constraint at each time ti . This means that for all i, we add the constraint
u(ti ) ∈ U. We assume that this can equivalently be expressed as  i (ai ) ⪯ 0 for some functions  i ∶
ℝki → ℝq . All inequality constraints can then be expressed in terms of  ∶ ℝk → ℝNq , where
⎡  0 (a0 ) ⎤
⎢ ⎥
 (a) = ⎢ ⋮ ⎥. (7.13)
⎢ N−1 ⎥
⎣ (aN−1 )⎦
As in the previous section, we assume that the functions G0 and GT can be used to describe the
constraints on the initial state and final state values. All equality constraints can be described using
194 7 Calculus of Variations

the function  ∶ ℝ(N+1)n × ℝk → ℝ(N+1)n as

⎡ h(s0 , a0 , s1 ) ⎤
⎢ ⎥
⎢ h(s1 , a1 , s2 ) ⎥
⎢ ⋮ ⎥
(s, a) = ⎢ ⎥.
⎢h(sN−1 , aN−1 , sN )⎥
⎢ ⎥
⎢ G0 (s0 ) ⎥
⎢ ⎥
⎣ GT (sN ) ⎦
Then the discrete time optimal control problem can be stated as

minimize 0 (s, a),

subject to  (a) ⪯ 0,
(s, a) = 0,

with variables s and a. It is straightforward to generalize to the case when there are also inequality
constraints related to the states x. The above optimization problem is also a finite-dimensional opti-
mization problem. Here, we are not able to compute analytical derivatives of the functions defining
the optimization problem. However, often ODE solvers can also deliver derivatives of the solutions
with respect to (a, s), which can be used to compute derivatives of all the involved functions. The
fact that the differential equations can be solved in parallel can be used to speed up the solver.
A good reference to multiple shooting methods is [32]. The ACADO toolkit which implements
multiple shooting methods is described in [58].

7.5.5 The Collocation Method

We consider the optimal control problem in (7.12). The idea in the collocation method is to merge
the ideas from the discretization method and from the multiple shooting method. The same type
of approximation of the control signal as in the multiple shooting method is used. However, the
numerical integration of the differential equation is replaced by representing the state as a cubic
polynomial, which is more accurate than using the forward Euler approximation of the discretiza-
tion method. The cubic polynomial is given by
( )
∑3
t − ti k
xa (t) = cik ,
k=0
hi

for t ∈ [ti , ti+1 ], where cik ∈ ℝn are the coefficients of the vector-valued polynomial. The polynomial
has to agree with the true state x(t) and its derivative x(t)̇ at the endpoints of the interval. This results
in the following four equations for the coefficients cik :

xa (ti ) = xi ,
xa (ti + 1) = xi+1 ,
ẋ a (ti ) = F(ti , xi , ui ),
ẋ a (ti+1 ) = F(ti+1 , xi+1 , ui+1 ),

where
( )k−1
∑
3
kcik t − ti
ẋ a (t) = .
k=1
hi hi
Exercises 195

Since the equations are linear in cik , it is straightforward to obtain the solution:

ci0 = xi ,
ci1 = hi Fi ,

ci2 = −3xi − 2hi Fi + 3xi+1 − hi Fi+1 ,

ci3 = 2xi + hi Fi − 2xi+1 + hi Fi+1 ,
where Fi = F(ti , xi , ui ), and where xi = x(ti ), ui = u(ti ) = 𝜑i (ti , ai ) and hi = ti+1 − ti as before. Notice
that the differential equation is satisfied at the endpoints of the interval by construction of the coef-
ficients of the polynomials. Depending on the dimension of ai , we can try to enforce the differential
equation to hold at one or several points inside the interval [ti , ti+1 ]. We will only use the center point
tic = (ti + ti+1 )∕2, which results in the constraints
ẋ a (tic ) = F(tic , xa (tic ), 𝜑i (tic , ai )).
This constraint involves the variables xi , xi+1 and ai and is just a nonlinear equation in these vari-
ables. In case, the polynomial instead would have been of first degree we would have obtained the
forward Euler approximation in the discretization method.
The objective function is approximated using the center points as
∑
N−1
𝜙(xN ) + hi f (tic , xa (tic ), 𝜑i (tic , ai )).
i=0

This results in the following parametric optimization problem:

∑
N−1
minimize 𝜙(xN ) + hi f (tic , xa (tic ), 𝜑i (tic , ai )),
i=0

subject to ẋ a (tic ) = F(tic , xa (tic ), 𝜑i (tic , ai )),

G0 (x0 ) = 0, GT (xN ) = 0,

𝜑(ti , ai ) ∈ U ⊂ ℝm ,
with variables x = (x0 , … , xN ) and a = (a0 , … , aN−1 ), where we as before assume that the con-
straint involving the control signal can be written as inequalities involving a function as in (7.13).
Understand that the variables x and a are implicitly present in xa and ẋ a . An advantage as compared
to the multiple shooting method is that analytical derivatives with respect to (x, a) of the functions
defining the optimization problem are possible to compute. However, the cubical approximation of
the state might be less accurate as compared to the numerical integration of the state performed in
the multiple shooting method. A good reference to the collocation method presented here is [53].
More general collocation methods involving orthogonal polynomials is discussed in e.g. [93].

Exercises

7.1 Show that || ⋅ || ∶ (a, b) → ℝ defined as

||y|| = max |y(x)| + max |y′ (x)|
x∈ x∈

is a norm on the linear space of continuously differentiable real-valued functions (a, b)

defined on  = [a, b] ⊂ ℝ.
196 7 Calculus of Variations

7.2 Show that the first variation in (7.2), i.e.

( ( ) ( ) )
b 𝜕f x, y(x), y′ (x) 𝜕f x, y(x), y′ (x)
𝛿J[𝛿y] = 𝛿y(x) + 𝛿y (x) dx
′
∫a 𝜕y 𝜕y′
is a linear functional

7.3 Solve the optimal control problem

T ( )
minimize x(t) + u(t)2 dt,
∫0
̇ = x(t) + u(t) + 1,
subject to x(t)
x(0) = 0,
with variables x and u for a fixed T > 0, using the PMP.

7.4 Find the extremal to the functional

1( )
J[y] = ̇ 2 dt,
y(t)2 + y(t)
∫0
satisfying y(0) = 0 and y(1) = 1.

7.5 Consider Newton’s minimal resistance problem:

L
yẏ 3
minimize dx,
∫0 1 + ẏ 2
with variable y subject to y(0) = H and y(L) = h, where H > h. Here, ẏ = dy∕dx. Show that
for the optimal solution it holds that x is related to p = −ẏ as
( )
1 3
x = C + D ln p + 2 + 4 ,
p 4p
for some constants C and D.
Hint: It is a good idea to consider x to be a function of y.

7.6 We are interested in computing optimal transportation routes in a circular city. The cost for
transportation per unit length is given by a function g(r) that only depends on the radial
distance r to the city center. This means that the total cost for transportation from a point P1
to a point P2 is given by
P2
g(r)ds,
∫P1
where s represents the arc length along the path of integration. In polar coordinates (𝜃, r),
the total cost reads
P2 √
̇ 2 dr,
g(r) 1 + (r 𝜃)
∫P1
where 𝜃 = 𝜃(r), and 𝜃̇ = d𝜃∕dr.
(a) Formulate the problem of computing an optimal path as an optimal control problem.
(b) For the case of g(r) = 𝛼∕r for some positive 𝛼, show that any optimal path satisfies the
equation 𝜃 = a ln r + b for some constants a and b.
(c) Show that if the initial point and the final point are at the same distance from the origin,
then the optimal path is a circle segment. You may use the claim in (b).
Exercises 197

7.7 Consider the dynamical system

̇ = u(t) + 𝑤(t),
x(t)
where x(t) is the state, u(t) is the control signal, 𝑤(t) is a disturbance signal, and where
x(0) = 0. In the so-called H∞ -control, the objective is to find a control signal that solves
min max J[u, 𝑤],
u 𝑤
with
T ( )
J[u, 𝑤] = 𝜌2 x(t)2 + u(t)2 − 𝛾 2 𝑤(t)2 dt,
∫0
where 𝜌 is a given parameter and where 𝛾 < 1.
(a) Express the solution as a TPBVP problem using the PMP. Note that you do not need to
solve the equations. [ ]
u
Hint: You can consider ū = as a “control signal” in the PMP framework.
𝑤
(b) Solve the TPBVP and express the solution in the feedback form, i.e. u(t) = 𝜇(t, x(t)) for
some 𝜇.
(c) For what values of 𝛾 does the control signal exist for all t ∈ [0, T].

7.8 Consider the motion of a robotic manipulator with joint angles q ∈ ℝn , which may be
described as a function of the applied joint torques 𝜏 ∈ ℝn as
𝜏 = M(q)q̈ + C(q, q)
̇ q̇ + G(q), (7.14)
where M(q) ∈ ℝn×n ̇ ∈
is a positive definite mass matrix and C(q, q) ℝn×n is a matrix account-
ing for Coriolis and centrifugal effects, which is linear in the joint velocities q̇ = dq∕dt, and
where G(q) ∈ ℝn is a vector accounting for gravity and other joint angle-dependent torques.
Consider a path q(s) as a function of a scalar path coordinate s. The path coordinate deter-
mines the spatial geometry of the path, whereas the trajectory’s time dependency follows
from the relation s(t) between the path coordinate s and time t.
(a) Show that (7.14) can be expressed in terms of s as
𝜏(s) = m(s)̈s + c(s)ṡ 2 + g(s),
where
m(s) = M (q(s)) q′ (s),
( )
c(s) = M (q(s)) q′′ (s) + C q(s), q′ (s) q′ (s),
g(s) = G (q(s)) .
(b) Consider the time-optimal path-tracking problem
minimize T,
subject to 𝜏(s) = m(s)̈s + c(s)ṡ 2 + g(s),
s(0) = 0,
s(T) = 1,
̇
s(0) = 0,
̇
s(T) = 0,
𝜏 (s(t)) ≤ 𝜏 (s(t)) ≤ 𝜏̄ (s(t)) ,
198 7 Calculus of Variations

with variables s and T, where the torque lower bounds 𝜏 and upper bounds 𝜏̄ may depend
on s. Using the fact that dt = (dt∕ds)ds, show that
1
1
T= ds,
∫0 ṡ
and that for the change of variables,

a(s) = s̈ ,
b(s) = ṡ 2 ,

it holds that b′ (s) = 2a(s).

(c) Show that, the optimization problem in (b) is equivalent to
1
1
minimize √ ds
∫0 b(s)
subject to 𝜏(s) = m(s)a(s) + c(s)b(s) + g(s),
b(0) = 0,
b(T) = 0,
b′ (s) = 0,
b′ (s) = 2a(s),
b(s) ≥ 0,
𝜏 ≤ 𝜏 ≤ 𝜏,
̄

with variables a and b.

(d) Compute the optimal path s, from the problem posed in (c) when

q(s) = s,
M(q) = l2 m = 1,
̇ = 0,
C(q, q)
G(q) = mlg cos(s) = cos(s),
𝜏(s) = −2, 𝜏(s)
̄ = 2.

Hint: Consider s as a “time-variable.”

7.9 Consider controlling the triple integrator

ẋ 1 = x2 ,
ẋ 2 = x3 ,
ẋ 3 = u, |u| ≤ 1,

from a certain arbitrary initial condition x(0) to a final value x(T) = 0 in minimum time.
Show that the necessary conditions for optimality are satisfied by a control of the form

u(t) = sign (p(t)),

where p(t) is a polynomial. What is the maximum degree of the polynomial? How many
times can u change sign? It is not necessary to compute the values of the coefficients
of p(t).
Exercises 199

7.10 A community living around a lake wants to maximize the yield of fish taken out of the lake.
The amount of fish at a certain time is denoted as x. The growth rate of the fish is kx and
fish is caught at a rate of ux, where u is the control variable, which is assumed to satisfy
0 ≤ u ≤ umax . The dynamics of the fish population is then given by

ẋ = (k − u)x, x(0) = x0 .

Here, k > 0 and x0 > 0. The total amount of fish obtained during a time period T is
T
J= ux dt.
∫0

(a) Derive the necessary conditions given by the PMP for the problem of maximizing J.
(b) Show that the necessary conditions are satisfied by a bang–bang control. How many
switching times are there?
(c) Determine an equation for calculating the switching time(s).

7.11 Consider a motion model of a particle with position z(t) and speed 𝑣(t). Define the state
vector x = (z, 𝑣) and the continuous time model
[ ] [ ]
0 1 0
̇ = F(x(t), u(t)) =
x(t) x(t) + u(t). (7.15)
0 0 1
The problem is to go from the state xi = x(0) = (1,1) to xf = x(tf ) = (0,0), where tf = 2,
t
and such that the control input energy ∫0 f u2 (t)dt is minimized. Thus, the optimization
problem is
tf
minimize f (x, u)dt,
∫0
̇ = F(x(t), u(t)),
subject to x(t) (7.16)
x(0) = xi ,
x(tf ) = xf ,

with variables x and u, where f (t) = u(t)2 and F is given in (7.15).

(a) Show that the Euler approximation with h = tf ∕N as described in the subsection on
discretization methods in Section 7.5 results in the discrete time model
[ ] [ ]
̄ 1 h 0
xk+1 = F(xk , uk ) = x + u , (7.17)
0 1 k h k
and the following discretized version of (7.16):
∑
N−1
minimize h u2k ,
k=0
̄ k , uk ),
subject to xk+1 = F(x (7.18)
x0 = xi ,
xN = xf ,
with variables x and u.
(b) Define the optimization parameter vector as
( )
y = x0 , u0 , x1 , u1 , … , xN−1 , uN−1 , xN , uN .
200 7 Calculus of Variations

Note that uN is superfluous, but it is included to make the presentation and the code
more convenient. Furthermore, define
N−1
∑
0 (y) = h u2k (7.19)
k=0

and
⎡ h1 (y) ⎤
⎢ ⎥ ⎡ ̄ 0 , u0 )
x1 − F(x ⎤
⎢ h2 (y) ⎥ ⎢ ̄ 1 , u1 )
x2 − F(x ⎥
⎢ h3 (y) ⎥ ⎢ ⎥
(y) = ⎢ ⎥=⎢ ⋮ ⎥. (7.20)
h
⎢ 4 (y) ⎥ ⎢xN − F(x
̄ N−1 , uN−1 )x0 − xi ⎥
⎢ ⋮ ⎥ ⎢ ⎥
⎢h (y)⎥ ⎣ xN − xf ⎦
⎣ N+2 ⎦
Show that the optimization problem in (7.18) can be expressed as the constrained
problem
minimize 0 (y),
subject to (y) = 0,

with variable y. Complete the files secOrderSysEqDisc.m and secOrder-

SysCostCon.m. Note that the constraints are linear for this system model, and this is
very important to exploit in the optimization routine. In fact the problem is a quadratic
programming problem, and since it only contains equality constraints, it can be solved
as a linear system of equations. Complete the file mainDiscLinconSecOrderSys.m
by creating the linear constraint matrix Aeq and the vector Beq such that the constraints
defined by (y) = 0 is expressed as Aeq y = Beq . Then run the MATLAB script.
Hint: The structure of the matrix Aeq is

⎡−F E 0 0 … 0 0⎤
⎢ 0 −F E 0 … 0 0⎥
⎢ ⎥
⎢ 0 0 −F E … 0 0⎥
Aeq =⎢ ⋮ ⋱ ⋮⎥ , (7.21)
⎢ ⎥
⎢ 0 0 0 0 … −F E⎥
⎢E 0 0 0 … 0 0⎥
⎢ ⎥
⎣ 0 0 0 0 … 0 E⎦
where F and E are 2 × 3-matrices.
(c) Now, suppose that the constraints are nonlinear. This case can be handled by passing
a function that computes (y) and its Jacobian. Let the initial and terminal state con-
straints, hn+1 (y) = 0 and hN−2 (y) = 0, be handled as above and complete the function
secOrderSysNonlcon.m by computing

ceq = (h1 (y), h2 (y), … , hN (y)), (7.22)

and the Jacobian

𝜕ceqT
ceqJac = . (7.23)
𝜕y
Each row in (7.23) is computed in the loop in the for-statement in secOrderSys-
Nonlcon.m. Note that the length of the state vector is 2. Also, note that MATLAB
wants the Jacobian to be transposed, but that operation is already added last in the file.
Exercises 201

Finally, run the MATLAB script mainDiscNonlinconSecOrderSys.m to solve the

problem again.
Hint: You can use the function checkCeqNonlcon to check that the Jacobian of the
equality constraints are similar to the numerical Jacobian. You can also compare the
Jacobian to the matrix Aeq in this example.

7.12 In this exercise, we will also use the discrete time model in (7.17). However, the terminal
constraints are removed by approximating them with a penalty term in the objective func-
tion. If the terminal constraints are not fulfilled, they are penalized in the objective function.
Now, the optimization problem can be defined as
∑
N−1
minimize c||xN − xf ||22 + h u2k ,
k=0
(7.24)
̄ k , uk ),
subject to xk+1 = F(x
x0 = xi ,
with variables x and u, where F̄ is given in (7.17), N = tf ∕h and c is a predefined constant.
This problem can be rewritten as an unconstrained quadratic optimization problem:
minimize  (u0 , u1 , … , uN−1 ), (7.25)
with variable u. The optimization problem will be solved by using the MATLAB function:
fminunc.
(a) Write down an explicit algorithm for the evaluation of  .
(b) Complete the file secOrderSysCostUnc.m with the cost function  . Solve the prob-
lem by running the script mainDiscUncSecOrderSys.m.
(c) Compare the method in this exercise with the one in the previous exercise. What are the
advantages and disadvantages? Which method handles constraints best? Suppose the
task was also to implement the optimization algorithm, which method would be easiest
to implement, assuming that the state dynamics would be nonlinear?
(d) Implement an unconstrained gradient method that can replace the fminunc function
in mainDiscUncSecOrderSys.m.

7.13 We consider the same optimal control problem as in the previous exercise. We will investi-
gate the gradient method based on the PMP method. The optimal control problem is
tf
minimize J = c||x(tf ) − xf ||22 + u2 (t)dt,
∫0
̇ = F(x(t), u(t)),
subject to x(t)
x(0) = xi ,
with variables x and u.
(a) Write down the Hamiltonian and show that the Hamiltonian partial derivatives with
respect to x and u are
[ ]
𝜕H 0
= ,
𝜕x λ1 (7.26)
𝜕H
= 2u(t) + λ2 ,
𝜕u
respectively.
202 7 Calculus of Variations

(b) What are the adjoint equation and its terminal constraint?
(c) Complete the files secOrderSysEq.m with the system model, the file secOrder-
SysAdjointEq.m with the adjoint equations, the file secOrderSysFinal-
Lambda.m with the terminal values of λ, and the file secOrderSysGradient.m
with the control signal gradient. Finally, complete the script mainGradientSec-
OrderSys.m and solve the problem.
(d) Try some different values of the penalty constant c. What happens if c is “small?” What
happens if c is “large?”

7.14 In this exercise, we will solve the problem in (7.16) using a shooting method as discussed in
the subsection on shooting methods in Section 7.5. Such methods are based on successive
improvements of the unspecified initial conditions of the TPBVP.
Complete the files secOrderSysEqAndAdjointEq.m with the combined system and
adjoint equation which implements the function G in the abovementioned section. Also,
complete the file theta.m with the final constraints, which implement the function
GT . We do not need to specify the function G0 , since the initial value for the state is fully
known. The main script is mainShootingSecOrderSys.m, and it solves the equation
GT (G(x(0), λ(0))) = 0 with respect to λ(0) for a given value of x(0).

7.15 (a) What are the advantages and disadvantages of discretization methods?
(b) Discuss the advantages and disadvantages of the problem formulation in Exercise 7.11
compared to the formulation in Exercise 7.12.
(c) What are the advantages and disadvantages of gradient methods?
(d) Compare the methods in Exercises 7.11–7.13 in terms of accuracy and complexity/speed.
Also, compare the results for different algorithms used by fmincon in Exercises 7.11.
Can all algorithms handle the optimization problem?
(e) Assume that there are constraints on the control signal, and you can use either the
discretization method in Exercise 7.11 or the gradient method in Exercise 7.13. Which
method would you use?
(f) What are the advantages and disadvantages of shooting methods?

7.16 Consider the problem of finding the curve with the minimum length from a point (0,0) to
(xf , yf ). The solution is of course obvious, but we will use this example as a starting point for
introducing CasADi, see https://web.casadi.org, to solve optimal control problems
using a collocation method. We will use a control signal that is constant in each discretiza-
tion interval.
The problem can be formulated by using an expression of the length of the curve from (0,0)
to (xf , yf ):
xf √
s= 1 + y′ (x)2 dx. (7.27)
∫0
Note that x is the “time” variable. The optimal control problem is solved in minCurve-
LengthCol.m by using MATLAB and CasADi. Notice that (x, y) are represented as (t, x) in
the file.
(a) Derive the expression of the length of the curve in (7.27).
(b) Write the problem on standard optimal control form, i.e. determine 𝜙, f , F, and so on.
(c) Use the PMP to show that the solution is a straight line.
Exercises 203

(d) Examine and run the script minCurveLengthCol.m and compare the solution with
what the theory says.

7.17 Consider the minimum length curve problem again. The problem can be reformulated by
using a constant speed model where the control signal is the heading angle. This problem is
solved in the CasADi/MATLAB file minCurveLengthHeadingCtrlMS.m.
(a) Examine the CasADi/MATLAB file and write down the optimal control problem that is
solved in this example, i.e. what are F, f , and 𝜙 in the standard optimal control formu-
lation.
(b) Run the script and compare with the result from the previous exercise.

7.18 In this exercise, we will investigate the so-called “Brachistochrone problem,” which was
posed by Johann Bernoulli in Acta Eruditorum in 1696. The history of this problem involves
several of the greatest scientists ever, such as Galileo, Pascal, Fermat, Newton, Lagrange, and
Euler. It is about finding the curve between two points, A and B, that is covered in the least
time by a body that starts in A with zero speed and is constrained to move along the curve to
point B, under the action of gravity only and assuming no friction, see Figure 7.1. The word
“brachistochrone” comes from the Greek language: brachistos – the shortest, chronos – time.
Let the motion of the particle, under the influence of gravity g, be defined by ż = F(z, 𝜃),
where the state vector is defined as z = (x, y, 𝑣) and (x, y) is the Cartesian position of the
particle in a vertical plane and 𝑣 is the speed, i.e.

ẋ = 𝑣 sin(𝜃),
ẏ = −𝑣 cos(𝜃). (7.28)

The motion of the particle is constrained by a path that is defined by the angle 𝜃(t).
(a) Give an explicit expression for F(z, 𝜃). Only the expression for 𝑣̇ is missing.
(b) Define the Brachistochrone problem as an optimal control problem based on this
state-space model. Assume that the initial position of the particle is at the origin and that
the initial speed is zero. The final position of the particle is (x(tf ), y(tf )) = (xf , yf ) = (10,3).
(c) Modify the script minCurveLengthHeadingCtrlMS.m of the minimum curve
length example in Exercise 7.17 above and solve the Brachistochrone problem with
CasADi.

7.19 The time it takes for a particle to travel on a curve between the points p0 = (0,0) and
pf = (xf , yf ) is
pf
1
tf = ds, (7.29)
∫p0 𝑣

where ds is an element of arc length, and 𝑣 is the speed.

Figure 7.1 The Brachistochrone problem. 𝑦

𝑥
𝐴 (0, 0) 𝑔

𝐵 (𝑥𝑓 , 𝑦𝑓 )
204 7 Calculus of Variations

(a) Show that the travel time tf is

√
xf
1 + y′ (x)2
tf = √ dx, (7.30)
∫0 −2gy(x)
for the Brachistochrone problem, where the speed of the particle is due to gravity, and
its initial speed is zero c.f. Exercise 7.18.
(b) Define the Brachistochrone problem as an optimal control problem based on the expres-
sion in (7.30). Note that the time variable is eliminated and that y is a function x.
(c) Use the PMP to show that the optimal trajectory is a cycloid
√
C − y(x)
′
y (x) = , (7.31)
y(x)

with solution
C
x= (𝜙 − sin(𝜙)),
2 (7.32)
C
y = (1 − cos(𝜙)).
2
where 𝜙 parameterizes the curve.

7.20 An alternative formulation of the Brachistochrone problem, c.f. Exercise 7.18, can be
obtained by considering the “law of conservation of energy,” which is derived from the
principle of least action in Example 7.4. Consider the position (x, y) of the particle and its
̇ y).
velocity (x, ̇
̇ y,
(a) Write the kinetic energy T and the potential energy V as functions of x, y, x, ̇ the mass
m, and the gravity constant g.
(b) Define the Brachistochrone problem as an optimal control problem based on the law of
conservation of energy.
Hint: You should introduce u = ẋ and 𝑣 = ẏ as control signals. Notice that the problem
will contain an algebraic constraint that is not present in a standard optimal control
problem as we have defined it. This means that the state evolution is not described by
an ODE but by a differential algebraic equation.
(c) Solve this optimal control problem with CasADi by modifying the file brachis-
tochroneHeadingCtrlMS.m. Assume that the mass of the particle is m = 1.
Hint: You need to use a value for N of at least 40.

7.21 When solving the Brachistochrone problem, we have used three different problem formula-
tions in the three previous exercises. We will discuss here the pros and cons of the different
formulations.
(a) Discuss why not only the choice of optimization algorithm is important but also the
problem formulation, when deciding how to solve an optimal control problem numeri-
cally.
(b) Compare the different approaches to the Brachistochrone problem in the three previous
exercises. Try to explain the advantages and disadvantages of the different formulations
of the Brachistochrone problem.

7.22 We will solve here the so-called “Zermelo” problem. From the point (0,0) on the bank of a
wide river, a boat starts with relative speed to the water equal to 𝜈. The stream of the river
Exercises 205

becomes faster as it departs from the bank, and the speed is g(y) parallel to the bank. The
movement of the boat is described by
̇ = 𝜈 cos(𝜙(t)) + g(y(t)),
x(t)
̇ = 𝜈 sin(𝜙(t)),
y(t)
where 𝜙 is the angle between the boat direction and bank. We want to determine the move-
ment angle 𝜙(t) so that x(T) is maximized, where T is a fixed time. We will investigate the
case when g(y) = y, 𝜈 = 1, and T = 1. Use CasADi to solve this optimal control problem by
modifying the file minCurveLengthCol.m.

7.23 Solve the problem in Exercise 7.11 using CasADi and both the multiple shooting method
and the collocation method. How many iterations do the methods need to converge to a
solution?
206

Dynamic Programming

In Chapter 7, we discussed how to solve optimal control problems over a finite time interval.
We specifically considered the continuous time case, since for discrete time dynamics, it is fairly
straightforward how to solve the optimal control problem. The solutions we obtained are so-called
open-loop solutions, i.e. the value of the control signal at a certain time depends only on the
initial value of the state and not on the current value of the state. For the linear quadratic (LQ)
control problem, we were able to restate the solution as a feedback policy, i.e. to explicitly write the
control signal as u(t) = 𝜇(t, x(t)) for a feedback function 𝜇. This is very desirable since it is known
that feedback solutions are more robust to unmodeled dynamics and disturbances. We will in this
chapter look in more detail into the problem of obtaining feedback solutions. We will only treat
the discrete time case, since it is easier from a mathematical point of view. However, many of the
ideas can be extended to the continuous time case. We will first consider a finite time interval and
then we will discuss the case of an infinite time interval. Optimal feedback control goes back to
the work by Richard Bellman who in 1953 introduced what is known as dynamic programming to
solve these types of problems. We will see that this is a special case of message passing as discussed
in Section 5.5. The case of infinite time interval is of practical importance in many applications,
since under certain conditions stability can be proven. It is unfortunately often very difficult to
compute the solution, and hence, we will introduce what is known as model predictive control
(MPC) as a remedy. This is today commonly used in industry. We will at the end of the chapter
discuss how to treat uncertainty by considering a stochastic setting of the control problem. This
treatment is based on the stochastic multistage decision problem introduced in Section 5.7.

8.1 Finite Horizon Optimal Control

We are given a dynamical system of the form

xk+1 = Fk (xk , uk ), (8.1)
where Fk ∶ n ×  m → n , k ∈ ℤN−1 are functions describing how the state xk evolves depending
on the input signal uk . Here, k can be interpreted as discrete time instances or as general stages.
The sets  and  could be, e.g. ℝ, or they could be sets with a finite number of elements, e.g. subsets
of the integers. We assume that the initial value x0 is given. The input signal is called the control
signal. The control signal is used to control the dynamical system, i.e. the evolution of the states.
To this end, we introduce the incremental costs fk ∶ n ×  m → ℝ for k ∈ ℤN−1 . The incremental
costs are functions of the states and the control signal. We also introduce a cost associated with the

final state xN called the terminal cost of final cost 𝜙 ∶ n → ℝ. The optimal control problem is then
the problem of solving
∑
N−1
minimize 𝜙(xN ) + fk (xk , uk ),
k=0

subject to xk+1 = Fk (xk , uk ), k ∈ ℤN−1 , (8.2)

uk ∈ Uk (xk ), k ∈ ℤN−1 ,
xk ∈ Xk , k ∈ ℕN ,
with variables (u0 , x1 , … , uN−1 , xN ), where Xk ⊂ n , Uk ∶ n → , and where  = {E ∶ E ⊂  m }.
The final stage N that we consider in the formulation of the problem is called the time horizon of
the problem. Above, the optimal control problem is solved for a given value of the initial state x0 ,
and we denote the optimal value of the objective function with J ⋆ ∶ n → ℝ which is a function
of x0 . For later reference, we will also introduce the optimal cost-to-go function Jl⋆ ∶ n → ℝ as the
optimal value of the objective function for a problem starting at time k = l instead of at k = 0 and
with initial value xl = x. Clearly, J ⋆ (x0 ) = J0⋆ (x0 ).

8.1.1 Standard Optimization Problem

( )
Let z = u0 , x1 , … , uN−1 , xN and define f ∶  m × n × · · · ×  m × n → ℝ as
∑
N−1
f (z) = 𝜙(xN ) + fk (xk , uk ),
k=0

and let h ∶  m × n × · · · ×  m × n → ℝnN be given by

( )
h(z) = x1 − F0 (x0 , u0 ), x2 − F1 (x1 , u1 ), … , xN − FN−1 (xN−1 , uN−1 ) .
Also, assume that it is possible to equivalently express uk ∈ Uk (xk ) for k ∈ ℤN−1 and xk ∈ Xk for k ∈
ℕN as g(z) ≤ 0 for some function g ∶  m × n × · · · ×  m × n → ℝp . Then the above optimization
problem can equivalently be written
minimize f (z),
subject to g(z) ≤ 0, (8.3)
h(z) = 0,
with variable z. This can be solved with several of the different optimization methods presented in
Chapter 6. It should be stressed that there is a lot of structure in the above optimization problem.
Gradients and Hessians of the functions f , g, and h are sparse. This is often utilized at the linear
algebra level of optimization software to increase efficiency. Unfortunately, this would not result in
a feedback solution, but an open loop solution, i.e. the control signal does only depend on the initial
value of the state and not explicitly on the current value of the state. Moreover, the formulation
above also assumes that we not only know the incremental costs but also the functions Fk that
describe how the states evolve. In reinforcement learning, see Chapter 11, we do not know these
latter functions. Motivated by this we will discuss a different way of solving the optimal control
problem that later on will enable us to generalize to reinforcement learning. The method we present
is also useful for the case when the states and/or the control signals are not real-valued, but instead
take values in some finite set. Even if the functions Fk were known, then the optimization methods
presented so far in the book could not be used. The reason is that the optimization problem is then
a combinatorial optimization problem.
208 8 Dynamic Programming

. 8.1.2 Dynamic Programming

We will base our derivation on the fact that the optimization problem is partially separable as dis-
cussed in Section 5.5. We essentially just need to identify (xk , uk−1 ) in this section with xk in that
section, and let fk in Section 5.5 be the incremental cost fk plus an indicator function to describe
our constraints. When carrying out minimization over (xk , uk−1 ), we realize that because of the
constraint xk = Fk−1 (xk−1 , uk−1 ) we can substitute xk with Fk−1 (xk−1 , uk−1 ), remove the constraint,
and hence, we only need to minimize with respect to uk−1 . The details are omitted, but it is then
straightforward to show that we obtain the following dynamic programming recursion. Define VN ∶
n → ℝ as
{
𝜙(x), x ∈ XN ,
VN (x) =
∞, x ∉ XN .
Then determine recursively starting with k = N − 1 and finishing with k = 0, the functions Vk ∶
n → ℝ defined as
{ }
Vk (x) = min fk (x, u) + Vk+1 (Fk (x, u)) . (8.4)
u∈Uk (x),Fk (x,u)∈Xk+1

Here, the minimization over u should be carried out for all possible values of x and hence, the
optimal u is a function of x, i.e. we have a multiparametric optimization problem as discussed in
Section 5.6. If it is possible to carry out the minimizations above,1 then the optimal control signal for
(8.2) is given by the minimizing argument in the dynamic programming recursion, i.e. u⋆k = 𝜇k (x),
where 𝜇k ∶ n →  m for k ∈ ℤN−1 is given by
𝜇k (x) = argmin Qk (x, u),
u∈Uk (x),Fk (x,u)∈Xk+1

and where Qk ∶ n ×  m → ℝ is called the Q-function and defined as

Qk (x, u) = fk (x, u) + Vk+1 (Fk (x, u)), (8.5)
for k ∈ ℤN−1 . If the minimizing u is not unique for each x, then 𝜇k should be interpreted as the set of
all u that minimizes Qk . It should be stressed that Vk (x) = Jk⋆ (x), i.e. the value functions agree with
the optimal cost-to-go functions, and this follows from the derivations carried out in Section 5.5.
The function 𝜇k is called the feedback function, and it defines a so-called feedback policy or control
policy. This is for practical applications often more tractable than the so-called open-loop policy that
is the result of solving (8.3). We summarize the dynamic programming recursion in Algorithm 8.1.

Algorithm 8.1: Dynamic programming

Input: Final state penalty 𝜙, incremental costs fk , functions Fk
Output: Vk for k ∈ ℤN−1
VN (x, u) ← 𝜙(x)
for k ← N − 1 to 0 do
( )
Vk (x) ← minu fk (x, u) + Vk+1 (Fk (x, u))
end

We next consider an example.

1 We assume that the minimum exists.

8.1 Finite Horizon Optimal Control 209

Example 8.1 Consider the problem in (8.2) for the case when the dynamic equation is linear, i.e.
Fk (x, u) = Ak x + Bk u, where Ak ∈ ℝn×n and Bk ∈ ℝn×m and where  =  = ℝ. We assume that the
incremental costs and the final cost are quadratic functions given by fk (x, u) = xT Sk x + uT Rk u and
𝜙(x) = xT SN x, respectively, where Rk ∈ 𝕊m++ and Sk ∈ 𝕊+ . We also assume that there are no state
n

constraints or control signal constraints. This problem is called the linear quadratic (LQ) control
problem. Application of the dynamic programming recursion in (8.4) gives VN (x) = xT SN x and
{ ( )}
Vk (x) = min xT Sk x + uT Rk u + Vk+1 Ak x + Bk u .
u

We will make the guess that Vk (x) = xT Pk x for some Pk ∈ 𝕊n . This is clearly true for k = N with
PN = SN . We now assume that it is true for k + 1. Then the right-hand side of the above equation
reads
Q(x, u) = xT Sk x + uT Rk u + (Ak x + Bk u)T Pk+1 (Ak x + Bk u),
[ ]T [ ][ ]
x Sk + ATk Pk+1 Ak ATk Pk+1 Bk x
= ,
u BTk Pk+1 Ak Rk + BTk Pk+1 Bk u
which should be minimized with respect to u. If Rk + BTk Pk+1 Bk ∈ 𝕊m
++ , then the above optimization
problem is convex in u, and the solution is obtained similarly as in Example 5.7 as
( )−1
u = − Rk + BTk Pk+1 Bk BTk Pk+1 Ak x,
which defines the feedback policy 𝜇k (x). Back-substitution of this expression for u results in the
following expression for the right-hand side of the dynamic programming recursion:
{ ( )−1 }
xT Sk + ATk Pk+1 Ak − ATk Pk+1 Bk Rk + BTk Pk+1 Bk BTk Pk+1 Ak x.

It now follows that our assumption is also true for k if we define Pk as

( )−1
Pk = Sk + ATk Pk+1 Ak − ATk Pk+1 Bk Rk + BTk Pk+1 Bk BTk Pk+1 Ak .
This defines a recursion for the matrices Pk which is called the discrete-time Riccati recursion. It
remains to prove that Rk + BTk Pk+1 Bk ∈ 𝕊m++ for the recursion to be well defined. If Pk+1 ∈ 𝕊+ , then
n

the result follows. This is true for k = N − 1, since PN = SN ∈ 𝕊+ . Assume that it is true for k + 1,
n

then the above minimization will result in a minimal value that is nonnegative. Hence, the result
also holds for k.

Example 8.2 We will now look at the problem of finding the shortest path between two nodes in
a graph, see Figure 8.1. The nodes could represent different cities, and the numbers on the edges
could represent the distances between the cities. This is called a shortest path problem. We cast this
problem as an optimal control problem as in (8.2) with the following definitions. Let N = 5, x0 = 0,
Xk = {−1, 0, 1} for k = 1, … , N − 1 and XN = {0}. We define the control signal to take the values
−1, 0, 1, which mean to go down, stay, or go up in the graph, respectively. Hence, the control signal
constraint set is
⎧{−1, 0}, x = 1,
⎪
Uk (x) = ⎨{−1, 0, 1}, x = 0,
⎪{0, 1}, x = −1,
⎩
for k ∈ ℕN−2 and
⎧{−1}, x = 1,
⎪
UN−1 (x) = ⎨{0}, x = 0,
⎪{1}, x = −1.
⎩
210 8 Dynamic Programming

1 1 1
1 1 1 1

2
5

3
2

2
2 4 3 4 3
0 0 0 0 0 0
3

2
4

4
5 1 5
−1 −1 −1 −1

Stage 0 Stage 1 Stage 2 Stage 3 Stage 4 Stage 5

Figure 8.1 Graph for the shortest path problem in Example 8.2.

Moreover, we let Fk (x, u) = x + u, and hence, the next state value is the sum of the current state
value and the control signal value. The final cost is 𝜙(x) = 0. Finally, the incremental cost is
fk (x, u) = cki,j , where cki,j is the cost on the arrow from node i at stage k to node j at stage k + 1. It
should be stressed that the definitions above are not unique. We could, e.g. change the control
signal to take values −2, 0, 2 and change the function Fk to Fk (x, u) = x + 0.5u.
We now apply the dynamic programming recursion. At the final stage N = 5, we have V5 (x) = 0
for x = 0 and V5 (x) = ∞ for any other x, since X5 = {0}. Then
{ }
V4 (x) = min f4 (x, u) + V5 (F4 (x, u)) .
u∈U4 (x),F4 (x,u)∈X5

We realize that x and u must satisfy F4 (x, u) = 0 to obtain a minimum, since otherwise the second
term will be infinity. Hence, for Stage 4, we get
⎧c41,0 = 2, x = 1,
⎪
V4 (x) = ⎨c40,0 = 3, x = 0,
⎪ 4
⎩c−1,0 = 4, x = −1,
where the value for x = 1 is obtained for u = −1, the value for x = 0 is obtained for u = 0, and the
value for x = −1 is obtained for u = 1. For Stage 3, we have
{ }
V3 (x) = min f3 (x, u) + V4 (F3 (x, u)) ,
u∈U3 (x),F3 (x,u)∈X4
{ }
⎧min c3 + V4 (1), c3 + V4 (0) = min {3, 5} = 3, x = 1,
⎪ 1,1 1,0
⎪ { }
= ⎨min c30,1 + V4 (1), c30,0 + V4 (0), c30,−1 + V4 (−1) = min {5, 7, 6} = 5, x = 0,
⎪ { }
⎪min c3 + V (0), c3
⎩ −1,0 4 −1,−1 + V4 (−1) = min {4, 9} = 4, x = −1,

where the value for x = 1 is obtained for u = 0, the value for x = 0 is obtained for u = 1, and the
value for x = −1 is obtained for u = 1. We can now continue for the remaining stages, and the result
is depicted in Figure 8.2. The shortest path correspond to the thick arcs. Above each node, the value
of Vk (x) is given, and the small arrows show the optimal control signal at each node. From these
arrows, we are hence able to also obtain the shortest path from any other node than the initial
one to the final node. Hence, the optimal cost to go from Node 1 at Stage 1 to the final node is 5,
and the optimal path is to go to Node 1 at each stage except for at Stage 4, where one has to go to
node 0.
8.2 Parametric Approximations 211

(5) (4) (3) (2)

1 1 1
1 1 1 1

2
3

2
2
5

3
2
(8) (6) (5) (5) (3) (0)
2 4 3 4 3
0 0 0 0 0 0
3

2
4

4
(9) (5) (4) (4)
5 1 5
−1 −1 −1 −1
→
Stage 0 Stage 1 Stage 2 Stage 3 Stage 4 Stage 5

Figure 8.2 Graph for the shortest path problem in Example 8.2. The shortest path correspond to the thick
arcs. Above each node, the value of Vk (x) is given, and the small arrows show the optimal control signal at
each node.

8.2 Parametric Approximations

It is clear that the dynamic programming recursion is difficult to solve analytically. Even if each
component of the states takes values in finite sets with say p elements, the tasks are challenging for
high- dimension n of the state. The total number of possible values of a state is then pn , which also
for moderate values of p and n can be a huge number. This is called the curse of dimensionality. For
problems where the state is a real-valued vector, one may approximate it with discrete values obtain-
ing a piecewise constant approximation, but then very course of approximations are necessary.
We will now discuss how the curse of dimensionality can be circumvented to some extent. We
introduce vector-valued functions 𝜑k ∶ n → ℝp called feature vectors and parameters ak ∈ ℝq for
k ∈ ℤN−1 . We would like to approximate the functions Vk in (8.4) with Ṽ k ∶ n × ℝq → ℝ, where
Ṽ k (x, ak ) = V̂ k (𝜑k (x), ak ) for a suitable choice of V̂ k ∶ ℝp × ℝq → ℝ and ak ∈ ℝq . One possible
choice of V̂ k is to use a linear regression model for which p = q, i.e.
V̂ k (𝜑k (x), ak ) = aTk 𝜑k (x).
There are of course many more possibilities. The definitions made above are general enough to
model V̂ k as an artificial neural network (ANN), see Section 10.7. In case one has problem-specific
insight, more clever choices can be made. In a preprocessing step, features could be extracted from
data or from sufficient statistics, see [17]. For the sake of simplicity, we will in what follows only
discuss the linear regression case. Notice that a piecewise constant approximation of Vk can be
obtained by taking
{
1, x ∈ Dk ,
𝜑k (x) =
0, x ∉ Dk ,
where Dk , k ∈ ℕp is a partition of ℝn . More general approximations are obtained by taking 𝜑k as
basis functions. In this section, we will not consider constraints on xk or uk .

8.2.1 Fitted-Value Iteration

In order to find good parameters ak , we are going to use an algorithm known as fitted-value iteration.
It is a sequential supervised learning algorithm based on a least-squares (LS) criterion at each
212 8 Dynamic Programming

iteration. It is based on sampling the state-space n . This can be done in many ways, and how
well this is done is crucial for the success of the algorithm. A popular choice is to use some sort of
Monte Carlo technique. It is important that the states that are sampled are representative for what
states are typically visited by a close to optimal policy. We start by defining approximate Q-functions
Q̃ k ∶ n ×  m × ℝq → ℝ as
{
̃
̃ k (x, u, a) = fk (x, u) + V k+1 (Fk (x, u), a), k ∈ ℤN−2 ,
Q
fk (x, u) + 𝜙(Fk (x, u)), k = N − 1.
̃ k does
Here, we have just replaced Vk+1 with Ṽ k+1 in the expression for Qk in (8.5). Notice that Q
not depend on any parameter a for k = N − 1, which is the iteration index we start with. We then
consider samples xks ∈ n , where s ∈ ℕr , and define the minimal values of the approximate
Q-functions as
̃ k (xs , u, ak+1 ),
𝛽ks = min Q (8.6)
u k

where ak+1 is a known value from the previous iterate. This is in general a nonconvex optimization
problem that can be challenging to solve. It should be stressed that the quality of the approximation
that is obtained depends critically on the choice of xks . After this, we define the following LS problem
for obtaining the next value of the parameter ak :

1 ∑( ̃ s
r
)2
minimize V (x , a) − 𝛽ks ,
2 s=1 k k
with variable a. The iterations start with k = N − 1 and proceed backward in time to k = 0. For the
case when V̂ k is a linear regression model, the LS problem is a linear LS problem with closed-form
solution. Otherwise, we need to use some iterative optimization method as discussed in Chapter 6.
Once all the parameters ak have been computed, the approximate feedback function is given by
( )
𝜇k (x) = argmin Q ̃ k x, u, ak+1 . (8.7)
u

Example 8.3 We will in this example perform fitted-value iteration for the optimal control prob-
lem in Example 8.1. We will specifically consider the case when m = 1 and n = 2, and we let
Ak , Bk , Rk , Sk be independent of k, and we write them as A, B, R, S. Since we know that the value
function is quadratic, we will use a feature vector that is 𝜑(x) = (x12 , x22 , 2x1 x2 ), where x = (x1 , x2 ) ∈
ℝ2 . Notice that the indices refer to components of the vector and not to time. We let

V̂ k (𝜑(x), a) = aT 𝜑(x),

where a ∈ ℝ3 . With
[ ]
̃P = a1 a3 ,
a3 a2
we may then write
̃
Ṽ k (x, a) = aT 𝜑(x) = xT Px.

Hence, the true value function Vk (x) = xT Pk x, and the approximate value function Ṽ k (x, a) agrees
if P̃ = Pk . We, moreover, have
̃ k (x, u, a) = xT Sx + uT Ru + (Ax + Bu)T P(Ax
Q ̃ + Bu),
[ ]T [ ] [ ]
x ̃
S + AT PA ̃
AT PB x (8.8)
= ̃ ̃ .
u BT PA R + BT PB u
8.3 Inﬁnite Horizon Optimal Control 213

For k = N − 1 down to k = 0, we then solve the linear LS problem in (8.6) to obtain 𝛽ks . From
Example 5.7, we realize that
( )T { ( )−1 }
𝛽ks = xks S + AT P̃ k+1 A − AT P̃ k+1 B R + BT P̃ k+1 B BT P̃ k+1 A xks ,

assuming that R + BT P̃ k+1 B is positive definite. Here, P̃ N = S. We then obtain ak as the solution to
the linear LS problem
1 ∑( T s
r
)2
minimize 𝜑 (xk )a − 𝛽ks ,
2 s=1
with variable a. This defines P̃ k . The solution ak satisfies the normal equations, cf . (5.3),
ΦTk Φk ak = ΦTk 𝛽k ,
where
⎡𝜑T (xk1 )⎤ ⎡𝛽k1 ⎤
⎢ ⎥ ⎢ ⎥
Φk = ⎢ ⋮ ⎥ ; 𝛽k = ⎢ ⋮ ⎥ .
⎢ T r ⎥ ⎢ r⎥
⎣𝜑 (xk )⎦ ⎣𝛽k ⎦
It is crucial here to choose xks such that ΦTk Φk is invertible. We realize that we need r ≥ 3 for this
hold. Moreover, we need to choose xks sufficiently different for ΦTk Φk to be well conditioned. For a
general n, we will need that r ≥ n(n + 1)∕2. From (8.7), (8.8), and Example 5.7 it follows that the
optimal control is given by
( )−1
uk = − Rk + BTk P̃ k+1 Bk BTk P̃ k+1 Ak xk .

In the example above, it should be possible to obtain the same solution as in Example 8.4. In
general, it is not the case that fitted value iteration will provide the exact solution to the problem in
(8.2). The reason is that in general, we cannot represent the value function exactly with the feature
vectors.

8.3 Inﬁnite Horizon Optimal Control

Let F ∶ n ×  m → n and f ∶ n ×  m → ℝ be given functions and consider an infinite horizon
optimal control problem defined as
∑
∞
minimize 𝛾 k f (xk , uk ),
k=0
(8.9)
subject to xk+1 = F(xk , uk ), k ∈ ℤ+ ,
uk ∈ U(xk ) ⊆ m , k ∈ ℤ+ ,
with variables (u0 , x1 , u1 , x2 , …), where x0 is given, and where 0 < 𝛾 ≤ 1 is a discount factor. Here, U
is a set-valued function from n . We now have that the incremental costs f and the state-dynamics
F are independent of the stage index k. We define the optimal value function J ⋆ ∶ n → ℝ as the
optimal value of the above optimization problem for different values of the initial state x0 . Before we
continue, we will make some assumptions. It is assumed that F(0, 0) = 0 and that 0 ∈ U(0) so that
zero is a stationary point of the dynamical equation.2 If this is not the case, we can make a change

2 The point (x0 , u0 ) is a stationary point of xk+1 = F(xk , uk ) if x0 = F(x0 , u0 ), i.e. the state remains at x0 for all values
of k.
214 8 Dynamic Programming

of coordinates to make this hold. We also assume that the incremental cost f is such that f (0, 0) = 0.
Hence, if the state reaches the stationary point, only zero value is added to the cost function for each
stage we remain in stationarity. If this assumption would not be made, it would not be possible to
obtain closed-loop stability, i.e. (xk , uk ) → (0, 0) as k → ∞ with finite cost for the case when 𝛾 = 1.
To simplify the presentation below, we will also restrict ourselves to the case when the incremental
cost is strictly positive definite.3

8.3.1 Bellman Equation

If the above assumptions are satisfied, and there exists a strictly positive definite and quadratically
bounded4 solution V ∶ n → ℝ+ to the Bellman equation
V(x) = min {f (x, u) + 𝛾V(F(x, u))} , (8.10)
u∈U(x)

then V(x) = J ⋆ (x) and u⋆k = 𝜇(xk ) is the optimal feedback control, where 𝜇 ∶ n →  m is defined
as 𝜇(x) = argminu∈U(x) {f (x, u) + 𝛾V(F(x, u))}. If in addition 𝛾 is sufficiently close to one, this feed-
back results in closed-loop stability in the sense defined above. The proof of this result is given in
Section 8.10. For later reference, we define Q ∶ n ×  m → ℝ as
Q(x, u) = f (x, u) + 𝛾V(F(x, u)).
We next consider an example which is known as infinite horizon LQ control.

Example 8.4 Let us consider the case when F(x, u) = Ax + Bu for matrices A ∈ ℝn×n and B ∈
ℝn×m . We also assume that f (x, u) = xT Sx + uT Ru, where S ∈ 𝕊n+ and R ∈ 𝕊m++ and that U(x) = ℝ
m

for all x. Clearly, we satisfy the assumptions on the functions f and F. We will make the guess that
V(x) = xT Px for some P ∈ 𝕊n++ . Then
[ ]T [ ][ ]
x S + 𝛾AT PA 𝛾AT PB x
Q(x, u) = .
u 𝛾BT PA R + 𝛾BT PB u
Since this expression is strictly convex in u, it follows from Example 5.7 that Q is minimized for
u = −𝛾(R + 𝛾BT PB)−1 BT PAx.
Back substitution of this results in
( ( )−1 )
xT Px = xT S + 𝛾AT PA − 𝛾 2 AT PB R + 𝛾BT PB BT PA x.

This equation holds if P is the solution to the following discounted algebraic Riccati equation:
( )−1
P = S + 𝛾AT PA − 𝛾 2 AT PB R + 𝛾BT PB BT PA. (8.11)
It can be shown that there is a unique solution to the above equation under our assumptions
and if (A, B) is controllable.5 This solution is such that P ∈ 𝕊n++ , and hence, the function V
is strictly positive definite and quadratically bounded. The assumptions can be relaxed, see,
e.g. [51].

3 A function V ∶ n → ℝ is said to be strictly positive definite if V(0) = 0, and there exist 𝜖 > 0 such that
V(x) ≥ 𝜖||x||22 .
4 A function V ∶ n → ℝ is said to be quadratically bounded [ if there exists c >] 0 such that V(x) ≤ c||x||2 .
2

5 It holds that (A, B) is controllable if and only if the matrix B AB · · · An−1 B has full rank.
8.4 Value Iterations 215

An equivalent formulation of the Bellman equation is obtained by subtracting an arbitrary func-

tion W ∶ n → ℝ from V to obtain the function V̂ ∶ n → ℝ as V(x)
̂ = V(x) − W(x). The function
̂
V satisfies
̂
V(x) = min {f (x, u) − W(x) + 𝛾W(F(x, u)) − 𝛾W(F(x, u)) + 𝛾V(F(x, u))} ,
u∈U(x)
{ }
= min f̂ (x, u) + 𝛾 V(F(x,
̂ u)) ,
u∈U(x)

where f̂ ∶ n ×  m → ℝ is defined as

f̂ (x, u) = f (x, u) − W(x) + 𝛾W(F(x, u)). (8.12)

The equation above is called the variational form of the Bellman equation, and f̂ is called the tem-
poral difference corresponding to W. This equivalent formulation plays an important role when
solving the Bellman equation numerically.

8.4 Value Iterations

The Bellman equation is in general a difficult equation to solve. It is possible to show that the
dynamic programming iteration in (8.4) converges to the solution V(x) of the Bellman equation
when k → −∞.
We will for convenience restate it in a format where the iteration index proceeds forward instead
of backward as
{ ( )}
Vk+1 (x) = min fk (x, u) + 𝛾Vk Fk (x, u) , (8.13)
u∈U(x)

with initial value V0 (x) = 0. If one has a clever guess of an approximate solution to the Bellman
equation, this can be used as initial value instead. This will make the iterates converge much
faster. The iteration above is called value iteration (VI). The algorithm for VI is summarized in
Algorithm 8.2, where T is defined in (8.14).

Algorithm 8.2: Value iteration

Input: Incremental cost f , function F, 𝛾 ∈ (0, 1], tolerance 𝜀 > 0
Output: V solving V = T(V)
V ←0
while ‖V − T(V)‖∞ ≥ 𝜀 do
V ← T(V)
end

Example 8.5 Here, we will consider VI for infinite horizon LQ control as in Example 8.4. To this
end, we let Vk (x) = xT Pk x with P0 = 0. Similarly, as in Example 8.1, we realize that if Pk satisfies
the recursion
( )−1
Pk+1 = S + 𝛾AT Pk A − 𝛾 2 AT Pk B R + BT Pk B BT Pk A,

then the VI recursion is satisfied. Based on a similar argument as in Example 8.1, we also realize
that the inverse in the recursion exists and that Pk ∈ 𝕊n+ for all k.
216 8 Dynamic Programming

The proof of convergence of VI is based on the contraction property of the Bellman operator
T ∶ ℝ ∶→ ℝ 6 defined as
n n

T(V)(x) = min {f (x, u) + 𝛾V(F(x, u))} . (8.14)

u∈U(x)

We first show that the Bellman operator is monotone. Assume that V1 and V2 are two functions
from n to ℝ such that V1 (x) ≤ V2 (x) for all x ∈ n . Then

T(V1 )(x) − T(V2 )(x) ≤ 𝛾V1 (F(x, u⋆2 )) − 𝛾V2 (F(x, u⋆2 )) ≤ 0,

where u⋆2 is the minimizing argument for T(V2 ). Let

c = sup |V1 (x) − V2 (x)| = ||V1 − V2 ||∞ .

x∈n

Then from the monotonicity

V1 (x) − c ≤ V2 (x) ≤ V1 (x) + c ⟹ T(V1 )(x) − 𝛾c ≤ T(V2 )(x) ≤ T(V1 )(x) + 𝛾c,

and hence, the contraction mapping property

||T(V1 ) − T(V2 )||∞ ≤ 𝛾||V1 − V2 ||∞ ,

holds. From this, it follows by the contraction-mapping theorem [73, p. 272] that VI converges to a
solution of the Bellman equation for the case when 𝛾 < 1. In case 𝛾 = 1, it is sometimes possible to
still prove convergence for VI. See, e.g. [19] for the LQ control case.

8.5 Policy Iterations

It turns out that the convergence of VI can be slow, and there is another approach that can be
pursued. This is called policy iteration (PI). Introduce the Bellman policy operator: T𝜇 ∶ ℝ → ℝ
n n

defined as

T𝜇 (V)(x) = f (x, 𝜇(x)) + 𝛾V (F(x, 𝜇 (x))) , (8.15)

for a given function 𝜇 ∶ n →  m . Given a feedback policy 𝜇k ∶ n →  m for some k, we compute

Vk ∶ n → ℝ as the solution of the equation:

Vk (x) = T𝜇k (Vk )(x), (8.16)

which is the same as the Bellman equation except for that we in place of the optimal u substitute a
feedback policy. This is a linear system of equations for Vk . Depending on  there could be finitely
many equations or infinitely many equations. Solving this linear system of equations is called the
policy evaluation step. We then obtain a new feedback policy by solving
{ }
𝜇k+1 (x) = argmin f (x, u) + 𝛾Vk (F(x, u)) . (8.17)
u∈U(x)

This is called the policy improvement step. We summarize the PI algorithm in Algorithm 8.3.
The proof that PI converges to a solution of the Bellman equation goes as follows: it holds

6 The set ℝ is the set of all functions from n to ℝ, cf . the notation section.
n
8.5 Policy Iterations 217

Algorithm 8.3: Policy iteration

Input: Incremental cost f , function F, parameter 𝛾 ∈ (0, 1], tolerance 𝜀 > 0, initial feedback
policy 𝜇
Output: V solving V = T(V)
while ‖V − T(V)‖∞ ≥ 𝜀 do
Solve V(x) = f (x, 𝜇(x)) + 𝛾V(F(x, 𝜇(x))) w.r.t. V
𝜇(x) ← argminu (f (x, u) + 𝛾V(F(x, u))
end

that
( ) ( )
Vk (x) = T𝜇k Vk (x) ≥ T𝜇k+1 Vk (x),
where the equality is by definition of the policy evaluation step, and where the inequality is by def-
inition of the policy improvement step. Just like the Bellman operator, the Bellman policy operator
is a monotone operator, see Exercise 8.7. Hence, repeated application of T𝜇k+1 results in
( ) ( )2 ( )
Vk (x) ≥ T𝜇k+1 Vk (x) ≥ T𝜇k+1 Vk (x) ≥ · · · ,
( )n ( )
≥ lim T𝜇k+1 Vk (x) = Vk+1 (x),
n→∞

where the equality follows from the fact that also the Bellman policy operator is a contraction map-
ping when 𝛾 < 1, see Exercise 8.7, and therefore, the limit satisfies T𝜇k+1 (Vk+1 ) = Vk+1 . Hence, PI
results in an improving sequence of value functions, and in case Vk (x) = Vk+1 (x) for all x ∈ n , we
realize that the first inequality above also holds with equality, proving that Vk satisfies the Bellman
equation proving that 𝜇k+1 is optimal. Hence, PI either results in a strict improvement of the value
function in each iteration or termination at a solution of the Bellman equation.
We remark that neither VI nor PI are normally tractable methods. Hence, approximations are in
most cases required. We now give an example where the calculations can be carried out exactly.

Example 8.6 We consider PI for the infinite horizon LQ control problem in Example 8.4. We
guess that Vk (x) = xT Pk x for some Pk ∈ 𝕊n+ and that 𝜇k (x) = −Lk x for some Lk ∈ ℝm×n . The policy
evaluation step is then given by finding a solution Pk of
( )T ( )
xT Pk x = xT Sx + xT LTk RLk x + 𝛾xT A − BLk Pk A − BLk x,
for given Lk . This can be obtained by solving the algebraic Lyapunov equation
( )T ( )
Pk − 𝛾 A − BLk Pk A − BLk = S + LTk RLk ,
which has a positive definite solution Pk since the right- hand side is positive definite. This assumes
√
that 𝛾(A − BLk ) has all its eigenvalues strictly inside the unit disk. The policy improvement step
is then
{ }
𝜇k+1 (x) = argmin xT Sx + uT Ru + 𝛾(Ax − Bu)T Pk (Ax − Bu) .
u

The solution is given by

( )−1
𝜇k+1 (x) = −𝛾 R + 𝛾BT Pk B BT Pk Ax,
and hence, we obtain
( )−1
Lk+1 = 𝛾 R + 𝛾BT Pk B BT Pk A.
218 8 Dynamic Programming

It can be shown that if we start with L0 that is stabilizing, then so will all Lk be, see [51], where
it is also shown that convergence holds for the case when 𝛾 = 1. The iterations for the LQ control
problem derived above are called the Hewer iterations.

8.5.1 Approximation
In general, it is not possible to carry out the computations in PI exactly. One has to resort to approx-
imations. We will use here a similar idea as in Section 8.2. It is based on defining Ṽ ∶ n × ℝp → ℝ
using an ANN or as a linear regression with p parameters. This function will be used to approximate
Vk in (8.16). Before we do that we notice that (8.16) implies
( ) ( ( ))
Vk (x0 ) = f x0 , 𝜇k (x0 ) + 𝛾Vk F x0 , 𝜇k (x0 ) ,
( ) ( )
= f x0 , 𝜇k (x0 ) + 𝛾Vk x1 ,
( ) ( ) ( )
= f x0 , 𝜇k (x0 ) + 𝛾f x1 , 𝜇k (x1 ) + 𝛾 2 Vk x2 ,
(8.18)
⋮
∑
N−1
( ) ( )
= 𝛾 i f xi , 𝜇k (xi ) + 𝛾 N Vk xN ,
i=0

where xi+1 = F(xi , 𝜇k (xi )). In case N is large and 𝜇k is stabilizing, we have that xN is close to zero
and that also Vk (xN ) is close to zero. Hence, a way to evaluate Vk for a value x0 of the state is to just
simulate the dynamical system and add up the incremental costs. In case one has an idea about
( )
what Vk xN might be, that can also be used. This can in particular be beneficial in case xN is not
very small.
We define these sums for different initial values xs for s ∈ ℕr as
∑
N−1
( )
𝛽ks = 𝛾 i f xi , 𝜇k (xi ) ,
i=0

where xi+1 = F(xi , 𝜇k (xi )). We then find the approximation of Vk by solving

1 ∑( ̃ s
r
)2
minimize V(x , a) − 𝛽ks ,
2 s=1
with variable a. The solution is denoted ak . After this, we use the following exact policy improve-
ment step
{ ( )}
𝜇k+1 (x) = argmin f (x, u) + 𝛾 Ṽ F(x, u), ak . (8.19)
u∈U(x)

We remark that it is possible to reuse the simulated trajectory and use it to compute several costs.
This follows from the simple fact that we can use also x1 = F(xs , us ) as an initial value, for which
the simulated trajectory is obtained from the one starting at xs by omitting the first value. This
means that costs from any state on the simulated trajectory can be computed. These are just the
tail-costs of the overall cost when starting at xs . We should however stress that this might not
provide enough representative initial states to obtain a good approximation of Vk .

Example 8.7 We will in this example consider the optimal control problem in Example 8.4. We
will specifically consider the case when m = 1 and n = 2. Since we know that the value function is
quadratic, we will use a feature vector that is 𝜑(x) = (x12 , x22 , 2x1 x2 ), where x = (x1 , x2 ) ∈ ℝ2 . Notice
that the indices refer to components of the vector and not to time. We let
̃ a) = aT 𝜑(x),
V(x,
8.5 Policy Iterations 219

where a ∈ ℝ3 . With
[ ]
a a
P̃ = 1 3 ,
a3 a2
we may then write
̃ a) = xT Px.
V(x, ̃

Hence, the true value function V(x) = xT Px and the approximate value function V(x, ̃ a) agree if
P̃ = P. Here, with an abuse of notation ak ∈ ℝ , which defines P̃ k , is obtained as the solution to the
3

linear LS problem
1 ∑( T s
r
)2
minimize 𝜑 (x )a − 𝛽ks ,
2 s=1
with variable a. The solution ak satisfies the normal equations, cf . (5.3),
ΦTk Φk ak = ΦTk 𝛽k ,
where
⎡𝜑T (x1 )⎤ ⎡𝛽k1 ⎤
Φk = ⎢ ⋮ ⎥ , 𝛽k = ⎢ ⋮ ⎥ ,
⎢ T r ⎥ ⎢ r⎥
⎣𝜑 (x )⎦ ⎣𝛽k ⎦
and where
∑ (
N−1
)
𝛽ks = 𝛾 i xiT Sxi + 𝜇k (xi )T R𝜇k (xi ) ,
i=0
and where xi+1 = Axi + B𝜇k (xi ) with initial values xs , s ∈ ℕr . It is crucial here to choose xs such that
ΦTk Φk is invertible. We realize that we need r ≥ 3 for this hold. Moreover, we need to choose xs
sufficiently different for ΦTk Φk to be well conditioned. For a general n, we will need r ≥ n(n + 1)∕2.
We define
̃ k (x, u, a) = f (x, u) + 𝛾 V(Ax
Q ̃ + Bu, a),
= xT Sx + uT Ru + 𝛾(Ax + Bu)T P(Ax ̃ + Bu),
[ ]T [ ] [ ] (8.20)
x ̃
S + 𝛾AT PA 𝛾AT PB̃ x
= ̃ ̃ .
u 𝛾BT PA R + 𝛾BT PB u
From Example 5.7, we realize that the solution to (8.19) is given by
̃ k (x, u, ak ) = −𝛾(R + 𝛾BT P̃ k B)−1 BT P̃ k Ax,
𝜇k+1 (x) = argmin Q
u

assuming that R + 𝛾BT P̃ k B is positive definite. Here, P̃ k is defined from ak in the same way as P̃ is
defined from a above. We may hence write
𝜇k+1 (x) = −Lk+1 x,
where Lk+1 = 𝛾(R + 𝛾BT P̃ k B)−1 BT P̃ k A. It is a good idea to start with some L0 that is stabilizing.
We remark that it may be highly beneficial to consider the variational form of the Bellman
equation by replacing the incremental cost with the temporal difference in (8.12). Since the Bell-
man equation for the variational form looks the same as the original Bellman equation with the
only difference that the incremental cost is replaced with the temporal difference, all formulas in
this section remain the same when the temporal difference is used as incremental cost. The rea-
son that the variational form may be beneficial is that W(x) may be taken as an initial guess of the
solution of the Bellman equation, and then we only need to parameterize the difference between
the true solution and the initial guess, which might require less parameters if the initial guess
is good.
220 8 Dynamic Programming

8.6 Linear Programming Formulation

Our intention is now to show that the solution to the Bellman equation can also be obtained as the
solution of an LP. To this end we now start the VI with a V0 such that V1 = T(V0 ) ≥ V0 for all x ∈ n .
One possible choice is V0 = 0 for all x ∈ n . We obtain T(V1 ) ≥ T(V0 ) from the monotonicity prop-
erty, and hence, V2 = T(V1 ) ≥ V1 ≥ V0 . If we repeat this, we obtain Vk = T(Vk−1 ) ≥ Vk−1 ≥ V0 . Since
Vk converges to the solution V of the Bellman equation, we have shown that V is the maximum
element, see [22, p. 45] of the set of functions V that satisfy the linear inequalities:
V(x) ≤ f (x, u) + 𝛾V(F(x, u)), ∀(x, u) ∈ n ×  m such that u ∈ U(x).
The maximum element can be obtained by solving the LP
∑
maximize c(x)V(x),
x∈n

subject to V(x) ≤ f (x, u) + 𝛾V(F(x, u)), ∀(x, u) ∈ n ×  m such that u ∈ U(x),

where c(x) > 0 is arbitrary. This follows from the dual characterization of maximum elements, see
[22, p. 54]. The optimization variable is V(x) for all values of x. We remark that the LP formulation
is only tractable if it does not have too many constraints or variables.

Example 8.8 We consider an example where  =  = {−1, 0, 1} and where m = n = 1. We let

f (x, u) = x2 + u2 and we let the values of the function F be defined by the following table:
u −1 0 1
x
−1 −1 −1 0
0 −1 0 1
1 0 1 1

We assume that 0 ≤ 𝛾 < 1. The LP formulation of the optimal control problem is then given by
maximize V(−1) + V(0) + V(1),
subject to V(−1) ≤ (−1)2 + (−1)2 + 𝛾V(−1),
V(−1) ≤ (−1)2 + 02 + 𝛾V(−1),
V(−1) ≤ (−1)2 + 12 + 𝛾V(0),
V(0) ≤ 02 + (−1)2 + 𝛾V(−1),
V(0) ≤ 02 + 02 + 𝛾V(0),
V(0) ≤ 02 + 12 + 𝛾V(1),
V(1) ≤ 12 + (−1)2 + 𝛾V(0),
V(1) ≤ 12 + 02 + 𝛾V(1),
V(1) ≤ (1)2 + 12 + 𝛾V(1),
over the variables (V(−1), V(0), V(1)). We immediately realize that the first constraint is implied
by the second constraint and that the last constraint is implied by the second last constraint. More-
over, the fifth constraint is equivalent to V(0) = 0, if 0 ≤ 𝛾 < 1, since V(x) ≥ 0 from the fact that
the objective function is bounded by zero from below. From this, we see that the fourth and sixth
constraint cannot be active at optimality, since they only provide lower bounds on the variables.
The remaining constraints can be summarized as V(−1), V(1) ≤ 𝑣, where 𝑣 = min {2, 1∕(1 − 𝛾)}.
8.7 Model Predictive Control 221

Hence, the optimal solution is (V(−1), V(0), V(1)) = (𝑣, 0, 𝑣), where 𝑣 ∈ (1, 2], since 0 < 𝛾 < 1.
It now remains to compute the optimal u for different values of x ∈ . We have for x = −1 that the
right-hand side of the Bellman equation is given by
⎧𝑣, u = −1,
⎪
(−1) + u + 𝛾 ⎨𝑣,
2 2
u = 0,
⎪0, u = 1.
⎩
We see that whatever value 𝑣 has it can never be that u = −1 is optimal. If 𝛾𝑣 < 1, it is optimal
to have u = 0, which can be seen to be the case when 0 < 𝛾 < 1∕2, and hence, for 1∕2 ≤ 𝛾 < 1, it
is optimal to take u = 1. A similar argument shows that when x = 1, u = 0 is optimal when 0 <
𝛾 < 1∕2 and u = −1 optimal when 1∕2 ≤ 𝛾 < 1. When x = 0 the right-hand side of the Bellman
equation is given by
⎧𝑣, u = −1,
⎪
0 + u + 𝛾 ⎨0,
2 2
u = 0,
⎪𝑣, u = 1,
⎩
and hence, u = 0 is optimal. We summarize our findings by noting that when 0 < 𝛾 < 1∕2 the opti-
mal solution is always zero. The future costs are discounted so much that it is not worth the effort
to take any action with the control signal. Otherwise, it pays off to steer away from a nonzero state
value to make it zero.

8.6.1 Approximations
The LP formulation is often not tractable in general, since there might be many variables. A remedy
to this is to approximate V(x) with, e.g. a feature-based linear regression Ṽ ∶ n × ℝp → ℝ as
̃ a) = aT 𝜑(x),
V(x,
where 𝜑 ∶ n → ℝp . The approximate optimization problem is
∑
maximize c(x)Ṽ (x, a),
x∈n
̃ a) ≤ f (x, u) + 𝛾 V(F(x,
subject to V(x, ̃ u), a), ∀(x, u) ∈ n ×  m such that u ∈ U(x),
with variable a. This might still be an intractable problem because of the many constraints. One way
to overcome this obstacle is to sample the constraints, i.e. omit some of them by just considering
(xs , us ) for s ∈ ℕr as above. Notice that the approach presented here can also be used to approxi-
mately evaluate a fixed policy 𝜇. Just replace u above with 𝜇(x). This might reduce the number of
constraints significantly and can then be used together with PI.

8.7 Model Predictive Control

It is clear that the Bellman equation can be very difficult to solve in many practical applications.
However, it is the case that the resulting feedback law has the very attractive property that it is
insensitive to disturbances and unmodeled dynamics. This is a property that an open-loop solution
of (8.3) does not have. Here, we will discuss MPC, which is a way to obtain nearly optimal feedback
control laws by computing a sequence of open-loop controls as in (8.3). We may suspect that such
an approach should be very appealing, since it is a feedback control law, and at the same time, it is
obtained not by solving the Bellman equation, but by solving open-loop control problems, which
are known to be tractable from a computational point of view.
222 8 Dynamic Programming

8.7.1 Inﬁnite Horizon Problem

We consider a nondiscounted infinite time horizon optimal control problem defined as
∑
∞
minimize f (xk , uk ),
k=0
subject to xk+1 = F(xk , uk ), k ∈ ℤ+ , (8.21)
uk ∈ U ⊆  , m
k ∈ ℤ+ ,
xk ∈ X ⊆  , n
k ∈ ℕ,
with variables (u0 , x1 , …), where x0 is given. The only difference compared to our previous setup
is that the set U does not depend on x and that we have a constraint set X for the state. We define
the optimal value function J ⋆ ∶ ℝn → ℝ as the optimal value of the above optimization problem
for different values of the initial state x0 . We make the same technical assumptions as before on the
functions and sets defining the problem.
If there exists a strictly positive definite solution V ∶ ℝn → ℝ+ to the Bellman equation
V(x) = min {f (x, u) + V(F(x, u))} , (8.22)
u∈U,F(x,u)∈X

then V(x) = J ⋆ (x) and u⋆k = 𝜇(xk ), where 𝜇 ∶ n →  m defined as

𝜇(x) = argmin {f (x, u) + V(F(x, u))}
u∈U,F(x,u)∈X

is an optimal feedback that results in closed-loop stability. The only difference as compared to our
previous presentation of the Bellman equation is that we have added a constraint on F(x, u).

8.7.2 Guessing the Value Function

It is clear that if one for some reason is able to guess an approximate solution V̂ to the Bellman
equation, then
{ }
̂
𝜇(x) = argmin f (x, u) + V(F(x, u)) (8.23)
u∈U,F(x,u)∈X

will be an approximation of the optimal feedback function.

This approximation can also be interpreted as the optimal solution to the one time-step horizon
optimal control problem
̂ k+1 ) + f (xk , uk ),
minimize V(x
subject to xk+1 = F(xk , uk ),
uk ∈ U, xk+1 ∈ X,
with variables (uk , xk+1 ), where xk ∈ X is given. This is a special case of (8.2). In case (8.23) still is
difficult to solve, then the one time-step horizon optimal control problem can be solved using some
of the methods presented in Chapter 6 on-line, i.e. the problem can be solved repeatedly for each
time-instant k as soon as a new measurement of xk becomes available. In case the resulting opti-
mization problem is of moderate size and the sampling time is not too short, this can be carried out
in real-time. Notice that the on-line solution only computes the optimal control for one given value
of the state, whereas the off-line computation with the Bellman equation computes the optimal
control for all possible values of the state.
When V̂ is taken to be zero, the most crude approximation imaginable, we get what is usually
called “greedy control.” This can often result in very poor performance such as instability or very
large value of the objective function.
8.7 Model Predictive Control 223

For the case of f (x, u) = xT Qx + uT Ru, but when there still are constraints on the states
̂
and/or control signal, one might use V(x) = xT Px, where P solves the algebraic Riccati equation
in (8.11) for 𝛾 = 1 This solution will be optimal if x0 is such that there are no constraint
violations – otherwise not.
We remark that stability is not guaranteed for any of the approximations of this section without
further investigation.

8.7.3 Finite Horizon Approximation

Instead of finding an approximate solution by guessing the value function, one might consider
to approximate the optimal control problem in (8.21) with a finite time horizon N and impose a
terminal state constraint. We consider
∑
N−1
minimize f (xk , uk ),
k=0
subject to xk+1 = F(xk , uk ), k ∈ ℤN−1 ,
(8.24)
uk ∈ U, k ∈ ℤN−1 ,
xk ∈ X, k ∈ ℕN−1 ,
xN = 0,
with variables (u0 , x1 , … , uN−1 , xN ), where x0 ∈ X is given. In case N is sufficiently large, this should
result in a near optimal solution. Notice that uk can be taken to zero for k ≥ N, since xN = 0 and
(0, 0) is a stationary point. Hence, stability is guaranteed assuming that the approximate optimal
control problem is feasible. It does not seem like this should be a simpler problem to solve, since
we know that the solution can be obtained from a dynamic programming recursion like in (8.4).
However, the solution can alternatively be obtained as the solution of a finite-dimensional opti-
mization problem as in (8.3). We then do not obtain a feedback solution, but only an optimal input
sequence. The finite time horizon approximation makes this trade-off possible.

8.7.4 Receding Horizon Approximation

We have so far considered two different approximate solutions to the original optimal control
problem in (8.21). The first one was obtained by approximating the value function and could be
interpreted as on-line repeatedly solving an optimal control problem with a one time-step hori-
zon. The second approximation was obtained by considering a finite horizon which was not solved
repeatedly on-line, i.e. we did not obtain a feedback solution. An obvious remedy for this would be
to solve the finite horizon problem repeatedly on-line and use what is called a receding time hori-
zon. To fix the ideas, we solve at each time instant k = 0, 1, … the receding horizon optimal control
problem
∑
k+N−1
minimize Jk = f (̃xi , ũ i ),
i=k

subject to x̃ i+1 = F(̃xi , ũ i ), i = k, … , k + N − 1,

(8.25)
ũ i ∈ U, i = k, … , k + N − 1,
x̃ i ∈ X, i = k + 1, … , k + N − 1,

x̃ k+N = 0,
224 8 Dynamic Programming

where x̃ k = xk is given; how will be defined below. We will call this problem the time k problem.
Denote the solution by

(̃x⋆k+1 , … , x̃ ⋆k+N , ũ ⋆k , … , ũ ⋆k+N−1 ),

and let uk = ũ ⋆k . Then the state evolves as before according to

xk+1 = F(xk , uk ), k ∈ ℤ+ ,

with x0 given. This control strategy is often called model predictive control, since a model that pre-
dicts the values of the states are present in the time k problem. The time horizon N is for this reason
usually called the prediction horizon. Notice that only the optimal control signal corresponding to
time instant k in the solution of the time k problem is applied to the system to be controlled in the
MPC strategy. All the other computed control signals are disregarded, and the optimization is car-
ried out once again for k + 1, and so on. In this way feedback is obtained, even though the control
signal is not explicitly given as a feedback policy, i.e. we do not have an explicit function 𝜇 relating
the optimal uk to the state xk . We only have an implicit way to obtain the optimal uk once the state xk
is given. Hence, we have to resort to the on-line implementation to be carried out in real-time. This
can be infeasible for large-scale problems or short sampling times. How to overcome this obstacle
will be discussed later. We remark that the receding horizon optimal control problem can be cast
as a finite-dimensional optimization problem similarly as in (8.3). There are many different varia-
tions of the MPC strategy. One might add a penalty 𝜙(̃xk+N ) to the objective function. The constraint
x̃ k+N = 0 can be removed or relaxed to a less stringent constraint. This will usually make it possible
to use shorter prediction horizons N and still obtain feasibility of the time k problem. Stability of
MPC is investigated in Section 8.10. We will now consider an example.

Example 8.9 Consider F(x, u) = Ax + Bu with A ∈ ℝ3×3 and B ∈ ℝ3×2 , where the matrices A and
B are chosen randomly with entries drawn from a standard uniform distribution with support on
the interval [0, 1]. The incremental cost is f (x, u) = xT x + uT u, and the constraints are −𝟙 ⪯ xk ⪯ 𝟙
and −0.5 × 𝟙 ⪯ uk ⪯ 0.5 × 𝟙, where 𝟙 is a vector of ones of appropriate dimension. The initial point
[ ]T
is x0 = 0.9 −0.9 0.9 . In Figure 8.3, the cost for the infinite horizon criterion is plotted versus

3.6

3.4

3.2

0 2 4 6 8 10

Figure 8.3 The cost for the inﬁnite horizon criterion as a function of the horizon for the ﬁnite time horizon
approximation (circles), and for MPC (triangle).
8.8 Explicit MPC 225

1 1 1
0.5 0.5 0.5
0 0 0
𝑥

−0.5 −0.5 −0.5

−1 −1 −1
0 2 4 6 8 0 2 4 6 8 0 2 4 6 8
𝑁 𝑁 𝑁
1 1 1
0.5 0.5 0.5
0 0 0
𝑢

−0.5 −0.5 −0.5

−1 −1 −1
0 2 4 6 8 0 2 4 6 8 0 2 4 6 8
𝑁 𝑁 𝑁

Figure 8.4 The trajectories for the states (top), and the control signals (bottom), for the ﬁnite horizon
approximation with N = 3 (left), and N = 20 (middle), and for the MPC with N = 3 (right). The different
components of the vectors are shown as circles, crosses, and triangles, respectively.

different time horizons N for the finite time horizon approximation. It is seen that the cost decreases
as N becomes larger. In fact, it converges to the cost for the infinite time horizon problem. Notice
that there is no feasible solution for N ≤ 2. Also the results for MPC with different time horizons
are shown as comparison. We see that MPC is always performing better than the finite-horizon
open-loop strategy for the same N. The improvement is larger, the smaller N is.
In Figure 8.4, the trajectories for the states and the control signals for the finite horizon approx-
imation with N = 3 and N = 20 together with the MPC with N = 3 are shown. It is seen that for
the finite horizon approximation with N = 3 the control signals and the states are zero for k ≥ N.
We also see that the difference in behavior is not so large for the finite horizon approximation with
N = 20 and the MPC with N = 3.

8.8 Explicit MPC

It was clear from the Section 8.7 that we do not obtain an explicit expression for the feedback
resulting from MPC, but we are only able to on-line compute the optimal control signal given a
measurement of the current state. We will now show how that obstacle can be overcome for a cer-
tain important special case using multiparametric programming as in Section 5.6 applied to the
time k problem in (8.25), where xk is the parameter.
̃ = x̃ T Q̃x + ũ T Ru,
We consider the case when the incremental cost is quadratic, i.e. f (̃x, u) ̃ where Q
and R are positive definite matrices. We assume that the dynamics is linear, i.e. F(̃x, u) ̃ = Ãx + Bu.
̃
For simplicity, we will also assume that the equality constraint on x̃ k+N is not present. Let
[ ]T [ ]T
X̃ = x̃ Tk · · · x̃ Tk+N−1 , Ũ = ũ Tk · · · ũ Tk+N−1 ,
and let
⎡ I ⎤ ⎡ 0 0 0 ··· 0⎤
⎢ A ⎥ ⎢ B 0 0 ··· 0⎥
⎢ ⎥ ⎢ ⎥
Φ = ⎢ A2 ⎥ , Γ = ⎢ AB B 0 ··· 0⎥ .
⎢ ⋮ ⎥ ⎢ ⋮ ⋱ ⋱ ⋱ ⋮⎥
⎢ N−1 ⎥ ⎢ N−2 ⎥
⎣A ⎦ ⎣A B ··· AB B 0⎦
226 8 Dynamic Programming

Then the equality constraints in (8.25) can be expressed as

X̃ = Φxk + ΓU.
̃

Let  and  be block-diagonal matrices with Q and R, respectively, on the diagonal such that
∑
k+N−1
f (xi , ui ) = X̃ X̃ + Ũ T U.
̃
T

i=k

We assume that the sets X and U are defined via the inequalities
u Ũ ⪯ u , x X̃ ⪯ x ,
for some matrices u , u , x , and x . It is now straightforward to verify that (8.25) can be reformu-
lated as
minimize Ũ T  Ũ + 2Ũ T  xk + xkT ΦT Φxk ,
[ ] [ ]
u u (8.26)
subject to ̃
U⪯ ,
x Γ  x −  x Φx k

̃ where  = ΓT Γ +  and  = ΓT Φ. Note that the last term in the objective
with variable U,
function can be omitted when carrying out the optimization. By completing the squares and trans-
forming the variables according to
√ ( )
z = 2 Ũ +  −1  xk ,
an equivalent optimization problem is obtained, which is a convex multiparametric quadratic
program:
minimize 12 zT z,
(8.27)
subject to Gz ⪯ 𝑤 + S𝜃,
with variable z, where 𝜃 = xk , H is defined as above, and where
[ ] [ ] [ ]
1 u u  −1  
G= √ , S= −1  − Φ) , 𝑤= u .
2 x  Γ  x (Γ x
From this reformulation and the discussion in Section 5.6, we conclude that we can obtain an
explicit feedback for MPC in the case of quadratic objective function with affine constraints.
For this case, the feedback function is piecewise affine over a polyhedral partitioning of the
state-space, and it can be computed off-line. The only major on-line computational effort is related
to computing in which polyhedron the current state xk is in.

8.9 Markov Decision Processes

Here, we will discuss the generalization of the results in Sections 8.1 and 8.3 to a stochastic setting
based on the multistage stochastic optimization problem formulation in Section 5.7. We are given
a random processes W = (W0 , W1 , …) called the disturbance process. It is assumed that Wi is inde-
pendent of Wj for i ≠ j. This generates another random process X = (X0 , X1 , …) via the dynamical
recursion
Xk+1 = Fk (Xk , uk , Wk ), (8.28)
where Fk ∶ n ×  m ×  p → n , k ∈ ℤ+ are functions describing how the state Xk evolves depend-
ing on the decision variable or control signal uk . This is a function uk ∶ n × · · · × n →  m with
8.9 Markov Decision Processes 227

(X0 , … , Xk ) → uk (X0 , … , Xk ), i.e. u = (u0 , u1 , …) is a random process adapted to X. The initial state
X0 is assumed to have a probability distribution that is independent of the probability distribu-
tion of Wk , k ∈ ℤ+ . The sets  and  can be finite or infinite as in Section 8.1, but as discussed
in Section 5.7 care has to be taken in case these sets are not finite. The control signal is used to
control the dynamical system, i.e. the evolution of the states. To this end, we introduce the incre-
mental costs fk ∶ n ×  m → ℝ for k ∈ ℤN−1 . The incremental costs are functions of the states and
the control signal. We also introduce a cost associated with the final state XN called the final cost or
terminal cost 𝜙 ∶ n → ℝ. The finite-horizon stochastic optimal control problem is then the problem
of solving
[ N−1 ]
∑
minimize 𝔼 fk (Xk , uk ) + 𝜙(XN ) ,
k=0 (8.29)
subject to Xk+1 = Fk (Xk , uk , Wk ), k ∈ ℤN−1 ,
with variables (u0 , X1 , … , uN−1 , XN ), where u is adapted to X. This is clearly a multistage stochastic
optimization problem, and it has a special structure that we will now exploit. We will from now on
assume that the constraints have been used to eliminate the variables Xk , and hence, we only have
to consider the variables (u0 , … , uN−1 ) in the optimization problem.

8.9.1 Stochastic Dynamic Programming

Similarly, as in Section 5.7, we may rewrite the minimization of the objective function above as
[ [ [ [N−1 ]]]]
∑
𝔼X0 min 𝔼X1 min 𝔼X2 · · · min 𝔼XN fk (Xk , uk ) + 𝜙(XN ) ,
u0 u1 uN−1
k=0

where 𝔼Xk denotes conditional expectation with respect to the probability distribution for Xk given
(X0 , … , Xk−1 ) for k ∈ ℕN , and where 𝔼X0 is expectation with respect to X0 . Note that we do not have
any dependence on uN in the objective function, and hence, we do not need to minimize over uN .
We will skip the outer expectation since it does not affect the optimization problem, and we then
assume that we know the value of the initial state to be X0 = x0 . Hence, we consider
[ [ [N−1 ]]]
∑
min 𝔼X1 min 𝔼X2 · · · min 𝔼XN fk (Xk , uk ) + 𝜙(XN ) ,
u0 u1 uN−1
k=0

for all values of x0 . Once the optimal value of the objective function is known for all values of x0 ,
we can of course compute the mean over the probability distribution for X0 . We make use of the
additivity of the objective function and rewrite the minimization of the objective function as
[ [ ]]
[ ]
min 𝔼X1 f0 (X0 , u0 ) + min 𝔼X2 f1 (X1 , u1 ) · · · min 𝔼XN fN−1 (XN−1 , uN−1 ) + 𝜙(XN ) .
u0 u1 uN−1

We now make use of the properties of the conditional expectations to obtain

[ [ ]]
[ ]
min 𝔼X1 f0 (x0 , u0 ) + min 𝔼X2 f1 (x1 , u1 ) · · · min 𝔼XN fN−1 (xN−1 , uN−1 ) + 𝜙(XN ) .
u0 u1 uN−1

We consider first the innermost optimization problem which is equivalent to

[ ]
min 𝔼XN fN−1 (xN−1 , uN−1 ) + 𝜙(FN−1 (xN−1 , uN−1 , WN−1 )) ,
uN−1

after substitution of the dynamic equation. Since Wk is independent of Xl for l ≤ k it follows that
we equivalently have
[ ]
min 𝔼WN−1 fN−1 (xN−1 , uN−1 ) + 𝜙(FN−1 (xN−1 , uN−1 , WN−1 )) ,
uN−1
228 8 Dynamic Programming

where 𝔼Wk denotes the expectation with respect to the probability distribution for Wk .7 We now
realize a very important fact. When we carry out the optimization above with respect to uN−1 ,
this will only be a function of xN−1 and not of any previous values of xk for k < N − 1. This will
also be true when we continue to optimize over uN−2 , i.e. it will only depend on xN−2 , and so on.
Hence, the stochastic optimal control problem can be solved with the following stochastic dynamic
programming recursion. Define the value functions Vk ∶ n → ℝ as VN (x) = 𝜙(x) and
[ ]
Vk (x) = min 𝔼 fk (x, u) + Vk+1 (Fk (x, u, Wk )) , k ∈ ℤN−1 ,
u

where the optimal control is u⋆k = 𝜇k (xk ) where 𝜇k ∶ n →  m for k ∈ ℤN−1 is given by
𝜇k (x) = argmin Qk (x, u),
u

and where Qk ∶ n ×  m → ℝ is the Q-function in the stochastic setting and defined as

[ ]
Qk (x, u) = 𝔼 fk (x, u) + Vk+1 (Fk (x, u, Wk )) , (8.30)
for k ∈ ℤN−1 . Note that we do not have to write out any subindex for the expectation operator, cf .
the discussion at the end of Section 3.6. The functions 𝜇k defines a feedback policy or control policy
also for the stochastic optimal control problem. Because of this, X is a Markov process, which we did
not define it to be to start off with. Such a Markov process is often called a Markov decision process.
We remark that we can move the expectation to just affect the function Vk since the incremental
cost does not depend on Wk . We then realize that we could have let the incremental cost also depend
on Wk , and this is the way Markov decision processes are sometimes presented. However, it should
be stressed that for the stochastic setting of the optimal control problem we have to postulate either
a feedback policy or that u is adapted to X. This was not the case for the deterministic setting, where
we obtained a feedback policy without postulating it or anything else. We summarize the stochastic
dynamic programming recursion in Algorithm 8.4. We next consider an example,

Algorithm 8.4: Stochastic dynamic programming

Input: Final state penalty 𝜙, incremental costs fk , functions Fk , distributions for Wk
Output: Vk for k ∈ ℤN−1
VN (x, u) ← 𝜙(x)
for k ← N − 1 to 0 do
[ ]
Vk (x) ← minu 𝔼 fk (x, u) + Vk+1 (Fk (x, u, Wk ))
end

which is a stochastic version of the LQ control problem in Example 8.1.

Example 8.10 Consider the problem in (8.29) for the case when the dynamic equation is lin-
ear, i.e. Fk (x, u, 𝑤) = Ak x + Bk u + 𝑤, where Ak ∈ ℝn×n and Bk ∈ ℝn×m and where  =  =  = ℝ.
We assume that W is a zero mean random process with Wi independent of Wj for i ≠ j with vari-
[ ]
ance 𝔼 Wk WkT = Σk ∈ 𝕊n+ . We assume that the incremental costs and the final cost are quadratic
functions given by fk (x, u) = xT Sk x + uT Rk u and 𝜙(x) = xT SN x, respectively, where Rk ∈ 𝕊m
++ and
Sk ∈ 𝕊n+ . Application of the stochastic dynamic programming recursion gives VN (x) = xT SN x and
Vk (x) = min Qk (x, u),
u

7 In some applications, it will be convenient to let Wk depend on Xk and then we should take expectation with
respect to the conditional probability function for Wk given Xk = xk instead.
8.9 Markov Decision Processes 229

where
[ ]
Qk (x, u) = xT Sx + uT Ru + 𝔼 Vk+1 (Ak x + Bk u + Wk ) .
We will make the guess that Vk (x) = xT Pk x + rk for some Pk ∈ 𝕊n and some rk ∈ ℝ. This is clearly
true for k = N with PN = SN and rN = 0. We now assume that it is true for k + 1. It then holds that
[ ]T [ ][ ]
x Sk + ATk Pk+1 Ak ATk Pk+1 Bk x
Qk (x, u) = + tr Pk+1 Σk + rk+1 .
u T
Bk Pk+1 Ak T
Rk + Bk Pk+1 Bk u

As in Example 8.1, it holds that

( )−1
u = − Rk + BTk Pk+1 Bk BTk Pk+1 Ak x,
minimizes Qk (x, u). Back substitution of this expression for u results in the following expression
for the right-hand side of the dynamic programming recursion:
{ ( )−1 }
xT Sk + ATk Pk+1 Ak − ATk Pk+1 Bk Rk + BTk Pk+1 Bk BTk Pk+1 Ak x + tr Pk+1 Σk + rk+1 .

It now follows that our assumption is also true for k if we define Pk as

( )−1
Pk = Sk + ATk Pk+1 Ak − ATk Pk+1 Bk Rk + BTk Pk+1 Bk BTk Pk+1 Ak ,
and rk = tr Pk+1 Σk + rk+1 . This defines a recursion for the matrices Pk which are the same
discrete-time Riccati recursion that we had in the nonstochastic setting in Example 8.1, which we
have already proven is well defined. Hence, the stochastic problem has the same solution as the
deterministic problem. The only difference is that the optimal value of the problem is different.

Generally, if the optimal policy is unaffected when a disturbance such as W is replaced with its
mean, we say that certainty equivalence holds.

Example 8.11 Consider the problem of ordering a quantity uk ∈ ℝ+ of an item at periods k rang-
ing from 0 to N − 1 such that a stochastic demand Wk ∈ ℝ is met. Denote by Xk ∈ ℝ the stock
available at the beginning of period k. Assume that Wk are independent random variables. The
stock evolves as
Xk+1 = Xk + uk − Wk .
There is a cost r ∶ ℝ → ℝ, which is a function of Xk , for keeping stock. Moreover, there is a pur-
chasing cost cuk , where c > 0. The objective is to minimize
[ ]
∑(
N−1
)
𝔼 r(XN ) + r(Xk ) + cuk ,
k=0

with respect to uk . We assume that r is a convex function that is bounded from below and such that
r(x) → ∞ as |x| → ∞. Moreover, we assume that
dr(x)
lim < −c.
x→−∞ dx
The stochastic dynamic programming recursion is
[ ]
Vk (x) = min 𝔼 r(x) + cu + Vk+1 (x + u − Wk ) .
u≥0

Let us define the functions 𝜑k ∶ ℝ2 → ℝ as

[ ]
𝜑k (x, u) = 𝔼 Vk+1 (x + u − Wk ) .
230 8 Dynamic Programming

We notice that if Vk+1 is a convex function in x, then 𝜑k is a convex function in (x, u). This follows
since the argument z = x + u − Wk is an affine transformation and since taking expectation of a
function preserves convexity. Under the same assumption, it now follows that
r(x) + cu + 𝜑k (x, u)
is a convex function of (x, u), since it is the sum of convex functions. Moreover, it is bounded from
below if 𝜑k is bounded from below and
d𝜑k (x, u)
lim < −c.
u→−∞ du
This is implied by Vk+1 being bounded from below and
dVk+1 (x)
lim < −c.
x→−∞ dx
Under these assumptions, it follows by Exercise 4.9 that
( )
min cu + 𝜑k (x, u)
u≥0

is a convex function that it is bounded from below. Since Vk is the sum of this function and r, which
is convex and bounded from below, it follows that also Vk is convex and bounded from below. The
fact that also
dVk (x)
lim < −c
x→−∞ dx

follows from the same property for r and that Vk is the sum of r and a function that is bounded from
below. For k + 1 = N, we have that VN (x) = r(x) and hence, VN is convex, bounded from below and
satisfies
dVN (x)
lim < −c.
x→−∞ dx
By induction, it now follows that Vk is convex, bounded from below and satisfies
dVk (x)
lim < −c,
x→−∞ dx

for all k. Since cu + 𝜑k (x, u) is bounded from below, it has an unconstrained minimum that we
denote by u0k . The optimal u⋆k is obtained by projecting u0k onto the set [0, ∞). Hence, if the constraint
was not present, it would be desirable to order u0k . Let sk = xk + u0k , and we can write u0k = sk − xk .
Then we realize that
{
⋆ sk − xk , xk ≤ sk ,
uk = 𝜇k (xk ) =
0, xk > sk ,
where we can interpret sk as a target value, i.e. as long as the current stock is above the target value
we do not need to make any order. Otherwise, we should order the difference.

Example 8.12 We consider the problem of accepting offers Wk ∈ ℝ for selling an asset, where
Wk are independent random variables for k ∈ ℤN . If we accept an offer, we can invest the money at
a fixed interest rate of r > 0 for the remaining period of time. Otherwise, we wait for the next offer,
and make a new decision. This is called an optimal stopping problem. We can cast this as a stochastic
optimal control problem by letting the state space be  = ℝ ∪ {t}, where the element t denotes
termination of the offers. We let the control space be  = {u0 , u1 }, where u0 denotes keeping the
asset, and where u1 denotes selling the asset. We let the state evolve as
{
t, if Xk = t or Xk ≠ t and uk = u1 ,
Xk+1 = F(Xk , uk , Wk ) =
Wk , otherwise.
8.9 Markov Decision Processes 231

The reward at the final time is

{
x, x ≠ t,
𝜙(x) =
0, x = t,
which means that we have to accept the last offer if we have rejected all the previous ones. The
incremental rewards are
{
(1 + r)N−k x, x ≠ t and u = u1 ,
fk (x, u) =
0, otherwise.
Hence, there will be at most one nonzero incremental reward. The stochastic dynamic program-
ming recursion is
{ ( [ ])
max u fk (x, u) + 𝔼 Vk+1 (F(x, u, Wk )) , x ≠ t,
Vk (x) =
0, x = t.
Note that the value function must be zero if x = t since it is equal to the optimal cost to go from k,
and if we have already sold the asset, we cannot expect any future reward. It now follows that for
x ≠ t, it holds that
{ [ ] }
Vk (x) = max 0 + 𝔼 Vk+1 (Wk ) , (1 + r)N−k x + 0 ,
where the first element corresponds to u = u0 and the second element corresponds to u = u1 .
Hence, we should sell if
[ ]
(1 + r)N−k x ≥ 𝔼 Vk+1 (Wk ) ,
or equivalently, if the offer x is such that
[ ]
𝔼 Vk+1 (Wk )
x≥ ,
(1 + r)N−k
where the right-hand side is the expected reward discounted to the present time. It can be shown
that the right-hand side of the inequality is decreasing with k if Wk are identically distributed, and
then if an offer is good at k, it is also good for later times, see Exercise 8.15.

8.9.2 Inﬁnite Time Horizon

It is possible to also extend the results for infinite horizon optimal control in Section 8.3 to a stochas-
tic setting. We consider an infinite horizon stochastic optimal control problem defined as
[ ∞ ]
∑
minimize 𝔼 𝛾 k f (Xk , uk ) ,
k=0 (8.31)
subject to Xk+1 = F(Xk , uk , Wk ), k ∈ ℤ+ ,
with variables (u0 , X1 , u1 , X2 , …), where 0 < 𝛾 < 1 is a discount factor. Here, F ∶ n ×  m ×  p →
n and f ∶ n ×  m → ℝ. Note that we do not allow for 𝛾 = 1, which we did in the deterministic
setting. We assume that W is a zero mean weakly stationary random process with Wi independent
of Wj for i ≠ j, cf . Section 3.9, and we assume that the initial value X0 = x0 is known. We now
have that the incremental costs and the state-dynamics are independent of the stage index k. It is
assumed that F(0, 0, 0) = 0 so that zero is a stationary point of the dynamical equation. If this is
not the case, we can make a change of coordinates to make this hold. We also assume that the
incremental cost f is such that f (0, 0) = 0. Without this assumption, it is not possible to obtain
232 8 Dynamic Programming

closed loop stability, in the sense that (Xk , uk ) is bounded in mean square, i.e. the second moments
of these random variables are bounded for all k ≥ 0. To simplify the presentation below, we will also
restrict ourselves to the case when the incremental cost is strictly positive definite. We define J ⋆ ∶
n → ℝ+ to be the minimal value for the optimization problem in (8.31) for initial value x, and we
define u⋆k to be the optimizing sequence of decisions or control signals that achieves this minimal
value.

8.9.3 Stochastic Bellman Equation

Assume that the above conditions are satisfied and that there exists a solution V ∶ n → ℝ+ such
that c1 ||x||22 ≤ V(x) ≤ c2 ||x||22 + c3 for constants c1 , c2 , c3 > 0 to the stochastic Bellman equation
[ ]
V(x) = min 𝔼 f (x, u) + 𝛾V(F(x, u, W)) . (8.32)
u

Here, W = Wk for an arbitrary k. Then V(x) = J ⋆ (x) and with the Q-function Q ∶ n ×  m → ℝ
defined as
[ ]
Q(x, u) = 𝔼 f (x, u) + 𝛾V(F(x, u, W)) , (8.33)
it holds that u⋆k = 𝜇(Xk ), where 𝜇 ∶ n →  m is given by
𝜇(x) = argmin Q(x, u) (8.34)
u

is an optimal feedback control. If in addition 𝛾 is sufficiently close to one, this feedback results in
closed-loop stability in the sense defined above. The proof of this result is given in Section 8.10. We
next consider an example which is called infinite horizon stochastic LQ control.

Example 8.13 Let us consider the case when F(x, u, 𝑤) = Ax + Bu + 𝑤 for matrices A ∈ ℝn×n and
B ∈ ℝn×m . We also assume that f (x, u) = xT Sx + uT Ru, where S ∈ 𝕊n++ and R ∈ 𝕊m
++ , and that Wk are
independent and define a weakly stationary random process with zero mean and with covariance
Σ ∈ 𝕊n+ . Clearly, we satisfy the assumptions on the functions f and F. We will make the guess that
V(x) = xT Px + r for some P ∈ 𝕊n++ and some r ≥ 0. Then
Q(x, u) = xT Sx + uT Ru + 𝛾(Ax + Bu)T P (Ax + Bu) + 𝛾(tr PΣ + r).
As in Example 8.4, Q is minimized for
u = −𝛾(R + 𝛾BT PB)−1 BT PAx.
Back substitution of this results in
( ( )−1 )
xT Px + r = xT S + 𝛾AT PA − 𝛾 2 AT PB R + 𝛾BT PB BT PA x
+ 𝛾(tr PΣ + r).
This equation holds if P is the solution to the discounted algebraic Riccati equation in (8.11), and if
r = tr(PΣ)∕(1 − 𝛾). We see that certainty equivalence holds also for infinite horizon stochastic LQ
control.

It is possible to extend the results on VI and PI to the stochastic setting in a straightforward man-
ner. What is needed to prove these extensions are the monotonicity and the contraction properties of
the stochastic versions of the Bellman operator and the Bellman policy operator, see Exercise 8.16.
Also, the LP formulation can be extended to the stochastic setting.
8.10 Appendix 233

8.10 Appendix

8.10.1 Stability and Optimality of Inﬁnite Horizon Problem

Consider the infinite time horizon optimal control problem in (8.9). Assume that F(0, 0) = 0 and
that f (0, 0) = 0. Also, assume that f is strictly positive definite. If there exists a strictly positive
definite and quadratically bounded solution V ∶ n → ℝ+ to
V(x) = min {f (x, u) + 𝛾V(F(x, u))} , (8.35)
u∈U(x)

and if 𝛾 is sufficiently close to one, then V(x) = J ⋆ (x), where we recall that J ⋆ (x) is the optimal value
of the problem in (8.9) for initial value x. Moreover, u⋆k = 𝜇(xk ), where 𝜇 ∶ n →  m with
𝜇(x) = argmin {f (x, u) + 𝛾V(F(x, u))}
u∈U(x)

is an optimal feedback control that results in closed-loop stability.

The proof of this result goes as follows: we have from the Bellman equation that
1( )
V(xk+1 ) = V(xk ) − f (xk , 𝜇(xk )) , (8.36)
𝛾
where xk+1 = F(xk , 𝜇(xk )). From this it follows that
1( ( )) 1 ( )
V(xk+1 ) ≤ V(xk ) − 𝜖 ||xk ||22 + ||𝜇(xk )||22 ≤ 1 − 𝜖∕c2 V(xk ),
𝛾 𝛾
where we in the first inequality have used the fact that there exists 𝜖 > 0 such that f (x, u) ≥
( )
𝜖 ||x||22 + ||u||22 , and where we in the second inequality have used the fact that there exists c2 such
that V(x) ≤ c2 ||x||22 . This implies that
1
V(xN ) ≤ (1 − 𝜖∕c2 )N V(x0 ),
𝛾N
which converges to zero as N → ∞ if 𝛾 > (1 − 𝜖∕c2 ). This is always true if we take 𝛾 sufficiently
close to one. Since there exists c1 > 0 such that c1 ||x||22 ≤ V(x), we obtain that
c
||xN ||22 ≤ 2N (1 − 𝜖∕c2 )N ||x0 ||22 .
c1 𝛾
From this, it follows that
1 1 c
||𝜇(xN )||22 ≤ f (xN , 𝜇(xN )) ≤ V(xN ) ≤ 2 ||xN ||22 ,
𝜖 𝜖 𝜖
where the first inequality follows from f being strictly positive definite, the second inequality
follows from (8.36), and the third inequality follows from V being quadratically bounded. This
proves closed loop stability. We now consider any stabilizing sequence of uk , i.e. uk such that
xk+1 = F(xk , uk ) → 0 as k → ∞ for any initial value x0 . From the Bellman equation, it follows that
f (xk , uk ) ≥ V(xk ) − 𝛾V(xk+1 ),
and that
∑
N−1
∑
N−1
( )
lim 𝛾 k f (xk , uk ) ≥ lim 𝛾 k V(xk ) − 𝛾V(xk+1 ) ,
N→∞ N→∞
k=0 k=0
= V(x0 ) − lim 𝛾 N V(xN ) = V(x0 ),
N→∞

where the last equality follows from the fact that uk is stabilizing and that V(0) = 0. Since the above
inequality holds with equality for uk = 𝜇(xk ) by the Bellman equation, we have that uk = 𝜇(xk ) is
optimal for the infinite-horizon optimal control problem.
234 8 Dynamic Programming

8.10.2 Stability and Optimality of Stochastic Inﬁnite Time Horizon Problem

We next consider the stochastic case and discuss how the proof has to be adapted to this set-
ting. Consider the infinite time horizon stochastic optimal control problem in (8.31). Assume that
F(0, 0, 0) = 0 and that f (0, 0) = 0. Also assume that f is strictly positive definite. If there exists a
solution V ∶ n → ℝ+ such that c1 ||x||22 ≤ V(x) ≤ c2 ||x||22 + c3 for c1 , c2 , c3 > 0 to (8.32) and if 𝛾 is
sufficiently close to one, then V(x) = J ⋆ (x), where we recall that J ⋆ (x) is the optimal value of the
problem in (8.31) for initial value x. Moreover, u⋆k as defined in (8.34) is an optimal feedback control
that results in closed-loop stability in mean square sense.
The proof of this result goes as follows: we have from the stochastic Bellman equation in (8.32)
and (8.34) that
[ ] 1( )
𝔼 V(Xk+1 ) = V(Xk ) − f (Xk , 𝜇(Xk )) , (8.37)
𝛾
where Xk+1 = F(Xk , 𝜇(Xk ), Wk ), and where 𝔼 denotes expectation with respect to Wk . From this it
follows that
[ ] 1( ( )) 1 ( ) 𝜖c
𝔼 V(Xk+1 ) ≤ V(Xk ) − 𝜖 ||Xk ||22 + ||𝜇(Xk )||22 ≤ 1 − 𝜖∕c2 V(Xk ) + 3 ,
𝛾 𝛾 𝛾c2
where we in the first inequality have used the fact that there exists 𝜖 > 0 such that f (x, u) ≥
( )
𝜖 ||x||22 + ||u||22 , and where we in the second inequality have used the fact that there exists
c2 , c3 such that V(x) ≤ c2 ||x||22 + c3 . Let us from now on denote the expectation with respect
[ ]
to (W0 , W1 , …) by 𝔼. With Vk = 𝔼 V(Xk ) , a = (1 − 𝜖∕c2 )∕𝛾 and b = 𝜖c3 ∕(𝛾c2 ) it then holds
that

Vk+1 ≤ aVk + b,

where 0 < a < 1 for 𝛾 sufficiently close to one. From this, it follows that

Vk ≤ ak V0 + (ak−1+ ak−2 + · · · + 1)b,

and hence, by taking the limit of the sum that

b
Vk ≤ ak V0 + , ∀k ≥ 0.
1−a
Since V0 = V(x0 ) is bounded, we have that Vk ≤ c for some finite c and all k ≥ 0. Since there exists
c1 > 0 such that c1 ||x||22 ≤ V(x), we obtain
c
𝔼||Xk ||22 ≤ , ∀k ≥ 0.
c1
We also have that
1 1 1
||𝜇(Xk )||22 ≤ f (Xk , 𝜇(Xk )) ≤ V(Xk ) ≤ (c2 ||Xk ||22 + c3 ),
𝜖 𝜖 𝜖
where the first inequality follows from f being strictly positive definite, the second inequality
follows from (8.37) since 𝔼V(Xk+1 ) ≥ 0, and the third inequality follows from the bound on V.
By taking expectation above, it follows that both 𝔼||Xk ||22 and 𝔼||𝜇(Xk )||22 are bounded for all
k ≥ 0, i.e. closed-loop stability holds. We now consider any sequence u = (u0 , u1 , …) such that
uk and Xk+1 = F(Xk , uk , Wk ) have bounded second moments for all k ≥ 0. From (8.32), it follows
that
[ ]
𝔼f (Xk , uk ) ≥ 𝔼 V(Xk ) − 𝛾V(Xk+1 ) ,
Exercises 235

and that
[ N−1 ] [ N−1 ]
∑ ∑ ( )
lim 𝔼 𝛾 f (Xk , uk ) ≥ lim 𝔼
k
𝛾 V(Xk ) − 𝛾V(Xk+1 ) ,
k
N→∞ N→∞
k=0 k=0

= V(x0 ) − lim 𝛾 N 𝔼V(XN ) = V(x0 ),

N→∞

where the last equality follows from the fact that 𝔼V(XN ) ≤ c2 𝔼||XN ||22 + c2 < ∞ for all N ≥ 0 and
that 𝛾 < 1, and where the penultimate equality follows from the fact that X0 = x0 is known. Since
the above inequality holds with equality for uk = 𝜇(Xk ) by the stochastic Bellman equation, we have
that uk = 𝜇(Xk ) is optimal for the infinite horizon stochastic optimal control problem.

8.10.3 Stability of MPC

( )
We will now discuss stability of MPC. If ũ ⋆k , … , ũ ⋆k+N−1 is optimal for the time k problem in (8.25),
( )
then ũ ⋆k+1 , … , ũ ⋆k+N−1 , 0 is feasible for the time k + 1 problem. This follows from the following
argument. We notice that the suggested control signal for the time k + 1 problem together with the
optimal choice made at time k will result in exactly the same control signals and states as the ones
that are optimal for the time k problem. These are also feasible for the time k + 1 problem up until
and including time k + N − 1 for the control signal and up until and including time k + N for the
state. Then applying ũ k+N = 0, which is feasible, will, since x̃ k+N = 0, result in x̃ k+N+1 = 0, which
hence is feasible for the time k + 1 problem.
We are now able to state the stability result. Assume that (8.25) is feasible for k = 0 and that f
is strictly positive definite. Then the MPC algorithm results in a convergent closed-loop trajectory,
i.e. (xk , uk ) → (0, 0) as k → ∞. This is proven in the following way: with the feasible set of controls
defined above it holds that
∑
k+N
Jk+1 = f (̃xi , ũ i ),
i=k+1

∑
k+N−1
= f (̃xi , ũ i ) + f (̃xk+N , ũ k+N ) − f (̃xk , ũ k ),
i=k
= Jk⋆ + f (0, 0) − f (̃x⋆k , ũ ⋆k ),

where f (0, 0) = 0 by assumption. Here, Jk⋆ denotes the optimal value of Jk . Hence,
⋆
0 ≤ Jk+1 ≤ Jk+1 = Jk⋆ − f (̃x⋆k , ũ ⋆k ).

It now follows that f (̃x⋆k , ũ ⋆k ) → 0, k → ∞, since otherwise, Jk⋆ → −∞, which is a contradiction.
Since f is strictly positive definite we have that
( )
f (̃x⋆k , ũ ⋆k ) ≥ 𝜖 ||̃x⋆k ||2 + ||ũ ⋆k ||2 ,

for some 𝜖 > 0, and hence, (̃x⋆k , ũ ⋆k ) → (0, 0), k → ∞.

Exercises

8.1 Assume that we have a vessel whose maximum weight capacity is z and whose cargo is to
consist of different quantities of N different items. Let 𝑣k denote the value of the kth type of
item, and let 𝑤k denote the weight of the kth type of item.
236 8 Dynamic Programming

(a) Let xk be the used weight capacity of the vessel after the first k − 1 items have been
loaded and let the control uk be the quantity of item k to be loaded on the vessel. For-
mulate the dynamic equation:
xk+1 = Fk (xk , uk ),
describing the process.
(b) Determine the constraint set U(k, xk ) for the control signal uk .
(c) Formulate a DP recursion that solves the problem of finding the most valuable cargo
satisfying the maximal weight capacity. Observe that you do not need to solve the
problem.

8.2 Consider the problem

∑
N−1
maximize 𝛽 k log(uk ),
k=0
subject to xk+1 = axk𝛼 − uk ≥ 0,
x0 ≥ 0 given,
where 0 < 𝛼, 𝛽 < 1, and a > 0 are some constants. This is commonly referred to as the
consumption problem in the theory of economics. The variable uk may be interpreted as
the consumption for time period k and xk the available capital, which is assumed positive,
at time period k, respectively. Find the optimal control signal uk , where k = 1, 0, for the
problem when the horizon is N = 2 using the dynamic programming algorithm.

8.3 A businessman operates out of a van that he sets up in one of the two locations each day.
If he operates in location i day k, where i = 1, 2, he makes a known and predictable profit
denoted rik . However, each time he moves from one location to the other, he pays a setup
cost c. The businessman wants to maximize his total profit over N days.
(a) The problem can be formulated as a shortest path problem (SPP), where node (k, i)
is representing location i at day k. Let s and e be the start node and the end node,
respectively. The costs of all edges are
– s to i1 with cost −ri1
– ik to ik+1 , i.e. no switch, with cost −rik+1 , k = 1, … , N − 1
k+1
– ik to ̄ik+1 , i.e. switch, with cost c − r k+1 , k = 1, … , N − 1
̄ik+1
– iN to e with cost 0,
where ̄i denotes the location that is not equal to i, i.e. 1̄ = 2 and 2̄ = 1. Draw a figure
to illustrate the SPP and the definitions of variables and parameters. Write down the
corresponding dynamic programming algorithm. Note that you do not have to solve the
problem.
(b) Suppose the businessman is at location i on day k − 1 and let
Rki = r̄ik − rik .

Show that if Rki ≤ 0, it is optimal to stay at location i, while if Rki ≥ 2c, it is optimal to
switch. You can use the following lemma.
Lemma: For every k = 1, 2, … , N, it holds that
|Vk (i) − Vk (̄i)| ≤ c,
where Vk (i) is the value of the optimal cost-to-go function at stage k for state i.
Exercises 237

8.4 Consider the scalar discrete time system

xk+1 = xk + uk , x0 = 1,
together with the optimization criterion
∑
N−1
J = |xN | + |uk |.
k=0

(a) Consider the control law

uk = −𝛼xk , 0 ≤ 𝛼 ≤ 1.
Show that the cost is independent of the choice of 𝛼. What is the cost?
∑
n n+1
Hint: The sum of a geometric series is ak = 1−a
1−a
.
k=0
(b) Use, e.g. dynamic programming to show that the optimal cost-to-go satisfies
Jk⋆ (xk ) = |xk |,
for all k ∈ ℤN . What can be concluded about optimality/suboptimality of the control
law in (a)?

8.5 Consider the problem of packing a knapsack with items labeled k ∈ ℕN of different value
𝑣k and weight 𝑤k . There is only one copy of each item, and the task is to pack the knapsack
such that the sum of the values of the items in the knapsack is maximized subject to the
constraint that the total weight of the items is less than or equal to the weight limit W.
The problem can be posed as a multistage decision problem in which at each stage k, it is
decided whether item k should be loaded or not. This decision can be coded in terms of the
binary variable uk ∈ {0, 1}, where uk = 1 in case item k is loaded and uk = 0, otherwise. If
xk denotes the total weight of the knapsack after the first k − 1 items have been loaded, then
the following relation holds:
xk+1 = xk + 𝑤k uk , x1 = 0,
for k ∈ ℕN . The constraint that xk+1 ≤ W can be reformulated in terms of uk as uk ≤ (W −
xk )∕𝑤k . From this, it follows that it is possible to calculate how to load the knapsack in an
optimal way using the dynamic programming recursion
{ }
Vk (x) = max 𝑣n u + Vk+1 (x + 𝑤n u) ,
u≤(W−x)∕𝑤n ,u∈{0,1}

with final value VN+1 (x) = 0. We will consider the case when W = 10 and when the values
of 𝑣k and 𝑤k are defined as in the table below, where N = 5.
k 1 2 3 4 5
𝑣k 2 8 7 1 3
𝑤k 1 5 4 3 2

For this case, we notice that xk ∈ {0, 1, … , 10}.

(a) Compute the values of Vk (x) and the maximizing argument u = 𝜇k (x) for each value of
k = 1, … , N as a table in terms of the values of x. The tables should look like
x 0 1 2 3 4 5 6 7 8 9 10
Vk (x)
𝜇k (, x)

(b) From the tables derived in (a) compute the optimal loading of the knapsack.
238 8 Dynamic Programming

8.6 Consider the optimal control problem

∑∞ 2
k=0 xk + uk ,
2
minimize
subject to xk+1 = xk + uk ,
x0 given,
uk ∈ [−1, 1].

(a) Compute an optimal feedback policy uk = 𝜇(xk ) for this problem when the control signal
constraint is neglected.
Hint: Try the value function V(x) = px2 with p > 0.
(b) Now, consider the case with constraints on the control signal. Compute an approxima-
tive solution by solving
{ }
minimize−1≤u≤1 x2 + u2 + V(x + u) ,

where V(x) is defined as in the hint above.

8.7 Show that the Bellman policy operator in (8.15) is a monotone operator and that it is a
contraction when 𝛾 < 1.

8.8 Consider the infinite time horizon optimal control problem

∑
∞
minimize f (xk , uk ),
k=0
subject to xk+1 = F(xk , uk ),

with given initial vale x0 . You may assume that f (x, u) = 𝜌x2 + u2 and F(x, u) = x + u, where
x and u are scalar-valued, and where 𝜌 > 0.
(a) Compute the optimal feedback policy using the Bellman equation by guessing that
V(x) = px2 for some p.
(b) In general, it is more tricky to solve the Bellman equation, and different iterative proce-
dures are available. Use VI, i.e. let Vk (x) be iteratively computed from

Vk+1 (x) = min {f (x, u) + Vk (F(x, u))},

for k = 1, 2, … with V0 (x) = 0. Show that Vk (x) = pk x2 and that the minimizing argu-
ment is lk xk , where lk = −pk ∕(pk + 1) and where

pk+1 = 𝜌 + pk ∕(pk + 1), p0 = 0.

(c) Now, instead use PI. In this method, one starts with an initial feedback policy 𝜇0 (x) and
repeats the following two steps iteratively for k = 0, 1, 2, …:
1. Compute Vk (x) such that

Vk (x) = f (x, 𝜇k (x)) + Vk (f (x, 𝜇k (x))).

2. Compute 𝜇k+1 (x) as the minimizing argument in

{ }
min f (x, u) + Vk (F(x, u)) .
u
Exercises 239

Assume that 𝜇0 (x) = l0 x and that Vk (x) = pk x2 . Show that 𝜇k+1 (x) = lk+1 x, where now
𝜌 + l2k
pk = ,
1 − (1 + lk )2
and
lk+1 = −pk ∕(pk + 1),
with l0 given.
(d) Compute the sequences pk and lk in (b) and (c) for k = 1, 2, … , 5 when 𝜌 = 0.5. Assume
that l0 = −0.1 for the method in (c). The iterates will converge to the solution in (a).
Which method converges the fastest?

8.9 In this exercise, we consider the LQ problem in Example 8.4. We will specifically consider
the case when m = 1 and n = 2, and the matrices
[ ] [ ]
0.5 1 0
A= , B= , R = 1 and S = I.
0 0.5 1

(a) Implement the Hewer iterations in Example 8.6. You may start with L0 = 0. Specifically
investigate how many iterations are needed for convergence of Lk .
(b) Implement the approach in Example 8.7. You may start with L0 = 0. How does the
choice of initial values xs and the number r of initial values affect the convergence of Lk .
How does the choice of N affect the convergence.

8.10 In this exercise, we investigate how to compute the explicit solution to the MPC problem
using the multiparametric toolbox MPT3 for MATLAB; see https://www.mpt3.org/.
Consider a second-order system
[ ] [ ]
0.8584 −0.0928 0.0928
xk+1 = xk + u ,
0.0464 0.9976 0.0024 k (8.38)
[ ]
yk = 0 1 xk .
This has been obtained from a continuous time system with transfer function
2
,
s2 + 3s + 2
by zero-order hold sampling with a sampling time of 0.05.
(a) Use MPT3 to calculate the explicit MPC for (8.38) using the weighted 2-norm for
the incremental costs defined by Q = I, R = 0.01, when N = 2 and for the following
constraints
[ ] [ ]
−10 10
−1 ≤ uk ≤ 1 and ⪯ xk ⪯ .
−10 10
How many regions are there? Present plots of the partitioning of the state-space for the
control signal, and for the time trajectory for the initial condition x0 = (1, 1).
(b) Find the number of regions in the control law for N = 1, 2, … , 13. How does the
number of regions depend on N, e.g. is the complexity polynomial or exponential?
Estimate the order of the complexity by computing
min ||𝛼N 𝛽 − nr ||22 and min ||𝛾(𝛿 N − 1) − nr ||22 ,
𝛼,𝛽 𝛾,𝛿

where nr denotes the number of regions. Discuss the result.

240 8 Dynamic Programming

Hint: Useful commands: MPCController, ctrl.toExplicit, QuadFunction,

expctrl.evaluate, ctrl.evaluate, expctrl.feedback.fplot, expctrl.
cost.fplot, expctrl.partition.plot and lsqnonlin.

8.11 [14, Exercise 1.14] A farmer annually producing Xk units of a certain crop stores (1 − uk )Xk
units of his production, where uk ∈ [0, 1], and invests the remaining uk Xk units, which
increase the production for next year to a level of Xk+1 according to

Xk+1 = Xk + Wk uk Xk ,

where Wk are i.i.d. random variables that are independent of Xk and uk . Moreover, it holds
that 𝔼Wk = 𝑤̄ > 0. The total expected product stored over N years is given by
[ ]
∑
N−1
𝔼 XN + (1 − uk )Xk .
k=1

Show that the optimal solution that maximizes the expected product stored is independent
of xk and given by
⎧𝜇0 (x0 ) = · · · = 𝜇N−1 (xN−1 ) = 1, 𝑤̄ > 1,
⎪
⎨𝜇0 (x0 ) = · · · = 𝜇N−1 (xN−1 ) = 0, 0 < 𝑤̄ < 1∕N,
⎪
⎩𝜇0 (x0 ) = · · · = 𝜇N−k−1
̄ (xN−k−1 ̄ ) = 1, 𝜇N−k̄ (xn−k̄ ) = · · · = 𝜇N−1 (xN−1 ) = 0, 1∕N ≤ 𝑤̄ ≤ 1,

where k̄ is such that 1∕(k̄ + 1) < 𝑤̄ < 1∕k.

8.12 [14, Exercise 1.15] Consider the following random process:

Xk+1 = (1 − 𝛿k )Xk + uk 𝛾k Xk ,
Yk+1 = (1 − 𝛿k )Yk + (1 − uk )𝛾k Xk ,

for k ∈ ℤN , where 𝛿k are i.i.d. random variables with values in [𝛿, 𝛿 ′ ], where 0 < 𝛿 ≤ 𝛿 ′ < 1,
where 𝛾k are i.i.d. random variables with values in [a, b], where a > 0. We assume that
0 < 𝛼 ≤ uk ≤ 𝛽 < 1. We may interpret Xk as the number of educators in a certain country at
time k and Yk as the number of research scientists. By means of incentives, a science policy
maker can determine the proportion uk of new scientists produced at time k who become
educators. The initial values of X0 and Y0 are known. We will derive the optimal policy for
maximizing 𝔼YN .
(a) Show that the value functions are given by Vk (x, y) = ck x + dk y for some ck , dk ∈ ℝ.
(b) Derive the optimal policy when 𝔼𝛾k > 𝔼𝛿k , and show that the optimal policy 𝜇k (x, y) is
independent of x and y.

8.13 Consider a scalar linear random process defined by

Xk+1 = ak Xk + buk + Wk , k = 0, 1, … , N,

where ak , bk ∈ ℝ, and where Wk are i.i.d. random variables with a Gaussian distribution
with zero mean and variance 𝜎 2 < 1. We are interested in minimizing the objective function
[ ( )]
1 2 1∑( 2
N−1
)
𝔼 exp x + x + ruk 2
,
2 N 2 k=0 k

where r ∈ ℝ++ . Derive the optimal control law assuming that it is adapted to X.
Exercises 241

Hint: Start by showing that the following recursion for value functions Vk ∶ ℝ → ℝ,
[ ( ) ( )]
Vk (x) = min 𝔼Wk exp x2 + ru2 Vk+1 ax x + bk u + Wk ,
u

with final value VN (x) = x2 ∕2 provides the optimal solution. Also remember the results in
Exercise 3.10.

8.14 [14, Exercise 4.12] We want to run a machine to produce a certain item to meet a known
demand dk ∈ ℝ, k ∈ ℤN . The machine can be in a bad (B) state or a good (G) state. The state
of the machine evolves according to the transition probabilities
ℙ[ G ∣ G ] = λG , ℙ[ B ∣ G ] = 1 − λG , ℙ[ B ∣ B ] = λB , ℙ[ G ∣ B ] = 1 − λB ,
where λB , λG ∈ [0, 1]. Denote by Xk ∈ ℝ, k ∈ ℤN , the stock in period k. If the machine is in
̄ where ū > 0 is a known constant, and
a good state at period k, it can produce uk ∈ [0, u],
the stock evolves as
Xk+1 = Xk + uk − dk .
Otherwise, it evolves as
Xk+1 = Xk − dk .

(a) Let Yk ∈ {0, 1} be a random process defined as

Yk+1 = Wk ,
where Wk ∈ {0, 1} is defined via the conditional probabilities
[ ]
ℙ Wk = 1 ∣ Yk = 1 = λG ,
[ ]
ℙ Wk = 0 ∣ Yk = 1 = 1 − λG ,
[ ]
ℙ Wk = 0 ∣ Yk = 0 = λB ,
[ ]
ℙ Wk = 1 ∣ Yk = 0 = 1 − λB .
Show that we may summarize the updates for Zk = (Xk , Yk ) as
Zk+1 = F(Zk , uk , Wk ),
where
[ ]
Xk + Yk uk − dk
F(Zk , uk , Wk ) = .
Wk
(b) Consider the following objective function:
[ N ]
∑
𝔼 g(Xk ) ,
k=0

where g ∶ ℝ → ℝ is a convex function bounded from below and such that g(x) → ∞ as
|x| → ∞. The objective function should be minimized. Show that the value functions in
the stochastic dynamic programming recursion are convex functions in x and that there
for each k is a target stock level Sk+1 such if the machine is in good state, it is optimal to
produce u⋆k ∈ [0, u]̄ that will bring Xk+1 as close as possible to Sk+1 .
Hint: Remember the footnote about when Wk depends on Xk in the derivation of the
stochastic dynamic programming recursion.
242 8 Dynamic Programming

8.15 Let us consider the problem in Example 8.12. Let

[ ]
𝔼Wk Vk+1 (Wk )
ak = ,
(1 + r)N−k
and show that ak ≥ ak+1 for all k = 0, 1, … , N − 1. We assume that Wk are identically
distributed.

. Define the stochastic Bellman operator  ∶ ℝ → ℝ as

n n
8.16 (a)
[ ]
 (V) = min 𝔼 f (x, u) + 𝛾V(F(x, u, W)) ,
u

where the assumptions on f , 𝛾, F, and W are the same as in Section 8.9. Show that if
V1 (x) ≤ V2 (x) for all x ∈ n , then
 (V1 ) −  (V2 ) ≤ 0.
Also show that  is a contraction.
(b) Define the stochastic Bellman policy operator 𝜇 ∶ ℝ → ℝ as
n n

[ ]
𝜇 (V) = 𝔼 f (x, 𝜇(x)) + 𝛾V(F(x, 𝜇(x), W)) ,
where the assumptions on f , 𝛾, F, and W are the same as in Section 8.9. Show that if
V1 (x) ≤ V2 (x) for all x ∈ n , then
𝜇 (V1 ) − 𝜇 (V2 ) ≤ 0.
Also, show that 𝜇 is a contraction.
243

Part IV

Learning
245

Unsupervised Learning

We are now going to discuss unsupervised learning. This is about finding lower-dimensional
descriptions of a set of data {x1 , … , xN }. One simple such lower-dimensional description is the
mean of the data. Another one could be to find a probability function from which the data are the
outcome. We will see that there are many more lower-dimensional descriptions of data. We will
start the chapter by defining entropy, and we will see that many of the probability density functions
that are of interest in learning can be derived from the so-called “maximum entropy principle.”
Specifically, we will derive the categorical distribution, the Ising distribution, and the normal
distribution. There is a close relationship between the Lagrange dual function of the maximum
entropy problem and maximum likelihood (ML) estimation, which will also be investigated. Other
topics that we cover are prediction, graphical models, cross entropy, the expectation maximization
algorithm, the Boltzmann machine, principal component analysis, mutual information, and
cluster analysis. As a prelude to entropy we will start by discussing the so-called Chebyshev bounds.

9.1 Chebyshev Bounds

Consider a probability space (Ω,  , ℙ) and a random variable X ∶ Ω → S ⊆ ℝn . In this section, we
will bound ℙ[ X ∈ C ] for some set C using the knowledge of what some expectations related to X
are. We let the pdf of X be p ∶ S → ℝ+ . We assume that we know that
[ ]
𝔼 fi (X) = fi (x)p(x)dx = ai , i ∈ ℤm ,
∫S
where fi ∶ ℝn → ℝ with f0 (x) = 1 and a0 = 1, and where ai ∈ ℝ are given. Now, suppose we are
interested in finding a p that maximizes ℙ[ X ∈ C ], where C ⊆ S is given. This is the same as
solving

maximize p(x)dx,
∫C

subject to p(x)fi (x)dx = ai , i ∈ ℤm ,

∫S
p(x) ≥ 0, ∀ x ∈ S,
with variable p. Let  = {p ∈ ℝℝ ∣ ∫S p(x)dx = 1, p(x) ≥ 0}, and let the Lagrangian L ∶
n

 × ℝm+1 → ℝ be defined as
( )
∑ m
L[p, 𝜆] = p(x)dx + 𝜆i ai − p(x)fi (x)dx) .
∫C i=0
∫S

We then have that

( )
∑
m
sup L[p, 𝜆] = 𝜆i ai + sup (1 − f (x, 𝜆))p(x)dx − f (x, 𝜆)p(x)dx ,
p i=0 p ∫C ∫S⧵C
∑m
where f ∶ ℝn × ℝm+1 → ℝ is given by f (x, 𝜆) = i=0 𝜆i fi (x). Moreover, we have that
∑
m
sup L[p, 𝜆] ≤ 𝜆i a i ,
p i=0

if
1 − inf f (x, 𝜆) ≤ 0, − inf f (x, 𝜆) ≤ 0.
x∈C x∈S⧵C
[ ] ∑m
Since 𝔼 fi (X) = ai , we also have that these conditions imply that supp ℙ[ X ∈ C ] ≤ i=0 𝜆i ai . We
can therefore compute the smallest possible such upper bound by solving the dual problem
∑
m
minimize a i 𝜆i ,
i=0
subject to 1 − inf f (x, 𝜆) ≤ 0,
x∈C

− inf f (x, 𝜆) ≤ 0,
x∈S⧵C

with variable 𝜆. This is a convex optimization problem in 𝜆, which follows by noting that
∑
m
inf f (x, 𝜆) = inf 𝜆i fi (x)
x∈C x∈C
i=0

is the infimum over a family of linear functions of 𝜆, and hence, it is a concave function of 𝜆. The
same argument applies to the function in the second constraint.

Example 9.1 Let S = ℝ+ , C = [1, ∞), f0 (x) = 1 and f1 (x) = x. Assume that it is known that
𝔼f1 (X) = 𝔼X = 𝜇, where 0 ≤ 𝜇 ≤ 1. Then the so-called Markov bound
ℙ[ X ≥ 1 ] ≤ 𝜇
holds. The result is derived in Exercise 9.1.

9.2 Entropy
Let us consider a probability space (Ω,  , ℙ). Entropy measures the amount of uncertainty in a
probability distribution. Assume that we observe the values of a random variable X ∶ Ω → ℝ for
outcomes of experiments and estimate the mean of the random variable. What is then the most
likely distribution for the random variable? The maximum entropy principle says that it is the one
that maximizes the entropy among all possible probability distributions that have the same esti-
mated mean. To formalize this, we consider a finite sample space Ω = ℕn and the set of probability
functions defined by
{ }
∑n
n = p ∈ [0, 1] ∣ pk ≥ 0, k ∈ ℕn ;
n
pk = 1 ,
k=1

cf . Chapter 3. Then entropy Hn ∶ n → ℝ can be defined in an axiomatic way. It should satisfy

1. Hn (p) = Hn (𝜋(p)), where 𝜋 is any permutation function.
̄ where p̄ = (1∕n, … , 1∕n).
2. Hn (p) ≤ Hn (p),
9.2 Entropy 247

3. Hn (p) = Hn+1 (q), where q = (p, 0).

4. If r ∈ mn is a joint probability function with marginal probability functions p ∈ n and
∑
q ∈ m , then Hmn (r) = Hn (p) + k∶pk ≠0 pk Hm (rk ∕pk ), where rk = (rk1 , … , rkm ), and where
r = (r1 , … , rn ).
The first axiom says that entropy only depends on the distribution and not how we do the labeling
when we define the probability function. The second axiom says that the uniform probability func-
tion has the highest entropy, i.e. the most uncertainty. The third axiom says that adding elements
of probability zero does not change entropy. The fourth axiom says that the entropy for a joint
probability function is the sum of the entropy for one of the marginal probability functions plus a
weighted average of the entropy of the conditional probability function. This latter term is what is
called conditional entropy. It was shown in [62] that any function Hn that satisfies the axioms has
to be of the form
∑n
Hn (p) = −k pk log pk ,
k=1
where k is a positive constant, and log(•) is any logarithm. We use the convention 0 × log(0) = 0 to
have continuity at the origin. We immediately notice that the function is concave, since it is the sum
of concave functions, where the concavity of each term is easily verified by computing its second
derivative and verifying that it is negative. Because of this, maximum entropy problems are often
tractable. A typical problem reads
∑
n
maximize − pk ln pk ,
k=1
(9.1)
subject to Ap = b,
p ∈ Δn ,
with variable p ∈ ℝn , where A ∈ ℝm×n . Every row in the constraint Ax = b is an expectation con-
straint. To see this, define random variables X1 , … , Xn such that Xi takes the values xij with prob-
∑n
ability pj . Then the expected value of Xi is j=1 xij pj , which is the ith row of Ap if we let Aij = xij .
In applications, the right-hand side b could be obtained from empirical estimates of the expected
values. A very simple example of a random variable Xi is one for which xij = 1 if j = i and zero,
otherwise. Hence, the expected value is pj , and if the right-hand side is an empirical estimate of
this expected value, we have effectively constrained pj to be equal to its empirical estimate. If we
instead take Aij = xij2 , we see that we define a constraint on the second moment of the random vari-
able. Similarly, we can define constraints on any moment of a random variable. There could also be
additional inequality constraints on p, and as long as they are convex constraints, the tractability
of the problem is not destroyed. One example is lower bounds on variances of random variables.
Note the close connection to Example 5.2.

Example 9.2 In this example, we consider the static ranking of web pages. To formalize this,
we let V = ℕn represent a set of n web pages and define a graph G = (V, E), where the edge set
E ⊂ V × V contains directed edges (i, j) describing that there is a link from web page i to web page j.
For an example see Figure 9.1, where V = {1, 2, 3, 4}, and E = {(1, 2), (2, 4), (4, 1), (4, 3), (3, 1)}. The
most well-known way to model the ranking is called PageRank. It uses a Markov chain model, cf .
Section 3.10, in which all outgoing links are assigned equal transition probability. The transition
probability that a user at page i jumps to page j is given by
⎧1
⎪ , (i, j) ∈ E,
pij = ⎨ di
⎪ 0, (i, j) ∉ E,
⎩
248 9 Unsupervised Learning

where di is the out-degree of node i, i.e. the number of edges (i, j) ∈ E, where
1 2
j ∈ V. The PageRank is then defined as the stationary distribution of the
Markov chain, i.e. as the solution of 𝜋 T = 𝜋 T P, where P ∈ ℝn×n is the matrix
of transition probabilities pij at position i, j. This is an eigenvalue problem,
and 𝜋 is the normalized eigenvector corresponding to the eigenvalue that is
one. For the example in Figure 9.1, we have d = (1, 1, 1, 2) and 3 4

⎡ 0 1 0 0⎤
⎢ ⎥ Figure 9.1 Graph
0 0 0 1⎥
P=⎢ , showing the links
⎢ 1 0 0 0⎥ between different
⎢1∕2 0 1∕2 ⎥
0⎦
⎣ web pages.
with eigenvector 𝜋 = (0.2857, 0.2857, 0.1429, 0.2857). We see that web page
number three has the lowest ranking.
Another way to model the transition probabilities is based on a network flow approach. Let yij be
the number of users following link (i, j) ∈ E per unit time. Assume that the web traffic is in a state
of equilibrium so that the traffic out of a node is equal to the in-coming traffic per unit time, i.e.
∑ ∑
yij = yji , i ∈ V.
j∶(i,j)∈E j∶(j,i)∈E

The number of “hits” per unit time at node j ∈ V is then

∑
Hj = yij ,
i∶(i,j)∈E

and the total number of hits per unit time is

∑ ∑
Y= yi,j = Hj .
(i,j)∈E j∈V

We now define the probabilities pij = yi,j ∕Y , which allow us to write the equilibrium condition as
∑ ∑
pij = pji , i ∈ V. (9.2)
j∶(i,j)∈E j∶(j,i)∈E

Note that these probabilities are not the transition probabilities in the PageRank model. They are
∑
obtained by normalization with j∶(i,j)∈E pij .
One solution to (9.2) is
Hi
pij = , (i, j) ∈ E,
Y di
which agrees with the PageRank model after normalization. However, there are many more solu-
tions to the equilibrium condition, which can be interpreted as moment constraints. The maximum
entropy solution under the moment constraints is obtained by solving
∑
maximize − pij ln pij ,
(i,j)∈E
∑ ∑
subject to pij = pji , i ∈ V,
j∶(i,j)∈E j∶(j,i)∈E
∑
pij = 1,
(i,j)∈E

pij ≥ 0, (i, j) ∈ E,
with variables pij , (i, j) ∈ E. We will investigate this optimization problem more in Exercise 9.6.
9.2 Entropy 249

9.2.1 Categorical Distribution

We will now study a family of distributions known as categorical distributions, which can be derived
from the problem
maximize Hn (p),
subject to f T p = b,
𝟙T p = 1,
p ⪰ 0,
with variable p ∈ ℝn , where f ∈ ℝn and b ∈ ℝ are given. Define the partial Lagrange function
L ∶ ℝn × ℝ × ℝ → ℝ as
L(p, 𝜆, 𝜇) = Hn (p) + 𝜆(b − f T p) + 𝜇(1 − 𝟙T p),
where 𝜆 ∈ ℝ and 𝜇 ∈ ℝ are Lagrange multipliers for the equality constraints. This is a concave
function of p. To find the dual function, we want to maximize this function subject to p ⪰ 0. The
partial derivatives of L with respect to pk vanish if
𝜕L
= − ln pk − 1 − 𝜆fk − 𝜇 = 0.
𝜕pk
Hence,
e−𝜆fk
pk = ,
e1+𝜇
which satisfies pk ≥ 0 for any values of the Lagrange multipliers. Substituting the right-hand for pk
in the Lagrange function results in the Lagrange dual function g ∶ ℝ × ℝ → ℝ given by
∑n −𝜆f
e k
g(𝜆, 𝜇) = k=11+𝜇 + 𝜆b + 𝜇.
e
This is a convex function and minimizing it will provide us with the values of 𝜆 and 𝜇 that are
needed in the expressions for pk . For reasons that will become clear later, we first minimize with
respect to 𝜇. This can be done by setting the partial derivative with respect to 𝜇 equal to zero, i.e.
∑n −𝜆f
e k
− k=11+𝜇 + 1 = 0.
e
∑n
From this, we obtain e1+𝜇 = k=1 e−𝜆fk = Φ(𝜆), where Φ ∶ ℝ → ℝ++ . Solving for 𝜇 and substituting
this expression for 𝜇 into the Lagrange dual function results in the function h ∶ ℝ → ℝ defined as
h(𝜆) = 𝜆b + ln Φ(𝜆).
We may now minimize this function by setting the partial derivative equal to zero, i.e.
∑n
𝜕Φ fk e−𝜆fk
b+ ∕Φ(𝜆) = b − k=1 = 0,
𝜕𝜆 Φ(𝜆)
which is equivalent to
∑
n
( )
fk − b e−𝜆fk = 0.
k=1

Solving this equation with respect to 𝜆 will give us the pf. We now realize that if we do not want to
parameterize the probability function in terms of the expected value b, then we can do it in terms
of 𝜆, i.e.
e−𝜆fk
pk = ∑n . (9.3)
−𝜆fl
l=1 e
250 9 Unsupervised Learning

The parameter 𝜆 is called the natural parameter. The probability function we have derived belongs
to the family of exponential probability functions. The distribution is known under several different
names: the categorical, Gibbs, or Boltzmann distribution. Note that we may normalize such that
⎧ ezk
⎪ ∑n−1 z , k ∈ ℕn−1 ,
⎪ 1 + l=1 e l
pk = ⎨
⎪ 1
∑n−1 z , k = n,
⎪ 1 + l=1 el
⎩
where zk = 𝜆(fn − fk ), k ∈ ℕn−1 , and this is the form of the categorical probability function that is
used in logistic regression, which is discussed in Section 10.4.

Example 9.3 A construction company is ordering lumber every month. The lumber comes in
three different grades. The construction company cannot decide which quality to order, but it
can observe that the prices are different per unit. The different prices are f1 = 1 for the lowest
grade, f2 = 1.1 for the middle grade, and f3 = 1.2 for the highest grade, respectively. They have also
observed that on average the price is b = 1.05. We can then use the maximum entropy principle to
estimate what the probabilities are that low-, middle-, or high-grade lumber is delivered. Let p1 be
the probability that low-grade lumber is delivered, p2 that medium-grade lumber is delivered, and
p3 that high-grade lumber is delivered. We then have the following moment constraint:
1 × p1 + 1.1 × p2 + 1.2 × p3 = 1.05.
From the function
G(𝜆) = (1 − 1.05)e−𝜆×1 + (1.1 − 1.05)e−𝜆×1.1 + (1.2 − 1.05)e(−𝜆×1.2) ,
which is shown in Figure 9.2, we see that G(𝜆) is zero for 𝜆 = 6.9, and hence, we can compute the
probabilities from (9.3) to be
p1 = 0.5386, p2 = 0.2701, p3 = 0.1913.

⋅10−5
2

1.5

1
𝐺

0.5

6 6.5 7 7.5 8
𝜆

Figure 9.2 Plot of the function G(𝜆).

9.2 Entropy 251

9.2.2 Ising Distribution

We are now interested in m-dimensional vectors X of binary-valued random variables. To this end,
we introduce the bijection x = {0, 1}m ↔ ℕn , where n = 2m . The bijection is defined such that for
∑n ( )
x ∈ x and k ∈ ℕn it holds that k = l=1 xl 2l−1 , where x = x1 , … , xm with xl ∈ {0, 1}. With abuse
( )
of notation, we let p(x) = pk , where as before p = p1 , … , pn ∈ n . From this, we realize that the
∑
entropy equivalently can be expressed as Hn (p) = − x∈x p(x) ln (p(x)).
We are now interested in the probability function that maximizes the entropy with constraints
on the first moments of each component of X and on the cross-moments between the different
components of X. This can be expressed as the problem
maximize Hn (p),
∑
subject to xp(x) = m,
x∈x
∑
xxT p(x) = M, (9.4)
x∈x
∑
p(x) = 1,
x∈x

p(x) ≥ 0, x ∈ x ,
with variable p, and where, with abuse of notation, m is the vector of first moments, and M is
the matrix of second moments, for which we do not specify the diagonal. Ignoring the inequality
constraints, we introduce the partial Lagrangian L ∶ ℝn × ℝm × 𝕊m × ℝ → ℝ via
( )
∑
L(p, 𝜆, Λ, 𝜇) = Hn (p) + 𝜆 T
xp(x) − m
x∈x
( ( )) ( )
∑ ∑
+ tr Λ T
xx p(x) − M +𝜇 p(x) − 1 ,
x∈x x∈x

where Λ has a zero diagonal because of the unspecified diagonal in the constraints. This is a concave
function of p. It holds that
𝜕L ( )
= − ln p(x) − 1 + 𝜆T x + tr ΛxxT + 𝜇,
𝜕p(x)
from which it follows that
( )
p(x) = exp 𝜆T x + tr(ΛxxT ) − 1 + 𝜇 ,
and hence p(x) ≥ 0. The Lagrange dual function g ∶ ℝm × 𝕊m × ℝ → ℝ follows by inserting the
expression for p into the Lagrangian. Minimizing this function with respect to 𝜇 results in choosing
𝜇 such that the probabilities sum up to one. With A = 1 − 𝜇 this holds if
( )
∑ ( T ( ))
A = ln exp 𝜆 x + tr Λxx T
.
x∈x

We see that A is a function of 𝜆 and Λ, and we, therefore, have A ∶ ℝm × 𝕊m → ℝ. The resulting
distribution is called the Ising distribution,1 where 𝜆 and Λ are the natural parameters. We introduce

1 Strictly speaking, it is the random variable that is defined by the probability function that we have derived which
has the Ising distribution.
252 9 Unsupervised Learning

the so-called energy function E ∶ {0, 1}m → ℝ given by E(x) = −𝜆T x − tr(ΛxxT ). We may then write
(∑ )
the probability function as p(x) = exp (−E(x) − A(𝜆, Λ)), where A(𝜆, Λ) = ln x exp(−E(x)) .
In order to relate the natural parameters to the moments m and M, we substitute 𝜇 = 1 − A(𝜆, Λ)
into the Lagrange dual function and obtain the function h ∶ ℝm × 𝕊m → ℝ given by

h(𝜆, Λ) = A(𝜆, Λ) − 𝜆T m − tr (ΛM) . (9.5)

We proceed to minimize this function with respect to (𝜆, Λ). The optimality conditions are
𝜕h 𝜕A
= − m = 0,
𝜕𝜆 𝜕λ
𝜕h 𝜕A
= − M = 0,
𝜕Λ 𝜕Λ
where
∑ ∑
𝜕A x∈x x exp (−E(x)) x∈ x exp (−E(x) − A(𝜆, Λ))
= ∑ = ∑ x ,
𝜕𝜆 x∈x exp (−E(x)) x∈x exp (−E(x) − A(𝜆, Λ))
∑
= xp(x),
x∈x
∑ T ∑ T
𝜕A x∈x xx exp (−E(x)) x∈x xx exp (−E(x) − A(𝜆, Λ))
= ∑ = ∑ ,
𝜕Λ x∈x exp (−E(x)) x∈x exp (−E(x) − A(𝜆, Λ))
∑
= xxT p(x).
x∈x

Here, we do not consider the equations related to the diagonal of the second equation. This is
because we have constrained the diagonal of Λ to be zero. We see that the equations say that we
should match the moments. To solve the above equations with respect to (𝜆, Λ) is not easy in gen-
eral, and especially not when the dimension of x is large.

9.2.3 Normal Distribution

The concept of entropy can be generalized to probability distributions for continuous random
variables. Given a pdf p ∶ ℝ → ℝ+ , the entropy is defined as
+∞
H(p) = − p(x) log p(x)dx. (9.6)
∫−∞
Here, we use p to denote a pdf instead of f as we did in Chapter 3. Most of the results from the dis-
crete case carry over, and the family of distributions that are obtained by maximizing the entropy,
irrespective of if they are related to discrete or continuous random variables, is called the exponen-
tial family. Some examples of distributions that belong to the family are the Gaussian or normal
distribution, the exponential distribution and the Poisson distribution.
We will now derive the normal distribution as an example. This is obtained by maximizing the
entropy subject to constraints on first and second moments of the distribution. Let  be the space
of real-valued functions defined on ℝn and consider p ∈  subject to

p(x) ≥ 0, p(x)dx = 1, xp(x)dx = m, and xxT p(x)dx = M,

∫ℝn ∫ℝn ∫ℝn
9.2 Entropy 253

where m ∈ ℝn and M ∈ 𝕊+ are the first and second moments of the distribution, respectively. The
maximum entropy problem is
maximize H(p),

subject to xp(x)dx = m,
∫ ℝn

xxT p(x)dx = M, (9.7)

∫ ℝn

p(x)dx = 1,
∫ ℝn
p(x) ≥ 0, x ∈ ℝn ,
with variable p ∈ , which is a generalization of the problem we discussed in Example 7.3. We
define the partial Lagrangian functional L ∶  × ℝ × 𝕊n × ℝ → ℝ by
( )
L[p, 𝜆, Λ, 𝜇] = − p(x) ln p(x)dx + 𝜆T m − xp(x)dx
∫ ℝn ∫ℝn
( ( ))
1
+ tr Λ M − xxT p(x)dx
2 ∫ℝn
( )
+𝜇 1− p(x)dx ,
∫ℝn
where we have ignored the constraint p(x) ≥ 0. The first variation of the Lagrangian is
( )
1
𝛿L[𝛿p] = − ln p + 1 + 𝜆T x + xT Λx + 𝜇 𝛿p dx,
∫ ℝn 2
which should be nonpositive for all 𝛿p when p is optimal. Hence, the expression in the parenthesis
must vanish by the du Bois Raymond lemma, see Section 7.1, and the optimal pdf is
T x− 1 x T Λx
p(x) = e−1−𝜇−𝜆 2 , (9.8)
which clearly is nonnegative. We will in Section 9.9 verify that p is indeed maximizing the entropy
and not merely is a stationary point. The Lagrange dual function g ∶ ℝ × 𝕊n × ℝ → ℝ is defined by
T x− 1 x T Λx 1
g(𝜆, Λ, 𝜇) = e−1−𝜇−𝜆 2 dx + 𝜆T m + tr (ΛM) + 𝜇,
∫ ℝn 2
1 T Λ−1 𝜆− 1 T
= e−1−𝜇+ 2 𝜆 2
(x+Λ−1 𝜆) Λ(x+Λ−1 𝜆) dx
∫ ℝn
1
+ 𝜆T m + tr (ΛM) + 𝜇,
2
where the second equality follows from completing the squares and assuming that Λ is invertible.
We will later on see under which assumption we have invertibility. The Lagrange dual function is
a convex function, and we determine the 𝜇 that minimizes it by setting the partial derivative of g
with respect to 𝜇 equal to zero. We find that p(x) should integrate to one, i.e.
1 T Λ−1 𝜆− 1 T
p(x)dx = e−1−𝜇+ 2 𝜆 2
(x+Λ−1 𝜆) Λ(x+𝜆−1 Λ) dx,
∫ℝn ∫ℝn
1
2n∕2 e−1−𝜇+ 2 𝜆
T Λ−1 𝜆
2 2
= √ e−̄x1 d̄x1 × · · · × e−̄xn d̄xn ,
det Λ ∫ℝ ∫ℝ
1
e−1−𝜇+ 2 𝜆 Λ 𝜆 (2𝜋)n∕2
T −1

= √ = 1.
det Λ
254 9 Unsupervised Learning

( )
Hence, 𝜇 = 12 𝜆T Λ−1 𝜆 − 1 − 12 ln det Λ
(2𝜋)n
from which, we obtain
√
det Λ − 12 (x+Λ−1 𝜆)T Λ(x+Λ−1 𝜆)
p(x) = e .
(2𝜋)n
If we insert the optimal 𝜇 into the Lagrange dual function, we can define a function h ∶ ℝn × 𝕊n →
ℝ as
( ( ))
1 T −1 1 det Λ
h(𝜆, Λ) = g 𝜆, Λ, 𝜆 Λ 𝜆 − 1 − ln ,
2 2 (2𝜋)n
( )
1 T −1 1 det Λ 1
= 𝜆 Λ 𝜆 − ln + 𝜆T m + tr (ΛM) .
2 2 (2𝜋)n 2
Note that this is also a convex function. In order to find 𝜆 and Λ, we minimize this function, which
can be done by setting the derivatives with respect to 𝜆 and Λ equal to zero, i.e.
𝜕h
= Λ−1 𝜆 + m = 0,
𝜕𝜆
𝜕h 1 1 1 1 1 1
= − Λ−1 𝜆𝜆T Λ−1 − Λ−1 + M = − mmT − Λ−1 + M = 0,
𝜕Λ 2 2 2 2 2 2
where the second equality in the second equation follows from the first equation. Noting that
Σ = M − mmT is the covariance matrix, we immediately have that Λ = Σ−1 , and hence, the invert-
ibility of Λ is equivalent to the covariance matrix belonging to 𝕊n++ . Moreover, we have 𝜆 = −Σ−1 m
and
1 1 T −1
p(x) = √ e− 2 (x−m) Σ (x−m) .
(2𝜋)n det Σ
Note that we may just as well use the natural parameter (𝜆, Λ) instead of (m, Σ). If h is expressed
in terms of (m, Σ) instead of in terms of (𝜆, Λ) it will not be a convex function. As we will see, this
makes it convenient to use the natural parameters.

9.3 Prediction

Let fX,Y ∶ ℝm × ℝn → ℝ+ be the joint pdf of two random variables X and Y with marginal pdfs
fX ∶ ℝm → ℝ+ and fY ∶ ℝn → ℝ+ , and suppose we are given an observation x of X and would
like to predict a value y for Y . The fact that X and Y are not independent will be utilized.
Clearly, we could just compute fY |X ∶ ℝm × ℝn → ℝ+ , the conditional pdf for Y given X, which is
defined as
fX,Y (x, y)
fY |X (y|x) = ,
fX (x)
cf . (3.7). This relationship is the foundation for the Bayesian approach to statistics, a topic we return
to in Section 10.3. In Section 3.11, we derived the conditional pdf for an hidden Markov model
(HMM). If one wants a single value for the prediction, one may consider the argument y of fY |X (y|x)
that maximizes it for the observation x, i.e.

ŷ = argmax{fY |X (y|x)}.
y

This is called the maximum a posteriori (MAP) estimate. The reason for the name a posteriori is
that the conditional pdf is the pdf for Y resulting after we observe that X = x.
9.3 Prediction 255

9.3.1 Conditional Expectation Predictor

Another possibility is to look for a predictor g ∶ ℝm → ℝn that minimizes the expected squared
error of the prediction, i.e.
[ ]
minimize 𝔼 (Y − g(X))T (Y − g(X)) , (9.9)

with the function g as variable. This criterion is often called the mean squared error (MSE). We will
now show that this infinite-dimensional optimization problem has a simple solution in terms of a
conditional expectation. To this end, write the above objective function as

fX,Y (x, y)(y − g(x))T (y − g(x)) dxdy,

where the second equality follows by completing the squares. Thus, the above integral is minimized
by g(x) = 𝔼[Y|X = x], i.e. the conditional expectation of Y given X = x.
In general, neither the conditional pdf nor the conditional expectation is easy to compute. One
important exception is when fX,Y is the normal pdf, i.e.
1 1 T Σ−1 (z−𝜇)
fX,Y (z) = √ e− 2 (z−𝜇) ,
(2𝜋)m+n det Σ
where z = (x, y), 𝜇 = (𝜇x , 𝜇y ) and where
[ ]
Σx Σxy
Σ= T .
Σxy Σy
From Example 3.6, we then have that the conditional pdf is given by
1 1 T Σ−1 (y−𝜇 )
fY |X (y|x) = √ e− 2 (y−𝜇y|x ) y|x y|x
,
(2𝜋)n det Σy|x

where

𝜇y|x = 𝜇y + ΣTxy Σ−1

x (x − 𝜇x ), x Σxy .
Σy|x = Σy − ΣTxy Σ−1 (9.10)

The conditional expectation is given by 𝜇y|x , which is an affine function of x. Note that the max-
imizing argument y of fY |X (y|x) is y = 𝜇y|x , and hence, we have shown that the conditional mean
is also the MAP estimate for the normal distribution case. As a consequence, the Kalman filter
in Section 3.11 provides the MAP estimate as well as the estimate that minimizes the MSE of the
prediction when the noise sequences are Gaussian.
Another special case is a so-called Gaussian mixture model, which has a joint pdf of the form
∑
N
fX,Y (z) = 𝛼j fXi ,Yi (z),
i=1

where
1 1 T Σ−1 (z−𝜇 )
fXi ,Yi (z) = √ e− 2 (z−𝜇i ) i i
,
(2𝜋)m+n det Σi
256 9 Unsupervised Learning

where z = (x, y), 𝜇i = (𝜇x,i , 𝜇y,i ) and where

[ ]
Σx,i Σxy,i
Σi = .
ΣTxy,i Σy,i
∑N
Here, 𝛼i ≥ 0 and i=1 𝛼i = 1. The marginal pdf for X is then given by

∑
N
fX (x) = 𝛼i fXi (x),
i=1

where
1 1 T −1
fXi (x) = √ e− 2 (x−𝜇x,i ) Σx,i (x−𝜇x,i ) ,
(2𝜋)m det Σx,i
and hence, the conditional pdf is given by
∑N
i=1 𝛼i fX ,Y (x, y)
fY |X (y|x) = ∑N i i .
i=1 𝛼i fXi (x)

It follows that the optimal predictor is given by

∑N
i=1 𝛼i ∫ℝn yfXi ,Yi (x, y)dy
𝔼[Y|X = x] = x = ∑N .
i=1 𝛼i fXi (x)

where

𝜇i (x) = 𝜇y,i + ΣTxy,i Σ−1

x,i (x − 𝜇x,i )

are the linear predictors for Yi given Xi = x. We see that the overall predictor of Y given X = x is a
convex combination of these linear predictors, where the weights are functions of x, and hence, it
is a nonlinear predictor.

Example 9.4 We now consider the two-dimensional Gaussian mixture defined by

fX1 ,Y1 (x, y) =  (0, I),
fX2 ,Y2 (x, y) =  ((1, 2), I),
fX3 ,Y3 (x, y) =  ((−2, 0), I),

with 𝛼i = 1∕3, i ∈ ℕ3 . Figure 9.3 shows the level curves of the pdf together with the nonlinear pre-
dictor given by the conditional expectation.

9.3.2 Afﬁne Predictor

Instead of looking for a general predictor g when minimizing the MSE, we now restrict ourselves
to affine predictors g(x) = Ax + b with A ∈ ℝm×n and b ∈ ℝm . Sometimes the affine predictor is
called a linear predictor. The problem in (9.9) then becomes a single-stage stochastic optimization
problem of the form (5.44). We will do this for a general pdf fX,Y that is not necessarily normal. It is
9.3 Prediction 257

Figure 9.3 The level curves of the pdf for 3

the Gaussian mixture in Example 9.4
together with the nonlinear predictor given
by the conditional expectation.
2

𝑦
0

−1

−3 −2 −1 0 1 2 3
𝑥

convenient to introduce mx = 𝔼X, my = 𝔼Y and new variables X̄ = X − mx and Ȳ = Y − my . We

define J ∶ ℝm×n × ℝm → ℝ+ as
1 [ ] 1 ( [ ])
J(A, b) = 𝔼 (Y − AX − b)T (Y − AX − b) = tr 𝔼 (Y − AX − b) (Y − AX − b)T ,
2 ( [ 2
1 ( )( )T ])
̄ ̄
= tr 𝔼 Y − AX + my − Amx − b Y − AX̄ + my − Amx − b
̄ , (9.11)
2 ( [ ])
1 ( ) ( ) 1 ( ) ( )
= tr 𝔼 Ȳ − AX̄ Ȳ − AX̄
T T
+ my − Amx − b my − Amx − b ,
2 2
[ ]
where we have used the fact that 𝔼X̄ = 0 and 𝔼Ȳ = 0. We have with Dx = 𝔼 X̄ X̄
T
and
[ ]
Dxy = 𝔼 X̄ Ȳ
T
that the optimal A and b must satisfy

𝜕J
= −my + Amx + b = 0,
𝜕b
𝜕J
= −DTxy + ADx − my mTx + bmTx + Amx mTx = 0.
𝜕A
Inserting the expression for b from the first equation into the second equation and simplifying
results in A = DTxy D−1 T −1
x and b = my − Dxy Dx mx . It follows that
( )
Ax + b = my + DTxy D−1
x x − mx ,

which is in agreement with the normal distribution case in (9.3.1). We have just replaced the
moments of the normal distribution with the moments of the general distribution. Thus, if we are
satisfied with the best linear predictor, we only need to know the first- and second-order moments
( )
of the distribution. The minimal value of J is tr Dy − DTxy D−1
x Dxy . This value is in general larger
that the trace of the covariance of Y conditioned on X = x. The predictor for Y has the very nice
property that
[ ] [ ]
𝔼 AX + b = my + DTxy D−1 x 𝔼 X − mx = my ,

i.e. its expected value agrees with the expected value of Y . This is what is called an unbiased pre-
dictor.
258 9 Unsupervised Learning

Example 9.5 For a general Gaussian mixture, we have that

∑
N
∑
N
mx = 𝛼i 𝜇x,i , my = 𝛼i 𝜇y,i .
i=1 i=1

Moreover, we have that

∑N ( )
Dx = 𝛼i Σx,i + 𝜇x,i 𝜇x,i
T
− mx mTx ,
i=1
∑
N
Dxy = 𝛼i (𝜇x,i − mx )(𝜇y,i − my )T ,
i=1

where the latter formula only holds when Xi and Yi are independent. A formula for the general case
of Dxy is more complicated. For the mixture in Example 9.4, it holds that mx = −1∕3, my = 2∕3,
Dx = 71∕27, and Dxy = 24∕27. Hence, the affine predictor is given by
( )
2 24 1
Ax + b = + x+ .
3 71 3
Figure 9.4 shows the affine predictor together with the nonlinear predictor and the level curves of
the pdf.

9.3.3 Linear Regression

If we do not know the pdf for (X, Y ), we may replace the expected value with its stochastic averaging
approximation (SAA) as in (5.45) using observations (xi , yi ), i ∈ ℕN , of the random variables (X, Y ),
i.e. we solve instead
1∑
N
minimize (y − Axi − b)T (yi − Axi − b), (9.12)
2 i=1 i
which is an ordinary least-squares problem. This is actually also a linear regression problem, cf .
Section 10.1. In Exercise 9.3 we derive the solution to this problem, and we will see that it is closely
related to the solution for the affine predictor above. Actually, one just needs to replace the true

3 Figure 9.4 Figure showing the level

curves of the pdf for the Gaussian mixture
in Example 9.4 together with both the
nonlinear and afﬁne predictors.
2

1
𝑦

−1

−3 −2 −1 0 1 2 3
𝑥
9.4 The Viterbi Algorithm 259

Figure 9.5 Figure showing samples from

the Gaussian mixture in Example 9.4 4
together with both the true afﬁne predictor
(solid line), and the afﬁne predictor
estimated from samples (dashed line).
2

𝑦
0

−2

−4 −2 0 2 4
𝑥

moments of the random variables with their estimates from the observations (xi , yi ), and then the
same formulas for A and b apply. Figure 9.5 shows 100 samples from the Gaussian mixture in
Example 9.4 together with both the true affine predictor and the affine predictor estimated from
the 100 samples. We see that the two predictors are close to one another.

9.4 The Viterbi Algorithm

We consider an HMM, as defined in Section 3.11. For ease of reference, the definition is repeated.
Consider two random processes X ∶ Ω → ℤ+ and Y ∶ Ω →  ℤ+ that are correlated. The sets  and
 will be defined later. We will assume that X is a Markov process satisfying (3.10), and that Yj given
Xj are independent of Yk given Xk for j ≠ k. In Section 3.11, we derived the filtering equations for
the conditional pdf pXk |Ȳ k , where Ȳ k = (Y0 , … , Yk ). In this section, we are interested in predicting
or estimating X̄ k = (X0 , … , Xk ) from the observation ȳ k of Ȳ k using MAP estimation, cf . Section 9.3.
Note that we are interested in estimating X̄ k rather than just Xk . This is often called a smoothing
problem. To this end, the joint probability function or pdf pX̄ k ,Ȳ k ∶ k+1 ×  k+1 → ℝ+ for (X̄ k , Ȳ k ) is
needed.2 We also need the conditional probability function or pdf for Ȳ k given X̄ k : pȲ k |X̄ k ∶  k+1 ×
k+1 → ℝ+ . Because of the conditional independence, it can be expressed as

∏
k
pȲ k |X̄ k (̄yk |̄xk ) = pYi |Xi (yi |xi ),
i=0

where pYi |Xi ∶  ×  → ℝ+ are the conditional probability functions or pdfs for Yi given Xi . We also
define the marginal probability function or pdf for X̄ k : pX̄ k ∶ k+1 → ℝ+ . From the above assump-
tions, it follows that
∏
k
pX̄ k ,Ȳ k (̄xk , ȳ k ) = pȲ k |X̄ k (̄yk |̄xk )pX̄ k (̄xk ) = pYi |Xi (yi |xi )pX̄ k (̄xk ). (9.13)
i=0

2 We notice that maximizing this joint pdf will result in the MAP estimate, since the only difference between the
conditional pdf used for MAP estimation and the joint pdf is a normalization with the marginal pdf for the
observations.
260 9 Unsupervised Learning

We now make use of the multiplication theorem, see Section 3.2, and obtain with obvious defini-
tions of the involved functions

pX̄ k (̄xk ) = pX0 (x0 )pX1 |X0 (x1 |x0 )pX2 |X0 ,X1 (x2 |x0 , x1 ) · · · pXk |X0 ,…,Xk−1 (xk |x0 , … , xk−1 ),
= pX0 (x0 )pX1 |X0 (x1 |x0 )pX2 |X1 (x2 |x1 ) · · · pXk |Xk−1 (xk |xk−1 ),

where the last equality follows from the Markov property. From (9.13) it then follows that

pX̄ k ,Ȳ k (̄xk , ȳ k ) = pYk |Xk (yk |xk )

× pXk |Xk−1 (xk |xk−1 )
× pYk−1 |Xk−1 (yk−1 |xk−1 )
× pXk−1 |Xk−2 (xk−1 |xk−2 )
× pYk−2 |Xk−2 (yk−2 |xk−2 )
⋮
× pX2 |X1 (x2 |x1 )
× pY1 |X1 (y1 |x1 )
× pX1 |X0 (x1 |x0 )
× pY0 |X0 (y0 |x0 )pX0 (x0 ).

Therefore,

max {pX̄ k ,Ȳ k (̄xk , ȳ k )} = max {pYk |Xk (yk |xk )

x̄ k xk

× max {pXk |Xk−1 (xk |xk−1 )pYk−1 |Xk−1 (yk−1 |xk−1 )

xk−1

× max {pXk−1 |Xk−2 (xk−1 |xk−2 )pYk−2 |Xk−2 (yk−2 |xk−2 )

xk−2

⋮
× max {pX2 |X1 (x2 |x1 )pY1 |X1 (y1 |x1 )
x1

× max {pX1 |X0 (x1 |x0 )pY0 |X0 (y0 |x0 )pX0 (x0 )}}}}}.
x0

We now introduce functions Vk ∶  → ℝ+ for k ∈ ℤN defined via V0 (x) = pY0 |X0 (y0 |x)pX0 (x) and the
recursion

Vi (x) = pYi |Xi (yi |x)max {pXi |Xi−1 (x|u)Vi−1 (u)},

for i = ℕk . It then follows that

max {pX̄ k ,Ȳ k (̄xk , ȳ k )} = max {Vk (x)},

x̄ k x

and that the optimal x̄ k is such that xi−1 is the maximizing u above for i ∈ ℕk , and xk is the maximiz-
ing x above. The recursions above are summarized as the famous Viterbi algorithm in Algorithm 9.1.
We remark that it was much easier to derive the Viterbi algorithm than to derive the filtering
equations in Section 3.11. The reason for this is that we are only interested in computing the MAP
estimate and not in obtaining the conditional probability function or pdf.
There is also a logarithmic version of the Viterbi algorithm which is obtained by defining
Ji ∶  → ℝ for 0 ≤ i ≤ k as Ji (x) = − log Vi (x). It then follows that the recursion reads

Ji (x) = − log pYi |Xi (yi |x) + min {− log pXi |Xi−1 (x|u) + Ji−1 (u)},
u
9.5 Kalman Filter on Innovation Form 261

Algorithm 9.1: Viterbi algorithm

Input: Time horizon k, initial pdf pX0 , conditional pdfs pYi |Xi for i ∈ ℤk , and pXi |Xi−1 for i ∈ ℕk ,
measurement data (y0 , y1 , … , yk )
Output: x̄ k = (x0 , … , xk )
V0 (x) ← pY0 |X0 (y0 |x)pX0 (x)
for i ← 1 to k do
xi−1 ← argmaxu {pXi |Xi−1 (x|u)Vi−1 (u)}
Vi (x) ← pYi |Xi (yi |x) maxu {pXi |Xi−1 (x|u)Vi−1 (u)}
end
xk ← argmaxx {Vk (x)}

with initial value J0 (x) = − log pY0 |X0 (y0 |x) − log pX0 (x). This is known to often have better numerical
properties.
A typical example of an HMM for  = ℝn and  = ℝp is obtained by considering the random
processes defined by the recursion
Xk+1 = Fk (Xk , Vk ),
(9.14)
Yk = Gk (Xk , Ek ),
where Fk ∶ ℝn × ℝn → ℝn , Gk ∶ ℝn × ℝp → ℝp , and where Ek are i.i.d. p-dimensional random vec-
tors and Vk are i.i.d. n-dimensional random vectors, and where X0 ∈ ℝn is a random vector with
known distributions. We will assume that X0 is independent of Ek and Vk for all k ≥ 0. It is straight
forward to verify the Markov property and the conditional independence property.

9.5 Kalman Filter on Innovation Form

An important special case of the HMM in (9.14) is obtained when Fk (x, 𝑣) = Ax + 𝑣 and Gk (x, e) =
Cx + e, where A ∈ ℝn×n , and C ∈ ℝp×n , and where X0 , Vk , and Ek all have Gaussian distributions
with expectations x̄ , 0, and 0, respectively, and covariances R0 , R1 , and R2 , respectively. We derived
the Kalman filter in Section 3.11 for this HMM. We will see that we can obtain similar recursions
for the MAP estimate, and we will show that they actually provide the same estimate. We have

pX0 (x0 ) =  (x0 , x̄ 0 , R0 ),

pXk |Xk−1 (xk |xk−1 ) =  (xk , Axk−1 , R1 ),
pYk |Xk (yk |xk ) =  (yk , Cxk , R2 ),

where  is defined as in Example 3.4, i.e. all the involved pdfs are Gaussian. Applying the loga-
rithmic Viterbi recursion, we find that
1 1
J0 (x) = (x − x̄ 0 )T R−1 ̄ 0 ) + (y0 − Cx)T R−1
0 (x − x 2 (y0 − Cx), (9.15)
2 2
{ }
1 1
Ji (x) = (y − Cx)T R−1
2 (yk − Cx) + min 1 (x − Au) + Ji−1 (u) ,
(x − Au)T R−1 (9.16)
2 k u 2
for i ∈ ℕk modulo constant terms. Finally, we obtain the optimal xk as the solution of

minimize Jk (x).
262 9 Unsupervised Learning

We will now find a more explicit solution to the above recursions by verifying that
1 T
Ji (x) =x Pi x + qTi x + ri ,
2
for some Pi ∈ 𝕊n+ , qi ∈ ℝn , and ri ∈ ℝ. For i = 0 this holds with
P0 = R−1 T −1
0 + C R2 C, q0 = −R−1
0 x̄ 0 − CT R−1
2 y0 . (9.17)
The actual value of the constant term r0 will be of no interest, and this is true for the whole sequence
ri . The argument of the min operator on the right-hand side of (9.16) is a strictly convex function
of u, and hence, its minimizer must satisfy the stationary condition
−AT R−1
1 (x − Au) + Pi−1 u + qi−1 = 0.

The optimal u is therefore

( T −1 )
i−1 A R1 x − qi−1 ,
u = G−1
where Gi = AT R−1
1 A + Pi , and it satisfies
( ) −1
i−1 qi−1 .
T
x − Au = R1 − AG−1 i−1 A R1 x + AG−1
Inserting this on the right-hand side of (9.16) results in
1
(y − Cx)T R−12 (yi − Cx)
2 i
1 (( T
) −1 )T −1 (( T
) −1 )
+ R1 − AG−1i−1 A R1 x + AG−1 i−1 qi−1 R1 R1 − AG−1
i−1 A R1 x + AG−1
i−1 qi−1
2
1( )T ( T −1 )
+ AT R−1 x − qi−1 G−1 −1
i−1 Pi−1 Gi−1 A R x − qi−1
2 ( T −1 )
i−1 A R x − qi−1 + ri−1 .
+ qTi−1 G−1
This is equal to the left-hand side of (9.16) if
( ) −1 ( )
Pi = CT R−1 1 AGi−1 Pi−1 Gi−1 A R1 ,
−1 T T −1 T −1
2 C + R1 R1 − AG−1
i−1 A R1 R1 − AG−1 i−1 A + R−1 −1
( ( ) )
qi = −CT R−1 1 A − R1 AGi−1 Pi−1 + R1 A Gi−1 qi−1 ,
−1 T
2 yi + R1 R1 − AG−1i−1 A R−1 −1 −1 −1 −1

where we again omit the constant terms. By making use of the definition of Gi−1 , the expressions
can be simplified to

2 C + R1 − R1 AGi−1 A R1 ,
Pi = CT R−1 −1 −1 −1 T −1

2 yi + R1 AGi−1 qi−1 .
qi = −CT R−1 −1 −1

Using the Sherman–Morrison–Woodbury (SMW) identity in (2.56), we obtain

f
2 C + Yi ,
Pi = CT R−1 (9.18)

f
2 yi − Yk APi−1 qi−1 ,
qi = −CT R−1 −1
(9.19)
where
f ( )
−1 T −1
Yi = R1 + APi−1 A . (9.20)
This also shows that Pi is positive definite. The estimate at the final time i = k is now given by
xk = −Pk−1 qk .
The above recursions and this expression can be used to obtain the solution for any value of k, i.e.
we can obtain the solution for the problem ending at k + 1 from the solution for the problem ending
9.5 Kalman Filter on Innovation Form 263

at k with just one more step in the recursion. We summarize the Kalman filter on the innovation
form in Algorithm 9.2. It is a good idea to avoid computing the inverse of Pk . It is better to use a
Cholesky factorization. Sometimes the algorithm is presented with qk having the opposite sign. We
may also use the fact that u = xi−1 to obtain the estimates of xi for i ∈ ℤk−1 from
( T −1 )
i−1 A R1 xi − qi−1 ,
xi−1 = G−1 (9.21)
which are the so-called “smoothed estimates.” Here, we may use the SMW identity to express
G−1
i−1
as
( )
−1 T −1 −1
G−1 −1 −1 T
i−1 = Pi−1 − Pi−1 A R1 + APi−1 A Pi−1 .
The smoothed estimates will be different for different values of k, i.e. we cannot find the smoothed
solution for the problem ending at k + 1 from the solution of the problem ending at k without
re-running the backward recursion in (9.21). The more intuitive explanation for this is that the
new measurement yk+1 affects all the smoothed estimates.

Algorithm 9.2: Innovation form Kalman filter

Input: System matrices A and C, mean x̄ 0 , covariances R0 , R1 , and R2 , measurement data
(y0 , y1 , …)
Output: xk for k ∈ ℤ+
f
Y0 ← R−10
q0 ← −R−1 ̄ T −1
0 x0 − C R2 y0
for k ← 0 to ∞ do
f
Pk ← CT R−1 2
C + Yk
xk ← −Pk qk
−1

f ( )−1
Yk+1 ← R1 + APk−1 AT
f
qk+1 ← −Yk+1 APk−1 qk − CT R−1 y
2 k+1
end

The form of the Kalman filter that we have derived is called the information form. It has advan-
tages when the inverses of the covariance matrices are sparse. It is also advantageous when we have
0 → 0. Then we can initialize
little information about the initial value of the state and need to let R−1
f
with Y0 = 0. However, the Kalman filter is often presented in another way, cf . Section 3.11, which
we will now derive from the innovation form summarized in Algorithm 9.2. Let us define Σk = Pk−1
( )−1
f f
and Σk = Yk . From the update formula for Pk in (9.18), we then have
( )−1
f
Σ−1
k
= C T −1
R 2 C + Σk .

Now, apply the SMW identity to obtain

( )−1
f f f f
Σk = Σk − Σk CT CΣk CT + R2 CΣk .
f
From the update formula for Yk in (9.18), we obtain
f
Σk = R1 + AΣk−1 AT ,
f
where the initial value is given by Σ0 = R0 . With these definitions, we have agreement with the
definition of the initial value P0 in (9.17).
264 9 Unsupervised Learning

f
We will now show that the recursion for qk in (9.19) is related to recursions for xka and xk ,
defined as
( )−1 ( )
f f f f
xka = xk + Σk CT CΣk CT + R2 yk − Cxk ,
f
xk+1 = Axka .
f
with x0 = x̄ 0 , and where the estimate at time k is given by xk = xka . More precisely, we will show
that with the definition qk = −Pk xka , the above recursions are the same as the recursion for qk that
we derived previously in (9.19). It holds that
( ( )−1 )
f T f T f
Pk xk = Pk I − Σk C CΣk C + R2 C xk
( )−1
f f
+ Pk Σk CT CΣk CT + R2 yk ,

Moreover, it holds that

( )−1 ( ( )−1 )
f T f T f f
2 ,
T −1 T −1
Σk C CΣk C + R2 = Σk I − C R2 C Yk + C R2 C CT R−1
( )−1
f
= Yk + CT R−1
2 C 2 ,
CT R−1

where we in the first equality have used the SMW identity and where we in the second equality
f f
have added and subtracted Yk (Yk + CT R−1
2
C)−1 inside the parenthesis. Thus, we have that
( ) f
2 C xk + Pk Σk C R2 yk ,
Pk xk = Pk I − Σk CT R−1 T −1

f
2 yk ,
a
−1
= Yk APk−1 Pk−1 xk−1 + CT R−1

which agrees with the recursion for qk in (9.19). It is also straightforward to show using similar
techniques that the initial values agree. The Kalman filter on standard form is summarized in
Algorithm 3.1.

9.6 Viterbi Decoder

In this section, we are going to investigate decoding of a coded message. We will specifically con-
sider the so-called convolutional codes, for which a signal u ∈ {0, 1}ℤ+ is coded into y ∈ {0, 1}ℤ+
using a convolution
∑
n
yk = ci uk−i ,
i=1

where ci ∈ {0, 1} for i ∈ ℕn represents the code. The above summation is carried out as modulo
two. We assume that uk = 0 for k < 0 to make the convolution well defined. It is then possible to
introduce a state xk ∈ {0, 1}n such that

xk+1 = Axk + Buk ,

yk = Cxk ,
[ ]
where A is the lower shift matrix of order n, B = e1 , C = c1 · · · cn , and x0 = 0. We assume a
channel model which is such that the received signal is rk = yk + ek , where ek is a realization of a
sequence of independent normally distributed random variables with zero mean and unit variance.
9.6 Viterbi Decoder 265

It is then straightforward, cf . Section 10.1, to see that the ML problem of estimating uk for k ∈ ℤN−1
is equivalent to solving the optimal control problem

∑
N
minimize (rk − Cxk )2 ,
k=0
subject to xk+1 = Axk + Buk , k ∈ ℤN−1 ,

with variables (u0 , x1 , … , uN−1 , xN ), where x0 = 0, see [84]. Omura used dynamic programming
as presented in Chapter 8 to solve the problem. This does, however, not result in a very practi-
cal algorithm, since a solution for N cannot be used to solve a problem where N is replaced with
N + 1. This is often of importance in decoding applications. i[109] proposed an algorithm that does
not suffer from this limitation. It is based on performing dynamic programming forward in time
instead of backward in time. This can be derived from the general approach of partially separa-
ble optimization problems as presented in Section 5.5.3 We introduce for k ∈ ℤN−1 the functions
fk ∶ {0, 1}n × {0, 1} × {0, 1}n → ℝ as

fk (x, u, x+ ) = (rk − Cx)2 + ID (x, u, x+ ),

where ID ∶ {0, 1}n × {0, 1} × {0, 1}n → ℝ is the indicator function for the set

D = {(x, u, x+ ) ∈ {0, 1}n × {0, 1} × {0, 1}n | x+ = Ax + Bu}.

We also let 𝜙 ∶ {0, 1}n → ℝ be defined as 𝜙(x) = (yN − Cx)2 . With this notation, we may write the
optimal control problem above as

∑
N−1
minimize 𝜙(xN ) + fk (xk , uk , xk+1 ),
k=0

with variables (u0 , x1 , … , uN−1 , xN ), where x0 = 0. This is clearly a partially separable optimization
problem.
We then introduce the functions Vk ∶ {0, 1}n → ℝ defined as
{ }
Vk+1 (x+ ) = min Vk (x) + fk (x, u, x+ ) , k ∈ ℤN−1 ,
u,x

where V0 (x) = 0. We also define 𝜇k+1 ∶ {0, 1}n → {0, 1} × {0, 1}n as the minimizing argument in
the above minimization, i.e.
{ }
𝜇k+1 (x+ ) = argmin Vk (x) + fk (x, u, x+ ) .
x,u

The function VN (x) + (rN − Cx)2 is then finally minimized with respect to x to obtain the optimal xN .
After this has been done, all optimal variables can be recovered from the recursion

(uk−1 , xk−1 ) = 𝜇k (xk ), 1 ≤ k ≤ N.

The minimization in each step in the recursion can be written as

minimize Vk (x) + (rk − Cx)2 ,

subject to Ax + Bu = x+ ,

3 We may interpret this as a variation of the Viterbi algorithm if we take yk in the Viterbi algorithm equal to rk ,
consider the state transition probability to be degenerate, and introduce an additional variables uk to optimize over.
266 9 Unsupervised Learning

with variables (x, u). Because of the very specific structure of the constraints, we can say much
more. To avoid cluttering the notation, we look at n = 3. Then we have that the constraints are
u = x1+ ,
x1 = x2+ ,
x2 = x3+ ,
and hence, the only optimization variable is x3 . Therefore, the optimization problem is
minimize (rk − c1 x2+ − c2 x3+ − c3 x3 )2 + Vk ((x2+ , x3+ , x3 )),
where the minimizing argument x3 will be a function of x2+ and x3+ . Also, notice that Vk+1 will only
be a function of x2+ and x3+ . The minimizing argument is therefore
+
⎡ x1 ⎤
⎢ + ⎥
⎢ x2 ⎥
𝜇k (x) = ⎢ ⎥.
+
⎢ x3 ⎥
⎢ ⎥
⎣x3 (x2 , x3 )⎦
+ +

Only the last component is nontrivial to compute, and it can be done by enumerating of all possible
values of x2+ and x3+ . We now realize that we need tables for Vk and 𝜇k that have 2n−1 entries.
In the practical use of Viterbi decoding, the value of N is not fixed, but it is increasing and rep-
resents time. The decoding is done with some fixed delay d measured from N, i.e. it is uN−d that is
estimated. Thus, we need to store one table for VN and d tables for 𝜇k , where N − d + 1 ≤ k ≤ N. In
case d2n−1 is large, this could be costly. Approximations for how to circumvent this were proposed
by Viterbi; see, e.g. [110].

9.7 Graphical Models

We will now define distributions on undirected graphs. To this end, we consider a graph G = (V, E),
where V = ℕn is the set of vertices and where E ⊂ V × V is a set of undirected edges connecting
vertices in V. We now let the elements of V index components of an n-dimensional random variable,
and we define a probability function by maximizing entropy under moment constraints. However,
we will only specify second-order moments for pairs of components that correspond to the edges
of the graph. Such a model is called a graphical model.

9.7.1 Ising Distribution

When we consider a graphical model for an n-dimensional Ising distribution similar as in (9.4), we
obtain
( )
∑ ∑
p(x) = exp 𝜆k xk + Λi,j xi xj − A ,
k∈ℕn (i,j)∈E

where
( ( ))
∑ ∑ ∑
A(𝜆, Λ) = ln exp 𝜆k xk + Λi,j xi xj .
x∈x k∈ℕn (i,j)∈E

We notice that we do not specify all the entries of neither Λ nor the matrix M of second moments. If
we let Λi,j = 0 for (i, j) ∉ E, then we may express the probability function in terms of tr(ΛxxT ). The
9.7 Graphical Models 267

Lagrange dual function will therefore look the same, and so will the optimality conditions, except
for the fact that those related to (i, j) ∉ E are omitted, i.e. we have
𝜕h 𝜕A
= − m = 0,
𝜕𝜆 𝜕𝜆
𝜕h 𝜕A
= − Mij = 0, (i, j) ∈ E.
𝜕Λij 𝜕Λij
Also, for this case, it is in general difficult to solve the optimality conditions.

9.7.2 Normal Distribution

When we derive a graphical model for the normal distribution, cf . (9.7), we have that Λ is such
that Λij = 0 for (i, j) ∉ E. Also, we do not specify Mij for (i, j) ∉ E. The Lagrange dual function will
look the same as in (9.2). We first minimize it with respect to 𝜆, which results in 𝜆 = −Λm. We then
insert the solution in the Lagrange dual function and obtain the function f ∶ 𝕊n → ℝ given by
( )
1 det Λ 1
f (Λ) = − ln + tr (ΛΣ) , (9.22)
2 (2𝜋)n 2
where Σ = M − mmT . We need to minimize it subject to the constraint that Λij = 0 for (i, j) ∉ E in
order to express Λ in terms of the moments. We realize that this is a convex optimization prob-
lem with a linear constraint on Λ. This optimization problem can be solved with algorithms in
Chapter 6. We will see that Λ ∈ 𝕊n+ . However, we would like to analyze the problem a bit more.
To this end, we introduce the Lagrangian  ∶ 𝕊n × 𝕊n given by
( )
1 det Λ 1 1
(Λ, Γ) = − ln + tr (ΛΣ) + tr (ΓΛ) ,
2 (2𝜋)n 2 2
where Γi,j = 0 for (i, j) ∈ E. The optimality conditions are that
𝜕 1 1 1
= − Λ−1 + Σ + Γ = 0,
𝜕Λ 2 2 2
together with Λij = 0 for (i, j) ∉ E and Γij = 0 for (i, j) ∈ E. We realize that we have full freedom in
selecting the entries of Σ + Γ for indexes (i, j) ∉ E by choosing Γ. Thus, it does not matter that we
did not specify these entries for M. We may take them equal to any value. Because of this, we let
Σij = 0 for (i, j) ∉ E. However, Σ + Γ is also the covariance matrix of the normal distribution, and
hence, has to be positive definite. We can describe the above equation as finding Γ to complete the
covariance matrix in such a way that it is positive definite and such that its inverse has zeros for
entries not in the set E. This is called the maximum determinant positive definite completion of Σ.

9.7.3 Markov Random Field

Let us consider the normal pdf on the form
√
det Λ −(x−m)T Λ(x−m)
p(x) = e .
𝜋n
Assume A, B, and C are three mutually disjoint subsets of V and that if we remove the vertices
indexed by C, then there are no paths connecting the vertices in A with the vertices in B. Let D =
V ⧵ (A ∪ B ∪ C). Also, assume with no loss of generality that we have ordered the components of x
in such a way that x = (xA , xB , xC , xD ), where xA is the subvector of x indexed by A, and so on. We
will show that
p(x) = pA,B∣C (x)pC (xC ) = pA∣C (xA |xC )pB∣C (xB |xC )pC (xC ), (9.23)
268 9 Unsupervised Learning

𝐶
𝐴

Figure 9.6 A graph where the subset of nodes in C separates the subset of nodes in A from the ones in B.

where pA∣C ∶ ℝ|A|+|C| → ℝ+ and pB∣C ∶ ℝ|B|+|C| → ℝ+ are the conditional pdfs and where
pC ∶ ℝ|C| → ℝ+ is the marginal pdf for the variables indexed by C. This means that the random
variables indexed by A and B conditioned on the random variables indexed by C are independent,
which is called conditional independence. This is the motivation for the name Markov random field
for a distribution specified as above.
From the above definitions, it follows that Λ must have the following structure:
⎡× 0 × 0⎤
⎢ ⎥
0 × × 0⎥
Λ=⎢ .
⎢× × × ×⎥
⎢0 0 × ×⎥⎦
⎣
We realize that we with no loss of generality may consider the components indexed by D to be part
of A and/or B, or we may assume that V = A ∪ B ∪ C, which we from now on do, see Figure 9.6.
From this, it follows that Λ must have the following structure:
⎡× 0 ×⎤
Λ = ⎢ 0 × ×⎥ ,
⎢ ⎥
⎣× × ×⎦
which is called an arrow structure. To prove the conditional independence property, we partition
the covariance matrix Σ as
⎡ ΣA ΣAB ΣAC ⎤
Σ = ⎢ΣTAB ΣB ΣBC ⎥ .
⎢ T ⎥
⎣ΣAC ΣTBC ΣC ⎦
We also let
[ ] [ ] [ ] [ ]T
Σ̃ Σ̃ Σ Σ Σ ΣAC
Σ̃ = T1 12 = TA AB − AC Σ−1 ,
Σ̃ 12 Σ̃ 2 ΣAB ΣB ΣBC C ΣBC
which is the covariance matrix for the variables indexed by A and B conditioned on the variables
indexed by C. Hence, it only remains to prove that Σ̃ 12 = 0. From the formula for the inverse of a
blocked matrix in (2.54), it follows that
[ −1 ]
Σ̃ ×
−1
Σ = .
× ×
We once again apply this formula to obtain that
( )−1 ( )−1
⎡ Σ̃ 1 − Σ̃ 12 Σ̃ −1 Σ̃ T
− ̃
Σ − ̃
Σ ̃
Σ ̃
−1 T
Σ Σ̃ 12 Σ̃ 2 ⎤
−1

Σ̃ = ⎢ ⎥.
−1 2 12 1 12 2 12
( )−1
⎢−Σ̃ −1 Σ̃ T Σ̃ − Σ̃ Σ̃ −1 Σ̃ T ⎥
⎣ 2 12 1 12 2 12 × ⎦
9.8 Maximum Likelihood Estimation 269

From this, we now see that the zero block in Λ corresponds to

( )−1
Σ̃ 1 − Σ̃ 12 Σ̃ 2 Σ̃ 12 Σ̃ 12 Σ̃ 2 = 0,
−1 T −1

and hence, Σ̃ 12 = 0, which is what we wanted to prove. Because of this, (9.23) holds.
There may be many more ways to partition V such that the above property holds. It is possible to
show that the pdf in general can be factorized as
∏
p(x) = fC (xC ),
C∈

for some functions fC ∶ ℝ|C| → ℝ+ , where  is the set of all cliques of G, i.e. the set of all complete
subgraphs of G, see [112]. Here, C has a different meaning than before. It is possible to take  to be
the set of maximal cliques of G, where a clique is maximal if it is not contained in any other clique.
We will later discuss how this structure may be utilized in more detail. The above Markov property
holds for general graphical models defined on undirected graphs, and specifically, also for the Ising
model.

9.8 Maximum Likelihood Estimation

We have already seen how we can estimate distributions by maximizing entropy. Now, we will
consider the problem of estimating parameters in a distribution by maximizing the so-called “like-
lihood function.” This is called maximum likelihood (ML) estimation. For a probability function
p ∈ n that depends on a parameter 𝜆 ∈ ℝ, we may define the likelihood function  ∶ ℝN × ℝ →
[0, 1] based on N samples xk , i ∈ ℕN , of the random variable X ∶ ℕn → ℝ as
∏
N
(x1 , … , xN ; 𝜆) = pi ,
i=1
[ ]
where pi = ℙ X = xi . Then an estimate of 𝜆 is obtained by maximizing , or equivalently, the
logarithm of .

9.8.1 Categorical Distribution

For the categorical distribution discussed in Section 9.2, we have xi ∈ {f1 , … , fn }, and we obtain
∑
N
ln (fk1 , … , fkN ; 𝜆) = − 𝜆fki − N ln Φ(𝜆).
i=1
1 ∑N
If we take b = N i=1 fki in Section 9.2, then minimizing h(𝜆) = 𝜆b + ln Φ(𝜆) is equivalent to max-
imizing the likelihood function. Because of this, we realize that entropy maximization generalizes
ML estimation in that we also look for the optimal distribution.

9.8.2 Ising Distribution

We now turn to the Ising distribution where we define  ∶ {0, 1}mN × ℝm × 𝕊m → [0, 1] based on
N samples xi ∈ {0, 1}m , i ∈ ℕN , of the random variable X as
∏
N
(x1 , … , xN ; 𝜆, Λ) = p(xi ).
i=1
270 9 Unsupervised Learning

Then an ML estimate of 𝜆 is obtained by maximizing

∑
N
∑
N
ln (x1 , … , xN ; 𝜆, Λ) = 𝜆T xi + trΛ xi xiT − NA(𝜆, Λ).
i=1 i=1

Note that the subindex i now refers to the ith sample and not the ith component of x. If we take
∑N ∑N
m = N1 i=1 xi and M = N1 i=1 xi xiT in Section 9.2, then minimizing h(𝜆, Λ) = A(𝜆, Λ) − 𝜆T m −
tr(ΛM) is equivalent to maximizing the likelihood function. This results also hold for the case
when we define the Ising model on a graph.

9.8.3 Normal Distribution

We now consider a pdf p ∈ ℝℝ
n M
+
×ℝ
that depends on a parameter 𝜃 ∈ ℝM , and we define the likeli-
hood function  ∶ ℝ × ℝ → ℝ+ based on N samples xi ∈ ℝn of the random variable X as
Nn M

∏
N
(x1 , … , xN ; 𝜃) = p(xi , 𝜃).
i=1

As before, the estimate of 𝜃 is obtained by maximizing . Now, consider the normal distribution
discussed in Section 9.2, and suppose that 𝜃 = (𝜆, Λ) ∈ ℝn × 𝕊n . The log-likelihood function is then

1 ∑(
N
)T ( )
ln (x1 , … , xN ; 𝜃) = − xi + Λ−1 𝜆 Λ xi + Λ−1 𝜆
2 i=1
( )
N det Λ
+ ln ,
2 (2𝜋)n
( (N ))
1 ∑ ∑N
N
= − tr Λ xi xiT
− 𝜆T xi − 𝜆T Λ−1 𝜆
2 i=1 i=1
2
( )
N det Λ
+ ln . (9.24)
2 (2𝜋)n
If we take

1∑ 1∑ T
N N
m= x, M= xx
N i=1 i N i=1 i i
( )
in Section 9.2, then minimizing h(Λ, Λ) = 12 𝜆T Λ−1 𝜆 − 12 ln det Λ
(2𝜋)n
+ 𝜆T m + 12 tr (ΛM) is equivalent
to maximizing the likelihood function. Hence, the relationship between maximum entropy and
ML estimation is the same also for continuous distributions. Note that the problem of minimizing
h is a convex optimization problem. The solution has already been derived in Section 9.2, and with
Σ = M − mmT , we have Λ = Σ−1 and 𝜆 = −Σ−1 m. Thus, the solution to the ML problem is simply
the sample mean and covariance in case we use the nonnatural parameterization. We may also
consider ML estimation for the normal distribution when it is defined on a graph, and we obtain
similar results.

9.8.4 Generalizations
There are several ways in which we could generalize the ML problem. For example, if we have prior
information such as upper or lower bounds on the matrix Λ, e.g.

Bl ⪯ Λ ⪯ Bu ,
9.9 Relative Entropy and Cross Entropy 271

with Bl , Bu ∈ 𝕊n++ , then we have a convex constraint that easily can be incorporated when we min-
imize h(𝜆, Λ). Also, an upper bound 𝜅max on the condition number of Λ can be incorporated by
noting that it is equivalent to the existence of u > 0 such that uI ⪯ Λ ⪯ 𝜅max uI. We may also include
prior information by modifying m and M. For example, if we have reason to believe from previous
experience that m0 and M0 are good values, then we could take

1∑
N
m = 𝛼m0 + (1 − 𝛼) x,
N i=1 i

1∑ T
N
M = 𝛽M0 + (1 − 𝛽) xx ,
N i=1 i i

with 𝛼, 𝛽 ∈ [0, 1], where the values of 𝛼 and 𝛽 are related to how much we trust our prior informa-
tion as compared to the information in the data {x1 , … , xN } that we have collected.

9.9 Relative Entropy and Cross Entropy

Sometimes it is of interest to quantify the difference between different distributions, and this is what
relative entropy, or equivalently the Kullback–Leibler divergence does. We will give the definition for
two pdfs p and q defined on ℝn with obvious modifications for probability functions. Let
{ }
|
ℝn |
n = p ∈ ℝ | p(x)dx = 1, p(x) ≥ 0, ∀x ∈ ℝ n
.
|∫ℝn
Define the relative entropy D ∶ n × n → ℝ from q to p as
q(x)
D(p, q) = − p(x) ln dx.
∫ℝn p(x)

9.9.1 Gibbs’ Inequality

We are interested in the pdf p that minimizes D(p, q) for a given pdf q. Clearly, the relative entropy
is zero for p = q. We will prove a well-known result known as Gibbs’ inequality, which states that

D(p, q) ≥ 0,

with equality if and only if p(x) = q(x) for almost all x. From this, we see that the relative entropy
measures the “difference” between two pdfs. However, it is not a metric, since in general D(p, q) ≠
D(q, p). The proof of Gibbs’ inequality is based on Jensen’s inequality that says that for a convex
function 𝜑 ∶ ℝ → ℝ, any function f ∶ ℝn → ℝ, and any pdf p, it holds that
( )
𝜑 f (x)p(x)dx ≤ 𝜑 (f (x)) p(x)dx.
∫ℝn ∫ℝn
This is just a slight modification of Jensen’s inequality in Exercise 4.8, where dx is replaced with
p(x)dx. If we let f (x) = q(x)∕p(x), and 𝜑(f ) = − ln(f ) in Jensen’s inequality, we find that
q(x)
D(p, q) ≥ − ln p(x)dx = − ln q(x)dx = 0.
∫ℝn p(x) ∫ℝn
We will now use Gibbs’ inequality to show that the pdf
( )
1
p(x) = exp −1 − 𝜇 − 𝜆T x − xT Λx
2
272 9 Unsupervised Learning

in (9.8) indeed maximizes the entropy. Suppose that q is another pdf that satisfies the same moment
constraints, i.e.

xp(x)dx = xq(x)dx = m,
∫ℝn ∫ℝn

xxT p(x)dx = xxT q(x)dx = M.

∫ℝn ∫ℝn
It then follows that

H(q) = − q(x) ln(q(x))dx,

∫ ℝn
( )
q(x)
= − q(x) ln dx − q(x) ln(p(x))dx,
∫ ℝn p(x) ∫ ℝn
( )
1
= −D(q, p) − q(x) −1 − 𝜇 − 𝜆T x − xT Λx dx,
∫ℝn 2
( )
1
= −D(q, p) − p(x) −1 − 𝜇 − 𝜆T x − xT Λx dx,
∫ℝn 2

= −D(q, p) + H(p),

and since D(q, p) ≥ 0 with equality when q = p, it follows that H(q) ≤ H(p) with equality when
q = p. Thus, p maximizes the entropy under the given moment constraints.

9.9.2 Cross Entropy

Another concept of relevance is cross entropy, which is defined as C ∶ n × n → ℝ, where

C(p, q) = − p(x) ln(q(x))dx.

∫ ℝn
We immediately realize that
( )
q(x)
C(p, q) = − p(x) ln dx − p(x) ln(p(x))dx = D(p, q) + H(p),
∫ ℝn p(x) ∫ℝn
where H(p) is the entropy of p. Hence, it follows from Gibbs’ inequality that

C(p, q) ≥ H(p),

with equality if and only if p(x) = q(x) for almost all x. We also realize that cross entropy is the
expected value of − ln q with respect to the pdf p, i.e.
[ ]
C(p, q) = −𝔼 ln q ,

where 𝔼 denotes expectation with respect to p. If we consider q to be parameterized with a param-

eter 𝜃 ∈ ℝm , i.e. q ∶ ℝn × ℝm → ℝ+ , then the ML problem is equivalent to
∑
N
minimize − ln q(xk ; 𝜃),
k=1

with variable 𝜃, where xk , k ∈ ℕN , are the observed data. If we assume that xk are observations of
a random variable with pdf p, then the objective function is proportional to an SAA of the cross
entropy.
9.10 The Expectation Maximization Algorithm 273

9.10 The Expectation Maximization Algorithm

Let us consider an ML problem with a likelihood function  ∶ ℝN × ℝp → ℝ+ based on observa-

tions y ∈ ℝN parameterized with 𝜃 ∈ ℝp , i.e. we are given (y; 𝜃) for known y and we want to solve
the problem

maximize ln (y; 𝜃),

with variable 𝜃. Unfortunately, it is sometimes complicated to write down , and the expression
for its gradient with respect to 𝜃. Hence, the evaluations of function and gradients might be
time-consuming and/or error-prone to implement. However, in case we had some more observa-
tions z ∈ ℝM , then sometimes the ML problem with x = (y, z) as observation happens to not suffer
from the above difficulties. This is often the case when some data are missing because of errors
in the collection of the data, so-called missing data, or in case there are data that are difficult to
measure, so-called latent variables.
We will now show how to circumvent the above problems. To this end, let fX ∶ ℝN × ℝM ×
ℝ → ℝ+ be the joint pdf for the random variable X = (Y , Z) with parameter 𝜃, of which we have
p

the partial observation Y = y.4 We also define the conditional pdf fZ|Y ∶ ℝN × ℝM × ℝp → ℝ+ via

fX (y, z; 𝜃) fX (y, z; 𝜃)
fZ|Y (z|y; 𝜃) = = .
fY (y; 𝜃) (y; 𝜃)
This follows since  is the marginal pdf for the observation y. Moreover, we consider an arbitrary
pdf q ∈ M . Note that this is not the marginal pdf for the latent variable z. We now consider the
infinite-dimensional optimization problem

maximize ln (y; 𝜃) − D(q, fZ|Y ),

with variables 𝜃 and q, where D is the relative entropy. The second term in the objective function
depends on 𝜃, since fZ|Y is a function of 𝜃. We have that only the second term in the objective
function depends on q, and therefore, we may first carry out the maximization with respect to this
term over q with the trivial maximum q = fZ|Y with D(fZ|Y , fZ|Y ) = 0. Then we are left with the ML
problem originally defined, and hence, the problems are equivalent, i.e. we can trivially obtain the
solution for one of them from the other. Furthermore, it holds that

ln (y; 𝜃) − D(q, fX|Y ) = ln (y; 𝜃) + H(q) − C(q, fZ|Y ) = H(q) − C(q, fX ),

where H is the entropy and C is the cross-entropy. The above results follow from the fact that cross
entropy is the sum of entropy and relative entropy, and from the definition of the conditional pdf
for Z given Y . We now apply block-coordinate accent to the optimization problem above, i.e. we
repeat the following steps:

1. Fix 𝜃 and optimize with respect to q.

2. Fix q and optimize with respect to 𝜃.

This is not a procedure that in general is guaranteed to converge to the optimal solution. However,
our derivation shows that we can never obtain a worse value of the objective function by iterating

4 In this section, we do not have several observations of a scalar-valued random variable. Instead, we have one
observation of a vector-valued random variable. The former case can be treated as a special case of the vector-valued
case by taking each component of the vector-valued random variable to be a random variable with the same
distribution.
274 9 Unsupervised Learning

as above. A detailed discussion of the convergence properties is given in, e.g. [117]. The first step
above is trivial, since by the original formulation of the objective function, we immediately have
that q = fZ|Y is optimal. In the second step q is fixed, and therefore, it is by the reformulation of
the objective function equivalent to maximizing −C(fZ|Y , fX ) with respect to 𝜃, since the term H(q)
only depends on the previous value of 𝜃 which we will denote 𝜃. ̄ Hence, we first need to evaluate
the function Q ∶ ℝp × ℝp → ℝ given by
[ ]
̄ = −C(fZ|Y , fX ) = 𝔼 ln(fX (Y , Z; 𝜃))|Y = y; 𝜃̄ ,
Q(𝜃, 𝜃)
i.e. we need to compute the conditional expected value of the log of the joint pdf fX . This is often
easy, and it is called the expectation step in the expectation maximization (EM) algorithm. After
this, we need to solve
̄
maximize Q(𝜃, 𝜃),
with variable 𝜃, which is often easier to solve than the original ML problem. This is called the
maximization step. We will later exemplify the claims regarding what is difficult and easy to com-
pute. The E-step is sometimes approximated using an SAA also called empirical cross-entropy,
i.e.
1∑
M
[ ]
𝔼 ln fX (Y, Z; 𝜃)|Y = y; 𝜃̄ ≈ ln fi ,
M i=1
where fi = fX (y, zi ; 𝜃) is obtained by drawing a sample zi from the conditional pdf fZ|Y . This is called
Monte Carlo EM.

9.11 Mixture Models

A mixture model is a probabilistic model for representing a random variable whose distribution
function is a convex combination of a number of other distribution functions, each of which rep-
resents a so-called “mixture component.” Specifically, given a collection of k random variables
C1 , … , Ck with distribution functions FCi ∶ ℝd → [0, 1], i ∈ ℕk , we define the mixture distribution
function FY ∶ ℝd → [0, 1] as
∑
k
FY (y) = 𝛼j FCj (y),
j=1

where 𝛼 ∈ Δk . In other words, the random variable Y may be viewed as an overall population that
is derived from a mixture of subpopulations.
An important special case is when the k components are Gaussian random variables. The corre-
sponding mixture model is called a Gaussian mixture model (GMM), and the mixture pdf can be
expressed as
∑
k
fY (y; 𝜃) = 𝛼j  (y, 𝜇j , Σj ), (9.25)
j=1

where 𝜃 represents the model parameters (𝛼j , 𝜇j , Σj ), j ∈ ℕk . The problem of computing a ML esti-
mate of the model parameters based on a given set of m independent observations, y1 , … , ym ∈ ℝd ,
9.11 Mixture Models 275

can be expressed as the nonlinear optimization problem

( k )
∑m
∑
maximize ln 𝛼j  (yi , 𝜇j , Σj ) ,
i=1 j=1

subject to 𝛼 ∈ Δ , k

Σj ⪰ 0, j ∈ ℕk ,
with variables (𝛼j , 𝜇j , Σj ), j ∈ ℕk . This problem is generally nonconvex and intractable, but local
optimization methods may be used in the pursuit of a local maximum. We note that the ML
estimation problem becomes much easier if, in addition to the m observations y1 , … , ym , we are
given labels z1 , … , zm ∈ ℕk such that zi identifies which of the k components the ith observation
originates from. However, such labels are typically not available.
We will now show how the EM algorithm can be used to derive a relatively simple iterative pro-
cedure that converges to a local maximum. To this end, we introduce a discrete random variable
Z which takes the value j with probability 𝛼j , j ∈ ℕk , i.e. Z is a latent variable that identifies one of
the k components. Moreover, we define the pdf of Y given Z = z as
∑
k
fY |Z (y|z; 𝜃) = 𝛿j (z) (y, 𝜇j , Σj ),
j=1

where 𝛿j (z) = 1 if z = j and 0 otherwise. The joint pdf5 of Y and Z is then

∑
k
∏
k
[ ]𝛿 (z)
fY ,Z (y, z; 𝜃) = 𝛿j (z)𝛼j  (y, 𝜇j , Σj ) = 𝛼j  (y, 𝜇j , Σj ) j . (9.26)
j=1 j=1

It is easy to check that (9.25) follows from (9.26) by marginalizing over Z. Moreover, the probability
function of Z conditioned on Y may be expressed as
∑k
fY ,Z (y, z; 𝜃) j=1 𝛿j (z)𝛼j  (y, 𝜇j , Σj )
fZ|Y (z|y; 𝜃) = = ∑k . (9.27)
fY (y; 𝜃)
l=1 𝛼l  (y, 𝜇l , Σl )
Now, given m observations y1 , … , ym and model parameters 𝜃, ̄ e.g. an initial guess or the param-
eters from the previous iteration of the EM algorithm, the E-step of the EM algorithm may be
expressed as
[ m ]
∑
̄ =𝔼
Q(𝜃, 𝜃) ln(fY ,Z (Yi , Zi ; 𝜃)) ∣ Y1 = y1 , … , Ym = ym ; 𝜃̄ ,
i=1
∑
m
[ ]
= 𝔼 ln(fY ,Z (Yi , Zi ; 𝜃)) ∣ Yi = yi ; 𝜃̄ ,
i=1

∑∑
m k
= ̄ ln(fY ,Z (yi , j; 𝜃)),
fZ|Y (j|yi ; 𝜃)
i=1 j=1

where (Yi , Zi ), i ∈ ℕm , are independent pairs of random variables with the same joint pdf as that of
̄ as
(Y , Z). Using (9.27), we can write Q(𝜃, 𝜃)
∑∑
m k
( )
̄ =
Q(𝜃, 𝜃) 𝑤̄ ij ln 𝛼j  (yi , 𝜇j , Σj ) , (9.28)
i=1 j=1

5 Note that Y is a continuous random variable and Z is a discrete random variable, for which we define a joint pdf.
276 9 Unsupervised Learning

̄ is the probability that yi originates from the jth mixture component under
where 𝑤̄ ij = fZ|Y (j|yi , 𝜃)
the mixture model defined by the model parameters 𝜃. ̄
The M-step of the EM algorithm is the problem of maximizing Q(𝜃, 𝜃) ̄ with respect to the model
parameters 𝜃. This is a block separable optimization problem, which follows by writing Q(𝜃, 𝜃) ̄ as
( )
∑ k
∑
k
∑
m
( )
Q(𝜃, 𝜃)̄ = c̄ j ln(𝛼j ) + 𝑤̄ ij ln  (yi , 𝜇j , Σj ) ,
j=1 j=1 i=1
∑m
where we define c̄ j = i=1 𝑤̄ ij . Thus, the update of 𝛼 may be expressed as
{ }
∑
k
𝛼 + = argmax c̄ j ln(𝛼j )| 𝛼 ∈ Δk ,
𝛼 j=1

= (̄c1 ∕m, … , c̄ k ∕m), (9.29)

∑k
where we have used the fact that ̄j
j=1 c = m. Similarly, the update of 𝜇j and Σj , j ∈ ℕk , amounts to
solving the problem
∑
m ( )
maximize 𝑤̄ ij −(yi − 𝜇j )T Σ−1
j (y i − 𝜇 j ) + ln det (Σ −1
j ) , (9.30)
i=1

which is concave in 𝜇j for a fixed Σj and concave in Σ−1 j for a fixed 𝜇j . The first-order optimality
conditions are
∑m
∑
m
( )
𝑤̄ ij Σ−1
j (y i − 𝜇 j ) = 0, 𝑤̄ ij Σj − (yi − 𝜇j )(yi − 𝜇j )T = 0,
i=1 i=1

from which it follows that the updated mean is

1∑
m
𝜇j+ = 𝑤̄ y , j ∈ ℕk , (9.31)
c̄ j i=1 ij i
and the update covariance matrix is

1∑
m
Σ+j = 𝑤̄ (y − 𝜇j+ )(yi − 𝜇j+ )T , j ∈ ℕk . (9.32)
c̄ j i=1 ij i
We note that the weights 𝑤̄ ij are positive, and hence, Σ+j is nonsingular if and only if span({y1 −
𝜇j+ , … , ym − 𝜇j+ }) = ℝd . Equivalently, Σ+j is singular if and only if
dim aff({y1 , … , ym }) < d.
For example, this is the case if the number of observations m is less than or equal to d or if all
observations lie on a hyperplane.

Example 9.6 We now illustrate the use of GMMs to approximate the pdf of a random variable
based on m independent observations. Figure 9.7 shows examples in one and two dimensions. The
one-dimensional example in Figure 9.7a shows the GMM pdf for a model with k components and
with parameter estimates based on m = 1000 observations and computed using the EM algorithm.
The observations are shown as a normalized histogram. The model seemingly fits the histogram
well when we use a mixture model with three components. The two-dimensional example in
Figure 9.7b shows m = 500 observations as dots and an ellipse for each component of the GMM
obtained using the EM algorithm. Each ellipse defines the superlevel set containing 95% of the
probability mass for the corresponding component. We see that the model with four components
visually appears to be a reasonable approximation.
9.12 Gibbs Sampling 277

𝑘=1 𝑘=2 𝑘=3

0.4 0.4 0.4

0.3 0.3 0.3

0.2 0.2 0.2

0.1 0.1 0.1

0 0 0
−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
(a)
𝑘=2 𝑘=3 𝑘=4
4 4 4

2 2 2

0 0 0

−2 −2 −2

−4 −4 −4
−5 0 5 −5 0 5 −5 0 5
(b)

Figure 9.7 Observations and estimated GMM with k components based on maximum likelihood
estimation: (a) one-dimensional GMMs and (b) two-dimensional GMMs.

9.12 Gibbs Sampling

Given a random variable X ∶ ℝn → ℝn with pdf fX ∶ ℝn → ℝ+ , it can be very expensive to evaluate

quantities of interest that involve an integral when n is large, e.g. a marginal distribution or the
expected value of some function of the random variable. One approach to addressing this issue is
to use Monte Carlo techniques. The basic principle is to compute an SAA of the quantity of interest
by drawing a set of random samples from the pdf.
The class of Markov chain Monte Carlo (MCMC) methods construct a Markov chain that has
the desired distribution as its equilibrium distribution. However, the sequence generated by the
Markov chain is autocorrelated, and this must be taking into account when estimating the error
introduced by the SAA. A very long sequence is typically needed to compute a reasonably accurate
approximation when the chain is highly autocorrelated. Moreover, the initialization of the Markov
chain can have a strong influence on the first part of the sequence, which is typically discarded and
is referred to as the burn-in phase.
Gibbs sampling is an example of an MCMC method that is sometimes useful when there is a
simple way to sample from the one-dimensional conditional distributions fXi |Xic ∶ ℝ → ℝ+ , i ∈ ℕn ,
where Xic = (X1 , … , Xi−1 , Xi+1 , … , Xn ) is the random variable obtained from X by discarding Xi .
Specifically, the Gibbs sampler starts with some initialization x(0) ∈ ℝn and generates a sequence
x(1) , x(2) , … by updating the n variables in a cyclic or randomized manner. Algorithm 9.3 shows the
Gibbs sampler with a cyclic order, which is also known as the deterministic-scan Gibbs sampler. We
note that xic is defined analogously to Xic . It can be shown that under mild conditions on fX , the
Gibbs sampler generates a realization of a Markov chain that converges to the target distribution
as k → ∞. Further details and a thorough introduction to sampling methods can be found in, e.g.
[18, Chapter 11].
To illustrate the principle behind Gibbs sampling and its inherent limitations, we now consider
the case where fX is a two-dimension Gaussian pdf.
278 9 Unsupervised Learning

Algorithm 9.3: Gibbs sampler

Input: Starting point x(0) ∈ ℝn and number of samples K.
Output: Sequence x(1) , … , x(K) .
x ← x(0)
for k ← 1 to K do
for i ← 1 to n do
Draw a sample x̃ i from fXi |Xic ( ⋅ |xic ).
xi ← x̃ i
end
x(k) ← x
end

Example 9.7 Consider the Gaussian pdf

fX (x) =  (x, 𝜇, Σ),
with X = (X1 , X2 ) and
[ ] [ ]T [ ][ ] [ 2 ]
2 c s 1 0 c s c + 𝜎 2 s2 (1 − 𝜎 2 )cs
𝜇= , Σ= = ,
2 −s c 0 𝜎 2 −s c (1 − 𝜎 2 )cs s2 + 𝜎 2 c2
where c = cos(𝜃) and s = − sin(𝜃) are parameterized by 𝜃 ∈ [0, 𝜋∕2]. We note that the condition
number of Σ is 𝜅 = 1∕𝜎 2 , and the correlation between X1 and X2 is
Σ12 (1 − 𝜎 2 )cs (𝜅 − 1)cs
𝜌= √ = √ = √ .
Σ11 Σ22 (c2 + 𝜎 2 s2 )(s2 + 𝜎 2 c2 ) 𝜅 + (𝜅 − 1)2 s2 c2
Moreover, the conditional distribution of Xi given Xj , i ≠ j, is given by
fXi |Xj (xi |xj ) =  (xi , 𝜇i|j , Σi|j ),
where
𝜇i|j = 𝜇i + Σij Σ−1
jj (xj − 𝜇j ), jj Σji .
Σi|j = Σii − Σij Σ−1
Figure 9.8 compares the Gibbs sampler in Algorithm 9.3 to direct sampling for different values of 𝜎
and 𝜃. The plots clearly show that unlike the direct sampler, which produces independent samples,
the Gibbs sampler generates an autocorrelated sequence. This is especially noticeable when X1 and
X2 are highly correlated.

9.13 Boltzmann Machine

As an application of the EM algorithm, we will discuss the so-called Boltzmann machine (BM). To
this end, we revisit the Ising distribution from Section 9.2 with pdf p(x) = exp (−E(x) − A(𝜆, Λ)),
(∑ )
where A(𝜆, Λ) = ln x exp(−E(x)) . The Boltzmann machine is obtained by splitting the variable x
into visible and hidden or latent variables, i.e. x = (𝑣, h). We now consider N observations 𝑣1 , … , 𝑣N ,
of the visible variable. The corresponding hidden variables are h1 , … , hN . We let xi = (𝑣i , hi ), and
with abuse of notation we let 𝑣 = (𝑣1 , … , 𝑣N ), h = (h1 , … , hN ) and x = (𝑣, h). We also define
r ∶ Nx × ℝm × 𝕊m → ℝ via
∏
N
̄
r(x, 𝜆, Λ) = p(xi ) = e−E(x,𝜆,Λ)−NA(𝜆,Λ) ,
i=1
9.13 Boltzmann Machine 279

Direct sampling Gibbs sampling

3 3

2 2
𝑥2

𝑥2
1 1

0 0
0 2 4 0 2 4
𝑥1 𝑥1
(a)
Direct sampling Gibbs sampling
4 4

3 3

2 2
𝑥2

𝑥2

1 1

0 0
0 1 2 3 4 0 1 2 3 4
𝑥1 𝑥1
(b)

Figure 9.8 Realizations of K = 50 samples obtained via direct sampling and Gibbs sampling; the ellipse
marks the superlevel set of fX that contains 95% of the probability mass, and o marks the starting point
x(0) = (2, 1∕2) used in the Gibbs sampler. (a) Low correlation: 𝜎 = 0.2, 𝜃 = 𝜋∕30, and 𝜌 ≈ 0.45. (b) High
correlation: 𝜎 = 0.1, 𝜃 = 𝜋∕4, and 𝜌 = 12∕13.

∑N
where Ē ∶ Nx × ℝm × 𝕊m → ℝ is given by E(x,
̄ 𝜆, Λ) = i=1 E(xi ). The conditional pdf for h given
𝑣 is s ∶ N𝑣 × Nh × ℝm × 𝕊m → ℝ given by
̄
r(𝑣, h, 𝜆, Λ) e−E(𝑣,h,𝜆,Λ)
s(𝑣, h, 𝜆, Λ) = ∑ = ∑ ̄ ,
h r(𝑣, h, 𝜆, Λ)
−E(𝑣,h,𝜆,Λ)
he

where 𝑣 and h are defined such that 𝑣 × 𝑣 = x . The summation over h above is carried
out over all h ∈ h . From now on, we tacitly assume that all summations are carried over all the
elements in the corresponding sets unless otherwise stated. We define Q ∶ ℝm × 𝕊m → ℝ via
∑ ( )
Q(𝜆, Λ) = Es ln r = ̄ h, 𝜆, Λ) − NA(𝜆, Λ) ,
s(𝑣, h, 𝜆− , Λ− ) −E(𝑣,
h
∑
̄ h, 𝜆, Λ) − NA(𝜆, Λ),
= − s(𝑣, h, 𝜆− , Λ− )E(𝑣,
h

where (𝜆− , Λ− )
are the values of the parameters from the previous iteration in the EM algorithm.
In order to maximize Q, the gradient with respect to (𝜆, Λ) is typically needed. We have
𝜕Q ∑ 𝜕 Ē 𝜕A
= − s(h, 𝑣, 𝜆− , Λ− ) −N ,
𝜕𝜆 h
𝜕𝜆 𝜕𝜆
280 9 Unsupervised Learning

∑ ∑
N
∑
= s(h, 𝑣, 𝜆− , Λ− ) xi − N 𝜉p(𝜉),
h i=1 𝜉

∑
N
∑ ∑
= si (hi , 𝑣i , 𝜆− , Λ− )xi − N 𝜉p(𝜉),
i=1 hi 𝜉

where si ∶ 𝑣 × h × ℝm × 𝕊m → ℝ is the conditional pdf for hi given 𝑣i defined as

p(x ) e−E(𝑣i ,hi ,𝜆,Λ)
si (𝑣i , hi , 𝜆, Λ) = ∑ i = ∑ −E(𝑣 ,h ,𝜆,Λ) .
hi p(xi ) hi e
i i

Similarly, we obtain

𝜕Q ∑∑ ∑
N
= si (hi , 𝑣i , λ− , Λ− )xi xiT − N 𝜉𝜉 T p(𝜉).
𝜕Λ i=1 h 𝜉
i

Note that above x = (𝑣, h), and that 𝑣 is known and that we sum over all possible values of the
hidden variables h. Hence, it is not trivial to compute the gradients if h and 𝜉 are high dimensional
because of the fact that we need to sum over many variables. We notice that the expressions for
the gradients of Q are differences of expectations, and therefore, it is possible to approximate them
using Monte Carlo techniques like Gibbs sampling; see, e.g. [38]. Note that the gradients can be
computed as sums of gradients for each observation i. We will later discuss what further structure
can be utilized when we have a graphical Ising model.

9.14 Principal Component Analysis

We are given a collection of data xi ∈ ℝn , i ∈ ℕN . We are interested in approximating this data

with W T ci , where W ∈ ℝm×n , m < n, and where ci ∈ ℝm , i ∈ ℕN . We require the rows of W to be
orthonormal, i.e. WW T = I. We will use a least-squares criterion to formalize what we mean with
approximation. To this end, we let J ∶ ℝm×n × ℝn × · · · × ℝn → ℝ+ be defined by

1∑
N
J(W, c1 , … , cN ) = ||x − W T ci ||22 .
2 i=1 i

The optimization problem is

minimize J(W, c1 , … , cN ),
subject to WW T = I,

with variables W, c1 , … , cN . The solution to this problem will give us a lower-dimensional descrip-
tion of the original data. The rows of W are called the principal components, and the analysis we
carry out is called principal component analysis (PCA).

9.14.1 Solution
This is a convex problem with respect to ci for fixed value of W. Hence, we first carry out this
minimization. The optimal values are obtained by setting the gradient equal to zero, i.e.
𝜕J
= WW T ci − Wxi = 0, i ∈ ℕN .
𝜕ci
9.14 Principal Component Analysis 281

Since WW T = I, we get that ci = Wxi . We realize that the principal components are used to com-
press the original data xi to the lower-dimensional data ci . Back substitution into J gives

1 ∑ T(
N
)2
J(W, Wx1 , … , WxN ) = x I − W T W xi .
2 i=1 i

Since I − W T W is a projection matrix, we may remove the square. From this, we see that it is equiv-
alent to maximize

1∑ T T
N
1( T ) 1 ( )
x W Wxi = trX XW T W = tr WX T XW T ,
2 i=1 i 2 2
where
⎡ x1T ⎤
⎢ ⎥
X = ⎢ ⋮ ⎥.
⎢ T⎥
⎣xN ⎦
Let X = UDV T be a singular value decomposition such that the diagonal matrix D has the elements
sorted in decreasing order. Let Y = V T W T . Then we may define the criterion above as J̃ ∶ℝn×m → ℝ+ ,
where
1 ( )
J̃ (Y ) = tr Y T D2 Y ,
2
and define the optimization problem equivalently as
maximize J̃ (Y ),
subject to Y T Y = I,
where we have made use of the fact that WW T = I if and only if Y T Y = I. This is a nonconvex
optimization problem. It can be shown that the gradient of the constraint function h ∶ ℝn×n → 𝕊n
defined by h(Y ) = I − Y T Y is full rank for all orthogonal Y , and hence, the linear independence
condition is satisfied for the necessary optimality conditions in Section 4.7. To see this notice that
it can be shown that
( )T
𝜕 svec h(Y ) 𝜕 svec h(Y )
𝜕(vec Y )T 𝜕(vec Y )T
is a diagonal matrix with positive diagonal. It is important to only consider the symmetric vector-
ization of h since otherwise, the condition does not hold. Because of this the Lagrange multiplier
has to be a symmetric matrix. Define the Lagrangian L ∶ ℝn×m × 𝕊m → ℝ as
( ( ))
L(Y , Λ) = J̃ (L) + tr Λ I − Y T Y .

Then a necessary condition for Y to be optimal is that there exist Λ such that the gradient of the
Lagrangian vanishes, i.e.
𝜕L
= D2 Y − Y Λ = 0.
𝜕Y
[ ] [ ]T [ ]
Let Z be such that Y Z is orthogonal, i.e. Y Z Y Z = I, and multiply the above equation with
[ ]T
Y Z from the left to obtain the equivalent equations

Y T D2 Y − Λ = 0,
Z T D2 Y = 0.
282 9 Unsupervised Learning

Note that the first equation always has a solution Λ for any Y , and hence, the necessary conditions
of optimality are equivalent to existence of Z such that
[ ]T [ ]
Y Z Y Z = I,
Z T D2 Y = 0.
[ ] [ ]
Clearly, one solution to these equations is Y Z = I. It is actually optimal. Moreover, Y Z =
blkdiag(X1 , X2 ), for any orthogonal X1 ∈ ℝm×m and X2 ∈ ℝ(n−m)×m are also optimal. We will show
this by noting that the objective function can be written as

1∑
N
||Dyi ||22 ,
2 i=1
[ ]
where Y = y1 · · · ym . Now we start by optimizing with respect to y1 . We have

⎡ d1 y11 ⎤
⎢ ⎥
dy
Dy1 = ⎢ 2 21 ⎥ ,
⎢ ⋮ ⎥
⎢d y ⎥
⎣ n n1 ⎦
where di are the diagonal elements of D, which we remember are ordered such that di ≥ dj when
i < j. The constraint Y T T = I implies that yT1 y1 = 1, and therefore, it is optimal to take y1 = e1 , which
is the first basis vector. All the other yi , for 2 ≤ i ≤ m has to have its first component equal to zero
in order to be orthogonal to y1 . Because of this, we by repeating the arguments above, find that
y2 = e2 , i.e. we pick out the second largest diagonal element of D. The remaining yi now have to
have the first
[ ] two components equal to zero in order [ to ]be orthogonal to y1 and y2 . We thus conclude
I X
that Y = is optimal. We then notice that Y = 1 with X1 orthogonal will result in the same
0 0
objective function value. This follows from the fact that
([ ]) ([ ]T [ ])
X 1 X1 X 1 ( ) 1 ( )
J̃ 1
= tr D2 1 = tr X1T D21 X1 = tr X1 X1T D21 ,
0 2 0 0 2 2

where D = blkdiag(D1 , D2 ). Hence, the PCA picks out the components of xi corresponding to the
m largest singular values of X T .

Example 9.8 In this example, we will perform PCA analysis on the Fisher Iris flower data set
[39]. It contains measurements of four different characteristics of three different iris species. There
are 150 rows in the data set. Each subset of 50 rows corresponds to three different iris species. Each
of the four columns corresponds to the a different characteristic. We preprocess the data by sub-
tracting the mean value of each column from all the values in that column. We also divide all values
in a column with the standard deviation of its column values. This is then the X-matrix that we use
for PCA. Hence, each row of it is a scaled observation xiT . We compute the two principal compo-
nents corresponding to the largest singular values, and then compute the compressed data ci ∈ ℝ2 .
In Figure 9.9, we have plotted the first component of each ci versus its second component. The
different species are marked differently in the plot. The PCA analysis makes it possible to visualize
high-dimensional data in low-dimensional plots and, hence, makes the data more understandable.

We will now relate the solution (W, c1 , … , cN ) to X. We approximated xi with W T ci , where we

found that ci = Wxi . Because of this X is approximated with XW T W. Since W has orthonormal
rows, W T W is a projection matrix, and hence, the rank of XW T W is equal to m. This shows
9.15 Mutual Information 283

Figure 9.9 Plot of compressed data c

resulting from PCA. The difference species are
displayed with different markers.
2

𝑐𝑖,2
−2

−3 −2 −1 0 1 2 3
𝑐𝑖,1

that PCA computes a low-rank approximation of X which minimizes the Frobenius

[ ] norm of
T I
T T T T
X(I − W W) = X − XW W. Furthermore, we have XW W = UDV V1 V1 = UD V T = U1
0 1
[ ] [ ]
D1 V1T , where U = U1 U2 and V = V1 V2 .

9.14.2 Relation to Rank-Constrained Optimization

We will now relate the above solution to the rank-constrained optimization problem in (5.23). There
the solution is Z = U1 D1 V1T , and for PCA it is XW T W = U1 D1 V1T . Hence, the solutions agree. The
objective function that is minimized in PCA is the Frobenius norm of X(I − W T W) = X − XW T W
with respect to W. In the rank-constrained problem, we instead minimize the induced norm of
X − Z. In the rank-constrained optimization, we have an explicit constraint on the rank of Z,
whereas we in PCA constrain the rank implicitly with the number of rows in W, which are equal
to the constraint on the rank.

9.15 Mutual Information

A quantity that is closely related to entropy is mutual information. For a joint pdf r ∶ ℝm × ℝn → ℝ+
of two random variables with marginal pdfs p ∶ ℝm → ℝ+ and q ∶ ℝn → ℝ+ we define the mutual
information I ∶ m+n → ℝ+ as
r(x, y)
I(r) = r(x, y) ln dxdy,
∫ℝm ×ℝn p(x)q(y)
where m+n is defined as in Section 9.9.

9.15.1 Channel Model

We will show how mutual information can be used to generalize principal component analysis.
To this end, we consider two zero mean independent random variables X ∈ ℝn and E ∈ ℝm that
both have a normal density with covariances Σ ∈ 𝕊n+ and I, respectively. We also define the random
284 9 Unsupervised Learning

variables Z = WX and Y = Z + E, where W ∈ ℝm×n with m < n.6 We can interpret Z as information
that is transmitted over a channel W with additive noise E. We would like to choose W to maximize
the mutual information between Y and Z, i.e. between what is transmitted and what is received.
We realize that (Y , Z) has a zero mean normal pdf r with covariance
[ ]
WΣW T + I WΣW T
.
WΣW T WΣW T
We let p and q be the marginal pdfs for Y and Z and define J ∶ ℝm×n → ℝ+ as

J(W) = I(r).

Tedious calculations show that

1 ( )
J(W) = ln det I + WΣW T .
2

9.15.2 Orthogonal Case

The function J is not bounded from above, and therefore, we introduce a constraint WW T = I that
the function should satisfy, i.e. its rows should be orthonormal. We now let V be such that V T V = I
and Σ = VD2 V T with D a diagonal matrix. From this, it follows that
( )
1
J(W) = ln det I + Ȳ D2 Ȳ ,
T
2
where Ȳ = V T W T .We assume that the diagonal elements di , i ∈ ℕn of D are positive and ordered
such that di ≥ dj if i ≥ j.
We realize that it is equivalent to maximize over Ȳ since V is invertible. We therefore define
̃J ∶ ℝn×m → ℝ+ via J̃ (Ȳ ) = J(W). Notice that WW T = I is equivalent to Ȳ T Ȳ = I. We define the
Lagrangian L ∶ ℝn×m × 𝕊m → ℝ as
( ) ( ( ))
1 1
L(Ȳ , Λ) = ln det I + Ȳ D2 Ȳ + tr Λ I − Ȳ Ȳ
T T
.
2 2
We have that the existence of Λ such that
( )−1
𝜕L
= D2 Ȳ I + Ȳ D2 Ȳ − Ȳ 𝜆 = 0
T
𝜕 Ȳ
[ ]
is a necessary condition for optimality of Ȳ . Let Z̄ be such that Ȳ Z̄ is square and orthogonal.
Multiply the above equation with the transpose of this matrix from the left to obtain that equivalent
conditions for optimality are existence of Z̄ and Λ such that
( )−1
Ȳ D2 Ȳ I + Ȳ D2 Ȳ
T T
− Λ = 0,

Z̄ D2 Ȳ = 0,
T

[ ]
where Z̄ is such that Ȳ Z̄ is orthogonal. The second equation follows since I + Ȳ D2 Ȳ is invertible
T

for all Ȳ . From the formula A(I + A)−1 = I − (I + A)−1 , we may rewrite the above equations as
( )−1
I − I + Ȳ D2 Ȳ
T
− Λ = 0,

Z̄ D2 Ȳ = 0.
T

6 Note that the dimensions of X and Y are not the same as the dimensions of x and y in the definition of mutual
information in Section 9.14.
9.15 Mutual Information 285

This now shows that the first equation has a solution in terms of Λ for any Ȳ , since
( )−1
I − I + Ȳ D2 Ȳ is symmetric for any Ȳ . Hence, the optimality conditions simplify to the
T

existence of Z̄ ∈ ℝn×(n−m) such that

Z̄ D2 Ȳ = 0,
T

[ ]T [ ]
Ȳ Z̄ Ȳ Z̄ = I.

These are similar to the optimality conditions for PCA if we identify Σ with X T X.7 However, the
objective functions are not the same, and therefore, we have to proceed slightly differently.
[ ] [ ]
As in Section 9.14, Ȳ Z̄ = I is a solution to the optimality conditions and so is Ȳ Z̄ =
blkdiag(X1 , X2 ) with X1 ∈ ℝm×m and X2 ∈ ℝ(n−m)×(n−m) orthonormal. It is straightforward to verify
that the objective function evaluates to

1∑ (
m
)
J̃ (Ȳ ) = ln 1 + d2k .
2 k=1
[ ]
Can we consider more general Ȳ Z̄ ? Any orthogonal matrix with determinant equal to one can
be written as a product of Givens rotations. Notice that there is no restriction in assuming the
determinant to be equal to one, since we multiply both from left and right. A Givens rotation is
defined as G ∶ ℕn × ℕn × [0, 2𝜋] → ℝn×n , where
⎡1 ··· 0 ··· 0 ··· 0⎤
⎢⋮ ⋱ ⋮ ⋮ ⋮⎥
⎢ ⎥
⎢0 ··· c · · · −s ··· 0⎥
G(i, j, 𝜃) = ⎢ ⋮ ⋮ ⋱ ⋮ ⋮⎥,
⎢ ⎥
⎢0 ··· s ··· c ··· 0⎥
⎢⋮ ⋮ ⋮ ⋱ ⋮⎥
⎢ ⎥
⎣0 ··· 0 ··· 0 ··· 1⎦
where c = cos 𝜃, s = sin 𝜃, and where c and s are positioned on the ith and jth rows and columns,
cf . (2.17). We assume that i < j. From this, it follows that
2
⎡ d1 ··· 0 ··· 0 ··· 0 ⎤
⎢⋮ ⋱ ⋮ ⋮ ⋮ ⎥
⎢ ⎥
⎢0 ··· c2 d2i + s2 d2j ··· −scd2i + scd2j ··· 0 ⎥
G(i, j, 𝜃) D G(i, j, 𝜃) = ⎢ ⋮
T 2
⋮ ⋱ ⋮ ⋮ ⎥.
⎢ ⎥
⎢0 ··· −scd2i + scd2j ··· s2 d2i + c2 d2j ··· 0 ⎥
⎢⋮ ⋮ ⋮ ⋱ ⋮ ⎥
⎢ ⎥
⎣0 ··· 0 ··· 0 · · · d2n ⎦
We understand that it is only if i ≤ m and j > m that we do not have the case G(i, j, 𝜃) =
blkdiag(X1 , X2 ) discussed above. For these values of i and j, the top left m × m-dimensional block
of the above matrix is given by

⎡ d21 ··· 0 ··· 0 ⎤

⎢⋮ ⋱ ⋮ ⋮ ⎥
⎢ ⎥
⎢0 · · · c2 d2i + s2 d2j ··· 0 ⎥.
⎢⋮ ⋮ ⋱ ⋮ ⎥
⎢ ⎥
⎣0 ··· 0 ··· d2m ⎦

7 We will discuss this in more detail later on. There should actually be a normalization with N.
286 9 Unsupervised Learning

Hence, the objective function value is given by

( )
1 ∑ ( )
J̃ (Ȳ ) = ln 1 + d2k + ln 1 + c2 d2i + s2 d2j .
2 k∈ℕ ⧵{i}
m

We realize that the Givens rotation replaces the ith diagonal element d2i with c2 d2i + s2 dj , which is a
convex combination of the values d2i and d2j . Since dj < di it follows that c2 d2i + s2 dj < d2i , and hence,
the Givens rotation has resulted in a smaller value of the objective function value. Any further rota-
tions for which i ≤ m and j > m will only further decrease the objective function value. Any [ ]other
X
rotation will not affect the objective function value. From this, we realize that all Ȳ = 1 with
0
X1T X1 = I are optimal, and hence, maximizes the mutual information under the constraint that
Ȳ Ȳ = I. There are actually several local stationary points. The reason for this is that for 𝜃 = k𝜋∕2,
T

k ∈ ℕ it holds that −scd2i + scd2j = 0, and hence, Z̄ DȲ = 0 irrespective of what values i and j have.
T

9.15.3 Nonorthogonal Case

We will now investigate what happens if we drop the orthogonality constraint and only require that
[ ]
||yi ||22 = 1, i ∈ ℕm , where Ȳ = y1 · · · ym . We define the Lagrangian M ∶ ℝn×m × ℝm → ℝ as
( )
1∑ (
m
1 )
M(Ȳ , 𝜆) = ln det I + Ȳ D2 Ȳ +
T
𝜆i 1 − ||yi ||22 .
2 2 i=1
We see that
( )−1
𝜕M
= D2 Ȳ I + Ȳ D2 Ȳ − Ȳ diag(𝜆),
T
𝜕 Ȳ
and hence, necessary conditions for optimality are existence of Z and 𝜆 ∈ ℝm such that
( )−1
Ȳ D2 Ȳ I + Ȳ D2 Ȳ − Ȳ Ȳ diag(𝜆) = 0,
T T T

Z̄ D2 Ȳ = 0,
T

[ ] [ ]
where Z̄ is such that Ȳ Z̄ has full column rank. Notice that Ȳ Z̄ is not necessarily a square
matrix. Similarly as above, we may rewrite the above equations as
( )−1
I − I + Ȳ D2 Ȳ − Ȳ Ȳ diag(𝜆) = 0,
T T

Z̄ D2 Ȳ = 0.
T

[ ]
It is straightforward to verify that Ȳ Z̄ = I satisfy also these optimality conditions. However, any
[ ]
orthogonal Ȳ Z̄ will not satisfy them, since for orthogonal Ȳ it must hold that I + Ȳ DȲ is diago-
T

nal for there to exist 𝜆 that satisfies the equation. There are however nonorthogonal Ȳ that satisfy
[ ]T
them. Consider Ȳ = 𝟙 0 , where 𝟙 ∈ ℝm is a vector of all ones. This means that all yi = e1 , where
e1 is the first basis vector in ℝn . It is straightforward to verify that 𝜆 = d21 ∕(1 + md21 )𝟙 satisfies the
necessary optimality conditions for this Ȳ . It actually holds that we may take yi equal to any basis
vector for ℝn , and that they may be linearly dependent.
Now, the question arises if such a Ȳ where some of the columns are not linearly independent may
be optimal and not only constitute a stationary point. To investigate this, we consider the case when
[ ]T
m = 2, d1 > d2 and let Ȳ = 𝟙 0 . Then ln det (I + Ȳ D2 Ȳ ) = ln(1 + 2d21 ) ≈ 2d21 for small values
T
[ ]
I
of d21 . For Ȳ = we have ln det (I + Y T D2 Y ) = ln(1 + d21 + d22 + d21 d22 ) ≈ d21 + d22 + d21 d22 for small
0
values of d21 and d22 . It is easy to find values of d1 > d2 for which the second approximation is smaller
9.15 Mutual Information 287

2𝜋

3𝜋
2

𝜋
𝜃2

𝜋
2

0 𝜋 𝜋 3𝜋 2𝜋
2 2

𝜃1

Figure 9.10 Level curves for the objective function between 2.15 (dark gray) and 2.4 (light gray).

than the first. Hence, we realize that when the signal-to- noise ratio is small, i.e. when Σ is small
as compared to I, then it is better to only consider the signal with the largest variance. In general
there are cases in-between, where one should pick out more than one of the largest, but not all m.
Also there are cases when it optimal for the vectors yi to have an angle in-between them such that
they are neither aligned nor orthogonal [28]. [A small] example with
[ m= ] n = 2, d1 = 2, and d2 = 1 is
cos 𝜃1 cos 𝜃2
visualized in Figure 9.10, where we let y1 = and y2 = for 𝜃1 , 𝜃2 ∈ [0, 2𝜋] and plot
sin 𝜃1 sin 𝜃2
the level curves of the objective function values. It is seen that
[ ] [ ] [ ] [ ]
1 1 1 0 0 1 0 0
Ȳ = ; ; ;
0 0 0 1 1 0 1 1
correspond to saddle points. There are several optima, and one is given by
[ ]
̄ 0.83 0.83
Y= ,
−0.56 0.56
which corresponds to 𝜃1 = 5.7 and 𝜃2 = 0.6. Note that the other optima have the property[ that ] the
X
angle between y1 and y2 are all the same. If the signal-to-noise ratio is large, then Ȳ = 1
with
0
X1 having orthonormal columns satisfies the necessary optimality conditions also without orthog-
onality constraints. This follows from the fact that the first equation of the optimality conditions in
the limit when D is much larger than I is given by
I − Ȳ Ȳ diag(𝜆) = 0.
T

The problem discussed is a special case of optimization on a matrix manifold, which is discussed
in more detail in [1].

9.15.4 Relationship to PCA

The relation to PCA become apparent if we consider the case where we have observations xi , i ∈ ℕN
of X and we want to approximate them with W T zi , where zi = Wxi . We estimate the covariance
288 9 Unsupervised Learning

1 ̄T ̄
matrix of X with Σ = N
X X, where

⎡ x1T ⎤
⎢ T⎥
⎢x ⎥
X̄ = ⎢ 2 ⎥ .
⎢⋮⎥
⎢xT ⎥
⎣ N⎦
We then let X̄ = UDV T be a singular value decomposition of X. From this it follows that
( 2
)
1 TD
J(W) = ln det I + Ȳ Ȳ ,
2 N
where Ȳ = V T W T . The approximation of X̄ will be XW
̄ T W, where W T W is not necessarily a projec-
̄ T W will still be m. Notice that we cannot relax the condition
tion matrix. However, the rank of XW [ T]
̄ 𝟙
on orthogonality in the principal component analysis without obtaining Y = as the optimal
0
8
solution. The signal-to-noise ratio will not help in making the correct choice from an information
point of view. This is why optimizing mutual information seems to be more appropriate.

9.16 Cluster Analysis

Cluster analysis is about partitioning observations into groups called clusters so that the observa-
tions in each cluster are close to one another. For observations xi ∈ ℝn , i ∈ ℕN , we may measure
the closeness with some norm and define the distance d ∶ ℝn × ℝn → ℝ+ as, e.g. the squared norm
of the difference between two observations, i.e. d(xi , xj ) = ||xi − xj ||2 . We will assume that there are
K < N clusters, and we are looking for an encoder C ∶ ℕN → ℕK that assigns each observation to
a cluster. This should be done in such a way that the sum of all distances within each cluster is
minimized, i.e. we have the following combinatorial optimization problem
minimize f (C),
with variable C, where f ∶ ℕN × ℕK → ℝ+ is defined by

1∑ ∑ ∑
K
f (C) = d(xi , xj ),
2 k=1 i∈ j∈
k k

where k = {i ∈ ℕN ∶ C(i) = k}. Define the mean vectors for each cluster as
1 ∑
mk = x , k ∈ ℕK ,
Nk i∈ i
k

where Nk = |k |. It then holds that

1∑ ∑ ∑ ∑ ∑
K K
f (C) = d(xi , xj ) = Nk d(xi , mk ),
2 k=1 i∈ j∈ k=1 i∈
k k k
∑
if d is the squared Euclidian norm. Notice that for any yk ∈ ℝn , it holds that i∈k d(xi , mk ) ≤
∑
i∈k d(xi , yk ) with equality for yk = mk and hence, an equivalent optimization problem is

minimize F(C, y1 , … , yK ),

8 Notice that what is called X̄ and Ȳ in this section is called X and Y in the section on PCA. The reason is that we
used X and Y for random variables in this section.
Exercises 289

Figure 9.11 Plot of resulting clustering. 5

The original data xi is marked with + and
the yk are marked with o.
4

𝑥2
2

0
0 1 2 3 4 5 6
𝑥1

with variables C, y1 , … , yK , where F ∶ ℕK × ℝn × · · · × ℝn → ℝ+ is defined as

∑
K
∑
F(C, y1 , … , yK ) = Nk d(xi , yk ).
k=1 i∈k

The sets k are functions of the encoder C. The so-called K-means algorithm tries to solve the above
formulation using block coordinate descent, i.e. it iteratively solves the following two optimization
problems:
1. For fixed C, minimize F(C, y1 , … , yK ) with respect to y1 , … , yK .
2. For fixed (y1 , … , yK ) minimize F(C, y1 , … , yK ) with respect to C.
The first problem is a least-squares problem with solution equal to the average of the xi over
i ∈ k , i.e.
1 ∑
yk = x.
Nk i∈ i
k

This is the reason for the name K-means. The second problem also has an explicit solution given
by assigning observation xi to cluster k if d(xi , yk ) ≤ d(xi , yj ) for all j ≠ k. The algorithm can unfor-
tunately be trapped in local minima.

Example 9.9 We now consider a problem in two dimensions for which N = 30. We want to per-
form the clustering using K = 3. We initialize yk as the first three values of xi that we are given.
They actually come from the same cluster, showing that initialization is not extremely critical. The
result is shown in Figure 9.11, where we see that we get excellent clustering.

Exercises

9.1 Consider the special case of finding a Chebyshev bound as detailed in Section 9.1 when
S = ℝ+ , C = [1, ∞), f0 (x) = 1 and f1 (x) = x. Assume that it is known that 𝔼f1 (X) = 𝔼X = 𝜇,
where 0 ≤ 𝜇 ≤ 1. Show that this implies the so-called Markov bound
ℙ[ X ≥ 1] ≤ 𝜇.
290 9 Unsupervised Learning

9.2 Show how the function h in (9.5) for the Ising distribution can be derived by back-
substitution of 𝜇 = 1 − A(𝜆, Λ) into the Lagrange dual function.
Hint: You need to first obtain the Lagrange dual function by back substitution of p into the
Lagrangian.

9.3 Consider minimizing (9.12) with respect to (A, b). Show that the solution is the same as the
solution obtained by minimizing the criterion in (9.11) if mx , my , Dx , and Dxy are replaced
with their sample averages

1∑ 1∑
N N
̂x =
m x, ̂y=
m y,
N i=1 i N i=1 i
and
∑ N
∑ N
̂x = 1
D x̄ x̄ T , ̂ xy = 1
D x̄ ȳ T ,
N i=1 i i N i=1 i i

respectively, where x̄ i = xi − m
̂ x and ȳ i = yi − m
̂ y.

9.4 [22, Exercise 4.57] Consider two discrete random variables X and Y , where (X, Y ) ∶ ℕn ×
ℕm → ℝ. The mutual information I ∶ mn → ℝ is defined as
∑ ∑
n m
pX,Y (i, j)
I(pX,Y ) = pX,Y (i, j) log ,
i=1 j=1
pX (i)pY (j)

where pX,Y ∶ ℕn × ℕm → [0, 1] is the joint probability function, and where pX ∶ ℕn → [0, 1]
and pY ∶ ℕm → [0, 1] are the marginal probability functions for X and Y , respectively.
(a) Show that the mutual information can be expressed as
∑ ∑
n m
pY |X (j|i)
I(pX,Y ) = pX (i)pY |X (j|i) log ∑n ,
i=1 j=1 k=1 pX (k)pY |X (j|k)

where pY |X ∶ ℕm × ℕn → [0, 1] is the conditional probability function for Y given X = i.

(b) We define the matrix P ∈ [0, 1]m×n to have elements (j, i) given by pY |X (j|i). Then let
x ∈ [0, 1]n have elements i given by pX (i) and let y ∈ [0, 1]m have elements j given by
pY (j). Show that y = Px and that
∑
m
I(pX,Y ) = cT x − yj log yj ,
j=1
∑m
where ci = j=1 Pji log Pji .
(c) We now interpret X as the input and Y as the output of a communication channel,
respectively. The capacity C of the channel is according to Shannon equal to the largest
value of the mutual information over all probability functions for X, and hence, defined
as the optimal value of
∑
m
maximize cT x − yj log yj ,
j=1

subject to y = Px,
𝟙T x = 1,
x ≥ 0,
Exercises 291

with variables (x, y). Here, is it assumed that we use log2 . Let m = 2, n = 2 and consider
the channel defined by
[ ]
1−p p
P= ,
p 1−p
and show that
C = 1 + plog2 p + (1 − p)log2 (1 − p).

9.5 Let f be a pdf defined on ℝ+ and let F be the corresponding distribution function
x ∞
F(x) = ∫0 f (t)dt. The expected value is 𝜇 = ∫0 xf (x)dx. The Lorenz curve LF ∶ [0, 1] → ℝ is
defined as
u
1
LF (u) = F −1 (x)dx.
𝜇 ∫0
If F is the income distribution in a country, then L(u) represents the fraction of the total
income which is in the hands of the uth fraction of the population with the lowest income.
The Gini index G ∶ [0, 1]ℝ → [0, 1], defined as
1 ( ) 1
G(F) = 2 u − LF (u) du = 1 − 2 LF (u)du
∫0 ∫0
is twice the area between the Lorenz curve of the line of perfect equality LF0 (u) = u. The
index is zero for perfect equality and close to one if most of the income is with a very small
portion of the population. We are interested in finding the probability density function f
that maximizes the entropy for given values of the mean 𝜇 and of the Gini index 𝛾, i.e. we
want to solve
∞
minimize f (x) ln f (x)dx,
∫0
∞
subject to f (x)dx = 1,
∫0
∞
xf (x)dx = 𝜇,
∫0
G(F) = 𝛾.

(a) Show that the Gini index can be expressed as

∞
1 ̄ 2 dx,
G(F) = 1 − F(x)
𝜇 ∫0
̄
where F(x) = 1 − F(x).
(b) Now, consider the equivalent maximum entropy problem
∞
minimize f (x) ln f (x)dx,
∫0
∞
subject to f (x)dx = 1,
∫0
∞
xf (x)dx = 𝜇,
∫0
∞
̄ 2 dx = 𝜂,
F(x)
∫0
292 9 Unsupervised Learning

with variable f , where 𝜂 = 𝜇(1 − 𝛾) and show that an optimal F̄ satisfies

F̄ + c1 F̄ + c2 F̄ = c3 ,
′ 2

where ci are real constants.

Hint: You may use the following necessary conditions of optimality. Let L0 ∶
ℝ × ℝ × ℝ → ℝ, L1 ∶ ℝ × ℝ × ℝ → ℝ, and Mi ∶ ℝ × ℝ × ℝ → ℝ, i ∈ ℕn . Define
H ∶ ℝ × ℝ × ℝ × ℝ × ℝn → ℝ as
∑
n
H(x, y, z, 𝜆, 𝜈) = L0 (x, y, z) + 𝜆L1 (x, y, z) + 𝜈i Mi (x, y, z).
i=1

Then the existence of 𝜆 ∶ ℝ → ℝ such that

d𝜆(x) 𝜕H(x, y(x), z(x), 𝜆(x), 𝜈)
=− ,
dx 𝜕y
𝜕H(x, y(x), z(x), Λ(x), 𝜈)
0=
𝜕z
is a necessary condition of optimality for
∞
minimize L0 (x, y(x), z(x))dx,
∫0
dy(x)
subject to = L1 (x, y(x), z(x)),
dx
∞
Mi (x, y(x), z(x)) = ci , i ∈ ℕn .
∫o
(c) Verify that
̄ 1
F(x) =
𝜎e𝜌x + 1 − 𝜎
satisfies the above differential equation and that the corresponding f is a probability
density function for any 𝜎 > 0 and 𝜌 > 0.
(d) Show that the parameters of F̄ and the parameters of the problem are related as
1 1 ln 𝜎
𝛾 =1+ − , 𝜌= .
𝜎 − 1 ln 𝜎 (𝜎 − 1)𝜇
(e) Consider data from the following countries, which are approximate.

Country Gini index Average annual income (USD)

United states 0.414 63 093

United Kingdom 0.348 44 505
Sweden 0.288 41 748

Plot the probability density functions obtained from the maximum entropy solution for
the three countries.

9.6 In this exercise we consider the static ranking of web pages in Example 9.2 and the corre-
sponding maximum entropy problem.
(a) Show that the solution of the maximum entropy problem is pij = exp(−𝜆i + 𝜆j )∕Φ(𝜆),
∑
(i, j) ∈ E, where Φ(𝜆) = (i,j)∈E exp(−𝜆i + 𝜆j ), and where 𝜆 ∈ ℝn are the natural param-
eters. Notice that you get different answers depending on how you rewrite the equilib-
rium conditions as equations with a zero right-hand side.
Exercises 293

(b) Let ai = exp(−𝜆i ), i ∈ V and let A = diag(ai ). Moreover, define the sparse matrix
M ∈ ℝn×n as
{
Φ(𝜆)−1 , (i, j) ∈ E,
Mij =
0, (i, j) ∉ E.
Show that P = AMA−1 .
(c) The so-called iterative scaling or matrix balancing approach for finding the optimal solu-
tion to the maximum entropy problem is based on the expression above for P and the
∑
fact that Φ(𝜆) = (i,j)∈E ai ∕aj . It is an iterative algorithm and can be described as follows,
where superscripts are used to denote iteration index. Given a tolerance 𝜖 > 0, initialize
∑
as a(0) = 1.0, Z (0) = (i,j)∈E 1, and then repeat starting with k = 0:
i ( )
1. p(k)
ij
= a(k)
i
∕ a(k)j
Z (k) , (i, j) ∈ E
∑
2. 𝜌(k)
i
= j∶(i,j)∈E p(k)
ij
, i∈V
∑
3. 𝜎i(k) = j∶(j,i)∈E p(k)
ji
, i∈V
( )1∕2
4. 𝜂i(k) = 𝜎i(k) ∕𝜌(k)
i
, i∈V

5. If 1 − 𝜖 ≤ 𝜂i(k) ≤ 1 + 𝜖 for all i ∈ V go to Step 7.

6. For those i ∈ V for which 1 − 𝜖 ≤ 𝜂i(k) ≤ 1 + 𝜖 does not hold, let a(k+1) i
= a(k)
i
𝜂i(k) ,
increase k, and then go to Step 1.
∑ ∑
7. Break if 1 − 𝜖 ≤ (i,j)∈E pij ≤ 1 + 𝜖. Otherwise, let Z (k+1) = (i,j)∈E a(k)
i
∕a(k)
j
, increase k,
and then go to Step 1
Implement the above algorithm and test it on random graphs generated by the MATLAB
function webgraph.m. This function generates the edges for a graph with dim edges,
for which the first node is connected to all other nodes in both directions. In this way the
constructed graph is strongly connected, modeling the fact that every page potentially
can be visited from every page despite that there is not a link from the current page.
The other nodes are connected to at most 14 other nodes. Notice that you are not asked
to compute the page rank. It is enough to compute the unnormalized pij for the edges.
Make a plot of the computational time as a function of the variable dim. Average over
10 random graphs for each dimension. You should at least consider graphs for which
dim=10000. You should use 𝜖 = 10−12 .
(d) Show that the entropy problem in (9.1) can be equivalently expressed as a conic opti-
mization problem involving the exponential cone, i.e. as
minimize − 𝟙T t,
subject to Ap = b,
𝟙T p = 1,
(tk , pk , 1) ∈ Kexp , k ∈ ℕn ,
with variables p, t, where t = (t1 , … , tn ).
(e) Use the conic solver MOSEK to solve the conic formulation of the static ranking of web
pages and compare the solution times with your own solver. How big problems can you
solve with the two different solvers? Make sure to compare exactly the same problems,
i.e. do not use different random realizations for the two different solvers but the same
realizations of the edges for the graph. To make the comparison fair try to choose
∑
tolerances in MOSEK so that 1 − 𝜖 ≤ (i,j)∈E pij ≤ 1 + 𝜖, or do it the other way around,
294 9 Unsupervised Learning

i.e. choose 𝜖 for the iterative scaling approach to match the accuracy you get from
MOSEK with its default settings.
(f) It is also possible to solve the maximum entropy problem using an interior-point solver.
An especially good solver for this problem can be downloaded at http://web.stanford
.edu/group/SOL/software/pdco/. Compare the performance of this solver with the two
previous ones for the static ranking of web pages. Also, here try to make the comparison
fair with respect to tolerances. Check the link to the presentation at ISMP2003 on the
above web page. It seems that they are able to solve maximum entropy problems for
web traffic with roughly 50 000 nodes using MATLAB. Are you able to?

9.7 Consider a primitive clinic in a village. People in the village have the property that they are
either healthy or have a fever. They can only tell if they have a fever by asking the doctor in
the clinic. The doctor makes a diagnosis of fever by asking patients how they feel. Villagers
only answer that they feel normal, dizzy, or cold. This defines a HMM (X, Y ) with Xk ∈  =
{𝛼1 , 𝛼2 } and Yk ∈  = {𝛽1 , 𝛽2 , 𝛽3 }, where 𝛼1 =healthy, 𝛼2 =fever, 𝛽1 =normal, 𝛽2 =cold, and
𝛽3 =dizzy. Introduce the notation:

aij = ℙ[Xk+1 = 𝛼j |Xk = 𝛼i ],

bij = ℙ[Yk = 𝛽j |Xk = 𝛼i ].

Also, let the matrices A and B be defined such that element (i, j) of the matrix is equal to aij
and bij , respectively. Consider the case when
[ ] [ ]
0.7 0.3 0.5 0.4 0.1
A= , B= ,
0.4 0.6 0.1 0.3 0.6
and assume that ℙ[X0 = 𝛼1 ] = 0.6 and ℙ[X0 = 𝛼2 ] = 0.4. The doctor has for a patient
observed the first day normal, the second day cold, and the third day dizzy. What is the
most likely value of the condition for the patient for the different days?

9.8 Show that the smoothing problem in Section 9.4 can be formulated as an optimal control
problem
∑
N
minimize 𝜙(x0 ) + fk (xk , uk ),
k=1
subject to xk−1 = Fk (xk , uk ), k ∈ ℕN ,

with variables (x0 , u1 , … , uN , xN ) for some fk and Fk . Notice that the time index is running
in reverse order as compared to a standard control problem.
Hint: Take the logarithm of the joint pdf pX̄ N ,Ȳ N .

9.9 Recall the Gaussian mixture model from Section 9.11, which involves a pdf of the form

∑
k
fY (y; 𝜃) = 𝛼j  (y, 𝜇j , Σj ),
j=1

where 𝜃 represents the model parameters 𝛼j , 𝜇j , Σj , j ∈ ℕk . Now, suppose that we fix Σj = 𝜎 2 I

for some 𝜎 2 > 0 and let 𝜃̃ represent the reduced set of model parameters 𝛼j , 𝜇j , j ∈ ℕk . Show
that the EM algorithm for estimating 𝜃̃ reduces to the K-means algorithm in the limit as
𝜎 2 → 0.
Exercises 295

9.10 Implement the EM algorithm for estimating the model parameters of a Gaussian mix-
ture model; cf . Section 9.11. Test your implementation on a set of samples from a
one-dimensional Gaussian mixture model with three components. The following MATLAB
code illustrates how to generate m = 1000 samples y1 , … , ym from a mixture of three
univariate Gaussian distributions:

alpha = [0.5 0.2 0.3]; % component weights

mu = [-1.5 0 1.5]; % component means
Sigma = [0.3 0.1 0.6]; % component variance
k = length(alpha); % number of components
m = 1000; % number of samples

% Generate subpopulation labeling

[∼,z] = histc(rand(m,1),cumsum([0,alpha]));

% Generate samples from GMM

N = sum(z==1:k);
y = zeros(m,1);
for j=1:k
y(z==j) = mu(j) + sqrt(Sigma(j))*randn(N(j),1);
end

9.11 Determine 𝛾 such that the matrix

⎡ 1 𝛾 1⎤
Σ = ⎢𝛾 2 1⎥
⎢ ⎥
⎣ 1 1 3⎦
is positive semidefinite and such that its inverse has a zero in the position of the variable 𝛾.

9.12 In this exercise, we will perform PCA analysis on the Fisher Iris flower data that we investi-
gated in Example 9.8. The data set can be download in MATLAB with the command load
fisheriris.mat. It contains measurements of four different characteristics of three dif-
ferent iris species in each column. There are 150 rows in the data set. Each subset of 50 rows
corresponds to three different iris species. You should preprocess the data by subtracting
the mean value of each column from all the values in that column. Then you should divide
all values in a column with the standard deviation of its column values. This is then the
X-matrix that you should use for PCA. Hence, each row of it is an observation xiT . You should
compute the two principal components corresponding to the largest singular values, and
then compute the compressed data ci ∈ ℝ2 . Finally, plot for each ci its second component
versus its first component. Make sure to mark the different species differently in your plot.
You should obtain the same plot as in Figure 9.9. Would it have been possible to separate the
different species from one another using this plot in case you had not known from which
species the data originated?

9.13 In this exercise you are asked to implement the K-means algorithm in Section 9.16 for cluster
analysis. You should try out the algorithm on the Fisher Iris flower data set that you used in
Exercise 9.12. Preprocess the data set in the same way as you did in that exercise, i.e. make
sure that all columns have zero mean and unit standard deviation.
296 9 Unsupervised Learning

(a) Try the algorithm using two and three clusters, respectively, i.e. for the cases K = 2, 3.
How does the algorithm cope? You do not need to make any plots. It is enough to inves-
tigate the resulting encoders, and how good they perform.
(b) Instead of using the 4-dimensional data that you used in (a) instead use the
2-dimensional data ci that resulted from the PCA analysis in Exercise 9.12. Try
the cases K = 2, 3.
(c) Instead of using the 2-dimensional data that you used in (b) instead use a 3-dimensional
data ci that you obtain from PCA analysis similarly as in Exercise 9.12 by instead using
the three principal components corresponding to the three largest singular values. Try
the cases K = 2, 3 again.
(d) Relate your results in (a)–(c) to what you visually observed in Exercise 9.12.
297

Supervised Learning

In this chapter, we will discuss supervised learning. What distinguishes supervised learning
problems from unsupervised learning problems is that the data come in pairs, i.e. we may say
(xk , yk ) ∈ ℝ × ℝ for k ∈ ℕN and we would like to find a relationship between the pairs of data. We
will start with linear regression. This does not mean that the data pairs are related to one another
in a linear way. Instead, it is the class of functions that we consider that is parameterized in a linear
way. First, we will do this in a finite-dimensional space, and there we will also discuss statistical
interpretations and generalizations such as maximum likelihood estimation, maximum a poste-
riori estimation, and regularization. We will then also do regression in an infinite-dimensional
space, i.e. in a Hilbert space. We will see that this is equivalent to maximum a posteriori estimation
for so-called Gaussian processes. Then we will discuss classification both using linear regression,
logistic regression, support vector machines, and the restricted Boltzmann machine. The chapter
is finished off with artificial neural networks and the so-called back-propagation algorithm. We
also discuss a form of implicit regularization known as dropout.

10.1 Linear Regression

We start by considering the problem of curve fitting. Assume that we want to fit the polynomial
function f ∶ ℝ → ℝ defined by
f (x) = a1 + a2 x + · · · am xm−1 ,
( )
to pairs xk , yk ∈ ℝ × ℝ for k ∈ ℕN . Define a = (a1 , … , am ) ∈ ℝm and 𝛽 ∶ ℝ → ℝm by
𝛽(x) = (1, x, … , xm−1 ). It then follows that
f (x) = aT 𝛽(x). (10.1)
This is called a linear regression model, since the right-hand side is linear in the coefficient vector a.

10.1.1 Least-Squares Estimation

We now would like to find a common value of a which is such that yk is close to f (xk ) = aT 𝛽(xk )
for all pairs of data (xk , yk ). This so-called fit is obtained by minimizing the sum of the squared
distances between f (xk ) and yk , i.e. we solve

1 ∑( T
N
)2
minimize a 𝛽(xk ) − yk (10.2)
2 k=1

with variable a ∈ ℝm . This is a linear LS problem. We may consider more general functions than
monomials, i.e. we let 𝜑j ∶ ℝ → ℝ, j ∈ ℕm , be any functions and define
∑
m
f (x) = aj 𝜑j (x) = aT 𝛽(x),
j=1
( )
with 𝛽(x) = 𝜑1 (x), … , 𝜑m (x) . Often, one lets 𝜑1 (x) = 1. It is also possible to generalize the
regression model to x ∈ ℝn . We just need to define 𝜑j ∶ ℝn → ℝ, j ∈ ℕm , and 𝛽 ∶ ℝn → ℝm .
An important special case is when n = m and 𝛽(x) = x.
Another generalization is when f (x) is vector-valued. We consider the case when f ∶ ℝn → ℝp is
given by

f (x) = A𝛽(x),
( )
where A ∈ ℝp×m , 𝛽 ∶ ℝn → ℝm with 𝛽(x) = 𝜑1 (x), … , 𝜑m (x) and 𝜑j ∶ ℝn → ℝ, j ∈ ℕm . The LS
criterion is then the sum of the LS criteria for each row in the regression model, and the LS problem
can be written as

1∑ ‖
N
y − A𝛽(xk )‖
2
minimize
2 k=1 ‖ k ‖2 ,

with variable A. The solution to this problem is closely related to the SAA approximation of the
single-stage stochastic optimization problem for the affine predictor in Section 9.3. This is one of
the motivations for calling f (x) a predictor – it can be used to predict values of yk when only xk is
known.

Example 10.1 We are given pairs of data (xk , yk ) that happen to satisfy yk = sin xk . We are not
aware of this relationship, and instead, we want to find a linear regression model as in (10.1) that
solves (10.2) for 𝛽(x) = (1, x, x2 , x3 , x4 ). We solve the resulting normal equations as in (5.3), and then
plot the resulting polynomial and compare it with the 11 points we used for fitting, see Figure 10.1.
The fit is pretty good inside the interval of the available data. We do not expect it to be very good
outside this interval. The polynomial is only of fourth degree.

0.5

0
𝑦

−0.5

−1
0 1 2 3 4 5 6
𝑥

Figure 10.1 Plot showing the fourth degree polynomial (solid line), and the 11 data points (+), used to ﬁt
the polynomial to a sinusoidal.
10.1 Linear Regression 299

10.1.2 Maximum Likelihood Estimation

The linear LS problem may be viewed as a maximum likelihood problem. Consider ek to be the
outcomes of independent, normally distributed random variables E1 , … , EN with zero mean and
variance 𝜎 2 , i.e. they have the pdf fEk ∶ ℝ → ℝ+ defined by
1 1 2
fEk (ek ) = √ e− 2𝜎2 ek .
2𝜋𝜎
Now, suppose that
yk = aT 𝛽(xk ) + ek , k ∈ ℕN ,
with a ∈ and 𝛽 ∶
ℝm ℝn
→ ℝm model N noisy observations yk . The pdf for the kth observation
fYk ∶ ℝ → ℝ+ is then fYk (yk ) = fEk (yk − aT 𝛽(xk )), and the likelihood function  ∶ ℝN × ℝm → ℝ+ is

( ) ∏N
∏
N
( )
 y1 , … yN ; a = fYk (yk ) = fEk yk − aT 𝛽(xk ) .
k=1 k=1

The negative log-likelihood function is given by

(√ )
1 ∑(
N
)2
yk − a T
𝛽(xk ) + N ln 2𝜋𝜎 .
2𝜎 2 k=1
For any fixed value of 𝜎 > 0, the minimum of this function is obtained for the same a that solves the
LS problem in (10.2),1 and hence, maximum likelihood estimation is equivalent to LS estimation
for this problem formulation. We may also consider 𝜎 to be an unknown parameter and maximize
the likelihood function jointly for (a, 𝜎), see Exercise 10.1.
We now consider correlated residuals. Let e = (e1 , … , eN ) be the outcome of a normally
distributed random variable E with zero mean and covariance Σ ∈ 𝕊N++ . Then the joint pdf
fE ∶ ℝN → ℝ+ for E is given by
1 1 T −1
fE (e) = √ e− 2 e Σ e
.
(2𝜋)N det Σ
Define
⎡ 𝛽(x1 )T ⎤
X = ⎢ ⋮ ⎥,
⎢ ⎥
⎣𝛽(xN )T ⎦
and y = (y1 , … , yN ). Then it is straightforward to see that the ML problem is equivalent to the
weighted LS problem
1
minimize (y − Xa)T Σ−1 (y − Xa) ,
2
with variable a, if we assume that Σ is known. Notice that we do not have enough data to also
estimate Σ. The reason for this is that there would then be more parameters to estimate than the
number of data points N, since the symmetric matrix Σ has N(N + 1)∕2 entries in its upper trian-
gular part.

10.1.3 Maximum a Posteriori Estimation

Recall from Section 9.3 that it is possible to estimate the parameters in a regression model using a
MAP estimate. To this end, we assume that a is also an outcome of a random variable A with some

1 The minimum is not necessarily unique without further assumptions.

300 10 Supervised Learning

pdf fA ∶ ℝm → ℝ+ . It is natural to assume that A and Ek are independent, in which case the joint
pdf f ∶ ℝN × ℝm → ℝ+ is given by

∏N
( )
fY ,A (y, a) = fA (a) fEk yk − aT 𝛽(xk ) ,
k=1

where y = (y1 , … , yN ). In MAP estimation, the density of A conditioned on the observations Y = y

is maximized. Since this density is obtained from fY ,A (y; a) by dividing by ∫ℝm fY ,A (y; a) da, which is
a constant for known y, we may equivalently maximize fY ,A , or minimize the negative logarithm of
it, with respect to a. This results in the optimization problem
(√ )
1 ∑(
N
)2
minimize 2
yk − aT 𝛽(xk ) + N ln 2𝜋𝜎 − ln fA (a) (10.3)
2𝜎 k=1

with variable a. Compared to the ML problem, the only difference is that we have added the term
− ln fA (a) to the objective function. This can be interpreted as a regularization of the LS problem.
Different pdfs fA result in different regularizations, and hence, reflect different prior knowledge.

10.2 Regression in Hilbert Spaces

We now consider the generalization of regression to infinite dimension, i.e. we let 𝛽(x) =
( ) ( )
𝜑1 (x), 𝜑2 (x), … , where 𝜑i ∶ ℝn → ℝ, i ∈ ℕ. For a = a1 , a2 , … , where ai ∈ ℝ, i ∈ ℕ, we say that
∑∞ ∑∞
a ∈ 𝓁2 if i=1 a2i < ∞. We define the inner product ⟨⋅, ⋅⟩𝓁2 ∶ 𝓁2 × 𝓁2 → ℝ as ⟨a, b⟩𝓁2 = i=1 ai bi . The
corresponding norm || ⋅ ||𝓁2 ∶ 𝓁2 → ℝ+ is defined as ||a||𝓁 = ⟨a, a⟩𝓁2 . For functions f ∶ ℝn → ℝ,
2
2
we define the space of square integrable functions L2 , i.e. the set of functions such that

f (x)2 dx < ∞.
∫ℝn
We define the inner product ⟨⋅, ⋅⟩L2 ∶ L2 × L2 → ℝ by

⟨f , g⟩L2 = f (x)g(x)dx.
∫ℝn
The corresponding norm || ⋅ ||L2 ∶ L2 → ℝ+ is defined by ||f ||2L = ⟨f , f ⟩L2 . We remark that both 𝓁2
2
and L2 are Hilbert spaces. Now, suppose that 𝜑i ∈ L2 , i ∈ ℕ is a family of orthonormal functions, i.e.

⟨𝜑i , 𝜑j ⟩L2 = 0, i≠j

⟨𝜑i , 𝜑j ⟩L2 = 1, i = j.
∑∞
It is well known that a sum of the form f (x) = i=1 ai 𝜑i (x) is convergent and belongs to L2 if and
only if a ∈ 𝓁2 , e.g. [Theorem 1, p. 59][73]. In that case, it also holds that ai = ⟨f , 𝜑i ⟩L2 , and hence,
we realize that ||f ||L2 = ||a||𝓁2 .

10.2.1 Inﬁnite-Dimensional LS Problem

∑∞
We will now consider f (x) = i=1 ai 𝜑i (x) as a regressor, and we want to solve the infinite-
dimensional regularized LS problem

1 ∑(
N
)2 𝜈
minimize yk − f (xk ) + ||f ||2L2 ,
2 k=1 2
10.2 Regression in Hilbert Spaces 301

with variable a. We reformulate this problem as a constrained problem

1∑ 2 𝜈
N
minimize e + ||a||2𝓁
2 k=1 k 2 2

∑
∞
subject to ek = yk − ai 𝜑i (x), k ∈ ℕK
i=1

with variables (a, e), where e = (e1 , … , eN ). To ease the notation, we will write this as
1 𝜈
minimize eT e + aT a
2 2
subject to e = y − Xa,
where X is the infinite-dimensional matrix given by
⎡ 𝛽 T (x1 ) ⎤
X = ⎢ ⋮ ⎥.
⎢ T ⎥
⎣𝛽 (xN )⎦
We then introduce the Lagrangian L ∶ ℝN × 𝓁2 × ℝN → ℝ defined by
1 T 𝜈
L(e, a, 𝜆) = e e + aT a + 𝜆T (e − y + Xa) .
2 2
Completing the squares in the Lagrangian results in
( )T ( )
1 𝜈 1 1
L(e, a, 𝜆) = (e + 𝜆)T (e + 𝜆) + a + XT𝜆 a + XT𝜆
2 2 𝜈 𝜈
1 1
− 𝜆T 𝜆 − 𝜆T XX T 𝜆 − 𝜆T y.
2 2𝜈
Thus, for a given 𝜆, the Lagrangian is minimized by taking
1
e = −𝜆, a = − X T 𝜆,
𝜈
which yields the Lagrange dual function g ∶ ℝN → ℝ defined by
1 1
g(𝜆) = − 𝜆T 𝜆 − 𝜆T XX T 𝜆 − 𝜆T y.
2 2𝜈

10.2.2 The Kernel Trick

Let us introduce the function K ∶ ℝn × ℝn → ℝ defined by
∑
∞
K(x, x̄ ) = 𝛽(x)T 𝛽(̄x) = 𝜑i (x)𝜑i (̄x).
i=1

We call this function the kernel function. We then define the matrix  ∈ 𝕊N with elements
ij = K(xi , xj ). We may then write the dual optimization problem as
( )
1 1
maximize − 𝜆T I +  𝜆 + 𝜆T y,
2 𝜈
with variable 𝜆, which has solution
( )−1
1
𝜆 = − I +  y.
𝜈
We may also write f (x) in terms of the Kernel function as

1∑
N
1
f (x) = aT 𝛽(x) = − 𝜆T X𝛽(x) = − 𝜆 K(xk , x).
𝜈 𝜈 k=1 k
302 10 Supervised Learning

The fact that the solution to this infinite-dimensional regression problem can be obtained from a
finite-dimensional optimization problem by introducing the kernel function is sometimes called
the kernel trick.
From the above expression for the regressor f (x), we realize that one could just as well start with a
series expansion in terms of the kernel function. A natural question that arises is for what functions
K there are orthonormal 𝜑i . The answer is that K should be a positive semidefinite kernel, i.e. for
any N ∈ ℕ, and any ci ∈ ℝ and xi ∈ ℝn , i ∈ ℕN , it should hold that
∑ ∑
N N
ci cj K(xi , xj ) ≥ 0.
i=1 j=1

This result is known as Mercer’s theorem, [94, p. 96]. Popular choices are dth degree polynomi-
( )d
als given by K(x, x̄ ) = 1 + xT x̄ and radial basis functions that are functions that only depend on
||x − x̄ ||2 . The dth degree polynomials correspond to finitely many orthonormal 𝜑i . The Gaussian
radial basis kernel
( )
1
K(x, x̄ ) = exp − 2 ||x − x̄ ||22 ,
2𝜎
where 𝜎 ∈ ℝ++ , corresponds to an infinite series of orthonormal 𝜑i [94].

10.3 Gaussian Processes

The MAP estimate in Section 10.1 is related to the Bayesian approach to statistics in which the
so-called posterior distribution is computed, cf . Section 9.3. To formalize this, we consider
Yk = AT xk + Ek , k ∈ ℕN , (10.4)
where Ek and Yk are random variables, and where A is a random vector of the same dimension as the
vector xk . We assume that Ek and A are independent. The posterior distribution is the distribution
for A conditioned on the observation that Yk = yk . For Gaussian or normal distributions for A and
Ek the posterior distribution is also Gaussian, and hence, it is sufficient to compute the posterior
mean and covariance. The MAP estimate is given by the conditional mean cf . Section 9.3. From
the above model, we obtain the following linear regression model for the outcomes (yk , ek , a) of the
experiment
yk = aT xk + ek , k ∈ ℕN ,
where yk ∈ ℝ and xk ∈ ℝn are the observations we have made and where a ∈ ℝn is the parameter
we would like to estimate.

10.3.1 Gaussian MAP Estimate

Suppose that A is Gaussian random variable with zero mean and covariance Σa ∈ 𝕊n++ , and that E =
(E1 , … , EN ) is also Gaussian with zero mean and covariance Σe ∈ 𝕊N++ . If we let Y = (Y1 , … , YN ),
and
⎡ x1T ⎤
X = ⎢ ⋮ ⎥,
⎢ T⎥
⎣xN ⎦
it follows that
Y = XA + E.
10.3 Gaussian Processes 303

It is straightforward to show that (Y , A) is Gaussian with zero mean and covariance

[ ]
R XΣa
,
Σa X T Σa
where R = Σe + XΣa X T . Let us define fY ,A ∶ ℝN × ℝn → ℝ+ as the joint pdf for (Y , A), fA|Y ∶ ℝN ×
ℝn → ℝ+ as the conditional pdf for A given Y , and fY ∶ ℝN → ℝ+ as the marginal pdf for Y . From
the results in Section 9.3, it now follows with Σa|y = Σa − Σa X T R−1 XΣa that the joint pdf factorizes
as fY ,A (y, a) = fY (y)fA|Y (a|y), where
1 1 T −1
fY (y) = √ e− 2 yR y

(2𝜋)N det R
1 1 T Σ−1 (a−𝜇 )
fA|Y (a|y) = √ e− 2 (a−𝜇a|y ) a|y a|y
,
(2𝜋)n det Σa|y

where 𝜇a|y = Σa X T R−1 y is the conditional mean for A given Y = y. Clearly, the conditional pdf is
maximized by a = 𝜇a|y , which proves that the MAP estimate is the conditional mean for Gaussian
distributions, as we saw in Section 9.3.

10.3.2 The Kernel Trick

A MAP prediction with a new value x is then given by
( )−1
𝜇a|y
T
x = xT 𝜇a|y = xT Σa X T Σe + XΣa X T y.

We realize that Σa only appears in terms of expressions of the form zT Σa z̄ for z ∈ ℝn and z̄ ∈ ℝn .
Because of this we could instead write the predictor in terms of the covariance function K ∶ ℝn ×
ℝn → ℝ+ defined by K(z, z̄ ) = zT Σa z̄ . We then let  ∈ 𝕊N+ be defined by ij = K(xi , xj ), and we let
𝜂k , k ∈ ℕN , be defined by 𝜂k = K(xk , x). Then the predictor can be written as
( )−1
f (x) = 𝜂 T Σe +  y.

This is the same predictor as we obtain when we do regression in Hilbert spaces in Section 10.2 if
we let Σe = 𝜈I and if we define the orthonormal functions 𝜑i such that the resulting kernel func-
tion is K(z, z̄ ) = zT Σa z̄ . We notice that this is the covariance between AT z and AT z̄ . Thus, we can
generalize the regression model in (10.4) by instead considering

Yk = (xk ) + Ek ,

with  ∶ ℝn → ℝ, where we specify the covariance between (z) and (̄z) by specifying the covari-
ance function for any z ∈ ℝn and z̄ ∈ ℝn . We still assume that the joint distribution is Gaussian.
This defines a zero mean real-valued Gaussian random process  on ℝn . Such a process is some-
times called a random field, cf . Section 3.9. Notice that any of the kernel functions discussed in
Section 10.2 may be used. The Gaussian radial basis function is called the squared exponential
covariance function. Another common covariance function is
( )
1
K(z, z̄ ) = exp − ||z − z̄ ||2 ,
𝜎
which defines the so-called Ornstein–Uhlenbeck process, where 𝜎 ∈ ℝ++ . When the covariance
function only depends on z − z̄ , the process is stationary, and when it only depends on ||z − z̄ ||2 is
also isotropic. Stationary and isotropic together is sometimes called homogeneous. In practice, these
properties reflect the differences, or rather the lack of them, in the behavior of the process given
the location of the observer. Actually, there are many more possibilities, see [94]. We specifically
304 10 Supervised Learning

0.5

0
𝑦

−0.5

−1
0 5 10 15 20
𝑥

Figure 10.2 Plot showing the Gaussian regression model (solid line), and the 11 data points (+), used to
compute the model.

realize that infinite-dimensional regression in Hilbert spaces can be equivalently expressed as

MAP estimation for Gaussian processes.

Example 10.2 Here we again consider data from a sinusoidal, but this time collected from
three periods of the sinusoidal. We take Σe = 0, since we have no measurement errors of
the sinusoidal. We use the squared exponential covariance function with parameter 𝜎 = 1. In
Figure 10.2, the predicted values together with the 11 data points used to compute the predictor are
shown.

10.4 Classiﬁcation

In classification problems, we are given pairs of data (xk , yk ) for k ∈ ℕN with xk ∈ ℝn , and where
yk is qualitative in the sense that is it belongs to a discrete finite set with cardinality K. Without
loss of generality, we may take this set to be ℕK . We say that the data (xk , yk ) belongs to class l
if yk = l. These types of data are sometimes also called categorical or discrete as well as factors.
We are interested in finding functions fl ∶ ℝn → ℝ, l ∈ ℕK , that are such that fl (xk ) > 0 if yk = l
{ }
and fl (xk ) < 0 if yk ≠ l. The set x ∣ fl (x) = 0 then separates class l from the other classes. To the
left in Figure 10.3, we are shown two classes of data that can easily be separated with, e.g. a
straight line. In the same figure to the right, we see two classes that cannot be separated with any
connected line.

10.4.1 Linear Regression

A simple approach to classification is to model each class l ∈ ℕK as a linear regression zl ∶ ℝn → ℝ,
where

zl (x) = aTl x + bl ,
10.4 Classiﬁcation 305

4
4

2
𝑥2

𝑥2
2

0
0
−1 0 1 2 3 −2 0 2 4 6
𝑥1 𝑥1

Figure 10.3 Plots showing data points from two classes marked with + for the ﬁrst class and ⚬ for the
second class. In the left plot, the data points for the two classes are well separated. In the right plot, the
data points for the two classes are mixed.

where al ∈ ℝn and bl ∈ ℝ. The objective is to choose the regression parameters (al , bl ) such that zl
is close to one if x belongs to class l and otherwise close to zero. We obtain this by solving the LS
problem
( )
1∑ ∑ ( T ∑(
K
)2 )2
minimize a x + bl − 1 + T
al xk + bl ,
2 l=1 k∶y =l l k k∶y ≠l
k k

with variables a = (a1 , … , aK ) and b = (b1 , … , bK ). Then we use the functions 𝛿l ∶ ℝn → ℝ defined
by 𝛿l (x) = zl (x) as discriminant functions, i.e. we classify x to belong to class l if 𝛿l (x) > 𝛿k (x) for all
k ≠ l. Hence, we get fl (x) = 𝛿l (x) − max k≠l 𝛿k (x). Unfortunately, this method is prone to give bad
results for K ≥ 3. However, there is a simple remedy which is to consider polynomial regression
models, and it might be necessary to have polynomial terms up to degree K − 1, see [54]. More gen-
eral basis functions may also be considered in this framework just as for the curve fitting problem
in Section 10.1. The LS problems will still be linear.

Example 10.3 We consider an example where K = 2 and where n = 2. There are in total 20 data
points from each class, and hence, N = 40. In Figure 10.4, the data points are shown together with
the line defined by 𝛿1 (x) = 𝛿2 (x), which separates the two classes.

Figure 10.4 Plot showing the data points

from the two classes marked with + for the
ﬁrst class and ⚬ for the second class. The
4
straight line separates the two classes from
one another.
3
𝑥2

−1 0 1 2 3
𝑥1
306 10 Supervised Learning

10.4.2 Logistic Regression

Another popular approach to classification is called logistic regression. It is based on the categorical
distribution, cf . Example 3.2. One of its forms is given by
⎧ ∑ezl , l ∈ ℕK−1
⎪ 1+ K−1
k=1 e
zk
pl (z1 , z2 , … , zK−1 ) = ⎨ 1
,
⎪ 1+∑K−1 ezk , l = K,
⎩ k=1

where we consider pl ∶ ℝK−1 → [0, 1] for l ∈ ℕK to be functions of the scalars zl ∈ ℝ. Notice that
any value of zl will make this a valid distribution. We let zl (x) = aTl x + bl as above and define the
likelihood function  ∶ ℝNn × ℝ(K−1)(n+1) → [0, 1] by
∏
K
∏
(x1 , … , xN ; a1 , … , aK−1 , b1 , … , bK−1 ) = pl (z1 (x), … , zK−1 (x)).
l=1 k∶yk =l

Then the log-likelihood function is given by

( )
∑∑
K−1
∑
N
∑ T
K−1
T
al xk + bl − ln 1 + eal xk +bl
,
l=1 k∶yk =l k=1 l=1

which is a concave function. This follows from Exercises 4.4 and the fact that concavity
is preserved under an affine transformation. We use the functions 𝛿l ∶ ℝn → ℝ defined by
𝛿l (x) = pl (aT1 x + b1 , … , aTK−1 x + bK−1 ) as discriminant functions similarly as above, i.e. we let
fl (x) = 𝛿l (x) − max k≠l 𝛿k (x).
The classifiers discussed so far work with discriminant functions, and then the function fl that
separates the classes is obtained indirectly as fl (x) = 𝛿l (x) − max k≠l 𝛿k (x). In Section 10.5, we will
discuss how to work directly with the functions fl .

10.5 Support Vector Machines

We are as previously given pairs of data (xk , yk ) for k ∈ ℕN with xk ∈ ℝn , and where yk is qualitative
in the sense that is it belongs to a discrete finite set. However, we now restrict ourselves to the case
when the cardinality of this set is 2. Without loss of generality, we may take the set to be {−1, 1}.
We are interested in finding a function f ∶ ℝn → ℝ that is such that f (xk ) > 0 if yk = 1 and f (xk ) < 0
if yk = −1, which may also be expressed as yk f (xk ; a, b) > 0. The set {x ∣ f (x) = 0} then separates the
two classes.

10.5.1 Hebbian Learning

A simple approach is to take f ∶ ℝn × ℝn × ℝ → ℝ as a linear regression
f (x; a, b) = aT x + b.
The objective is to choose the regression parameters (a, b) such that
yk f (xk ; a, b) > 0, k ∈ ℕN .
This is a feasibility problem. It has a solution if and only if the two classes can be separated with
{ }
the hyperplane x ∣ aT x + b = 0 . Notice that the above inequalities are homogeneous in (a, b), and
hence, they are equivalent to
yk f (xk ; a, b) ≥ 1, k ∈ ℕN .
10.5 Support Vector Machines 307

It is always possible to formulate feasibility problems as optimization problems, and there are
many ways of doing this. One is to define the function g ∶ ℝ → ℝ+ by
g(y) = max(0, y),
and define the optimization problem
∑
N
minimize g(1 − yk f (xk ; a, b)) (10.5)
k=1
with variables (a, b). Clearly, the optimal value is zero if and only if the two classes can be separated
by a hyperplane. This formulation is the basis for Hebbian learning, cf . Exercise 10.6, where we
carry out the minimization using a subgradient algorithm. The function g is called the rectifier
function or the rectified linear unit (ReLU) function.

10.5.2 Quadratic Programming Formulation

We notice that the objective function in (10.5) is not differentiable, and this limits the possible
class of optimization methods that may be used. Because of this it might be better to just consider
the feasibility problem above. It is actually an LP feasibility problem, which makes it tractable.
However, it suffers from the problem that the solution is generally not unique, i.e. there are often
many hyperplanes that separate the two classes. This problem it shares with the optimization for-
mulation since they are equivalent formulations of the same problem. A remedy to the nonunique-
ness problem is to maximize the minimum distance from xk to the hyperplane. Since the distance
is given by yk f (xk ; a, b)∕||a||2 , and since f is homogeneous in (a, b) this can be accomplished by
minimizing ||a||22 , see Exercise 10.7 for details. This results in the QP
1
minimize ||a||22
2
subject to yk f (xk ; a, b) ≥ 1, k ∈ ℕN ,
with variables (a, b).

10.5.3 Soft Margin Classiﬁcation

The QP above can only deliver a solution in case it is feasible, i.e. if there exists a separating
hyperplane for the two classes. The nondifferentiable formulation above does not suffer from this
problem. In case, the optimal objective value is not zero, it has instead found a hyperplane that min-
imizes the sum of a measure of the misfit of the xk that are not separated correctly by the hyperplane.
An epigraph reformulation of (10.5) can be obtained by introducing variables 𝜉 = (𝜉1 , … , 𝜉N ) ∈ ℝN ,
resulting in the equivalent optimization problem
∑
N
minimize 𝜉k
k=1
subject to yk f (xk ; a, b) ≥ 1 − 𝜉k , k ∈ ℕN
𝜉k ≥ 0, k ∈ ℕN
with variables (a, b, 𝜉). This, however, suffers from the problem of possible nonuniqueness. A rem-
edy to this is to consider
∑
N
minimize 𝜉k + 𝜈2 ||a||22
k=1
(10.6)
subject to yk f (xk ; a, b) ≥ 1 − 𝜉k , k ∈ ℕN
𝜉k ≥ 0, k ∈ ℕN
308 10 Supervised Learning

with variables (a, b, 𝜉), where 𝜈 ≥ 0 can be used to make a trade-off between the amount of mis-
fit and the distance to the hyperplane. This is the formulation that is the basis of support vector
machines (SVMs). The optimization problem is a convex optimization problem and because of this
it is also tractable. We remark that in the linear regression we may use nonlinear functions of xk
instead of xk itself without changing the type of optimization problems. However, the two classes
will then not be separated by a hyperplane but by a general surface.

10.5.4 The Dual Problem

We will now investigate the above optimization problem further using duality. We consider the
regressor f ∶ ℝm × ℝn × ℝ → ℝ given by
f (x; a, b) = aT 𝛽(x) + b,
where 𝛽 ∶ ℝm → ℝn . The Lagrangian L ∶ ℝn × ℝ × ℝN × ℝN × ℝN → ℝ for (10.6) is given by
∑
N
𝜈 ∑N
( ) ∑ N
L(a, b, 𝜉, 𝜆, 𝜇) = 𝜉k + ||a||22 + 𝜆k 1 − 𝜉k − yk f (xk ; a, b) − 𝜇k 𝜉k
k=1
2 k=1 k=1

𝜈 ∑N
∑N
∑N
( ) ∑N
= ||a||22 − 𝜆k yk aT 𝛽(xk ) − 𝜆k yk b + 1 − 𝜆k − 𝜇k 𝜉k + 𝜆k .
2 k=1 k=1 k=1 k=1

The minimum of the Lagrangian is unbounded from below unless 1 − 𝜆k − 𝜇k = 0, k ∈ ℕN , and

∑N
k=1 𝜆k yk = 0. When bounded, the minimizing a is given by the solution of

𝜕L ∑N
= 𝜈a − 𝜆k yk 𝛽(xk ) = 0,
𝜕a k=1

i.e.
1∑
N
a= 𝜆 y 𝛽(x ).
𝜈 k=1 k k k
This follows from the fact that the Lagrangian is convex in a. We know from complementary
slackness that 𝜆k = 0, if the first constraint in (10.6) is satisfied strictly at optimality. This means
that the optimal solution a⋆ only depends on xk if yk f (xk ; a⋆ , b⋆ ) = 1 − 𝜉k⋆ , and these vectors xk
are called support vectors. Back substitution into the Lagrangian gives the Lagrange dual function
g ∶ ℝN × ℝN → ℝ given by
∑
N
1 ∑∑
N N
g(𝜆, 𝜇) = 𝜆k − 𝜆k 𝜆l yk yl 𝛽 T (xk ), 𝛽(xl )
k=1
2𝜈 k=1 l=1
∑N
with dom g = {(𝜆, 𝜇)|1 − 𝜆k − 𝜇k = 0, k ∈ ℕN , k=1 𝜆k yk = 0}. Since Slater’s condition is fulfilled
for (10.6), we may obtain the optimal (𝜆, 𝜇) by solving the dual optimization problem
maximize g(𝜆, 𝜇)
subject to 𝜆k + 𝜇k = 1, k ∈ ℕN
∑N
𝜆k yk = 0
k=1
𝜆k ≥ 0, 𝜇k ≥ 0, k ∈ ℕN .
with variables (𝜆, 𝜇). This is a convex optimization problem with quadratic objective function
and simple inequality constraints, which can be solved efficiently. There are actually special
purpose algorithms that are very efficient, e.g. the so-called sequential minimal optimization
algorithm [91].
10.5 Support Vector Machines 309

10.5.5 Recovering the Primal Solution

The solution to the dual problem also provides the solution to the primal problem. We have already
seen the expression for a. However, it still remains to compute the value of b from the solution of
the dual problem. For any 𝜆k > 0, which we obtain from the dual problem, we know from comple-
mentary slackness that 1 − 𝜉k − yk f (xk ; a, b) = 0, i.e.
𝜉 +1
b= k − aT 𝛽(xk ), k ∈ ,
yk
where  = {k ∈ ℕN |𝜆k > 0}. We also know that we will have 𝜉k = 0 for some values of k ∈ , i.e.
there is a nonempty set  = {k ∈ |𝜉k = 0} for which it holds that
1
b= − aT 𝛽(xk ), k ∈  .
yk
Any such k can be used to compute b, and they should all give the same result. However, for numer-
ical stability, one may average over them. We can find these values of k from the complementary
slackness condition 𝜇k 𝜉k = 0. All k such that 𝜇k > 0 will do. To summarize, the value of b can be
computed by identifying the indexes k for which both the optimal 𝜆k and 𝜇k are strictly positive.
A generalization of the SVM to more than two classes is given in [29].

Example 10.4 We consider an example where we have 20 points x1 , … , x20 ∈ ℝ2 of one class and
20 points x21 , … , x40 ∈ ℝ2 of another class. We use the parameter value 𝜈 = 20 and compute a and
b for the function f (x) = aT x + b that should separate the two classes. In Figure 10.5, we see the
points together with the line f (x) = 0, which almost separates the two classes of points. It is easy
to see that they cannot be separated with a straight line, but the SVM solution only misclassifies a
few points.

10.5.6 The Kernel Trick

We realize that the dual problem does not depend explicitly on the function 𝛽 but only on the inner
products 𝛽(xk )T 𝛽(xl ). This is also the case for the resulting regressor, since it can be written as

1∑
N
f (x; a, b) = 𝜆 y 𝛽(x )T 𝛽(x) + b,
𝜈 k=1 k k k

4
𝑥2

−2 0 2 4 6
𝑥1

Figure 10.5 Plot showing the data points from the two classes marked with + for the ﬁrst class and ⚬ for
the second class. Also shown is the line f (x) = 0 that almost separates the two classes.
310 10 Supervised Learning

and hence, it is sufficient to know K ∶ ℝm × ℝm → ℝ+ defined by K(x, x̄ ) = 𝛽(x)T 𝛽(̄x). In other

words, the kernel trick also applies for SVM, so we may also consider any of the kernels discussed
in relation to regression in Hilbert spaces and regression using Gaussian processes, cf . sections 10.2
and 10.3.

10.6 Restricted Boltzmann Machine

The BM discussed in Section 9.13 can also be used for classification. It is applicable when the pairs
of data (xk , yk ), k ∈ ℕN , are such that xk is a binary vector, i.e. xk ∈ {0, 1}M . We still have that yk
is categorical, but we do not have yk ∈ ℕK . Instead, we say that (xk , yk ) belongs to class l ∈ ℕK if
yk ∈ {0, 1}K is a vector of all zeros except for a one in position l, i.e. yk = el , where el is the lth unit
vector. We then let 𝑣k = (xk , yk ) ∈ {0, 1}M+N be the visible variables in the Boltzmann machine.
Once the available data (xk , yk ), k ∈ ℕN , has been used to find optimal values of the parameters
(𝜆, 𝛬), e.g. using ML estimation, we may use the resulting Ising probability function p(y, x, h), where
∑
h are the hidden variables for discrimination. We let 𝛿l (x) = h p(el , x, h) be the discriminant func-
tion and say that x belongs to class l if 𝛿l (x) > 𝛿k (x) for all k ≠ l. This means that we assign the class
with the largest probability to each observation.

10.6.1 Graphical Ising Distribution

As mentioned in Section 9.13, it is not easy to compute the ML estimate. Because of this, one often
uses the so-called Restricted Botzmann Machine (RBM). This is based on a special case of the graph-
ical Ising distribution, which is obtained by considering a graph on the variables (𝑣, h) such that
there are no edges between the variables within 𝑣 or h, respectively, see Figure 10.6. This results in
the following structure for the parameter 𝛬:
[ ]
0 𝛬12
𝛬= ,
𝛬T12 0
which means that the probability function for the Ising distribution can be written as
1 T T
p(𝑣, h) = e𝜆1 𝑣+𝜆2 h+2𝑣 𝛬12 h ,
T

Figure 10.6 Graph showing how the visible and hidden layers in
𝑣1 an RBM are connected.

ℎ1

𝑣2

ℎ2

𝑣3

ℎ3

𝑣4
10.6 Restricted Boltzmann Machine 311

where
∑
e𝜆1 𝑣+𝜆2 h+2𝑣
T T T𝛬
Z= 12 h .
h,𝑣

The convention is that if there are no limits for the summations, the sum should be taken over
all possible values of the variables. It follows that the conditional probability function for h
given 𝑣 may be factorized. To see this, notice that the marginal probability function for 𝑣 can be
written as
∑ e𝜆1 𝑣 ∑ (𝜆2 +2𝛬T12 𝑣)T h e𝜆1 𝑣 ∑∏ (𝜆2 +2𝛬T12 𝑣)Ti hi
T T

p(𝑣, h) = e = e
h
Z h Z h i
( )
e𝜆1 𝑣 ∏∑ (𝜆2 +2𝛬T12 𝑣)Ti hi e𝜆1 𝑣 ∏
T T

1 + e(𝜆2 +2𝛬12 𝑣)i .

T T
= e =
Z i h Z i
i

Because of this, the conditional probability function is given by

1 𝜆T 𝑣+𝛬T h+2𝑣T 𝛬12 h ∏ (𝜆2 +2𝛬T12 𝑣)i hi
p(𝑣, h) Z
e1 2 ie
∑ = ( ) = ∏ ( ).
e𝜆1 𝑣 ∏
T
h p(𝑣, h) 1 + e(𝜆2 +2𝛬12 𝑣)i
(𝜆2 +2𝛬T12 𝑣)Ti
T T

Z i i 1+e

The conditional probability function for hj given 𝑣 is then

e(𝜆2 +2𝛬12 𝑣)j hj

sj (hj , 𝑣) = ∏ ( ),
(𝜆2 +2𝛬T12 𝑣)Ti
i 1+e

and the overall conditional probability function is obtained as the product of these factors. The
expression of the factors can be simplified considerably. Notice that they are for hj = 0 and hj = 1
given by
1 exj
sj (0, 𝑣) = ∏ xi
, sj (1, 𝑣) = ∏ xi
,
i (1 + e ) i (1 + e )
∏
where xi = (𝜆2 + 2𝛬T12 𝑣)i . Since sj (0, 𝑣) + sj (1, 𝑣) = 1, we have that i (1 + exi ) = exj + 1, and hence
exj 1
sj (1, 𝑣) = = = 𝜎(xj ),
1 + exj 1 + e−xj
where 𝜎 ∶ ℝ → ℝ is the logistic function, which is an example of a so-called sigmoid function.

10.6.2 Gradient Expressions for EM Algorithm

We will now show that the expressions for the gradients of the Q-function in the EM-algorithm
are simpler for the RBM than for the BM, cf . Section 9.13. For the sake of simplicity, we assume
that there is only one observation 𝑣. Notice that the case of several observations is obtained by
summing up the gradients for the different observations just as for the general Boltzmann machine.
Subindices from now will on refer to component of vectors. The expressions for the gradients for
the RBM are
[ ] [ ]
𝜕Q ∑ 𝑣 ∑ 𝜉
= s(h, 𝑣, 𝜆 , 𝛬 )
− −
− p(𝜉𝑣 , 𝜉h ) 𝑣
𝜕𝜆 h
h 𝜉
𝜉h

𝜕Q ∑ ∑
= 2 s(h, 𝑣, 𝜆− , 𝛬− )𝑣hT − p(𝜉𝑣 , 𝜉h )𝜉𝑣 𝜉hT ,
𝜕𝛬12 h 𝜉
312 10 Supervised Learning

∏ ∑
where s(h, 𝑣, 𝜆− , 𝛬− ) = k sk (hk , 𝑣), and where 𝜉 = (𝜉𝑣 , 𝜉h ). We let p𝑣 (𝜉𝑣 ) = 𝜉h p(𝜉𝑣 , 𝜉h ) be the
marginal probability function for the observations. We may then write the gradients as
[ ] [ ]
𝜕Q ∑ 𝑣 ∑ ∑ 𝜉
= s(h, 𝑣, 𝜆− , 𝛬− ) − p𝑣 (𝜉𝑣 ) s(𝜉h , 𝜉𝑣 , 𝜆− , 𝛬− ) 𝑣
𝜕𝜆 h
h 𝜉𝑣 𝜉h
𝜉h

𝜕Q ∑ ∑ ∑
= 2 s(h, 𝑣, 𝜆− , 𝛬− )𝑣hT − p𝑣 (𝜉𝑣 ) s(𝜉h , 𝜉𝑣 , 𝜆− , 𝛬− )𝜉𝑣 𝜉hT .
𝜕𝛬12 h 𝜉 𝜉
𝑣 h

We realize that
∑ ∑∏ ∑∏
s(h, 𝑣, 𝜆− , 𝛬− )hl = sk (hk , 𝑣)hl = sk (hk , 𝑣)sl (hl , 𝑣)hl
h h k h k≠l
∏∑ ∑
= sk (hk , 𝑣) sl (hl , 𝑣)hl = sl (1, 𝑣),
k≠l hk hl

where the last equality follows from the fact that sk (0, 𝑣) + sk (1, 𝑣) = 1 and sl (0, 𝑣) × 0 = 0. From
this, it follows that
𝜕Q ∑
=𝑣− p𝑣 (𝜉𝑣 )𝜉𝑣
𝜕𝜆1 𝜉𝑣
( )
𝜕Q ∑
= si (1, 𝑣) − p𝑣 (𝜉𝑣 )si (1, 𝜉𝑣 )
𝜕𝜆2 i 𝜉𝑣
( )
𝜕Q ∑
= 2sj (1, 𝑣)𝑣i − p𝑣 (𝜉𝑣 )sj (1, 𝜉𝑣 )(𝜉𝑣 )i .
𝜕𝛬12 i,j 𝜉 𝑣

The first term in each of the expressions is cheap to evaluate. The second terms can be approximated
using Monte Carlo methods. In [38], the so-called contrastive divergence method based on Gibbs
sampling is described. We want to draw a sample 𝜉𝑣 from p𝑣 . We initialize the Gibbs sampler with
𝜉𝑣(0) = 𝑣, cf . Section 9.12. Because of the graphical structure of the RBM, we then draw a sample
h(1) from the conditional distribution of h given 𝑣, i.e. from sj (hj , 𝜉𝑣(0) ) above. We then finally draw a
sample 𝜉𝑣(1) from the conditional distribution of 𝑣 given h, where we use h = h(1) . This conditional
distribution can be obtained similarly to sj (hj , 𝑣) above. Here we have run the Gibbs sampler for
only one step, but it is, of course, possible to run more steps. However, empirical evidence suggests
that one step is enough. After the samples have been obtained, we approximate the above sums as
∑ (1) ∑ ∑
𝜉𝑣 , si (1, 𝜉𝑣(1) ), sj (1, 𝜉𝑣(1) )(𝜉𝑣(1) )i ,
𝜉𝑣 𝜉𝑣 𝜉𝑣

respectively. This is a very simplistic approximation.

10.7 Artiﬁcial Neural Networks

The regression models we have discussed so far have been linear, i.e. they have been linear in the
parameters to be estimated. Artificial neural networks (ANN)s can be used to specify nonlinear
regression models. It can be seen as a generalization of the RBM.

10.7.1 The Network

An ANN with L layers and ni neurons for layer i is defined as follows: let xi ∈ ℝni for i ∈ ℤL−1 be
the input activation of layer i, and let
zi = Φi (xi−1 ), i ∈ ℕL ,
10.7 Artiﬁcial Neural Networks 313

where zi ∈ ℝni is the output of layer i. The function Φi ∶ ℝni−1 → ℝni is called the propagation
function. Typically, it can be a linear or an affine function, and then we may write
Φi (x) = Wi x + 𝑣i , i ∈ ℕL , (10.7)
where Wi ∈ ℝni ×ni−1 and 𝑣i ∈ ℝni . The input to the next layer is obtained by
xi = hi (zi ), i ∈ ℕL−1 , (10.8)
where hi ∶ ℝni → ℝni is called the activation function. It is often the case that this function can be
written as
( )
hi (z) = hi1 (z1 ), … , hini (zni )) , (10.9)

and each component hij is typically a saturation function or a sigmoid function, i.e. a function
𝜎 ∶ ℝ → [0, 1] such that limt→∞ 𝜎(t) = 1 and limt→−∞ 𝜎(t) = 0. Another popular choice is the ReLU,
which was defined in Section 10.5. Figure 10.7 shows an example of a neural network.
We now define the functions fi ∶ ℝni−1 × ℝpi → ℝni , i ∈ ℕL , as
fi (xi−1 , 𝜃i ) = hi (Φi (x)) = hi (Wi xi−1 + 𝑣i ), (10.10)
where
([ ])
𝜃i = vec Wi 𝑣i ∈ ℝpi . (10.11)
It follows that
xi = fi (xi−1 ; 𝜃i ), i ∈ ℕL . (10.12)
Finally, we define f ∶ × ℝn0×···× ℝp1 → ℝpL ℝnL
as f (x0 ; 𝜃1 , … , 𝜃L ) = xL . With x = x0 ∈ ℝn0 and
∑ L
𝜃 = (𝜃1 , … , 𝜃L ) ∈ ℝp , where p = i=1 pi , we have hence defined a nonlinear regression model or
predictor f (x; 𝜃). The recursive structure of the predictor is illustrated in Figure 10.8.

Input Hidden layers Output

Figure 10.7 A neural network with four hidden layers. The nodes represent the activation functions, and
the edges illustrate how the output from one layer is propagated to the next.

𝜃1 𝜃2 𝜃𝐿

𝑥1 𝑥2 𝑥 𝐿− 1
𝑥0 𝑓1 𝑓2 ⋯ 𝑓𝐿 𝑥𝐿

Figure 10.8 Figure illustrating the recursive deﬁnition of the ANN predictor.
314 10 Supervised Learning

10.7.2 Approximation Potential

It is interesting to note that any continuous function f ∶ [0, 1]n → ℝ can be represented as
( n )
∑
2n
∑
f (x) = g 𝜙p,q (xp ) ,
q=0 p=1

where 𝜙p,q ∶ [0, 1] → [0, 1] are continuous increasing functions, and where g ∶ ℝ → ℝ is a con-
tinuous function [64, 72]. As a result, a two-layer ANN is sufficient to represent any continuous
function. However, very little is known about how to choose g. A result by Cybenko [30] says that
any continuous function f ∶ [0, 1]n → ℝ can be approximated arbitrarily well as
∑
N
f (x) ≈ 𝛼j 𝜎(aTj x + bj ) (10.13)
j=1

if N is sufficiently large, and where 𝜎 is any continuous sigmoid function, and 𝛼j ∈ ℝ, aj ∈ ℝn and
bj ∈ ℝ, j ∈ ℕN .

10.7.3 Regression Problem

Regression using ANNs is about selecting a 𝜃 ∈ ℝp such that for pairs of data (xk , yk ) ∈ ℝn0 × ℝnL ,
k ∈ ℕN , it holds that the output f (xk ; 𝜃) of the ANN is close to yk . The goodness of the closeness
is measured with a function V ∶ ℝp → ℝ which could typically be the sum of squared norms of
yk − f (xk ; 𝜃), e.g.

1∑
N
V(𝜃) = ||y − f (xk ; 𝜃)||22 , (10.14)
2 k=1 k
the minimization of which is a nonlinear LS problem.

10.7.4 Special Cases

It is possible to interpret logistic regression as a one-layer ANN. This follows immediately by taking
⎡ aT1 ⎤ ⎡ b1 ⎤
W1 = ⎢ ⋮ ⎥ , 𝑣1 = ⎢ ⋮ ⎥ ,
⎢ T ⎥ ⎢ ⎥
⎣aK−1 ⎦ ⎣bK−1 ⎦
( ) ∑K−1
and h(z) = z1 ∕s, … , zK−1 ∕s, 1∕s , where s = 1 + k=1 ezk . Hence the output x1 of the ANN are the
probabilities pl (z1 , z2 , … , zK−1 ), l ∈ ℕK of the categorical distribution, i.e. x1l = pl (z1 , z2 , … , zK−1 ).
The training data are (xk , yk ) ∈ ℝn × {0, 1} for k ∈ ℕN , and the function to maximize is the likeli-
hood function for the data.
It is also possible to interpret an RBM as a one-layer ANN with binary input activation. To see
this, let x0 = 𝑣, 𝑣1 = 𝜆2 , W1 = 2𝛬T12 , and hi (zi ) = 𝜎(zi ). Then x1i = si (1, 𝑣), and hence, the outputs of
the ANN are the conditional probabilities for hi = 1 given 𝑣, or equivalently the expected values
( )
of the hidden variables given the visible variables. The training data are the data 𝑣 = 𝑣1 , … , 𝑣N
for the visible layer of the RBM, and the function to maximize is the likelihood function for the
data. This optimization can be performed using the EM algorithm which makes use of the output
of the ANN to compute gradients.
10.7 Artiﬁcial Neural Networks 315

Finally, we interpret the optimization problem for Hebbian learning in (10.5) as a one-layer ANN.
Let 𝑣1 = b, W1 = aT and h(z1 ) = g(z1 ). The training data is (xk , yk ) ∈ ℝn × {−1, 1}, k ∈ ℕN and the
function to maximize is the sum of the outputs of the ANN over all training data.
Because of these interpretations, it easy to see how one can generalize logistic regression, the
RBM, and Hebbian learning using multilayer ANNs.

10.7.5 Back Propagation

In order to minimize the function V in, e.g. (10.7) with respect to the parameter 𝜃 its gradient needs
to be computed efficiently. This will, of course, depend on the specific choice of the function V, but
the main computational burden will be related to computing the gradient of the predictor f (x; 𝜃)
with respect to 𝜃.
From (10.12), it follows that for i, j ∈ ℕL ,
⎧
⎪0, j > i,
𝜕xi ⎪ 𝜕fi
= ⎨ T, i = j,
𝜕𝜃jT ⎪ 𝜕𝜃i
𝜕f 𝜕x
T , j < i,
⎪ 𝜕xTi 𝜕𝜃i−1
⎩ i−1 j
from which we may conclude that
𝜕f 𝜕f 𝜕fL−1 𝜕f
= TL · · · Ti , i ∈ ℕL .
𝜕𝜃iT 𝜕xL−1 𝜕xL−2
T
𝜕𝜃i
We realize from the above formula that the gradients of f are computed by starting with the last
layer and then progressing backward through the layers of the ANN. This has given rise to the name
back propagation for the algorithm for computing the gradients for ANNs. We also realize that not
all partial derivatives have to be stored, but only cumulative products in the above expression.
We will now provide some more details on how the partial derivatives are computed for the case
of an affine propagation function. We have from (10.11) that
([ T ] )
Wi xi−1 + 𝑣i = xi−1 1 ⊗ I 𝜃i .
Hence, it follows from (10.10) that
𝜕fi 𝜕hi ([ ] )
= T
xi−1 1 ⊗I ,
𝜕𝜃iT 𝜕ziT
and
𝜕fi 𝜕hi
= Wi .
𝜕xi−1
T
𝜕ziT
𝜕hi
In case, hi has the structure in (10.9), it follows that 𝜕ziT
is a diagonal matrix. The back propagation
algorithm for the case of affine propagation function is summarized in Algorithm 10.1. We have
𝜕h
included also the forward propagation of x in the algorithm. In case 𝜕zTi is diagonal, special care
i
should be taken to make use of this in the implementation of the algorithm.
It should be mentioned that back propagation is related to what is called automatic
differentiation, and specifically what is called reverse mode accumulation. For more information
about back propagation, we refer to [45, Section 6.5] and [9].
316 10 Supervised Learning

Algorithm 10.1: Back propagation algorithm

Input: input x to ANN, number of Layers L, activation functions hi , weights Wi , bias terms 𝑣i ,
for i ∈ ℕL
𝜕f
Output: 𝜕𝜃T for i ∈ ℕL
i

x0 ← x
for i ← 1 to L do
zi ← Wi xi−1 + 𝑣i
xi ← hi (zi )
end
𝜕hL (zL )
HL ← 𝜕zLT
𝜕f ([ ] )
𝜕𝜃L
← HL T
xL−1 1 ⊕I
P ← HL WL
for i ← L − 1 to 1 do
𝜕h (z )
Hi ← 𝜕zi T i
i

P ← PHi
𝜕f ([ ] )
𝜕𝜃iT
←P T
xi−1 1 ⊕I
P ← PWi
end

10.8 Implicit Regularization

Regularization is also important when training ANNs. This is especially true, since often the num-
ber of parameters in ANN exceeds the number of data points. Sometimes it is not easy to find
suitable regularization functions. Fortunately, there are implicit ways of obtaining regularization,
which is the topic of this section.

10.8.1 Least-Norm Solution

To provide insight into how a stochastic gradient method can provide regularization for free, we
consider a linear regression problem where the pairs of data (xk , yk ) with xk ∈ ℝn and yk ∈ ℝ,
k ∈ ℕN , should satisfy the linear regression model
yk = aT xk . (10.15)
This means that we are able to interpolate the data. This is possible if N ≤ n. Before, when we have
discussed regression, we have tacitly assumed that N > n. However, as mentioned above, com-
mon practice when using ANNs is to be in this interpolation regime. We can collect all pairs of
data in
⎡ x1T ⎤ ⎡ y1 ⎤
X = ⎢ ⋮ ⎥ ∈ ℝN×n , y = ⎢ ⋮ ⎥ ∈ ℝN .
⎢ T⎥ ⎢ ⎥
⎣xN ⎦ ⎣yN ⎦
Then a should satisfy
Xa = y,
10.8 Implicit Regularization 317

which has infinitely many solutions if the system of equations is consistent. It would then be natural
to look for a solution that minimizes the norm ||a||2 . In Exercise 10.11, we show that if X has full
row rank, then this solution is given by
( )−1
a = X T XX T y.
We will now see how the incremental stochastic optimization method in (6.72) can be used to obtain
this least-norm solution by solving the optimization problem

1∑
N
minimize V (aT xk , yk ),
N k=1 k
with variable a, [120]. Here Vk ∶ ℝ × ℝ → ℝ are functions such that Vk (0, 0) = 0 and such that the
incremental method converges to a vector a that satisfies Xa = y. This holds true for many convex
functions. One example is
1
Vk (zk , yk ) = (yk − zk )2 , (10.16)
2
which corresponds to the LS criterion. We realize that fk (a) = Vk (aT xk , yk ) to get agreement with
the optimization problem in (5.46). The incremental method reads
ak+1 = ak − tk ∇fik (ak ).
We have that
𝜕Vik (aT xik , yk )
∇fik (a) = xik ,
𝜕zik
and hence, if we start with a0 = 0, we find that the incremental method converges to a = X T 𝛼 for
some 𝛼 ∈ ℝN . Together with Xa = y, we obtain the equation
XX T 𝛼 = y,
which has a unique solution if X has full row rank. It then follows that
( )−1
a = X T XX T y,
which is the least-norm solution. This could also have been obtained by solving the optimization
problem
∑
N
minimize (yk − aT xk )2 + 𝛾||a||22 ,
k=1
with variable a, which is a regularized LS problem, and then letting 𝛾 → 0, see Exercise 10.11.

10.8.2 Dropout
Another mechanism that results in implicit regularization is the concept of dropout. We will study
it for the linear regressor in (10.15). We consider Vk (z, y) given by (10.16), and the resulting LS
problem is

1∑
N
minimize (y − aT xk )2 ,
2 k=1 k
with variable a ∈ ℝn . Dropout is achieved by replacing ai , i.e. the ith component of a with 𝛿i ai ,
where 𝛿i is an outcome of a random variable Δi with a Bernoulli distribution, i.e. P(Δi = 1) = pi
and P(Δi = 0) = 1 − pi . We assume that Δi is independent of Δj for i ≠ j. Let Δ = diag(Δ1 , … , Δn )
and 𝛿 = diag(𝛿1 , … , 𝛿n ). We may then write the kth term in the objective function as the outcome of
1
(y − xkT Δa)2 . We actually also assume that we have different outcomes 𝛿 of Δ for each k that are
2 k
318 10 Supervised Learning

independent of one another, but since we only need to analyze on fixed value of k to understand how
the incremental method performs with dropout, we will neglect all dependence on k and consider
V d ∶ ℝn → ℝ defined as
1
V d (a) = (y − xT Δa)22 ,
2
for y ∈ ℝ and x ∈ ℝn . Once we have determined a, we would like to use this parameter to predict y
given a new value of x. We therefore define the predictor Ŷ ∶ ℝn × {0, 1}n → ℝ as Ŷ (x; Δ1 , … , Δn ) =
xT Δa. This is, however, not so useful, since it involves the random variable Δi . We are more inter-
ested in the expected value of this predictor, which is often called the ensemble average predictor
ŷ ∶ ℝn → ℝ defined as
[ ]
ŷ (x) = 𝔼 Ŷ (x; Δ1 , … , Δn ) = xT Pa,
where P = diag(p1 , … , pn ). The interesting question is if running the incremental method on the
problem
[ ]
minimize 𝔼 V d (a) ,
with variable a will result in a good ensemble average predictor. We therefore introduce the function
V e ∶ ℝn → ℝ as
1
V e (a) = (y − xT Pa)2 ,
2
which we like to be small for the ensemble average predictor to be good. We will now
compare the gradient of this function with the expected value of the gradient of V d . Remem-
ber that this expected value is what determines the behavior of the incremental method,
cf . Section 6.8. We have
𝜕V e (a)
= −(y − xT Pa)Px
𝜕a
𝜕V d (a)
= −(y − xT Δa)Δx.
𝜕a
From this, we obtain
[ ]
𝜕V d (a) [ ]
𝔼 = −yPx + 𝔼 ΔxxT Δ a,
𝜕a
[ ]
where element (i, j) of 𝔼 ΔxxT Δ is given by
{
[ ] pi pj xi xj , i ≠ j
𝔼 Δi Δj xi xj =
pi xi2 , i = j.
This implies that
[ ]
𝔼 ΔxxT Δ = PxxT P + diag(𝜎12 x12 , … , 𝜎n2 xn2 ),
where 𝜎i2 = pi (1 − pi ) is the variance of Δi . This means that we have
[ ]
𝜕V d (a) 𝜕V e (a)
𝔼 = + diag(𝜎12 x12 , … , 𝜎n2 xn2 )a,
𝜕a 𝜕a
which is the gradient of
1
V e (a) + aT diag(𝜎12 x12 , … , 𝜎n2 xn2 )a. (10.17)
2
We see that the second term is a ridge regularization, and this explains how dropout implicitly
provides regularization. The largest possible regularization is obtained when pi = 1∕2, since this
value maximizes 𝜎i2 . For more information about dropout see [8].
Exercises 319

Exercises

10.1 Consider the linear regression model

yk = aT 𝛽(xk ) + ek ,
for pairs of data (xk , yk ), k ∈ ℕN , where a ∈ ℝm and 𝛽 ∶ ℝn → ℝm . We assume that the pdf
related to ek is
1 1 2
fEk (ek ) = √ e− 2𝜎2 ek .
2𝜋𝜎
Define the likelihood function  ∶ ℝN × ℝm × ℝ → ℝ+
( ) ∏N
( )
 y1 , … yN ; a, 𝜎 = fEk yk − aT 𝛽(xk ) ,
k=1
where we not only consider a to be a parameter but also the standard deviation 𝜎. Show
that the ML estimate of (a, 𝜎) is given by
( )−1
a = XTX XTy
√
1
𝜎= (y − Xa)T (y − Xa),
N
where y = (y1 , … , yN ) and
⎡ 𝛽 T (x1 ) ⎤
X = ⎢ ⋮ ⎥.
⎢ T ⎥
⎣𝛽 (xN )⎦

10.2 Consider the LS problem

1
minimize ||YN − XN a||22 ,
2
with variable a, where
⎡ y1 ⎤ ⎡ x1T ⎤
YN = ⎢ ⋮ ⎥ , XN = ⎢ ⋮ ⎥ .
⎢ ⎥ ⎢ T⎥
⎣yN ⎦ ⎣xN ⎦
You may assume that measurements yk are scalar. The regressors xk and the parameter a
have dimension n > 1. We assume that we have solved this problem, i.e. we have the solu-
tion to the above problem, and we denote it with aN . We then obtain a new measurement
yN+1 and a new regressor xN+1 , and we would like to solve the above LS problem with N
replaced with N + 1, i.e. we would like to compute aN+1 such that it solves
1
minimize ||YN+1 − XN+1 a||22 .
2

(a) Show that this can be done with the following updated formula:
( )
aN+1 = aN + PN+1 xN+1 yN+1 − xN+1T
aN ,
where PN−1 = XNT XN and where PN+1−1 T
= PN−1 + xN+1 xN+1 .
(b) Show that the recursion above for PN can be equivalently written as
1
PN+1 = PN − T
T
PN xN+1 xN+1 PN .
1 + xN+1 PN xN+1
320 10 Supervised Learning

You may use the matrix inversion lemma which says that for matrices A, C, U, and
V such that the dimensions are compatible and such that the inverses below exist it
holds that
( )−1
(A + UCV)−1 = A−1 − A−1 U C−1 + VA−1 U VA−1 .

10.3 Consider the MAP optimization problem in (10.3) restated for ease of reference below
∑N (√ )
1 ( )2
minimize yk − a T
𝛽(xk ) + N ln 2𝜋𝜎 − ln fA (a),
k=1
2𝜎 2
with variable a, where fA is the pdf for the prior.
(a) Consider the case of a double-sided exponential distribution given by
∏m
1 −|ai |∕𝜆
fA (a) = e ,
i=1
2𝜆
for some 𝜆 > 0, and show that an equivalent optimization problem is
∑N
1( )2 ∑m
minimize yk − aT 𝛽(xk ) + c |ai |,
k=1
2 i=1

with variable a for some constant c > 0. This is known as lasso regularization.
(b) Consider the case when each component Ai of the random variable A has a uniform
distribution on the interval [−𝜆, 𝜆], where 𝜆 > 0. Show that an equivalent MAP opti-
mization problem is
∑N
1( )2
minimize yk − aT 𝛽(xk )
k=1
2
subject to ||a||∞ ≤ 𝜆,
with variable a.

10.4 Show that the logistic regression problem in Section 10.4 can be equivalently written as a
conic optimization problem involving the exponential cone.

10.5 Consider Example 10.2 and write a MATLAB code the reproduces the result in the
example.

10.6 Consider the optimization problem in (10.5).

(a) Show that an incremental subgradient method with step length one for updating the
parameters a and b results in the Hebbian learning algorithm
ak+1 = ak + yik xik
bk+1 = bk + yik ,
if 1 − yi f (xik ; ak , bk ) > 0 and that there is no update otherwise. Above the index ik could
be picked randomly with equal probability from ℕN at each iteration k or cyclic as
ik = (k mod N) + 1.
(b) Make a MATLAB implementation of the cyclic Hebbian learning algorithm. Generate
two-dimensional data in ℝ2 that can be classified with a linear classifier and try out
the algorithm. How fast does it converge?
Exercises 321

10.7 Consider the following optimization problem:

aT xk + b
maximize yk
||a||2
subject to yk (aT xk + b) ≥ 1, k ∈ ℕN
with variables (a, b), where b, yk ∈ ℝ and a, xk ∈ ℝn . Show that it is equivalent to
maximize ||a||2
subject to yk (aT xk + b) ≥ 1, k ∈ ℕN .

10.8 Consider the Fisher Iris data set that you investigated in Exercise 9.13 using PCA. You
are going to use the compressed data ci that you obtained using the two principal com-
ponents corresponding to the two largest singular values. You are asked to classify these
using an SVM as in (10.6) with f being an affine function, i.e. f (c) = aT c + b, where a ∈ ℝ2
and b ∈ ℝ are the parameters that define the separating straight line. The data set has
three classes, and the SVM is only able to separate between two classes. However, you will
instead carry out three classifications, where you classify species 1 against species 2 and 3,
species 2 against species 1 and 3, and finally species 3 against species 1 and 2. In this way,
you will obtain three separating straight lines. Plot the lines on top of the figure you plot-
ted in Exercise 9.13. How well do the lines separate the different species? Are there species
that cannot be classified? Are there species that are not uniquely classified? How does the
results depend on the choice of the regularization parameter 𝜈? Notice that you may use
different values of 𝜈 in the three different classifications.
Hint: You can directly solve the primal problems using, e.g. YALMIP, since it is of low
dimension.

10.9 Consider the empirical risk minimization problem

1∑
n
minimize 𝜙(aTi 𝑤) + 𝛾 g(𝑤) (10.18)
n i=1

with variable 𝑤 ∈ ℝd and where 𝜙(t) = max (0, 1 − t) and g(𝑤) = (1∕2)||𝑤||22 . The problem
data are ai ∈ ℝd for i = 1, … , n, and 𝛾 > 0 is a parameter.
(a) We start by deriving a dual coordinate ascent method for the problem (10.18).
(1) Show that the dual problem is equivalent to the problem
∑n
1
maximize − 𝜙∗ (−xi ) − ||AT x||22 (10.19)
i=1
2n𝛾

with variable x ∈ ℝn and where a1 … , an are the columns of AT .

Hint: Introduce a new variable z and the constraint z = A𝑤.
(2) Show that 𝜙∗ (s), the conjugate of 𝜙(t) = max (0, 1 − t), is given by
{
s −1 ≤ s ≤ 0
𝜙∗ (s) =
∞ otherwise.
(3) Derive the update for a step of (dual) coordinate ascent with exact line search, i.e.
{ }
1
xi ← argmin 𝜙 (−s) +
∗
||A (x − xi ei ) + sai ||2 .
T 2
s 2n𝛾
(4) Show that the optimal primal variable 𝑤⋆ can be recovered from a dual optimal x⋆ .
322 10 Supervised Learning

(b) Show that the proximal operator associated with gi (𝑤) = 𝜙(aTi 𝑤) + 𝛾g(𝑤) can be
expressed as
{ 𝛾 }
1
proxtgi (𝑤) = argmin 𝜙(aTi u) + ||u||22 + ||u − 𝑤||22
u 2 2t
⎧ 1
⎪ 1+t𝛾 (𝑤 + tai ) t(||ai ||22 − 𝛾) < 1 − aTi 𝑤
⎪ ( 1−aTi 𝑤+t𝛾
)
=⎨ 1 𝑤+ ai −t𝛾 ≤ 1 − aTi 𝑤 ≤ t(||ai ||22 − 𝛾)
1+t𝛾 ||ai ||22
⎪ 1
⎪ 𝑤 1 − aTi 𝑤 < −t𝛾.
⎩ 1+t𝛾
Hint: Derive the optimality condition associated with the prox-problem and show that
u⋆ has the form u⋆ = (1 + t𝛾)−1 (𝑤 − 𝛽tai ), where 𝛽 ∈ 𝜕𝜙(aTi u⋆ ).
(c) The so-called linear softmargin support vector machine training problem is a special
case of the problem (10.18). Specifically, if we partition 𝑤 as 𝑤 = (𝑤, ̃ b) ∈ ℝd−1 × ℝ
and define ai = (yi xi , yi ), where xi ∈ ℝ d−1 corresponds to a so-called feature vec-
tor with corresponding label yi ∈ {−1, 1}, then aTi 𝑤 = yi (xiT 𝑤̃ + b), and hence,
𝜙(aTi 𝑤) = max (0, 1 − yi (xiT 𝑤̃ + b)). Given a vector of labels y = (y1 , … , yn ) and a
matrix X ∈ ℝn×(d−1) with rows x1T … , xnT , we can express A as
[ ]
A = diag(y) X 𝟙 .
The file classification.mat is an example of such a data set.
The solution to the SVM training problem defines a hyperplane that can be used to
define a classifier: the function f (x) = sgn(𝑤̃ T x + b) provides a label prediction for an
unlabeled feature vector x.
(1) Implement an incremental proximal method for solving (10.18), and test your
implementation on the provided dataset.
(2) Implement a dual coordinate ascent method for solving (10.19), and test your
implementation on the provided dataset.
(3) Compare the two methods: plot the objective value as a function of the number of
iterations, e.g. integer multiples of n iterations.

10.10 The soft margin support vector machine-training problem is a convex quadratic problem:
𝛾
minimize 𝟙T 𝑣 + ||𝑤||22
2
subject to diag(y)(A𝑤 + 𝟙b) ⪰ 𝟙 − 𝑣 (10.20)
𝑣⪰0
with variables 𝑤 ∈ ℝn , b ∈ ℝ, and 𝑣 ∈ ℝn , and with problem data A ∈ ℝm×n and
y ∈ {−1, 1}m . The optimal 𝑣 can be expressed in terms of 𝑤 and b as
𝑣 = max (0, 𝟙 − diag(y)(A𝑤 + 𝟙b)).
The Lagrangian is given by
𝛾
L(𝑣, 𝑤, b, z, 𝜆) = 𝟙T 𝑣 + ||𝑤||22 + zT (𝟙 − 𝑣 − diag(y)(A𝑤 + 𝟙b)) − 𝜆T 𝑣,
2
and the dual problem can be expressed as
1
maximize − zT diag(y)AAT diag(y)z + 𝟙T z
2𝛾
subject to yT z = 0 (10.21)
0 ⪯ z ⪯ 𝟙.
Exercises 323

The Lagrange multiplier associated with 𝑣 ⪰ 0 is 𝜆 = 𝟙 − z, so the conditions 𝜆 ⪰ 0 and

z ⪰ 0 are equivalent to 0 ⪯ z ⪯ 𝟙. It follows from the optimality conditions that
𝛾𝑤 = AT diag(y)z,
and complementary slackness implies that

zi⋆ = 0 ⇔ yi ui ≥ 1⎫
⎪
0 < zi⋆ < 1 ⇔ yi ui = 1⎬ for i = 1, … , m (10.22)
zi⋆ = 1 ⇔ yi ui ≤ 1⎪
⎭
where u = A𝑤 + 𝟙b = 𝛾 −1 AAT diag(y)z + 𝟙b.
The dual problem (10.21) can be solved using the so-called sequential minimal optimization
(SMO) method. Suppose zk is the value of the dual variable at the beginning of iteration k.
The SMO iteration then consists of the following three steps:
● Working set selection: select two dual variables zk,i and zk,j (i ≠ j) where at least one of the
two variables violates the optimality conditions (10.22).
● Two-variable subproblem: compute zk+1,i and zk+1,j by solving the dual problem (10.21)
with
z = zk + (zi − zk,i )ei + (zj − zk,j )ej . (10.23)
● Update intercept: compute bk+1 such that the updated dual variables zk+1,i and zk+1,j
satisfy the optimality conditions (10.22).
The main advantage of this seemingly simple method is that the two-variable subproblem
is cheap to solve. Each iteration increases the dual objective, and the method stops when
all dual variables satisfy the optimality conditions within some tolerance. The method can
be shown to converge, but the working set selection can have a significant impact on per-
formance: there are m(m − 1)∕2 potential working sets at each iteration, so a good working
set selection heuristic is crucial. A popular heuristic is the so-called maximal violating pair
working set selection rule which chooses a pair (i, j) as
i ∈ argmin{yl − ul }, j ∈ argmax{yl − ul }, (10.24)
l∈I1 l∈I2

where
I1 = {l | zl > 0, yl = 1} ∪ {l | zl < 1, yl = −1}
I2 = {l | zl > 0, yl = −1} ∪ {l | zl < 1, yl = 1}.
This selection heuristic can be motivated from the optimality conditions (10.22): yl ul must
satisfy yl ul ≤ 1 if zl > 0, and similarly, yl ul ≥ 1 if zl < 1. By multiplying both sides of both
yl ul ≤ 1 and yl ul ≥ 1 by yl , we can express the optimality conditions as
0 ≤ yl − ul , l ∈ I1
0 ≥ yl − ul , l ∈ I2 .
It follows that z satisfies the optimality conditions when (i, j) is chosen according to maxi-
mal violating pair heuristic (10.24) and yi − ui ≥ 0 ≥ yj − uj .
(a) Derive a simple method for solving the two variable subproblem with working set (i, j).
You may assume that zk is feasible.
Hint: The constraint yT z = 0 implies that zi and zj can be parameterized as zi (t) = zk,i +
tyi and zj (t) = zk,j − tyj , and hence, z(t) = zk + t(ei yi − ej yj ).
324 10 Supervised Learning

(b) Derive an expression for bk+1 so that (zk+1,i , uk+1,i ) and (zk+1,j , uk+1,j ) satisfy the optimal-
ity conditions (10.22).
(c) Derive a recursive update of u, i.e. express uk+1 in terms of uk , bk , and bk+1 .
(d) Implement the SMO algorithm with the working set selection heuristic (10.24) and
test it on the same data as in the previous exercise.

10.11 We consider the under-determined linear system of equations

Ax = b,
where A ∈ ℝm×n has full row rank with m < n, and where b ∈ ℝm and x ∈ ℝn .
(a) Compute the solution that satisfies the linear system of equations and that has the
smallest value of the norm ||x||2 . You should determine an explicit formula for the
solution.
(b) Compute the solution to the LS problem
minimize||Ax − b||22 ,
with variable x using a gradient method, and compare it with the solution above.
Generate random problem data for the matrix and vector with entries from a stan-
dardized normal distribution.
(c) Consider the optimization problem
minimize||Ax − b||22 + 𝛾||x||22 ,
where 𝛾 > 0. Compute the optimal solution and show that the optimal solution
approaches the least-norm solution in (a) when 𝛾 → 0.
Hint: Use the matrix inversion lemma in (2.56).

10.12 Implement a stochastic gradient algorithm that uses dropout. Generate pairs of data
(xk , yk ) ∈ ℝn × ℝ for k ∈ ℕN randomly in the following way. First, generate a0 ∈ ℝn
randomly from a standardized Gaussian distribution. Then generate xk randomly from a
standardized Gaussian distribution and also generate ek randomly from a standardized
Gaussian distribution. Then let yk = xkT a0 + 0.1ek . Use dropout probabilities pi = 0.5 in
the following stochastic gradient method
ak+1 = ak + tk (yik − xiT 𝛿a)𝛿xik ,
k

to estimate an ensemble average predictor ŷ (x) = xT Pa, where P = diag(pi ). Above 𝛿 is a

diagonal matrix for which you should choose the diagonal elements equal to 0 or 1 with
probability pi , independently for each value of k. You may take the values ik as ik = 1 +
(k − 1) mod N. You may take tk = 1∕k. Compare the resulting value of a that you obtain
with the solution you obtain by minimizing the regularized LS criterion
(N )
1∑ ∑
N
𝜎 2
||y − xkT Pa||22 + aT diag(xk )2 a,
2 k=1 k 2 k=1

where 𝜎 2 = pi (1 − pi ).

10.13 When dropout was used with an incremental method for solving the LS problem in
Section 10.8, we realized that the predictor was given by the expected value of the stochastic
predictor. This could be computed exactly. For general ANNs, it is only possible to approx-
imately compute the expectation as will be discussed in this exercise. Let Δi , i ∈ ℕn , be
random variables with a Bernoulli distribution, i.e. P(Δi = 1) = pi and P(Δi = 0) = 1 − pi .
Exercises 325

We assume that Δi is independent of Δj for i ≠ j. Let Δ = diag(Δi , … , Δn ). Consider a

stochastic predictor Ŷ ∶ ℝn × {0, 1}n → ℝ given by
Ŷ (x; Δ1 , … , Δn ) = xT Δa,
where x, a ∈ ℝn . Let 𝜎 ∶ ℝ → ℝ be the logistic function defined as
1
𝜎(y) = ,
1 + e−y
and let Ẑ ∶ ℝn × {0, 1}n → ℝ be defined as
̂ Δ1 , … , Δn ) = 𝜎(Ŷ (x; Δ1 , … , Δn )).
Z(x;
The joint probability function for p ∶ {0, 1}n → [0, 1] for (Δ1 , … , Δn ), can be expressed in
terms of pi . As an example for n = 3, we have
p(1, 0, 0) = P(Δ1 = 1, Δ2 = 0, Δ3 = 0) = p1 (1 − p2 )(1 − p3 ).
We now introduce a bijection between {0, 1}n and ℕm , where m = 2n , such that we can
collect the m probabilities defining the probability function above in the vector P ∈ [0, 1]m ,
∑m
for which it hold that Pj ≥ 0 and j=1 Pj = 1. We will from now on also consider Ŷ and Ẑ to
be defined on ℕm , and we will use subscripts to refer to the values of the random variables.
We will also drop the dependence on x. This means that the expected value of Ŷ is given by
∑m
𝔼Ŷ = j=1 Ŷ j Pj . Luckily, we do not have to use this formula to compute the expectation,
since we may instead use the fact that 𝔼Ŷ = xT diag(pi )a. However, for the expectation of
∑
̂ there is no such shortcut. Hence, we need to use 𝔼Ẑ = m
Z, ̂
j=1 𝜎(Yj )Pj . Even for moderate
n
values of n it holds that m = 2 is huge. Therefore, it is necessary to find approximate values
for this expectation.
(a) Let us define the weighted geometric mean G ∶ ℝm → ℝ as
∏
m
P
G(x) = xj j ,
j=1

and the weighted geometric mean of the complement G′ ∶ ℝm → ℝ as

∏
m
G′ (x) = (1 − xj )Pj .
j=1

Show that
̂
G(Z) ( )
= 𝜎 𝔼Ŷ ,
̂ + G′ (Z)
G(Z) ̂
where the left-hand side is called the normalized weighted geometric mean.
(b) Show that
̂
G(Z)
̂ ≤
G(Z) ̂
≤ 𝔼Z,
̂ + G′ (Z)
G(Z) ̂
[ ]
if 0 < Ẑ i ≤ 1∕2. Hence, 𝜎(𝔼Ŷ ) is a good approximation for 𝔼Ẑ = 𝔼 𝜎(Ŷ ) for the case
when 0 < Ẑ i ≤ 1∕2.
Hint: Use the Ky Fan inequality from Exercise 4.10.

10.14 We will use the Deep Learning Toolbox in MATLAB to fit an ANN to pairs of data (xi , yi ).
We will use a two-layer network implementing the function in (10.13). The following com-
mands read in the data, define a net with two layers, three hidden neurons, train the net,
326 10 Supervised Learning

display the net, compute the predicted outputs of the net, and plot a comparison of the
predicted values ŷ i and the original values of yi versus the values xi :
[x,y] = simplefit_dataset;
net = fitnet(3);
net = train(net,x,y);
view(net),
haty = net(x);
plot(x,haty,'-',x,y,'--');
Play around with the number of hidden neurons and try to figure out how many are needed
to get an almost perfect fit.
327

Reinforcement Learning

Reinforcement learning is an area of machine learning concerned with how agents should take
actions in an environment in order to maximize the notion of a so-called cumulative reward.
The idea goes back to the work by the Russian Nobel laureate Ivan Pavlov on classical conditioning
where he showed that you can train dogs by rewarding or punishing them. To formalize this, we
consider the dog to be in different states xk at different stages k in time. The value of the state xk+1
depends on the current state xk and the action uk we take during our training. Each and every state
xk is also related to a reward rk . It is then desirable to maximize the sum, possibly discounted, of
rewards over some set of stages, say {0, 1, … , N − 1}, where N could be infinity. In reinforcement
learning, the action is determined by an agent, and the next state and the reward are determined
by the environment. One can view reinforcement learning as a way of solving the optimal control
problems in Chapter 8 without explicitly knowing the details of the mathematical model that
describes how the state evolves. The control signal in optimal control corresponds to the action in
reinforcement learning. The incremental costs in optimal control are the negative of the rewards
in reinforcement learning.

11.1 Finite Horizon Value Iteration

Let fk ∶ n ×  m → ℝ and Fk ∶ n ×  m → n be given. Consider the finite horizon optimal con-

trol problem
∑
N−1
minimize 𝜙(xN ) + fk (xk , uk )
k=0 (11.1)
subject to xk+1 = Fk (xk , uk ), k ∈ ℤN−1 ,
for a given initial value x0 with variables (u0 , x1 , … , uN−1 , xN ). Notice that here we do not consider
constraints on xk or uk , but the results can easily be generalized to the constrained case. Reinforce-
ment learning for this optimal control problem is closely related to the parametric approximations
used for the dynamic programming recursion in Section 8.2. However, we cannot directly use
this recursion, since when defining the approximate Q-function, an analytical expression for the
value of the state at the next iterate was needed in terms of the function F. In reinforcement learn-
ing, only the value of the state at the next iterate is available and not the function, i.e. we have an
oracle or simulator that for given values of (xk , uk ) computes xk+1 = Fk (xk , uk ) without us knowing
what Fk is. We will show how this can be accommodated by instead of using the dynamic program-
ming recursion in (8.4), use a recursion for the Q-function in (8.5). This will make it possible for us

to directly learn the Q-function instead of computing it using Fk from a learned value function Vk .
We will show that the following recursion holds
( )
̄ = fk (x, u)
Qk (x, u) ̄ + min Qk+1 Fk (x, u),
̄ u , (11.2)
u

starting with k = N − 2 and finishing with k = 0, and where QN−1 (x, u) = fN−1 (x, u) + 𝜙(FN−1 (x, u)).
The above recursion follows from the fact that (8.4) and (8.5) implies that
Vk+1 (x) = min Qk+1 (x, u),
u

and hence,
̄ = min Qk+1 (Fk (x, u),
Vk+1 (Fk (x, u)) ̄ u),
u

̄ to both sides the recursion is obtained, which is summarized in Algorithm 11.1.

By adding fk (x, u)

Algorithm 11.1: Finite horizon value iteration for Q-function

Input: Final state penalty 𝜙, incremental costs fk , oracle for Fk
Output: Qk for k ∈ ℤN−1
̄ ← fN−1 (x, u)
QN−1 (x, u) ̄ + 𝜙(FN−1 (x, u))
̄
for k ← N − 2 to 0 do
̄ ← fk (x, u)
Qk (x, u) ̄ + minu Qk+1 (Fk (x, u),
̄ u)
end

11.1.1 Fitted Value Iteration

̃ k ∶ n ×  m ×
We define feature vectors 𝜑 ∶ n ×  m → ℝp and directly approximate Qk with Q
ℝ → ℝ as
p

̃ k (x, u, ak ) = aT 𝜑(x, u),

Q k

for k ∈ ℤN−1 . This is a linear regression model, but we could also have considered a more general
model as we did in Section 8.2. We leave the details of generalizing to this case for the reader to
carry out. We then consider samples (xks , usk ) ∈ n ×  m , where s ∈ ℕr , and define
𝛽N−1
s
= 𝜙(FN−1 (xN−1
s
, usN−1 )),
and
̃ k+1 (Fk (xs , us ), u, ak+1 ),
𝛽ks = min Q (11.3)
u k k

where ak+1 is a known value from a previous iterate defined below. We see that we do not need
an analytical expression for Fk in order to define 𝛽ks for k ∈ ℤN−1 . It is enough to know the value
of Fk (xks , usk ) which is the next value of the state in a simulation. Moreover, depending on how the
feature vectors are chosen, the minimization above could become very tractable. This can be a great
advantage for the approximation method based directly on the Q-function. After this, we define the
following LS problem:

1 ∑( ̃ s s
r
)2
minimize Q (x , u , a) − fk (xks , usk ) − 𝛽ks , (11.4)
2 s=1 k k k
with variable a for k ∈ ℤN−1 . Denote the optimal solution by ak . The iterations start with k = N − 1
and go down to k = 0, where we alternate between solving (11.4) and (11.3). Once all parameters
11.1 Finite Horizon Value Iteration 329

ak have been computed, the approximate optimal control is u⋆k = 𝜇k (xk ), where 𝜇k ∶ n →  m is
given by
( )
̃ k x, u, ak .
𝜇k (x) = argmin Q (11.5)
u

We notice that using the Q-function instead of using the value function as in Section 8.2 comes at
the price of also having to sample the control signal space  m .

Example 11.1 We will in this example perform fitted value iteration for the optimal control prob-
lem in Example 8.1 based on the iteration of the Q-functions. We will as in Example 8.3 specifically
consider the case when m = 1 and n = 2, and we let Ak , Bk , Rk , Sk be independent of k and we write
them as A, B, R, S. Since we know that the Q-functions are quadratic, we will use a feature vec-
tor that is 𝜑(x, u) = (x12 , x22 , u2 , 2x1 x2 , 2x1 u, 2x2 u), where x = (x1 , x2 ) ∈ ℝ2 and u ∈ ℝ. Notice that the
indices refer to components of the vector and not to time. We let
̃ k (x, u, a) = aT 𝜑(x, u),
Q

where a ∈ ℝ6 . With P̃ ∈ 𝕊2 , r̃ ∈ ℝ2 and q̃ ∈ ℝ defined as

[ ] ⎡a1 a4 a5 ⎤
P̃ r̃
= ⎢a4 a2 a6 ⎥ ,
r̃ T q̃ ⎢ ⎥
⎣a5 a6 a3 ⎦
we may write
[ ]T [ ][ ]
̃ x P̃ r̃ x
Qk (x, u, a) = . (11.6)
u r̃ T q̃ u
For k = N − 1, we define
( )T
𝛽N−1
s
= x+s Sx+s ,

where x+s = AxN−1s

+ BusN−1 . For k = N − 2 down to k = 0, we solve the linear LS problem in (11.3)
to obtain 𝛽k . From Example 5.7, it follows that the solution is
s

( )T ( )
𝛽ks = x+s P̃ k+1 − r̃ k+1 q̃ −1 r̃ T x+s ,
k+1 k+1

where x+s = Axks + Busk . We then obtain ak for k ∈ ℤN−1 as the solution to the linear LS problem in
(11.4):
r ( )2
1∑ T s s ( )T ( )T
minimize 𝜑 (xk , uk )a − xks Sxks − usk Rusk − 𝛽ks .
2 s=1
Notice that the subindex k for a now refers to stage index. The solution ak satisfies the normal
equations, cf . (5.3),

ΦTk Φk ak = ΦTk 𝛾k ,

where

⎡𝜑T (xk1 , u1k )⎤ ⎡(x1 )T Sx1 + (u1 )T Ru1 + 𝛽 1 ⎤

⎢ k k k k k⎥
Φk = ⎢ ⋮ ⎥, 𝛾k = ⎢ ⋮ ⎥.
⎢ T r r ⎥ ⎢ (xr )T Sxr + (ur )T Rur + 𝛽 r ⎥
⎣𝜑 (xk , uk )⎦ ⎣ k k k k k⎦

It is here crucial to choose (xks , usk ) such that ΦTk Φk is invertible. We realize that we need r ≥ 6 for this
to hold. Moreover, we need to choose (xks , usk ) sufficiently different to have ΦTk Φk well conditioned.
330 11 Reinforcement Learning

For a general, n we need r ≥ (m + n)(m + n + 1)∕2. Compared to Example 8.3 we require a larger
value of r. The optimal feedback function is by (11.5) and (11.6) given by
𝜇k (x) = −̃q−1
k k
r̃ T x,
where we again have used the results in Example 5.7.

11.2 Inﬁnite Horizon Value Iteration

We will now address the infinite horizon optimal control problem in (8.9) for the case of no state
constraints, restated below for ease of reference:
∑
∞
minimize 𝛾 k f (xk , uk )
k=0
subject to xk+1 = F(xk , uk ), k ∈ ℤ+ ,
with variables (u0 , x1 , …), where x0 is given. Algorithms for reinforcement learning for the infinite
horizon problem are based on approximate version of value iteration (VI) and policy iteration (PI)
for the Q-function. When VI is used, it is called Q-learning.
The Q-function Q ∶ n ×  m → ℝ is defined as Q(x, u) = f (x, u) + 𝛾V(F(x, u)), and it satisfies
̄ = f (x, u)
Q(x, u) ̄ + min 𝛾Q (F(x, u),
̄ u) . (11.7)
u

To see this we have from the Bellman equation in (8.10) that

V(x) = min Q(x, u),
u

and
̄ = min 𝛾Q(F(x, u),
𝛾V(F(x, u)) ̄ u).
u

̄ to both sides of the above equation, the desired equation for Q is obtained.
By adding f (x, u)
We introduce the Bellman Q-operator TQ ∶ ℝ × → ℝ × defined as
n m n m

̄ = f (x, u)
TQ (Q)(x, u) ̄ + min 𝛾Q (F(x, u),
̄ u) . (11.8)
u

Here ℝ × is the set of real-valued functions defined on n ×  m , cf . notation section in the

n m

appendix. It can then be shown that the following VI

Qk+1 = TQ (Qk ), (11.9)
for Qk ∶ n ×  m → ℝ with boundary condition Q0 (x, u) = f (x, u) converges to Q(x, u) satisfying
(11.7) as k → ∞, see Exercise 11.6. It is summarized in Algorithm 11.2. This algorithm is not easy
to implement in practice unless (x, u) takes only finitely many values.

Algorithm 11.2: Infinite horizon value iteration for Q-function

Input: Incremental cost f , oracle for F, 𝛾 ∈ (0, 1], tolerance 𝜀 > 0
Output: Q solving Q = TQ (Q)
Q←f
while ‖Q − TQ (Q)‖∞ ≥ 𝜀 do
Q ← TQ (Q)
end
11.2 Inﬁnite Horizon Value Iteration 331

11.2.1 Q-Learning
We will now see how VI can be generalized by developing an alternative iterative method for solving
(11.7). To this end, we introduce the error function e ∶ ℝ × → ℝ × defined as
n m n m

e(Q) = Q − TQ (Q),
where TQ is defined in (11.8). With this definition, we may write the equation in (11.7) as e(Q) = 0.
We now apply the following standard algorithm
Qk+1 = Qk − tk e(Qk ), k ∈ ℤ+ , (11.10)
to find a root of the equation e(Q) = 0. You can initialize with Q0 = f , but there are better ways.
The step lengths tk should satisfy tk ∈ (0, 1]. We will prove convergence of the above iterations to a
solution Q⋆ of e(Q) = 0, i.e. Q⋆ = TQ (Q⋆ ). Let Δk = Qk − Q⋆ . Then it holds that
( )
Δk+1 = (1 − tk )Δk + tk TQ (Qk ) − TQ (Q⋆ ) .
From this it follows for tk ∈ (0, 1] that
||Δk+1 ||∞ ≤ (1 − tk )||Δk ||∞ + tk ||TQ (Qk ) − TQ (Q⋆ )||∞ ≤ (1 − tk + tk 𝛾)||Δk ||∞ ,
where the last inequality follows from the contraction property of TQ for 𝛾 ∈ [0, 1) shown in
Exercise 11.6. We see that we have a contraction for Δk for tk ∈ (0, 1], if 𝛾 ∈ (0, 1), and therefore,
Δk converges to zero proving that Qk converges to Q⋆ . We remark that we recover VI for tk = 1.
The algorithm is summarized in Algorithm 11.3.

Algorithm 11.3: Generalized value iteration for Q-function

Input: Incremental cost f , oracle for F, 𝛾 ∈ (0, 1], tolerance 𝜀 > 0, step-lengths tk ∈ (0, 1] for
k∈ℕ
Output: Q solving Q = TQ (Q)
Q←f
k←1
while ‖Q − TQ (Q)‖∞ ≥ 𝜀 do
Q ← Q − tk (Q − TQ (Q))
k ←k+1
end

It is possible to instead of in each iteration k consider all values of (x, u) ∈ n ×  m only consider
one sample (xk , uk ) at a time. These samples could be generated in a cyclic order or in a randomized
cyclic order such that each sample is visited infinitely many times. We assume that n ×  m is a
finite set. Then it holds that
[ ]
Qk+1 (xk , uk ) = Qk (xk , uk ) − tk Q(xk , uk ) − f (xk , uk ) − min 𝛾Q(F(xk , uk ), u) , (11.11)
u
converges to a solution of e(Q) = 0 as k goes to infinity when tk ∈ (0, 1] and 𝛾 ∈ (0, 1). The results
follow trivially from the convergence proof above. The algorithm has similarities with the coordi-
nate descent method in Section 6.9, since Q is only updated for one discrete value in each iteration.
This algorithm is often referred to as Q-learning, and it is summarized in Algorithm 11.4.
It is possible to develop VI methods for the stochastic infinite horizon optimal control problem
in (8.33) using the Q-function in (8.35) which satisfies the Bellman equation for the Q-function in
(11.32). The above algorithm can be used by replacing F(xk , uk ) with F(xk , uk , 𝑤k ), where 𝑤k is a
∑∞ ∑∞
realization of the random process W. If the step sizes satisfies k=0 tk = ∞ and k=0 tk2 < ∞, then
the algorithm can be shown to converge in a stochastic sense [114].
332 11 Reinforcement Learning

Algorithm 11.4: Q-learning

Input: Incremental cost f , oracle for F, parameter 𝛾 ∈ (0, 1], tolerance 𝜀 > 0, step-lengths
tk ∈ (0, 1] for k ∈ ℕ
Output: Q solving Q = TQ (Q)
Q(x, u) ← f (x, u), ∀(x, u) ∈ n ×  m
k←1
while ‖Q − TQ (Q)‖∞ ≥ 𝜀 do
(xk , uk ) sampled in randomized cyclic order from n ×  m
[ ]
Q(xk , uk ) ← Q(xk , uk ) − tk Q(xk , uk ) − f (xk , uk ) − minu 𝛾Q(F(xk , uk ), u)
k ←k+1
end

11.3 Policy Iteration

Reinforcement learning based on PI is called self-learning. The policy evaluation step is referred to
as a critic and the policy improvement is referred to as an actor. These types of methods are called
actor-critic methods. In case parametric approximations using artificial neural networks ANNs are
involved, the actor and critic are called actor networks and critic networks, respectively. In order to
do PI for the Q-function formulation in (11.7), we define Qk ∶ n ×  m → ℝ as Qk (x, u) = f (x, u) +
𝛾Vk (F(x, u)). It then follows that
( )
Vk (x) = Qk x, 𝜇k (x) ,
from (8.16), and therefore
( )
Vk (F(x, u)) = Qk F(x, u), 𝜇k (F(x, u)) .
Multiply with 𝛾 and add f (x, u) to obtain that Qk is the solution of
( )
Qk (x, u) = f (x, u) + 𝛾Qk F (x, u) , 𝜇k (F(x, u)) . (11.12)
This is the policy evaluation step in terms of the Q-function. We then obtain a new feedback policy
by solving
𝜇k+1 (x) = argmin Qk (x, u) , (11.13)
u

which is the policy improvement step in terms of the Q-function. These iterations are exactly the
same as the iterations in (8.16)–(8.17) except that we need to solve for a function Qk that depends
also on the control signal in the policy iteration step. The PI algorithm for the Q-function is sum-
marized in Algorithm 11.5.

Algorithm 11.5: PI for Q-function

Input: Incremental cost f , oracle for F, parameter 𝛾 ∈ (0, 1], tolerance 𝜀 > 0, initial policy 𝜇
Output: Q solving Q = TQ (Q)
while ‖Q − TQ (Q)‖∞ ≥ 𝜀 do
Solve Q(x, u) = f (x, u) + 𝛾Q(F(x, u), 𝜇(F(x, u))) w.r.t. Q
𝜇 ← argminu Q(x, u)
end
11.3 Policy Iteration 333

Example 11.2 We consider the infinite horizon LQ control problem in Example 8.4. We guess
that
[ ]T [ ][ ]
x Uk Wk x
Qk (x, u) = ,
u WkT Vk u
for some
[ ]
Uk Wk
∈ 𝕊+m+n ,
WkT Vk
where Vk ∈ 𝕊m
++ . It then follows from (11.13) that

𝜇k (x) = −Lk+1 x,

where Lk+1 = Vk−1 WkT . The recursion for Qk in (11.12) is seen to be satisfied if
[ ] [ ] [ ]T [ ][ ]
Uk Wk S 0 [ ]T I Uk Wk I [ ]
= +𝛾 A B A B ,
WkT Vk 0 R −Lk WkT Vk −Lk
for a given Lk . This is an algebraic Lyapunov equation which has a positive semidefinite solution
since
[ ]
S 0
,
0 R
is positive semidefinite. This assumes that
[ ]
√ I [ ]
𝛾 A B ,
−Lk
√
has all its eigenvalues strictly inside the unit disc. This is true if 𝛾(A − BLk ) has all its eigenvalues
strictly inside the unit disc by Exercise 11.1. As in Example 8.6, it can be shown that if we start with
a stabilizing L0 , then all subsequent Lk , will also be stabilizing. Moreover, it can be shown that all
Vk are positive definite so that the inverse in the formula for Lk exists.

11.3.1 Critic Network

We now approximate the policy evaluation step in (11.12) for PI. This is based on defining
Q̃ ∶ n ×  m × ℝp → ℝ using an ANN or as a linear regression with p parameters similarly as we
have done in Section 8.2. This function will be used to approximate Qk in (11.12). Before we do
that we notice that (11.12) implies
( ) ( ( ) ( ( )))
Qk (x0 , u0 ) = f x0 , u0 + 𝛾Qk F x0 , u0 , 𝜇k F x0 , u0
( ) ( ( ))
= f x0 , u0 + 𝛾Qk x1 , 𝜇k x1
( ) ( ) ( ( ))
= f x0 , u0 + 𝛾f x1 , 𝜇k (x1 ) + 𝛾 2 Qk x2 , 𝜇k x2
⋮
( ∑ (
) N−1 ) ( ( ))
= f x0 , u0 + 𝛾 i f xi , 𝜇k (xi ) + 𝛾 N Qk xN , 𝜇k xN ,
i=1

where xi+1 = F(xi , 𝜇k (xi )) for i ∈ ℕN−1 , and x1 = F(x0 , u0 ). In case N is large and 𝜇k is stabilizing,
we have that xN is close to zero and that also Qk (xN ) is close to zero. We realize that the only differ-
ence for the approximate evaluation of the Q-function as compared to the approximate evaluation
of the value function in (8.18) is that the first incremental cost is evaluated using u0 and not 𝜇k (x0 ).
334 11 Reinforcement Learning

We denote these approximations for different initial values (xs , us ) for s ∈ ℕr as

∑
N−1
( )
𝛽ks = f (xs , us ) + 𝛾 i f xi , 𝜇k (xi ) ,
i=1

where xi+1 = F(xi , 𝜇k (xi )) for i ∈ ℕN−1 , and x1 = F(xs , us ). We then find the approximation of Qk by
solving

1 ∑( ̃ s s
r
)2
minimize Q(x , u , ak ) − 𝛽ks ,
2 s=1
with variable ak . After this, we use the following exact policy improvement step
( )
𝜇k+1 (x) = argmin Q ̃ x, u, ak . (11.14)
u

A drawback of this method compared to when using value functions is that the reuse of trajectories
is more problematic, since sufficiently many different values of the control signal might then not
be explored. For a more detailed discussion on this see [17, Section 5.3].

Example 11.3 We will in this example consider the optimal control problem in Example 8.4.
We will specifically consider the case when m = 1 and n = 2. Since we know that the value func-
tion is quadratic, we will use a feature vector that is 𝜑(x, u) = (x12 , x22 , u2 , 2x1 x2 , 2x1 u, 2x2 u), where
x = (x1 , x2 ) ∈ ℝ2 and u ∈ ℝ. Notice that the indices refer to components of the vector and not to
time. We let
̃ u, a) = aT 𝜑(x, u),
Q(x,

where a ∈ ℝ6 . With P̃ ∈ 𝕊2 , r̃ ∈ ℝ2 and q̃ ∈ ℝ defined as

[ ] ⎡a1 a4 a5 ⎤
P̃ r̃
= ⎢a4 a2 a6 ⎥ ,
r̃ T q̃ ⎢ ⎥
⎣a5 a6 a3 ⎦
we may write
[ ]T [ ][ ]
̃ k (x, u, a) = x P̃ r̃ x
Q . (11.15)
u r̃ T q̃ u
Then ak ∈ ℝ6 with an abuse of notation is obtained as the solution to the linear LS problem

1 ∑( T s s
r
)2
minimize 𝜑 (x , u )a − 𝛽ks ,
2 s=1
with variable a. The solution ak satisfies the normal equations, cf . (5.3),

ΦTk Φk ak = ΦTk 𝛽k ,

where
⎡𝜑T (x1 , u1 )⎤ ⎡𝛽k1 ⎤
Φk = ⎢ ⋮ ⎥, 𝛽k = ⎢ ⋮ ⎥ ,
⎢ T r r ⎥ ⎢ r⎥
⎣ 𝜑 (x , u ) ⎦ ⎣𝛽k ⎦
with
( )T ( )T ∑ (
N−1
)
𝛽ks = xs Sxs + us Rus + 𝛾 i xiT Sxi + 𝜇k (xi )T R𝜇k (xi ) ,
i=1
11.3 Policy Iteration 335

where x1 = Axs + Bus and xi+1 = Axi + B𝜇k (xi ) for i ∈ ℕN−2 with initial values xs , s ∈ ℕr . It is here
crucial to choose (xs , us ) such that ΦTk Φk is invertible. We realize that we need r ≥ 6 for this hold.
Moreover, we need to choose (xs , us ) sufficiently different for ΦTk Φk to be well conditioned. For a
general n, we will need r ≥ (m + n)(m + n + 1)∕2.
From Example 5.7, we realize that the solution to (11.14) is given by
̃ k (x, u, ak ) = −̃q−1 r̃ T x,
𝜇k+1 (x) = argmin Q k k
u

assuming that q̃ is positive. Here q̃ k and r̃ k are defined from ak . We may hence write
𝜇k+1 (x) = −Lk+1 x,
where Lk+1 = q̃ −1
k k
r̃ T . It is a good idea to start with some L0 such that 𝜇0 is stabilizing.

11.3.2 SARSA-Algorithm
We will now present an alternative approach. We define 𝜀 ∶ n ×  m ×  m × ℝp → ℝ as
̃ u, a) − f (x, u) − 𝛾 Q
𝜀(x, u, 𝑣, a) = Q(x, ̃ (F(x, u), 𝑣, a) .

Then the error we obtain when we approximate Qk in (11.12) with Q ̃ can be written as
𝜀(x, u, 𝜇(F(x, u)), a), where 𝜇 = 𝜇k . We then define the LS problem

1∑
N−1
minimize 𝜀(x , u , 𝜇 (x )), a)2 ,
2 i=0 i i k i+1
with variable a and solution ak , where
( )
xi+1 = F xi , ui
ui = 𝜇k (xi )
with x0 given. Then 𝜇 is updated as in (11.14). For each value of k, we should choose different
initial values x0 . This should be done such that many different states and controls are explored. It is
common to add a realization 𝑤 of a zero mean white noise random process W to the control signal
in order to obtain more exploration, i.e. we use
( )
xi+1 = F xi , ui + 𝑤i
ui = 𝜇k (xi ).
It is important to gather as much information that the above LS problems has a unique solu-
tion. Also when updating a, the information from the previous batches should not be disregarded.
Hence, batch-recursive LS should be used.
Sometimes k = i is used, i.e. the policy improvement step is carried out for each i. Then a pure
recursive LS technique can be used to solve the LS problem. These types of algorithms are often
referred to as state-action-reward-state-action (SARSA) algorithms. They are possible to run in real
time by letting xi+1 be given by measurements from a real system when the control signal ui has
been applied. This will result in what is known as adaptive control, see [7].
Exact recursive LS techniques are only available for linear LS problems, but there are approxi-
mate techniques available for nonlinear LS problems for the case when Q ̃ as a nonlinear function.
It should be stressed that the theoretical convergence properties of the above schemes are not well
understood.

Example 11.4 We will in this example consider the optimal control problem in Example 8.4.
We will specifically consider the case when m = 1 and n = 2. Since we know that the value function
336 11 Reinforcement Learning

is quadratic, we will use a feature vector that is 𝜑(x, u) = (x12 , x22 , u2 , 2x1 x2 , 2x1 u, 2x2 u, 1), where
x = (x1 , x2 ) ∈ ℝ2 and u ∈ ℝ. Notice that the indices refer to components of the vector and not to
time. We let
̃ u, a) = aT 𝜑(x, u),
Q(x,
where a ∈ ℝ7 . With P̃ ∈ 𝕊2 , r̃ ∈ ℝ2 and q̃ ∈ ℝ defined as
[ ] ⎡a1 a4 a5 ⎤
P̃ r̃
= ⎢a4 a2 a6 ⎥ ,
r̃ T q̃ ⎢ ⎥
⎣a5 a6 a3 ⎦
we may write
[ ]T [ ][ ]
̃ x P̃ r̃ x
Qk (x, u, a) = + a7 . (11.16)
u r̃ T q̃ u
Here we have added a constant term compared to what we did in Example 11.3, since we have a
random process W involved. From Example 5.7, we realize that the solution to (11.14) is given by
̃ k (x, u, ak ) = −̃q−1 r̃ T x,
𝜇k+1 (x) = argmin Q k k
u
assuming that q̃ k is positive. Here q̃ k and r̃ k are defined from ak We may hence write
𝜇k+1 (x) = −Lk+1 x,
where Lk+1 = q̃ −1
k k
r̃ T . With
x̄ i+1 = 𝜑(xi , ui ) − 𝛾𝜑(xi+1 , −Li xi+1 ),
and
ȳ i+1 = xiT Sxi + uTi Rui ,
we may write the residual as
𝜀(xi , ui , −Li xi+1 , a) = x̄ Ti+1 a − ȳ i+1 ,
and hence, the LS problem reads

1 ∑( T
N
)2
minimize x̄ a − ȳ i ,
2 i=1 i
where we have shifted the summation index. We now use recursive LS as in Exercise 10.2 to update
the parameter a for an LS problem with N terms to an LS problem with N + 1 terms in order to
obtain adaptive control. The recursion for this reads
( )
aN+1 = aN + PN+1 x̄ N+1 ȳ N+1 − x̄ TN+1 aN ,
where
1
PN+1 = PN − PN x̄ N+1 x̄ TN+1 PN .
1 + xN+1 PN x̄ N+1
̄ T

For the definition of PN see the solution of Exercise 10.2. We then have that LN+1 = q̃ −1 ̃T
N+1 r N+1 , where
q̃ N+1 and r̃ N+1 are defined by aN+1 . We may start with L0 = 0 and a0 = 0 if we have no better initial
guess. We should take P0 to have a large value, e.g. a large multiple of the identity matrix.

11.3.3 Actor Network

It is possible to also approximate the policy with a linear regression or an ANN. Instead of opti-
mizing with respect to u in the policy improvement step, we then optimize with respect to the
11.4 Linear Programming Formulation 337

parameters of the linear regression or the ANN, see Section 11.5. One advantage of this is that no
optimization has to be carried out in real time when the controller is running, since the feedback
function will be known explicitly in terms of the linear regression or ANN. Also, it will in many
cases improve the speed of the learning.

11.3.4 Variational Form

We remark that it may be highly beneficial to consider the variational form of the Bellman equation
for the Q-function by replacing the incremental cost with the temporal difference in (8.12). Since
the variational form of the Bellman equation for the Q-function looks the same as the original Bell-
man equation for the Q-function, with the only difference that the incremental cost is replaced
with the temporal difference, all formulas in this section also remain the same when the tempo-
ral difference is used as incremental cost. The reason that the variational form may be beneficial
is that W(x) may be taken as an initial guess of the value function, and then we only need to
model the difference between the true solution and the initial guess, which might require less
parameters.

11.4 Linear Programming Formulation

In Section 8.6, we showed how to formulate the Bellman equation as an LP. Our intention is now
to show that this is also possible to do for the Bellman equation for the Q-function. To this end, we
start the VI in (11.9) with a Q0 such that Q1 = TQ (Q0 ) ≥ Q0 for all (x, u) ∈ n ×  m .1 One possible
choice is Q0 = 0. We obtain TQ (Q1 ) ≥ TQ (Q0 ) from the monotonicity property, see Exercise 11.6,
and hence Q2 = TQ (Q1 ) ≥ Q1 ≥ Q0 . If we repeat this, we obtain Qk = TQ (Qk−1 ) ≥ Qk−1 ≥ Q0 .
Since Qk converges to the solution Q of (11.7), we have shown that Q is the maximum element,
see [22, p. 45] of the set of functions Q that satisfy the linear inequalities
Q(x, u) ≤ f (x, u) + 𝛾Q(F(x, u), 𝑣), ∀(x, u, 𝑣) ∈ n ×  m ×  m .
The maximum element can be obtained by solving the LP
∑
maximize c(x, u)Q(x, u) (11.17)
(x,u)∈n × m

subject to Q(x, u) ≤ f (x, u) + 𝛾Q(F(x, u), 𝑣), ∀(x, u, 𝑣) ∈ n ×  m ×  m (11.18)

where c(x, u) > 0 is arbitrary. This follows from the dual characterization of maximum elements,
see [22, p. 54]. The optimization variable is Q(x, u) for all values of x and u.
The LP formulation is often not tractable in general, since there might be many variables and
constraints. As in Section 8.6, it is possible to approximate Q(x, u) with, e.g. a feature-based linear
regression. Sampling of constraints may also be used. Moreover, we may use the LP to approxi-
mately evaluate a fixed policy 𝜇, which may then be used together with PI.
We remark that the stochastic infinite horizon optimal control problem in (8.33) can also be
solved using an LP. This is because we just need to replace Q(F(x, 𝑣), 𝑣) in the right-hand side in
the inequalities with 𝔼[ Q(F(x, 𝑣, W), 𝑣) ], which if W only takes finitely many values is a finite sum,
which is linear in Q.

1 We assume that  and  are finite sets.

338 11 Reinforcement Learning

11.5 Approximation in Policy Space

So far we have discussed how to approximate the optimal cost-to-go function for the finite horizon
case or the optimal cost function for the infinite horizon case. This has been done indirectly by
approximating the corresponding optimal Q-functions. From these functions, the optimal policy
has been computed by minimizing Q with respect to u parametrically for all values of x. This can
be challenging, and as has been hinted above, it is possible to also obtain a parametric solution to
this optimization problem. Hence, we assume that we are given a function Q ∶ n ×  m , and that
we would like to solve the problem
𝜇(x) = argmin Q(x, u), (11.19)
u
approximately. To this end, we define 𝜇̃ ∶ n × ℝp →  m using an ANN or a linear regression
with p parameters similarly as we previously have approximated value functions and Q-functions.
We then solve the following LS problem:

1∑
N
minimize ||u − 𝜇(x
̃ k , a)||22 ,
2 k=1 k
with respect to a, where uk , k ∈ ℕN are solutions to (11.19) for the samples x = xk . One benefit of
the approach is that less optimization problems need to be solved when the policy is implemented
in real time, and moreover evaluating 𝜇̃ might be done much faster than solving the optimiza-
tion problem. Hence, smaller sampling times in real-time applications are possible, which may be
important for some applications. Notice that this approach can be used for any policy for which we
know its values for a discrete number of samples, i.e. it does not necessarily have to be related to
(11.19).

11.5.1 Iterative Learning Control

We will now discuss approximation in policy space which is not based on the knowledge of
Q-functions. We start by discussing an alternative way of solving (11.1) without a model. We define
J ∶ ℝNm → ℝ as the function defined by the objective function in (11.1) when the constraints
are used to substitute away xk , i.e. J is a function of u = (u0 , … , uN−1 ). Similarly, we consider xk to
be functions of u. Obviously, xk does not depend on ul for l ≥ k, but we do not bother about that
for now, since it would only clutter the notation. Then in order to minimize J with respect to u
with some gradient method we need to compute the gradient of J with respect to u. This can be
done using the chain rule. We obtain
dJ(u) d𝜙(xN ) dxN ∑ 𝜕fk (xk , uk ) dxk
N−1
𝜕f (x , u ) duk
= + + k kT k ,
du T T
dxN du T
k=0 𝜕xkT du T
𝜕uk duT
where
dxk+1 𝜕Fk (xk , uk ) dxk 𝜕Fk (xk , uk )
= + 𝛿(k − l),
duTl 𝜕xkT duTl 𝜕uTk
and where
{
1, k=0
𝛿(k) =
0, k ≠ 0.
du
The initial value is zero. Notice that duTk is a trivial matrix consisting of ones and zeros. For the
linear case when Fk (x, u) = Ak x + Bk u, where Ak ∈ ℝn×n and Bk ∈ ℝn×m , we have
dxk+1 dxk
= Ak + Bk 𝛿(k − l).
duTl duTl
11.5 Approximation in Policy Space 339

From this we realize that not only the states but also their derivatives can be obtained from sim-
ulation using the same dynamical equations or an experiment involving the dynamical system.
For the gradient, one simulation or experiment has to be carried out for each value of l and each
component of the control signal. In case Ak and Bk do not depend on k, the time-invariant case, we
realize that
dxk dx
T
= k−l ,
dul duT0
since the initial value is zero. For the linear time-invariant case, this means that one just needs to
obtain the so-called “impulse response” of the dynamical system. It is of course debatable if it is a
good idea to carry out impulse response experiments to obtain the impulse response. The reason is
that the impulse response is given by Ak−1 B, k ≥ 1, and hence it can be computed from knowledge
of the system matrices (A, B). They can as will be discussed in Chapter 12 be obtained using system
identification. To use an impulse as the input signal for system identification is normally not a good
experiment design, see Section 12.9. However, using system identification in conjunction with the
methods of Chapter 8 is a more involved approach than just using an impulse response which can
be obtained from a very simple experiment.
For the case of nonlinear Fk , the nonlinear dynamical equations may be used instead. The result
will then not be exact, but only hold approximately, i.e.
( )
dxk+1 dxk duk
≈ Fk , .
dui,j dui,j dui,j
Here index i refers to stage index and index j to component index. This all assumes that the coor-
dinate system has been chosen such that Fk (0, 0) = 0. For the nonlinear time-invariant case one
simulation for each component of uk is enough. What has been presented above is strongly related
to what is called iterative learning control (ILC), see, e.g. [49, 77]. This approach is typically used in
applications where the same repeated task is carried out over and over again. Any reference value
is incorporated in the definition of the incremental costs fk . In case the whole state cannot be mea-
sured, it is still possible to use impulse responses for obtaining derivatives assuming that only what
is measured is penalized in the objective function, see Exercise 11.12

11.5.2 ILC Using Root-Finding

What has been presented above is just one form of ILC. There is another more popular method
which does not rely on obtaining the impulse response. To set the stage, we consider the dynamical
system
xk+1 = Fk (xk , uk )
yk = Gk (xk , uk )
for k ∈ ℤ+ , where xk ∈ ℝn , uk ∈ ℝm , and yk ∈ ℝm . Here Fk ∶ ℝn × ℝm → ℝn and Gk ∶
ℝn × ℝm → ℝm . We assume that the initial value x0 is zero. We define y = (y0 , y1 , … , yN−1 )
and u = (u0 , u1 , … , uN−1 ). We then define H ∶ ℝNm → ℝNm such that
y = H(u).
In order to do this, we use the equation xk+1 = Fk (xk , uk ) to recursively eliminate the variables xk
from the dynamical system equations. We are now interested in finding u such that the error signal
𝜀 ∶ ℝNm → ℝNm defined as
𝜀(u) = y − r = H(u) − r,
340 11 Reinforcement Learning

is zero for a given so-called reference value r ∈ ℝNm for y. The following root-finding algorithm,
see Appendix 11.6:
uk+1 = uk − t𝜀(uk ),
will be used. Here the subindex k refers to iteration index and not to stage index or time. Notice
that the computations involved in the root-finding algorithm only require us to be able to evaluate
the error e for a known input u. This can be done with simulations or with experiments on a real
dynamical system, i.e. no explicit knowledge of the functions Fk and Gk is needed.
The assumptions for convergence are that 𝜀 is Lipschitz continuous and strongly monotone.
These conditions are not easy to investigate for a general H. However, we may phrase them in
systems theory terms. The Lipschitz constant 𝛽 is the incremental gain of the dynamical system,
i.e. the smallest 𝛽 such that
||H(u) − H(𝑣)||2 ≤ 𝛽||u − 𝑣||2 , ∀u, 𝑣 ∈ ℝNm .
The strong monotonicity condition is the same as saying that the dynamical system is incrementally
strictly passive with dissipation 𝛼, i.e.
(H(u) − H(𝑣))T (u − 𝑣) ≥ 𝛼||u − 𝑣||22 , ∀u, 𝑣 ∈ ℝNm .

Example 11.5 We now assume that Fk (x, u) = Ax + Bu and that Gk = Cx + Du for matrices
(A, B, C, D) of compatible dimensions. Then H is a linear function and we may instead write
y = Hu. Here H now is the matrix defined as
⎡ h0 0 ··· 0⎤
⎢ ⎥
h ⋱ ⋱ ⋮⎥
H=⎢ 1 ,
⎢ ⋮ ⋱ ⋱ 0⎥
⎢h ··· h1 h0 ⎥⎦
⎣ N−1
where h0 = D and hi = CAi B ∈ ℝm for i ∈ ℕ are the Markov parameters, or equivalently the
impulse response coefficients for the linear dynamical system. It is straightforward to see that the
Lipschitz constant is 𝛽 = ||H||2 , i.e. the largest singular value of H. Moreover, the criterion for
strong monotonicity is that
(u − 𝑣)T H T (u − 𝑣) ≥ 𝛼||u − 𝑣||22 , ∀u, 𝑣 ∈ ℝNm .
We realize that this is equivalent to
1 T( )
x H + H T x ≥ 𝛼||x||22 , ∀x ∈ ℝNm ,
2
which is satisfied if and only if the smallest eigenvalue 𝜆min of H + H T is greater than or equal to
2𝛼. If we then denote the largest eigenvalue of H T H with 𝜆max it follows that the above algorithm
converges for t ∈ (0, 4∕(𝜆min + 2𝜆max )], assuming that 𝜆min > 0, see Appendix 11.6.

It is possible to instead of working directly on 𝜀 as defined above consider an invertible function

T ∶ ℝNm → ℝNm defined as T(𝜀(u)), and apply the root finding algorithm to T. This assumes that
T is chosen such that T(𝜀) = 0 if and only if 𝜀 = 0. One possible choice is to take T as a linear
function defined by an invertible matrix T ∈ ℝNm×Nm , i.e. we consider T𝜀(u) = 0. We realize that
for the linear case, the matrix H above is then replaced by the matrix TH, and it is for this matrix,
we need to compute 𝛼 and 𝛽. In case we know the impulse response, we may take T = H −1 and we
obtain convergence in one step with tk = 1. In case we have some approximate knowledge of the
impulse response, we can still make use of this to make TH close to the identity matrix. Instead
11.5 Approximation in Policy Space 341

of this, it is also possible to use feedback in order to make H itself close to the identity matrix,
see e.g. [23].
We realize that there is a strong advantage of using the root-finding algorithm, since no gradients
are needed, which is the case for formulations involving minimization of, e.g. 𝜀T 𝜀. However, this
comes at the price of assumptions on Lipschitz continuity and strong monotonicity.

11.5.3 Iterative Feedback Tuning

We will now consider the case when the control signal is given by a policy 𝜇 ∶ ℝn × ℝp → ℝm , i.e.
uk = 𝜇(xk , a), where a is a parameter that we would like to learn. Typically, we would like it to be
the solution of a problem like
∑
N−1
minimize 𝜙(xN ) + fk (xk , uk ) (11.20)
k=0

subject to xk+1 = Fk (xk , uk ), k ∈ ℤN−1 (11.21)

uk = 𝜇(xk , a), k ∈ ℤN−1 (11.22)

with variable a for a given initial value x0 , where 𝜙, fk and Fk are defined as before in this chapter.
We define J ∶ ℝp → ℝ as the function defined by (11.20) when (11.21–11.22) are used to substitute
away xk and uk . Similarly, we consider xk and uk to be functions of a. Then in order to minimize J
with respect to a with some gradient method, we need to compute the gradient of J with respect
to a. This can again be done using the chain rule. We obtain

dJ(a) d𝜙(xN ) dxN ∑ 𝜕fk (xk , uk ) dxk

N−1
𝜕f (x , u ) duk
= + + k kT k ,
daT dxNT daT k=0 𝜕xkT daT 𝜕uk daT
where
dxk+1 𝜕F(xk , uk ) dxk 𝜕F(xk , uk ) duk
= +
da T
𝜕xkT da T
𝜕uTk daT
duk 𝜕𝜇(xk , a) dxk 𝜕𝜇(xk , a)
= + .
da T
𝜕xk da
T T 𝜕aT
The initial value is zero.

Example 11.6 For the case when Fk (x, u) = Ak x + Bk u, where Ak ∈ ℝn×n and Bk ∈ ℝn×m , and
[ ]
when 𝜇(x, a) = Lx, where L ∈ ℝm×n with aT = L1 · · · Lm , where Li are the rows of L, we have
dxk+1 dx du
T
= Ak Tk + Bk Tk
da da da
duk dxk
= L T + blkdiag(xkT , … , xkT ).
daT da
We realize that the derivatives are obtained by simulation of the closed-loop system or from exper-
iments involving the closed-loop system with the current xk as an additional input.

The results of the example also hold true approximately for the nonlinear case when Fk (0, 0) is
zero. Clearly, many simulations have to be carried out since one simulation is needed for every
component of a, i.e. there are p = mn simulations. This idea is the basis for what is called iterative
feedback tuning (IFT). There often output feedback is considered, and reference values for the out-
put signal are defined explicitly. Then it is possible with transfer function manipulations to show
342 11 Reinforcement Learning

that only two additional simulations are needed for the case when m = 1 and when there is only
one output signal, see e.g. [56].
Before we end this section, it should be mentioned that just using different initial values to obtain
good experimental conditions for the simulations might not be a good idea. It might be the case
that the simulations are experiments where one does not control the initial values. Also, big initial
values might be needed to get good information and then the approximations for nonlinear state
dynamics might not be good. A remedy to this is to inject a perturbation by changing the dynamical
equations to
xk+1 = Fk (xk , uk + 𝑤k ),
where 𝑤k is a realization of white noise. This should only be done in the first experiment in each
iteration for IFT. Sometimes the experimental conditions are such that one has to live with pertur-
bations also in the other experiments. One should then resort to stochastic optimization methods
as discussed in Section 6.8. We will discuss this more in an exercise.

11.6 Appendix – Root-Finding Algorithms

In this appendix, we will discuss root-finding algorithms for solving nonlinear equations. We will
consider both the deterministic case and the stochastic case.
Let us consider the function g ∶ ℝn → ℝn and the equation
g(x) = 0. (11.23)
We assume that g is strongly monotone with parameter 𝛼 ∈ ℝ++ , i.e. it satisfies
(g(y) − g(x))T (y − x) ≥ 𝛼||y − x||22 , ∀x, y ∈ ℝn . (11.24)
We also assume that it is Lipschitz continuous with Lipschitz constant 𝛽 ∈ ℝ++ , i.e.
||g(y) − g(x)|| ≤ 𝛽||y − x||2 , ∀x, y ∈ ℝn . (11.25)
It then follows from the proof of convergence preceding (6.19) for the gradient method for
𝛼-strongly convex and 𝛽-smooth functions that the algorithm
xk+1 = xk − tg(xk ), k ∈ ℤ+ , (11.26)
converges for t ∈ (0, 2∕(𝛼 + 𝛽)] by replacing ∇f with g. To see this we notice that 𝛽-smoothness for
convex f is equivalent to Lipschitz continuity of g = ∇f , and that strong monotonicity of g = ∇f is
implied by strong convexity of f .
We are now interested in the stochastic version of the above problem. To this end, we con-
sider a probability space (Ω,  , ℙ), and we define a random variable 𝜉 from the sample space Ω
to some set . The random variable could be discrete or continuous. We are given the function
G ∶ ℝn ×  → ℝn and define the function g ∶ ℝn → ℝn as g(x) = 𝔼[ G(x, 𝜉) ] and the equation
g(x) = 0. (11.27)
We will show that the above equation can be solved with an algorithm of the form
xk+1 = xk − tk gk , k ∈ ℤ+ , (11.28)
where x0 is an initial guess, tk > 0 is the step size at iteration k, and gk is a realization of an estimator
of Gk of g(xk ). It should be stressed that Gk is a random variable for each k. Hence, the iteration
(11.28) is a realization of a stochastic process
Xk+1 = Xk − tk Gk , k ∈ ℤ+ , (11.29)
Exercises 343

where the random variable X0 is the initial state of the process. We consider an unbiased estimator
Gk of g(xk ) with bounded variance, i.e.
[ ] [ ]
𝔼 Gk |Xk = xk = g(xk ), 𝔼 ||Gk − g(xk )||22 |Xk = xk ≤ c2 , (11.30)
for all k and for some scalar c ≥ 0. Note that in the special case where c = 0, the iteration (11.28) is
essentially the root-finding method (11.26). We note that G(x, 𝜉) is a random variable, and it is an
unbiased estimator of g(x) and hence, it is natural to choose Gk = G(Xk , 𝜉k ), where 𝜉k has the same
distribution as 𝜉 for all k ≥ 0, and where 𝜉k and 𝜉j are independent for j ≠ k. We assume that g
satisfies (11.24) and (11.25). For a constant step size tk = t, it then follows from the proof preceding
(6.71) that (11.28) converges in mean square to a ball centered at x⋆ , where g(x⋆ ) = 0, and with
a radius that is proportional to the constant step size t. A more elaborate analysis can be done to
show that we have convergence in means square to x⋆ for a diminishing step-size.

Exercises

11.1 Let A ∈ ℝm×n and B ∈ ℝn×m , where m ≤ n. Show that the eigenvalues of BA are the eigen-
values of AB together with n − m zero eigenvalues.

11.2 We will in this exercise compare the fitted value iterations in Example 8.3 with the fitted
Q-function iterations in Example 11.1 for a finite horizon LQ control problem. Implement
the two algorithms in MATLAB and investigate their performance for the case when
[ ] [ ]
0.5 1 0
Ak = ; Bk = ,
0 0.5 1
and Rk = 1 and Sk = I. Consider a time horizon of N = 10. Also compare the resulting feed-
back gains Lk you obtain with the ones obtained from the Riccati recursion in Example 8.1.

11.3 Show that the Q-functions defined in (8.32) satisfy the recursion
[ ( )]
̄ = 𝔼 fk (x, u)
Qk (x, u) ̄ + min Qk+1 Fk (x, u,
̄ Wk ), u . (11.31)
u

11.4 We now investigate how to do fitted Q-function iterations for the stochastic LQ problem in
Example 8.10. We then need to extend Q ̃ k in Example 11.1 with a term a7 , i.e.
[ ]T [ ][ ]
̃ k (x, u, a) = x P̄ r̄ x
Q + a7 .
u r̄ T q̄ u
Make the necessary additional modification to accommodate for this in the algorithm.
Specifically you should replace x+s with x+s = Axks + Busk + 𝑤k , where 𝑤k is a realization
from a white noise random process W with unit variance. Use the same problem data as in
Exercise 11.2 and run the algorithm in MATLAB. We know that the same feedback gains Lk
are optimal for both the stochastic and deterministic case by certainty equivalence. Do you
get the same solution? Does the quality of the solution depend on the number of samples
you consider?

11.5 Show that the Q-function in (8.35) satisfies

[ ]
̄ = 𝔼 f (x, u)
Q(x, u) ̄ + 𝛾Q(F(x, u,̄ W), u) . (11.32)
344 11 Reinforcement Learning

11.6 (a) Show that the Bellman Q-operator in (11.8) is such that
̄ ≤ TQ (Q2 )(x, u),
TQ (Q1 )(x, u) ̄
for all (x, u) ̄ ≤ Q2 (x, u)
̄ ∈ n ×  m if Q1 (x, u) ̄ for all (x, u)
̄ ∈ n ×  m . Also, show that
it is a contraction.
(b) We let the Stochastic Bellman Q-operator Q ∶ ℝ → ℝ be defined as
n n

[ ]
̄ = 𝔼 f (x, u)
Q (Q)(x, u) ̄ + min 𝛾Qk (F(x, u,̄ W), u) , (11.33)
u
where F ∶ n ×  m ×  p → n . Expectation is with respect to W. Show the same
results as in the previous subexercise hold also for the stochastic case.

11.7 Consider the optimal control problem in Example 8.8.

(a) Solve it numerically using the LP formulation for the Q-function in Section 11.4. Solve
it for different values of 𝛾, e.g. 𝛾 = 0.25 and 𝛾 = 0.9.
(b) Solve the same problems using (11.9).
(c) Solve the same problems using (11.10).
(d) Solve the same problems using (11.11).
(e) Discuss pros and cons with the different methods used above to solve the problem.

11.8 We now consider a stochastic version of the problem in the previous exercise. This is
obtained by defining F(x, u, 𝑤) as
u −1 0 1
x
−1 −1 −1 0
0 −1 0 1
1 0 1 1

for 𝑤 = 0, as
u −1 0 1
x
−1 −1 0 1
0 0 1 1
1 1 1 1

for 𝑤 = 1, and as
u −1 0 1
x
−1 −1 −1 −1
0 −1 −1 0
1 −1 0 1

for 𝑤 = −1. We assume that the random variable Wk takes the values (−1, 0, 1) with prob-
abilities (p, 1 − 2p, p), where p ∈ [0, 0.5]. Notice that the deterministic case is recovered for
p = 0.
(a) Solve the problem using (11.11) modified to the stochastic setting as discussed at the
end of Section 11.2. For what values of 𝛾 and p do you obtain a nontrivial solution?
(b) Solve the problem using the LP formulation in Section 11.4 for the Q-function. See the
end of Section 11.4 for how to modify the LP to the stochastic case.
Exercises 345

(a) Define the Bellman Q policy operator TQ𝜇 ∶ ℝ → ℝ

n × m n × m
11.9 as
TQ𝜇 (Q) = f (x, u) + 𝛾Q(F(x, u), 𝜇(F(x, u))),
where F ∶ n ×  m → n . Show that if Q1 (x, u) ≤ Q2 (x, u) for all (x, u) ∈ n ×  m ,
then
TQ𝜇 (Q1 ) − TQ𝜇 (Q2 ) ≤ 0.
Also, show that TQ𝜇 is a contraction.
(b) Define the stochastic Bellman Q policy operator Q𝜇 ∶ ℝ × → ℝ × as
n m n m

[ ]
Q𝜇 (Q) = 𝔼 f (x, u) + 𝛾Q(F(x, u, W), 𝜇(F(x, u, W))) ,
where F ∶ n ×  m ×  p → n . Expectation is with respect to W. Show that if
Q1 (x) ≤ Q2 (x) for all (x, u) ∈ n ×  m , then
Q𝜇 (Q1 ) − Q𝜇 (Q2 ) ≤ 0.
Also show that Q𝜇 is a contraction.

11.10 In this exercise, we consider the same LQ problem as in Exercise 8.9.

(a) Implement the approach in Example 11.3. You may start with L0 = 0. How does the
choice of initial values (xs , us ) and the number r of initial values affect the convergence
of Lk ? How does the choice of N affect the convergence? Compare your results with
what was obtained in Exercise 8.9.
(b) Consider the stochastic policy evaluation step for the Q function given by
Qk = Q𝜇k (Qk ),
where Q𝜇 is defined in Exercise 11.9. Show that it implies that
( ∑ [ (
) N−1 )] [ ( ( )) ]
Qk (x0 , u0 ) = f x0 , u0 + 𝛾 i 𝔼 f Xi , 𝜇k (Xi ) + 𝛾 N 𝔼 Qk XN , 𝜇k XN
i=1
assuming that Xk+1 = F(Xk , 𝜇k (Xk ), Wk ) and that X0 = x0 is given. It is tempting to try
to obtain a SAA of Qk (x0 , u0 ) by using 𝛽ks as defined in Section 11.3, where xi now are
realizations of Xi . Hence, the idea is that the approach in that section can be used also
for the stochastic case. Therefore, repeat what was done in the previous subexercise,
but now let xk+1 = Axk + Buk + 𝑤k , where 𝑤k is a realization of Wk , which is assumed
to have a Gaussian distribution with zero mean and covariance equal to the identity
matrix. We also assume that Wj is independent of Wk for j ≠ k. You may start with
L0 = 0. How does the choice of initial values (xs , us ) and the number r of initial val-
ues affect the convergence of Lk ? How does the choice of N affect the convergence?
Compare your results with what was obtained in the previous subexercise. Are the
assumptions for an SAA valid?
Hint: You need to modify the approach in Example 11.3 to let
𝜑(x, u) = (x12 , x22 , u2 , 2x1 x2 , 2x1 u, 2x2 u, 1),
and hence, a ∈ ℝ7 . This means that the last component of a will model the constant
term in the value function which we know is present for the stochastic case, c.f.
Example 8.13.

11.11 In this exercise, we consider the same stochastic LQ problem as in the previous subexercise.
Implement the approach in Example 11.4. Are you able to obtain convergence of Lk to the
correct value? Does it depend on the initial value P0 ?
346 11 Reinforcement Learning

11.12 We are given matrices A ∈ ℝn×n , B ∈ ℝn×1 , C ∈ ℝ1×n and D ∈ ℝ. We consider the following
optimal control problem:

1∑
N
minimize (y − rk )2
2 k=0 k
subject to xk+1 = Axk + Buk , k ∈ ℤN−1
yk = Cxk + Duk , k ∈ ℤN ,
for a given initial value x0 with variables (u0 , x1 , y1 … , uN−1 , xN , yN ). Denote by J the objec-
tive function. Let u = (u0 , … , uN ) and y = (y0 , … , yN ).
(a) Show that
dy
= H,
duT
where
⎡ D 0 ··· 0⎤
⎢ ⎥
CB ⋱ ⋱ ⋮⎥
H=⎢ .
⎢ ⋮ ⋱ ⋱ 0⎥
⎢CAN−1 B ··· CB D⎥⎦
⎣
(b) Show that the optimality conditions for the optimal control problem are
( )T
dJ dy
= (y − r) = 0,
du duT
where
y = x0 + Hu,
with
⎡ C ⎤
⎢ ⎥
CA ⎥
=⎢ .
⎢ ⋮ ⎥
⎢CAN ⎥
⎣ ⎦
(c) Show that the optimal u is the solution to
H T Hu = H T r,
if x0 = 0.

11.13 We are given matrices A ∈ ℝn×n , B ∈ ℝn×1 , C ∈ ℝ1×n and D ∈ ℝ. We consider the following
stochastic optimal control problem:
[ N ]
1 ∑
minimize 𝔼 (Yk − rk )2
2 k=0

subject to Xk+1 = AXk + Buk + Wk , k ∈ ℤN−1

Yk = CXk + Duk + Ek , k ∈ ℤN
for a given initial value X0 = x0 with variables (u0 , X1 , Y1 … , uN−1 , XN , YN ). The sequence
rk is a desired reference value. Denote by J the objective function. Let u = (u0 , … , uN ),
Y = (Y0 , … , YN ), W = (W0 , … , WN−1 ), and E = (E0 , … , EN ). We assume that Wk and
Ek are zero mean random variables. This is a stochastic version of the problem in the
Exercises 347

previous exercise. We now want to use a stochastic gradient method to find a solution to the
problem.
(a) Show that for arbitrary u we have that
Y = x0 + Hu +  W + E,
where
⎡ 0 0 ··· 0⎤
⎢ ⎥
C ⋱ ⋱ ⋮⎥
 =⎢ ,
⎢ ⋮ ⋱ ⋱ 0⎥
⎢CAN−1 ··· C 0⎥⎦
⎣
Here  and H are defined as in the previous exercise.
(b) We now define
̂
dY [ ]
T
= Yimp SYimp · · · SN−1 Yimp ,
du
where S is a shift matrix that is all zeros except for ones on the first sub-diagonal,
where
Yimp = x0 + H1 +  W + E
is the impulse response, and where H1 is the first column of H. Notice that this is
dY
obviously not the true value of the Jacobian du T
= H. However, it can be obtained from
an impulse response experiment. Let
( )T
̂
dY
G= (Y − r),
duT
and show that this is an unbiased estimator of the gradient of the objective function
for the stochastic optimal control problem when x0 = 0. You may assume that we have
evaluated Y and Yimp for different uncorrelated noise sequences.
(c) Consider a simple example where
[ ] [ ]
0.5 1 1 [ ]
A= , B= , C = 0 1 , D = 2.
0 0.5 0
Let N = 50, generate a reference value given by
{
10, 0 ≤ k ≤ 24
rk =
−10. 25 ≤ k ≤ 50,
assume that the initial value x0 is zero, and let the noise sequences both have zero
mean, unit variance and have Gaussian distribution. Solve the stochastic optimal con-
trol problem with a stochastic gradient method using the gradient suggested above.

11.14 We will now investigate a stochastic version of Example 11.5. To this end, we consider the
Markov process in the previous exercise for which we have that
Y = x0 + Hu +  W + E,
where W and W are zero mean random vectors. We define the error function
𝜀(u, W, E) = Y − r = x0 + Hu +  W + E − r,
similarly as for the deterministic case, and then we apply the root-finding algorithm
uk+1 = uk − tgk ,
348 11 Reinforcement Learning

where gk = yk − r, and yk is a realization of Y for the kth iteration. Notice that the subindex
k does not refer to the stage index for the Markov process.
(a) Show that 𝜀(u, W, E) is an unbiased estimator of 𝔼[ 𝜀(u, W, E) ].
(b) Consider the same numerical example as in the last subexercise of the previous
exercise. Compute the largest singular value of H and the smallest eigenvalue of
H + H T . What is the largest possible step-length that can be used in order to guarantee
convergence in m.s. to a ball centered around the true solution?
(c) Use the stochastic root-finding algorithm applied to the same Markov process as in
the last subexercise in the previous exercise. Use the same reference value. Compare
the convergence for the root-finding algorithm with that of the stochastic gradient
algorithm of the previous exercise. Experiment both with fixed step-size and with
diminishing step-size for the root-finding algorithm. You may assume that x0 = 0.

11.15 We are given matrices A ∈ ℝn×n , B ∈ ℝn×1 , C ∈ ℝ1×n and D ∈ ℝ. We consider the following
stochastic optimal control problem:
[ N ]
1 ∑
minimize 𝔼 (Yk − rk )2
2 k=0

subject to Xk+1 = AXk + Buk + Wk , k ∈ ℤN−1

Yk = CXk + Duk + Ek , k ∈ ℤN ,

for a given initial value X0 = x0 with variables (u0 , X1 , Y1 … , uN−1 , XN , YN ). Let

u = (u0 , … , uN ), Y = (Y0 , … , YN ), W = (W0 , … WN−1 ), and E = (E0 , … , EN ). We assume
that Wk and Ek are zero mean random variables. The sequence rk is a desired reference
value. We consider a policy 𝜇(x, a) = Lx, where aT = L ∈ ℝ1 × n. Let J(a) be the value of
the objective function when the above constraints and uk = 𝜇(Xk , a) are used to eliminate
the variables Xk , uk and Yk .
(a) Show that
( )
dJ ∑N
dXk duk
= (Yk − r k ) C + D ,
daT k=0
daT daT

where
dXk+1 dX du
= A k + B k + Wk
dai dai dai
duk dXk
=L + Xi,k ,
dai dai
dX
where Xi,k is the ith component of Xk for i ∈ ℤn . The initial value is daT0 = 0.
(b) Solve again the same stochastic control problem numerically that we have considered
in the previous two exercises. This time use a stochastic gradient method to compute
the optimal policy discussed above. You may assume that x0 = 0. You should in each
dX
iteration use three independent noise sequences for computing Xk and dak , and the
i
noise sequences should of course also be independent for each iteration. Are you able
to obtain as small value of the objective function as you were before? If not, why? How
does the optimal policy perform for a different reference value as the one it was trained
for? How do the approaches in the previous two exercises perform for another refer-
ence value in case no new training is performed for this new reference value?
Exercises 349

11.16 Use the reinforcement learning toolbox in MATLAB to solve the shortest path problem
in Example 8.2 using Q-learning. Use the MDP environment. You need to code the states
and actions differently as compared to what is done in the example. You should label the
states from 1 to 14. You also need to label the actions with positive integers, say 1, 2, and 3.
Moreover, you need to code what happens, if you, for example choose action 1, when you
are at stage k = 1 in node 1. A good idea is to say that you stay in the same node and that
the distance is very long, say 10. In this way, this action cannot be optimal. Also remember
that the rewards in reinforcement learning are the negative of the costs in optimal control,
and hence, all the rewards you code have to be the negative of the distances.
350

System Identiﬁcation

System identification is about learning models for dynamical systems. Examples include the flow of
water in a river, the planetary motions, and the number of cars on a segment of a freeway. We will
in this chapter limit ourselves to discrete-time dynamical systems. However, many of the results
can be generalized to continuous-time dynamical systems using sampling. The origin of system
identification goes back to the work by Karl Åström and Torsten Bolin in 1965.
We start by defining what we mean by a dynamical system in state-space form. We define a regres-
sion problem for learning/estimating the dynamical system, and specifically, we define it as an
ML problem. From this, we then derive input–output models and the corresponding ML problem.
We discuss in detail how the parameters can be estimated by solving a nonlinear LS problem.
Then, we discuss how to estimate the model when some of the data are missing. Nuclear norm
system identification is also discussed in this context. Prior information can be incorporated into
system identification easily using Gaussian processes and empirical Bayes. We show how this can
be implemented using the sequential convex optimization technique based on the majorization
minimization principle. Recurrent neural networks and temporal convolutional neural networks
are shown to be generalizations of the linear dynamical models to nonlinear dynamical models.
The chapter is finished off with a discussion on experiment design for system identification.

12.1 Dynamical System Models

A discrete-time dynamical system is described by the so-called state equation
xk+1 = Fk (xk , uk ), (12.1)
where k is a nonnegative integer describing time, xk ∈ ℝn is the state, uk ∈ ℝm is the input signal and
Fk ∶ ℝn × ℝm → ℝn is describing how the state evolves, cf . (8.1). We assume that the initial value x0
is given. Then given the values of uk for k ∈ ℤ+ all values of xk can be computed recursively using
the state equation. The system above is called time-varying if Fk depends explicitly on k. Otherwise,
it is called time-invariant. Also, it is called deterministic in case the input u is fully known. Often,
it is not possible to observe or measure the full state x but only some function of it, and this can be
modeled as
yk = Gk (xk , uk ), (12.2)
where yk ∈ ℝp is called the output signal and Gk ∶ ℝn
× →ℝm ℝp . We have assumed that the input
signal, the output signal, and the state are real-valued. However, it is possible to let them, or some
components of them, take values in finite sets.

A linear time-invariant system is obtained by taking Fk and Gk as linear functions independent

of k, i.e.
xk+1 = Axk + Buk , (12.3)
yk = Cxk + Duk , (12.4)
where A ∈ ℝn×n , B∈ ℝn×m , C∈ ℝp×n , and D ∈ ℝp×m are called the system matrices.

12.2 Regression Problem

For the linear system in (12.3)–(12.4), the system identification problem is to find the system matri-
ces and the initial value for given sequences of inputs and outputs, i.e. we are given (yk , uk ), k ∈ ℤN
and would like to compute 𝜃 = (A, B, C, D, x0 ). Normally, the measured sequences do not satisfy
(12.3)–(12.4) exactly for any 𝜃 due to measurement errors and/or other limitations in the model.
Hence, it is reasonable to define what is called a predictor ŷ k ∈ ℝp of the output as
xk+1 = Axk + Buk
ŷ k = Cxk + Duk ,
and define the prediction error ek ∈ ℝp as
ek = yk − ŷ k .
Here yk is our measured output, and ŷ k is recursively defined by the measured inputs. Naturally,
we would like the prediction error to be small. We realize that an equivalent way of writing the
equations is
xk+1 = Axk + Buk
yk = Cxk + Duk + ek .
This way of defining the prediction error assumes that all errors in the model are related to mea-
surement errors. More generally, we may consider
xk+1 = Axk + Buk + 𝑣k
yk = Cxk + Duk + ek ,
where 𝑣k ∈ ℝn models errors in the state equation. If we assume that (𝑣k , ek ) are realizations of a
sequence of Gaussian zero mean random variables (Vk , Ek ) with covariance
[ ]
R R n+p
R = T1 12 ∈ 𝕊++ ,
R12 R2
such that (Vj , Ej ) and (Vk , Ek ) are independent for j ≠ k, then the ML problem of estimating
𝜃 = (A, B, C, D, R, x0 ) is equivalent to
N−1[ ]T [ ]
1 ∑ 𝑣k −1 𝑣k 1 N 1
minimize R + eTN R−1
2 eN + ln det R + ln det R2
2 k=0 ek ek 2 2 2
subject to xk+1 = Axk + Buk + 𝑣k , 0≤k ≤N −1
yk = Cxk + Duk + ek , 0≤k≤N
with variables (𝜃, x, 𝑣, e), where x = (x1 , … , xN ), e = (e0 , … , eN ), and 𝑣 = (𝑣0 , … , 𝑣N−1 ). There are
many parameters to estimate. We can reduce them by considering the innovation form as discussed
in Exercise 3.17, i.e.
xk+1 = Axk + Buk + Kek (12.5)
yk = Cxk + Duk + ek , (12.6)
352 12 System Identiﬁcation

where K ∈ ℝn×p . The variables x and e in the innovation model are not the same x and e as used
in the previous model. For details about how they are related, see Exercise 3.17. There is no loss in
generality to consider the innovations form. Then we define 𝜃 = (A, B, C, D, K, R2 , x0 ), and the ML
problem for identification is now to solve

1 ∑ T −1
N
N +1
minimize e R e + ln det R2
2 k=0 k 2 k 2
(12.7)
subject to xk+1 = Axk + Buk + Kek , k ∈ ℤN−1
yk = Cxk + Duk + ek , k ∈ ℤN
with variables (𝜃, x, e). For the case when p = 1, it follows that we can solve a constrained LS prob-
lem with no weighting with R2 , and then estimate R2 as eT e∕(N + 1), where e is the optimal solution,
cf . Section 9.8. The optimization problem is however not convex due to bilinearity of the variables
in the constraints. Even worse is the fact that there are uncountably many solutions to the prob-
lem because of the fact that the input–output relations are unaffected by state transformations, cf .
Exercise 2.12. How to circumvent the latter problem will be discussed next. We will from now on
restrict ourselves to the case when m = p = 1. Most of the results presented below carry over to the
general case with some slight modifications.

12.3 Input–Output Models

One remedy to the nonuniqueness of solutions is to use input–output models. Introduce the
Z-transform of a signal x = (x0 , x1 , …), where xk ∈ ℝn , as the function X ∶ ℂ → ℂn defined by
∑
∞
X(z) = xk z−k .
k=0

Applied to (12.5–12.6) this results in

zX(z) − zx0 = AX(z) + BU(z) + KE(z)
Y (z) = CX(z) + DU(z) + E(z),
where Y , U, and E are the Z-transforms of y, u, and e, respectively. If we eliminate X using the first
equation and substitute into the second equation, we obtain
Y (z) = C(zI − A)−1 zx0 + (z)U(z) + (z)E(z),
where (z) = C(zI − A)−1 B + D and (z) = C(zI − A)−1 K + I are rational and proper functions of z
with degree n in the denominator. The rational functions may be written as
(z) (z)
(z) = ; (z) = ,
(z) (z)
where we have defined the polynomials
(z) = zn + a1 zn−1 + · · · + an−1 z + an
(z) = b0 zn + b1 zn−1 + · · · + bn−1 z + bn
(z) = zn + c1 zn−1 + · · · + cn−1 z + cn
in the variable z. Notice that det (zI − A) = (z) and that (z)(zI − A)−1 = adj(zI − A). Because of
this, we may write
(z)Y (z) = Cadj(zI − A)zx0 + (z)U(z) + (z)E(z).
12.3 Input–Output Models 353

Multiply this equation with z−k ∕(2𝜋i) for k = −n, −n + 1, … N − n and integrate along the unit cir-
cle C in the complex plane.1 Then, since
{
2𝜋i, k = −1
zk dz = ,
∮C 0, k ≠ −1
it follows that
Ta y = 𝜉 + Tb u + Tc e, (12.8)
where y = (y0 , … , yN ), u = (u0 , … , uN ), and e = (e0 , … , eN ), and where
Ta = I + a1 S + · · · + an−1 Sn−1 + an Sn
Tb = b0 I + b1 S + · · · + bn−1 Sn−1 + bn Sn
Tc = I + c1 S + · · · + cn−1 Sn−1 + cn Sn
are Toeplitz matrices with the shift matrix S ∈ ℝ(N+1)×(N+1) being a matrix of all zeros except for
ones on the first subdiagonal, cf . (2.24). The vector 𝜉 ∈ ℝN+1 is a vector of all zeros except for
the first n elements, which are functions of the initial value x0 . We may hence write 𝜉 = (𝜉0 , 0),
where 𝜉0 ∈ ℝn . Each and every row in the above equation is related to a specific k above. Except
for the first n rows in the above equation, one may equivalently write
(q)yk = (q)uk + (q)ek ,
where we have introduced the shift operator q which shifts the time index of a signal,
i.e. qyk = yk+1 . This is called an autoregressive-moving-average model with exogenous terms
(ARMAX). The case when C(q) = 1 is called an autoregressive model with exogenous terms
(ARX), the case when (q) = (q) is called an output error (OE) model, and the case when
(q) = (q) = 1 is called a finite impulse response (FIR) model.

12.3.1 Optimization Problem

We now consider the coefficients defining the Toeplitz matrices together with 𝜉0 as the model
parameters. Then the ML problem in (12.7) can equivalently be stated as
minimize 12 ||e||22
(12.9)
subject to Ta y = 𝜉 + Tb u + Tc e
with variables (𝜃, e), where 𝜃 = (a, b, c, 𝜉0 ), a = (a1 , … , an ), b = (b0 , … , an ), and c = (c1 , … , cn ).
In this way, we have removed nonuniqueness related to the fact that there are infinitely many
realizations of an input–output model. However, the optimization problem may still not have
a unique solution, and this is related to the fact that the data (yk , uk ) may not contain enough
information to uniquely determine 𝜃. Criteria that ensure uniqueness are in system identification
called persistence of excitation, see e.g. [68].
It might be a good idea to try to remove the constraint in the optimization problem by eliminating
e. For the case of an ARX model, this is trivial, and then the optimization problem is a linear LS
problem. For the ARMAX model, we see that Tc is still invertible, cf . (2.26), and thus
( )
e = Tc−1 Ta y − 𝜉 − Tb u .
Hence, the unconstrained regression problem can be written as
1‖ ( )‖2
minimize ‖Tc−1 Ta y − 𝜉 − Tb u ‖ ,
2‖ ‖2
with variable 𝜃. This is a nonlinear LS problem. Actually, it is a separable LS problem as in (6.59),
i.e. if we fix c, then the problem is a linear LS in the remaining variables.

1 This is nothing but the inverse Z-transform.

354 12 System Identiﬁcation

12.3.2 Implementation Details

It holds that Tc−1 is also a Toeplitz matrix, see Exercise 2.6. This is useful when computing the
Jacobian of e with respect to 𝜃, which is needed in the optimization methods in Section 6.7. It holds
that
𝜕e
= Tc−1 Sk y, k ∈ ℕn
𝜕ak
𝜕e
= −Tc−1 Sk u, k ∈ ℤn
𝜕bk
𝜕e
= −Tc−1 Sk e, k ∈ ℕn
𝜕ck
[ ]
𝜕e −1 I
= −T c
𝜕𝜉0T 0
with the convention that S0 = I. Moreover, any Toeplitz matrix commutes with any shift matrix,
cf . Exercise 2.7. Hence, the gradients for different values of k can be obtained from the gradient for
k = 0 using trivial shifts. We now realize that all the gradients are obtained by multiplying some
vector with Tc−1 from the left or equivalently solving the linear system of equations:
Tc z = r,
for some right-hand side r that is either the residuals e or the given data (y, u). Since Tc is lower
triangular, this system of equations can be solved recursively, and it can also be interpreted as a
filtration, since it is equivalent to
1
zk = r .
(q) k
When there are zeros of the polynomial (z) such that |z| > 1, where z ∈ ℂ, then this filtration
is not numerically well behaved, and it is also the case that Tc is an ill-conditioned matrix. It is
a common practice to only look for solutions to the optimization problem such that the zeros
of (z) are inside the unit circle. The reason for this is that the spectrum of (q)Ek , when Ek is
̄
white noise, is the same as the spectrum of (q)E ̄ ̄
k if (z) = 0 implies either (z) = 0 or (1∕z) = 0,
[6, pp 141–143]. Therefore, one should in a numerical implementation of an optimization algo-
rithm for identification of an ARMAX model always mirror any iterate of c such that the cor-
responding zeros of  are inside the unit circle. It should also be mentioned that solving linear
systems of equations involving Toeplitz matrices can be done efficiently using the discrete Fourier
transform [44]. However, this only pays off in case n is much greater than log n. It is in general
tricky to initialize the optimization solver in such a way that it is not trapped in a local minimum.
Typically, higher-order models and other system identification techniques are used initially to find
an approximate model that can be used for initialization; see [68, Chapter 10.4] for details.

12.3.3 State Space Realization

When the model has been obtained for the input–output model, it is possible to go back and obtain
a state description in terms of (A, B, C, K, x0 ). This description is, of course, not unique. One possi-
bility is to use the observable canonical form, which yields
⎡0 0 · · · 0 −an ⎤ ⎡ bn − an b0 ⎤ ⎡ cn − an ⎤
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
1 0 · · · 0 −an−1 ⎥ b − a b c − an−1 ⎥
xk+1 = ⎢ xk + ⎢ n−1 n−1 0 ⎥
uk + ⎢ n−1 e
⎢⋮ ⋮ ⋮ ⋮ ⎥ ⎢ ⋮ ⎥ ⎢ ⋮ ⎥ k
⎢0 0 · · · 1 −a1 ⎥⎦ ⎢ b −a b ⎥ ⎢ c −a ⎥
⎣ ⎣ 1 1 0 ⎦ ⎣ 1 1 ⎦
[ ]
yk = 0 0 · · · 0 1 xk + b0 uk + ek .
12.4 Missing Data 355

It only remains to compute the initial value x0 , which must satisfy

Cx0 = y0 − Du0 − e0 ,
where the right-hand side is known from the above definitions. This equation is not enough, but
we also have
CAx0 = Cx1 − CBu1 − CKe1 = y1 − Du1 − e1 − CBu1 − CKe1 ,
with known right-hand side. Continuing like this, we can obtain the equation:
x0 = r,
where
⎡ C ⎤
⎢ CA ⎥
⎢ ⎥
 = ⎢ CA2 ⎥
⎢ ⋮ ⎥
⎢ n−1 ⎥
⎣CA ,⎦
is the so-called observability matrix, and where r is some known quantity that can be computed
recursively. Since we have used the observer canonical form, it is known that the observability
matrix is invertible, see, e.g. [97], and hence, we can solve for x0 .

12.4 Missing Data

Sometimes not all the data (yk , uk ), k ∈ ℤN are available for estimating the model. The natural idea
then is to consider the missing data as additional variables to optimize over in (12.7). This will
result in more bilinear terms in the constraints. Clearly, it will be a more challenging optimization
problem. We will in this section only consider the input–output model formulation. State-space
models will be considered in Section 12.5.

12.4.1 Block Coordinate Descent

We will restrict ourselves to the case of missing output data. This is the most common case, but it is
easy to generalize the results to missing input data. The most natural formulation is as mentioned
above to consider the optimization problem
minimize 12 ||e||22
(12.10)
subject to Ta y = 𝜉 + Tb u + Tc e
with variables (𝜃, e, ym ), which is the same as (12.9) except that we also have the missing outputs
[ ]
ym as optimization variables. Here ym = Tm y, where T = Tm To ∈ ℝ(N+1)×(N+1) is a permutation
matrix, cf . (2.5). This can be written as an unconstrained problem:
1‖ ( )‖2
minimize ‖Tc−1 Ta y − 𝜉 − Tb u ‖ (12.11)
2 ‖ ‖2
with variables (𝜃, ym ). This is also a separable nonlinear LS problem, since the residual e is linear
in the remaining variables when c is fixed. Unfortunately, the solution obtained will not be unbi-
ased2 unless Tc−1 Ta = I, which is true for OE models, for which Ta = Tc , and for FIR models, for
which Ta = Tc = I. A common way of solving (12.11) is to consider a block-coordinate method,
[ ]
2 An estimate 𝜃̂ is said to be unbiased if 𝔼 𝜃̂ = 𝜃0 , where 𝜃0 is the true parameter value.
356 12 System Identiﬁcation

cf . Section 6.9, where in every other step (ym , 𝜉0 ) and (a, b, c) are fixed, respectively. When (ym , 𝜉0 )
is fixed, we have a standard system identification problem with no missing data and known initial
value. When (a, b, c) is fixed, we have a linear LS problem for (ym , 𝜉0 ).

12.4.2 Maximum Likelihood Problem

We will state the ML formulation for missing data model estimation, which is unbiased. To this
end, we will as before assume that e is a realization of a normal distribution
[ ] with zero mean, and
I
covariance equal to 𝜎 2 I. Let Ty = Tc−1 Ta , Tu = Tc−1 Tb and Ti = Tc−1 n
. Then we can write
0
e = Ty y − Ti 𝜉0 − Tu u.

Also, we define m = Ty Tm
T . Then it can be shown, [52, 113] that the ML problem is equivalent to

1‖ ( ) 1 ( )‖2
‖det  T  2no T y − T 𝜉 − T u ‖
minimize ‖ m m y i 0 u ‖ (12.12)
2‖ ‖2
with variables (𝜃, ym ), where no is the number of observed outputs. The difference as compared
( ) 1
to (12.11) is the factor det mT m 2no , which makes the optimization problem much more chal-
lenging. It is still a separable nonlinear LS problem. For the case when also inputs are missing, see
[52, 113].

Example 12.1 In this example, we consider identification of an ARMAX model for which a = 0.7,
b = (0.7, 0), and c = 0.5. The details of the experimental conditions are given in Exercise 12.1.
In total, 40% of the data is missing. In Figure 12.1, the results of 100 experiments are presented
for the estimates of a and c when using both the criterion in (12.11) and the one in (12.12). It is
seen that the first criterion results in biased estimates, i.e. estimates that are not centered around
the true values. This is not the case for the second criterion, which is the ML criterion.

0.6

0.5

0.4
𝑐

0.3

0.2

0.1
0.6 0.65 0.7 0.75
𝑎

Figure 12.1 Plots showing the estimated values of a and c for 100 runs of system identiﬁcation when data
are missing. The crosses show the result when the criterion in (12.11) is used, and the circles shows the
result when the criterion in (12.12) is used. The true values are a = 0.7 and c = 0.5.
12.5 Nuclear Norm system Identiﬁcation 357

12.5 Nuclear Norm system Identiﬁcation

A common way to directly identify (12.5)–(12.6) is to use a so-called subspace method, see [68,
107]. The method is based on computing an SVD of a matrix built up from the collected data.
More precisely, the method is based on the following Hankel matrix, see Section 2.5, of outputs
⎡ y0 y1 · · · yM−1 ⎤
⎢ ⎥
y y2 · · · yM ⎥
Y =⎢ 1 , (12.13)
⎢ ⋮ ⋮ ⋱ ⋮ ⎥
⎢y yr · · · yM+r−2 ⎥⎦
⎣ r−1
and a similarly defined input Hankel matrix U, where the entries are built up from values uk .
We choose M such that N = M + r − 2, where N + 1 is the number of data collected. The value
of r has to be taken as greater than the state dimension n. For the case when K = 0 in (12.5), the
following matrix is formed:
1
G= Y Π⟂ ,
N
where
( )−1
Π⟂ = I − U T UU T U,
is a projection matrix onto the null-space of U. An SVD of G is computed to estimate the state
dimension n, and then from a low-rank approximation of G, it is straightforward to compute
the system matrices (A, B, C, D); see, e.g. [68]. Exercise12.3 provides some of the details. For
the case when K ≠ 0 so-called “instrument variables” can be used to modify G to avoid biased
estimates, see [67, 68].
The number of singular values of G that are significantly larger than zero is the estimated dimen-
sion n of the state vector. It is often not easy to determine this value, since there is not a clear cut
among the singular values when the data are corrupted with noise. One remedy to circumvent
this problem is to consider nuclear norm system identification [67]. It turns out that this formula-
tion seamlessly also can accommodate missing data in both inputs and outputs. It is based on the
following optimization problem:
minimize ||G(̂y)||∗ + 𝜆||̂y − y||22 ,
with variable ŷ , where y ∈ ℝpN are the measured outputs stacked, 𝜆 ∈ ℝ++ is a regularization
parameter, and where ŷ ∈ ℝpN is to be optimized such that it is close to y at the same time as the
nuclear norm of G ∶ ℝpN → ℝpr×N is small, which is a heuristic for the rank of G, cf . Section 5.4.
Here the most simple choice for G is G(̂y) = Ŷ Π⟂ , where Ŷ is defined similarly as Y in (12.13) above
but using ŷ instead of y.
The important observation is that G is linear in ŷ , and hence, the above optimization problem is
a convex optimization problem, cf . (5.27). After the optimal ŷ has been found, it is treated as the
observed y in the subspace method, and hopefully, there is a more clear cut among the singular
values such that it is easier to determine the state dimension n.
Another benefit with the nuclear norm formulation is that it is possible to address missing output
data. The only modification that needs to be done is to consider
minimize ||G(̂y)||∗ + 𝜆 ‖ ‖ 2
‖To (̂y − y)‖2 ,
with variable ŷ , where To as in Section 12.4 is a matrix that picks out the observed outputs. For the
case of missing inputs and outputs, we refer to [67]. There also an efficient ADMM algorithm for
solving the above optimization problem is presented.
358 12 System Identiﬁcation

12.6 Gaussian Processes for Identiﬁcation

So far we have assumed that the system we would like to estimate a model for can be described
within the model class we consider, i.e. that both the model and the true system are a linear system
with state-dimension n. However, it can be shown that ARMAX systems can be arbitrarily well
approximated with ARX models, see [69], if the model order n in the model is taken large enough.
If both the number of data N and the model order n goes to infinity, and N faster than n, then the
estimate is consistent. Since computing ARX models is equivalent to solving a linear LS problem,
the solution can be obtained both efficiently and accurately. A drawback with the approach of using
high model order is that even for a modest number of data, the variance of the estimate of the model
parameters will be large. A remedy to this is to model the dynamical system as a Gaussian process,
which by our previous discussions in Chapter 10 equivalently can be interpreted as regression in
a Hilbert space or as a MAP estimate. For simplicity, we will in this section only consider the FIR
case, but the extension to the ARX case is immediate.

12.6.1 MAP Estimate

We consider the regression model

y = 𝜉 + Tb u + e,

which is a special case of the regression model derived in Section 12.3. It models a FIR system.
Define

U = u0 I + u1 S + · · · + uN S N ,

where S is a lower triangular shift matrix. Now, we let Un+1 be the first n + 1 columns of U. Then
it holds with 𝜃 = (b, 𝜉0 ) that

y = Φ𝜃 + e,

where
[ ]
Φ = Un+1 J ,
[ ]T
with J = I 0 ∈ ℝ(N+1)×n . This is clearly the same model as for the Gaussian process in
Section 10.3 if we take X = Φ and a = 𝜃. We now assume that e is the outcome of a zero mean
normally distributed random vector with covariance 𝜎 2 I and that 𝜃 is the outcome of a zero
mean normally distributed random vector with covariance Σ and independent of e. It then follows
that the estimate of 𝜃 is given by the conditional mean of 𝜃 given observations of y as
( )−1
𝜃̂ = ΣΦT 𝜎 2 I + ΦΣΦT y.

similarly as in Section 10.3. It can be shown that this is a consistent estimate if Σ ∈ 𝕊N++ , ΦΦT ∕N
converges to an invertible matrix and if Φe∕N converges to zero as N goes to infinity [26].

12.6.2 Empirical Bayes

There are many possible choices of Σ ∈ 𝕊N++ , and hence, it would be interesting to investigate if some
are better than others. Assume that Σ ∶ ℝq → 𝕊+2n+1 , where Σ(𝜂) is describing a suitable subset of
𝕊2n+1
+ as the so-called hyper parameter 𝜂 varies. Then a natural way to choose 𝜂 is to maximize the
likelihood function for the observations. For the above regression model, we have from Section 10.3
12.6 Gaussian Processes for Identiﬁcation 359

that y is the outcome of a zero mean normally distributed random vector with covariance Z ∶ ℝq ×
ℝ+ → 𝕊N+1 ++ defined by

Z(𝜂, 𝜎 2 ) = ΦΣ(𝜂)ΦT + 𝜎 2 I.
Therefore, the ML problem is equivalent to
minimize yT Z(𝜂, 𝜎 2 )−1 y + ln det Z(𝜂, 𝜎 2 ),
with variables (𝜂, 𝜎 2 ), where we also consider 𝜎 2 as a hyper parameter. This approach is known as
Empirical Bayes or Type II Maximum Likelihood. The advantage with this parameterization is that
the ML criterion then is the difference between the two convex functions g ∶ ℝq → ℝ and h ∶ ℝq →
ℝ defined by g(𝜂) = yT Z(𝜂, 𝜎 2 )−1 y and h(𝜂) = − ln det Z(𝜂, 𝜎 2 ), see exercises 4.5 and 4.7. Hence,
the sequential convex optimization technique based on the majorization minimization principle
applies, see Section 6.2.
How should then a suitable subset be chosen? Some insight can be obtained from the fact that for
a FIR model, the parameter b is the impulse response of a linear dynamical system. One possible
choice is to take Σb ∶ ℝ3 → 𝕊+n+1 defined by
{ }
Σb (𝜂) j,k = 𝜆𝛼 (k+j)∕2 𝜌|j−k| ,
for (j, k) ∈ ℕn+1 × ℕn+1 , where 𝜂 = (𝜆, 𝛼, 𝜌) with 𝜆 ∈ ℝ+ , 0 ≤ 𝛼 < 1, and |𝜌| ≤ 1, see [26]. This is
called a diagonal/correlated (DC) kernel.3 We then let Σ = bdiag(Σb , Σ𝜉 ), where Σ𝜉 ∈ 𝕊n+ . Here 𝛼
accounts for an exponential decay along the diagonal of Σb and 𝜌 describes the correlation between
neighboring impulse response coefficients. Exponential decay is expected for a stable linear system.
There are several other choices for Σb discussed in the above reference. One interesting choice is to
∑q q
take Σ(𝜂) = i=1 𝜂i Σi , where Σi ∈ 𝕊2n+1
+ and 𝜂 ∈ ℝ+ . This is motivated by the fact that for a linear
dynamical system, the impulse response is the sum of the impulse responses of its partial fraction
expansion. The different Σi could be obtained from fixed DC kernels modeling the different modes
expected to be present in the model. For a survey on kernel methods in system identification, we
refer the reader to [88].

Example 12.2 We consider a FIR model given by

∑
75
yk = je−0.2j u(k − j) + ek .
j=1

We consider ek to be realizations of i.i.d. Gaussian random variables with zero mean and unit vari-
ance. The input signal uk is generated in a similar way. However, it is then low-pass filtered with a
filter with cut-off frequency of 0.9. We generate data for k ∈ ℕN , where N = 200. We then estimate
an FIR model of order n = 50 using Empirical Bayes with the DC kernel and another FIR model
without using Empirical Bayes. After this, we generate new data denoted (u, ̄ ȳ ) in the same way as
we generated the data (u, y) for estimation of the FIR model. We then generate ŷ from
∑
n
ŷ k = b̂ j u(k
̄ − j),
j=1

where b̂ j are the coefficients for the FIR models we estimated. This results in one ŷ for the Empirical
Bayes estimate and one for the estimate without using Empirical Bayes. We compare these ŷ with
one another and with ȳ in Figure 12.2. We see that Empirical Bayes results in a much better model,
since for that model ȳ and ŷ are much closer to one another.

3 Formally, the kernel is defined as Kb ∶ ℝn+1 × ℝn+1 → ℝ, where Kb = zT Σb (𝜂)̄z.

360 12 System Identiﬁcation

50
10

0 0
𝑦,̄ 𝑦̂

𝑦,̄ 𝑦̂
−10
−50
50 100 150 200 50 100 150 200
𝑘 𝑘

Figure 12.2 The left plot shows ȳ k (solid), and ŷ (dashed), as function of k, for Empirical Bayes. The right
plot shows ȳ k (solid), and ŷ (dashed), as function of k, without Empirical Bayes.

12.7 Recurrent Neural Networks

We are now interested in obtaining a predictor for the nonlinear system described by (12.1–12.2).
We consider the time-invariant case when there is no explicit dependence on k. There are many
challenges. We know that for the linear and time-invariant case, there are infinitely many param-
eterizations, since the parameterization will depend on the choice of state. Also, we do not really
know what the functions F and G should be unless we have some physical insight. Here we will
resort to the fact that artificial neural networks ANNs can be used to approximate functions, as
discussed in Section 10.7.
The idea in recurrent neural networks (RNNs) is to model the predictor as
xk+1 = F(xk , uk , 𝜃), k ∈ ℤN−1
ŷ k = G(xk , uk , 𝜃), k ∈ ℤN ,
where F ∶ × ℝn × ℝm ℝq
→ ℝn and G ∶ ℝn × ℝm × ℝq → ℝp are functions given by ANNs as
described in Section 10.7. The vectors 𝜃 contains all the parameters that specify the affine
propagation functions of the ANN. The word “recurrent” comes from the fact that the same ANN
is used for each time step. It is not uncommon to only use a one-layer ANN to model F and G.
Then it is possible to interpret the evolution of the state equation as a traditional ANN, for which
the same weights are used in each layer. This interpretation is called unrolling of the RNN. There
are also outputs for every layer. The fact that the same weight are used in every layer makes it
impossible to use the multilayer back-propagation algorithm. However, it is of course still possible
to compute the gradients and make use of the structure. A challenge with RNNs is that the
resulting optimization problems might become ill-conditioned. In the learning community, this is
called vanishing gradient or exploding gradient. A remedy to this is the so-called long short-term
memory (LSTM) network, see [57]. For the application to system identification, see [70].

12.8 Temporal Convolutional Networks

Another way of obtaining a good predictor for a nonlinear time-invariant system is to generalize
the linear ARX-model to a nonlinear ARX-model. This can be done by taking the predictor to be
ŷ k+1 = f (yk , … , yk−n+1 , uk , … , uk−n+1 ; 𝜃), n − 1 ≤ k ≤ N − 1,
where f ∶ × ℝ2n ℝq
→ ℝ is a nonlinear function. For simplicity, we only consider the case of a scalar
input signal and a scalar output signal. The results can be generalized to vector valued signals.
12.9 Experiment Design 361

The vector 𝜃 contains all the parameters that define the predictor. The idea in temporal convolu-
tional networks (TCNs) is to build up the function f using a tree structure, where each node in the
tree is defined by a nonlinear ARX-model as well. The formal definition is as follows: let xk = (yk , uk )
and let
( )
ŷ k+1 = f (L) Zk(L−1)
( )
zk(l) = f (l) Zk(l−1) , l ∈ ℕL−1

zk(0) = xk ,
where
( )
Zk(l−1) = zk(l−1) , zk−d
(l−1) (l−1)
, … , zk−( ̄
n−1)d
,
l l

and where dl is the so-called dilation factor. Typically, dl increases exponentially with l, e.g.
dl = 2l−1 . We notice that n − 1 = (n̄ − 1)dL is the effective memory of the overall predictor. Each
function f (l) can be defined in many different ways, and we refer to [4] for some examples. A very
simple example would be to take it as a standard one-layer ANN. Because of the tree-structure of
the TCN, it is possible to introduce further parallelism when computing gradients as compared to
the back-propagation algorithm.

12.9 Experiment Design

We are interested in designing suitable input signals for system identification, and suitable in the
sense that the covariance of the estimated model parameter is small in some sense. To fix the ideas
we consider an FIR model given by
yk = b0 uk + b1 uk−1 + … + bn uk−n + ek , k ∈ ℕN . (12.14)
With the following definitions:
⎡ y1 ⎤ ⎡ uk ⎤ ⎡ 𝜑T1 ⎤ ⎡ e1 ⎤
y = ⋮ , 𝜑k = ⋮ , X = ⎢ ⋮ ⎥ ,
⎢ ⎥ ⎢ ⎥ e = ⎢ ⋮ ⎥,
⎢ ⎥ ⎢ ⎥ ⎢ T⎥ ⎢ ⎥
⎣yN ⎦ ⎣uk−n ⎦ ⎣𝜑N ⎦ ⎣eN ⎦
[ ]T
and 𝜃 = b0 · · · bn it holds that
y = X𝜃 + e.
We assume that e is the outcome of a normally distributed random variable E with zero mean and
covariance 𝜎 2 I, where 𝜎 ∈ ℝ++ is the standard deviation. It is then straightforward4 to show that
̂ which is the same as the ML estimate for
the variance or covariance P̄ ∈ 𝕊+n+1 for the LS estimate 𝜃,
this case, is given by
( )−1
P̄ = 𝜎 2 X T X .
From the above definition of X, we obtain the following SAA approximation
⎡ Ru (0) · · · Ru (n)⎤
XTX ≈ N ⎢ ⋮ ⋱ ⋮ ⎥ ,
⎢ ⎥
⎣Ru (n) · · · Ru (0) ⎦

4 The estimate 𝜃̂ satisfies the normal equations X T X 𝜃̂ = X T Y , where Y = X𝜃 + E with 𝜃 being the true parameter
value. Hence, X T X𝔼𝜃̂ = X T (X𝜃 + 𝔼E) = X T X𝜃. This shows that 𝔼𝜃̂ = 𝜃. Then the covariance is given by
[ ] [ ] [ ] ( )
𝔼 (𝜃̂ − 𝔼𝜃)( ̂ T = 𝔼 (𝜃̂ − 𝜃)(𝜃̂ − 𝜃)T = 𝔼 (X T X)−1 X T EET X(X T X)−1 = 𝜎 2 X T X −1 .
̂ 𝜃̂ − 𝔼𝜃)
362 12 System Identiﬁcation

where Ru ∶ ℤ → ℝ is the covariance function for the input signal of which uk is a realization, see
Sections 3.6 and 3.9, and (5.45). Our idea is now to find a good covariance function Ru for the input.
[ ]T
We consider the vector r ∈ ℝn+1 defined as r = Ru (0) · · · Ru (n) as a variable that we are going to
find a good value for. We define the function R ∶ ℝn+1 → 𝕊+n+1 as
⎡ r1 · · · rn+1 ⎤
R(r) = N ⎢ ⋮ ⋱ ⋮ ⎥ ,
⎢ ⎥
⎣rn+1 · · · r1 ⎦
which is a linear function of r. We also define the function P ∶ ℝn+1 → 𝕊+n+1 as
P(r) = 𝜎 2 R(r)−1 .
We have that P̄ ≈ P assuming that we are able to generate an input with the covariance function
defined by r.
We now give different measures of what is a good covariance P(r). The following quantities
should be small:
A-optimality tr P(r)
D-optimality ln det P(r)
E-optimality 𝜆max (P(r))
L-optimality tr(WP(r)), where W ∈ 𝕊++
n+1
.
The reason for the name optimality above is that we would like to make the scalar functions of
r above small. Then we will have a small covariance of the estimate 𝜃. ̂ All optimality criteria can be
related to the eigenvalues of P(r), and the eigenvalues of a symmetric positive definite matrix are
related to the length of the different principal axes of the confidence ellipsoids for the estimate that
{ }
are given by 𝜃 ∈ ℝn+1 | (𝜃̂ − 𝜃)T P(r)−1 (𝜃̂ − 𝜃) ≤ 𝛼 , where 𝛼 ∈ ℝ++ . Hence, A-optimality tries to
make the sum of the lengths of principal axes as small as possible, and E-optimality tries to make
the largest principal axis as small as possible. Regarding L-optimality this is a scaled version of
A-optimality, and we will discuss why this is the case, and what it means later on. Since the volume
of an ellipsoid is proportional to the product of the eigenvalues it follows that D-optimality tries to
make the volume of the confidence ellipsoid as small as possible. Notice that we can make each of
the criteria arbitrarily small by choosing r big enough, i.e. by using a high power in the signal uk .
In order to avoid this trivial and uninteresting solution, we will impose a bound on r1 ≤ L, where
L ∈ ℝ++ . Then the signal-to-noise ratio (SNR) will be L∕𝜎 2 .

12.9.1 The Positive Real Lemma

It is not the case that all r will result in covariance functions Ru that are possible to realize.
We will here limit ourselves to covariance functions that can be obtained from the output of a
moving average (MA) filter driven by white noise, i.e. let
Uk = Vk + c1 Vk−1 + · · · + cn Vk−n , (12.15)
where Vk is independent of Vj for j ≠ k, and where Vk has variance 𝜎2,
𝜎 ∈ ℝ+ , and ci ∈ ℝ for
1 ≤ i ≤ n. The mean of Vk is zero. Then the covariance function Ru ∶ ℤ → ℝ is given by
[ n ]
[ ] ∑ ∑
n
Ru (k) = 𝔼 Ul Ul−k = 𝔼 ci Vl−i cj Vl−k−j ,
i=0 j=0

and we realize that Ru (k) = 0 for k > n. Here it holds that c0 = 1. The Z-transform Φu ∶ ℂ → ℂ of
the covariance function Ru is given by
∑
n
Φu (z) = Ru (k)z−k .
k=−n
12.9 Experiment Design 363

A necessary and sufficient condition for Ru or equivalently r to be valid is that Φu (ei𝜔 ) ≥ 0 for all
𝜔 ∈ ℝ. It is possible to characterize these r with a convex set. To this end, let
[ ] [ ]
0 0 1 [ ]
A= , B= , C = r2 · · · rn+1 .
In−1 0 0
Define the function Ψ ∶ ℂ → ℂ as
1
Ψ(z) = C(zI − A)−1 B + r1 .
2
Since
⎡z−1 ⎤
(zI − A) ⎢ ⋮ ⎥ = B,
⎢ −n ⎥
⎣z ⎦
it follows that
1 1
Ψ(z) = r1 + r2 z−1 + · · · + rn+1 z−n = Ru (0) + Ru (1)z−1 + · · · + Ru (n)z−n .
2 2
We now have that
Φu (z) = Ψ(z) + Ψ(1∕z).
We will see that Φu (ei𝜔 ) = Ψ(ei𝜔 ) + Ψ(e−i𝜔 ) ≥ 0 for all 𝜔 ∈ ℝ is equivalent to the existence of a
Q ∈ 𝕊n such that the following constraint holds
[ ]
Q − AT QA CT − AT QB
K(Q, r) = ∈ 𝕊+n+1 ,
C − BT QA r1 − BT QB
where K ∶ 𝕊n × ℝn+1 → 𝕊n+1 . This equivalence is known as the positive real lemma. The set of Q
and r that satisfies this constraint is convex since 𝕊+n+1 is a convex cone, and the matrix K is affine in
Q and r. We will show one direction of the proof of the positive real lemma. Assume that K(Q, r) ∈
𝕊+n+1 . Notice that
[ ]
Q CT [ ]T [ ]
K(Q, r) = − A B Q A B ,
C r1
and that
[ ]
[ ] (zI − A)−1 B
A B = z(zI − A)−1 B.
1
From this, it is possible to show that
[ −1 ]T [ ]
(z I − A)−1 B (zI − A)−1 B
K(Q, r) = Φu (z),
1 1
from which the result follows that Φu (ei𝜔 ) ≥ 0 for all 𝜔 ∈ ℝ. For a proof in the other direction see,
e.g. [92].

12.9.2 D-Optimality
We will first look at D-optimality. It holds that ln det P(r) = ln det R(r)−1 + 2(n + 1) ln 𝜎 is a con-
vex function of r. Hence, minimizing it is tractable. We get the following equivalent optimization
problem, where we have removed the constant terms in the objective function:
minimize − ln det R(r)
subject to r1 ≤ L
K(Q, r) ∈ 𝕊+n+1
364 12 System Identiﬁcation

with variables r ∈ ℝn+1 and Q ∈ 𝕊n . This is a conic optimization problem. We will see that the other
optimization problems related to the other optimality criteria can also be cast as conic optimization
problems.

12.9.3 E-Optimality
Minimizing a function f ∶ ℝn → ℝ over some set D ⊂ ℝn is equivalent to minimizing t over the
epigraph of f . The minimal value of f (x) will be equal to the minimal value of t. This is called the
epigraph formulation of the optimization problem. Thus, minimizing 𝜆max (P(r)) with respect to r
is equivalent to minimizing t subject to 𝜆max (P(r)) ≤ t or, equivalently, subject to 𝜆min (R(r)) ≥ 1∕t.
The smallest eigenvalue is greater than 1∕t if and only if all eigenvalues are greater than 1∕t, and it
can be shown that these inequalities hold if and only if R(r) − 1∕tI ∈ 𝕊+n+1 . Also notice that mini-
mizing t is the same as maximizing 1∕t. We now introduce a new variable s = 1∕t. Then notice that
maximizing s is the same as minimizing −s. Hence, the optimization problem for E-optimality can
be stated as

minimize − s
subject to r1 ≤ L
R(r) − sI ∈ 𝕊+n+1
K(Q, r) ∈ 𝕊+n+1

with variables r ∈ ℝn+1 , s ∈ ℝ and Q ∈ 𝕊n , which is also a conic optimization problem.

12.9.4 A-Optimality
∑m
Similarly as the epigraph formulation, it holds that minimizing a sum of functions i=1 fi (x),
∑ m
where fi ∶ ℝn → ℝ is equivalent to minimizing i=1 ti subject to fi (x) ≤ ti , where ti ∈ ℝ and where
i ∈ ℕm , cf . Exercise 4.2. We now apply this to the A-optimality criterion and obtain the optimization
problem:
∑
n+1
minimize ti
i=1
subject to r1 ≤ L
( 2 )
𝜎 R(r)−1 ii ≤ ti , i ∈ ℕn+1
K(Q, r) ∈ 𝕊+n+1

with variables r ∈ ℝn+1 , t ∈ ℝn+1 and Q ∈ 𝕊n . With ei ∈ ℝn+1 , i ∈ ℕn+1 being the standard basis
vectors for ℝn+1 , we can write the last constraints as
( )
ti − eTi 𝜎 2 R(r)−1 ei ≥ 0, i ∈ ℕn+1 .

By (2.58), this is equivalent to

[ ]
ti eTi
∈ 𝕊+n+1 , i ∈ ℕn+1 ,
ei 𝜎12 R(r)

and hence, it follows that the above optimization problem is also equivalent to a conic optimization
problem.
12.9 Experiment Design 365

12.9.5 L-Optimality
For L-optimality, we factorize W as W = V −1 V −T which can always be done with the help of, e.g. a
Cholesky factorization, see (2.45). Now,
( ) ( ( )−1 )
tr (WP(r)) = tr V −T 𝜎 2 R(r)−1 V −1 = tr 𝜎 2 VR(r)V T ,

and hence, we realize that we obtain the optimization problem for L-optimality by replacing R(r)
in the optimization problem for A-optimality with VR(r)V T . Hence, this is just a weighted version
of A-optimality with the weight matrix W.
We will discuss an application where L-optimality comes up naturally. It is not necessarily the
case that we are interested in good values only of the parameters, but we might be more interested
in obtaining a small Mean Square Error (MSE) when we use the model for prediction with new
data. Hence, we define X̄ similarly as X in the beginning of this section, but for the typical input
signal for which we would like to have a good prediction. Then it can be shown5 that the MSE for
the predictor is
( )
̄ T X)−1 X̄ T + 𝜎 2 trI,
𝜎 2 tr X(X

and hence, the choice of W = X̄ X̄ in L-optimal experiment design will result in minimizing the
T
[ ]
MSE for the predictor. We can interpret W∕N as an estimate of the covariance 𝔼 Ū Ū T , where Ū
is the input for which we would like to have a small MSE.

12.9.6 Realization of Covariance Function

We now have several ways of finding optimal covariance functions Ru for the input signal u. Once
it has been found we want to find an input signal u that is a realization of a stochastic process that
has this covariance function. This is a so-called “realization problem.” There are many ways to do
this; see, e.g. the appendix of [118]. We will derive an MA-process that has the desired properties if
its input is white noise 𝑣(k) ∈ ℝ. Remember that the Z-transform of the covariance function Ru is
given by
∑
n
Φu (z) = Ru (k)z−k .
k=−n

Since Ru is an even function, we have that Φu (z) = Φu (1∕z). Hence, if zi is a zero or pole of Φu so is
1∕zi . Based on this the following spectral factorization follows, i.e. we may write

Φu (z) = 𝜅H(z)H(1∕z),

where 𝜅 ∈ ℝ is some constant, and where H ∶ ℂ → ℂ is given by

Πni=1 (z − zi )
H(z) = ,
Πni=1 (z − pi )
with |zi | ≤ 1 and |pi | ≤ 1. Notice that zn Φu (z) is a polynomial, and hence,

zn Φu (z) = 𝜅H(z)zn H(1∕z),

5 Let the new data be generated from Ȳ = X𝜃 ̄ + E,̄ where the random vector Ē has the same distribution as E. Also,
let us estimate Ȳ with Ŷ = X̄ 𝜃,̂ where 𝜃̂ solves the normal equations X T X 𝜃̂ = X T Y , where Y = X𝜃 + E. Then
[ ] [
𝔼Ŷ = X𝔼̄ 𝜃̂ = X𝜃.
̄ Also Ŷ − 𝔼Ŷ = X( ̄ 𝜃̂ − 𝔼𝜃)
̂ = X(X
̄ T X)−1 X T E. Hence, tr 𝔼 (Ȳ − Ŷ )(Ȳ − Ŷ )T = tr 𝔼 (Ȳ − 𝔼Ŷ +
] ( )
̄ T X)−1 X̄ T + 𝜎 2 trI.
𝔼Ŷ − Ŷ )(Ȳ − 𝔼Ŷ + 𝔼Ŷ − Ŷ )T = 𝜎 2 tr X(X
366 12 System Identiﬁcation

where the right-hand side will also be a polynomial. Let us now take pi = 0, i ∈ ℕn . With C ∶ ℂ → ℂ
defined as the polynomial

C(z) = zn + c1 zn−1 + · · · + cn = Πni=1 (z − zi ),

we hence, have the following polynomial spectral factorization

̃
zn Φu (z) = 𝜅C(z)C(z),

where C̃ ∶ ℂ → ℂ is defined as C(z) ̃ = zn C(1∕z). It follows that C̃ will have all its zeros outside
the unit disk, since C has all its zeros inside the unit disk. Moreover, H(z) = C(z)∕zn . From this, it
follows that the MA-process defined as

Uk = Vk + c1 Vk−1 + · · · + cn Vk−n ,

with Vk independent of Vj for j ≠ k with variance 𝜅 will have the covariance function Ru .

12.9.7 OE Model
It turns out that the solutions to the above optimization problems for optimal experiment design
will all be trivial except for L-optimality. The optimal input will be white noise. However, for more
general models than FIR models this is not the case. We will here consider a special case of an
ARMAX model obtained by letting Tc = Ta in (12.8), i.e.

Ta y = Tb u + Ta e.

Here we have assumed that 𝜉 = 0. This model is an OE model. Since Ta is invertible we may write

y = Ta−1 Tb u + e.
[ ]T [ ]T
We now consider Ta to be a function of a = a1 · · · ana and Tb to be a function of b = b0 · · · bnb
[ T T]
We also introduce the vector 𝜃 = a b , and we define the function T ∶ ℝn → ℝN×N by T(𝜃) =
Ta (a)−1 Tb (b), where n = na + nb + 1. We realize that we have a nonlinear regression model

y = f (𝜃) + e,

where f ∶ ℝn → ℝN is given by f (𝜃) = T(𝜃)u. We now define

𝜕f
X= ∈ ℝN×n .
𝜕𝜃 T
( )−1
It can be shown that the covariance of the estimate of 𝜃 for ML estimation is given by 𝜎 2 X T X
just as for the FIR model. We just have a different definition of X.6 This is somewhat problematic,
since the definition depends on the true parameter 𝜃 which we do not know. However, we will first
make a preliminary estimate of 𝜃 without optimal inputs, and then we will use this estimate in the
expression for X as if it was the true 𝜃 in order to figure out what is the optimal input to use. Then
we will redo the estimation with this input to improve the estimate.
[ ]
We partition X as X = Y Z , where
𝜕f 𝜕f
Y= ∈ ℝN×nna , Z= ∈ ℝN×(nnb +1) .
𝜕aT 𝜕bT

6 The inverse of this covariance matrix is known as the Fischer information matrix. Under mild conditions it holds
that this covariance, as N goes to infinity, is the smallest possible covariance, after normalization with N, that can be
obtained for any unbiased estimator, and this is called the Cramér-Rao lower bound.
12.9 Experiment Design 367

We have that
𝜕f
= −Ta−1  i Ta−1 Tb u, 1 ≤ i ≤ na
𝜕ai
𝜕f
= Ta−1  i u, 0 ≤ i ≤ nb ,
𝜕bi
where  0 = I. These expressions can be simplified since the shift matrix commutes with the
Toeplitz matrices and their inverses. Actually, it can be shown that
Y = Ty Uy , Z = Tz Uz ,
where Ty = −Ta−1 Ta−1 Tb ∈ ℝN×N and Tz = Ta−1 ∈ ℝN×N are lower triangular Toeplitz matrices, and
where
[ ] [ ]
Uy = u u · · ·  na −1 u , Uz = u u · · ·  nb u .
We realize that Uy and Yz are the first na and nb + 1 columns of the lower triangular Toeplitz matrix
[ ]
U = u u · · ·  N−1 u ∈ ℝN×N .
Since both Ty and Tz commute with this matrix, we have that
Y = U T̃ y , Z = U T̃ z ,
where T̃ y ∈ ℝN×na and T̃ z ∈ ℝN×(nb +1) are the first columns of Ty and Tz , respectively. Since we
assume stability of the OE model, it follows that Ty and Tz are diagonally dominant, and hence,
we may approximate Y and Z as
Y = Ũ T̃ y , Z = Ũ T̃ z ,
[ ]
where Ũ is a Toeplitz matrix with first column u and first row u(0) u(−1) · · · u(−N + 1) .
[ ]
Remember that X = Y Z , and hence,
[ ]T [ ]
X T X = T̃ y T̃ z Ũ T Ũ T̃ y T̃ z ,
where
⎡ Ru (0) · · · Ru (N − 1)⎤
Ũ T Ũ ≈ N ⎢ ⋮ ⋱ ⋮ ⎥.
⎢ ⎥
⎣Ru (N − 1) · · · Ru (0) ⎦
Thus, we can model this covariance matrix as a symmetric Toeplitz function R̄ ∶ ℝN → 𝕊N
defined as
⎡ r1 · · · r N ⎤
̄
R(r) = N ⎢ ⋮ ⋱ ⋮ ⎥,
⎢ ⎥
⎣rN · · · r1 ⎦
and hence, all entries of X T X will be linear in r. We now realize a difference as compared to the
FIR-model case. There the vector r only had dimension n + 1, but here it has dimension N. We will
limit ourselves to a lower dimension by considering inputs u which are such that ri = 0 for i > m.
Hence, the function R̄ will be a banded matrix, and we will consider it to be a function of only
[ ]T [ ]
r ∈ ℝm . We now define R ∶ ℝm → 𝕊N as R(r) = T̃ y Ỹ u R(r) ̄ T̃ y Ỹ u , which is a linear function
of r. We finally define P ∶ ℝm → 𝕊n+ as
P(r) = 𝜎 2 R(r)−1 ,
similarly as for the FIR-model, and we can use the above optimization problem formulations to
compute optimal experiments.
368 12 System Identiﬁcation

One might think that it is very costly to set up the optimization problems, since there are several
matrix multiplications and inverses involved. However, it is the case that multiplication with both
a banded Toeplitz matrix and the inverse of a banded Toeplitz matrix is equivalent to a filtering, and
the multiplication with shift matrices is the same as delaying signals. Using sparse linear algebra
is just as efficient as the filtering approach. We will now look at an example.

Example 12.3 We consider the OE model of this section with na = nb = nk = 1 where a = a1 =

0.95, b = b0 = 0.1 are the true parameters. The remaining parameters are 𝜎 = 1, N = 300, L = 10,
m = 10. We will investigate D-optimality, and we are going to carry out 100 experiments for differ-
ent realizations of e. We have generated e using iddata in MATLAB’s Identification Toolbox, and
then we have subtracted the mean and normalized such that the squared Euclidean norm of e is
L∕𝜎 = 10. This data have been used to identify a preliminary model using oe in MATLAB’s Identi-
fication Toolbox. Then this model has been used to carry out a D-optimal experiment. The output
r of this optimization has been realized with an MA-model. We have then generated a new e using
iddata and filtered it through the MA-filter to get an input for the second experiment. Also, for
this input, the mean has been removed and a similar normalization as above has been carried out.
The Bode-diagram of the transfer function of the MA-filter is shown in Figure 12.3. This is just one
of all the 100 MA-filters that were computed, but they all look pretty much the same. We present the
estimated values of a and b for the 100 experiments in the scatter-plots in Figure 12.4. We see that
the optimal experiments result in a much smaller spread of the estimates, i.e. a smaller covariance,
just as expected.

12.9.8 Experiment Design for Applications

As we have discussed before, it is not necessarily the case that we are interested in a small covari-
ance of the parameter estimates. It all depends on what we are going to use the model for. In case

10 1
Magnitude

10 −1

10 −3
10 −2 10 −1 10 0
𝜔
Phase (degrees)

−100

10 −2 10 −1 10 0
𝜔

Figure 12.3 Plot showing the Bode-diagram of the Preﬁlter.

12.9 Experiment Design 369

Preliminary Optimal
0.12 0.12

0.11 0.11

0.1 0.1
𝑏

𝑏
9 ⋅ 10−2 9 ⋅ 10−2

8 ⋅ 10−2 8 ⋅ 10−2
−0.98 −0.96 −0.94 −0.92 −0.98 −0.96 −0.94 −0.92
𝑎 𝑎

Figure 12.4 Scatter-plots showing the estimated values of a and b. The left ﬁgure shows the results from
the preliminary nonoptimal experiment and the right ﬁgure shows the results from the optimal
experiments.

one wants to use the model for prediction, then it is natural to try to make optimal experiments
for minimizing the MSE of the predictor for the intended input signal. It turns out that also other
application criteria related to the model can be interpreted as wanting to have a small MSE for a
certain input signal. One example is if one wants to have a good model of the transfer function in
a certain frequency band. Then this can be accomplished by using an input ū ∈ ℝN that has most
[ ]
of its energy in this frequency band. We will model this with the covariance Σ̄ = 𝔼 Ū Ū T , where
we assume that 𝔼Ū = 0. Here ū is the outcome of the random variable U. ̄ We will as an example
consider the OE-model in the previous section. The predictor is given by
̂ = T(𝜃)
Ŷ (𝜃) ̂ U,
̄
[( )( )T ]
where 𝜃̂ is the estimate of the model parameters 𝜃. Let 𝜇 = 𝔼𝜃̂ and Σ = 𝔼 𝜃̂ − 𝜇 𝜃̂ − 𝜇 =
P(r), which are the mean and covariance of the ML estimate of the model parameters 𝜃. We make
̂
a first-order Taylor series expansion of the ith column of T(𝜃):
𝜕Ti (𝜇)
̂ ≈ Ti (𝜇) +
Ti (𝜃) (𝜃̂ − 𝜇), i ∈ ℕn .
𝜕𝜃 T
We introduce the notation
𝜕Ti (𝜇)
Li = , i ∈ ℕn ,
𝜕𝜃 T
for convenience. Since 𝜃̂ and Ū are assumed to be independent it follows that
∑
n
[ ]
̂ ≈ T(𝜇)𝔼Ū +
𝔼Ŷ (𝜃) Li 𝔼 (𝜃̂ − 𝜇)Ū i = 0.
i=1

We also have
[ ]
̂ Ŷ (𝜃)
tr 𝔼 Ŷ (𝜃) ̂ T ≈ A + 2B + C,

where
( )
A = tr T(𝜇)ΣT̄ T (𝜇)
[ n ]
∑
B = tr 𝔼 ̂ ̄ ̄
Li (𝜃 − 𝜇)Ui U T (𝜇) = 0
T T

i=1
[ ] ( )
∑
n
∑
n
∑ ∑
n n
C = tr 𝔼 Li (𝜃̂ − 𝜇)Ū i Ū j (𝜃̂ − 𝜇)T LTj = tr Σ Σ̄ ij LTj Li .
i=1 j=1 i=1 j=1
370 12 System Identiﬁcation

Hence, the MSE is given by

( n n )
( ) ∑∑
̄ (𝜇) + tr Σ
A + C = tr T(𝜇)ΣT T
Σ̄ ij Lj Li .
T

i=1 j=1
∑n ∑n
We now define W = i=1 j=1 Σ̄ ij LTj Li and remember that Σ = P(r), and we have shown that we
have obtained a criterion for L-optimality with the weight W.

Exercises

12.1 Show that the optimization problems in (12.11) and (12.12) are the same when Ta = Tc .

12.2 We will investigate the variable projection method in Section 6.7 when applied to (12.12).
We will take x = ym and 𝛼 = (a, b, c). We will not consider 𝜉0 as variable to estimate, i.e. we
assume that it is zero, and we will generate data such that this holds true. We then write
( ) ( ) 1
F(x, 𝛼) = 𝛾 Ty y − Tu u , where 𝛾 = det mT m 2no . With A(𝛼) = 𝛾m and b(𝛼) = 𝛾(−o yo +
Tu u) it holds that F(x, 𝛼) = A(𝛼)x − b(𝛼).
(a) Show that P and x(𝛼) as defined in Section 6.7 are given by

P = I − m m†
( )
x(𝛼) = m† −o yo + Tu u ,

where m† = (mT m )−1 mT .

(b) Show that
𝜕A 𝜕b 𝜕𝛾 𝜕e
x− = e+𝛾 ,
𝜕𝛼k 𝛼k 𝜕𝛼k 𝜕𝛼k
where e = Ty y − Ti 𝜉0 − Tu u. Notice that we do not consider x = (ym , 𝜉0 ) to depend on 𝛼
in this formula. However, when we use the formula we will despite this substitute with
x(𝛼) in order to evaluate e. This insight will simplify our code.
(c) Use the above results to implement the Kaufman Jacobian approximation for the vari-
able projection method applied to the problem in (12.12). The function should return
the nonlinear residual as its first output argument and the Jacobian approximation as
its second output argument.
Hint: Expressions for the gradients 𝜕e∕𝜕𝛼k are given in Section 12.3. The partial deriva-
tive 𝜕𝛾∕𝜕𝛼k is derived in Exercise 2.21. Make sure to use a QR-factorization of m to
make the code efficient and accurate, i.e. you should not implement the pseudo inverse
of m in order to compute P and x(𝛼), c.f. Exercise 2.22.
(d) Generate data for system identification experiments by taking e ∈ ℝN+1 to be a real-
ization of zero mean independent normal distributed random variables with variance
0.5. Let u ∈ {−1, 1}N+1 be generated by taking each component randomly from the set
{−1, 1} with equal probability for each value. Above let N = 999. Let a = 0.7, b = (0.7, 0),
and c = 0.5.[This] defines Ta , Tb , and Tc . From this, compute y = Ta−1 (Tb u + Tc e). Define
To
the matrix by randomly shifting the rows of an identity matrix of dimension
Tm
N + 1, and then let To contain the first 60% of the rows and Tm the remaining rows. This
then defines yo = To y and ym = Tm y. From the observed data (yo , u), you then estimate
𝛼 = (a, b, c) by running, e.g. a Levenberg–Marquardt algorithm with the Kaufman
Exercises 371

search direction and the residual 𝛾e. For this purpose, you can use lsqnonlin in
MATLAB. The calling sequence is

OPTIONS = optimoptions('lsqnonlin','Algorithm',...
'levenberg-marquardt','SpecifyObjectiveGradient',true);
x = lsqnonlin(@(x) missing_ML_kaufman(x,par),...
zeros(3*n+1,1),[],[],OPTIONS);
You then need to write a function missing_ML_kaufman that as first output argu-
ment computes 𝛾e and as second output argument computes the Kaufman–Jacobian
approximation. These computations should then be repeated with new data for in total
100 times. Make a two-dimensional scatter plot of the estimated values of a and c. You
can then compare this with the results you obtain if you fix 𝛾 to one, i.e. solving (12.11).
This means that the residual is e and that the Kaufman–Jacobian approximation is P 𝜕𝛼𝜕eT .

12.3 Consider a linear system

xk+1 = Axk + Buk
yk = Cxk + Duk

for k ∈ ℤN , where xk ∈ ℝn , uk ∈ ℝm , yk ∈ ℝp , and where (A, B, C, D) are real-valued matri-

ces of compatible dimensions. Consider Y defined as in (12.13) and also define U similarly
from uk .
(a) Show that

Y = X +  U, (12.16)

where

⎡ C ⎤ ⎡ D 0 0 · · · 0⎤
⎢ ⎥ ⎢ CB D 0 · · · 0⎥
AC ⎥ ⎢ ⎥
=⎢ ,  = ⎢ CAB CB D 0⎥ ,
⎢ ⋮ ⎥ ⎢ ⋮
⎢Ar−1 C⎥ ⋱ ⋱ ⎥
⎣ ⎦ ⎢ r−2 ⎥
⎣CA B CA B · · · CB D⎦
r−3

[ ]
and where X = x0 · · · xM−1 .
[ ]
X
(b) Assume that has full row rank and that  has full column rank. Then use the results
U
and definitions from Exercise 2.14 to conclude that  = L22 . Let
[ ][ ]
[ ] Σ 0 V1T
L22 = U1 U2 = U1 ΣV1T ,
0 0 V2T
be an SVD. Show that there exists invertible state transformation T ∈ ℝn×n , c.f.
Exercise 2.12, such that C̄ = CT = Ū 1 , where Ū 1 contains the first n rows of U1 , and
that Ā = T −1 AT can be computed from U1 by solving a linear system of equations.
(c) Show how B̄ = T −1 B and D can be computed from (12.16) by solving a linear system of
equations assuming that C̄ and Ā are known.

12.4 Consider Example 12.2. You are asked to write a MATLAB code using the System Identi-
fication Toolbox that reproduces the result in the example. Play around with the value of
the bandwidth. What happens if you use a bandwidth of one instead of 0.9 as is used in the
example?
372 12 System Identiﬁcation

12.5 A sufficient and necessary condition for a vector

r = (r1 , … , rn+1 ),
to define a covariance function that can be described as the covariance of the output of the
MA-filter (12.15) is the existence of a Q ∈ 𝕊n+ such that the following constraint holds
[ ]
Q − AT QA CT − AT QB
K(Q, r) = ∈ 𝕊+n+1 ,
C − BT QA r1 − BT QB
where
[ ] [ ]
0 0 1 [ ]
A= ; B= ; C = r2 · · · rn+1 .
In−1 0 0

(a) Write a MATLAB function utilizing YALMIP that outputs the positive real lemma con-
straint with input variables being Q, and r. You should assume that the calling code
has declared the input variables as Q = sdpvar(n) and r = sdpvar(n+1,1),
respectively.
[ ]T
(b) Write a simple code to test your function by checking if r = 3 21 is valid. Also, try
[ ]T
r = 3 2 −1 .

12.6 Write a MATLAB function specfac that given a vector r ∈ ℝn+1 containing values
of a covariance function computes the vector c ∈ ℝn containing the coefficients of
the C-polynomial and the variance 𝜅 such that the MA-process defined in (12.15) has
covariance as specific by r.

12.7 We consider estimating FIR models, i.e. models of the form

y(t) = b0 u(t) + b1 u(t − 1) + · · · + bn u(t − n) + e(t).
(a) You should write a function that computes an L-optimal value of r given a positive def-
inite weight matrix W, an upper limit L on r1 and a standard deviation 𝜎 on the noise
e(t). Use the code from Exercise 12.5.a for the positive real lemma constraints.
(b) We now assume that the true system that we would like to estimate with an optimal
input signal is a FIR model with n = 3. We assume that you would like to find an
L-optimal input such that the MSE of the predictor is minimized for an input that is
a sinusoidal given as u(t) = 0.5 sin 0.5t + 0.5 sin 2t. Write the code that computes the
optimal covariance function r. Also, use the function specfac from Exercise 12.6 to
compute the transfer function of the associated MA-filter and plot its Bode-diagram.
What can you say about the input signal that is L-optimal?
373

Appendix A

A.1 Notation and Basic Deﬁnitions

A set A is a unordered collection of elements. It can contain finitely many elements or infinitely
many elements. The number of elements in a set A is called its cardinality, and it is denoted by |A|.
For two sets A and B, the union of A and B is A ∪ B = {x | x ∈ A or x ∈ B}, and the intersection
of A and B is A ∩ B = {x | x ∈ A and x ∈ B}. The set difference between two sets A and B is
A∖B = {x ∈ A | x ∉ B}.
For two sets A and B, the Cartesian product of the two sets is A × B = {(a, b) | a ∈ A, b ∈ B}.
For any set A, we denote by An = A × · · · × A the set of n-dimensional vectors with entries in A.
We let ℝ̄ = ℝ ∪ {−∞, +∞} denote the set of extended real numbers. We introduce the notation
̄ | a ≤ x ≤ b},
[a, b] = {x ∈ ℝ
̄ | a < x < b},
(a, b) = {x ∈ ℝ
̄ | a ≤ x < b},
[a, b) = {x ∈ ℝ
̄ | a < x ≤ b}.
(a, b] = {x ∈ ℝ
For A ⊆ ℝ, ̄ we define the supremum of A, which we denote by sup A, as the smallest a such that
x ≤ a for all x ∈ A. In case a ∈ A, we say that a = max A, i.e. the maximum of A. If a ∉ A, then A
does not have a maximum element. For A = [0, 1), we see that sup A = 1, but A does not have a
maximum element. Similarly, we define the infimum of A, denoted inf A, as the largest a such that
x ≥ a for all x ∈ A. We way that a = min A is the minimum element of A if a ∈ A, and if a ∉ A, the
set A does not have a minimum element.
Let f be a function f ∶  → , where  is the domain of f and  is its codomain. The function f
assigns to each element x in the domain an element in the codomain denoted by f (x). The set   is
the set of all functions from  to . When  = ℕn , we frequently use  n to represent  ℕn . Similarly,
when  = ℤ+ , we frequently use  ×  × · · · to represent  ℤ+ , i.e. we use infinite-dimensional
vectors.
For a function f ∶  →  ⊆ ℝ, ̄ we let the infimum of the function over a set C ⊆  be defined as

inf f (x) = inf {f (x) ∈  | x ∈ },

x∈
and similarly, we define the supremum of the function over a set C ⊆  as
sup f (x) = sup {f (x) ∈  | x ∈ }.
x∈
When the infimum is attained for some x ∈ , we call it a minimum, and we have
min f (x) = inf f (x).
x∈ x∈

We define the maximum of f analogously. Consider the function ln(x), which has domain ℝ++ .
We have that the infimum over ℝ++ is −∞, which is not attained for an element of ℝ++ , and hence,
the function does not have a minimum over its domain. However, it has a minimum over the set
[1, ∞). The infimum is attained for x = 1, and hence, the minimum value is equal to zero.
̄ where  is an open subset of ℝ,
For a differentiable function f ∶  → ℝ, ̄ we denote its derivative
at x ∈  with
df (x)
, f ′ (x) or ḟ (x).
dx
For a differentiable function f ∶  → ℝ, ̄ where  is an open subset of ℝ
̄ n , we denote its partial
derivative with respect to xi at x = (x1 , … , xn ) ∈  with
𝜕f (x)
.
𝜕xi
̄ we denote the definite integral over C ⊆  ⊆ ℝ
For an integrable function f ∶  → ℝ, ̄ with n

f (x)dx.
∫C
When C = C1 × · · · × Cn , we may also write

··· f (x1 , · · · , xn )dx1 · · · dxn .

∫C1 ∫Cn
̄ we often write
When C = [a, b] ⊆ ℝ,
b
f (x)dx.
∫a

A.2 Software

Many of the problems in this book can be solved numerically with readily available software
packages. The landscape of software for optimization, learning, and control is vast, so rather than
attempting to provide a comprehensive list of software, we will provide only a brief overview of
select software packages that are related to problems in this book. We will focus on two types of
software: solvers and modeling tools. Generally speaking, solvers implement numerical methods
for learning or optimization of some class of problems, and modeling tools allow the user to specify
a wide range of problems in a solver-agnostic way using a high-level syntax.

A.2.1 Modeling Software

Modeling tools take a high-level description of an optimization problem as input and maps it to a
format that is accepted by some solver through a number of transformations, e.g. epigraph refor-
mulations, introduction of slack variables, etc. It then calls the solver and, if possible, recovers a
solution to the problem specified by the user from the information returned by the solver. To illus-
trate the basic principle, suppose we would like to solve an instance of an support vector machine
(SVM) training problem of the form
minimize 𝟙T u + 𝛾||𝑤||2 ,
subject to diag(y)(C𝑤 + b) ⪰ 𝟙 − u,
u ⪰ 0,
A.2 Software 375

Problem
Transform Solve Recover
specification

minimize
subject to = ⋆ ⇝ ⋆ ⋆ ⋆
∈

Figure A.1 Modeling tools allow the users to specify their problems using a high-level syntax.
The problem is then transformed to a format that is accepted as some solver and, if possible, the
information returned from the solver is used to recover a solution to the problem provided by the user.

with variables 𝑤 ∈ ℝn , b ∈ ℝ, and u ∈ ℝm , problem data C ∈ ℝm×n and b ∈ {−1, 1}m , and regular-
ization parameter 𝛾 > 0. This problem is equivalent to an second-order cone program (SOCP), and
it can be solved using one of many different software packages for conic optimization. However, the
task of reformulating the problem so that it can be accepted by a given solver is often tedious and
error-prone. Moreover, it is often necessary to start from scratch if we wish to use a another solver
that requires the problem to be specified differently. Modeling tools automate this process, as illus-
trated in Figure A.1. The problem specification in the figure is based on the CVX modeling package
for MATLAB, and it is clearly a self-explanatory and high-level representation of the mathemati-
cal description of the problem. Modeling tools also make it very easy to experiment with different
problem formulations, but the user often has little or no control over transformations, which can
make it difficult to exploit problem structure.

A.2.1.1 Disciplined Convex Programming

The CVX [47] modeling package for MATLAB has pioneered what is referred to as disciplined
convex programming. It requires that user inputs a problem in a form that allows the software to
verify convexity via a number of known composition rules. The problem is then reformulated as a
conic optimization problem and passed to one of several possible solvers. The software packages
CVXPY [31], Convex.jl [105], and CVXR [40] make similar modeling functionality available in the
programming languages Python, Julia, and R, respectively.

A.2.1.2 YALMIP
The MATLAB toolbox YALMIP [71] is a modeling package that can be used with a wide range of
solvers. YALMIP is an acronym for “Yet Another Linear Matrix Inequality Parser” and was ini-
tially developed for applications in control but has since grown into a full-fledged general-purpose
modeling package with support for many classes of both convex and nonconvex problems.

A.2.1.3 JuMP
The Julia modeling package JuMP [36] is highly flexible modeling language for mathematical opti-
mization, and it supports a large number of solvers through a generic solver-independent interface.

A.2.2 Optimization Software

We now provide an overview of optimization software for three broad classes of problems, namely
conic optimization, nonlinear optimization, and stochastic optimization.
376 Appendix A

A.2.2.1 Conic Optimization

Conic optimization problems of the form (5.12) and/or the dual (5.16) can be solved numerically
using open-source packages such as CVXOPT [3], ECOS [33], SCS [83], SDPT3 [104], and SeDuMi
[100], or commercial software packages such as Gurobi [50] and MOSEK [79]. These solvers
support cones of the form K = K1 × · · · × Km , where each Ki can be a cone from some set of
supported cones. MATLAB’s Optimization Toolbox also has some support for conic optimization
via the functions linprog, quadprog, and coneprog, which are linear program, quadratic
program, and second-order cone program solvers, respectively.

A.2.2.2 Nonlinear Optimization

Ceres Solver [2] is an open-source C++ library for both general nonlinear unconstrained optimiza-
tion problems as well as nonlinear least-squares problems with or without variable bounds. The
solvers IPOPT [111] and SNOPT [42] are designed for smooth nonlinear optimization problems
with constraints; IPOPT implements an interior-point method, whereas SNOPT is based on sequen-
tial quadratic programming. Examples of software packages for mixed-integer linear and nonlinear
programming are BARON [101], CPLEX [59], Gurobi [50], and SCIP [108].

A.2.2.3 Stochastic Optimization

Gradient-based methods and stochastic methods are often a natural choice for learning problems
that involve large quantities of training data. While these methods are typically conceptually simple
and easy to implement, the computation of gradients or stochastic gradients is often much more
complicated. Software packages such as TensorFlow [102] and PyTorch [87] address this issue by
means of automatic differentiation, and both packages provide implementations of a wide range of
stochastic optimization methods, which can be used for artificial neural network training and many
other learning problems. MATLAB’s Deep Learning Toolbox also includes a number of stochastic
optimization methods.

A.2.3 Software for Control

We end with a brief overview of select software for optimal control, reinforcement learning, and
system identification.

A.2.3.1 Optimal Control

Dynaprog is a MATLAB toolbox for solving finite horizon multistage deterministic decision prob-
lems using dynamic programming. The ACADO Toolkit is a software environment and algorithm
collection for automatic control and dynamic optimization. It provides a general framework
for using a variety of algorithms for direct optimal control, including model predictive control,
state and parameter estimation, and robust optimization. The ACADO Toolkit is implemented
as a self-contained C++ code and comes with a MATLAB interface. The object-oriented design
allows for coupling of existing optimization packages and for extending it with user-written
optimization routines. CasADi is an open-source tool for nonlinear optimization and algorithmic
differentiation. It facilitates rapid and efficient implementation of different methods for numerical
optimal control, both in an offline context and for online nonlinear model predictive control.

A.2.3.2 Reinforcement Learning

Gymnasium is an open source Python library for developing and comparing reinforcement learning
algorithms by providing a standard application programming interface communication between
learning algorithms and environments. The TensorFlow community has developed an extended
A.2 Software 377

version called TensorLayer, which provides some popular reinforcement learning modules that can
easily be customized and assembled. DeepMind Lab is a Google 3D platform with customization
for agent-based AI research. There is also the Reinforcement Learning Toolbox of MATLAB.

A.2.3.3 System Identiﬁcation

One of the most popular software packages for system identification is arguably MATLAB’s
System Identification Toolbox. This toolbox has an interface to MATLAB’s Deep Learning Toolbox,
and hence, it is possible to seamlessly identify dynamical systems with models based on not only
standard nonlinear regressors but also using ANNs.
379

References

1 P.-A. Absil, R. Mahoni, and R. Sepulchre. Optimization Algorithms on Matrix Manifolds.

Springer, 2006.
2 S. Agarwal, K. Mierle, and The Ceres Solver Team. Ceres Solver, 3 2022. URL https://github
.com/ceres-solver/ceres-solver.
3 M. S. Andersen, J. Dahl, and L. Vandenberghe. CVXOPT: A Python package for convex
optimization, 2022. URL https://cvxopt.org.
4 C. Andersson, A. H. Ribeiro, K. Tiels, N. Wahlström, and T. B. Schön. Deep convolutional
networks in system identification. In 2019 IEEE 58th Conference on Decision and Control
(CDC), 2019.
5 J. A. E. Andersson, J. Gillis, G. Horn, J. B. Rawlings, and M. Diehl. CasADi – A software
framework for nonlinear optimization and optimal control. Mathematical Programming
Computation, 11(1):1–36, 2019. doi: 10.1007/s12532-018-0139-4.
6 K. J. Åström and B. Wittenmark. Computer Controlled Systems—Theory and Design.
Prentice-Hall, Inc., Englewood Cliffs, NJ, 1984.
7 K. J. Åström and B. Wittenmark. Adaptive Control. Addison-Wesley Longman Publishing Co.,
Inc., Reading, MA, 2nd edition, 1994. ISBN 0201558661.
8 P. Baldi and P. Sadowski. The dropout learning algorithm. Artificial Intelligence, 210:78–122,
2014. ISSN 0004-3702. doi: https://doi.org/10.1016/j.artint.2014.02.004. URL https://www
.sciencedirect.com/science/article/pii/S0004370214000216.
9 A. G. Baydin, B. A. Pearlmutter, A. A. Radul, and J. M. Siskind. Automatic differentiation in
machine learning: A survey. Journal of Machine Learning Research, 18(153):1–43, 2018. URL
http://jmlr.org/papers/v18/17-468.html.
10 A. Beck. First Order Methods in Optimization. SIAM, Philadelphia, PA, 2017.
11 A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse
problems. SIAM Journal on Imaging Sciences, 2(1):183–202, 2009.
12 A. Beck and L. Tetruashvili. On the convergence of block coordinate descent type methods.
SIAM Journal on Optimization, 23(4):2037–2060, January 2013. doi: 10.1137/120887679. URL
https://doi.org/10.1137/120887679.
13 M. Benzi, G. H. Golub, and J. Liesen. Numerical solution of saddle point problems. Acta
Numerica, 14:1–137, April 2005. doi: 10.1017/s0962492904000212. URL https://doi.org/10.1017/
s0962492904000212.
14 D. P. Bertsekas. Dynamic Programming and Optimal Control—Volume I. Athena Scientific,
Belmont, MA, 1995.

15 D. P. Bertsekas. Constrained Optimization and Lagrange Multiplier Methods. Athena Scientific,

Belmont, MA, 1996.
16 D. P. Bertsekas. Incremental proximal methods for large scale convex optimization. Mathemati-
cal Programming, 129(2):163–195, June 2011. doi: 10.1007/s10107-011-0472-0.
17 D. P. Bertsekas. Reinforcement Learning and Optimal Control. Athena Scientific, Belmont, MA,
2019.
18 Ch. M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.
19 R. R. Bitmead, M. R. Gevers, I. R. Petersen, and R.J. Kaye. Monotonicity and stabilizability-
properties of solutions of the Riccati difference equation: Propositions, lemmas, theorems, fal-
lacious conjectures and counterexamples. Systems & Control Letters, 5(5):309–315, 1985. ISSN
0167-6911. doi: https://doi.org/10.1016/0167-6911(85)90027-1. URL https://www.sciencedirect
.com/science/article/pii/0167691185900271.
20 S. Bittanti, A. J. Laub, and J. C. Willems. The Riccati Equation. Springer, 1991.
21 Å. Björk. Numerical Methods for Least Squares Problems. SIAM, 1996.
22 S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.
23 D. A. Bristow, M. Tharayil, and A. G. Alleyne. A survey of iterative learning. IEEE Control
Systems, 26:96–114, 07 2006. doi: 10.1109/MCS.2006.1636313.
24 A. E. Bryson and Y. C. Ho. Applied Optimal Control. Blaisdell Waltham, 1969.
25 J. R. Bunch and B. N. Parlett. Direct methods for solving symmetric indefinite systems of
linear equations. SIAM Journal on Numerical Analysis, 8(4):639–655, December 1971. doi:
10.1137/0708060. URL https://doi.org/10.1137/0708060.
26 T. Chen, M. S. Andersen, L. Ljung, A. Chiuso, and G. Pillonetto. System identification via
sparse multiple Kernel-based regularization using sequential convex optimization. IEEE Trans-
actions on Automatic Control, 59(11):2933–2945, 2014.
27 A. R. Conn, N. I. M. Gould, and P. L. Toint. Trust Region Methods. 01 2000. ISBN
9780898719857.
28 A. C. C. Coolen. Information Theory in Neural Networks—Lecture Notes of Course
G31/CMNN14. Department of Mathematics, King’s College, 2002.
29 K. Crammer and Y. Singer. On the algorithmic implementation of multiclass kernel-based
vector machines. Journal of Machine Learning Research, 2:265–292, 2001.
30 G. Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of Control,
Signals, and Systems, 2:303–314, 1989.
31 S. Diamond and S. Boyd. CVXPY: A Python-embedded modeling language for convex opti-
mization. Journal of Machine Learning Research, 17(83):1–5, 2016.
32 M. Diehl. Real-Time Optimization for Large Scale Nonlinear Processes. Ph.D. Thesis,
Rupräcth-Karl-Universität, Heidelberg, June 2001.
33 A. Domahidi, E. Chu, and S. Boyd. ECOS: An SOCP solver for embedded systems. In 2013
European Control Conference, pages 3071–3076, 2013. doi: 10.23919/ECC.2013.6669541.
34 P. Dommel and A. Pichler. Foundations of multistage stochastic programming, 2021.
35 J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and
stochastic optimization. Journal of Machine Learning Research, 12:2121–2159, 2011.
36 I. Dunning, J. Huchette, and M. Lubin. JuMP: A modeling language for mathematical opti-
mization. SIAM Review, 59(2):295–320, 2017. doi: 10.1137/15M1020575.
37 M. Fazel. Matrix Rank Minimization with Applications. Ph.D. Thesis, Stanford University,
February 2002.
References 381

38 A. Fischer and Ch. Igel. An introduction to restricted Boltzmann machines. In L. Alvarez,

M. Mejail, L. Gomez, and J. Jacobo, editors, Progress in Pattern Recognition, Image Analysis,
Computer Vision, and Applications, pages 14–36. Springer-Verlag, Berlin, Heidelberg, 2012.
39 R. A. Fisher. The use of multiple measurements in taxonomic problems. Annals of Eugen-
ics, 7(2):179–188, 1936. doi: https://doi.org/10.1111/j.1469-1809.1936.tb02137.x. URL https://
onlinelibrary.wiley.com/doi/abs/10.1111/j.1469-1809.1936.tb02137.x.
40 A. Fu, B. Narasimhan, and S. Boyd. CVXR: An R package for disciplined convex optimization.
Journal of Statistical Software, 94(14), 2020. doi: 10.18637/jss.v094.i14. URL https://doi.org/10
.18637/jss.v094.i14.
41 I. M. Gelfand and S.V. Fomin. Calculus of Variations. Prentice-Hall, Englewood Cliffs, NJ,
1963.
42 P. E. Gill, W. Murray, and M. A. Saunders. SNOPT: An SQP algorithm for
large-scale constrained optimization. SIAM Review, 47(1):99–131, January 2005. doi:
10.1137/s0036144504446096. URL https://doi.org/10.1137/s0036144504446096.
43 G. Golub and V. Pereyra. Separable nonlinear least squares: The variable projection method
and its applications. Inverse Problems, 19:R1–R26, 2003.
44 G. H. Golub and Ch. F. van Loan. Matrix Computations. The Johns Hopkins University Press,
Baltimore, MD, 1996. 3rd edition.
45 I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. The MIT Press, Cambridge, MA,
London, England, 2016.
46 R. M. Gower, M. Schmidt, F. Bach, and P. Richtarik. Variance-reduced methods for
machine learning. Proceedings of the IEEE, 108(11):1968–1983, November 2020. doi:
10.1109/jproc.2020.3028013. URL https://doi.org/10.1109/jproc.2020.3028013.
47 M. Grant and S. Boyd. CVX: Matlab software for disciplined convex programming, version 2.1,
March 2014. URL http://cvxr.com/cvx.
48 G. R. Grimmett and D. R. Stirzaker. Probability and Random Processes. Oxford University
Press, New York, 1982.
49 S. Gunnarsson and M. Norrlöf. A short introduction to iterative learning control. Technical
Report 1926, Linkoping University, Automatic Control, 1997.
50 Gurobi Optimization, LLC. Gurobi Optimizer Reference Manual, 2022. URL https://www
.gurobi.com.
51 A. Hansson and P. Hagander. How to decompose semi-definite discrete-time algebraic Riccati
equations. European Journal of Control, 5(2):245–258, 1999. ISSN 0947-3580. doi: https://doi
.org/10.1016/S0947-3580(99)70159-7. URL https://www.sciencedirect.com/science/article/pii/
S0947358099701597.
52 A. Hansson and R. Wallin. Maximum likelihood estimation of Gaussian models with missing
data—Eight equivalent formulations. Automatica, 48:1955–1962, 2012.
53 C. R. Hargraves and S. W. Paris. Direct trajectory optimization using nonlinear programming
and collocation. Journal of Guidance Control and Dynamics, 10(4):338–342, 1987.
54 T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning—Data Mining,
Inference, and Prediction. Springer, 2009.
55 J.-B. Hiriart-Urruty and C. Lemaréchal. Fundamentals of Convex Analysis. Springer, December
2001.
56 H. Hjalmarsson, M. Gevers, S. Gunnarsson, and O. Lequin. Iterative feedback tuning: Theory
and applications. IEEE Control Systems Magazine, 18(4):26–41, 1998. doi: 10.1109/37.710876.
382 References

57 S. Hochtreiter and J. Schmidhuber. Long short-term memory. Neural Computation,

9(8):1735–1780, 1997.
58 B. Houska, H.J. Ferreau, and M. Diehl. ACADO toolkit – An open source framework for
automatic control and dynamic optimization. Optimal Control Applications and Methods,
32(3):298–312, 2011.
59 IBM ILOG CPLEX. User’s manual for CPLEX 20.1, 2021.
60 G. Karypis and V. Kumar. A fast and high quality multilevel scheme for partitioning irregular
graphs. SIAM Journal on Scientific Computing, 20(1):359–392, 1999.
61 L. Kaufman. A variable projection method for solving separable nonlinear least squares
problems. BIT, 15:49–75, 1975.
62 A. I. Khinchin. Mathematical Foundations of Information Theory. Dover Publications, 1957.
63 D. P. Kingma and J. Ba. Adam: A method for stochastic optimization, 2014.
64 A. N. Kolmogorov. On the representation of continuous functions of many variables by super-
position of continuous functions of one variable and addition. Doklady Akademii Nauk SSSR,
114:953–956, 1957.
65 G. Lanckriet and B. K. Sriperumbudur. On the convergence of the concave–convex procedure.
Advances in Neural Information Processing Systems, 22, 2009.
66 J. D. Lee, Y. Sun, and M. A. Saunders. Proximal Newton-type methods for minimizing
composite functions. SIAM Journal on Optimization, 24(3):1420–1443, January 2014. doi:
10.1137/130921428.
67 Z. Liu, A. Hansson, and L. Vandenberghe. Nuclear norm system identification with missing
inputs and outputs. System & Control Letters, 62(8):605–612, 2013.
68 L. Ljung. System Identification—Theory for the User. Prentice Hall, 1999.
69 L. Ljung and B. Wahlberg. Asymptotic properties of the least-squares method for estimat-
ing transfer functions and distrurbance spectra. Advances in Applied Probability, 24:412–440,
1992.
70 L. Ljung, C. Andersson, K. Tiels, and T. Schön. Deep learning and system identification. In
Proceedings of the 21st IFAC World Congress, Berlin, pages 1175–1181, 2020.
71 J. Löfberg. YALMIP: A toolbox for modeling and optimization in MATLAB. In Proceedings of
the CACSD Conference, Taipei, Taiwan, 2004.
72 G. G. Lorentz. The 13th problem of Hilbert. In F. Browder, editor, Mathematical Developments
Arising from Hilbert’s Problems, pages 419–430. American Mathematics Society, Providence, RI,
1976.
73 D. G. Luenberger. Optimization by Vector Space Methods. John Wiley & Sons, Inc., New York,
1969.
74 D. G. Luenberger. Linear and Nonlinear Programming. Addison-Wesley Publishing Company,
1984.
75 B. Ma. An Improved Algorithm for Solving Constrained Optimal Control Problems. Ph.D. thesis,
Institute for Systems Research, University of Maryland, 1994.
76 L. Mirsky. A trace inequality of John von Neumann. Monatshefte für Mathematik, 79(4):
303–306, December 1975. doi: 10.1007/bf01647331. URL https://doi.org/10.1007/bf01647331.
77 K. L. Moore. Iterative Learning Control for Deterministic Systems. Industrial Control.
Springer-Verlag, Berlin, 1993.
78 J. J. Moreé. The Levenberg-Marquardt algorithm: Implementation and theory. In G. A. Watson,
editor, Numerical Analysis. Lecture Notes in Mathematics, vol. 630. Springer-Verlag, Berlin, Hei-
delberg, 1978.
79 MOSEK ApS. The MOSEK optimization suite. Version 10.0, 2022. URL http://mosek.com.
References 383

80 Y. E. Nesterov. A method for solving the convex programming problem with convergence rate
o(1∕k2 ). Doklady Akademii Nauk SSSR, 269:543–547, 1983.
81 Y. Nesterov. Lectures on Convex Optimization. Springer Optimization and its Applications.
Springer International Publishing, 2nd edition, 2018.
82 Y. Nesterov and B. T. Polyak. Cubic regularization of Newton method and its global
performance. Mathematical Programming, 108(1):177–205, April 2006. doi: 10.1007/
s10107-006-0706-8. URL https://doi.org/10.1007/s10107-006-0706-8.
83 B. O’Donoghue, E. Chu, N. Parikh, and S. Boyd. Conic optimization via operator splitting and
homogeneous self-dual embedding. Journal of Optimization Theory and Applications, 169(3):
1042–1068, June 2016. URL http://stanford.edu/boyd/papers/scs.html.
84 J. Omura. On the Viterbi decoding algorithm. IEEE Transactions on Information Theory,
15(1):177–179, 1969. doi: 10.1109/TIT.1969.1054239.
85 N. Parikh and S. Boyd. Proximal algorithms. Foundations and Trends in Optimization,
1(3):123–231, 2013.
86 T. A. Parks. Reducible Nonlinear Programming Problems. Ph.D. thesis, Rice University, 1985.
87 A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N.
Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Yang, Z. DeVito, M. Raison, A. Tejani,
S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala. PyTorch: An Imperative Style,
High-Performance Deep Learning Library. Curran Associates Inc., Red Hook, NY, 2019.
88 G. Pillonetto, F. Dinuzzo, T. Chen, G. de Nicola, and L. Ljung. Kernel methods in system
identification, machine learning and function estimation: A survey. Automatica, 50:657–682,
2014.
89 B. T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR
Computational Mathematics and Mathematical Physics, 4(5):1–17, January 1964. doi:
10.1016/0041-5553(64)90137-5.
90 M. J. D. Powell. On search directions for minimization algorithms. Mathematical Program-
ming, 4(1):193–201, December 1973. doi: 10.1007/bf01584660. URL https://doi.org/10.1007/
bf01584660.
91 J. C. Pratt. Sequential minimal optimization: A fast algorithm for training support vector
machines. Technical report, Microsoft Research, April 1998.
92 A. Rantzer. On the Kalman-Yakobovich-Popov lemm. System & Control Letters, 28(1):7–10,
1996.
93 A. V. Rao. A survey of numerical methods for optimal control. Advances in the Astronautical
Sciences, 135(1):497–528, 2010.
94 C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press,
2006.
95 S. J. Reddi, S. Kale, and S. Kumar. On the convergence of Adam and beyond. In International
Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=ryQu7f-
RZ.
96 H. Robbins and S. Monro. A stochastic approximation method. The Annals of Mathematical
Statistics, 22(3):400–407, 9 1951. doi: 10.1214/aoms/1177729586.
97 W. J. Rugh. Linear System Theory. Prentice Hall, Englewood Cliffs, NJ, 1996.
98 A. N. Shiryayev. Probability. Springer-Verlag, New York, 1984.
99 J Sjöberg and M Viberg. Separable non-linear least squares minimization–possible improve-
ments for neural net fitting. In IEEE Workshop in Neural Networks for SignalProcessing,
1997.
384 References

100 J. F. Sturm. Using SeDuMi 1.02, a Matlab toolbox for optimization over symmetric
cones. Optimization Methods and Software, 11(1–4):625–653, January 1999. doi: 10.1080/
10556789908805766. URL https://doi.org/10.1080/10556789908805766.
101 M. Tawarmalani and N. V. Sahinidis. A polyhedral branch-and-cut approach to global opti-
mization. Mathematical Programming, 103:225–249, 2005.
102 TensorFlow. URL https://www.tensorflow.org/. Software available from tensorflow.org.
103 T. Tieleman and G. Hinton. Lecture 6.5—RmsProp: Divide the gradient by a running average
of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 2012.
104 K. C. Toh, M. J. Todd, and R. H. Tütüncü. SDPT3 — a Matlab software package for semidefi-
nite programming, version 1.3. Optimization Methods and Software, 11(1–4):545–581, January
1999. doi: 10.1080/10556789908805762. URL https://doi.org/10.1080/10556789908805762.
105 M. Udell, K. Mohan, D. Zeng, J. Hong, S. Diamond, and S. Boyd. Convex optimization in
Julia. In SC14 Workshop on High Performance Technical Computing in Dynamic Languages,
2014.
106 L. Vandenberghe and M. S. Andersen. Chordal graphs and semidefinite optimization.
Foundations and Trends® in Optimization, 1(4):241–433, 2015. ISSN 2167-3888. doi: 10.1561/
2400000006. URL http://dx.doi.org/10.1561/2400000006.
107 M. Verhaegen and V. Verdult. Filtering and System Identification. Cambridge University Press,
Cambridge, UK, 2007.
108 S. Vigerske and A. Gleixner. SCIP: Global optimization of mixed-integer nonlinear programs
in a branch-and-cut framework. Optimization Methods and Software, 33(3):563–593, 2018.
doi: 10.1080/10556788.2017.1335312.
109 A. Viterbi. Error bounds for convolutional codes and an asymptotically optimum decod-
ing algorithm. IEEE Transactions on Information Theory, 13(2):260–269, 1967. doi:
10.1109/TIT.1967.1054010.
110 A. J. Viterbi and J. K. Omura. Principles of Digital Communication and Coding. McGraw-Hill
Book Company, New York, 1979.
111 A. Wächter and L. T. Biegler. On the implementation of an interior-point filter line-search
algorithm for large-scale nonlinear programming. Mathematical Programming, 106(1):25–57,
April 2005. doi: 10.1007/s10107-004-0559-y. URL https://doi.org/10.1007/s10107-004-0559-y.
112 M. J Wainwright and M. I. Jordan. Graphical models, exponential families, and variational
inference. Foundations and Trends in Machine Learning, 1(1–2):1–305, 2008.
113 R. Wallin and A. Hansson. Maximum likelihood estimatin of linear SISO models subject
to missing output data and missing input data. International Journal of Control, 87(11):
2354–2364, 2014.
114 C.J.C.H. Watkins. Learning from Delayed Rewards. Ph.D. Thesis, Cambridge University,
Cambridge, 1989.
115 S. J. Wright. Primal-Dual Interior-Point Methods. SIAM, 1997. ISBN 978-0-89871-382-4.
116 S. J. Wright. Coordinate descent algorithms. Mathematical Programming, 151(1):3–34, March
2015. doi: 10.1007/s10107-015-0892-3. URL https://doi.org/10.1007/s10107-015-0892-3.
117 C. F. J. Wu. On the convergence properties of the EM algorithm. Annals of Statistics,
11(1):95–103, 1983.
118 S. P. Wu, S. Boyd, and L. Vandenberghe. FIR filter design via spectral factorization and convex
optimization. In B. Datta, editor, Applied Computational Control, Signal and Communications.
Birkhauser, Boston, MA, 1997.
References 385

119 M. Yannakakis. Computing the minimum fill-in is NP-complete. SIAM Journal on Algebraic
Discrete Methods, 2(1):77–79, March 1981. doi: 10.1137/0602010. URL https://doi.org/10.1137/
0602010.
120 C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning
requires rethinking generalization, 2017. https://doi.org/10.48550/arXiv.1611.03530
121 H. Zhang and W. W. Hager. A nonmonotone line search technique and its application to
unconstrained optimization. SIAM Journal on Optimization, 14(4):1043–1056, January 2004.
doi: 10.1137/s1052623403428208. URL https://doi.org/10.1137/s1052623403428208.
387

Index

a augmented Lagrangian 161

action 327 methods 161
activation function 313 autocovariance function 52
active constraint 64 automatic differentiation 315
actor 332
network 332, 336 b
actor-critic method 332 back propagation algorithm 315
Adam 152 backward substitution 19
adapted 227 band matrix 20
adaptive bang-bang 187
control 335 barrier
gradient method 151 function 155
adjoint problem 155
equation 180 Barzilai–Borwein step size 137
variable 181 basis 10
adjugate matrix 13 batch gradient method 149
admissible 176 Bayes’ theorem 43
affine Bayesian approach 254
combination 13, 66 Bellman
hull 13, 67 equation 214
predictor 256 linear programming formulation 220,
set 13 337
agent 327 stochastic 232
allelerated proximal gradient method 140 variatonal form 215
alternating direction method of operator 216
multipliers 163 policy operator 216
A-optimality 362, 364 Q-operator 330
approximation in policy space 338 stochastic 344
ARMAX model 353 variational form 337
Armijo condition 120 Beltrami’s identity 184
arrow structure 268 BFGS
artificial neural network 312 limited memory 136
ARX model 353 update 135

Big-O notation 15 distribution 47

block factorization 29 function 47, 48
LDU 30 expectation 50
UDL 30 independence 268
Boltzmann probability 42, 43
distribution 250 density function 48
machine 278 function 47
restricted 310 cone 70
Borel 𝜎-algebra 41 confidence ellipsoid 362
congruent 22
c conic
calculus of variations 175 combination 66
Carathéodory’s theorem 67 hull 70
cardinality 373 LP 97
cartesian product 373 optimization 97
categorical conjugate function 78
data 304 constraint qualifications 88
distribution 249, 250, 306 continuous random variable 44
probability function 42 contrastive divergence method 312
Cauchy–Schwarz inequality 14 control
Cauchy’s rule 120 policy 208
central signal 179, 206, 226
moment 49 convergence 51
path 156 almost everywhere 51
certainty equivalence 229 almost surely 51
channel model 283 in distribution 51
Chapman–Kolmogorov equation 53 in mean 51
Chebyshev bound 245 in mean square 51
Cholesky factorization 27 in rth mean 51
chordal graph 29 with probability one 51
classification 304 convex
soft margin 307 combination 66
clique 106, 269 cone 70
intersection graph 106 envelope 104
closed function 72
function 63 optimization problem 84
loop stability 214, 232 set 66
closure 63 convolutional code 264
cluster 288 coordinate descent methods 153
analysis 288 correlated 50
codomain 373 cost
cofactor 12 final 179, 207
matrix 12 incremental 179, 206
collocation method 189 terminal 179, 207
column space see range covariance 50
complementary slackness 90 function 303
concave function 72 squared exponential 303
conditional critic 332
Index 389

network 332 Gibbs 250

cross entropy 272 Ising 251
empirical 274 normal 45, 252, 267
cubic regularization 128 Poisson 252
Curry’s rule 120 disturbance process 226
curse of dimensionality 211 domain 64, 373
cyclic rule 149 D-optimality 362, 363
dropout 317
d dual
decision variable 226 feasible 87
definite integral 374 norm 79
dense matrix 20 optimal 87
derivative 374 problem 86
descent direction 119 variable 86
determinant 12 duality 86
deterministic 350 duality gap 88
df 42 du Bois–Reymond lemma 178
DFP update 136 dynamic programming 16, 206
diagonal-constant matrix see dynamical system 206, 350
Toeplitz matrix recursion 208
diagonal matrix 16 stochastic 228
difference convex function 141
differentiable 176 e
differential 176 Eckart–Young–Mirsky theorem 103
algebraic equation 191 effective domain 63
dilation factor 361 eigendecomposition 22
dimension element 373
convex set 67 ellipsoid 69
subspace 10 empirical
direct method 188 Bayes 359
discrete cross entropy 274
data 304 risk minimization 112
random variable 44 energy function 252
time 206 ensemble average predictor 318
discretization method 189 entropy 246
discriminant function 305 cross 272
dissipation 340 maximum entropy principle 246
distribution relative 271
Boltzmann 250 environment 327
categorical 249, 250 E-optimality 362, 364
conditional 47 epigraph 63
exponential 252 epigraphical cone 100
exponential family 252 equality constraint 64
function 42 equivalent optimization problem 65
conditional 47, 48 ergodic theorem
joint 45 strong 52
marginal 45 weak 52
Gaussian 45, 252 Euler–Lagrange equations 183
390 Index

exchange matrix 18 fitted value iteration 211

expectation 48 floating-point operation 15
conditional 50 formula of total probability 43
maximization algorithm 273 forward substitution 19
expectation step 274 Frobenius norm 14
maximization step 274 function of random variable 46
Monte Carlo 274
expected value 48 g
experiment Gaussian
design 361 distribution 45, 252
for applications 368 elimination 25
observation of 41 mixture model 255, 274
outcome of 41 process 56, 302, 358
explicit model predictive control 225 radial basis kernel 302
exploding gradient 360 Gauss–Newton method 142
exponential generalized
cone 100 inequality 21, 71
distribution 252 logarithm 159
family of distributions 252 Gibbs
extended real numbers 373 distribution 250
extremum sampling 277
strong 176 Gibbs’ inequality 271
weak 176 Givens rotation 17, 285
globally optimal 65
f gradient
factors 304 descent 124
fdds 52 method 189
feasibility problem 65 projection method 139
feasible graphical models 266
point 64
problem 64 h
set 64 halfspace 69
feature vector 211 Hamiltonian 180
feedback Hankel matrix 19, 357
function 208 heavy ball method 139
policy 206, 208 Hebbian learning 307
Fenchel duality 88 Hessian 33
Fenchel’s duality theorem 89 Hewer interations 217
Fenchel–Young inequality 79 hidden
Fermat’s rule 81 Markov model 53
fill-in 29 variables 278
filtration problem 54 Hilbert space 300, 358
final cost 179, 207 homogeneous
finite-dimensional distributions 52 Markov process 53
FIR model 353 process 303
first variation 176 Householder
Fisher Iris flower data set 282 matrix 17
fit 297 transformation 17
Index 391

hyper parameter 358 j

hyperplane 68, 306 Jacobian 33
Jacobi’s formula 35
i joint
ice-cream cone 71
distribution function 45
idempotent matrix 23
probability
identity matrix 12
density function 46, 48
implicit regularization 316
function 51
inactive constraint 64
increment 175
k
incremental
Kalman filter 55, 261
cost 179, 206
Karush–Kuhn–Tucker conditions 90
gain 340
methods 149 kernel
proximal gradient method 149 function 301
incrementally strictly passive 340 trick 302, 310
indefinite 21, 32 KKT equations 95
independence 44 K-means algorithm 289
conditional 268 Kullback–Leibler divergence 271
pairwise independent 44
of random variables 46 l
indicator function 78 Lagrange
indirect method 188 dual
induced norm 14 function 86, 178
inequality constraint 64 problem 87
infeasible 64 multiplier 86
infimum 373 Lagrangian 86, 180
infinite horizon optimal control problem 213 duality 87
initial value 179 Laplace expansion 12
innovation form 60 lasso regularization 320
input latent variable 273, 278
activation 312 layer
signal 179, 206, 350
input activation 312
interaction graph 106
output 313
interior-point methods 155
LDL fractorization 27
interpolation 316
learning
intersection 373
Hebbian 307
inverse
reinforcement 327
left 12
right 12 supervised 297
invertible 12 unsupervised 245
Ising distribution 251, 266 learning rate 145
isotropic process 303 least-norm solution 25, 316
iterative least-squares problem 94
feedback tuning 341 Legendre–Fenchel transformation see conjugate
learning control 339 function
root finding 339 lemma, du Bois–Reymond 178
reweighted regularization 142 Levenberg–Marquardt algorithm 143
392 Index

linear maximum 373

predictor 256 a posteriori estimate 254, 299, 358
program 96, 307 determinant positive definite
regression 258, 297, 304 completion 267
time-invariant system 351 entropy principle 246
linearly likelihood estimate 269, 299
dependent 10 maximum likelihood type II, 359
independent 10 mean 49
linear quadratic control 209 squared error 255
infinite horizon 214 measurable 44
stochastic 232 measurement 53
stochastic 228 Mercer’s theorem 302
Lipschitz constant 118 method of multipliers 162
localization problem 95 mini-batch method 149
locally optimal 65 minimum 373
log-det heuristic 105 Minkowski sum 11, 82
logistic minor 12
function 311 missing data 273, 355
regression 306, 320 mixture model 274
long short-term memory network 360 model predictive control 206, 221
L-optimality 362, 365 explicit 225
Lorentz cone 71 stability 235
LU factorization 27 moment 49
central 49
m Monte Carlo 277
majorization 124 expectation maximization algorithm 274
majorization minimization 123, 128, Moore–Penrose pseudoinverse 24
137, 141, 359 multi-parametric program 109
manifold 185, 191 multiple shooting method 189
MAP see maximimum a posteriori estimate multiplication theorem 43
marginal mutual information 283
distribution function 45
probability density function 45 n
probability function 47 name 52
Markov natural
bound 246 basis 10
chain 53 parameter 250
decision process 228 neuron 312
hidden model 53 Newton
process 53 decrement 129
property 53 equations 128
random field 268 Newton’s method 128
Markov chain Monte Carlo 277 nonlinear least-squares problem 353
matrix nonsinuglar 12
chain ordering 16 norm
inverse 12 ball 69
order 12 cone 70
maximization step 274 orthogonal invariance 14
Index 393

normal partial
cone 84 derivative 374
distribution 45, 252, 267 infimum 77
equations 95 order 72
normalized steepest descent directio 119 partial order 44, 108
nuclear norm 24, 80 separability 106
system identification 357 path-following method 157
nullity 10 penalty
null space 10 parameter 161
problem 161
o Penrose conditions 24
objective function 64 permutation matrix 18
observability matrix 355 persistence of excitation 353
observation of experiment 41 perspective function 68, 76
OE model 353, 366 persymmetric 20
open loop 206, 207 phase I problem 158
policy 208 PLU factorization 27
operator norm 14 Poisson distribution 252
optimal policy
control 175, 206 control 208
finite horizon problem 206 evaluation step 216
finite horizon stochastic problem 227 feedback 208
infinite horizon problem 213 improvement step 216
infinite horizon stochastic problem 231 iteration 216, 332
linear programming formulation 220 open loop 208
problem 181, 207 space approximation 338
stability 233, 234 polyhedral set 69
cost-to-go function 207 positive
duality gap 88 definite 21
point 65 real lemma 362
set 65 semidefinite 21
stopping problem 230 positively homogeneous 100
optimality conditions 90 posterior distribution 302
optimization power cone 101
methods 118 prediction 254
problem 64 error 351
Ornstein–Uhlenbeck process 57, 303 horizon 224
orthogonal 7, 17 predictor 298, 351
matrix 12 affine 256
orthonormal 10 ensemble average 318
outcome of experiment 41 linear 256
output 53 unbiased 257
signal 350 principle component analysis 280
probability 40
p categorical probability function 42
page rank 247 conditional 42, 43
pairwise independent 44 density function 42
parametric approximation 211 conditional 48
394 Index

probability (contd.) Ornstein-Uhlenbeck 57

joint 46 strongly stationary 52
marginal 45 weakly stationary 52
formula of total probabilty 43 Wiener 57
function 41 variable 44
conditional 47 continuous 44
finite 42 discrete 44
joint 51 function of 46
measure 41 random rule 149
space 40 range 10
transition 53 rank 10
problem full rank 11
filtration 54 numerical 23
optimal control 181, 207 optimization 103, 283
smoothing 259 rank-nullity theorem 10
state estimation 54 subadditive 11
projected gradient method 139 rank-revealing factorization 23
projection matrix 23 Rayleigh quotient 22
propagation function 313 realization 52
proper function 63 receding horizon 223
proximal rectified linear unit 307
gradient method 137 rectifier function 307, 313
Newton method 139 recurrent neural networks 360
operator 138, 164 reference value 340
quasi-Newton method 139 regression 297
pseudoinverse 24 fit 297
Hilbert space 300
q linear 304
Q-function 208, 228 logistic 306
Q-learning 330, 331 regularization 300
QR factorization 28 implicit 316
QRP factorization 28 lasso 320
quadratic ridge 318
cone 71 reinforcement learning 327
form 21 finite horizon problem 327
norm 22 infinite horizon problem 330
program 96, 307 iterative feedback tuning 341
quadratically bounded 214 iterative learning control 338, 339
qualitative data 304 linear programming formulation 337
quotient convergence 124 Q-learning 330
SARSA 335
r self-learning 332
radial basis functions 302 relative
random entropy 271
field 52 entropy cone 100
process 51 interior 67
Gaussian 56 restricted Boltzmann machine 310
innovation form 60 reverse mode accumulation 315
Index 395

reward 327 values 23

Riccati skew-symmetric matrix 19
equation Slater’s condition 88
differential 183 smooth function 118
discounted algebraic 214 smoothing problem 259
recursion 209 soft margin classification 307
ridge regularization 318 soft-thresholding operator 140
RMSprop 152 solution 65
root finding 339, 342 sparse
rotation matrix 17 Cholesky factorization 29
row space 10 factorizations 28
LU factorization 28
s matrix 20
saddle point 128 spectral
saddle-point systems 32 decomposition 22
sample factorization 365
average approximation 49, 111, 258 squared exponential covariance function 303
path 52 square matrix 12
space 40 SR1 update 136
saturation function 313 stability
Schatten norm 23 infinite horizon
Schur complement 25 optimal control problem 233
secant equation 134 stochastic optimal control problem 234
second-order cone 70 model predictive control 235
second variation 177 stage 206, 327
self-concordant function 157 standard
self-dual 71 basis 10
self-learning 332 deviation 49
semidefinite program 98 state 53, 179, 206, 226, 327, 350
separability 106 equation 350
separable nonlinear least-squares problem 353 estimation 54, 259
sequential state-action-reward-action algorithm 335
convex optimization 141, 359 stationary process 303
minimal optimization algorithm 308 stochastic
set 373 approximation method 144
set difference 373 Bellman equation 232
Sherman–Morrison–Woodbury identity 31 Bellman Q-operator 344
shift matrix 20 dynamic programming recursion 228
shooting method 189 gradient method 145
multiple 189 optimal control problem 227
shortest path problem 209 infinite horizon 231
𝜎-algebra 41 optimization 111
sigmoid function 311, 313 optimization methods 144
signal to noise ratio 362 optimization problem 256
single stage stochastic optimization problem process (see random process)
111 homogeneous 303
singular 12 variable (see random variable)
value decomposition 23 variance-reduced gradient method 151
396 Index

strictly probabilities 53
convex function 72 triangular matrix 18
positive definite 214 truncated SVD 103
strong trust-region method 123
duality 88 two point boundary value problem 186
ergodic theorem 52 type-II maximum likelihood 359
extremum 176
law of large numbers 51 u
strongly unbiased 257
convex function 72 unbounded 64
monotone 342 uncorrelated 50
stationary random process 52 union 373
subdifferential 81 unrolling 360
subgradient 80 unsupervised learning 245
sublevel set 63
suboptimal 65 v
subspace 10 value
methods 357 function 109
supervised learning 297 iteration 215
support finite horizon 327
function 78 fitted 211, 328
vector 308 infinite horizon 330
machine 306, 308 vanishing gradient 360
supremum 373 variable
surrogate function 134 metric method 134
surrogate problem 122 projection method 144
Sylvester’s determinant identity 30 splitting 164
symbolic factorization 29 variance 49
symmetric variation
matrix 19 first 176
square root 22 second 177
system vector 373
identification 350 field 179
matrices 351 violated constraint 64
visible variables 278
t Viterbi
temporal algorithm 259, 260
convolutional networks 360 decoder 264
difference 215 von Neumann’s trace inequality 23
terminal cost 179, 207
time 350 w
horizon 207 weak
time-invariant 350 duality 87
time-varying 350 ergodic theorem 52
Toeplitz matrix 19 extremum 176
trace norm 24 maximum 177
transition minimum 176
matrix 53, 182 weakly stationary random process 52
Index 397

web page rank 247 Woodbury identity see Sherman–Morrison–

Weinstein–Aronszajn identity 30 Woodbury identity
white noise 53
Wiener process 57 z
Wolfe conditions 121 Z-transform 352
WILEY END USER LICENSE AGREEMENT
Go to www.wiley.com/go/eula to access Wiley’s ebook EULA.

Optimization Reklaitis
100% (1)
Optimization Reklaitis
681 pages
Victor A. Skormin (Auth.) - Introduction To Process Control - Analysis, Mathematical Modeling, Control and Optimization (2016, Springer International Publishing)
100% (1)
Victor A. Skormin (Auth.) - Introduction To Process Control - Analysis, Mathematical Modeling, Control and Optimization (2016, Springer International Publishing)
265 pages
(IEEE Press Series on Control Systems Theory and Applications) Jun Liu, Milad Farsi - Model-Based Reinforcement Learning_ From Data to Continuous Actions with a Python-based Toolbox-Wiley-IEEE Press (
No ratings yet
(IEEE Press Series on Control Systems Theory and Applications) Jun Liu, Milad Farsi - Model-Based Reinforcement Learning_ From Data to Continuous Actions with a Python-based Toolbox-Wiley-IEEE Press (
275 pages
Intelligent Iterative Control
No ratings yet
Intelligent Iterative Control
236 pages
Lessons From Alpha Zero
No ratings yet
Lessons From Alpha Zero
242 pages
Engineering Optimization, 2nd Ed, Wiley (2006), 0471558141
100% (1)
Engineering Optimization, 2nd Ed, Wiley (2006), 0471558141
681 pages
[Henryk_G+¦recki]_Optimization_and_Control_of_Dyna(b-ok.org)
No ratings yet
[Henryk_G+¦recki]_Optimization_and_Control_of_Dyna(b-ok.org)
679 pages
Lecture Notes in Control and Information Sciences: Yousri M. EI-Fattah Claude Foulard
No ratings yet
Lecture Notes in Control and Information Sciences: Yousri M. EI-Fattah Claude Foulard
123 pages
Control Systems and Reinforcement Learning -- Sean Meyn -- 2022 -- Cambridge University Press -- 9781009051873 -- 0790f102da4ffa7564329bf76c7d0815 -- Anna’s Archive
No ratings yet
Control Systems and Reinforcement Learning -- Sean Meyn -- 2022 -- Cambridge University Press -- 9781009051873 -- 0790f102da4ffa7564329bf76c7d0815 -- Anna’s Archive
454 pages
Instant Download Uncertain Information and Linear Systems Tofigh Allahviranloo PDF All Chapters
100% (4)
Instant Download Uncertain Information and Linear Systems Tofigh Allahviranloo PDF All Chapters
62 pages
Introduction to Online Control Hazan & Singh
No ratings yet
Introduction to Online Control Hazan & Singh
192 pages
Borelli Predictive Control PDF
No ratings yet
Borelli Predictive Control PDF
424 pages
Output Feedback Reinforcement Learning Control For Linear Systems
No ratings yet
Output Feedback Reinforcement Learning Control For Linear Systems
304 pages
Texts in Mathematics
No ratings yet
Texts in Mathematics
265 pages
Linear Matrix Inequalities in System and Control Theory
No ratings yet
Linear Matrix Inequalities in System and Control Theory
205 pages
Optimización Lineal
No ratings yet
Optimización Lineal
304 pages
Mathematical Optimization Linear Program
No ratings yet
Mathematical Optimization Linear Program
167 pages
Download Full Model-Based Reinforcement Learning Milad Farsi PDF All Chapters
100% (7)
Download Full Model-Based Reinforcement Learning Milad Farsi PDF All Chapters
66 pages
Instant Ebooks Textbook Fundamentals of Optimization Theory With Applications To Machine Learning Gallier J. Download All Chapters
100% (2)
Instant Ebooks Textbook Fundamentals of Optimization Theory With Applications To Machine Learning Gallier J. Download All Chapters
62 pages
PhuongPhapTinh (1)
No ratings yet
PhuongPhapTinh (1)
49 pages
Reinforcement Learning and Optimal Control - Draft Version by Dmitri Bertsekas
No ratings yet
Reinforcement Learning and Optimal Control - Draft Version by Dmitri Bertsekas
268 pages
Where can buy Fundamentals of optimization theory with applications to machine learning Gallier J. ebook with cheap price
100% (2)
Where can buy Fundamentals of optimization theory with applications to machine learning Gallier J. ebook with cheap price
55 pages
A First Course in Linear Optimization
No ratings yet
A First Course in Linear Optimization
196 pages
Optimal And Robust Control Advanced Topics With Matlab 2nd Edition 2nd Edition Fortuna download
No ratings yet
Optimal And Robust Control Advanced Topics With Matlab 2nd Edition 2nd Edition Fortuna download
91 pages
A First Course in Linear Optimi - Wei Zhi
No ratings yet
A First Course in Linear Optimi - Wei Zhi
306 pages
Fundamentals of optimization theory with applications to machine learning Gallier J. download
100% (2)
Fundamentals of optimization theory with applications to machine learning Gallier J. download
57 pages
JohnLee First Course in Linear Optimization
No ratings yet
JohnLee First Course in Linear Optimization
172 pages
La5 PDF
No ratings yet
La5 PDF
35 pages
(A First Course in Linear Optimization) Jon Lee (B-Ok - CC) PDF
100% (1)
(A First Course in Linear Optimization) Jon Lee (B-Ok - CC) PDF
188 pages
RL Frontmatter
No ratings yet
RL Frontmatter
11 pages
Optimization Models
No ratings yet
Optimization Models
2 pages
Download full (Ebook) Linear and Integer Optimization: Theory and Practice, Third Edition by Sierksma, G.;Zwols, Y. ISBN 9781498743129, 1498743129 ebook all chapters
No ratings yet
Download full (Ebook) Linear and Integer Optimization: Theory and Practice, Third Edition by Sierksma, G.;Zwols, Y. ISBN 9781498743129, 1498743129 ebook all chapters
81 pages
2012 673136 PDF
No ratings yet
2012 673136 PDF
17 pages
Engineering Optimization, 2nd Ed, Wiley (2006)
90% (21)
Engineering Optimization, 2nd Ed, Wiley (2006)
681 pages
Linear Matrix Inequalities in System and Control Theory - Stephen Boyd
100% (1)
Linear Matrix Inequalities in System and Control Theory - Stephen Boyd
205 pages
Sensors Protocols Industry 4
100% (3)
Sensors Protocols Industry 4
280 pages
664 Optimal Control
No ratings yet
664 Optimal Control
184 pages
Optimization For Learning and Control - 2023 - Hansson
No ratings yet
Optimization For Learning and Control - 2023 - Hansson
413 pages
Network Data Envelopment Analysis
No ratings yet
Network Data Envelopment Analysis
483 pages
Network Coding Adele Kuzmiakova
No ratings yet
Network Coding Adele Kuzmiakova
236 pages
Mathematics Multilevel Systems Scaling
No ratings yet
Mathematics Multilevel Systems Scaling
270 pages