Mathematics For Machine Learning
Mathematics For Machine Learning
Mathematics For Machine Learning
MACHINE LEARNING
Foreword 1
2 Linear Algebra 17
2.1 Systems of Linear Equations 19
2.2 Matrices 22
2.3 Solving Systems of Linear Equations 27
2.4 Vector Spaces 35
2.5 Linear Independence 40
2.6 Basis and Rank 44
2.7 Linear Mappings 48
2.8 Affine Spaces 61
2.9 Further Reading 63
Exercises 64
3 Analytic Geometry 70
3.1 Norms 71
3.2 Inner Products 72
3.3 Lengths and Distances 75
3.4 Angles and Orthogonality 76
3.5 Orthonormal Basis 78
3.6 Orthogonal Complement 79
3.7 Inner Product of Functions 80
3.8 Orthogonal Projections 81
3.9 Rotations 91
3.10 Further Reading 94
Exercises 96
4 Matrix Decompositions 98
4.1 Determinant and Trace 99
i
This material will be published by Cambridge University Press as Mathematics for Machine Learn-
ing by Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong. This pre-publication version is
free to view and download for personal use only. Not for re-distribution, re-sale or use in deriva-
tive works. c by M. P. Deisenroth, A. A. Faisal, and C. S. Ong, 2019. https://mml-book.com.
ii Contents
References 395
Index 407
1
This material will be published by Cambridge University Press as Mathematics for Machine Learn-
ing by Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong. This pre-publication version is
free to view and download for personal use only. Not for re-distribution, re-sale or use in deriva-
tive works. c by M. P. Deisenroth, A. A. Faisal, and C. S. Ong, 2019. https://mml-book.com.
2 Foreword
covered in high school mathematics and physics. For example, the reader
should have seen derivatives and integrals before, and geometric vectors
in two or three dimensions. Starting from there, we generalize these con-
cepts. Therefore, the target audience of the book includes undergraduate
university students, evening learners and learners participating in online
machine learning courses.
In analogy to music, there are three types of interaction that people
have with machine learning:
Astute Listener The democratization of machine learning by the pro-
vision of open-source software, online tutorials and cloud-based tools al-
lows users to not worry about the specifics of pipelines. Users can focus on
extracting insights from data using off-the-shelf tools. This enables non-
tech-savvy domain experts to benefit from machine learning. This is sim-
ilar to listening to music; the user is able to choose and discern between
different types of machine learning, and benefits from it. More experi-
enced users are like music critics, asking important questions about the
application of machine learning in society such as ethics, fairness, and pri-
vacy of the individual. We hope that this book provides a foundation for
thinking about the certification and risk management of machine learning
systems, and allows them to use their domain expertise to build better
machine learning systems.
Experienced Artist Skilled practitioners of machine learning can plug
and play different tools and libraries into an analysis pipeline. The stereo-
typical practitioner would be a data scientist or engineer who understands
machine learning interfaces and their use cases, and is able to perform
wonderful feats of prediction from data. This is similar to a virtuoso play-
ing music, where highly skilled practitioners can bring existing instru-
ments to life and bring enjoyment to their audience. Using the mathe-
matics presented here as a primer, practitioners would be able to under-
stand the benefits and limits of their favorite method, and to extend and
generalize existing machine learning algorithms. We hope that this book
provides the impetus for more rigorous and principled development of
machine learning methods.
Fledgling Composer As machine learning is applied to new domains,
developers of machine learning need to develop new methods and extend
existing algorithms. They are often researchers who need to understand
the mathematical basis of machine learning and uncover relationships be-
tween different tasks. This is similar to composers of music who, within
the rules and structure of musical theory, create new and amazing pieces.
We hope this book provides a high-level overview of other technical books
for people who want to become composers of machine learning. There is
a great need in society for new researchers who are able to propose and
explore novel approaches for attacking the many challenges of learning
from data.
Acknowledgments
We are grateful to many people who looked at early drafts of the book and
suffered through painful expositions of concepts. We tried to implement
their ideas that we did not vehemently disagree with. We would like to
especially acknowledge Christfried Webers for his careful reading of many
parts of the book, and his detailed suggestions on structure and presen-
tation. Many friends and colleagues have also been kind enough to pro-
vide their time and energy on different versions of each chapter. We have
been lucky to benefit from the generosity of the online community, who
have suggested improvements via github.com, which greatly improved
the book.
The following people have found bugs, proposed clarifications and sug-
gested relevant literature, either via github.com or personal communica-
tion. Their names are sorted alphabetically.
Contributors through github, whose real names were not listed on their
github profile, are:
We are also very grateful to Parameswaran Raman and the many anony-
mous reviewers, organized by Cambridge University Press, who read one
or more chapters of earlier versions of the manuscript, and provided con-
structive criticism that led to considerable improvements. A special men-
tion goes to Dinesh Singh Negi, our LATEX support, for detailed and prompt
advice about LATEX-related issues. Last but not least, we are very grateful
to our editor Lauren Cowles, who has been patiently guiding us through
the gestation process of this book.
Table of Symbols
Mathematical Foundations
9
This material will be published by Cambridge University Press as Mathematics for Machine Learn-
ing by Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong. This pre-publication version is
free to view and download for personal use only. Not for re-distribution, re-sale or use in deriva-
tive works. c by M. P. Deisenroth, A. A. Faisal, and C. S. Ong, 2019. https://mml-book.com.
1
Since machine learning is inherently data driven, data is at the core data
of machine learning. The goal of machine learning is to design general-
purpose methodologies to extract valuable patterns from data, ideally
without much domain-specific expertise. For example, given a large corpus
of documents (e.g., books in many libraries), machine learning methods
can be used to automatically find relevant topics that are shared across
documents (Hoffman et al., 2010). To achieve this goal, we design mod-
els that are typically related to the process that generates data, similar to model
the dataset we are given. For example, in a regression setting, the model
would describe a function that maps inputs to real-valued outputs. To
paraphrase Mitchell (1997): A model is said to learn from data if its per-
formance on a given task improves after the data is taken into account.
The goal is to find good models that generalize well to yet unseen data,
which we may care about in the future. Learning can be understood as a learning
way to automatically find patterns and structure in data by optimizing the
parameters of the model.
While machine learning has seen many success stories, and software is
readily available to design and train rich and flexible machine learning
systems, we believe that the mathematical foundations of machine learn-
ing are important in order to understand fundamental principles upon
which more complicated machine learning systems are built. Understand-
ing these principles can facilitate creating new machine learning solutions,
understanding and debugging existing approaches, and learning about the
inherent assumptions and limitations of the methodologies we are work-
ing with.
11
This material will be published by Cambridge University Press as Mathematics for Machine Learn-
ing by Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong. This pre-publication version is
free to view and download for personal use only. Not for re-distribution, re-sale or use in deriva-
tive works. c by M. P. Deisenroth, A. A. Faisal, and C. S. Ong, 2019. https://mml-book.com.
12 Introduction and Motivation
Dimensionality
Classification
Reduction
Regression
Estimation
Density
Vector Calculus Probability & Distributions Optimization
Linear Algebra Analytic Geometry Matrix Decomposition
between the two parts of the book to link mathematical concepts with
machine learning algorithms.
Of course there are more than two ways to read this book. Most readers
learn using a combination of top-down and bottom-up approaches, some-
times building up basic mathematical skills before attempting more com-
plex concepts, but also choosing topics based on applications of machine
learning.
Linear Algebra
! ! 4 Figure 2.1
x+y Different types of
2
vectors. Vectors can
0 be surprising
objects, including
y
! 2 (a) geometric
x ! vectors
y 4
and (b) polynomials.
6
2 0 2
x
17
This material will be published by Cambridge University Press as Mathematics for Machine Learn-
ing by Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong. This pre-publication version is
free to view and download for personal use only. Not for re-distribution, re-sale or use in deriva-
tive works. c by M. P. Deisenroth, A. A. Faisal, and C. S. Ong, 2019. https://mml-book.com.
18 Linear Algebra
Vector
Figure 2.2 A mind
map of the concepts
ses pro
po per introduced in this
closure
m ty o
co f chapter, along with
Chapter 5 where they are used
Matrix Abelian
Vector calculus with + in other parts of the
ts Vector space Group Linear
sen independence book.
rep
e
repr
res
maximal set
ent
System of
s
linear equations
Linear/affine
so mapping
lve
solved by
s Basis
Matrix
inverse
Gaussian
elimination
Chapter 3 Chapter 10
Chapter 12
Analytic geometry Dimensionality
Classification
reduction
resources are Gilbert Strang’s Linear Algebra course at MIT and the Linear
Algebra Series by 3Blue1Brown.
Linear algebra plays an important role in machine learning and gen-
eral mathematics. The concepts introduced in this chapter are further ex-
panded to include the idea of geometry in Chapter 3. In Chapter 5, we
will discuss vector calculus, where a principled knowledge of matrix op-
erations is essential. In Chapter 10, we will use projections (to be intro-
duced in Section 3.8) for dimensionality reduction with principal compo-
nent analysis (PCA). In Chapter 9, we will discuss linear regression, where
linear algebra plays a central role for solving least-squares problems.
Example 2.1
A company produces products N1 , . . . , Nn for which resources
R1 , . . . , Rm are required. To produce a unit of product Nj , aij units of
resource Ri are needed, where i = 1, . . . , m and j = 1, . . . , n.
The objective is to find an optimal production plan, i.e., a plan of how
many units xj of product Nj should be produced if a total of bi units of
resource Ri are available and (ideally) no resources are left over.
If we produce x1 , . . . , xn units of the corresponding products, we need
a total of
ai1 x1 + · · · + ain xn (2.2)
many units of resource Ri . An optimal production plan (x1 , . . . , xn ) 2 Rn ,
therefore, has to satisfy the following system of equations:
a11 x1 + · · · + a1n xn = b1
.. , (2.3)
.
am1 x1 + · · · + amn xn = bm
where aij 2 R and bi 2 R.
system of linear Equation (2.3) is the general form of a system of linear equations, and
equations x1 , . . . , xn are the unknowns of this system. Every n-tuple (x1 , . . . , xn ) 2
solution Rn that satisfies (2.3) is a solution of the linear equation system.
Example 2.2
The system of linear equations
x1 + x2 + x3 = 3 (1)
x1 x2 + 2x3 = 2 (2) (2.4)
2x1 + 3x3 = 1 (3)
has no solution: Adding the first two equations yields 2x1 +3x3 = 5, which
contradicts the third equation (3).
Let us have a look at the system of linear equations
x1 + x2 + x3 = 3 (1)
x1 x2 + 2x3 = 2 (2) . (2.5)
x2 + x3 = 2 (3)
From the first and third equation, it follows that x1 = 1. From (1)+(2),
we get 2x1 + 3x3 = 5, i.e., x3 = 1. From (3), we then get that x2 = 1.
Therefore, (1, 1, 1) is the only possible and unique solution (verify that
(1, 1, 1) is a solution by plugging in).
As a third example, we consider
x1 + x2 + x3 = 3 (1)
x1 x2 + 2x3 = 2 (2) . (2.6)
2x1 + 3x3 = 5 (3)
Since (1)+(2)=(3), we can omit the third equation (redundancy). From
(1) and (2), we get 2x1 = 5 3x3 and 2x2 = 1+x3 . We define x3 = a 2 R
as a free variable, such that any triplet
✓ ◆
5 3 1 1
a, + a, a , a 2 R (2.7)
2 2 2 2
4x1 + 4x2 = 5
(2.8)
2x1 4x2 = 1
where the solution space is the point (x1 , x2 ) = (1, 14 ). Similarly, for three
variables, each linear equation determines a plane in three-dimensional
space. When we intersect these planes, i.e., satisfy all linear equations at
the same time, we can obtain a solution set that is a plane, a line, a point
or empty (when the planes have no common intersection). }
For a systematic approach to solving systems of linear equations, we
will introduce a useful compact notation. We collect the coefficients aij
into vectors and collect the vectors into matrices. In other words, we write
the system from (2.3) in the following form:
2 3 2 3 2 3 2 3
a11 a12 a1n b1
6 .. 7 6 .. 7 6 .. 7 6 .. 7
x1 4 . 5 + x2 4 . 5 + · · · + xn 4 . 5 = 4 . 5 (2.9)
am1 am2 amn bm
2.2 Matrices
Matrices play a central role in linear algebra. They can be used to com-
pactly represent systems of linear equations, but they also represent linear
functions (linear mappings) as we will see later in Section 2.7. Before we
discuss some of these interesting topics, let us first define what a matrix
is and what kind of operations we can do with matrices. We will see more
properties of matrices in Chapter 4.
matrix Definition 2.1 (Matrix). With m, n 2 N a real-valued (m, n) matrix A is
an m·n-tuple of elements aij , i = 1, . . . , m, j = 1, . . . , n, which is ordered
according to a rectangular scheme consisting of m rows and n columns:
2 3
a11 a12 · · · a1n
6 a21 a22 · · · a2n 7
6 7
A = 6 .. .. .. 7 , aij 2 R . (2.11)
4 . . . 5
am1 am2 · · · amn
row By convention (1, n)-matrices are called rows and (m, 1)-matrices are called
column columns. These special matrices are also called row/column vectors.
row vector
column vector Rm⇥n is the set of all real-valued (m, n)-matrices. A 2 Rm⇥n can be
Figure 2.4 By equivalently represented as a 2 Rmn by stacking all n columns of the
stacking its matrix into a long vector; see Figure 2.4.
columns, a matrix A
can be represented
as a long vector a.
2.2.1 Matrix Addition and Multiplication
A 2 R4⇥2 a 2 R8
The sum of two matrices A 2 Rm⇥n , B 2 Rm⇥n is defined as the element-
re-shape
wise sum, i.e.,
2 3
a11 + b11 · · · a1n + b1n
6 .. .. 7 m⇥n
(2.12)
A + B := 4 . . 52R .
am1 + bm1 · · · amn + bmn
Note the size of the For matrices A 2 Rm⇥n , B 2 Rn⇥k , the elements cij of the product
matrices. C = AB 2 Rm⇥k are computed as
C =
n
X
np.einsum(’il,
lj’, A, B) cij = ail blj , i = 1, . . . , m, j = 1, . . . , k. (2.13)
l=1
This means, to compute element cij we multiply the elements of the ith There are n columns
row of A with the j th column of B and sum them up. Later in Section 3.2, in A and n rows in
B so that we can
we will call this the dot product of the corresponding row and column. In
compute ail blj for
cases, where we need to be explicit that we are performing multiplication, l = 1, . . . , n.
we use the notation A · B to denote multiplication (explicitly showing Commonly, the dot
“·”). product between
two vectors a, b is
Remark. Matrices can only be multiplied if their “neighboring” dimensions denoted by a> b or
match. For instance, an n ⇥ k -matrix A can be multiplied with a k ⇥ m- ha, bi.
matrix B , but only from the left side:
A |{z}
|{z} B = |{z}
C (2.14)
n⇥k k⇥m n⇥m
Example 2.3 2 3
0 2
1 2 3
For A = 2 R2⇥3 , B = 41 15 2 R3⇥2 , we obtain
3 2 1
0 1
2 3
0 2
1 2 3 4 5 2 3
AB = 1 1 = 2 R2⇥2 , (2.15)
3 2 1 2 5
0 1
2 3 2 3
0 2 6 4 2
1 2 3
BA = 41 15 = 4 2 0 25 2 R3⇥3 . (2.16)
3 2 1
0 1 3 2 1
> >
>
(AB) = B A (2.31)
Definition 2.5 (Symmetric Matrix). A matrix A 2 Rn⇥n is symmetric if symmetric matrix
A = A> .
Note that only (n, n)-matrices can be symmetric. Generally, we call
(n, n)-matrices also square matrices because they possess the same num- square matrix
ber of rows and columns. Moreover, if A is invertible, then so is A> , and
(A 1 )> = (A> ) 1 =: A > .
Remark (Sum and Product of Symmetric Matrices). The sum of symmet-
ric matrices A, B 2 Rn⇥n is always symmetric. However, although their
product is always defined, it is generally not symmetric:
1 0 1 1 1 1
= . (2.32)
0 0 1 1 0 0
}
and use the rules for matrix multiplication, we can write this equation
system in a more compact form as
2 32 3 2 3
2 3 5 x1 1
44 2 7 x2 = 85 .
5 4 5 4 (2.36)
9 5 3 x3 2
Note that x1 scales the first column, x2 the second one, and x3 the third
one.
Generally, a system of linear equations can be compactly represented in
their matrix form as Ax = b; see (2.3), and the product Ax is a (linear)
combination of the columns of A. We will discuss linear combinations in
more detail in Section 2.5.
Example 2.6
For a 2 R, we seek all solutions of the following system of equations:
2x1 + 4x2 2x3 x4 + 4x5 = 3
4x1 8x2 + 3x3 3x4 + x5 = 2
. (2.44)
x1 2x2 + x3 x4 + x5 = 0
x1 2x2 3x4 + 4x5 = a
We start by converting this system of equations into the compact matrix
notation Ax = b. We no longer mention⇥ the ⇤variables x explicitly and
build the augmented matrix (in the form A | b ) augmented matrix
2 3
2 4 2 1 4 3 Swap with R3
6 4 8 3 3 1 7
2 7
6
4 1 2 1 1 1 0 5 Swap with R1
1 2 0 3 4 a
where we used the vertical line to separate the left-hand side from the
right-hand side in (2.44). We use to indicate a transformation of the
augmented matrix using elementary transformations. The augmented
⇥ ⇤
matrix A | b
Swapping Rows 1 and 3 leads to
compactly
2 3
1 2 1 1 1 0 represents the
6 system of linear
6 4 8 3 3 1 2 77 4R1 equations Ax = b.
4 2 4 2 1 4 3 5 +2R1
1 2 0 3 4 a R1
When we now apply the indicated transformations (e.g., subtract Row 1
four times from Row 2), we obtain
2 3
1 2 1 1 1 0
6 0 0 1 1 3 2 7
6 7
4 0 0 0 3 6 3 5
0 0 1 2 3 a R2 R3
2 3
1 2 1 1 1 0
6 2 7
6 0 0 1 1 3 7 ·( 1)
4 0 0 0 3 6 3 5 ·( 13 )
0 0 0 0 0 a+1
2 3
1 2 1 1 1 0
6 0 0 1 1 3 2 7
6 7
4 0 0 0 1 2 1 5
0 0 0 0 0 a+1
row-echelon form This (augmented) matrix is in a convenient form, the row-echelon form
(REF). Reverting this compact notation back into the explicit notation with
the variables we seek, we obtain
x1 2x2 + x3 x4 + x5 = 0
x3 x4 + 3x5 = 2
. (2.45)
x4 2x5 = 1
0 = a+1
particular solution Only for a = 1 this system can be solved. A particular solution is
2 3 2 3
x1 2
6x2 7 6 0 7
6 7 6 7
6 x 3 7 = 6 17 . (2.46)
6 7 6 7
4x4 5 4 1 5
x5 0
general solution The general solution, which captures the set of all possible solutions, is
8 2 3 2 3 2 3 9
>
> 2 2 2 >
>
>
> 6 0 7 617 6 0 7 >
>
< 6 7 6 7 6 7 =
x2R :x=6
5
6 717 + 1 607 + 2 6 17 ,
6 7 6 7 1 , 2 2 R . (2.47)
>
> 4 1 5 405 4 2 5 >
>
>
> >
>
: ;
0 0 1
All rows that contain only zeros are at the bottom of the matrix; corre-
spondingly, all rows that contain at least one nonzero element are on
top of rows that contain only zeros.
Looking at nonzero rows only, the first nonzero number from the left
pivot (also called the pivot or the leading coefficient) is always strictly to the
leading coefficient right of the pivot of the row above it.
In other texts, it is
sometimes required Remark (Basic and Free Variables). The variables corresponding to the
that the pivot is 1. pivots in the row-echelon form are called basic variables and the other
basic variable variables are free variables. For example, in (2.45), x1 , x3 , x4 are basic
free variable variables, whereas x2 , x5 are free variables. }
Remark (Obtaining a Particular Solution). The row-echelon form makes
It is in row-echelon form.
Every pivot is 1.
The pivot is the only nonzero entry in its column.
}
The reduced row-echelon form will play an important role later in Sec-
tion 2.3.3 because it allows us to determine the general solution of a sys-
tem of linear equations in a straightforward way.
Gaussian
Remark (Gaussian Elimination). Gaussian elimination is an algorithm that elimination
performs elementary transformations to bring a system of linear equations
into reduced row-echelon form. }
the second column from three times the first column. Now, we look at the
fifth column, which is our second non-pivot column. The fifth column can
be expressed as 3 times the first pivot column, 9 times the second pivot
column, and 4 times the third pivot column. We need to keep track of
the indices of the pivot columns and translate this into 3 times the first col-
umn, 0 times the second column (which is a non-pivot column), 9 times
the third column (which is our second pivot column), and 4 times the
fourth column (which is the third pivot column). Then we need to subtract
the fifth column to obtain 0. In the end, we are still solving a homogeneous
equation system.
To summarize, all solutions of Ax = 0, x 2 R5 are given by
8 2 3 2 3 9
>
> 3 3 >
>
>
> 6 17 6 0 7 >
>
< 6 7 6 7 =
5 6 7
x 2 R : x = 16 0 7+ 26 9 7 , 6 7 1, 2 2 R . (2.50)
>
> 4 0 5 4 45 >
>
>
> >
>
: ;
0 1
This means that if we bring the augmented equation system into reduced
row-echelon form, we can read out the inverse on the right-hand side of
the equation system. Hence, determining the inverse of a matrix is equiv-
alent to solving systems of linear equations.
and use the Moore-Penrose pseudo-inverse (A> A) 1 A> to determine the Moore-Penrose
solution (2.59) that solves Ax = b, which also corresponds to the mini- pseudo-inverse
mum norm least-squares solution. A disadvantage of this approach is that
it requires many computations for the matrix-matrix product and comput-
ing the inverse of A> A. Moreover, for reasons of numerical precision it
is generally not recommended to compute the inverse or pseudo-inverse.
In the following, we therefore briefly discuss alternative approaches to
solving systems of linear equations.
Gaussian elimination plays an important role when computing deter-
minants (Section 4.1), checking whether a set of vectors is linearly inde-
pendent (Section 2.5), computing the inverse of a matrix (Section 2.2.2),
computing the rank of a matrix (Section 2.6.2), and determining a basis
of a vector space (Section 2.6.1). Gaussian elimination is an intuitive and
constructive way to solve a system of linear equations with thousands of
variables. However, for systems with millions of variables, it is impracti-
cal as the required number of arithmetic operations scales cubically in the
number of simultaneous equations.
In practice, systems of many linear equations are solved indirectly, by ei-
ther stationary iterative methods, such as the Richardson method, the Ja-
cobi method, the Gauß-Seidel method, and the successive over-relaxation
method, or Krylov subspace methods, such as conjugate gradients, gener-
alized minimal residual, or biconjugate gradients. We refer to the books
by Stoer and Burlirsch (2002), Strang (2003), and Liesen and Mehrmann
(2015) for further details.
Let x⇤ be a solution of Ax = b. The key idea of these iterative methods
is to set up an iteration of the form
x(k+1) = Cx(k) + d (2.60)
for suitable C and d that reduces the residual error kx(k+1) x⇤ k in every
iteration and converges to x⇤ . We will introduce norms k · k, which allow
us to compute similarities between vectors, in Section 3.1.
2.4.1 Groups
Groups play an important role in computer science. Besides providing a
fundamental framework for operations on sets, they are heavily used in
cryptography, coding theory, and graphics.
might say ,“You can get to Kigali by first going 506 km Northwest to Kam-
pala (Uganda) and then 374 km Southwest.”. This is sufficient information
to describe the location of Kigali because the geographic coordinate sys-
tem may be considered a two-dimensional vector space (ignoring altitude
and the Earth’s curved surface). The person may add, “It is about 751 km
West of here.” Although this last statement is true, it is not necessary to
find Kigali given the previous information (see Figure 2.7 for an illus-
tration). In this example, the “506 km Northwest” vector (blue) and the
“374 km Southwest” vector (purple) are linearly independent. This means
the Southwest vector cannot be described in terms of the Northwest vec-
tor, and vice versa. However, the third “751 km West” vector (black) is a
linear combination of the other two vectors, and it makes the set of vec-
tors linearly dependent. Equivalently, given “751 km West” and “374 km
Southwest” can be linearly combined to obtain “506 km Northwest”.
Remark. The following properties are useful to find out whether vectors
are linearly independent:
Example 2.14
Consider R4 with
2 3 2 3 2
3
1 1 1
6 2 7 617 6 27
x1 = 6 7
4 35 , x2 = 6 7
405 , x3 = 6 7
4 1 5. (2.67)
4 2 1
To check whether they are linearly dependent, we follow the general ap-
proach and solve
2 3 2 3 2 3
1 1 1
6 2 7 617 6 27
1 x1 + 2 x2 + 3 x3 = 1 4
6 7+ 26 7+ 36 7 = 0 (2.68)
35 405 4 1 5
4 2 1
for 1 , . . . , 3 . We write the vectors xi , i = 1, 2, 3, as the columns of a
matrix and apply elementary row operations until we identify the pivot
columns:
2 3 2 3
1 1 1 1 1 1
6 2 1 27 60 1 07
6 7 ··· 6 7. (2.69)
4 3 0 15 40 0 15
4 2 1 0 0 0
Here, every column of the matrix is a pivot column. Therefore, there is no
non-trivial solution, and we require 1 = 0, 2 = 0, 3 = 0 to solve the
equation system. Hence, the vectors x1 , x2 , x3 are linearly independent.
= 4 ... 5 ,
6 7
xj = B j , j j = 1, . . . , m , (2.71)
kj
This means that {x1 , . . . , xm } are linearly independent if and only if the
column vectors { 1 , . . . , m } are linearly independent.
}
Remark. In a vector space V , m linear combinations of k vectors x1 , . . . , xk
are linearly dependent if m > k . }
Example 2.15
Consider a set of linearly independent vectors b1 , b2 , b3 , b4 2 Rn and
x1 = b1 2b2 + b3 b4
x2 = 4b1 2b2 + 4b4
. (2.73)
x3 = 2b1 + 3b2 b3 3b4
x4 = 17b1 10b2 + 11b3 + b4
Are the vectors x1 , . . . , x4 2 Rn linearly independent? To answer this
question, we investigate whether the column vectors
82 3 2 3 2 3 2 39
>
> 1 4 2 17 >>
<6 7 6 7 6 7 6 7=
6 27 , 6 27 , 6 3 7 , 6 107 (2.74)
> 4 1 5 4 0 5 4 15 4 11 5>
>
: >
;
1 4 3 1
Generating sets are sets of vectors that span vector (sub)spaces, i.e.,
every vector can be represented as a linear combination of the vectors
in the generating set. Now, we will be more specific and characterize the
smallest generating set that spans a vector (sub)space.
Example 2.16
4 15 4 2 5 4 5 5 4 65
1 2 3 1
we are interested in finding out which vectors x1 , . . . , x4 are a basis for U .
For this, we need to check whether x1 , . . . , x4 are linearly independent.
Therefore, we need to solve
4
X
i xi = 0, (2.82)
i=1
Since the pivot columns indicate which set of vectors is linearly indepen-
dent, we see from the row-echelon form that x1 , x2 , x4 are linearly inde-
pendent (because the system of linear equations 1 x1 + 2 x2 + 4 x4 = 0
can only be solved with 1 = 2 = 4 = 0). Therefore, {x1 , x2 , x4 } is a
basis of U .
2.6.2 Rank
The number of linearly independent columns of a matrix A 2 Rm⇥n
equals the number of linearly independent rows and is called the rank rank
of A and is denoted by rk(A).
Remark. The rank of a matrix has some important properties:
rk(A) = rk(A> ), i.e., the column rank equals the row rank.
The columns of A 2 Rm⇥n span a subspace U ✓ Rm with dim(U ) =
rk(A). Later we will call this subspace the image or range. A basis of
U can be found by applying Gaussian elimination to A to identify the
pivot columns.
The rows of A 2 Rm⇥n span a subspace W ✓ Rn with dim(W ) =
rk(A). A basis of W can be found by applying Gaussian elimination to
A> .
For all A 2 Rn⇥n it holds that A is regular (invertible) if and only if
rk(A) = n.
For all A 2 Rm⇥n and all b 2 Rm it holds that the linear equation
system Ax = b can be solved if and only if rk(A) = rk(A|b), where
A|b denotes the augmented system.
For A 2 Rm⇥n the subspace of solutions for Ax = 0 possesses dimen-
sion n rk(A). Later, we will call this subspace the kernel or the null kernel
space. null space
A matrix A 2 Rm⇥n has full rank if its rank equals the largest possible full rank
rank for a matrix of the same dimensions. This means that the rank of
a full-rank matrix is the lesser of the number of rows and columns, i.e.,
rk(A) = min(m, n). A matrix is said to be rank deficient if it does not rank deficient
have full rank.
}
2 3
1 2 1
A=4 2 3 15 .
3 5 0
We use Gaussian elimination to determine the rank:
2 3 2 3
1 2 1 1 2 1
4 2 3 15 ··· 4 0 1 35 . (2.84)
3 5 0 0 0 0
Here, we see that the number of linearly independent rows and columns
is 2, such that rk(A) = 2.
B = (b1 , . . . , bn ) (2.89)
x = ↵1 b1 + . . . + ↵n bn (2.90)
Example 2.20
Let us have a look at a geometric vector x 2 R2 with coordinates [2, 3]> Figure 2.9
with respect to the standard basis (e1 , e2 ) of R2 . This means, we can write Different coordinate
representations of a
x = 2e1 + 3e2 . However, we do not have to choose the standard basis to
vector x, depending
represent this vector. If we use the basis vectors b1 = [1, 1]> , b2 = [1, 1]> on the choice of
we will obtain the coordinates 12 [ 1, 5]> to represent the same vector with basis.
respect to (b1 , b2 ) (see Figure 2.9). x = 2e1 + 3e2
1
x= 2 b1 + 52 b2
ŷ = A x̂ . (2.94)
This means that the transformation matrix can be used to map coordinates
with respect to an ordered basis in V to coordinates with respect to an
ordered basis in W .
where we first expressed the new basis vectors c̃k 2 W as linear com-
binations of the basis vectors cl 2 W and then swapped the order of
summation.
Alternatively, when we express the b̃j 2 V as linear combinations of
bj 2 V , we arrive at
n
! n n m
(2.106)
X X X X
(b̃j ) = sij bi = sij (bi ) = sij ali cl (2.109a)
i=1 i=1 i=1 l=1
m n
!
X X
= ali sij cl , j = 1, . . . , n , (2.109b)
l=1 i=1
and, therefore,
T Ã = A S 2 Rm⇥n , (2.111)
such that
1
à = T A S, (2.112)
C̃ B̃ = ⌅C̃C CB B B̃
1
= ⌅C C̃ CB B B̃ . (2.114)
Concretely, we use B B̃ = idV and ⌅C C̃ = idW , i.e., the identity mappings
that map vectors onto themselves, but with respect to a different basis.
2 3 2 3 2 3 2 3
2 3 2 3 2 3 1 1 0 1
1 0 1 617 607 617 607
B̃ = (415 , 415 , 405) 2 R3 , C̃ = (6 7 6 7 6 7 6 7
405 , 415 , 415 , 405) . (2.119)
0 1 1
0 0 0 1
Then,
2 3
2 3 1 1 0 1
1 0 1 61 0 1 07
S = 41 1 05 , T =6
40
7, (2.120)
1 1 05
0 1 1
0 0 0 1
where the ith column of S is the coordinate representation of b̃i in
terms of the basis vectors of B . Since B is the standard basis, the co-
ordinate representation is straightforward to find. For a general basis B ,
we would need to solve a linear equation system to find the i such that
P3
i=1 i bi = b̃j , j = 1, . . . , 3. Similarly, the j th column of T is the coordi-
nate representation of c̃j in terms of the basis vectors of C .
Therefore, we obtain
2 32 3
1 1 1 1 3 2 1
16 1 1 1 17 6
7 6 0 4 27
7
à = T 1 A S = 6 (2.121a)
24 1 1 1 1 5410 8 45
0 0 0 2 1 6 3
2 3
4 4 2
6 6 0 0 7
=64 4
7. (2.121b)
8 45
1 6 3
ker( ) Im( )
0V 0W
= span[a1 , . . . , an ] ✓ R , m
(2.124b)
i.e., the image is the span of the columns of A, also called the column column space
space. Therefore, the column space (image) is a subspace of Rm , where
m is the “height” of the matrix.
rk(A) = dim(Im( )).
The kernel/null space ker( ) is the general solution to the homoge-
neous system of linear equations Ax = 0 and captures all possible
linear combinations of the elements in Rn that produce 0 2 Rm .
The kernel is a subspace of Rn , where n is the “width” of the matrix.
The kernel focuses on the relationship among the columns, and we can
use it to determine whether/how we can express a column as a linear
combination of other columns.
}
1 2 1 0
= x1 + x2 + x3 + x4 (2.125b)
1 0 0 1
is linear. To determine Im( ), we can take the span of the columns of the
transformation matrix and obtain
1 2 1 0
Im( ) = span[ , , , ]. (2.126)
1 0 0 1
To compute the kernel (null space) of , we need to solve Ax = 0, i.e.,
we need to solve a homogeneous equation system. To do this, we use
Gaussian elimination to transform A into reduced row-echelon form:
1 2 1 0 1 0 0 1
··· 1 1 . (2.127)
1 0 0 1 0 1 2 2
This matrix is in reduced row-echelon form, and we can use the Minus-
1 Trick to compute a basis of the kernel (see Section 2.3.3). Alternatively,
we can express the non-pivot columns (columns 3 and 4) as linear com-
binations of the pivot columns (columns 1 and 2). The third column a3 is
equivalent to 12 times the second column a2 . Therefore, 0 = a3 + 12 a2 . In
the same way, we see that a4 = a1 12 a2 and, therefore, 0 = a1 12 a2 a4 .
Overall, this gives us the kernel (null space) as
2 3 2 3
0 1
617 6 1 7
ker( ) = span[6 27 6 2 7
4 1 5 , 4 0 5] . (2.128)
0 1
rank-nullity
theorem Theorem 2.24 (Rank-Nullity Theorem). For vector spaces V, W and a lin-
ear mapping : V ! W it holds that
dim(ker( )) + dim(Im( )) = dim(V ) . (2.129)
fundamental The rank-nullity theorem is also referred to as the fundamental theorem
theorem of linear of linear mappings (Axler, 2015, theorem 3.22). The following are direct
mappings
consequences of Theorem 2.24:
If dim(Im( )) < dim(V ), then ker( ) is non-trivial, i.e., the kernel
contains more than 0V and dim(ker( )) > 1.
If A is the transformation matrix of with respect to an ordered basis
and dim(Im( )) < dim(V ), then the system of linear equations A x =
0 has infinitely many solutions.
If dim(V ) = dim(W ), then the following three-way equivalence holds:
– is injective
– is surjective
– is bijective
since Im( ) ✓ W .
One-dimensional affine subspaces are called lines and can be written line
as y = x0 + x1 , where 2 R, where U = span[x1 ] ✓ Rn is a
one-dimensional subspace of Rn . This means that a line is defined by
a support point x0 and a vector x1 that defines the direction. See Fig-
ure 2.13 for an illustration.
Exercises
2.1 We consider (R\{ 1}, ?), where
a ? b := ab + a + b, a, b 2 R\{ 1} (2.134)
3 ? x ? x = 15
k = {x 2 Z | x k = 0 (modn)}
= {x 2 Z | (9a 2 Z) : (x k = n · a)} .
Zn = {0, 1, . . . , n 1}
a b := a + b
a ⌦ b = a ⇥ b, (2.135)
a.
2 32 3
1 2 1 1 0
44 5 5 40 1 15
7 8 1 0 1
b.
2 32 3
1 2 3 1 1 0
44 5 6 5 40 1 15
7 8 9 1 0 1
c.
2 32 3
1 1 0 1 2 3
40 1 1 5 44 5 65
1 0 1 7 8 9
d.
2 3
0 3
1 2 1 2 661 17
7
4 1 1 4 42 1 5
5 2
e.
2 3
0 3
61 17
6 7 1 2 1 2
42 1 5 4 1 1 4
5 2
2.5 Find the set S of all solutions in x of the following inhomogeneous linear
systems Ax = b, where A and b are defined as follows:
a.
2 3 2 3
1 1 1 1 1
62 5 7 57 6 27
A=6
42
7, b=6 7
1 1 3 5 445
5 2 4 2 6
b.
2 3 2 3
1 1 0 0 1 3
61 1 0 3 0 7 667
A=6
42
7, b=6 7
1 0 1 15 455
1 2 0 2 1 1
2.6 Using Gaussian elimination, find all solutions of the inhomogeneous equa-
tion system Ax = b with
2 3 2 3
0 1 0 0 1 0 2
A = 40 0 0 1 1 05 , b = 4 15
0 1 0 0 0 1 1
Determine a basis of U1 \ U2 .
2.13 Consider two subspaces U1 and U2 , where U1 is the solution space of the
homogeneous equation system A1 x = 0 and U2 is the solution space of the
homogeneous equation system A2 x = 0 with
2 3 2 3
1 0 1 3 3 0
61 2 17 61 2 37
A1 = 6
42
7, A2 = 6 7.
1 3 5 47 5 25
1 0 1 3 1 2
where L1 ([a, b]) denotes the set of integrable functions on [a, b].
b.
: C1 ! C0
f 7! (f ) = f 0 ,
c.
:R!R
x 7! (x) = cos(x)
d.
: R3 ! R2
1 2 3
x 7! x
1 4 3
and let us define two ordered bases B = (b1 , b2 ) and B 0 = (b01 , b02 ) of R2 .
1. Show that B and B 0 are two bases of R2 and draw those basis vectors.
2. Compute the matrix P 1 that performs a basis change from B 0 to B .
3. We consider c1 , c2 , c3 , three vectors of R3 defined in the standard basis
of R as
2 3 2 3 2 3
1 0 1
c1 = 4 2 5 , c2 = 4 1 5 , c3 = 4 0 5
1 2 1
Analytic Geometry
Orthogonal
Lengths Angles Rotations
projection
70
This material will be published by Cambridge University Press as Mathematics for Machine Learn-
ing by Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong. This pre-publication version is
free to view and download for personal use only. Not for re-distribution, re-sale or use in deriva-
tive works. c by M. P. Deisenroth, A. A. Faisal, and C. S. Ong, 2019. https://mml-book.com.
3.1 Norms 71
3.1 Norms
When we think of geometric vectors, i.e., directed line segments that start
at the origin, then intuitively the length of a vector is the distance of the
“end” of this directed line segment from the origin. In the following, we
will discuss the notion of the length of vectors using the concept of a norm.
k · k : V ! R, (3.1)
x 7! kxk , (3.2)
which assigns each vector x its length kxk 2 R, such that for all 2R length
and x, y 2 V the following hold:
absolutely
Absolutely homogeneous: k xk = | |kxk homogeneous
Definition 3.1 is in terms of a general vector space V (Section 2.4), but ca+b
Xn
kxk1 := |xi | , (3.3)
i=1
where | · | is the absolute value. The left panel of Figure 3.3 shows all
vectors x 2 R2 with kxk1 = 1. The Manhattan norm is also called `1 `1 norm
norm.
Euclidean distance and computes the Euclidean distance of x from the origin. The right panel
of Figure 3.3 shows all vectors x 2 R2 with kxk2 = 1. The Euclidean
`2 norm norm is also called `2 norm.
Remark. Throughout this book, we will use the Euclidean norm (3.4) by
default if not stated otherwise. }
We will refer to this particular inner product as the dot product in this
book. However, inner products are more general concepts with specific
properties, which we will now introduce.
where Aij := hbi , bj i and x̂, ŷ are the coordinates of x and y with respect
to the basis B . This implies that the inner product h·, ·i is uniquely deter-
mined through A. The symmetry of the inner product also means that A
The null space (kernel) of A consists only of 0 because x> Ax > 0 for
all x 6= 0. This implies that Ax 6= 0 if x 6= 0.
The diagonal elements aii of A are positive because aii = e>
i Aei > 0,
where ei is the ith vector of the standard basis in Rn .
in a natural way, such that we can compute lengths of vectors using the in-
ner product. However, not every norm is induced by an inner product. The
Manhattan norm (3.3) is an example of a norm without a corresponding
inner product. In the following, we will focus on norms that are induced
by inner products and introduce geometric concepts, such as lengths, dis-
tances, and angles.
Remark (Cauchy-Schwarz Inequality). For an inner product vector space
(V, h·, ·i) the induced norm k · k satisfies the Cauchy-Schwarz inequality Cauchy-Schwarz
inequality
| hx, yi | 6 kxkkyk . (3.17)
}
is called the distance between x and y for x, y 2 V . If we use the dot distance
product as the inner product, then the distance is called Euclidean distance. Euclidean distance
The mapping
d:V ⇥V !R (3.22)
(x, y) 7! d(x, y) (3.23)
positive definite 1. d is positive definite, i.e., d(x, y) > 0 for all x, y 2 V and d(x, y) =
0 () x = y .
symmetric 2. d is symmetric, i.e., d(x, y) = d(y, x) for all x, y 2 V .
triangle inequality 3. Triangle inequality: d(x, z) 6 d(x, y) + d(y, z) for all x, y, z 2 V .
Remark. At first glance, the lists of properties of inner products and met-
rics look very similar. However, by comparing Definition 3.3 with Defini-
tion 3.6 we observe that hx, yi and d(x, y) behave in opposite directions.
Very similar x and y will result in a large value for the inner product and
a small value for the metric. }
0
hx, yi
16 6 1. (3.24)
kxk kyk
1
0 ⇡/2 ⇡
Therefore, there exists a unique ! 2 [0, ⇡], illustrated in Figure 3.4, with
hx, yi
cos ! = . (3.25)
kxk kyk
angle The number ! is the angle between the vectors x and y . Intuitively, the
angle between two vectors tells us how similar their orientations are. For
example, using the dot product, the angle between x and y = 4x, i.e., y
is a scaled version of x, is 0: Their orientation is the same.
which gives exactly the angle between x and y . This means that orthog-
onal matrices A with A> = A 1 preserve both angles and distances. It
turns out that orthogonal matrices define transformations that are rota-
tions (with the possibility of flips). In Section 3.9, we will discuss more
details about rotations.
for all i, j = 1, . . . , n then the basis is called an orthonormal basis (ONB). orthonormal basis
If only (3.33) is satisfied, then the basis is called an orthogonal basis. Note ONB
orthogonal basis
that (3.34) implies that every basis vector has length/norm 1.
Recall from Section 2.6.1 that we can use Gaussian elimination to find a
basis for a vector space spanned by a set of vectors. Assume we are given
a set {b̃1 , . . . , b̃n } of non-orthogonal and unnormalized basis vectors. We
concatenate them into a matrix B̃ = [b̃1 , . . . , b̃n ] and apply Gaussian elim-
>
ination to the augmented matrix (Section 2.3.2) [B̃ B̃ |B̃] to obtain an
orthonormal basis. This constructive way to iteratively build an orthonor-
mal basis {b1 , . . . , bn } is called the Gram-Schmidt process (Strang, 2003).
e1
U
for lower and upper limits a, b < 1, respectively. As with our usual inner
product, we can define norms and orthogonality by looking at the inner
product. If (3.37) evaluates to 0, the functions u and v are orthogonal. To
make the preceding inner product mathematically precise, we need to take
care of measures and the definition of integrals, leading to the definition of
a Hilbert space. Furthermore, unlike inner products on finite-dimensional
vectors, inner products on functions may diverge (have infinite value). All
this requires diving into some more intricate details of real and functional
analysis, which we do not cover in this book.
sin(x) cos(x)
0.0
Figure 3.9
Orthogonal
projection (orange 2
dots) of a
1
two-dimensional
dataset (blue dots)
x2
0
onto a
one-dimensional 1
subspace (straight
line). 2
4 2 0 2 4
x1
b x
⇡U (x)
! sin !
! cos ! b
(a) Projection of x 2 R2 onto a subspace U (b) Projection of a two-dimensional vector
with basis vector b. x with kxk = 1 onto a one-dimensional
subspace spanned by b.
We can now exploit the bilinearity of the inner product and arrive at With a general inner
product, we get
hx, bi hb, xi = hx, bi if
hx, bi hb, bi = 0 () = = . (3.40) kbk = 1.
hb, bi kbk2
In the last step, we exploited the fact that inner products are symmet-
ric. If we choose h·, ·i to be the dot product, we obtain
b> x b> x
= = . (3.41)
b> b kbk2
hx, bi b> x
⇡U (x) = b = b = b, (3.42)
kbk2 kbk2
where the last equality holds for the dot product only. We can also
compute the length of ⇡U (x) by means of Definition 3.1 as
b> x bb>
⇡U (x) = b = b = b = x, (3.45)
kbk2 kbk2
we immediately see that
bb>
P⇡ = . (3.46)
kbk2
Projection matrices Note that bb> (and, consequently, P ⇡ ) is a symmetric matrix (of rank
are always 1), and kbk2 = hb, bi is a scalar.
symmetric.
The projection matrix P ⇡ projects any vector x 2 Rn onto the line through
the origin with direction b (equivalently, the subspace U spanned by b).
Remark. The projection ⇡U (x) 2 Rn is still an n-dimensional vector and
not a scalar. However, we no longer require n coordinates to represent the
projection, but only a single one if we want to express it with respect to
the basis vector b that spans the subspace U : . }
Let us now choose a particular x and see whether it lies in the subspace
⇥ ⇤>
spanned by b. For x = 1 1 1 , the projection is
2 32 3 2 3 2 3
1 2 2 1 5 1
14 1
⇡U (x) = P ⇡ x = 2 4 45 415 = 4105 2 span[425] . (3.48)
9 2 4 4 1 9 10 2
Note that the application of P ⇡ to ⇡U (x) does not change anything, i.e.,
P ⇡ ⇡U (x) = ⇡U (x). This is expected because according to Definition 3.10,
we know that a projection matrix P ⇡ satisfies P 2⇡ x = P ⇡ x for all x.
Remark. With the results from Chapter 4, we can show that ⇡U (x) is an
eigenvector of P ⇡ , and the corresponding eigenvalue is 1. }
B = [b1 , . . . , bm ] 2 Rn⇥m , = [ 1, . . . , m]
>
2 Rm , (3.50)
is closest to x 2 Rn . As in the 1D case, “closest” means “minimum
distance”, which implies that the vector connecting ⇡U (x) 2 U and
x 2 Rn must be orthogonal to all basis vectors of U . Therefore, we
obtain m simultaneous conditions (assuming the dot product as the
inner product)
hb1 , x ⇡U (x)i = b>1 (x ⇡U (x)) = 0 (3.51)
..
.
hbm , x ⇡U (x)i = b>
m (x ⇡U (x)) = 0 (3.52)
which, with ⇡U (x) = B , can be written as
b>
1 (x B )=0 (3.53)
..
.
b>
m (x B )=0 (3.54)
such that we obtain a homogeneous linear equation system
2 >3 2 3
b1
6 .. 7 4 >
4 . 5 x B 5 = 0 () B (x B ) = 0 (3.55)
b>
m
pseudo-inverse The matrix (B > B) 1 B > is also called the pseudo-inverse of B , which
can be computed for non-square matrices B . It only requires that B > B
is positive definite, which is the case if B is full rank. In practical ap-
plications (e.g., linear regression), we often add a “jitter term” ✏I to
Remark. The solution for projecting onto general subspaces includes the
1D case as a special case: If dim(U ) = 1, then B > B 2 R is a scalar and
we can rewrite the projection matrix in (3.59) P ⇡ = B(B > B) 1 B > as
>
P ⇡ = BB
B> B
, which is exactly the projection matrix in (3.46). }
projection error The corresponding projection error is the norm of the difference vector
The projection error between the original vector and its projection onto U , i.e.,
is also called the ⇥ ⇤> p
reconstruction error. kx ⇡U (x)k = 1 2 1 = 6. (3.63)
Remark. The projections ⇡U (x) are still vectors in Rn although they lie in
an m-dimensional subspace U ✓ Rn . However, to represent a projected
vector we only need the m coordinates 1 , . . . , m with respect to the
basis vectors b1 , . . . , bm of U . }
Remark. In vector spaces with general inner products, we have to pay
attention when computing angles and distances, which are defined by
means of the inner product. }
We can find
approximate Projections allow us to look at situations where we have a linear system
solutions to Ax = b without a solution. Recall that this means that b does not lie in
unsolvable linear
equation systems
the span of A, i.e., the vector b does not lie in the subspace spanned by
using projections. the columns of A. Given that the linear equation cannot be solved exactly,
we can find an approximate solution. The idea is to find the vector in the
subspace spanned by the columns of A that is closest to b, i.e., we compute
the orthogonal projection of b onto the subspace spanned by the columns
of A. This problem arises often in practice, and the solution is called the
least-squares least-squares solution (assuming the dot product as the inner product) of
solution an overdetermined system. This is discussed further in Section 9.4. Using
reconstruction errors (3.63) is one possible approach to derive principal
component analysis (Section 10.3).
Remark. We just looked at projections of vectors x onto a subspace U with
basis vectors {b1 , . . . , bk }. If this basis is an ONB, i.e., (3.33) and (3.34)
are satisfied, the projection equation (3.58) simplifies greatly to
⇡U (x) = BB > x (3.65)
Figure 3.12
b2 b2 u2 b2 Gram-Schmidt
orthogonalization.
(a) non-orthogonal
basis (b1 , b2 ) of R2 ;
0 b1 0 ⇡span[u1 ] (b2 ) u1 0 ⇡span[u1 ] (b2 ) u1 (b) first constructed
basis vector u1 and
(a) Original non-orthogonal (b) First new basis vector (c) Orthogonal basis vectors u1
orthogonal
basis vectors b1 , b2 . u1 = b1 and projection of b2 and u2 = b2 ⇡span[u1 ] (b2 ).
projection of b2
onto the subspace spanned by
onto span[u1 ];
u1 .
(c) orthogonal basis
Consider a basis (b1 , b2 ) of R2 , where (u1 , u2 ) of R2 .
2 1
b1 = , b2 = ; (3.69)
0 1
see also Figure 3.12(a). Using the Gram-Schmidt method, we construct an
orthogonal basis (u1 , u2 ) of R2 as follows (assuming the dot product as
the inner product):
2
u1 := b1 = , (3.70)
0
(3.45) u1 u>1 1 1 0 1 0
u2 := b2 ⇡span[u1 ] (b2 ) = b2 b
2 2 = = .
ku1 k 1 0 0 1 1
(3.71)
These steps are illustrated in Figures 3.12(b) and (c). We immediately see
that u1 and u2 are orthogonal, i.e., u>1 u2 = 0.
Figure 3.14 A
rotation rotates
objects in a plane
about the origin. If
Original
the rotation angle is
Rotated by 112.5
positive, we rotate
counterclockwise.
3.9 Rotations
Length and angle preservation, as discussed in Section 3.4, are the two
characteristics of linear mappings with orthogonal transformation matri-
ces. In the following, we will have a closer look at specific orthogonal
transformation matrices, which describe rotations.
A rotation is a linear mapping (more specifically, an automorphism of rotation
a Euclidean vector space) that rotates a plane by an angle ✓ about the
origin, i.e., the origin is a fixed point. For a positive angle ✓ > 0, by com-
mon convention, we rotate in a counterclockwise direction. An example is
shown in Figure 3.14, where the transformation matrix is
0.38 0.92
R= . (3.74)
0.92 0.38
Important application areas of rotations include computer graphics and
robotics. For example, in robotics, it is often important to know how to
rotate the joints of a robotic arm in order to pick up or place an object,
see Figure 3.15.
✓
sin ✓ e1 cos ✓
3.9.1 Rotations in R2
⇢
1 0
Consider the standard basis e1 = , e2 = of R2 , which defines
0 1
the standard coordinate system in R2 . We aim to rotate this coordinate
system by an angle ✓ as illustrated in Figure 3.16. Note that the rotated
vectors are still linearly independent and, therefore, are a basis of R2 . This
means that the rotation performs a basis change.
Rotations are linear mappings so that we can express them by a
rotation matrix rotation matrix R(✓). Trigonometry (see Figure 3.16) allows us to de-
termine the coordinates of the rotated axes (the image of ) with respect
to the standard basis in R2 . We obtain
cos ✓ sin ✓
(e1 ) = , (e2 ) = . (3.75)
sin ✓ cos ✓
Therefore, the rotation matrix that performs the basis change into the
rotated coordinates R(✓) is given as
⇥ ⇤ cos ✓ sin ✓
R(✓) = (e1 ) (e2 ) = . (3.76)
sin ✓ cos ✓
3.9.2 Rotations in R3
In contrast to the R2 case, in R3 we can rotate any two-dimensional plane
about a one-dimensional axis. The easiest way to specify the general rota-
tion matrix is to specify how the images of the standard basis e1 , e2 , e3 are
supposed to be rotated, and making sure these images Re1 , Re2 , Re3 are
orthonormal to each other. We can then obtain a general rotation matrix
R by combining the images of the standard basis.
To have a meaningful rotation angle, we have to define what “coun-
terclockwise” means when we operate in more than two dimensions. We
use the convention that a “counterclockwise” (planar) rotation about an
axis refers to a rotation about an axis when we look at the axis “head on,
from the end toward the origin”. In R3 , there are therefore three (planar)
rotations about the three standard basis vectors (see Figure 3.17):
e3 Figure 3.17
Rotation of a vector
(gray) in R3 by an
angle ✓ about the
e3 -axis. The rotated
vector is shown in
blue.
e2
✓ e1
trix
2 3
Ii 1 0 ··· ··· 0
6 0 cos ✓ 0 sin ✓ 0 7
6 7
Rij (✓) := 6
6 0 0 I j i 1 0 0 7 2 Rn⇥n ,
7 (3.80)
4 0 sin ✓ 0 cos ✓ 0 5
0 ··· ··· 0 In j
Givens rotation for 1 6 i < j 6 n and ✓ 2 R. Then Rij (✓) is called a Givens rotation.
Essentially, Rij (✓) is the identity matrix I n with
rii = cos ✓ , rij = sin ✓ , rji = sin ✓ , rjj = cos ✓ . (3.81)
In two dimensions (i.e., n = 2), we obtain (3.76) as a special case.
kernel methods (Schölkopf and Smola, 2002). Kernel methods exploit the
fact that many linear algorithms can be expressed purely by inner prod-
uct computations. Then, the “kernel trick” allows us to compute these
inner products implicitly in a (potentially infinite-dimensional) feature
space, without even knowing this feature space explicitly. This allowed the
“non-linearization” of many algorithms used in machine learning, such as
kernel-PCA (Schölkopf et al., 1997) for dimensionality reduction. Gaus-
sian processes (Rasmussen and Williams, 2006) also fall into the category
of kernel methods and are the current state of the art in probabilistic re-
gression (fitting curves to data points). The idea of kernels is explored
further in Chapter 12.
Projections are often used in computer graphics, e.g., to generate shad-
ows. In optimization, orthogonal projections are often used to (iteratively)
minimize residual errors. This also has applications in machine learning,
e.g., in linear regression where we want to find a (linear) function that
minimizes the residual errors, i.e., the lengths of the orthogonal projec-
tions of the data onto the linear function (Bishop, 2006). We will investi-
gate this further in Chapter 9. PCA (Pearson, 1901; Hotelling, 1933) also
uses projections to reduce the dimensionality of high-dimensional data.
We will discuss this in more detail in Chapter 10.
Exercises
3.1 Show that h·, ·i defined for all x = [x1 , x2 ]> 2 R2 and y = [y1 , y2 ]> 2 R2 by
is an inner product.
3.2 Consider R2 with h·, ·i defined for all x and y in R2 as
2 0
hx, yi := x> y.
1 2
| {z }
=:A
using
a. hx, yi := x> y
2 3
2 1 0
b. hx, yi := x> Ay , A := 41 3 15
0 1 2
3.4 Compute the angle between
1 1
x= , y=
2 1
using
a. hx, yi := x> y
> 2 1
b. hx, yi := x By , B :=
1 3
3.5 Consider the Euclidean vector space R5 with the dot product. A subspace
U ✓ R5 and x 2 R5 are given by
2 3 2 3 2 3 2 3 2 3
0 1 3 1 1
6 17 6 37 6 4 7 6 37 6 97
6 7 6 7 6 7 6 7 6 7
U = span[6 7 6 7 6
6 2 7, 6 1 7, 6 1
7,
7
6 5 7] ,
6 7 x=6 7
6 17
4 0 5 4 15 4 2 5 405 445
2 2 1 7 1
3.8 Using the Gram-Schmidt method, turn the basis B = (b1 , b2 ) of a two-
dimensional subspace U ✓ R3 into an ONB C = (c1 , c2 ) of U , where
2 3 2 3
1 1
b1 := 415 , b2 := 4 2 5 .
1 0
Hint: Think about the dot product on Rn . Then, choose specific vectors
x, y 2 Rn and apply the Cauchy-Schwarz inequality.
3.10 Rotate the vectors
2 0
x1 := , x2 :=
3 1
by 30 .
Matrix Decompositions
98
This material will be published by Cambridge University Press as Mathematics for Machine Learn-
ing by Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong. This pre-publication version is
free to view and download for personal use only. Not for re-distribution, re-sale or use in deriva-
tive works. c by M. P. Deisenroth, A. A. Faisal, and C. S. Ong, 2019. https://mml-book.com.
4.1 Determinant and Trace 99
used in
where they are used
in other parts of the
book.
Eigenvalues Chapter 6
Probability
& distributions
determines
used
in
constructs used in
Eigenvectors Orthogonal matrix Diagonalization
n
di
use
us
in
ed
SVD
ed
us
in
used in
Chapter 10
Dimensionality
reduction
For a memory aid of the product terms in Sarrus’ rule, try tracing the
elements of the triple products in the matrix.
We call a square matrix T an upper-triangular matrix if Tij = 0 for upper-triangular
i > j , i.e., the matrix is zero below its diagonal. Analogously, we define a matrix
lower-triangular matrix as a matrix with zeros above its diagonal. For a tri- lower-triangular
angular matrix T 2 Rn⇥n , the determinant is the product of the diagonal matrix
elements, i.e.,
n
Y
det(T ) = Tii . (4.8)
i=1
The determinant is
the signed volume
of the parallelepiped
Example 4.2 (Determinants as Measures of Volume) formed by the
columns of the
The notion of a determinant is natural when we consider it as a mapping
matrix.
from a set of n vectors spanning an object in Rn . It turns out that the de- Figure 4.2 The area
terminant det(A) is the signed volume of an n-dimensional parallelepiped of the parallelogram
formed by columns of the matrix A. (shaded region)
For n = 2, the columns of the matrix form a parallelogram; see Fig- spanned by the
vectors b and g is
ure 4.2. As the angle between vectors gets smaller, the area of a parallel- |det([b, g])|.
ogram shrinks, too. Consider two vectors b, g that form the columns of a
matrix A = [b, g]. Then, the absolute value of the determinant of A is the
area of the parallelogram with vertices 0, b, g, b + g . In particular, if b, g b
are linearly dependent so that b = g for some 2 R, they no longer
g
form a two-dimensional parallelogram. Therefore, the corresponding area
is 0. On the contrary, if b, g are linearly independent and are multiples of Figure 4.3 The
volume of the
b
the canonical basis vectors e1 , e2 then they can be written as b = and parallelepiped
0 (shaded volume)
0 b 0 spanned by vectors
g= , and the determinant is = bg 0 = bg . r, b, g is
g 0 g
|det([r, b, g])|.
The sign of the determinant indicates the orientation of the spanning
vectors b, g with respect to the standard basis (e1 , e2 ). In our figure, flip-
ping the order to g, b swaps the columns of A and reverses the orientation
of the shaded area. This becomes the familiar formula: area = height ⇥
length. This intuition extends to higher dimensions. In R3 , we consider b
r
three vectors r, b, g 2 R3 spanning the edges of a parallelepiped, i.e., a g
solid with faces that are parallel parallelograms (see Figure 4.3). The ab- The sign of the
determinant
solute value of the determinant of the 3 ⇥ 3 matrix [r, b, g] is the volume
indicates the
of the solid. Thus, the determinant acts as a function that measures the orientation of the
signed volume formed by column vectors composed in a matrix. spanning vectors.
Consider the three linearly independent vectors r, g, b 2 R3 given as
2 3 2 3 2 3
2 6 1
r = 4 0 5 , g = 4 15 , b = 4 4 5 . (4.9)
8 0 1
Here Ak,j 2 R(n 1)⇥(n 1) is the submatrix of A that we obtain when delet-
ing row k and column j .
c0 = det(A) , (4.23)
cn 1 = ( 1)n 1 tr(A) . (4.24)
The characteristic polynomial (4.22a) will allow us to compute eigen-
values and eigenvectors, covered in the next section.
algebraic Definition 4.9. Let a square matrix A have an eigenvalue i . The algebraic
multiplicity multiplicity of i is the number of times the root appears in the character-
istic polynomial.
A matrix A and its transpose A> possess the same eigenvalues, but not
necessarily the same eigenvectors.
The eigenspace E is the null space of A I since
Ax = x () Ax x=0 (4.27a)
() (A I)x = 0 () x 2 ker(A I). (4.27b)
pA ( ) = det(A I) (4.29a)
✓ ◆
4 2 0 4 2
= det = (4.29b)
1 3 0 1 3
= (4 )(3 ) 2 · 1. (4.29c)
p( ) = (4 )(3 ) 2 · 1 = 10 7 + 2
= (2 )(5 ) (4.30)
giving the roots 1 = 2 and 2 = 5.
Step 3: Eigenvectors and Eigenspaces. We find the eigenvectors that
correspond to these eigenvalues by looking at vectors x such that
4 2
x = 0. (4.31)
1 3
For = 5 we obtain
4 5 2 x1 1 2 x1
= = 0. (4.32)
1 3 5 x2 1 2 x2
We solve this homogeneous system and obtain a solution space
2
E5 = span[ ]. (4.33)
1
This eigenspace is one-dimensional as it possesses a single basis vector.
Analogously, we find the eigenvector for = 2 by solving the homoge-
neous system of equations
4 2 2 2 2
x= x = 0. (4.34)
1 3 2 1 1
x1 1
This means any vector x = , where x2 = x1 , such as , is an
x2 1
eigenvector with eigenvalue 2. The corresponding eigenspace is given as
1
E2 = span[ ]. (4.35)
1
Example 4.6
2 1
The matrix A = has two repeated eigenvalues 1 = 2 = 2 and an
0 2
algebraic multiplicity of 2. The eigenvalue has, however, only one distinct
1
unit eigenvector x1 = and, thus, geometric multiplicity 1.
0
Figure 4.4
Determinants and
eigenspaces.
Overview of five
∏1 = 2.0 linear mappings and
∏2 = 0.5 their associated
det(A) = 1.0
transformation
matrices
Ai 2 R2⇥2
projecting 400
color-coded points
x 2 R2 (left
∏1 = 1.0
∏2 = 1.0 column) onto target
det(A) = 1.0 points Ai x (right
column). The
central column
depicts the first
eigenvector,
stretched by its
∏1 = (0.87-0.5j) associated
∏2 = (0.87+0.5j) eigenvalue 1 , and
det(A) = 1.0
the second
eigenvector
stretched by its
eigenvalue 2 . Each
row depicts the
effect of one of five
∏1 = 0.0
∏2 = 2.0 transformation
det(A) = 0.0 matrices Ai with
respect to the
standard basis .
∏1 = 0.5
∏2 = 1.5
det(A) = 0.75
half of the vertical axis, and to the left vice versa. This mapping is area
preserving (det(A2 ) = 1). The eigenvalue 1 = 1 = 2 is repeated
and the eigenvectors are collinear (drawn here for emphasis in two
opposite directions). This indicates that the mapping acts only along
one direction
(the horizontal axis).
p
cos( ⇡6 ) sin( ⇡6 ) 3 p1
A3 = 1
= 2 The matrix A3 rotates the
sin( ⇡6 ) cos( ⇡6 ) 1 3
points by ⇡6 rad = 30 counter-clockwise and has only complex eigen-
values, reflecting that the mapping is a rotation (hence, no eigenvectors
are drawn). A rotation has to be volume preserving, and so the deter-
minantis 1. For more details on rotations, we refer to Section 3.9.
1 1
A4 = represents a mapping in the standard basis that col-
1 1
lapses a two-dimensional domain onto one dimension. Since one eigen-
Figure 4.5
0
Caenorhabditis 25
elegans neural 50 20
network (Kaiser and
15
Hilgetag, 2006). 100
neuron index
eigenvalue
(a) Symmetrized 10
connectivity matrix; 150 5
(b) Eigenspectrum.
0
200
°5
250 °10
Methods to analyze and learn from network data are an essential com-
ponent of machine learning methods. The key to understanding networks
is the connectivity between network nodes, especially if two nodes are
connected to each other or not. In data science applications, it is often
useful to study the matrix that captures this connectivity data.
We build a connectivity/adjacency matrix A 2 R277⇥277 of the complete
neural network of the worm C.Elegans. Each row/column represents one
of the 277 neurons of this worm’s brain. The connectivity matrix A has
a value of aij = 1 if neuron i talks to neuron j through a synapse, and
aij = 0 otherwise. The connectivity matrix is not symmetric, which im-
plies that eigenvalues may not be real valued. Therefore, we compute a
symmetrized version of the connectivity matrix as Asym := A + A> . This
new matrix Asym is shown in Figure 4.5(a) and has a nonzero value aij if
and only if two neurons are connected (white pixels), irrespective of the
direction of the connection. In Figure 4.5(b), we show the correspond-
ing eigenspectrum of Asym . The horizontal axis shows the index of the
eigenvalues, sorted in descending order. The vertical axis shows the corre-
sponding eigenvalue. The S -like shape of this eigenspectrum is typical for
many biological neural networks. The underlying mechanism responsible
for this is an area of active neuroscience research.
Example 4.8
Consider the matrix
2 3
3 2 2
A = 4 2 3 25 . (4.37)
2 2 3
Figure 4.6
Geometric
x2 A interpretation of
eigenvalues. The
v2 eigenvectors of A
x1 v1 get stretched by the
corresponding
eigenvalues. The
area of the unit
Theorem 4.17. The trace of a matrix A 2 Rn⇥n is the sum of its eigenval-
square changes by
ues, i.e., | 1 2 |, the
Xn circumference
tr(A) = i, (4.43) changes by a factor
i=1 2(| 1 | + | 2 |).
on a different web site. The matrix A has the property that for any ini-
tial rank/importance vector x of a web site the sequence x, Ax, A2 x, . . .
PageRank converges to a vector x⇤ . This vector is called the PageRank and satisfies
Ax⇤ = x⇤ , i.e., it is an eigenvector (with corresponding eigenvalue 1) of
A. After normalizing x⇤ , such that kx⇤ k = 1, we can interpret the entries
as probabilities. More details and different perspectives on PageRank can
be found in the original technical report (Page et al., 1999).
Comparing the left-hand side of (4.45) and the right-hand side of (4.46)
shows that there is a simple pattern in the diagonal elements lii :
q q
p
l11 = a11 , l22 = a22 l21 2
, l33 = a33 (l31 2 2
+ l32 ) . (4.47)
Similarly for the elements below the diagonal (lij , where i > j ), there is
also a repeating pattern:
1 1 1
l21 = a21 , l31 = a31 , l32 = (a32 l31 l21 ) . (4.48)
l11 l11 l22
Thus, we constructed the Cholesky decomposition for any symmetric, pos-
itive definite 3 ⇥ 3 matrix. The key realization is that we can backward
calculate what the components lij for the L should be, given the values
aij for A and previously computed values of lij .
0 n
polynomial of A is
✓ ◆
2 1
det(A I) = det (4.56a)
1 2
= (2 )2 1= 2
4 +3=( 3)( 1) . (4.56b)
Therefore, the eigenvalues of A are 1 = 1 and 2 = 3 (the roots of the
characteristic polynomial), and the associated (normalized) eigenvectors
are obtained via
2 1 2 1
p1 = 1p1 , p = 3p2 . (4.57)
1 2 1 2 2
This yields
1 1 1 1
p1 = p , p2 = p . (4.58)
2 1 2 1
Step 2: Check for existence. The eigenvectors p1 , p2 form a basis of
R2 . Therefore, A can be diagonalized.
Step 3: Construct the matrix P to diagonalize A. We collect the
eigenvectors of A in P so that
1 1 1
P = [p1 , p2 ] = p . (4.59)
2 1 1
We then obtain
1 1 0
P AP = = D. (4.60)
0 3
Equivalently, we get (exploiting that P 1 = P > since the eigenvectors p1
and p2 in this example form an ONB)
2 1 1 1 1 1 0 1 1 1
=p p . (4.61)
1 2 2 1 1 0 3 2 1 1
| {z } | {z } | {z } | {z }
A P D P>
Ak = (P DP 1 k
) = P Dk P 1
. (4.62)
A = U ⌃ V>
m
(4.64)
The diagonal entries i , i = 1, . . . , r, of ⌃ are called the singular values, singular values
ui are called the left-singular vectors, and v j are called the right-singular left-singular vectors
vectors. By convention, the singular values are ordered, i.e., 1 > 2 > right-singular
r > 0.
vectors
The singular value matrix ⌃ is unique, but it requires some attention. singular value
Observe that the ⌃ 2 Rm⇥n is rectangular. In particular, ⌃ is of the same matrix
size as A. This means that ⌃ has a diagonal submatrix that contains the
singular values and needs additional zero padding. Specifically, if m > n,
then the matrix ⌃ has diagonal structure up to row n and then consists of
value matrix ⌃. Finally, it performs a second basis change via U . The SVD
entails a number of important details and caveats, which is why we will
review our intuition in more detail. It is useful to revise
Assume we are given a transformation matrix of a linear mapping : basis changes
(Section 2.7.2),
Rn ! Rm with respect to the standard bases B and C of Rn and Rm ,
orthogonal matrices
respectively. Moreover, assume a second basis B̃ of Rn and C̃ of Rm . Then (Definition 3.8) and
orthonormal bases
1. The matrix V performs a basis change in the domain Rn from B̃ (rep- (Section 3.5).
resented by the red and orange vectors v 1 and v 2 in the top-left of Fig-
ure 4.8) to the standard basis B . V > = V 1 performs a basis change
from B to B̃ . The red and orange vectors are now aligned with the
canonical basis in the bottom-left of Figure 4.8.
2. Having changed the coordinate system to B̃ , ⌃ scales the new coordi-
nates by the singular values i (and adds or deletes dimensions), i.e.,
⌃ is the transformation matrix of with respect to B̃ and C̃ , rep-
resented by the red and orange vectors being stretched and lying in
the e1 -e2 plane, which is now embedded in a third dimension in the
bottom-right of Figure 4.8.
3. U performs a basis change in the codomain Rm from C̃ into the canoni-
cal basis of Rm , represented by a rotation of the red and orange vectors
out of the e1 -e2 plane. This is shown in the top-right of Figure 4.8.
The SVD expresses a change of basis in both the domain and codomain.
This is in contrast with the eigendecomposition that operates within the
same vector space, where the same basis change is applied and then un-
done. What makes the SVD special is that these two different bases are
simultaneously linked by the singular value matrix ⌃.
the x1 -x2 plane. The third coordinate is always 0. The vectors in the x1 -x2
plane have been stretched by the singular values.
The direct mapping of the vectors X by A to the codomain R3 equals
the transformation of X by U ⌃V > , where U performs a rotation within
the codomain R3 so that the mapped vectors are no longer restricted to
the x1 -x2 plane; they still are on a plane as shown in the top-right panel
of Figure 4.9.
x3
structure of 0.0
x2
0.0
Figure 4.8.
-0.5
0.5
-1.0
1.5
1.0 0.5
-1.5
1.5 -0.5 -0.5 x2
1.5 1.0 0.5 0.0 0.5 1.0 1.5 0.5
x1 x1 1.5
-1.5
1.5
1.0
0.5
0 x3
x2
0.0
0.5
1.5
1.0
0.5
-1.5
1.5 -0.5 -0.5 x2
1.5 1.0 0.5 0.0 0.5 1.0 1.5 0.5
x1 x1 1.5 -1.5
The spectral theorem tells us that AA> = SDS > can be diagonalized
and we can find an ONB of eigenvectors of AA> , which are collected in
S . The orthonormal eigenvectors of AA> are the left-singular vectors U
and form an orthonormal basis set in the codomain of the SVD.
This leaves the question of the structure of the matrix ⌃. Since AA>
and A> A have the same nonzero eigenvalues (see page 106) the nonzero
entries of the ⌃ matrices in the SVD for both cases have to be the same.
The last step is to link up all the parts we touched upon so far. We have
an orthonormal set of right-singular vectors in V . To finish the construc-
tion of the SVD, we connect them with the orthonormal vectors U . To
reach this goal, we use the fact the images of the v i under A have to be
orthogonal, too. We can show this by using the results from Section 3.4.
We require that the inner product between Av i and Av j must be 0 for
i 6= j . For any two orthogonal eigenvectors v i , v j , i 6= j , it holds that
>
(Av i )> (Av j ) = v > >
i (A A)v j = v i ( j v j ) =
>
j vi vj = 0. (4.77)
For the case m > r, it holds that {Av 1 , . . . , Av r } is a basis of an r-
dimensional subspace of Rm .
To complete the SVD construction, we need left-singular vectors that
are orthonormal: We normalize the images of the right-singular vectors
Av i and obtain
Av i 1 1
ui := = p Av i = Av i , (4.78)
kAv i k i i
where the last equality was obtained from (4.75) and (4.76b), showing
us that the eigenvalues of AA> are such that i2 = i .
Therefore, the eigenvectors of A> A, which we know are the right-
singular vectors v i , and their normalized images under A, the left-singular
vectors ui , form two self-consistent ONBs that are connected through the
singular value matrix ⌃.
singular value Let us rearrange (4.78) to obtain the singular value equation
equation
Av i = i ui , i = 1, . . . , r . (4.79)
This equation closely resembles the eigenvalue equation (4.25), but the
vectors on the left- and the right-hand sides are not the same.
For n > m, (4.79) holds only for i 6 m and (4.79) says nothing about
the ui for i > m. However, we know by construction that they are or-
thonormal. Conversely, for m > n, (4.79) holds only for i 6 n. For i > n,
we have Av i = 0 and we still know that the v i form an orthonormal set.
This means that the SVD also supplies an orthonormal basis of the kernel
(null space) of A, the set of vectors x with Ax = 0 (see Section 2.7.3).
Moreover, concatenating the v i as the columns of V and the ui as the
columns of U yields
AV = U ⌃ , (4.80)
where ⌃ has the same dimensions as A and a diagonal structure for rows
1, . . . , r. Hence, right-multiplying with V > yields A = U ⌃V > , which is
the SVD of A.
A> A we obtain them straight from p D . Since rk(A) = 2, there are only
two nonzero singular values: 1 = 6 and 2 = 1. The singular value
matrix must be the same size as A, and we obtain
p
6 0 0
⌃= . (4.85)
0 1 0
Step 3: Left-singular vectors as the normalized image of the right-
singular vectors.
We find the left-singular vectors by computing the image of the right-
singular vectors under A and normalizing them by dividing them by their
corresponding singular value. We obtain
2 5 3
p " #
1 1 1 0 1 6 p30 p1
2 7
u1 = Av 1 = p 4 30 5 = 5
p2
, (4.86)
1 6 2 1 0 1 5
p
30
2 " # 3
0
1 1 1 0 1 4 p1 5 p2
u2 = Av 2 = = p1
5
, (4.87)
2 1 2 1 0 p25 5
5
1 1 2
U = [u1 , u2 ] = p . (4.88)
5 2 1
Note that on a computer the approach illustrated here has poor numerical
behavior, and the SVD of A is normally computed without resorting to the
eigenvalue decomposition of A> A.
Chandra
Beatrix
ratings of three
Ali people for four
movies and its SVD
2 3 2 3
Star Wars 5 4 1 0.6710 0.0236 0.4647 0.5774 decomposition.
6 7
Blade Runner 66 5 5 0 7
7=6
6 0.7197 0.2054 0.4759 0.4619 7
7
Amelie 4 0 0 5 5 4 0.0939 0.7705 0.5268 0.3464 5
Delicatessen 1 0 4 0.1515 0.6030 0.5293 0.5774
2 3
9.6438 0 0
6 0 6.3639 0 7
6 7
4 0 0 0.7056 5
0 0 0
2 3
6 0.7367 0.6515 0.1811 7
4 0.0852 0.1762 0.9807 5
0.6708 0.7379 0.0743
represents a movie and each column a user. Thus, the column vectors of
movie ratings, one for each viewer, are xAli , xBeatrix , xChandra .
Factoring A using the SVD offers us a way to capture the relationships
of how people rate movies, and especially if there is a structure linking
which people like which movies. Applying the SVD to our data matrix A
makes a number of assumptions:
1. All viewers rate movies consistently using the same linear mapping.
2. There are no errors or noise in the ratings.
3. We interpret the left-singular vectors ui as stereotypical movies and
the right-singular vectors v j as stereotypical viewers.
We then make the assumption that any viewer’s specific movie preferences
can be expressed as a linear combination of the v j . Similarly, any movie’s
like-ability can be expressed as a linear combination of the ui . Therefore,
a vector in the domain of the SVD can be interpreted as a viewer in the
“space” of stereotypical viewers, and a vector in the codomain of the SVD
These two “spaces” correspondingly as a movie in the “space” of stereotypical movies. Let us
are only inspect the SVD of our movie-user matrix. The first left-singular vector u1
meaningfully
spanned by the
has large absolute values for the two science fiction movies and a large
respective viewer first singular value (red shading in Figure 4.10). Thus, this groups a type
and movie data if of users with a specific set of movies (science fiction theme). Similarly, the
the data itself covers
a sufficient diversity
first right-singular v 1 shows large absolute values for Ali and Beatrix, who
of viewers and give high ratings to science fiction movies (green shading in Figure 4.10).
movies. This suggests that v 1 reflects the notion of a science fiction lover.
Similarly, u2 , seems to capture a French art house film theme, and v 2
indicates that Chandra is close to an idealized lover of such movies. An
idealized science fiction lover is a purist and only loves science fiction
movies, so a science fiction lover v 1 gives a rating of zero to everything
but science fiction themed – this logic is implied the diagonal substructure
for the singular value matrix ⌃. A specific movie is therefore represented
by how it decomposes (linearly) into its stereotypical movies. Likewise, a
person would be represented by how they decompose (via linear combi-
nation) into movie themes.
Sometimes this formulation is called the reduced SVD (e.g., Datta (2010)) reduced SVD
or the SVD (e.g., Press et al. (2007)). This alternative format changes
merely how the matrices are constructed but leaves the mathematical
structure of the SVD unchanged. The convenience of this alternative
formulation is that ⌃ is diagonal, as in the eigenvalue decomposition.
In Section 4.6, we will learn about matrix approximation techniques
using the SVD, which is also called the truncated SVD. truncated SVD
It is possible to define the SVD of a rank-r matrix A so that U is an
m ⇥ r matrix, ⌃ a diagonal matrix r ⇥ r, and V an r ⇥ n matrix.
This construction is very similar to our definition, and ensures that the
diagonal matrix ⌃ has only nonzero entries along the diagonal. The
main convenience of this alternative notation is that ⌃ is diagonal, as
in the eigenvalue decomposition.
A restriction that the SVD for A only applies to m ⇥ n matrices with
m > n is practically unnecessary. When m < n, the SVD decomposition
will yield ⌃ with more zero columns than rows and, consequently, the
singular values m+1 , . . . , n are 0.
The SVD is used in a variety of applications in machine learning from
least-squares problems in curve fitting to solving systems of linear equa-
tions. These applications harness various important properties of the SVD,
its relation to the rank of a matrix, and its ability to approximate matrices
of a given rank with lower-rank matrices. Substituting a matrix with its
SVD has often the advantage of making calculation more robust to nu-
merical rounding errors. As we will explore in the next section, the SVD’s
ability to approximate matrices with “simpler” matrices in a principled
manner opens up machine learning applications ranging from dimension-
ality reduction and topic modeling to data compression and clustering.
b
of A with rk(A(k)) = k . Figure 4.12 shows low-rank approximations
b
A(k) of an original image A of Stonehenge. The shape of the rocks be-
comes increasingly visible and clearly recognizable in the rank-5 approx-
imation. While the original image requires 1, 432 · 1, 910 = 2, 735, 120
numbers, the rank-5 approximation requires us only to store the five sin-
gular values and the five left- and right-singular vectors (1, 432 and 1, 910-
b
(d) Rank-3 approximation A(3).(e) b
Rank-4 approximation A(4).(f) b
Rank-5 approximation A(5).
dimensional each) for a total of 5 · (1, 432 + 1, 910 + 1) = 16, 715 numbers
– just above 0.6% of the original.
To measure the difference (error) between A and its rank-k approxima-
b
tion A(k) , we need the notion of a norm. In Section 3.1, we already used
norms on vectors that measure the length of a vector. By analogy we can
also define norms on matrices.
Definition 4.23 (Spectral Norm of a Matrix). For x 2 Rn \{0}, the spectral spectral norm
norm of a matrix A 2 Rm⇥n is defined as
kAxk2
kAk2 := max . (4.93)
x kxk2
We introduce the notation of a subscript in the matrix norm (left-hand
side), similar to the Euclidean norm for vectors (right-hand side), which
has subscript 2. The spectral norm (4.93) determines how long any vector
x can at most become when multiplied by A.
Theorem 4.24. The spectral norm of A is its largest singular value 1.
kA Bk2 < A b
A(k) , (4.97)
2
Figure 4.13 A
functional Real matrices
9 Pseudo-inverse
phylogeny of 9 SVD
matrices
encountered in
Rn⇥n Rn⇥m
machine learning. Square
9 Determinant Nonsquare
9 Trace
No basis of det =
0
eigenvectors
Singular
de
Basis of
Defective
t
6=
eigenvectors
0
Non-defective
(diagonalizable)
Normal Non-normal
A>
A
=
A
A>
9 Inverse Matrix
Symmetric =
eigenvalues 2 R
I Regular
(invertible)
Exercises
4.1 Compute the determinant using the Laplace expansion (using the first row)
and the Sarrus Rule for
2 3
1 3 5
A=4 2 4 6 5.
0 2 4
4.2 Compute the following determinant efficiently:
2 3
2 0 1 2 0
62 1 0 1 17
6 7
60 27
6 1 2 1 7.
4 2 0 2 1 25
2 0 0 1 1
1 0 2 2
4.3 Compute the eigenspaces of , .
1 1 2 1
4.4 Compute all eigenspaces of
2 3
0 1 1 1
6 1 1 2 37
A=6
42
7.
1 0 05
1 1 1 0
4.5 Diagonalizability of a matrix is unrelated to its invertibility. Determine for
the following four matrices whether they are diagonalizable and/or invert-
ible
1 0 1 0 1 1 0 1
, , , .
0 1 0 0 0 1 0 0
4.6 Compute the eigenspaces of the following transformation matrices. Are they
diagonalizable?
a.
2 3
2 3 0
A = 41 4 35
0 0 1
b.
2 3
1 1 0 0
60 0 0 07
A=6
40
7
0 0 05
0 0 0 0
4.7 Are the following matrices diagonalizable? If yes, determine their diagonal
form and a basis with respect to which the transformation matrices are di-
agonal. If no, give reasons why they are not diagonalizable.
a.
0 1
A=
8 4
b.
2 3
1 1 1
A = 41 1 15
1 1 1
c.
2 3
5 4 2 1
6 0 1 1 17
A=4 6 7
1 1 3 0 5
1 1 1 2
d.
2 3
5 6 6
A=4 1 4 2 5
3 6 4
4.11 Show that for any A 2 Rm⇥n the matrices A> A and AA> possess the
same nonzero eigenvalues.
4.12 Show that for x 6= 0 Theorem 4.24 holds, i.e., show that
kAxk2
max = 1,
x kxk2
Vector Calculus
0
y
density estimation,
2
5
i.e., modeling data
distributions.
4
10
4 2 0 2 4 10 5 0 5 10
x x1
(a) Regression problem: Find parameters, (b) Density estimation with a Gaussian mixture
such that the curve explains the observations model: Find means and covariances, such that
(crosses) well. the data (dots) can be explained well.
139
This material will be published by Cambridge University Press as Mathematics for Machine Learn-
ing by Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong. This pre-publication version is
free to view and download for personal use only. Not for re-distribution, re-sale or use in deriva-
tive works. c by M. P. Deisenroth, A. A. Faisal, and C. S. Ong, 2019. https://mml-book.com.
140 Vector Calculus
defines
when they are used i
ed
in other parts of the us
book.
collected in
us
ed
in
us
Chapter 6 used in Jacobian Chapter 11
ed
Probability Hessian Density estimation
in
used in
Example 5.1
Recall the dot product as a special case of an inner product (Section 3.2).
In the previous notation, the function f (x) = x> x, x 2 R2 , would be
specified as
f : R2 ! R (5.2a)
x 7! x21 + x22 . (5.2b)
y f (x + x) f (x)
:= (5.3)
x x
computes the slope of the secant line through two points on the graph of
f . In Figure 5.3, these are the points with x-coordinates x0 and x0 + x.
The difference quotient can also be considered the average slope of f
between x and x + x if we assume f to be a linear function. In the limit
for x ! 0, we obtain the tangent of f at x, if f is differentiable. The
tangent is then the derivative of f at x.
Definition 5.2 (Derivative). More formally, for h > 0 the derivative of f derivative
at x is defined as the limit
df f (x + h) f (x)
:= lim , (5.4)
dx h!0 h
and the secant in Figure 5.3 becomes a tangent.
The derivative of f points in the direction of steepest ascent of f .
x h xn
= lim i=0 i . (5.5c)
h!0 h
We see that xn = n
0
xn 0 h0 . By starting the sum at 1, the xn -term cancels,
and we obtain
Pn n
df i=1 x n i hi
= lim i
(5.6a)
dx h!0 h
!
Xn
n n i i 1
= lim x h (5.6b)
h!0
i=1
i
! n
!
n n 1 X n n i i 1
= lim x + x h (5.6c)
h!0 1 i=2
i
| {z }
!0 as h!0
n!
= xn 1
= nxn 1
. (5.6d)
1!(n 1)!
Taylor polynomial Definition 5.3 (Taylor Polynomial). The Taylor polynomial of degree n of
We define t0 := 1 f : R ! R at x0 is defined as
for all t 2 R.
n
X f (k) (x0 )
Tn (x) := (x x0 ) k , (5.7)
k=0
k!
For x0 = 0, we obtain the Maclaurin series as a special instance of the f 2 C 1 means that
Taylor series. If f (x) = T1 (x), then f is called analytic. f is continuously
differentiable
infinitely many
Remark. In general, a Taylor polynomial of degree n is an approximation times.
of a function, which does not need to be a polynomial. The Taylor poly- Maclaurin series
nomial is similar to f in a neighborhood around x0 . However, a Taylor analytic
polynomial of degree n is an exact representation of a polynomial f of
degree k 6 n since all derivatives f (i) , i > k vanish. }
y
Taylor polynomials
(dashed) around 0
x0 = 0.
Higher-order Taylor
polynomials 2
approximate the
function f better 4 2 0 2 4
and more globally. x
T10 is already
similar to f in
[ 4, 4].
Example 5.4 (Taylor Series)
Consider the function in Figure 5.4 given by
f (x) = sin(x) + cos(x) 2 C 1 . (5.19)
We seek a Taylor series expansion of f at x0 = 0, which is the Maclaurin
series expansion of f . We obtain the following derivatives:
f (0) = sin(0) + cos(0) = 1 (5.20)
f 0 (0) = cos(0) sin(0) = 1 (5.21)
f 00 (0) = sin(0) cos(0) = 1 (5.22)
f (3)
(0) = cos(0) + sin(0) = 1 (5.23)
f (4)
(0) = sin(0) + cos(0) = f (0) = 1 (5.24)
..
.
We can see a pattern here: The coefficients in our Taylor series are only
±1 (since sin(0) = 0), each of which occurs twice before switching to the
other one. Furthermore, f (k+4) (0) = f (k) (0).
Therefore, the full Taylor series expansion of f at x0 = 0 is given by
1
X f (k) (x0 )
T1 (x) = (x x0 ) k (5.25a)
k=0
k!
1 2 1 3 1 1
=1+x x x + x4 + x5 · · · (5.25b)
2! 3! 4! 5!
1 2 1 1 3 1
=1 x + x4 ⌥ · · · + x x + x5 ⌥ · · · (5.25c)
2! 4! 3! 5!
X1 X1
1 1
= ( 1)k x2k + ( 1)k x2k+1 (5.25d)
k=0
(2k)! k=0
(2k + 1)!
= cos(x) + sin(x) , (5.25e)
where ak are coefficients and c is a constant, which has the special form
in Definition 5.4. }
@f (x, y) @
= 2(x + 2y 3 ) (x + 2y 3 ) = 12(x + 2y 3 )y 2 . (5.42)
@y @y
where we used the chain rule (5.32) to compute the partial derivatives.
This is only an Let us have a closer look at the chain rule. The chain rule (5.48) resem-
intuition, but not bles to some degree the rules for matrix multiplication where we said that
mathematically
neighboring dimensions have to match for matrix multiplication to be de-
correct since the
partial derivative is fined; see Section 2.2.1. If we go from left to right, the chain rule exhibits
not a fraction. similar properties: @f shows up in the “denominator” of the first factor
and in the “numerator” of the second factor. If we multiply the factors to-
gether, multiplication is defined, i.e., the dimensions of @f match, and @f
“cancels”, such that @g/@x remains.
Example 5.8
Consider f (x1 , x2 ) = x21 + 2x2 , where x1 = sin t and x2 = cos t, then
df @f @x1 @f @x2
= + (5.50a)
dt @x1 @t @x2 @t
@ sin t @ cos t
= 2 sin t +2 (5.50b)
@t @t
= 2 sin t cos t 2 sin t = 2 sin t(cos t 1) (5.50c)
is the corresponding derivative of f with respect to t.
@f @f @x1 @f @x2
= + , (5.51)
@s @x1 @s @x2 @s
@f @f @x1 @f @x2
= + , (5.52)
@t @x1 @t @x2 @t
@f | @s {z @t }
=
@x @x
=
@(s, t)
This compact way of writing the chain rule as a matrix multiplication only The chain rule can
makes sense if the gradient is defined as a row vector. Otherwise, we will be written as a
matrix
need to start transposing gradients for the matrix dimensions to match.
multiplication.
This may still be straightforward as long as the gradient is a vector or a
matrix; however, when the gradient becomes a tensor (we will discuss this
in the following), the transpose is no longer a triviality.
Remark (Verifying the Correctness of a Gradient Implementation). The
definition of the partial derivatives as the limit of the corresponding dif-
ference quotient (see (5.39)) can be exploited when numerically checking
the correctness of gradients in computer programs: When we compute Gradient checking
gradients and implement them, we can use finite differences to numer-
ically test our computation and implementation: We choose the value h
to be small (e.g., h = 10 4 ) and compare the finite-difference approxima-
tion from (5.39) with our (analytic) implementation of the gradient. If the
error is small, ourqgradient
P
implementation is probably correct. “Small”
(dh df )2
could mean that Pi (dhii +dfii )2 < 10 6 , where dhi is the finite-difference
i
approximation and dfi is the analytic gradient of f with respect to the ith
variable xi . }
@f
= 4 ... 5 = 4 ..
6 7 6 7 m
. 52R .
@xi @fm
@xi limh!0 fm (x1 ,...,xi 1 ,xi +h,x
h
i+1 ,...xn ) fm (x)
(5.55)
From (5.40), we know that the gradient of f with respect to a vector is
the row vector of the partial derivatives. In (5.55), every partial derivative
@f /@xi is a column vector. Therefore, we obtain the gradient of f : Rn !
Rm with respect to x 2 Rn by collecting these partial derivatives:
df (x) @f (x) @f (x)
= ··· (5.56a)
dx @x1 @xn
2 3
@f1 (x) @f1 (x)
6 @x1 ··· @xn 7
6 7
=6 .. .. 7 2 Rm⇥n . (5.56b)
6 . . 7
4 @fm (x) @fm (x) 5
@x1 ··· @xn
exists also the denominator layout, which is the transpose of the numerator denominator layout
layout. In this book, we will use the numerator layout. }
We will see how the Jacobian is used in the change-of-variable method
for probability distributions in Section 6.7. The amount of scaling due to
the transformation of a variable is provided by the determinant.
In Section 4.1, we saw that the determinant can be used to compute
the area of a parallelogram. If we are given two vectors b1 = [1, 0]> ,
b2 = [0, 1]> as the sides of the unit square (blue; see Figure 5.5), the area
of this square is
✓ ◆
1 0
det = 1. (5.60)
0 1
If we take a parallelogram with the sides c1 = [ 2, 1]> , c2 = [1, 1]>
(orange in Figure 5.5), its area is given as the absolute value of the deter-
minant (see Section 4.1)
✓ ◆
2 1
det = | 3| = 3 , (5.61)
1 1
i.e., the area of this is exactly three times the area of the unit square.
We can find this scaling factor by finding a mapping that transforms the
unit square into the other square. In linear algebra terms, we effectively
perform a variable transformation from (b1 , b2 ) to (c1 , c2 ). In our case,
the mapping is linear and the absolute value of the determinant of this
mapping gives us exactly the scaling factor we are looking for.
We will describe two approaches to identify this mapping. First, we ex-
ploit that the mapping is linear so that we can use the tools from Chapter 2
to identify this mapping. Second, we will find the mapping using partial
derivatives using the tools we have been discussing in this chapter.
Approach 1 To get started with the linear algebra approach, we
identify both {b1 , b2 } and {c1 , c2 } as bases of R2 (see Section 2.6.1 for a
recap). What we effectively perform is a change of basis from (b1 , b2 ) to
(c1 , c2 ), and we are looking for the transformation matrix that implements
the basis change. Using results from Section 2.7.2, we identify the desired
basis change matrix as
2 1
J= , (5.62)
1 1
such that J b1 = c1 and J b2 = c2 . The absolute value of the determi-
nant of J , which yields the scaling factor we are looking for, is given as
|det(J )| = 3, i.e., the area of the square spanned by (c1 , c2 ) is three times
greater than the area spanned by (b1 , b2 ).
Approach 2 The linear algebra approach works for linear trans-
formations; for nonlinear transformations (which become relevant in Sec-
tion 6.7), we follow a more general approach using partial derivatives.
For this approach, we consider a function f : R2 ! R2 that performs a
variable transformation. In our example, f maps the coordinate represen-
tation of any vector x 2 R2 with respect to (b1 , b2 ) onto the coordinate
representation y 2 R2 with respect to (c1 , c2 ). We want to identify the
mapping so that we can compute how an area (or volume) changes when
it is being transformed by f . For this, we need to find out how f (x)
changes if we modify x a bit. This question is exactly answered by the
Jacobian matrix df dx
2 R2⇥2 . Since we can write
y1 = 2x1 + x2 (5.63)
y2 = x 1 + x 2 (5.64)
we obtain the functional relationship between x and y , which allows us
to get the partial derivatives
@y1 @y1 @y2 @y2
= 2, = 1, = 1, =1 (5.65)
@x1 @x2 @x1 @x2
and compose the Jacobian as
2 @y @y1 3
1
6 1 @x2 7 2 1
J = 4 @x
@y @y2 5 = 1 1 . (5.66)
2
@x1 @x2
Geometrically, the The Jacobian represents the coordinate transformation we are looking
Jacobian for. It is exact if the coordinate transformation is linear (as in our case),
determinant gives
and (5.66) recovers exactly the basis change matrix in (5.62). If the co-
the magnification/
scaling factor when ordinate transformation is nonlinear, the Jacobian approximates this non-
we transform an linear transformation locally with a linear one. The absolute value of the
area or volume. Jacobian determinant |det(J )| is the factor by which areas or volumes are
Jacobian
scaled when coordinates are transformed. Our case yields |det(J )| = 3.
determinant
The Jacobian determinant and variable transformations will become
relevant in Section 6.7 when we transform random variables and prob-
Figure 5.6 ability distributions. These transformations are extremely relevant in ma-
Dimensionality of chine learning in the context of training deep neural networks using the
(partial) derivatives.
reparametrization trick, also called infinite perturbation analysis.
x In this chapter, we encountered derivatives of functions. Figure 5.6 sum-
f (x) marizes the dimensions of those derivatives. If f : R ! R the gradient is
@f simply a scalar (top-left entry). For f : RD ! R the gradient is a 1 ⇥ D
@x row vector (top-right entry). For f : R ! RE , the gradient is an E ⇥ 1
column vector, and for f : RD ! RE the gradient is an E ⇥ D matrix.
We collect the partial derivatives in the Jacobian and obtain the gradient
2 @f1 @f1 3 2 3
@x1
· · · @x N
A11 · · · A1N
df
= 4 ... .. 7 = 6 .. .. 7 = A 2 RM ⇥N . (5.68)
6
. 5 4 . . 5
dx @fM @fM
@x 1
· · · @x N
AM 1 · · · AM N
of the least-squares
loss L with respect
L(e) := kek2 , (5.76)
to the parameters ✓. e(✓) := y ✓. (5.77)
We seek @L@✓
, and we will use the chain rule for this purpose. L is called a
least-squares loss least-squares loss function.
Before we start our calculation, we determine the dimensionality of the
gradient as
@L
2 R1⇥D . (5.78)
@✓
The chain rule allows us to compute the gradient as
@L @L @e
= , (5.79)
@✓ @e @✓
dLdtheta = where the dth element is given by
np.einsum(
’n,nd’, XN
@L @L @e
dLde,dedtheta) [1, d] = [n] [n, d] . (5.80)
@✓ n=1
@e @✓
We know that kek2 = e> e (see Section 3.2) and determine
@L
= 2e> 2 R1⇥N . (5.81)
@e
Furthermore, we obtain
@e
= 2 RN ⇥D , (5.82)
@✓
such that our desired derivative is
@L (5.77)
= 2e> = 2(y > ✓ > > ) |{z} 2 R1⇥D . (5.83)
@✓ | {z }
1⇥N N ⇥D
Remark. We would have obtained the same result without using the chain
rule by immediately looking at the function
L2 (✓) := ky ✓k2 = (y ✓)> (y ✓) . (5.84)
This approach is still practical for simple functions like L2 but becomes
impractical for deep function compositions. }
dà dA
2 R8⇥3 2 R4⇥2⇥3
A 2 R4⇥2 Ã 2 R8 dx dx
@fi
= 0> 2 R1⇥1⇥N (5.91)
@Ak6=i,:
where we have to pay attention to the correct dimensionality. Since fi
maps onto R and each row of A is of size 1 ⇥ N , we obtain a 1 ⇥ 1 ⇥ N -
sized tensor as the partial derivative of fi with respect to a row of A.
We stack the partial derivatives (5.91) and get the desired gradient
in (5.87) via
2 >3
0
6 .. 7
6 . 7
6 >7
60 7
@fi 6 >7
=6 x 7 72R
1⇥(M ⇥N )
. (5.92)
@A 6 60 7
>
6 7
6 . 7
4 .. 5
0>
XM
@Kpq @
= Rmp Rmq = @pqij , (5.97)
@Rij m=1
@R ij
8
>
> Riq if j = p, p 6= q
<
Rip if j = q, p 6= q
@pqij = . (5.98)
>
> 2Riq if j = p, p = q
:
0 otherwise
From (5.94), we know that the desired gradient has the dimension (N ⇥
N ) ⇥ (M ⇥ N ), and every single entry of this tensor is given by @pqij
in (5.98), where p, q, j = 1, . . . , N and i = 1, . . . , M .
We discuss the case, In neural networks with multiple layers, we have functions fi (xi 1 ) =
where the activation (Ai 1 xi 1 + bi 1 ) in the ith layer. Here xi 1 is the output of layer i 1
functions are
and an activation function, such as the logistic sigmoid 1+e1 x , tanh or a
identical in each
layer to unclutter rectified linear unit (ReLU). In order to train these models, we require the
notation. gradient of a loss function L with respect to all model parameters Aj , bj
for j = 1, . . . , K . This also requires us to compute the gradient of L with
respect to the inputs of each layer. For example, if we have inputs x and
observations y and a network structure defined by
f 0 := x (5.112)
f i := i (Ai 1 f i 1 + bi 1 ) , i = 1, . . . , K , (5.113)
@L @L @f K @f i+2 @f i+1
= ··· (5.118)
@✓ i @f K @f K 1 @f i+1 @✓ i
The orange terms are partial derivatives of the output of a layer with
respect to its inputs, whereas the blue terms are partial derivatives of
the output of a layer with respect to its parameters. Assuming, we have
already computed the partial derivatives @L/@✓ i+1 , then most of the com-
putation can be reused to compute @L/@✓ i . The additional terms that we
Figure 5.9
Backward pass in a
x f1 fK 1 fK L multi-layer neural
network to compute
the gradients of the
loss function.
A0 , b 0 A1 , b 1 AK 2 , b K 2 AK 1 , b K 1
Example 5.14
Consider the function
q
f (x) = x2 + exp(x2 ) + cos x2 + exp(x2 ) (5.122)
from (5.109). If we were to implement a function f on a computer, we
intermediate would be able to save some computation by using intermediate variables:
variables
a = x2 , (5.123)
b = exp(a) , (5.124)
c = a + b, (5.125)
p
d = c, (5.126)
e = cos(c) , (5.127)
f = d + e. (5.128)
p
Figure 5.11 exp(·) b · d
Computation graph
with inputs x,
function values f ,
x (·)2 a + c + f
and intermediate
variables a, b, c, d, e. cos(·) e
This is the same kind of thinking process that occurs when applying
the chain rule. Note that the preceding set of equations requires fewer
operations than a direct implementation of the function f (x) as defined
in (5.109). The corresponding computation graph in Figure 5.11 shows
the flow of data and computations required to obtain the function value
f.
The set of equations that include intermediate variables can be thought
of as a computation graph, a representation that is widely used in imple-
mentations of neural network software libraries. We can directly compute
the derivatives of the intermediate variables with respect to their corre-
sponding inputs by recalling the definition of the derivative of elementary
functions. We obtain the following:
@a
= 2x (5.129)
@x
@b
= exp(a) (5.130)
@a
@c @c
=1= (5.131)
@a @b
@d 1
= p (5.132)
@c 2 c
@e
= sin(c) (5.133)
@c
@f @f
=1= . (5.134)
@d @e
By looking at the computation graph in Figure 5.11, we can compute
@f /@x by working backward from the output and obtain
@f @f @d @f @e
= + (5.135)
@c @d @c @e @c
@f @f @c
= (5.136)
@b @c @b
@f @f @b @f @c
= + (5.137)
@a @b @a @c @a
@f @f @a
= . (5.138)
@x @a @x
Note that we implicitly applied the chain rule to obtain @f /@x. By substi-
tuting the results of the derivatives of the elementary functions, we get
@f 1
= 1 · p + 1 · ( sin(c)) (5.139)
@c 2 c
@f @f
= ·1 (5.140)
@b @c
@f @f @f
= exp(a) + ·1 (5.141)
@a @b @c
@f @f
= · 2x . (5.142)
@x @a
By thinking of each of the derivatives above as a variable, we observe
that the computation required for calculating the derivative is of similar
complexity as the computation of the function itself. This is quite counter-
intuitive since the mathematical expression for the derivative @f@x
(5.110)
is significantly more complicated than the mathematical expression of the
function f (x) in (5.109).
where the gi (·) are elementary functions and xPa(xi ) are the parent nodes
of the variable xi in the graph. Given a function defined in this way, we
can use the chain rule to compute the derivative of the function in a step-
by-step fashion. Recall that by definition f = xD and hence
@f
= 1. (5.144)
@xD
For other variables xi , we apply the chain rule
@f X @f @xj X @f @gj
= = , (5.145)
@xi x @xj @xi @xj @xi
j :xi 2Pa(xj ) x j :xi 2Pa(xj )
first-order Taylor
series expansion.
1 f (x0) f (x0) + f 0(x0)(x x0 )
4 2 0 2 4
x
Figure 5.13
Visualizing outer
products. Outer
products of vectors
increase the
dimensionality of
the array by 1 per
term. (a) The outer (a) Given a vector 2 R4 , we obtain the outer product 2
:= ⌦ = >
2
product of two R4⇥4 as a matrix.
vectors results in a
matrix; (b) the
outer product of
three vectors yields
a third-order tensor.
where Dxk f (x0 ) is the k -th (total) derivative of f with respect to x, eval-
uated at x0 .
Taylor polynomial Definition 5.8 (Taylor Polynomial). The Taylor polynomial of degree n of
f at x0 contains the first n + 1 components of the series in (5.151) and is
defined as
n
X Dk f (x0 ) k
Tn (x) = x
. (5.152)
k=0
k!
array.
2 > 2
:= ⌦ = , [i, j] = [i] [j] (5.153)
in the Taylor series, where Dxk f (x0 ) k contains k -th order polynomials.
Now that we defined the Taylor series for vector fields, let us explicitly
write down the first terms Dxk f (x0 ) k of the Taylor series expansion for
k = 0, . . . , 3 and := x x0 :
np.einsum(
0
k = 0 : Dx0 f (x0 ) = f (x0 ) 2 R (5.156) ’i,i’,Df1,d)
D
np.einsum(
X ’ij,i,j’,
k=1: Dx1 f (x0 ) 1 = rx f (x0 ) |{z} = rx f (x0 )[i] [i] 2 R (5.157) Df2,d,d)
| {z } i=1
1⇥D D⇥1 np.einsum(
> > ’ijk,i,j,k’,
k=2: Dx2 f (x0 ) 2 = tr H(x0 ) |{z} |{z} = H(x0 ) (5.158) Df3,d,d,d)
| {z }
D⇥D D⇥1 1⇥D
D X
X D
= H[i, j] [i] [j] 2 R (5.159)
i=1 j=1
D X
X D X
D
3
k = 3 : Dx3 f (x0 ) = Dx3 f (x0 )[i, j, k] [i] [j] [k] 2 R
i=1 j=1 k=1
(5.160)
Here, H(x0 ) is the Hessian of f evaluated at x0 .
@f @f
= 2x + 2y =) (1, 2) = 6 (5.163)
@x @x
@f @f
= 2x + 3y 2 =) (1, 2) = 14 . (5.164)
@y @y
Therefore, we obtain
h i ⇥ ⇤
1 @f @f
Dx,y f (1, 2) = rx,y f (1, 2) = @x
(1, 2) @y
(1, 2) = 6 14 2 R1⇥2
(5.165)
such that
1
Dx,y f (1, 2) ⇥ ⇤ x 1
= 6 14 = 6(x 1) + 14(y 2) . (5.166)
1! y 2
Note that Dx,y
1
f (1, 2) contains only linear terms, i.e., first-order polyno-
mials.
The second-order partial derivatives are given by
@2f @2f
= 2 =) (1, 2) = 2 (5.167)
@x2 @x2
@2f @2f
= 6y =) (1, 2) = 12 (5.168)
@y 2 @y 2
@2f @2f
= 2 =) (1, 2) = 2 (5.169)
@y@x @y@x
@2f @2f
= 2 =) (1, 2) = 2 . (5.170)
@x@y @x@y
When we collect the second-order partial derivatives, we obtain the Hes-
sian
" 2 #
@ f @2f
2 2
H = @2f@x2 @x@y
@2f
= , (5.171)
2
2 6y
@y@x @y
such that
2 2
H(1, 2) = 2 R2⇥2 . (5.172)
2 12
Therefore, the next term of the Taylor-series expansion is given by
2
Dx,y f (1, 2) 2 1 >
= H(1, 2) (5.173a)
2! 2
1⇥ ⇤ 2 2 x 1
= x 1 y 2 (5.173b)
2 2 12 y 2
2
= (x 1) + 2(x 1)(y 2) + 6(y 2)2 . (5.173c)
2
Here, Dx,y
2
f (1, 2) contains only quadratic terms, i.e., second-order poly-
nomials.
Exercises
5.1 Compute the derivative f 0 (x) for
f (x) = log(x4 ) sin(x3 ) .
where x, µ 2 RD , S 2 RD⇥D .
b.
f (x) = tr(xx> + 2
I) , x 2 RD
Here tr(A) is the trace of A, i.e., the sum of the diagonal elements Aii .
Hint: Explicitly write out the outer product.
c. Use the chain rule. Provide the dimensions of every single partial deriva-
tive. You do not need to compute the product of the partial derivatives
explicitly.
f = tanh(z) 2 RM
z = Ax + b, x 2 RN , A 2 RM ⇥N , b 2 RM .
Here, tanh is applied to every component of z .
5.9 We define
g(z, ⌫) := log p(x, z) log q(z, ⌫)
z := t(✏, ⌫)
for differentiable functions p, q, t. By using the chain rule, compute the gra-
dient
d
g(z, ⌫) .
d⌫
172
This material will be published by Cambridge University Press as Mathematics for Machine Learn-
ing by Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong. This pre-publication version is
free to view and download for personal use only. Not for re-distribution, re-sale or use in deriva-
tive works. c by M. P. Deisenroth, A. A. Faisal, and C. S. Ong, 2019. https://mml-book.com.
6.1 Construction of a Probability Space 173
reduction
y
Independence
rit
Bernoulli
ila
m
Sufficient statistics
Si
Conjugate
Chapter 11
Finite
Density estimation
(Grinstead and Snell, 1997; Jaynes, 2003) that introduce the three con-
cepts of sample space, event space, and probability measure. The prob-
ability space models a real-world process (referred to as an experiment)
with random outcomes.
The probability of a single event must lie in the interval [0, 1], and the
total probability over all outcomes in the sample space ⌦ must be 1, i.e.,
P (⌦) = 1. Given a probability space (⌦, A, P ), we want to use it to model
some real-world phenomenon. In machine learning, we often avoid explic-
itly referring to the probability space, but instead refer to probabilities on
quantities of interest, which we denote by T . In this book, we refer to T
as the target space and refer to elements of T as states. We introduce a target space
function X : ⌦ ! T that takes an element of ⌦ (an outcome) and returns
a particular quantity of interest x, a value in T . This association/mapping
from ⌦ to T is called a random variable. For example, in the case of tossing random variable
two coins and counting the number of heads, a random variable X maps
to the three possible outcomes: X(hh) = 2, X(ht) = 1, X(th) = 1, and
X(tt) = 0. In this particular case, T = {0, 1, 2}, and it is the probabilities
on elements of T that we are interested in. For a finite sample space ⌦ and The name “random
finite T , the function corresponding to a random variable is essentially a variable” is a great
source of
lookup table. For any subset S ✓ T , we associate PX (S) 2 [0, 1] (the
misunderstanding
probability) to a particular event occurring corresponding to the random as it is neither
variable X . Example 6.1 provides a concrete illustration of the terminol- random nor is it a
ogy. variable. It is a
function.
Remark. The aforementioned sample space ⌦ unfortunately is referred
to by different names in different books. Another common name for ⌦
is “state space” (Jacod and Protter, 2004), but state space is sometimes
reserved for referring to states in a dynamical system (Hasselblatt and
Example 6.1
This toy example is We assume that the reader is already familiar with computing probabil-
essentially a biased
ities of intersections and unions of sets of events. A gentler introduction
coin flip example.
to probability with many examples can be found in chapter 2 of Walpole
et al. (2011).
Consider a statistical experiment where we model a funfair game con-
sisting of drawing two coins from a bag (with replacement). There are
coins from USA (denoted as $) and UK (denoted as £) in the bag, and
since we draw two coins from the bag, there are four outcomes in total.
The state space or sample space ⌦ of this experiment is then ($, $), ($,
£), (£, $), (£, £). Let us assume that the composition of the bag of coins is
such that a draw returns at random a $ with probability 0.3.
The event we are interested in is the total number of times the repeated
draw returns $. Let us define a random variable X that maps the sample
space ⌦ to T , which denotes the number of times we draw $ out of the
bag. We can see from the preceding sample space we can get zero $, one $,
or two $s, and therefore T = {0, 1, 2}. The random variable X (a function
or lookup table) can be represented as a table like the following:
X(($, $)) = 2 (6.1)
X(($, £)) = 1 (6.2)
X((£, $)) = 1 (6.3)
X((£, £)) = 0 . (6.4)
Since we return the first coin we draw before drawing the second, this
implies that the two draws are independent of each other, which we will
discuss in Section 6.4.5. Note that there are two experimental outcomes,
which map to the same event, where only one of the draws returns $.
Therefore, the probability mass function (Section 6.2.1) of X is given by
P (X = 2) = P (($, $))
= P ($) · P ($)
= 0.3 · 0.3 = 0.09 (6.5)
P (X = 1) = P (($, £) [ (£, $))
= P (($, £)) + P ((£, $))
= 0.3 · (1 0.3) + (1 0.3) · 0.3 = 0.42 (6.6)
P (X = 0) = P ((£, £))
= P (£) · P (£)
= (1 0.3) · (1 0.3) = 0.49 . (6.7)
PX (S) = P (X 2 S) = P (X 1
(S)) = P ({! 2 ⌦ : X(!) 2 S}) . (6.8)
The left-hand side of (6.8) is the probability of the set of possible outcomes
(e.g., number of $ = 1) that we are interested in. Via the random variable
X , which maps states to outcomes, we see in the right-hand side of (6.8)
that this is the probability of the set of states (in ⌦) that have the property
(e.g., $£, £$). We say that a random variable X is distributed according
to a particular probability distribution PX , which defines the probability
mapping between the event and the probability of the outcome of the
random variable. In other words, the function PX or equivalently P X 1
is the law or distribution of random variable X . law
distribution
Remark. The target space, that is, the range T of the random variable X ,
is used to indicate the kind of probability space, i.e., a T random variable.
When T is finite or countably infinite, this is called a discrete random
variable (Section 6.2.1). For continuous random variables (Section 6.2.2),
we only consider T = R or T = RD . }
6.1.3 Statistics
Probability theory and statistics are often presented together, but they con-
cern different aspects of uncertainty. One way of contrasting them is by the
kinds of problems that are considered. Using probability, we can consider
a model of some process, where the underlying uncertainty is captured
by random variables, and we use the rules of probability to derive what
happens. In statistics, we observe that something has happened and try
to figure out the underlying process that explains the observations. In this
sense, machine learning is close to statistics in its goals to construct a
model that adequately represents the process that generated the data. We
can use the rules of probability to obtain a “best-fitting” model for some
data.
Another aspect of machine learning systems is that we are interested
in generalization error (see Chapter 8). This means that we are actually
interested in the performance of our system on instances that we will
observe in future, which are not identical to the instances that we have
ci Figure 6.2
z }|{ Visualization of a
y1 discrete bivariate
o probability mass
Y y2 nij rj function, with
random variables X
y3 and Y . This
x1 x2 x3 x4 x5 diagram is adapted
from Bishop (2006).
X
Example 6.2
Consider two random variables X and Y , where X has five possible states
and Y has three possible states, as shown in Figure 6.2. We denote by nij
the number of events with state X = xi and Y = yj , and denote by
N the total number of events. The value ciP is the sum of the individual
3
frequencies for the ith column, that is, ci = j=1 nij . Similarly, the value
P5
rj is the row sum, that is, rj = i=1 nij . Using these definitions, we can
compactly express the distribution of X and Y .
The probability distribution of each random variable, the marginal
probability, can be seen as the sum over a row or column
P3
ci j=1 nij
P (X = xi ) = = (6.10)
N N
and
P5
rj nij
P (Y = yj ) = = i=1 , (6.11)
N N
where ci and rj are the ith column and j th row of the probability table,
respectively. By convention, for discrete random variables with a finite
number of events, we assume that probabilties sum up to one, that is,
5
X 3
X
P (X = xi ) = 1 and P (Y = yj ) = 1 . (6.12)
i=1 j=1
Remark. We reiterate that there are in fact two distinct concepts when
talking about distributions. First is the idea of a pdf (denoted by f (x)),
which is a nonnegative function that sums to one. Second is the law of a
random variable X , that is, the association of a random variable X with
the pdf f (x). }
p(x)
uniform 1.0 1.0
distributions. See
Example 6.3 for 0.5 0.5
details of the
0.0 0.0
distributions. 1 0 1 2 1 0 1 2
z x
(a) Discrete distribution (b) Continuous distribution
For most of this book, we will not use the notation f (x) and FX (x) as
we mostly do not need to distinguish between the pdf and cdf. However,
we will need to be careful about pdfs and cdfs in Section 6.7.
Example 6.3
We consider two examples of the uniform distribution, where each state is
equally likely to occur. This example illustrates some differences between
discrete and continuous probability distributions.
Let Z be a discrete uniform random variable with three states {z =
The actual values of 1.1, z = 0.3, z = 1.5}. The probability mass function can be represented
these states are not as a table of probability values:
meaningful here,
and we deliberately z 1.1 0.3 1.5
chose numbers to
drive home the 1 1 1
P (Z = z) 3 3 3
point that we do not
want to use (and
should ignore) the Alternatively, we can think of this as a graph (Figure 6.3(a)), where we
ordering of the use the fact that the states can be located on the x-axis, and the y -axis
states.
represents the probability of a particular state. The y -axis in Figure 6.3(a)
is deliberately extended so that is it the same as in Figure 6.3(b).
Let X be a continuous random variable taking values in the range 0.9 6
X 6 1.6, as represented by Figure 6.3(b). Observe that the height of the
naturally from fulfilling the desiderata (Jaynes, 2003, chapter 2). Prob-
abilistic modeling (Section 8.4) provides a principled foundation for de-
signing machine learning methods. Once we have defined probability dis-
tributions (Section 6.2) corresponding to the uncertainties of the data and
our problem, it turns out that there are only two fundamental rules, the
sum rule and the product rule.
Recall from (6.9) that p(x, y) is the joint distribution of the two ran-
dom variables x, y . The distributions p(x) and p(y) are the correspond-
ing marginal distributions, and p(y | x) is the conditional distribution of y
given x. Given the definitions of the marginal and conditional probability
for discrete and continuous random variables in Section 6.2, we can now
These two rules present the two fundamental rules in probability theory.
arise The first rule, the sum rule, states that
naturally (Jaynes, 8 X
2003) from the > p(x, y) if y is discrete
requirements we
>
<
discussed in p(x) = Z
y2Y
, (6.20)
>
>
Section 6.1.1. : p(x, y)dy if y is continuous
sum rule Y
where Y are the states of the target space of random variable Y . This
means that we sum out (or integrate out) the set of states y of the random
marginalization variable Y . The sum rule is also known as the marginalization property.
property The sum rule relates the joint distribution to a marginal distribution. In
general, when the joint distribution contains more than two random vari-
ables, the sum rule can be applied to any subset of the random variables,
resulting in a marginal distribution of potentially more than one random
variable. More concretely, if x = [x1 , . . . , xD ]> , we obtain the marginal
Z
p(xi ) = p(x1 , . . . , xD )dx\i (6.21)
The product rule can be interpreted as the fact that every joint distribu-
tion of two random variables can be factorized (written as a product)
of two other distributions. The two factors are the marginal distribu-
tion of the first random variable p(x), and the conditional distribution
of the second random variable given the first p(y | x). Since the ordering
of random variables is arbitrary in p(x, y), the product rule also implies
p(x, y) = p(x | y)p(y). To be precise, (6.22) is expressed in terms of the
probability mass functions for discrete random variables. For continuous
random variables, the product rule is expressed in terms of the probability
density functions (Section 6.2.3).
In machine learning and Bayesian statistics, we are often interested in
making inferences of unobserved (latent) random variables given that we
have observed other random variables. Let us assume we have some prior
knowledge p(x) about an unobserved random variable x and some rela-
tionship p(y | x) between x and a second random variable y , which we
can observe. If we observe y , we can use Bayes’ theorem to draw some
conclusions about x given the observed values of y . Bayes’ theorem (also Bayes’ theorem
Bayes’ rule or Bayes’ law) Bayes’ rule
The quantity
Z
p(y) := p(y | x)p(x)dx = EX [p(y | x)] (6.27)
marginal likelihood is the marginal likelihood/evidence. The right-hand side of (6.27) uses the
evidence expectation operator which we define in Section 6.4.1. By definition, the
marginal likelihood integrates the numerator of (6.23) with respect to the
latent variable x. Therefore, the marginal likelihood is independent of
x, and it ensures that the posterior p(x | y) is normalized. The marginal
likelihood can also be interpreted as the expected likelihood where we
take the expectation with respect to the prior p(x). Beyond normalization
of the posterior, the marginal likelihood also plays an important role in
Bayesian model selection, as we will discuss in Section 8.6. Due to the
Bayes’ theorem is integration in (8.44), the evidence is often hard to compute.
also called the Bayes’ theorem (6.23) allows us to invert the relationship between x
“probabilistic
and y given by the likelihood. Therefore, Bayes’ theorem is sometimes
inverse.”
probabilistic inverse called the probabilistic inverse. We will discuss Bayes’ theorem further in
Section 8.4.
Remark. In Bayesian statistics, the posterior distribution is the quantity
of interest as it encapsulates all available information from the prior and
the data. Instead of carrying the posterior around, it is possible to focus
on some statistic of the posterior, such as the maximum of the posterior,
which we will discuss in Section 8.3. However, focusing on some statistic
of the posterior leads to loss of information. If we think in a bigger con-
text, then the posterior can be used within a decision-making system, and
having the full posterior can be extremely useful and lead to decisions that
are robust to disturbances. For example, in the context of model-based re-
inforcement learning, Deisenroth et al. (2015) show that using the full
posterior distribution of plausible transition functions leads to very fast
(data/sample efficient) learning, whereas focusing on the maximum of
the posterior leads to consistent failures. Therefore, having the full pos-
terior can be very useful for a downstream task. In Chapter 9, we will
continue this discussion in the context of linear regression. }
Definition 6.3 (Expected Value). The expected value of a function g : R ! expected value
R of a univariate continuous random variable X ⇠ p(x) is given by
Z
EX [g(x)] = g(x)p(x)dx . (6.28)
X
where X is the set of possible outcomes (the target space) of the random
variable X .
where the subscript EXd indicates that we are taking the expected value
with respect to the dth element of the vector x. }
Definition 6.3 defines the meaning of the notation EX as the operator
indicating that we should take the integral with respect to the probabil-
ity density (for continuous distributions) or the sum over all states (for
discrete distributions). The definition of the mean (Definition 6.4), is a
special case of the expected value, obtained by choosing g to be the iden-
tity function.
Definition 6.4 (Mean). The mean of a random variable X with states mean
Example 6.4
Consider the two-dimensional distribution illustrated in Figure 6.4:
✓ ◆ ✓ ◆
10 1 0 0 8.4 2.0
p(x) = 0.4 N x , + 0.6 N x , .
2 0 1 0 2.0 1.7
(6.33)
We will define the Gaussian distribution N µ, 2
in Section 6.5. Also
shown is its corresponding marginal distribution in each dimension. Ob-
serve that the distribution is bimodal (has two modes), but one of the
Figure 6.4
Mean Illustration of the
Modes mean, mode, and
Median median for a
two-dimensional
dataset, as well as
its marginal
densities.
Remark. The expected value (Definition 6.3) is a linear operator. For ex-
ample, given a real-valued function f (x) = ag(x) + bh(x) where a, b 2 R
and x 2 RD , we obtain
Z
EX [f (x)] = f (x)p(x)dx (6.34a)
Z
= [ag(x) + bh(x)]p(x)dx (6.34b)
Z Z
= a g(x)p(x)dx + b h(x)p(x)dx (6.34c)
}
For two random variables, we may wish to characterize their correspon-
y
each axis (colored
0 0 lines) but with
different
2 2 covariances.
5 0 5 5 0 5
x x
(a) x and y are negatively correlated. (b) x and y are positively correlated.
Z
p(xi ) = p(x1 , . . . , xD )dx\i , (6.39)
where “\i” denotes “all variables but i”. The off-diagonal entries are the
cross-covariance terms Cov[xi , xj ] for i, j = 1, . . . , D, i 6= j . cross-covariance
where xn 2 RD .
empirical covariance Similar to the empirical mean, the empirical covariance matrix is a D⇥D
matrix
N
1 X
⌃ := (xn x̄)(xn x̄)> . (6.42)
N n=1
Throughout the
book, we use the To compute the statistics for a particular dataset, we would use the
empirical realizations (observations) x1 , . . . , xN and use (6.41) and (6.42). Em-
covariance, which is pirical covariance matrices are symmetric, positive semidefinite (see Sec-
a biased estimate.
The unbiased
tion 3.2.3).
(sometimes called
corrected)
covariance has the 6.4.3 Three Expressions for the Variance
factor N 1 in the
denominator We now focus on a single random variable X and use the preceding em-
instead of N . pirical formulas to derive three possible expressions for the variance. The
The derivations are following derivation is the same for the population variance, except that
exercises at the end we need to take care of integrals. The standard definition of variance, cor-
of this chapter.
responding to the definition of covariance (Definition 6.5), is the expec-
tation of the squared deviation of a random variable X from its expected
value µ, i.e.,
VX [x] := EX [(x µ)2 ] . (6.43)
The expectation in (6.43) and the mean µ = EX (x) are computed us-
ing (6.32), depending on whether X is a discrete or continuous random
variable. The variance as expressed in (6.43) is the mean of a new random
variable Z := (X µ)2 .
When estimating the variance in (6.43) empirically, we need to resort
to a two-pass algorithm: one pass through the data to calculate the mean
µ using (6.41), and then a second pass using this estimate µ̂ calculate the
variance. It turns out that we can avoid two passes by rearranging the
terms. The formula in (6.43) can be converted to the so-called raw-score raw-score formula
formula for variance: for variance
2
VX [x] = EX [x2 ] (EX [x]) . (6.44)
The expression in (6.44) can be remembered as “the mean of the square
minus the square of the mean”. It can be calculated empirically in one pass
through data since we can accumulate xi (to calculate the mean) and x2i
simultaneously, where xi is the ith observation. Unfortunately, if imple- If the two terms
mented in this way, it can be numerically unstable. The raw-score version in (6.44) are huge
and approximately
of the variance can be useful in machine learning, e.g., when deriving the
equal, we may
bias–variance decomposition (Bishop, 2006). suffer from an
A third way to understand the variance is that it is a sum of pairwise dif- unnecessary loss of
ferences between all pairs of observations. Consider a sample x1 , . . . , xN numerical precision
in floating-point
of realizations of random variable X , and we compute the squared differ-
arithmetic.
ence between pairs of xi and xj . By expanding the square, we can show
that the sum of N 2 pairwise differences is the empirical variance of the
observations:
2 !2 3
N
X XN XN
1 1 1
2
(xi xj )2 = 2 4 x2i xi 5 . (6.45)
N i,j=1 N i=1 N i=1
We see that (6.45) is twice the raw-score expression (6.44). This means
that we can express the sum of pairwise distances (of which there are N 2
of them) as a sum of deviations from the mean (of which there are N ). Ge-
ometrically, this means that there is an equivalence between the pairwise
distances and the distances from the center of the set of points. From a
computational perspective, this means that by computing the mean (N
terms in the summation), and then computing the variance (again N
terms in the summation), we can obtain an expression (left-hand side
of (6.45)) that has N 2 terms.
Example 6.5
Consider a random variable X with zero mean (EX [x] = 0) and also
EX [x3 ] = 0. Let y = x2 (hence, Y is dependent on X ) and consider the
covariance (6.36) between X and Y . But this gives
Cov[x, y] = E[xy] E[x]E[y] = E[x3 ] = 0 . (6.54)
Figure 6.6
Geometry of
random variables. If
random variables X
and Y are
uncorrelated, they
are orthogonal
vectors in a
corresponding
vector space, and [ y]
var
the Pythagorean [ x]
+
p var
theorem applies. p
] = a var[x]
+y c
p va r[ x
b
p
var[y]
Figure 6.7
Gaussian
distribution of two
random variables x1
0.20 and x2 .
p(x1, x2)
0.15
0.10
0.05
0.00
7.5
5.0
2.5
1 0.0 x 2
0 2.5
x1 1 5.0
x2
2
dimensional case; 0.10
(b) two-dimensional 0
0.05
case. 2
0.00
4
5.0 2.5 0.0 2.5 5.0 7.5 1 0 1
x x1
Example 6.6
Figure 6.9
(a) Bivariate 8
Gaussian;
6
(b) marginal of a
joint Gaussian 4
distribution is
x2
2
Gaussian; (c) the
conditional 0 x2 = 1
distribution of a 2
Gaussian is also
Gaussian. 4
1 0 1
x1
0.4
0.2
0.2
0.0 0.0
1.5 1.0 0.5 0.0 0.5 1.0 1.5 1.5 1.0 0.5 0.0 0.5 1.0 1.5
x1 x1
Example 6.7
Since expectations are linear operations, we can obtain the weighted sum
of independent Gaussian random variables
p(ax + by) = N aµx + bµy , a2 ⌃x + b2 ⌃y . (6.79)
where the scalar 0 < ↵ < 1 is the mixture weight, and p1 (x) and p2 (x) are
univariate Gaussian densities (Equation (6.62)) with different parameters,
i.e., (µ1 , 12 ) 6= (µ2 , 22 ).
Then the mean of the mixture density p(x) is given by the weighted sum
of the means of each random variable:
Proof The mean of the mixture density p(x) is given by the weighted
sum of the means of each random variable. We apply the definition of the
mean (Definition 6.4), and plug in our mixture (6.80), which yields
Z 1
E[x] = xp(x)dx (6.83a)
1
Z 1
= ↵xp1 (x) + (1 ↵)xp2 (x)dx (6.83b)
1
Z 1 Z 1
=↵ xp1 (x)dx + (1 ↵) xp2 (x)dx (6.83c)
1 1
= ↵µ1 + (1 ↵)µ2 . (6.83d)
To compute the variance, we can use the raw-score version of the vari-
ance from (6.44), which requires an expression of the expectation of the
squared random variable. Here we use the definition of an expectation of
a function (the square) of a random variable (Definition 6.3),
Z 1
2
E[x ] = x2 p(x)dx (6.84a)
1
Z 1
= ↵x2 p1 (x) + (1 ↵)x2 p2 (x)dx (6.84b)
1
Remark. The preceding derivation holds for any density, but since the
Gaussian is fully determined by the mean and variance, the mixture den-
sity can be determined in closed form. }
For a mixture density, the individual components can be considered
to be conditional distributions (conditioned on the component identity).
Equation (6.85c) is an example of the conditional variance formula, also
known as the law of total variance, which generally states that for two ran- law of total variance
dom variables X and Y it holds that VX [x] = EY [VX [x|y]]+ VY [EX [x|y]],
i.e., the (total) variance of X is the expected conditional variance plus the
variance of a conditional mean.
We consider in Example 6.17 a bivariate standard Gaussian random
variable X and performed a linear transformation Ax on it. The outcome
is a Gaussian random variable with mean zero and covariance AA> . Ob-
serve that adding a constant vector will change the mean of the distribu-
tion, without affecting its variance, that is, the random variable x + µ is
Gaussian with mean µ and identity covariance. Hence, any linear/affine
transformation of a Gaussian random variable is Gaussian distributed. Any linear/affine
Consider a Gaussian distributed random variable X ⇠ N µ, ⌃ . For transformation of a
Gaussian random
a given matrix A of appropriate shape, let Y be a random variable such
variable is also
that y = Ax is a transformed version of x. We can compute the mean of Gaussian
y by exploiting that the expectation is a linear operator (6.50) as follows: distributed.
It turns out that the class of distributions called the exponential family exponential family
provides the right balance of generality while retaining favorable compu-
tation and inference properties. Before we introduce the exponential fam-
ily, let us see three more members of “named” probability distributions,
the Bernoulli (Example 6.8), Binomial (Example 6.9), and Beta (Exam-
ple 6.10) distributions.
Example 6.8
The Bernoulli distribution is a distribution for a single binary random Bernoulli
variable X with state x 2 {0, 1}. It is governed by a single continuous pa- distribution
rameter µ 2 [0, 1] that represents the probability of X = 1. The Bernoulli
distribution Ber(µ) is defined as
p(x | µ) = µx (1 µ)1 x
, x 2 {0, 1} , (6.92)
E[x] = µ , (6.93)
V[x] = µ(1 µ) , (6.94)
where E[x] and V[x] are the mean and variance of the binary random
variable X .
Figure 6.10
Examples of the µ = 0.1
Binomial 0.3 µ = 0.4
distribution for
µ 2 {0.1, 0.4, 0.75} µ = 0.75
and N = 15.
0.2
p(m)
0.1
0.0
0.0 2.5 5.0 7.5 10.0 12.5 15.0
Number m of observations x = 1 in N = 15 experiments
10 Figure 6.11
↵ = 0.5 = Examples of the
8 ↵=1= Beta distribution for
↵ = 2, = 0.3 different values of ↵
and .
p(µ|↵, )
6 ↵ = 4, = 10
↵ = 5, = 1
4
0
0.0 0.2 0.4 0.6 0.8 1.0
µ
Remark. There is a whole zoo of distributions with names, and they are
related in different ways to each other (Leemis and McQueston, 2008).
It is worth keeping in mind that each named distribution is created for a
particular reason, but may have other applications. Knowing the reason
behind the creation of a particular distribution often allows insight into
how to best use it. We introduced preceding three distributions to be able
6.6.1 Conjugacy
According to Bayes’ theorem (6.23), the posterior is proportional to the
product of the prior and the likelihood. The specification of the prior can
be tricky for two reasons: First, the prior should encapsulate our knowl-
edge about the problem before we see any data. This is often difficult to
describe. Second, it is often not possible to compute the posterior distribu-
tion analytically. However, there are some priors that are computationally
conjugate prior convenient: conjugate priors.
conjugate Definition 6.13 (Conjugate Prior). A prior is conjugate for the likelihood
function if the posterior is of the same form/type as the prior.
Conjugacy is particularly convenient because we can algebraically cal-
culate our posterior distribution by updating the parameters of the prior
distribution.
Remark. When considering the geometry of probability distributions, con-
jugate priors retain the same distance structure as the likelihood (Agarwal
and Daumé III, 2010). }
To introduce a concrete example of conjugate priors, we describe in Ex-
ample 6.11 the Binomial distribution (defined on discrete random vari-
ables) and the Beta distribution (defined on continuous random vari-
ables).
/ Beta(h + ↵, N h + ), (6.104d)
i.e., the posterior distribution is a Beta distribution as the prior, i.e., the
Beta prior is conjugate for the parameter µ in the Binomial likelihood
function.
Table 6.2 lists examples for conjugate priors for the parameters of some
standard likelihoods used in probabilistic modeling. Distributions such as The Gamma prior is
Multinomial, inverse Gamma, inverse Wishart, and Dirichlet can be found conjugate for the
precision (inverse
in any statistical text, and are described in Bishop (2006), for example.
variance) in the
The Beta distribution is the conjugate prior for the parameter µ in both univariate Gaussian
the Binomial and the Bernoulli likelihood. For a Gaussian likelihood func- likelihood, and the
tion, we can place a conjugate Gaussian prior on the mean. The reason Wishart prior is
conjugate for the
why the Gaussian likelihood appears twice in the table is that we need
precision matrix
to distinguish the univariate from the multivariate case. In the univariate (inverse covariance
(scalar) case, the inverse Gamma is the conjugate prior for the variance. matrix) in the
In the multivariate case, we use a conjugate inverse Wishart distribution multivariate
Gaussian likelihood.
as a prior on the covariance matrix. The Dirichlet distribution is the conju-
gate prior for the multinomial likelihood function. For further details, we
refer to Bishop (2006).
✓ = log 1 µµ (6.115)
(x) = x (6.116)
A(✓) = log(1 µ) = log(1 + exp(✓)). (6.117)
The relationship between ✓ and µ is invertible so that
1
µ= . (6.118)
1 + exp( ✓)
The relation (6.118) is used to obtain the right equality of (6.117).
Example 6.15
Recall the exponential family form of the Bernoulli distribution (6.113d)
µ
p(x | µ) = exp x log + log(1 µ) . (6.121)
1 µ
ables. However, we may not be able to obtain the functional form of the
distribution under transformations. Furthermore, we may be interested
in nonlinear transformations of random variables for which closed-form
expressions are not readily available.
Remark (Notation). In this section, we will be explicit about random vari-
ables and the values they take. Hence, recall that we use capital letters
X, Y to denote random variables and small letters x, y to denote the val-
ues in the target space T that the random variables take. We will explicitly
write pmfs of discrete random variables X as P (X = x). For continuous
random variables X (Section 6.2.2), the pdf is written as f (x) and the cdf
is written as FX (x). }
We will look at two approaches for obtaining distributions of transfor-
mations of random variables: a direct approach using the definition of a
cumulative distribution function and a change-of-variable approach that
uses the chain rule of calculus (Section 5.2.2). The change-of-variable ap- Moment generating
proach is widely used because it provides a “recipe” for attempting to functions can also
be used to study
compute the resulting distribution due to a transformation. We will ex-
transformations of
plain the techniques for univariate random variables, and will only briefly random
provide the results for the general case of multivariate random variables. variables (Casella
Transformations of discrete random variables can be understood di- and Berger, 2002,
chapter 2).
rectly. Suppose that there is a discrete random variable X with pmf P (X =
x) (Section 6.2.1), and an invertible function U (x). Consider the trans-
formed random variable Y := U (X), with pmf P (Y = y). Then
We also need to keep in mind that the domain of the random variable may
have changed due to the transformation by U .
Example 6.16
Let X be a continuous random variable with probability density function
on 0 6 x 6 1
f (x) = 3x2 . (6.128)
We are interested in finding the pdf of Y = X 2 .
The function f is an increasing function of x, and therefore the resulting
value of y lies in the interval [0, 1]. We obtain
FY (y) = P (Y 6 y) definition of cdf (6.129a)
= P (X 2 6 y) transformation of interest (6.129b)
1
= P (X 6 y ) 2 inverse (6.129c)
1
= FX (y ) 2 definition of cdf (6.129d)
Z y 12
= 3t2 dt cdf as a definite integral (6.129e)
0
⇥ ⇤t=y 12
= t3 t=0 result of integration (6.129f)
3
=y , 2 0 6 y 6 1. (6.129g)
Therefore, the cdf of Y is
3
FY (y) = y 2 (6.130)
for 0 6 y 6 1. To obtain the pdf, we differentiate the cdf
d 3 1
f (y) = FY (y) = y 2 (6.131)
dy 2
for 0 6 y 6 1.
Y := FX (x) (6.132)
Theorem 6.15 is known as the probability integral transform, and it is probability integral
used to derive algorithms for sampling from distributions by transforming transform
the result of sampling from a uniform random variable (Bishop, 2006).
The algorithm works by first generating a sample from a uniform distribu-
tion, then transforming it by the inverse cdf (assuming this is available)
to obtain a sample from the desired distribution. The probability integral
transform is also used for hypothesis testing whether a sample comes from
a particular distribution (Lehmann and Romano, 2005). The idea that the
output of a cdf gives a uniform distribution also forms the basis of copu-
las (Nelsen, 2006).
Let us break down the reasoning step by step, with the goal of understand-
ing the more general change-of-variables approach in Theorem 6.16. Change of variables
in probability relies
Remark. The name “change of variables” comes from the idea of chang- on the
ing the variable of integration when faced with a difficult integral. For change-of-variables
univariate functions, we use the substitution rule of integration, method in
Z Z calculus (Tandra,
2014).
f (g(x))g (x)dx = f (u)du , where u = g(x) .
0
(6.133)
The derivation of this rule is based on the chain rule of calculus (5.32) and
by applying twice the fundamental theorem of calculus. The fundamental
theorem of calculus formalizes the fact that integration and differentiation
are somehow “inverses” of each other. An intuitive understanding of the
rule can be obtained by thinking (loosely) about small changes (differen-
tials) to the equation u = g(x), that is by considering u = g 0 (x) x as a
differential of u = g(x). By subsituting u = g(x), the argument inside the
integral on the right-hand side of (6.133) becomes f (g(x)). By pretending
that the term du can be approximated by du ⇡ u = g 0 (x) x, and that
dx ⇡ x, we obtain (6.133). }
Consider a univariate random variable X , and an invertible function
U , which gives us another random variable Y = U (X). We assume that
random variable X has states x 2 [a, b]. By the definition of the cdf, we
have
FY (y) = P (Y 6 y) . (6.134)
(6.136)
The right-most term in (6.136) is an expression of the cdf of X . Recall the
definition of the cdf in terms of the pdf
Z U 1 (y)
P (X 6 U 1 (y)) = f (x)dx . (6.137)
a
Theorem 6.16. [Theorem 17.2 in Billingsley (1995)] Let f (x) be the value
of the probability density of the multivariate continuous random variable X .
If the vector-valued function y = U (x) is differentiable and invertible for
all values within the domain of x, then for corresponding values of y , the
probability density of Y = U (X) is given by
✓ ◆
@
f (y) = fx (U 1 (y)) · det U 1 (y) . (6.144)
@y
The theorem looks intimidating at first glance, but the key point is that
a change of variable of a multivariate random variable follows the pro-
cedure of the univariate change of variable. First we need to work out
the inverse transform, and substitute that into the density of x. Then we
calculate the determinant of the Jacobian and multiply the result. The
following example illustrates the case of a bivariate random variable.
Example 6.17
x1
Consider a bivariate random variable X with states x = and proba-
x2
bility density function
✓ ◆ > !
x1 1 1 x1 x1
f = exp . (6.145)
x2 2⇡ 2 x2 x2
Exercises
6.1 Consider the following bivariate distribution p(x, y) of two discrete random
variables X and Y .
x1 x2 x3 x4 x5
X
Compute:
a. The marginal distributions p(x) and p(y).
b. The conditional distributions p(x|Y = y1 ) and p(y|X = x3 ).
6.2 Consider a mixture of two Gaussian distributions (illustrated in Figure 6.4),
✓ ◆ ✓ ◆
10 1 0 0 8.4 2.0
0.4 N , + 0.6 N , .
2 0 1 0 2.0 1.7
p(x | µ) = µx (1 µ)1 x
, x 2 {0, 1} .
Choose a conjugate prior for the Bernoulli likelihood and compute the pos-
terior distribution p(µ | x1 , . . . , xN ).
6.4 There are two bags. The first bag contains four mangos and two apples; the
second bag contains four mangos and four apples.
We also have a biased coin, which shows “heads” with probability 0.6 and
“tails” with probability 0.4. If the coin shows “heads”. we pick a fruit at
random from bag 1; otherwise we pick a fruit at random from bag 2.
Your friend flips the coin (you cannot see the result), picks a fruit at random
from the corresponding bag, and presents you a mango.
What is the probability that the mango was picked from bag 2?
Hint: Use Bayes’ theorem.
6.5 Consider the time-series model
xt+1 = Axt + w , w ⇠ N 0, Q
y t = Cxt + v , v ⇠ N 0, R ,
where w, v are i.i.d. Gaussian noise variables. Further, assume that p(x0 ) =
N µ0 , ⌃ 0 .
Continuous Optimization
but there are several design choices, which we discuss in Section 7.1. For
constrained optimization, we need to introduce other concepts to man-
age the constraints (Section 7.2). We will also introduce a special class
of problems (convex optimization problems in Section 7.3) where we can
make statements about reaching the global optimum.
Consider the function in Figure 7.2. The function has a global minimum global minimum
around x = 4.5, with a function value of approximately 47. Since
the function is “smooth,” the gradients can be used to help find the min-
imum by indicating whether we should take a step to the right or left.
This assumes that we are in the correct bowl, as there exists another local local minimum
minimum around x = 0.7. Recall that we can solve for all the stationary
points of a function by calculating its derivative and setting it to zero. For Stationary points
are the real roots of
`(x) = x4 + 7x3 + 5x2 17x + 3 , (7.1) the derivative, that
is, points that have
we obtain the corresponding gradient as zero gradient.
d`(x)
= 4x3 + 21x2 + 10x 17 . (7.2)
dx
225
This material will be published by Cambridge University Press as Mathematics for Machine Learn-
ing by Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong. This pre-publication version is
free to view and download for personal use only. Not for re-distribution, re-sale or use in deriva-
tive works. c by M. P. Deisenroth, A. A. Faisal, and C. S. Ong, 2019. https://mml-book.com.
226 Continuous Optimization
Lagrange Chapter 11
multipliers Density estimation
Convex
Since this is a cubic equation, it has in general three solutions when set to
zero. In the example, two of them are minimums and one is a maximum
(around x = 1.4). To check whether a stationary point is a minimum
or maximum, we need to take the derivative a second time and check
whether the second derivative is positive or negative at the stationary
point. In our case, the second derivative is
d2 `(x)
= 12x2 + 42x + 10 . (7.3)
dx2
By substituting our visually estimated values of x = 4.5, ⇣1.4, 0.7, we ⌘
2
will observe that as expected the middle point is a maximum d dx `(x)
2 <0
and the other two stationary points are minimums.
Note that we have avoided analytically solving for values of x in the
previous discussion, although for low-order polynomials such as the pre-
ceding we could do so. In general, we are unable to find analytic solu-
tions, and hence we need to start at some value, say x0 = 6, and follow
the negative gradient. The negative gradient indicates that we should go
20
40
60
6 5 4 3 2 1 0 1 2
Value of parameter
right, but not how far (this is called the step-size). Furthermore, if we According to the
had started at the right side (e.g., x0 = 0) the negative gradient would Abel–Ruffini
theorem, there is in
have led us to the wrong minimum. Figure 7.2 illustrates the fact that for
general no algebraic
x > 1, the negative gradient points toward the minimum on the right of solution for
the figure, which has a larger objective value. polynomials of
In Section 7.3, we will learn about a class of functions, called convex degree 5 or more
(Abel, 1826).
functions, that do not exhibit this tricky dependency on the starting point
of the optimization algorithm. For convex functions, all local minimums
are global minimum. It turns out that many machine learning objective For convex functions
functions are designed such that they are convex, and we will see an ex- all local minima are
global minimum.
ample in Chapter 12.
The discussion in this chapter so far was about a one-dimensional func-
tion, where we are able to visualize the ideas of gradients, descent direc-
tions, and optimal values. In the rest of this chapter we develop the same
ideas in high dimensions. Unfortunately, we can only visualize the con-
cepts in one dimension, but some concepts do not generalize directly to
higher dimensions, therefore some care needs to be taken when reading.
Example 7.1
Consider a quadratic function in two dimensions
✓ ◆ > >
x1 1 x1 2 1 x1 5 x1
f = (7.7)
x2 2 x2 1 20 x2 3 x2
with gradient
✓ ◆ > >
x1 x1 2 1 5
rf = . (7.8)
x2 x2 1 20 3
Starting at the initial location x0 = [ 3, 1]> , we iteratively apply (7.6)
to obtain a sequence of estimates that converge to the minimum value
0
30
°1 15
10.0
30.0
20.0
0
70. 60.
80. 0 0
50.0
0 40.0
°2 °15
°4 °2 0 2 4
x1
(illustrated in Figure 7.3). We can see (both from the figure and by plug-
ging x0 into (7.8) with = 0.085) that the negative gradient at x0 points
north and east, leading to x1 = [ 1.98, 1.21]> . Repeating that argument
gives us x2 = [ 1.32, 0.42]> , and so on.
7.1.1 Step-size
As mentioned earlier, choosing a good step-size is important in gradient
descent. If the step-size is too small, gradient descent can be slow. If the The step-size is also
step-size is chosen too large, gradient descent can overshoot, fail to con- called the learning
rate.
verge, or even diverge. We will discuss the use of momentum in the next
section. It is a method that smoothes out erratic behavior of gradient up-
dates and dampens oscillations.
Adaptive gradient methods rescale the step-size at each iteration, de-
pending on local properties of the function. There are two simple heuris-
tics (Toussaint, 2012):
When the function value increases after a gradient step, the step-size
was too large. Undo the step and decrease the step-size.
When the function value decreases the step could have been larger. Try
to increase the step-size.
where ↵ 2 [0, 1]. Sometimes we will only know the gradient approxi-
mately. In such cases, the momentum term is useful since it averages out
different noisy estimates of the gradient. One particularly useful way to
obtain an approximate gradient is by using a stochastic approximation,
which we discuss next.
where xn 2 RD are the training inputs, yn are the training targets, and ✓
are the parameters of the regression model.
Standard gradient descent, as introduced previously, is a “batch” opti-
mization method, i.e., optimization is performed using the full training set
for a suitable step-size parameter i . Evaluating the sum gradient may re-
quire expensive evaluations of the gradients from all individual functions
Ln . When the training set is enormous and/or no simple formulas exist,
evaluating the sums P of gradients becomes very expensive.
N
Consider the term n=1 (rLn (✓ i )) in (7.15), we can reduce the amount
of computation by taking a sum over a smaller set of Ln . In contrast to
batch gradient descent, which uses all Ln for n = 1, . . . , N , we randomly
choose a subset of Ln for mini-batch gradient descent. In the extreme
case, we randomly select only a single Ln to estimate the gradient. The
key insight about why taking a subset of data is sensible is to realize that
for gradient descent to converge, we only require that the PN gradient is an
unbiased estimate of the true gradient. In fact the term n=1 (rLn (✓ i ))
in (7.15) is an empirical estimate of the expected value (Section 6.4.1) of
the gradient. Therefore, any other unbiased empirical estimate of the ex-
pected value, for example using any subsample of the data, would suffice
for convergence of gradient descent.
Remark. When the learning rate decreases at an appropriate rate, and sub-
ject to relatively mild assumptions, stochastic gradient descent converges
almost surely to local minimum (Bottou, 1998). }
Why should one consider using an approximate gradient? A major rea-
son is practical implementation constraints, such as the size of central
processing unit (CPU)/graphics processing unit (GPU) memory or limits
on computational time. We can think of the size of the subset used to esti-
mate the gradient in the same way that we thought of the size of a sample
when estimating empirical means (Section 6.4.1). Large mini-batch sizes
will provide accurate estimates of the gradient, reducing the variance in
the parameter update. Furthermore, large mini-batches take advantage of
highly optimized matrix operations in vectorized implementations of the
cost and gradient. The reduction in variance leads to more stable conver-
gence, but each gradient calculation will be more expensive.
In contrast, small mini-batches are quick to estimate. If we keep the
mini-batch size small, the noise in our gradient estimate will allow us to
get out of some bad local optima, which we may otherwise get stuck in.
In machine learning, optimization methods are used for training by min-
imizing an objective function on the training data, but the overall goal
is to improve generalization performance (Chapter 8). Since the goal in
machine learning does not necessarily need a precise estimate of the min-
imum of the objective function, approximate gradients using mini-batch
approaches have been widely used. Stochastic gradient descent is very
effective in large-scale machine learning problems (Bottou et al., 2018),
3 Figure 7.4
Illustration of
constrained
optimization. The
2
unconstrained
problem (indicated
by the contour
1 lines) has a
minimum on the
right side (indicated
by the circle). The
x2
0
box constraints
( 1 6 x 6 1 and
1 6 y 6 1) require
1
that the optimal
solution is within
the box, resulting in
2 an optimal value
indicated by the
star.
3
3 2 1 0 1 2 3
x1
where f : RD ! R.
In this section, we have additional constraints. That is, for real-valued
functions gi : RD ! R for i = 1, . . . , m, we consider the constrained
optimization problem (see Figure 7.4 for an illustration)
min f (x) (7.17)
x
This gives infinite penalty if the constraint is not satisfied, and hence
would provide the same solution. However, this infinite step function is
equally difficult to optimize. We can overcome this difficulty by introduc-
Lagrange multiplier ing Lagrange multipliers. The idea of Lagrange multipliers is to replace the
step function with a linear function.
Lagrangian We associate to problem (7.17) the Lagrangian by introducing the La-
grange multipliers i > 0 corresponding to each inequality constraint re-
spectively (Boyd and Vandenberghe, 2004, chapter 4) so that
m
X
L(x, ) = f (x) + i gi (x) (7.20a)
i=1
>
= f (x) + g(x) , (7.20b)
where in the last line we have concatenated all constraints gi (x) into a
vector g(x), and all the Lagrange multipliers into a vector 2 Rm .
We now introduce the idea of Lagrangian duality. In general, duality
in optimization is the idea of converting an optimization problem in one
set of variables x (called the primal variables), into another optimization
problem in a different set of variables (called the dual variables). We
introduce two different approaches to duality: In this section, we discuss
Lagrangian duality; in Section 7.3.3, we discuss Legendre–Fenchel duality.
Remark. In the discussion of Definition 7.1, we use two concepts that are
also of independent interest (Boyd and Vandenberghe, 2004).
minimax inequality First is the minimax inequality, which says that for any function with
two arguments '(x, y), the maximin is less than the minimax, i.e.,
Note that taking the maximum over y of the left-hand side of (7.24) main-
tains the inequality since the inequality is true for all y . Similarly, we can
take the minimum over x of the right-hand side of (7.24) to obtain (7.23).
The second concept is weak duality, which uses (7.23) to show that weak duality
primal values are always greater than or equal to dual values. This is de-
scribed in more detail in (7.27). }
Recall that the difference between J(x) in (7.18) and the Lagrangian
in (7.20b) is that we have relaxed the indicator function to a linear func-
tion. Therefore, when > 0, the Lagrangian L(x, ) is a lower bound of
J(x). Hence, the maximum of L(x, ) with respect to is
J(x) = max L(x, ) . (7.25)
>0
This is also known as weak duality. Note that the inner part of the right- weak duality
hand side is the dual objective function D( ) and the definition follows.
In contrast to the original optimization problem, which has constraints,
minx2Rd L(x, ) is an unconstrained optimization problem for a given
value of . If solving minx2Rd L(x, ) is easy, then the overall problem
is easy to solve. The reason is that the outer problem (maximization over
) is a maximum over a set of affine functions, and hence is a concave
function, even though f (·) and gi (·) may be nonconvex. The maximum of
a concave function can be efficiently computed.
Assuming f (·) and gi (·) are differentiable, we find the Lagrange dual
problem by differentiating the Lagrangian with respect to x, setting the
differential to zero, and solving for the optimal value. We will discuss two
concrete examples in Sections 7.3.1 and 7.3.2, where f (·) and gi (·) are
convex.
Remark (Equality Constraints). Consider (7.17) with additional equality
constraints
min f (x)
x
30 y = 3x2 5x + 2
y
20
10
0
3 2 1 0 1 2 3
x
Example 7.3
The negative entropy f (x) = x log2 x is convex for x > 0. A visualization
of the function is shown in Figure 7.8, and we can see that the function is
convex. To illustrate the previous definitions of convexity, let us check the
calculations for two points x = 2 and x = 4. Note that to prove convexity
of f (x) we would need to check for all points x 2 R.
Recall Definition 7.3. Consider a point midway between the two points
(that is ✓ = 0.5); then the left-hand side is f (0.5 · 2 + 0.5 · 4) = 3 log2 3 ⇡
4.75. The right-hand side is 0.5(2 log2 2) + 0.5(4 log2 4) = 1 + 4 = 5. And
therefore the definition is satisfied.
Since f (x) is differentiable, we can alternatively use (7.31). Calculating
the derivative of f (x), we obtain
1 1
rx (x log2 x) = 1 · log2 x + x · = log2 x + . (7.32)
x loge 2 loge 2
Using the same two test points x = 2 and x = 4, the left-hand side of
(7.31) is given by f (4) = 8. The right-hand side is
f (x) + r>
x (y x) = f (2) + rf (2) · (4 2) (7.33a)
1
= 2 + (1 + ) · 2 ⇡ 6.9 . (7.33b)
loge 2
5
f (x)
0 1 2 3 4 5
x
Example 7.4
A nonnegative weighted sum of convex functions is convex. Observe that
if f is a convex function, and ↵ > 0 is a nonnegative scalar, then the
function ↵f is convex. We can see this by multiplying ↵ to both sides of the
equation in Definition 7.3, and recalling that multiplying a nonnegative
number does not change the inequality.
If f1 and f2 are convex functions, then we have by the definition
f1 (✓x + (1 ✓)y) 6 ✓f1 (x) + (1 ✓)f1 (y) (7.34)
f2 (✓x + (1 ✓)y) 6 ✓f2 (x) + (1 ✓)f2 (y) . (7.35)
Summing up both sides gives us
f1 (✓x + (1 ✓)y) + f2 (✓x + (1 ✓)y)
6 ✓f1 (x) + (1 ✓)f1 (y) + ✓f2 (x) + (1 ✓)f2 (y) , (7.36)
where the right-hand side can be rearranged to
✓(f1 (x) + f2 (x)) + (1 ✓)(f1 (y) + f2 (y)) , (7.37)
completing the proof that the sum of convex functions is convex.
Combining the preceding two facts, we see that ↵f1 (x) + f2 (x) is
convex for ↵, > 0. This closure property can be extended using a sim-
ilar argument for nonnegative weighted sums of more than two convex
functions.
Remark. The inequality in (7.30) is sometimes called Jensen’s inequality. Jensen’s inequality
In fact, a whole class of inequalities for taking nonnegative weighted sums
of convex functions are all called Jensen’s inequality. }
In summary, a constrained optimization problem is called a convex opti- convex optimization
mization problem if problem
minf (x)
x
subject to Ax 6 b ,
where A 2 Rm⇥d and b 2 Rm . This is known as a linear program. It has d linear program
variables and m linear constraints. The Lagrangian is given by Linear programs are
one of the most
>
L(x, ) = c> x + (Ax b) , (7.40) widely used
approaches in
where 2 Rm is the vector of non-negative Lagrange multipliers. Rear- industry.
ranging the terms corresponding to x yields
>
L(x, ) = (c + A> )> x b. (7.41)
Taking the derivative of L(x, ) with respect to x and setting it to zero
gives us
c + A> = 0 . (7.42)
>
Therefore, the dual Lagrangian is D( ) = b. Recall we would like
to maximize D( ). In addition to the constraint due to the derivative of
L(x, ) being zero, we also have the fact that > 0, resulting in the
following dual optimization problem It is convention to
minimize the primal
maxm b> (7.43) and maximize the
2R
dual.
subject to c + A> = 0
> 0.
This is also a linear program, but with m variables. We have the choice
of solving the primal (7.39) or the dual (7.43) program depending on
by the contour 8
lines) has a
minimum on the
right side. The
optimal value given 6
x2
0
0 2 4 6 8 10 12 14 16
x1
Note that the preceding convex conjugate definition does not need the
function f to be convex nor differentiable. In Definition 7.4, we have used
a general inner product (Section 3.2) but in the rest of this section we
Example 7.8
In machine learning, we often use sums of functions; for example, the ob-
jective function of the training set includes a sum of the losses for each ex-
ample in the training set. In the following, we derive the convex conjugate
of a sum of losses `(t), where ` : R ! R. This also illustrates Pthe appli-
n
cation of the convex conjugate to the vector case. Let L(t) = i=1 `i (ti ).
Then,
n
X
L⇤ (z) = sup hz, ti `i (ti ) (7.63a)
t2Rn i=1
n
X
= sup zi t i `i (ti ) definition of dot product (7.63b)
t2Rn i=1
Xn
= sup zi ti `i (ti ) (7.63c)
n
i=1 t2R
n
X
= `⇤i (zi ) . definition of conjugate (7.63d)
i=1
Example 7.9
Let f (y) and g(x) be convex functions, and A a real matrix of appropriate
dimensions such that Ax = y . Then
min f (Ax) + g(x) = min f (y) + g(x). (7.64)
x Ax=y
where the last step of swapping max and min is due to the fact that f (y)
and g(x) are convex functions. By splitting up the dot product term and
collecting x and y ,
max min f (y) + g(x) + (Ax y)> u (7.66a)
u x,y
h i
= max min y > u + f (y) + min(Ax)> u + g(x) (7.66b)
u y x
h i
= max min y > u + f (y) + min x> A> u + g(x) (7.66c)
u y x
Recall the convex conjugate (Definition 7.4) and the fact that dot prod- For general inner
products, A> is
ucts are symmetric,
replaced by the
h i adjoint A⇤ .
max min y > u + f (y) + min x> A> u + g(x) (7.67a)
u y x
>
= max f (u) ⇤ ⇤
g ( A u) . (7.67b)
u
Exercises
7.1 Consider the univariate function
f (x) = x3 + 6x2 3x 5.
Find its stationary points and indicate whether they are maximum, mini-
mum, or saddle points.
7.2 Consider the update equation for stochastic gradient descent (Equation (7.15)).
Write down the update when we use a mini-batch size of one.
7.3 Consider whether the following statements are true or false:
a. The intersection of any two convex sets is convex.
b. The union of any two convex sets is convex.
c. The difference of a convex set A from another convex set B is convex.
7.4 Consider whether the following statements are true or false:
a. The sum of any two convex functions is convex.
b. The difference of any two convex functions is convex.
c. The product of any two convex functions is convex.
d. The maximum of any two convex functions is convex.
7.5 Express the following optimization problem as a standard linear program in
matrix notation
max p> x + ⇠
x2R2 , ⇠2R
Derive the convex conjugate function f ⇤ (s), by assuming the standard dot
product.
Hint: Take the gradient of an appropriate function and set the gradient to zero.
7.10 Consider the function
1 >
f (x) = x Ax + b> x + c ,
2
where A is strictly positive definite, which means that it is invertible. Derive
the convex conjugate of f (x).
Hint: Take the gradient of an appropriate function and set the gradient to zero.
7.11 The hinge loss (which is the loss used by the support vector machine) is
given by
L(↵) = max{0, 1 ↵} ,
249
This material will be published by Cambridge University Press as Mathematics for Machine Learn-
ing by Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong. This pre-publication version is
free to view and download for personal use only. Not for re-distribution, re-sale or use in deriva-
tive works. c by M. P. Deisenroth, A. A. Faisal, and C. S. Ong, 2019. https://mml-book.com.
8
In the first part of the book, we introduced the mathematics that form
the foundations of many machine learning methods. The hope is that a
reader would be able to learn the rudimentary forms of the language of
mathematics from the first part, which we will now use to describe and
discuss machine learning. The second part of the book introduces four
pillars of machine learning:
Regression (Chapter 9)
Dimensionality reduction (Chapter 10)
Density estimation (Chapter 11)
Classification (Chapter 12)
The main aim of this part of the book is to illustrate how the mathematical
concepts introduced in the first part of the book can be used to design
machine learning algorithms that can be used to solve tasks within the
remit of the four pillars. We do not intend to introduce advanced machine
learning concepts, but instead to provide a set of practical methods that
allow the reader to apply the knowledge they gained from the first part
of the book. It also provides a gateway to the wider machine learning
literature for readers already familiar with the mathematics.
251
This material will be published by Cambridge University Press as Mathematics for Machine Learn-
ing by Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong. This pre-publication version is
free to view and download for personal use only. Not for re-distribution, re-sale or use in deriva-
tive works. c by M. P. Deisenroth, A. A. Faisal, and C. S. Ong, 2019. https://mml-book.com.
252 When Models Meet Data
Table 8.1 Example Name Gender Degree Postcode Age Annual salary
data from a Aditya M MSc W21BG 36 89563
fictitious human Bob M PhD EC1A1BA 47 123543
resource database Chloé F BEcon SW1A1BH 26 23989
that is not in a Daisuke M BSc SE207AT 68 138769
numerical format. Elisabeth F MBA SE10AA 33 113888
used to talk about machine learning models. By doing so, we briefly out-
line the current best practices for training a model such that the resulting
predictor does well on data that we have not yet seen.
As mentioned in Chapter 1, there are two different senses in which we
use the phrase “machine learning algorithm”: training and prediction. We
will describe these ideas in this chapter, as well as the idea of selecting
among different models. We will introduce the framework of empirical
risk minimization in Section 8.2, the principle of maximum likelihood in
Section 8.3, and the idea of probabilistic models in Section 8.4. We briefly
outline a graphical language for specifying probabilistic models in Sec-
tion 8.5 and finally discuss model selection in Section 8.6. The rest of this
section expands upon the three main components of machine learning:
data, models and learning.
75
50
25
0
0 10 20 30 40 50 60 70 80
x
50
25
0
0 10 20 30 40 50 60 70 80
x
1. Prediction or inference
2. Training or parameter estimation
3. Hyperparameter tuning or model selection
the model to only fit the training data well, the predictor needs to per-
form well on unseen data. We simulate the behavior of our predictor on
cross-validation future unseen data using cross-validation (Section 8.2.4). As we will see
in this chapter, to achieve the goal of performing well on unseen data,
we will need to balance between fitting well on training data and finding
“simple” explanations of the phenomenon. This trade-off is achieved us-
ing regularization (Section 8.2.3) or by adding a prior (Section 8.3.2). In
philosophy, this is considered to be neither induction nor deduction, but
abduction is called abduction. According to the Stanford Encyclopedia of Philosophy,
abduction is the process of inference to the best explanation (Douven,
A good movie title is 2017).
“AI abduction”. We often need to make high-level modeling decisions about the struc-
ture of the predictor, such as the number of components to use or the
class of probability distributions to consider. The choice of the number of
hyperparameter components is an example of a hyperparameter, and this choice can af-
fect the performance of the model significantly. The problem of choosing
model selection among different models is called model selection, which we describe in
Section 8.6. For non-probabilistic models, model selection is often done
nested using nested cross-validation, which is described in Section 8.6.1. We also
cross-validation use model selection to choose hyperparameters of our model.
Remark. The distinction between parameters and hyperparameters is some-
what arbitrary, and is mostly driven by the distinction between what can
be numerically optimized versus what needs to use search techniques.
Another way to consider the distinction is to consider parameters as the
explicit parameters of a probabilistic model, and to consider hyperparam-
eters (higher-level parameters) as parameters that control the distribution
of these explicit parameters. }
In the following sections, we will look at three flavors of machine learn-
ing: empirical risk minimization (Section 8.2), the principle of maximum
likelihood (Section 8.3), and probabilistic modeling (Section 8.4).
Section 8.2.1 What is the set of functions we allow the predictor to take?
Section 8.2.2 How do we measure how well the predictor performs on
the training data?
Section 8.2.3 How do we construct predictors from only training data
that performs well on unseen test data?
Section 8.2.4 What is the procedure for searching over the space of mod-
els?
Example 8.1
We introduce the problem of ordinary least-squares regression to illustrate
empirical risk minimization. A more comprehensive account of regression
is given in Chapter 9. When the label yn is real-valued, a popular choice
of function class for predictors is the set of affine functions. We choose a Affine functions are
often referred to as
more compact notation for an affine function by concatenating an addi-
linear functions in
tional unit feature x(0) = 1 to xn , i.e., xn = [1, x(1)
n , xn , . . . , xn ] . The
(2) (D) >
machine learning.
parameter vector is correspondingly ✓ = [✓0 , ✓1 , ✓2 , . . . , ✓D ]> , allowing us
to write the predictor as a linear function
f (xn , ✓) = ✓ > xn . (8.4)
This linear predictor is equivalent to the affine model
D
X
f (xn , ✓) = ✓0 + ✓d x(d)
n . (8.5)
d=1
empirical risk where ŷn = f (xn , ✓). Equation (8.6) is called the empirical risk and de-
pends on three arguments, the predictor f and the data X, y . This general
empirical risk strategy for learning is called empirical risk minimization.
minimization
where y is the label and f (x) is the prediction based on the example x.
The notation Rtrue (f ) indicates that this is the true risk if we had access to
an infinite amount of data. The expectation is over the (infinite) set of all Another phrase
possible data and labels. There are two practical questions that arise from commonly used for
expected risk is
our desire to minimize expected risk, which we address in the following
“population risk”.
two subsections:
The regularization term is sometimes called the penalty term, which bi- penalty term
ases the vector ✓ to be closer to the origin. The idea of regularization also
appears in probabilistic models as the prior probability of the parameters.
Recall from Section 6.6 that for the posterior distribution to be of the same
form as the prior distribution, the prior and the likelihood need to be con-
jugate. We will revisit this idea in Section 8.3.2. We will see in Chapter 12
that the idea of the regularizer is equivalent to the idea of a large margin.
K
1 X
EV [R(f, V)] ⇡ R(f (k) , V (k) ) , (8.13)
K k=1
where R(f (k) , V (k) ) is the risk (e.g., RMSE) on the validation set V (k) for
predictor f (k) . The approximation has two sources: first, due to the finite
training set, which results in not the best possible f (k) ; and second, due to
the finite validation set, which results in an inaccurate estimation of the
risk R(f (k) , V (k) ). A potential disadvantage of K -fold cross-validation is
the computational cost of training the model K times, which can be bur-
densome if the training cost is computationally expensive. In practice, it
is often not sufficient to look at the direct parameters alone. For example,
we need to explore multiple complexity parameters (e.g., multiple regu-
larization parameters), which may not be direct parameters of the model.
Evaluating the quality of the model, depending on these hyperparameters,
may result in a number of training runs that is exponential in the number
of model parameters. One can use nested cross-validation (Section 8.6.1)
to search for good hyperparameters.
embarrassingly However, cross-validation is an embarrassingly parallel problem, i.e., lit-
parallel tle effort is needed to separate the problem into a number of parallel
tasks. Given sufficient computing resources (e.g., cloud computing, server
farms), cross-validation does not require longer than a single performance
assessment.
In this section, we saw that empirical risk minimization is based on the
following concepts: the hypothesis class of functions, the loss function and
regularization. In Section 8.3, we will see the effect of using a probability
distribution to replace the idea of loss functions and regularization.
The notation Lx (✓) emphasizes the fact that the parameter ✓ is varying
and the data x is fixed. We very often drop the reference to x when writing
the negative log-likelihood, as it is really a function of ✓ , and write it as
L(✓) when the random variable representing the uncertainty in the data
is clear from the context.
Let us interpret what the probability density p(x | ✓) is modeling for a
fixed value of ✓ . It is a distribution that models the uncertainty of the data.
In other words, once we have chosen the type of function we want as a
predictor, the likelihood provides the probability of observing data x.
In a complementary view, if we consider the data to be fixed (because
it has been observed), and we vary the parameters ✓ , what does L(✓) tell
us? It tells us how likely a particular setting of ✓ is for the observations x.
Based on this second view, the maximum likelihood estimator gives us the
most likely parameter ✓ for the set of data.
We consider the supervised learning setting, where we obtain pairs
(x1 , y1 ), . . . , (xN , yN ) with xn 2 RD and labels yn 2 R. We are inter-
ested in constructing a predictor that takes a feature vector xn as input
and produces a prediction yn (or something close to it), i.e., given a vec-
tor xn we want the probability distribution of the label yn . In other words,
we specify the conditional probability distribution of the labels given the
examples for the particular parameter setting ✓ .
Example 8.4
The first example that is often used is to specify that the conditional
probability of the labels given the examples is a Gaussian distribution. In
other words, we assume that we can explain our observation uncertainty
by independent Gaussian noise (refer to Section 6.5) with zero mean,
"n ⇠ N 0, 2 . We further assume that the linear model x> n ✓ is used for
prediction. This means we specify a Gaussian likelihood for each example
label pair (xn , yn ),
p(yn | xn , ✓) = N yn | x>
n ✓,
2
. (8.15)
An illustration of a Gaussian likelihood for a given parameter ✓ is shown
in Figure 8.3. We will see in Section 9.2 how to explicitly expand the
preceding expression out in terms of the Gaussian distribution.
independent and We assume that the set of examples (x1 , y1 ), . . . , (xN , yN ) are independent
identically and identically distributed (i.i.d.). The word “independent” (Section 6.4.5)
distributed
implies that the likelihood of the whole dataset (Y = {y1 , . . . , yN } and
X = {x1 , . . . , xN } factorizes into a product of the likelihoods of each
individual example
N
Y
p(Y | X , ✓) = p(yn | xn , ✓) , (8.16)
n=1
While it is temping to interpret the fact that ✓ is on the right of the condi-
tioning in p(yn |xn , ✓) (8.15), and hence should be interpreted as observed
and fixed, this interpretation is incorrect. The negative log-likelihood L(✓)
is a function of ✓ . Therefore, to find a good parameter vector ✓ that
explains the data (x1 , y1 ), . . . , (xN , yN ) well, minimize the negative log-
likelihood L(✓) with respect to ✓ .
Remark. The negative sign in (8.17) is a historical artifact that is due
to the convention that we want to maximize likelihood, but numerical
optimization literature tends to study minimization of functions. }
Example 8.5
Continuing on our example of Gaussian likelihoods (8.15), the negative
log-likelihood can be rewritten as
N
X N
X
L(✓) = log p(yn | xn , ✓) = log N yn | x>
n ✓,
2
(8.18a)
n=1 n=1
N
X ✓ ◆
1 (yn x>
n ✓)
2
= log p exp (8.18b)
n=1 2⇡ 2 2 2
XN ✓ ◆ XN
(yn x>
n ✓)
2
1
= log exp 2
log p (8.18c)
n=1
2 n=1 2⇡ 2
N N
1 X X 1
= 2
(yn x>
n ✓)
2
log p . (8.18d)
2 n=1 n=1 2⇡ 2
0
0 10 20 30 40 50 60 70 80
x
Figure 8.6
Comparing the 150 MLE
predictions with the MAP
maximum likelihood 125
estimate and the
MAP estimate at
100
x = 60. The prior
biases the slope to
y
75
be less steep and the
intercept to be
50
closer to zero. In
this example, the
bias that moves the 25
intercept closer to
zero actually 0
0 10 20 30 40 50 60 70 80
increases the slope.
x
p(x | ✓)p(✓)
p(✓ | x) = . (8.19)
p(x)
Recall that we are interested in finding the parameter ✓ that maximizes
the posterior. Since the distribution p(x) does not depend on ✓ , we can
ignore the value of the denominator for the optimization and obtain
p(✓ | x) / p(x | ✓)p(✓) . (8.20)
The preceding proportion relation hides the density of the data p(x),
which may be difficult to estimate. Instead of estimating the minimum
of the negative log-likelihood, we now estimate the minimum of the neg-
ative log-posterior, which is referred to as maximum a posteriori estima- maximum a
tion (MAP estimation). An illustration of the effect of adding a zero-mean posteriori
estimation
Gaussian prior is shown in Figure 8.6.
MAP estimation
Example 8.6
In addition to the assumption of Gaussian likelihood in the previous exam-
ple, we assume that the parameter vector is distributed as a multivariate
Gaussian with zero mean, i.e., p(✓) = N 0, ⌃ , where ⌃ is the covari-
ance matrix (Section 6.5). Note that the conjugate prior of a Gaussian
is also a Gaussian (Section 6.6.1), and therefore we expect the posterior
distribution to also be a Gaussian. We will see the details of maximum a
posteriori estimation in Chapter 9.
y
2 2 2
classes to a
4 4 4
regression dataset.
4 2 0 2 4 4 2 0 2 4 4 2 0 2 4
x x x
this is the model class we would want to work with since it has good
generalization properties.
In practice, we often define very rich model classes M✓ with many pa-
rameters, such as deep neural networks. To mitigate the problem of over-
fitting, we can use regularization (Section 8.2.3) or priors (Section 8.3.2).
We will discuss how to choose the model class in Section 8.6.
and they no longer depend on the model parameters ✓ , which have been
marginalized/integrated out. Equation (8.23) reveals that the prediction
is an average over all plausible parameter values ✓ , where the plausibility
is encapsulated by the parameter distribution p(✓).
Having discussed parameter estimation in Section 8.3 and Bayesian in-
ference here, let us compare these two approaches to learning. Parameter
estimation via maximum likelihood or MAP estimation yields a consistent
point estimate ✓ ⇤ of the parameters, and the key computational problem
to be solved is optimization. In contrast, Bayesian inference yields a (pos-
terior) distribution, and the key computational problem to be solved is
integration. Predictions with point estimates are straightforward, whereas
predictions in the Bayesian framework require solving another integration
problem; see (8.23). However, Bayesian inference gives us a principled
way to incorporate prior knowledge, account for side information, and
incorporate structural knowledge, all of which is not easily done in the
context of parameter estimation. Moreover, the propagation of parameter
uncertainty to the prediction can be valuable in decision-making systems
for risk assessment and exploration in the context of data-efficient learn-
ing (Deisenroth et al., 2015; Kamthe and Deisenroth, 2018).
While Bayesian inference is a mathematically principled framework for
learning about parameters and making predictions, there are some prac-
tical challenges that come with it because of the integration problems we
need to solve; see (8.22) and (8.23). More specifically, if we do not choose
a conjugate prior on the parameters (Section 6.6.1), the integrals in (8.22)
and (8.23) are not analytically tractable, and we cannot compute the pos-
may make the model structure and the generative process easier, learning
in latent-variable models is generally hard, as we will see in Chapter 11.
Since latent-variable models also allow us to define the process that
generates data from parameters, let us have a look at this generative pro-
cess. Denoting data by x, the model parameters by ✓ and the latent vari-
ables by z , we obtain the conditional distribution
p(x | ✓, z) (8.24)
that allows us to generate data for any model parameters and latent vari-
ables. Given that z are latent variables, we place a prior p(z) on them.
As the models we discussed previously, models with latent variables
can be used for parameter learning and inference within the frameworks
we discussed in Sections 8.3 and 8.4.2. To facilitate learning (e.g., by
means of maximum likelihood estimation or Bayesian inference), we fol-
low a two-step procedure. First, we compute the likelihood p(x | ✓) of the
model, which does not depend on the latent variables. Second, we use this
likelihood for parameter estimation or Bayesian inference, where we use
exactly the same expressions as in Sections 8.3 and 8.4.2, respectively.
Since the likelihood function p(x | ✓) is the predictive distribution of the
data given the model parameters, we need to marginalize out the latent
variables so that
Z
p(x | ✓) = p(x | ✓, z)p(z)dz , (8.25)
where p(x | z, ✓) is given in (8.24) and p(z) is the prior on the latent
The likelihood is a variables. Note that the likelihood must not depend on the latent variables
function of the data z , but it is only a function of the data x and the model parameters ✓ .
and the model
The likelihood in (8.25) directly allows for parameter estimation via
parameters, but is
independent of the maximum likelihood. MAP estimation is also straightforward with an ad-
latent variables. ditional prior on the model parameters ✓ as discussed in Section 8.3.2.
Moreover, with the likelihood (8.25) Bayesian inference (Section 8.4.2)
in a latent-variable model works in the usual way: We place a prior p(✓)
on the model parameters and use Bayes’ theorem to obtain a posterior
distribution
p(X | ✓)p(✓)
p(✓ | X ) = (8.26)
p(X )
over the model parameters given a dataset X . The posterior in (8.26) can
be used for predictions within a Bayesian inference framework; see (8.23).
One challenge we have in this latent-variable model is that the like-
lihood p(X | ✓) requires the marginalization of the latent variables ac-
cording to (8.25). Except when we choose a conjugate prior p(z) for
p(x | z, ✓), the marginalization in (8.25) is not analytically tractable, and
we need to resort to approximations (Bishop, 2006; Paquet, 2008; Mur-
phy, 2012; Moustaki et al., 2015).
2017) are used. Moustaki et al. (2015) and Paquet (2008) provide a good
overview of Bayesian inference in latent-variable models.
In recent years, several programming languages have been proposed
that aim to treat the variables defined in software as random variables
corresponding to probability distributions. The objective is to be able to
write complex functions of probability distributions, while under the hood
the compiler automatically takes care of the rules of Bayesian inference.
probabilistic This rapidly changing field is called probabilistic programming.
programming
c x3 x4
Example 8.7
Consider the joint distribution
p(a, b, c) = p(c | a, b)p(b | a)p(a) (8.29)
of three random variables a, b, c. The factorization of the joint distribution
in (8.29) tells us something about the relationship between the random
variables:
c depends directly on a and b.
b depends directly on a.
a depends neither on b nor on c.
For the factorization in (8.29), we obtain the directed graphical model in
Figure 8.9(a).
exactly the opposite and describe how to extract the joint distribution of
a set of random variables from a given graphical model.
Example 8.8
Looking at the graphical model in Figure 8.9(b), we exploit two proper-
ties:
The joint distribution p(x1 , . . . , x5 ) we seek is the product of a set of
conditionals, one for each node in the graph. In this particular example,
we will need five conditionals.
Each conditional depends only on the parents of the corresponding
node in the graph. For example, x4 will be conditioned on x2 .
These two properties yield the desired factorization of the joint distribu-
tion
p(x1 , x2 , x3 , x4 , x5 ) = p(x1 )p(x5 )p(x2 | x5 )p(x3 | x1 , x2 )p(x4 | x2 ) . (8.30)
where Pak means “the parent nodes of xk ”. Parent nodes of xk are nodes
that have arrows pointing to xk .
We conclude this subsection with a concrete example of the coin-flip
experiment. Consider a Bernoulli experiment (Example 6.8) where the
probability that the outcome x of this experiment is “heads” is
p(x | µ) = Ber(µ) . (8.32)
We now repeat this experiment N times and observe outcomes x1 , . . . , xN
so that we obtain the joint distribution
N
Y
p(x1 , . . . , xN | µ) = p(xn | µ) . (8.33)
n=1
x1 xN n = 1, . . . , N n = 1, . . . , N
the plate notation. The plate (box) repeats everything inside (in this case, plate
the observations xn ) N times. Therefore, both graphical models are equiv-
alent, but the plate notation is more compact. Graphical models immedi-
ately allow us to place a hyperprior on µ. A hyperprior is a second layer hyperprior
of prior distributions on the parameters of the first layer of priors. Fig-
ure 8.10(c) places a Beta(↵, ) prior on the latent variable µ. If we treat
↵ and as deterministic parameters, i.e., not random variables, we omit
the circle around it.
A?
? B|C , (8.34)
The arrows on the path meet either head to tail or tail to tail at the
node, and the node is in the set C .
The arrows meet head to head at the node, and neither the node nor
any of its descendants is in the set C .
Figure 8.11 a b c
D-separation
example.
is that at training time we can only use the training set to evaluate the
performance of the model and learn its parameters. However, the per-
formance on the training set is not really what we are interested in. In
Section 8.3, we have seen that maximum likelihood estimation can lead
to overfitting, especially when the training dataset is small. Ideally, our
model (also) works well on the test set (which is not available at training
time). Therefore, we need some mechanisms for assessing how a model
generalizes to unseen test data. Model selection is concerned with exactly
this problem.
Figure 8.14
Evidence Bayesian inference
embodies Occam’s
razor. The
horizontal axis
describes the space
p(D | M1 )
of all possible
datasets D. The
evidence (vertical
axis) evaluates how
well a model
p(D | M2 ) predicts available
data. Since
p(D | Mi ) needs to
integrate to 1, we
should choose the
D
C model with the
greatest evidence.
Adapted
mean estimate is. Once the model is chosen, we can evaluate the final from MacKay
performance on the test set. (2003).
D With a uniform prior p(Mk ) = K1 , which gives every model equal (prior)
probability, determining the MAP estimate over models amounts to pick-
model evidence
ing the model that maximizes the model evidence (8.44).
marginal likelihood
Remark (Likelihood and Marginal Likelihood). There are some important
differences between a likelihood and a marginal likelihood (evidence):
While the likelihood is prone to overfitting, the marginal likelihood is typ-
ically not as the model parameters have been marginalized out (i.e., we
no longer have to fit the parameters). Furthermore, the marginal likeli-
hood automatically embodies a trade-off between model complexity and
data fit (Occam’s razor). }
The ratio of the posteriors is also called the posterior odds. The first frac- posterior odds
tion on the right-hand side of (8.46), the prior odds, measures how much prior odds
our prior (initial) beliefs favor M1 over M2 . The ratio of the marginal like-
lihoods (second fraction on the right-hand-side) is called the Bayes factor Bayes factor
and measures how well the data D is predicted by M1 compared to M2 .
Remark. The Jeffreys-Lindley paradox states that the “Bayes factor always Jeffreys-Lindley
favors the simpler model since the probability of the data under a complex paradox
model with a diffuse prior will be very small” (Murphy, 2012). Here, a
diffuse prior refers to a prior that does not favor specific models, i.e.,
many models are a priori plausible under this prior. }
If we choose a uniform prior over models, the prior odds term in (8.46)
is 1, i.e., the posterior odds is the ratio of the marginal likelihoods (Bayes
factor)
p(D | M1 )
. (8.47)
p(D | M2 )
If the Bayes factor is greater than 1, we choose model M1 , otherwise
model M2 . In a similar way to frequentist statistics, there are guidelines
on the size of the ratio that one should consider before ”significance” of
the result (Jeffreys, 1961).
Remark (Computing the Marginal Likelihood). The marginal likelihood
plays an important role in model selection: We need to compute Bayes
factors (8.46) and posterior distributions over models (8.43).
Unfortunately, computing the marginal likelihood requires us to solve
an integral (8.44). This integration is generally analytically intractable,
and we will have to resort to approximation techniques, e.g., numerical
integration (Stoer and Burlirsch, 2002), stochastic approximations using
Monte Carlo (Murphy, 2012), or Bayesian Monte Carlo techniques (O’Hagan,
1991; Rasmussen and Ghahramani, 2003).
However, there are special cases in which we can solve it. In Section 6.6.1,
we discussed conjugate models. If we choose a conjugate parameter prior
p(✓), we can compute the marginal likelihood in closed form. In Chap-
ter 9, we will do exactly this in the context of linear regression. }
We have seen a brief introduction to the basic concepts of machine
learning in this chapter. For the rest of this part of the book we will see
how the three different flavors of learning in Sections 8.2, 8.3, and 8.4 are
applied to the four pillars of machine learning (regression, dimensionality
reduction, density estimation, and classification).
Linear Regression
Figure 9.1
0.4 0.4 (a) Dataset;
(b) possible solution
0.2 0.2
to the regression
0.0 0.0 problem.
y
0.2 0.2
0.4 0.4
4 2 0 2 4 4 2 0 2 4
x x
(a) Regression problem: observed noisy func- (b) Regression solution: possible function
tion values from which we wish to infer the that could have generated the data (blue)
underlying function that generated the data. with indication of the measurement noise of
the function value at the corresponding in-
puts (orange distributions).
289
This material will be published by Cambridge University Press as Mathematics for Machine Learn-
ing by Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong. This pre-publication version is
free to view and download for personal use only. Not for re-distribution, re-sale or use in deriva-
tive works. c by M. P. Deisenroth, A. A. Faisal, and C. S. Ong, 2019. https://mml-book.com.
290 Linear Regression
p(y | x) = N y | f (x), 2
. (9.1)
y = f (x) + ✏ , (9.2)
p(y | x, ✓) = N y | x> ✓, 2
(9.3)
>
() y = x ✓ + ✏ , ✏ ⇠ N 0, 2
, (9.4)
(a) Example 0 0 0
y
y
functions that fall
into this category; 20
10 10
refers to models that are “linear in the parameters”, i.e., models that de-
scribe a function by a linear combination of input features. Here, a “fea-
ture” is a representation (x) of the inputs x.
In the following, we will discuss in more detail how to find good pa-
rameters ✓ and how to evaluate whether a parameter set “works well”.
For the time being, we assume that the noise variance 2 is known.
p(y⇤ | x⇤ , ✓ ⇤ ) = N y⇤ | x> ⇤
⇤✓ ,
2
. (9.6)