0% found this document useful (0 votes)
6 views

Math_ML-trang-3

This chapter focuses on analytic geometry by providing geometric interpretations of vectors, inner products, and norms. It introduces concepts such as lengths, distances, and angles between vectors, which are essential for applications like support vector machines and principal component analysis. The chapter also defines norms and inner products, emphasizing their importance in understanding vector spaces and their properties.

Uploaded by

Quân Nguyễn
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Math_ML-trang-3

This chapter focuses on analytic geometry by providing geometric interpretations of vectors, inner products, and norms. It introduces concepts such as lengths, distances, and angles between vectors, which are essential for applications like support vector machines and principal component analysis. The chapter also defines norms and inner products, emphasizing their importance in understanding vector spaces and their properties.

Uploaded by

Quân Nguyễn
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

3

Analytic Geometry

In Chapter 2, we studied vectors, vector spaces, and linear mappings at


a general but abstract level. In this chapter, we will add some geomet-
ric interpretation and intuition to all of these concepts. In particular, we
will look at geometric vectors and compute their lengths and distances
or angles between two vectors. To be able to do this, we equip the vec-
tor space with an inner product that induces the geometry of the vector
space. Inner products and their corresponding norms and metrics capture
the intuitive notions of similarity and distances, which we use to develop
the support vector machine in Chapter 12. We will then use the concepts
of lengths and angles between vectors to discuss orthogonal projections,
which will play a central role when we discuss principal component anal-
ysis in Chapter 10 and regression via maximum likelihood estimation in
Chapter 9. Figure 3.1 gives an overview of how concepts in this chapter
are related and how they are connected to other chapters of the book.

Figure 3.1 A mind


Inner product
map of the concepts
introduced in this
es
chapter, along with uc
when they are used ind
in other parts of the
Chapter 12
book. Norm Classification

Orthogonal
Lengths Angles Rotations
projection

Chapter 9 Chapter 4 Chapter 10


Regression Matrix Dimensionality
decomposition reduction

70
This material is published by Cambridge University Press as Mathematics for Machine Learning by
Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong (2020). This version is free to view
and download for personal use only. Not for re-distribution, re-sale, or use in derivative works.
©by M. P. Deisenroth, A. A. Faisal, and C. S. Ong, 2021. https://mml-book.com.
3.1 Norms 71

kx k1 = 1 kx k2 = 1 Figure 3.3 For


1 1
different norms, the
red lines indicate
the set of vectors
with norm 1. Left:
1 1 Manhattan norm;
Right: Euclidean
distance.

3.1 Norms
When we think of geometric vectors, i.e., directed line segments that start
at the origin, then intuitively the length of a vector is the distance of the
“end” of this directed line segment from the origin. In the following, we
will discuss the notion of the length of vectors using the concept of a norm.

Definition 3.1 (Norm). A norm on a vector space V is a function norm

k · k : V → R, (3.1)
x 7→ kxk , (3.2)

which assigns each vector x its length kxk ∈ R, such that for all λ ∈ R length
and x, y ∈ V the following hold:
absolutely
Absolutely homogeneous: kλxk = |λ|kxk homogeneous

Triangle inequality: kx + yk 6 kxk + kyk triangle inequality

Positive definite: kxk > 0 and kxk = 0 ⇐⇒ x = 0 positive definite

Figure 3.2 Triangle


In geometric terms, the triangle inequality states that for any triangle, inequality.
the sum of the lengths of any two sides must be greater than or equal
a b
to the length of the remaining side; see Figure 3.2 for an illustration.
Definition 3.1 is in terms of a general vector space V (Section 2.4), but c≤a+b

in this book we will only consider a finite-dimensional vector space Rn .


Recall that for a vector x ∈ Rn we denote the elements of the vector using
a subscript, that is, xi is the ith element of the vector x.

Example 3.1 (Manhattan Norm)


The Manhattan norm on Rn is defined for x ∈ Rn as Manhattan norm

Xn
kxk1 := |xi | , (3.3)
i=1

where | · | is the absolute value. The left panel of Figure 3.3 shows all
vectors x ∈ R2 with kxk1 = 1. The Manhattan norm is also called `1 `1 norm
norm.

©2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


72 Analytic Geometry

Example 3.2 (Euclidean Norm)


Euclidean norm The Euclidean norm of x ∈ Rn is defined as
v

u n
uX
kxk2 := t x2i = x> x (3.4)
i=1

Euclidean distance and computes the Euclidean distance of x from the origin. The right panel
of Figure 3.3 shows all vectors x ∈ R2 with kxk2 = 1. The Euclidean
`2 norm norm is also called `2 norm.

Remark. Throughout this book, we will use the Euclidean norm (3.4) by
default if not stated otherwise. ♦

3.2 Inner Products


Inner products allow for the introduction of intuitive geometrical con-
cepts, such as the length of a vector and the angle or distance between
two vectors. A major purpose of inner products is to determine whether
vectors are orthogonal to each other.

3.2.1 Dot Product


We may already be familiar with a particular type of inner product, the
scalar product scalar product/dot product in Rn , which is given by
dot product n
X
x> y = xi yi . (3.5)
i=1

We will refer to this particular inner product as the dot product in this
book. However, inner products are more general concepts with specific
properties, which we will now introduce.

3.2.2 General Inner Products


Recall the linear mapping from Section 2.7, where we can rearrange the
bilinear mapping mapping with respect to addition and multiplication with a scalar. A bi-
linear mapping Ω is a mapping with two arguments, and it is linear in
each argument, i.e., when we look at a vector space V then it holds that
for all x, y, z ∈ V, λ, ψ ∈ R that
Ω(λx + ψy, z) = λΩ(x, z) + ψΩ(y, z) (3.6)
Ω(x, λy + ψz) = λΩ(x, y) + ψΩ(x, z) . (3.7)
Here, (3.6) asserts that Ω is linear in the first argument, and (3.7) asserts
that Ω is linear in the second argument (see also (2.87)).

Draft (2022-01-11) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.


3.2 Inner Products 73

Definition 3.2. Let V be a vector space and Ω : V × V → R be a bilinear


mapping that takes two vectors and maps them onto a real number. Then
Ω is called symmetric if Ω(x, y) = Ω(y, x) for all x, y ∈ V , i.e., the symmetric
order of the arguments does not matter.
Ω is called positive definite if positive definite

∀x ∈ V \{0} : Ω(x, x) > 0 , Ω(0, 0) = 0 . (3.8)


Definition 3.3. Let V be a vector space and Ω : V × V → R be a bilinear
mapping that takes two vectors and maps them onto a real number. Then
A positive definite, symmetric bilinear mapping Ω : V ×V → R is called
an inner product on V . We typically write hx, yi instead of Ω(x, y). inner product
The pair (V, h·, ·i) is called an inner product space or (real) vector space inner product space
with inner product. If we use the dot product defined in (3.5), we call vector space with
(V, h·, ·i) a Euclidean vector space. inner product
Euclidean vector
We will refer to these spaces as inner product spaces in this book. space

Example 3.3 (Inner Product That Is Not the Dot Product)


Consider V = R2 . If we define
hx, yi := x1 y1 − (x1 y2 + x2 y1 ) + 2x2 y2 (3.9)
then h·, ·i is an inner product but different from the dot product. The proof
will be an exercise.

3.2.3 Symmetric, Positive Definite Matrices


Symmetric, positive definite matrices play an important role in machine
learning, and they are defined via the inner product. In Section 4.3, we
will return to symmetric, positive definite matrices in the context of matrix
decompositions. The idea of symmetric positive semidefinite matrices is
key in the definition of kernels (Section 12.4).
Consider an n-dimensional vector space V with an inner product h·, ·i :
V × V → R (see Definition 3.3) and an ordered basis B = (b1 , . . . , bn ) of
V . Recall from Section 2.6.1 that any vectors x, y ∈ V Pncan be written as
linearPcombinations of the basis vectors so that x = i=1 ψi bi ∈ V and
n
y = j=1 λj bj ∈ V for suitable ψi , λj ∈ R. Due to the bilinearity of the
inner product, it holds for all x, y ∈ V that
* n n
+ n X n
ψi hbi , bj i λj = x̂> Aŷ , (3.10)
X X X
hx, yi = ψi bi , λj bj =
i=1 j=1 i=1 j=1

where Aij := hbi , bj i and x̂, ŷ are the coordinates of x and y with respect
to the basis B . This implies that the inner product h·, ·i is uniquely deter-
mined through A. The symmetry of the inner product also means that A

©2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


74 Analytic Geometry

is symmetric. Furthermore, the positive definiteness of the inner product


implies that

∀x ∈ V \{0} : x> Ax > 0 . (3.11)

Definition 3.4 (Symmetric, Positive Definite Matrix). A symmetric matrix


symmetric, positive A ∈ Rn×n that satisfies (3.11) is called symmetric, positive definite, or
definite just positive definite. If only > holds in (3.11), then A is called symmetric,
positive definite
positive semidefinite.
symmetric, positive
semidefinite

Example 3.4 (Symmetric, Positive Definite Matrices)


Consider the matrices
   
9 6 9 6
A1 = , A2 = . (3.12)
6 5 6 3
A1 is positive definite because it is symmetric and
  
>
  9 6 x1
x A1 x = x1 x2 (3.13a)
6 5 x2
= 9x21 + 12x1 x2 + 5x22 = (3x1 + 2x2 )2 + x22 > 0 (3.13b)
for all x ∈ V \{0}. In contrast, A2 is symmetric but not positive definite
because x> A2 x = 9x21 + 12x1 x2 + 3x22 = (3x1 + 2x2 )2 − x22 can be less
than 0, e.g., for x = [2, −3]> .

If A ∈ Rn×n is symmetric, positive definite, then

hx, yi = x̂> Aŷ (3.14)

defines an inner product with respect to an ordered basis B , where x̂ and


ŷ are the coordinate representations of x, y ∈ V with respect to B .

Theorem 3.5. For a real-valued, finite-dimensional vector space V and an


ordered basis B of V , it holds that h·, ·i : V × V → R is an inner product if
and only if there exists a symmetric, positive definite matrix A ∈ Rn×n with

hx, yi = x̂> Aŷ . (3.15)

The following properties hold if A ∈ Rn×n is symmetric and positive


definite:

The null space (kernel) of A consists only of 0 because x> Ax > 0 for
all x 6= 0. This implies that Ax 6= 0 if x 6= 0.
The diagonal elements aii of A are positive because aii = e>
i Aei > 0,
where ei is the ith vector of the standard basis in Rn .

Draft (2022-01-11) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.


3.3 Lengths and Distances 75

3.3 Lengths and Distances


In Section 3.1, we already discussed norms that we can use to compute
the length of a vector. Inner products and norms are closely related in the
sense that any inner product induces a norm Inner products
q induce norms.
kxk := hx, xi (3.16)

in a natural way, such that we can compute lengths of vectors using the in-
ner product. However, not every norm is induced by an inner product. The
Manhattan norm (3.3) is an example of a norm without a corresponding
inner product. In the following, we will focus on norms that are induced
by inner products and introduce geometric concepts, such as lengths, dis-
tances, and angles.
Remark (Cauchy-Schwarz Inequality). For an inner product vector space
(V, h·, ·i) the induced norm k · k satisfies the Cauchy-Schwarz inequality Cauchy-Schwarz
inequality
| hx, yi | 6 kxkkyk . (3.17)

Example 3.5 (Lengths of Vectors Using Inner Products)


In geometry, we are often interested in lengths of vectors. We can now use
an inner product to compute them using (3.16). Let us take x = [1, 1]> ∈
R2 . If we use the dot product as the inner product, with (3.16) we obtain
√ √ √
kxk = x> x = 12 + 12 = 2 (3.18)
as the length of x. Let us now choose a different inner product:
1 − 12 1
 
>
hx, yi := x 1 y = x1 y1 − (x1 y2 + x2 y1 ) + x2 y2 . (3.19)
−2 1 2
If we compute the norm of a vector, then this inner product returns smaller
values than the dot product if x1 and x2 have the same sign (and x1 x2 >
0); otherwise, it returns greater values than the dot product. With this
inner product, we obtain

hx, xi = x21 − x1 x2 + x22 = 1 − 1 + 1 = 1 =⇒ kxk = 1 = 1 , (3.20)
such that x is “shorter” with this inner product than with the dot product.

Definition 3.6 (Distance and Metric). Consider an inner product space


(V, h·, ·i). Then
q
d(x, y) := kx − yk = hx − y, x − yi (3.21)

is called the distance between x and y for x, y ∈ V . If we use the dot distance
product as the inner product, then the distance is called Euclidean distance. Euclidean distance

©2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


76 Analytic Geometry

The mapping

d:V ×V →R (3.22)
(x, y) 7→ d(x, y) (3.23)

metric is called a metric.

Remark. Similar to the length of a vector, the distance between vectors


does not require an inner product: a norm is sufficient. If we have a norm
induced by an inner product, the distance may vary depending on the
choice of the inner product. ♦
A metric d satisfies the following:

positive definite 1. d is positive definite, i.e., d(x, y) > 0 for all x, y ∈ V and d(x, y) =
0 ⇐⇒ x = y .
symmetric 2. d is symmetric, i.e., d(x, y) = d(y, x) for all x, y ∈ V .
triangle inequality 3. Triangle inequality: d(x, z) 6 d(x, y) + d(y, z) for all x, y, z ∈ V .

Remark. At first glance, the lists of properties of inner products and met-
rics look very similar. However, by comparing Definition 3.3 with Defini-
tion 3.6 we observe that hx, yi and d(x, y) behave in opposite directions.
Very similar x and y will result in a large value for the inner product and
a small value for the metric. ♦

3.4 Angles and Orthogonality


Figure 3.4 When
restricted to [0, π] In addition to enabling the definition of lengths of vectors, as well as the
then f (ω) = cos(ω) distance between two vectors, inner products also capture the geometry
returns a unique
number in the
of a vector space by defining the angle ω between two vectors. We use
interval [−1, 1]. the Cauchy-Schwarz inequality (3.17) to define angles ω in inner prod-
uct spaces between two vectors x, y , and this notion coincides with our
1
intuition in R2 and R3 . Assume that x 6= 0, y 6= 0. Then
cos(ω)

0
hx, yi
−1 6 6 1. (3.24)
kxk kyk
−1
0 π/2 π
ω Therefore, there exists a unique ω ∈ [0, π], illustrated in Figure 3.4, with

hx, yi
cos ω = . (3.25)
kxk kyk
angle The number ω is the angle between the vectors x and y . Intuitively, the
angle between two vectors tells us how similar their orientations are. For
example, using the dot product, the angle between x and y = 4x, i.e., y
is a scaled version of x, is 0: Their orientation is the same.

Draft (2022-01-11) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.


3.4 Angles and Orthogonality 77

Example 3.6 (Angle between Vectors)


Let us compute the angle between x = [1, 1]> ∈ R2 and y = [1, 2]> ∈ R2 ; Figure 3.5 The
see Figure 3.5, where we use the dot product as the inner product. Then angle ω between
two vectors x, y is
we get
computed using the
hx, yi x> y 3 inner product.
cos ω = p =p =√ , (3.26)
hx, xi hy, yi x> xy > y 10
y
and the angle between the two vectors is arccos( √310 ) ≈ 0.32 rad, which
corresponds to about 18◦ .
1
ω x
A key feature of the inner product is that it also allows us to characterize
vectors that are orthogonal.
0 1
Definition 3.7 (Orthogonality). Two vectors x and y are orthogonal if and orthogonal
only if hx, yi = 0, and we write x ⊥ y . If additionally kxk = 1 = kyk,
i.e., the vectors are unit vectors, then x and y are orthonormal. orthonormal

An implication of this definition is that the 0-vector is orthogonal to


every vector in the vector space.
Remark. Orthogonality is the generalization of the concept of perpendic-
ularity to bilinear forms that do not have to be the dot product. In our
context, geometrically, we can think of orthogonal vectors as having a
right angle with respect to a specific inner product. ♦

Example 3.7 (Orthogonal Vectors)

Figure 3.6 The


1 angle ω between
y x two vectors x, y can
change depending
ω on the inner
product.
−1 0 1

Consider two vectors x = [1, 1]> , y = [−1, 1]> ∈ R2 ; see Figure 3.6.
We are interested in determining the angle ω between them using two
different inner products. Using the dot product as the inner product yields
an angle ω between x and y of 90◦ , such that x ⊥ y . However, if we
choose the inner product
 
> 2 0
hx, yi = x y, (3.27)
0 1

©2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


78 Analytic Geometry

we get that the angle ω between x and y is given by


hx, yi 1
cos ω = = − =⇒ ω ≈ 1.91 rad ≈ 109.5◦ , (3.28)
kxkkyk 3
and x and y are not orthogonal. Therefore, vectors that are orthogonal
with respect to one inner product do not have to be orthogonal with re-
spect to a different inner product.

Definition 3.8 (Orthogonal Matrix). A square matrix A ∈ Rn×n is an


orthogonal matrix orthogonal matrix if and only if its columns are orthonormal so that
AA> = I = A> A , (3.29)
which implies that
A−1 = A> , (3.30)
It is convention to i.e., the inverse is obtained by simply transposing the matrix.
call these matrices
“orthogonal” but a Transformations by orthogonal matrices are special because the length
more precise of a vector x is not changed when transforming it using an orthogonal
description would matrix A. For the dot product, we obtain
be “orthonormal”.
2 2
Transformations kAxk = (Ax)> (Ax) = x> A> Ax = x> Ix = x> x = kxk . (3.31)
with orthogonal
matrices preserve Moreover, the angle between any two vectors x, y , as measured by their
distances and inner product, is also unchanged when transforming both of them using
angles.
an orthogonal matrix A. Assuming the dot product as the inner product,
the angle of the images Ax and Ay is given as
(Ax)> (Ay) x> A> Ay x> y
cos ω = =q = , (3.32)
kAxk kAyk x> A> Axy > A> Ay kxk kyk

which gives exactly the angle between x and y . This means that orthog-
onal matrices A with A> = A−1 preserve both angles and distances. It
turns out that orthogonal matrices define transformations that are rota-
tions (with the possibility of flips). In Section 3.9, we will discuss more
details about rotations.

3.5 Orthonormal Basis


In Section 2.6.1, we characterized properties of basis vectors and found
that in an n-dimensional vector space, we need n basis vectors, i.e., n
vectors that are linearly independent. In Sections 3.3 and 3.4, we used
inner products to compute the length of vectors and the angle between
vectors. In the following, we will discuss the special case where the basis
vectors are orthogonal to each other and where the length of each basis
vector is 1. We will call this basis then an orthonormal basis.

Draft (2022-01-11) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.


3.6 Orthogonal Complement 79

Let us introduce this more formally.

Definition 3.9 (Orthonormal Basis). Consider an n-dimensional vector


space V and a basis {b1 , . . . , bn } of V . If

hbi , bj i = 0 for i 6= j (3.33)


hbi , bi i = 1 (3.34)

for all i, j = 1, . . . , n then the basis is called an orthonormal basis (ONB). orthonormal basis
If only (3.33) is satisfied, then the basis is called an orthogonal basis. Note ONB
orthogonal basis
that (3.34) implies that every basis vector has length/norm 1.

Recall from Section 2.6.1 that we can use Gaussian elimination to find a
basis for a vector space spanned by a set of vectors. Assume we are given
a set {b̃1 , . . . , b̃n } of non-orthogonal and unnormalized basis vectors. We
concatenate them into a matrix B̃ = [b̃1 , . . . , b̃n ] and apply Gaussian elim-
>
ination to the augmented matrix (Section 2.3.2) [B̃ B̃ |B̃] to obtain an
orthonormal basis. This constructive way to iteratively build an orthonor-
mal basis {b1 , . . . , bn } is called the Gram-Schmidt process (Strang, 2003).

Example 3.8 (Orthonormal Basis)


The canonical/standard basis for a Euclidean vector space Rn is an or-
thonormal basis, where the inner product is the dot product of vectors.
In R2 , the vectors
1 1 1
   
1
b1 = √ , b2 = √ (3.35)
2 1 2 −1
form an orthonormal basis since b>
1 b2 = 0 and kb1 k = 1 = kb2 k.

We will exploit the concept of an orthonormal basis in Chapter 12 and


Chapter 10 when we discuss support vector machines and principal com-
ponent analysis.

3.6 Orthogonal Complement


Having defined orthogonality, we will now look at vector spaces that are
orthogonal to each other. This will play an important role in Chapter 10,
when we discuss linear dimensionality reduction from a geometric per-
spective.
Consider a D-dimensional vector space V and an M -dimensional sub-
space U ⊆ V . Then its orthogonal complement U ⊥ is a (D−M )-dimensional orthogonal
subspace of V and contains all vectors in V that are orthogonal to every complement
vector in U . Furthermore, U ∩ U ⊥ = {0} so that any vector x ∈ V can be

©2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


80 Analytic Geometry

Figure 3.7 A plane


U in a e3
three-dimensional
vector space can be w
described by its
normal vector,
which spans its e2
orthogonal
complement U ⊥ .

e1
U

uniquely decomposed into


M
X D−M
X
x= λm bm + ψj b⊥
j , λm , ψj ∈ R , (3.36)
m=1 j=1

where (b1 , . . . , bM ) is a basis of U and (b⊥ ⊥ ⊥


1 , . . . , bD−M ) is a basis of U .
Therefore, the orthogonal complement can also be used to describe a
plane U (two-dimensional subspace) in a three-dimensional vector space.
More specifically, the vector w with kwk = 1, which is orthogonal to the
plane U , is the basis vector of U ⊥ . Figure 3.7 illustrates this setting. All
vectors that are orthogonal to w must (by construction) lie in the plane
normal vector U . The vector w is called the normal vector of U .
Generally, orthogonal complements can be used to describe hyperplanes
in n-dimensional vector and affine spaces.

3.7 Inner Product of Functions


Thus far, we looked at properties of inner products to compute lengths,
angles and distances. We focused on inner products of finite-dimensional
vectors. In the following, we will look at an example of inner products of
a different type of vectors: inner products of functions.
The inner products we discussed so far were defined for vectors with a
finite number of entries. We can think of a vector x ∈ Rn as a function
with n function values. The concept of an inner product can be generalized
to vectors with an infinite number of entries (countably infinite) and also
continuous-valued functions (uncountably infinite). Then the sum over
individual components of vectors (see Equation (3.5) for example) turns
into an integral.
An inner product of two functions u : R → R and v : R → R can be
defined as the definite integral
Z b
hu, vi := u(x)v(x)dx (3.37)
a

Draft (2022-01-11) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.


3.8 Orthogonal Projections 81

for lower and upper limits a, b < ∞, respectively. As with our usual inner
product, we can define norms and orthogonality by looking at the inner
product. If (3.37) evaluates to 0, the functions u and v are orthogonal. To
make the preceding inner product mathematically precise, we need to take
care of measures and the definition of integrals, leading to the definition of
a Hilbert space. Furthermore, unlike inner products on finite-dimensional
vectors, inner products on functions may diverge (have infinite value). All
this requires diving into some more intricate details of real and functional
analysis, which we do not cover in this book.

Example 3.9 (Inner Product of Functions)


If we choose u = sin(x) and v = cos(x), the integrand f (x) = u(x)v(x) Figure 3.8 f (x) =
of (3.37), is shown in Figure 3.8. We see that this function is odd, i.e., sin(x) cos(x).
f (−x) = −f (x). Therefore, the integral with limits a = −π, b = π of this 0.5

sin(x) cos(x)
product evaluates to 0. Therefore, sin and cos are orthogonal functions.
0.0

Remark. It also holds that the collection of functions −0.5


−2.5 0.0 2.5
x
{1, cos(x), cos(2x), cos(3x), . . . } (3.38)
is orthogonal if we integrate from −π to π , i.e., any pair of functions are
orthogonal to each other. The collection of functions in (3.38) spans a
large subspace of the functions that are even and periodic on [−π, π), and
projecting functions onto this subspace is the fundamental idea behind
Fourier series. ♦
In Section 6.4.6, we will have a look at a second type of unconventional
inner products: the inner product of random variables.

3.8 Orthogonal Projections


Projections are an important class of linear transformations (besides rota-
tions and reflections) and play an important role in graphics, coding the-
ory, statistics and machine learning. In machine learning, we often deal
with data that is high-dimensional. High-dimensional data is often hard
to analyze or visualize. However, high-dimensional data quite often pos-
sesses the property that only a few dimensions contain most information,
and most other dimensions are not essential to describe key properties
of the data. When we compress or visualize high-dimensional data, we
will lose information. To minimize this compression loss, we ideally find
the most informative dimensions in the data. As discussed in Chapter 1, “Feature” is a
data can be represented as vectors, and in this chapter, we will discuss common expression
for data
some of the fundamental tools for data compression. More specifically, we
representation.
can project the original high-dimensional data onto a lower-dimensional
feature space and work in this lower-dimensional space to learn more
about the dataset and extract relevant patterns. For example, machine

©2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


82 Analytic Geometry

Figure 3.9
Orthogonal
projection (orange 2
dots) of a
1
two-dimensional
dataset (blue dots)

x2
0
onto a
one-dimensional −1
subspace (straight
line). −2

−4 −2 0 2 4
x1

learning algorithms, such as principal component analysis (PCA) by Pear-


son (1901) and Hotelling (1933) and deep neural networks (e.g., deep
auto-encoders (Deng et al., 2010)), heavily exploit the idea of dimension-
ality reduction. In the following, we will focus on orthogonal projections,
which we will use in Chapter 10 for linear dimensionality reduction and
in Chapter 12 for classification. Even linear regression, which we discuss
in Chapter 9, can be interpreted using orthogonal projections. For a given
lower-dimensional subspace, orthogonal projections of high-dimensional
data retain as much information as possible and minimize the difference/
error between the original data and the corresponding projection. An il-
lustration of such an orthogonal projection is given in Figure 3.9. Before
we detail how to obtain these projections, let us define what a projection
actually is.

Definition 3.10 (Projection). Let V be a vector space and U ⊆ V a


projection subspace of V . A linear mapping π : V → U is called a projection if
π2 = π ◦ π = π.
Since linear mappings can be expressed by transformation matrices (see
Section 2.7), the preceding definition applies equally to a special kind
projection matrix of transformation matrices, the projection matrices P π , which exhibit the
property that P 2π = P π .
In the following, we will derive orthogonal projections of vectors in the
inner product space (Rn , h·, ·i) onto subspaces. We will start with one-
line dimensional subspaces, which are also called lines. If not mentioned oth-
erwise, we assume the dot product hx, yi = x> y as the inner product.

3.8.1 Projection onto One-Dimensional Subspaces (Lines)


Assume we are given a line (one-dimensional subspace) through the ori-
gin with basis vector b ∈ Rn . The line is a one-dimensional subspace
U ⊆ Rn spanned by b. When we project x ∈ Rn onto U , we seek the
vector πU (x) ∈ U that is closest to x. Using geometric arguments, let

Draft (2022-01-11) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.


3.8 Orthogonal Projections 83
Figure 3.10
x Examples of
projections onto
one-dimensional
subspaces.

b x

πU (x)

ω sin ω
ω cos ω b
(a) Projection of x ∈ R2 onto a subspace U (b) Projection of a two-dimensional vector
with basis vector b. x with kxk = 1 onto a one-dimensional
subspace spanned by b.

us characterize some properties of the projection πU (x) (Figure 3.10(a)


serves as an illustration):

The projection πU (x) is closest to x, where “closest” implies that the


distance kx − πU (x)k is minimal. It follows that the segment πU (x) − x
from πU (x) to x is orthogonal to U , and therefore the basis vector b of
U . The orthogonality condition yields hπU (x) − x, bi = 0 since angles
between vectors are defined via the inner product. λ is then the
The projection πU (x) of x onto U must be an element of U and, there- coordinate of πU (x)
with respect to b.
fore, a multiple of the basis vector b that spans U . Hence, πU (x) = λb,
for some λ ∈ R.

In the following three steps, we determine the coordinate λ, the projection


πU (x) ∈ U , and the projection matrix P π that maps any x ∈ Rn onto U :

1. Finding the coordinate λ. The orthogonality condition yields


πU (x)=λb
hx − πU (x), bi = 0 ⇐⇒ hx − λb, bi = 0 . (3.39)

We can now exploit the bilinearity of the inner product and arrive at With a general inner
product, we get
hx, bi hb, xi λ = hx, bi if
hx, bi − λ hb, bi = 0 ⇐⇒ λ = = . (3.40) kbk = 1.
hb, bi kbk2
In the last step, we exploited the fact that inner products are symmet-
ric. If we choose h·, ·i to be the dot product, we obtain

b> x b> x
λ= = . (3.41)
b> b kbk2

If kbk = 1, then the coordinate λ of the projection is given by b> x.

©2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


84 Analytic Geometry

2. Finding the projection point πU (x) ∈ U . Since πU (x) = λb, we imme-


diately obtain with (3.40) that

hx, bi b> x
πU (x) = λb = b = b, (3.42)
kbk2 kbk2
where the last equality holds for the dot product only. We can also
compute the length of πU (x) by means of Definition 3.1 as

kπU (x)k = kλbk = |λ| kbk . (3.43)

Hence, our projection is of length |λ| times the length of b. This also
adds the intuition that λ is the coordinate of πU (x) with respect to the
basis vector b that spans our one-dimensional subspace U .
If we use the dot product as an inner product, we get

(3.42) |b> x| (3.25) kbk


kπU (x)k = kbk = | cos ω| kxk kbk = | cos ω| kxk .
kbk 2 kbk2
(3.44)

Here, ω is the angle between x and b. This equation should be familiar


from trigonometry: If kxk = 1, then x lies on the unit circle. It follows
The horizontal axis that the projection onto the horizontal axis spanned by b is exactly
is a one-dimensional cos ω , and the length of the corresponding vector πU (x) = |cos ω|. An
subspace.
illustration is given in Figure 3.10(b).
3. Finding the projection matrix P π . We know that a projection is a lin-
ear mapping (see Definition 3.10). Therefore, there exists a projection
matrix P π , such that πU (x) = P π x. With the dot product as inner
product and

b> x bb>
πU (x) = λb = bλ = b = x, (3.45)
kbk2 kbk2
we immediately see that

bb>
Pπ = . (3.46)
kbk2

Projection matrices Note that bb> (and, consequently, P π ) is a symmetric matrix (of rank
are always 1), and kbk2 = hb, bi is a scalar.
symmetric.
The projection matrix P π projects any vector x ∈ Rn onto the line through
the origin with direction b (equivalently, the subspace U spanned by b).
Remark. The projection πU (x) ∈ Rn is still an n-dimensional vector and
not a scalar. However, we no longer require n coordinates to represent the
projection, but only a single one if we want to express it with respect to
the basis vector b that spans the subspace U : λ. ♦

Draft (2022-01-11) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.


3.8 Orthogonal Projections 85
x Figure 3.11
Projection onto a
two-dimensional
subspace U with
basis b1 , b2 . The
projection πU (x) of
x − πU (x) x ∈ R3 onto U can
be expressed as a
linear combination
U of b1 , b2 and the
b2
displacement vector
x − πU (x) is
πU (x) orthogonal to both
b1 and b2 .
0 b1

Example 3.10 (Projection onto a Line)


Find the projection matrix P π onto the line through the origin spanned
 >
by b = 1 2 2 . b is a direction and a basis of the one-dimensional
subspace (line through origin).
With (3.46), we obtain
   
1 1 2 2
bb> 1    1
Pπ = > = 2 1 2 2 = 2 4 4 . (3.47)
b b 9 2 9 2 4 4

Let us now choose a particular x and see whether it lies in the subspace
 >
spanned by b. For x = 1 1 1 , the projection is
      
1 2 2 1 5 1
1 1
πU (x) = P π x = 2 4 4 1 = 10 ∈ span[2] . (3.48)
9 2 4 4 1 9 10 2
Note that the application of P π to πU (x) does not change anything, i.e.,
P π πU (x) = πU (x). This is expected because according to Definition 3.10,
we know that a projection matrix P π satisfies P 2π x = P π x for all x.

Remark. With the results from Chapter 4, we can show that πU (x) is an
eigenvector of P π , and the corresponding eigenvalue is 1. ♦

3.8.2 Projection onto General Subspaces


If U is given by a set
In the following, we look at orthogonal projections of vectors x ∈ Rn of spanning vectors,
onto lower-dimensional subspaces U ⊆ Rn with dim(U ) = m > 1. An which are not a
basis, make sure
illustration is given in Figure 3.11.
you determine a
Assume that (b1 , . . . , bm ) is an ordered basis of U . Any projection πU (x) basis b1 , . . . , bm
onto U is necessarily an element of U . Therefore, they can be represented before proceeding.

©2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


86 Analytic Geometry

as linear Pcombinations of the basis vectors b1 , . . . , bm of U , such that


m
The basis vectors πU (x) = i=1 λi bi .
form the columns of As in the 1D case, we follow a three-step procedure to find the projec-
B ∈ Rn×m , where
tion πU (x) and the projection matrix P π :
B = [b1 , . . . , bm ].
1. Find the coordinates λ1 , . . . , λm of the projection (with respect to the
basis of U ), such that the linear combination
m
X
πU (x) = λi bi = Bλ , (3.49)
i=1

B = [b1 , . . . , bm ] ∈ Rn×m , λ = [λ1 , . . . , λm ]> ∈ Rm , (3.50)


is closest to x ∈ Rn . As in the 1D case, “closest” means “minimum
distance”, which implies that the vector connecting πU (x) ∈ U and
x ∈ Rn must be orthogonal to all basis vectors of U . Therefore, we
obtain m simultaneous conditions (assuming the dot product as the
inner product)
hb1 , x − πU (x)i = b>1 (x − πU (x)) = 0 (3.51)
..
.
hbm , x − πU (x)i = b>
m (x − πU (x)) = 0 (3.52)
which, with πU (x) = Bλ, can be written as
b>
1 (x − Bλ) = 0 (3.53)
..
.
b>
m (x − Bλ) = 0 (3.54)
such that we obtain a homogeneous linear equation system
 > 
b1 
 ..   >
 .  x − Bλ = 0 ⇐⇒ B (x − Bλ) = 0 (3.55)
b>
m

⇐⇒ B > Bλ = B > x . (3.56)


normal equation The last expression is called normal equation. Since b1 , . . . , bm are a
basis of U and, therefore, linearly independent, B > B ∈ Rm×m is reg-
ular and can be inverted. This allows us to solve for the coefficients/
coordinates
λ = (B > B)−1 B > x . (3.57)

pseudo-inverse The matrix (B > B)−1 B > is also called the pseudo-inverse of B , which
can be computed for non-square matrices B . It only requires that B > B
is positive definite, which is the case if B is full rank. In practical ap-
plications (e.g., linear regression), we often add a “jitter term” I to

Draft (2022-01-11) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.


3.8 Orthogonal Projections 87

B > B to guarantee increased numerical stability and positive definite-


ness. This “ridge” can be rigorously derived using Bayesian inference.
See Chapter 9 for details.
2. Find the projection πU (x) ∈ U . We already established that πU (x) =
Bλ. Therefore, with (3.57)
πU (x) = B(B > B)−1 B > x . (3.58)

3. Find the projection matrix P π . From (3.58), we can immediately see


that the projection matrix that solves P π x = πU (x) must be

P π = B(B > B)−1 B > . (3.59)

Remark. The solution for projecting onto general subspaces includes the
1D case as a special case: If dim(U ) = 1, then B > B ∈ R is a scalar and
we can rewrite the projection matrix in (3.59) P π = B(B > B)−1 B > as
>
P π = BB
B> B
, which is exactly the projection matrix in (3.46). ♦

Example 3.11 (Projectiononto  a Two-dimensional


 Subspace)
 
1 0 6
For a subspace U = span[1 , 1] ⊆ R3 and x = 0 ∈ R3 find the
1 2 0
coordinates λ of x in terms of the subspace U , the projection point πU (x)
and the projection matrix P π .
First, we see that the generating set of U is a basis (linear
 indepen-

1 0
dence) and write the basis vectors of U into a matrix B = 1 1.
1 2
Second, we compute the matrix B > B and the vector B > x as
   
  1 0     6  
1 1 1  3 3 1 1 1   6
B>B = 1 1 = , B>x = 0 = .
0 1 2 3 5 0 1 2 0
1 2 0
(3.60)

Third, we solve the normal equation B > Bλ = B > x to find λ:


      
3 3 λ1 6 5
= ⇐⇒ λ = . (3.61)
3 5 λ2 0 −3
Fourth, the projection πU (x) of x onto U , i.e., into the column space of
B , can be directly computed via
 
5
πU (x) = Bλ =  2  . (3.62)
−1

©2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


88 Analytic Geometry

projection error The corresponding projection error is the norm of the difference vector
The projection error between the original vector and its projection onto U , i.e.,
is also called the  > √
reconstruction error. kx − πU (x)k = 1 −2 1 = 6. (3.63)

Fifth, the projection matrix (for any x ∈ R3 ) is given by


 
5 2 −1
1
P π = B(B > B)−1 B > =  2 2 2  . (3.64)
6 −1 2 5

To verify the results, we can (a) check whether the displacement vector
πU (x) − x is orthogonal to all basis vectors of U , and (b) verify that
P π = P 2π (see Definition 3.10).

Remark. The projections πU (x) are still vectors in Rn although they lie in
an m-dimensional subspace U ⊆ Rn . However, to represent a projected
vector we only need the m coordinates λ1 , . . . , λm with respect to the
basis vectors b1 , . . . , bm of U . ♦
Remark. In vector spaces with general inner products, we have to pay
attention when computing angles and distances, which are defined by
means of the inner product. ♦
We can find
approximate Projections allow us to look at situations where we have a linear system
solutions to Ax = b without a solution. Recall that this means that b does not lie in
unsolvable linear
equation systems
the span of A, i.e., the vector b does not lie in the subspace spanned by
using projections. the columns of A. Given that the linear equation cannot be solved exactly,
we can find an approximate solution. The idea is to find the vector in the
subspace spanned by the columns of A that is closest to b, i.e., we compute
the orthogonal projection of b onto the subspace spanned by the columns
of A. This problem arises often in practice, and the solution is called the
least-squares least-squares solution (assuming the dot product as the inner product) of
solution an overdetermined system. This is discussed further in Section 9.4. Using
reconstruction errors (3.63) is one possible approach to derive principal
component analysis (Section 10.3).
Remark. We just looked at projections of vectors x onto a subspace U with
basis vectors {b1 , . . . , bk }. If this basis is an ONB, i.e., (3.33) and (3.34)
are satisfied, the projection equation (3.58) simplifies greatly to
πU (x) = BB > x (3.65)

since B > B = I with coordinates


λ = B>x . (3.66)
This means that we no longer have to compute the inverse from (3.58),
which saves computation time. ♦

Draft (2022-01-11) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.


3.8 Orthogonal Projections 89

3.8.3 Gram-Schmidt Orthogonalization


Projections are at the core of the Gram-Schmidt method that allows us to
constructively transform any basis (b1 , . . . , bn ) of an n-dimensional vector
space V into an orthogonal/orthonormal basis (u1 , . . . , un ) of V . This
basis always exists (Liesen and Mehrmann, 2015) and span[b1 , . . . , bn ] =
span[u1 , . . . , un ]. The Gram-Schmidt orthogonalization method iteratively Gram-Schmidt
constructs an orthogonal basis (u1 , . . . , un ) from any basis (b1 , . . . , bn ) of orthogonalization
V as follows:
u1 := b1 (3.67)
uk := bk − πspan[u1 ,...,uk−1 ] (bk ) , k = 2, . . . , n . (3.68)
In (3.68), the k th basis vector bk is projected onto the subspace spanned
by the first k − 1 constructed orthogonal vectors u1 , . . . , uk−1 ; see Sec-
tion 3.8.2. This projection is then subtracted from bk and yields a vector
uk that is orthogonal to the (k − 1)-dimensional subspace spanned by
u1 , . . . , uk−1 . Repeating this procedure for all n basis vectors b1 , . . . , bn
yields an orthogonal basis (u1 , . . . , un ) of V . If we normalize the uk , we
obtain an ONB where kuk k = 1 for k = 1, . . . , n.

Example 3.12 (Gram-Schmidt Orthogonalization)

Figure 3.12
b2 b2 u2 b2 Gram-Schmidt
orthogonalization.
(a) non-orthogonal
basis (b1 , b2 ) of R2 ;
0 b1 0 πspan[u1 ] (b2 ) u1 0 πspan[u1 ] (b2 ) u1 (b) first constructed
basis vector u1 and
(a) Original non-orthogonal (b) First new basis vector (c) Orthogonal basis vectors u1
orthogonal
basis vectors b1 , b2 . u1 = b1 and projection of b2 and u2 = b2 − πspan[u1 ] (b2 ).
projection of b2
onto the subspace spanned by
onto span[u1 ];
u1 .
(c) orthogonal basis
Consider a basis (b1 , b2 ) of R2 , where (u1 , u2 ) of R2 .
   
2 1
b1 = , b2 = ; (3.69)
0 1
see also Figure 3.12(a). Using the Gram-Schmidt method, we construct an
orthogonal basis (u1 , u2 ) of R2 as follows (assuming the dot product as
the inner product):
 
2
u1 := b1 = , (3.70)
0
u1 u>
      
(3.45) 1 1 1 0 1 0
u2 := b2 − πspan[u1 ] (b2 ) = b2 − b
2 2 = − = .
ku1 k 1 0 0 1 1
(3.71)

©2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


90 Analytic Geometry
Figure 3.13 x x
Projection onto an
affine space.
(a) original setting;
(b) setting shifted x − x0
L L
by −x0 so that πL(x)
x − x0 can be x0 x0
projected onto the
b2 b 2 U = L − x0 b2
direction space U ;
πU (x − x0)
(c) projection is
translated back to 0 b1 0 b1 0 b1
x0 + πU (x − x0 ),
(a) Setting. (b) Reduce problem to pro- (c) Add support point back in
which gives the final
jection πU onto vector sub- to get affine projection πL .
orthogonal
space.
projection πL (x).

These steps are illustrated in Figures 3.12(b) and (c). We immediately see
that u1 and u2 are orthogonal, i.e., u>1 u2 = 0.

3.8.4 Projection onto Affine Subspaces


Thus far, we discussed how to project a vector onto a lower-dimensional
subspace U . In the following, we provide a solution to projecting a vector
onto an affine subspace.
Consider the setting in Figure 3.13(a). We are given an affine space L =
x0 + U , where b1 , b2 are basis vectors of U . To determine the orthogonal
projection πL (x) of x onto L, we transform the problem into a problem
that we know how to solve: the projection onto a vector subspace. In
order to get there, we subtract the support point x0 from x and from L,
so that L − x0 = U is exactly the vector subspace U . We can now use the
orthogonal projections onto a subspace we discussed in Section 3.8.2 and
obtain the projection πU (x − x0 ), which is illustrated in Figure 3.13(b).
This projection can now be translated back into L by adding x0 , such that
we obtain the orthogonal projection onto an affine space L as
πL (x) = x0 + πU (x − x0 ) , (3.72)
where πU (·) is the orthogonal projection onto the subspace U , i.e., the
direction space of L; see Figure 3.13(c).
From Figure 3.13, it is also evident that the distance of x from the affine
space L is identical to the distance of x − x0 from U , i.e.,
d(x, L) = kx − πL (x)k = kx − (x0 + πU (x − x0 ))k (3.73a)
= d(x − x0 , πU (x − x0 )) = d(x − x0 , U ) . (3.73b)
We will use projections onto an affine subspace to derive the concept of
a separating hyperplane in Section 12.1.

Draft (2022-01-11) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.


3.9 Rotations 91

Figure 3.14 A
rotation rotates
objects in a plane
about the origin. If
Original
the rotation angle is
Rotated by 112.5◦
positive, we rotate
counterclockwise.

Figure 3.15 The


robotic arm needs to
rotate its joints in
order to pick up
objects or to place
them correctly.
Figure taken
from (Deisenroth
et al., 2015).

3.9 Rotations
Length and angle preservation, as discussed in Section 3.4, are the two
characteristics of linear mappings with orthogonal transformation matri-
ces. In the following, we will have a closer look at specific orthogonal
transformation matrices, which describe rotations.
A rotation is a linear mapping (more specifically, an automorphism of rotation
a Euclidean vector space) that rotates a plane by an angle θ about the
origin, i.e., the origin is a fixed point. For a positive angle θ > 0, by com-
mon convention, we rotate in a counterclockwise direction. An example is
shown in Figure 3.14, where the transformation matrix is
 
−0.38 −0.92
R= . (3.74)
0.92 −0.38
Important application areas of rotations include computer graphics and
robotics. For example, in robotics, it is often important to know how to
rotate the joints of a robotic arm in order to pick up or place an object,
see Figure 3.15.

©2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


92 Analytic Geometry

Figure 3.16 Φ(e2 ) = [− sin θ, cos θ]>


Rotation of the cos θ
standard basis in R2
by an angle θ. e2

Φ(e1 ) = [cos θ, sin θ]>


sin θ
θ

θ
− sin θ e1 cos θ

3.9.1 Rotations in R2
    
1 0
Consider the standard basis e1 = , e2 = of R2 , which defines
0 1
the standard coordinate system in R2 . We aim to rotate this coordinate
system by an angle θ as illustrated in Figure 3.16. Note that the rotated
vectors are still linearly independent and, therefore, are a basis of R2 . This
means that the rotation performs a basis change.
Rotations Φ are linear mappings so that we can express them by a
rotation matrix rotation matrix R(θ). Trigonometry (see Figure 3.16) allows us to de-
termine the coordinates of the rotated axes (the image of Φ) with respect
to the standard basis in R2 . We obtain
   
cos θ − sin θ
Φ(e1 ) = , Φ(e2 ) = . (3.75)
sin θ cos θ
Therefore, the rotation matrix that performs the basis change into the
rotated coordinates R(θ) is given as
 
  cos θ − sin θ
R(θ) = Φ(e1 ) Φ(e2 ) = . (3.76)
sin θ cos θ

3.9.2 Rotations in R3
In contrast to the R2 case, in R3 we can rotate any two-dimensional plane
about a one-dimensional axis. The easiest way to specify the general rota-
tion matrix is to specify how the images of the standard basis e1 , e2 , e3 are
supposed to be rotated, and making sure these images Re1 , Re2 , Re3 are
orthonormal to each other. We can then obtain a general rotation matrix
R by combining the images of the standard basis.
To have a meaningful rotation angle, we have to define what “coun-
terclockwise” means when we operate in more than two dimensions. We
use the convention that a “counterclockwise” (planar) rotation about an
axis refers to a rotation about an axis when we look at the axis “head on,
from the end toward the origin”. In R3 , there are therefore three (planar)
rotations about the three standard basis vectors (see Figure 3.17):

Draft (2022-01-11) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.


3.9 Rotations 93

e3 Figure 3.17
Rotation of a vector
(gray) in R3 by an
angle θ about the
e3 -axis. The rotated
vector is shown in
blue.
e2

θ e1

Rotation about the e1 -axis


 
  1 0 0
R1 (θ) = Φ(e1 ) Φ(e2 ) Φ(e3 ) = 0 cos θ − sin θ . (3.77)
0 sin θ cos θ

Here, the e1 coordinate is fixed, and the counterclockwise rotation is


performed in the e2 e3 plane.
Rotation about the e2 -axis
 
cos θ 0 sin θ
R2 (θ) =  0 1 0 . (3.78)
− sin θ 0 cos θ

If we rotate the e1 e3 plane about the e2 axis, we need to look at the e2


axis from its “tip” toward the origin.
Rotation about the e3 -axis
 
cos θ − sin θ 0
R3 (θ) =  sin θ cos θ 0 . (3.79)
0 0 1

Figure 3.17 illustrates this.

3.9.3 Rotations in n Dimensions


The generalization of rotations from 2D and 3D to n-dimensional Eu-
clidean vector spaces can be intuitively described as fixing n − 2 dimen-
sions and restrict the rotation to a two-dimensional plane in the n-dimen-
sional space. As in the three-dimensional case, we can rotate any plane
(two-dimensional subspace of Rn ).

Definition 3.11 (Givens Rotation). Let V be an n-dimensional Euclidean


vector space and Φ : V → V an automorphism with transformation ma-

©2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


94 Analytic Geometry

trix
I i−1 ··· ···
 
0 0
 0
 cos θ 0 − sin θ 0   n×n
Rij (θ) := 
 0 0 I j−i−1 0 0  ∈R , (3.80)
 0 sin θ 0 cos θ 0 
0 ··· ··· 0 I n−j
Givens rotation for 1 6 i < j 6 n and θ ∈ R. Then Rij (θ) is called a Givens rotation.
Essentially, Rij (θ) is the identity matrix I n with
rii = cos θ , rij = − sin θ , rji = sin θ , rjj = cos θ . (3.81)
In two dimensions (i.e., n = 2), we obtain (3.76) as a special case.

3.9.4 Properties of Rotations


Rotations exhibit a number of useful properties, which can be derived by
considering them as orthogonal matrices (Definition 3.8):
Rotations preserve distances, i.e., kx−yk = kRθ (x)−Rθ (y)k. In other
words, rotations leave the distance between any two points unchanged
after the transformation.
Rotations preserve angles, i.e., the angle between Rθ x and Rθ y equals
the angle between x and y .
Rotations in three (or more) dimensions are generally not commuta-
tive. Therefore, the order in which rotations are applied is important,
even if they rotate about the same point. Only in two dimensions vector
rotations are commutative, such that R(φ)R(θ) = R(θ)R(φ) for all
φ, θ ∈ [0, 2π). They form an Abelian group (with multiplication) only if
they rotate about the same point (e.g., the origin).

3.10 Further Reading


In this chapter, we gave a brief overview of some of the important concepts
of analytic geometry, which we will use in later chapters of the book.
For a broader and more in-depth overview of some of the concepts we
presented, we refer to the following excellent books: Axler (2015) and
Boyd and Vandenberghe (2018).
Inner products allow us to determine specific bases of vector (sub)spaces,
where each vector is orthogonal to all others (orthogonal bases) using the
Gram-Schmidt method. These bases are important in optimization and
numerical algorithms for solving linear equation systems. For instance,
Krylov subspace methods, such as conjugate gradients or the generalized
minimal residual method (GMRES), minimize residual errors that are or-
thogonal to each other (Stoer and Burlirsch, 2002).
In machine learning, inner products are important in the context of

Draft (2022-01-11) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.


3.10 Further Reading 95

kernel methods (Schölkopf and Smola, 2002). Kernel methods exploit the
fact that many linear algorithms can be expressed purely by inner prod-
uct computations. Then, the “kernel trick” allows us to compute these
inner products implicitly in a (potentially infinite-dimensional) feature
space, without even knowing this feature space explicitly. This allowed the
“non-linearization” of many algorithms used in machine learning, such as
kernel-PCA (Schölkopf et al., 1997) for dimensionality reduction. Gaus-
sian processes (Rasmussen and Williams, 2006) also fall into the category
of kernel methods and are the current state of the art in probabilistic re-
gression (fitting curves to data points). The idea of kernels is explored
further in Chapter 12.
Projections are often used in computer graphics, e.g., to generate shad-
ows. In optimization, orthogonal projections are often used to (iteratively)
minimize residual errors. This also has applications in machine learning,
e.g., in linear regression where we want to find a (linear) function that
minimizes the residual errors, i.e., the lengths of the orthogonal projec-
tions of the data onto the linear function (Bishop, 2006). We will investi-
gate this further in Chapter 9. PCA (Pearson, 1901; Hotelling, 1933) also
uses projections to reduce the dimensionality of high-dimensional data.
We will discuss this in more detail in Chapter 10.

©2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


96 Analytic Geometry

Exercises
3.1 Show that h·, ·i defined for all x = [x1 , x2 ]> ∈ R2 and y = [y1 , y2 ]> ∈ R2 by

hx, yi := x1 y1 − (x1 y2 + x2 y1 ) + 2(x2 y2 )

is an inner product.
3.2 Consider R2 with h·, ·i defined for all x and y in R2 as
 
2 0
hx, yi := x> y.
1 2
| {z }
=:A

Is h·, ·i an inner product?


3.3 Compute the distance between
   
1 −1
x = 2 , y = −1
3 0

using
a. hx, yi := x> y
 
2 1 0
b. hx, yi := x> Ay , A := 1 3 −1
0 −1 2
3.4 Compute the angle between
   
1 −1
x= , y=
2 −1

using
a. hx, yi := x> y
 
> 2 1
b. hx, yi := x By , B :=
1 3
3.5 Consider the Euclidean vector space R5 with the dot product. A subspace
U ⊆ R5 and x ∈ R5 are given by
         
0 1 −3 −1 −1
−1 −3  4  −3 −9
         
U = span[
 2 , −1 .
  1 ,  1  ,  5 ] , x= 
     
0 −1 2 0 4
2 2 1 7 1

a. Determine the orthogonal projection πU (x) of x onto U


b. Determine the distance d(x, U )
3.6 Consider R3 with the inner product
 
2 1 0
hx, yi := x> 1 2 −1 y .
0 −1 2

Furthermore, we define e1 , e2 , e3 as the standard/canonical basis in R3 .

Draft (2022-01-11) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.


Exercises 97

a. Determine the orthogonal projection πU (e2 ) of e2 onto


U = span[e1 , e3 ] .

Hint: Orthogonality is defined through the inner product.


b. Compute the distance d(e2 , U ).
c. Draw the scenario: standard basis vectors and πU (e2 )
3.7 Let V be a vector space and π an endomorphism of V .
a. Prove that π is a projection if and only if idV − π is a projection, where
idV is the identity endomorphism on V .
b. Assume now that π is a projection. Calculate Im(idV −π) and ker(idV −π)
as a function of Im(π) and ker(π).

3.8 Using the Gram-Schmidt method, turn the basis B = (b1 , b2 ) of a two-
dimensional subspace U ⊆ R3 into an ONB C = (c1 , c2 ) of U , where
   
1 −1
b1 := 1 , b2 :=  2  .
1 0

3.9 Let n ∈ N and let x1 , . . . , xn > 0 be n positive real numbers so that x1 +


. . . + xn = 1. Use the Cauchy-Schwarz inequality and show that
Pn
a. x2i > n1
Pi=1
n 1 2
b. i=1 xi > n
Hint: Think about the dot product on Rn . Then, choose specific vectors
x, y ∈ Rn and apply the Cauchy-Schwarz inequality.
3.10 Rotate the vectors
   
2 0
x1 := , x2 :=
3 −1

by 30◦ .

©2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).

You might also like