Math_ML-trang-3
Math_ML-trang-3
Analytic Geometry
Orthogonal
Lengths Angles Rotations
projection
70
This material is published by Cambridge University Press as Mathematics for Machine Learning by
Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong (2020). This version is free to view
and download for personal use only. Not for re-distribution, re-sale, or use in derivative works.
©by M. P. Deisenroth, A. A. Faisal, and C. S. Ong, 2021. https://mml-book.com.
3.1 Norms 71
3.1 Norms
When we think of geometric vectors, i.e., directed line segments that start
at the origin, then intuitively the length of a vector is the distance of the
“end” of this directed line segment from the origin. In the following, we
will discuss the notion of the length of vectors using the concept of a norm.
k · k : V → R, (3.1)
x 7→ kxk , (3.2)
which assigns each vector x its length kxk ∈ R, such that for all λ ∈ R length
and x, y ∈ V the following hold:
absolutely
Absolutely homogeneous: kλxk = |λ|kxk homogeneous
Xn
kxk1 := |xi | , (3.3)
i=1
where | · | is the absolute value. The left panel of Figure 3.3 shows all
vectors x ∈ R2 with kxk1 = 1. The Manhattan norm is also called `1 `1 norm
norm.
Euclidean distance and computes the Euclidean distance of x from the origin. The right panel
of Figure 3.3 shows all vectors x ∈ R2 with kxk2 = 1. The Euclidean
`2 norm norm is also called `2 norm.
Remark. Throughout this book, we will use the Euclidean norm (3.4) by
default if not stated otherwise. ♦
We will refer to this particular inner product as the dot product in this
book. However, inner products are more general concepts with specific
properties, which we will now introduce.
where Aij := hbi , bj i and x̂, ŷ are the coordinates of x and y with respect
to the basis B . This implies that the inner product h·, ·i is uniquely deter-
mined through A. The symmetry of the inner product also means that A
The null space (kernel) of A consists only of 0 because x> Ax > 0 for
all x 6= 0. This implies that Ax 6= 0 if x 6= 0.
The diagonal elements aii of A are positive because aii = e>
i Aei > 0,
where ei is the ith vector of the standard basis in Rn .
in a natural way, such that we can compute lengths of vectors using the in-
ner product. However, not every norm is induced by an inner product. The
Manhattan norm (3.3) is an example of a norm without a corresponding
inner product. In the following, we will focus on norms that are induced
by inner products and introduce geometric concepts, such as lengths, dis-
tances, and angles.
Remark (Cauchy-Schwarz Inequality). For an inner product vector space
(V, h·, ·i) the induced norm k · k satisfies the Cauchy-Schwarz inequality Cauchy-Schwarz
inequality
| hx, yi | 6 kxkkyk . (3.17)
♦
is called the distance between x and y for x, y ∈ V . If we use the dot distance
product as the inner product, then the distance is called Euclidean distance. Euclidean distance
The mapping
d:V ×V →R (3.22)
(x, y) 7→ d(x, y) (3.23)
positive definite 1. d is positive definite, i.e., d(x, y) > 0 for all x, y ∈ V and d(x, y) =
0 ⇐⇒ x = y .
symmetric 2. d is symmetric, i.e., d(x, y) = d(y, x) for all x, y ∈ V .
triangle inequality 3. Triangle inequality: d(x, z) 6 d(x, y) + d(y, z) for all x, y, z ∈ V .
Remark. At first glance, the lists of properties of inner products and met-
rics look very similar. However, by comparing Definition 3.3 with Defini-
tion 3.6 we observe that hx, yi and d(x, y) behave in opposite directions.
Very similar x and y will result in a large value for the inner product and
a small value for the metric. ♦
0
hx, yi
−1 6 6 1. (3.24)
kxk kyk
−1
0 π/2 π
ω Therefore, there exists a unique ω ∈ [0, π], illustrated in Figure 3.4, with
hx, yi
cos ω = . (3.25)
kxk kyk
angle The number ω is the angle between the vectors x and y . Intuitively, the
angle between two vectors tells us how similar their orientations are. For
example, using the dot product, the angle between x and y = 4x, i.e., y
is a scaled version of x, is 0: Their orientation is the same.
Consider two vectors x = [1, 1]> , y = [−1, 1]> ∈ R2 ; see Figure 3.6.
We are interested in determining the angle ω between them using two
different inner products. Using the dot product as the inner product yields
an angle ω between x and y of 90◦ , such that x ⊥ y . However, if we
choose the inner product
> 2 0
hx, yi = x y, (3.27)
0 1
which gives exactly the angle between x and y . This means that orthog-
onal matrices A with A> = A−1 preserve both angles and distances. It
turns out that orthogonal matrices define transformations that are rota-
tions (with the possibility of flips). In Section 3.9, we will discuss more
details about rotations.
for all i, j = 1, . . . , n then the basis is called an orthonormal basis (ONB). orthonormal basis
If only (3.33) is satisfied, then the basis is called an orthogonal basis. Note ONB
orthogonal basis
that (3.34) implies that every basis vector has length/norm 1.
Recall from Section 2.6.1 that we can use Gaussian elimination to find a
basis for a vector space spanned by a set of vectors. Assume we are given
a set {b̃1 , . . . , b̃n } of non-orthogonal and unnormalized basis vectors. We
concatenate them into a matrix B̃ = [b̃1 , . . . , b̃n ] and apply Gaussian elim-
>
ination to the augmented matrix (Section 2.3.2) [B̃ B̃ |B̃] to obtain an
orthonormal basis. This constructive way to iteratively build an orthonor-
mal basis {b1 , . . . , bn } is called the Gram-Schmidt process (Strang, 2003).
e1
U
for lower and upper limits a, b < ∞, respectively. As with our usual inner
product, we can define norms and orthogonality by looking at the inner
product. If (3.37) evaluates to 0, the functions u and v are orthogonal. To
make the preceding inner product mathematically precise, we need to take
care of measures and the definition of integrals, leading to the definition of
a Hilbert space. Furthermore, unlike inner products on finite-dimensional
vectors, inner products on functions may diverge (have infinite value). All
this requires diving into some more intricate details of real and functional
analysis, which we do not cover in this book.
sin(x) cos(x)
product evaluates to 0. Therefore, sin and cos are orthogonal functions.
0.0
Figure 3.9
Orthogonal
projection (orange 2
dots) of a
1
two-dimensional
dataset (blue dots)
x2
0
onto a
one-dimensional −1
subspace (straight
line). −2
−4 −2 0 2 4
x1
b x
πU (x)
ω sin ω
ω cos ω b
(a) Projection of x ∈ R2 onto a subspace U (b) Projection of a two-dimensional vector
with basis vector b. x with kxk = 1 onto a one-dimensional
subspace spanned by b.
We can now exploit the bilinearity of the inner product and arrive at With a general inner
product, we get
hx, bi hb, xi λ = hx, bi if
hx, bi − λ hb, bi = 0 ⇐⇒ λ = = . (3.40) kbk = 1.
hb, bi kbk2
In the last step, we exploited the fact that inner products are symmet-
ric. If we choose h·, ·i to be the dot product, we obtain
b> x b> x
λ= = . (3.41)
b> b kbk2
hx, bi b> x
πU (x) = λb = b = b, (3.42)
kbk2 kbk2
where the last equality holds for the dot product only. We can also
compute the length of πU (x) by means of Definition 3.1 as
Hence, our projection is of length |λ| times the length of b. This also
adds the intuition that λ is the coordinate of πU (x) with respect to the
basis vector b that spans our one-dimensional subspace U .
If we use the dot product as an inner product, we get
b> x bb>
πU (x) = λb = bλ = b = x, (3.45)
kbk2 kbk2
we immediately see that
bb>
Pπ = . (3.46)
kbk2
Projection matrices Note that bb> (and, consequently, P π ) is a symmetric matrix (of rank
are always 1), and kbk2 = hb, bi is a scalar.
symmetric.
The projection matrix P π projects any vector x ∈ Rn onto the line through
the origin with direction b (equivalently, the subspace U spanned by b).
Remark. The projection πU (x) ∈ Rn is still an n-dimensional vector and
not a scalar. However, we no longer require n coordinates to represent the
projection, but only a single one if we want to express it with respect to
the basis vector b that spans the subspace U : λ. ♦
Let us now choose a particular x and see whether it lies in the subspace
>
spanned by b. For x = 1 1 1 , the projection is
1 2 2 1 5 1
1 1
πU (x) = P π x = 2 4 4 1 = 10 ∈ span[2] . (3.48)
9 2 4 4 1 9 10 2
Note that the application of P π to πU (x) does not change anything, i.e.,
P π πU (x) = πU (x). This is expected because according to Definition 3.10,
we know that a projection matrix P π satisfies P 2π x = P π x for all x.
Remark. With the results from Chapter 4, we can show that πU (x) is an
eigenvector of P π , and the corresponding eigenvalue is 1. ♦
pseudo-inverse The matrix (B > B)−1 B > is also called the pseudo-inverse of B , which
can be computed for non-square matrices B . It only requires that B > B
is positive definite, which is the case if B is full rank. In practical ap-
plications (e.g., linear regression), we often add a “jitter term” I to
Remark. The solution for projecting onto general subspaces includes the
1D case as a special case: If dim(U ) = 1, then B > B ∈ R is a scalar and
we can rewrite the projection matrix in (3.59) P π = B(B > B)−1 B > as
>
P π = BB
B> B
, which is exactly the projection matrix in (3.46). ♦
projection error The corresponding projection error is the norm of the difference vector
The projection error between the original vector and its projection onto U , i.e.,
is also called the > √
reconstruction error. kx − πU (x)k = 1 −2 1 = 6. (3.63)
To verify the results, we can (a) check whether the displacement vector
πU (x) − x is orthogonal to all basis vectors of U , and (b) verify that
P π = P 2π (see Definition 3.10).
Remark. The projections πU (x) are still vectors in Rn although they lie in
an m-dimensional subspace U ⊆ Rn . However, to represent a projected
vector we only need the m coordinates λ1 , . . . , λm with respect to the
basis vectors b1 , . . . , bm of U . ♦
Remark. In vector spaces with general inner products, we have to pay
attention when computing angles and distances, which are defined by
means of the inner product. ♦
We can find
approximate Projections allow us to look at situations where we have a linear system
solutions to Ax = b without a solution. Recall that this means that b does not lie in
unsolvable linear
equation systems
the span of A, i.e., the vector b does not lie in the subspace spanned by
using projections. the columns of A. Given that the linear equation cannot be solved exactly,
we can find an approximate solution. The idea is to find the vector in the
subspace spanned by the columns of A that is closest to b, i.e., we compute
the orthogonal projection of b onto the subspace spanned by the columns
of A. This problem arises often in practice, and the solution is called the
least-squares least-squares solution (assuming the dot product as the inner product) of
solution an overdetermined system. This is discussed further in Section 9.4. Using
reconstruction errors (3.63) is one possible approach to derive principal
component analysis (Section 10.3).
Remark. We just looked at projections of vectors x onto a subspace U with
basis vectors {b1 , . . . , bk }. If this basis is an ONB, i.e., (3.33) and (3.34)
are satisfied, the projection equation (3.58) simplifies greatly to
πU (x) = BB > x (3.65)
Figure 3.12
b2 b2 u2 b2 Gram-Schmidt
orthogonalization.
(a) non-orthogonal
basis (b1 , b2 ) of R2 ;
0 b1 0 πspan[u1 ] (b2 ) u1 0 πspan[u1 ] (b2 ) u1 (b) first constructed
basis vector u1 and
(a) Original non-orthogonal (b) First new basis vector (c) Orthogonal basis vectors u1
orthogonal
basis vectors b1 , b2 . u1 = b1 and projection of b2 and u2 = b2 − πspan[u1 ] (b2 ).
projection of b2
onto the subspace spanned by
onto span[u1 ];
u1 .
(c) orthogonal basis
Consider a basis (b1 , b2 ) of R2 , where (u1 , u2 ) of R2 .
2 1
b1 = , b2 = ; (3.69)
0 1
see also Figure 3.12(a). Using the Gram-Schmidt method, we construct an
orthogonal basis (u1 , u2 ) of R2 as follows (assuming the dot product as
the inner product):
2
u1 := b1 = , (3.70)
0
u1 u>
(3.45) 1 1 1 0 1 0
u2 := b2 − πspan[u1 ] (b2 ) = b2 − b
2 2 = − = .
ku1 k 1 0 0 1 1
(3.71)
These steps are illustrated in Figures 3.12(b) and (c). We immediately see
that u1 and u2 are orthogonal, i.e., u>1 u2 = 0.
Figure 3.14 A
rotation rotates
objects in a plane
about the origin. If
Original
the rotation angle is
Rotated by 112.5◦
positive, we rotate
counterclockwise.
3.9 Rotations
Length and angle preservation, as discussed in Section 3.4, are the two
characteristics of linear mappings with orthogonal transformation matri-
ces. In the following, we will have a closer look at specific orthogonal
transformation matrices, which describe rotations.
A rotation is a linear mapping (more specifically, an automorphism of rotation
a Euclidean vector space) that rotates a plane by an angle θ about the
origin, i.e., the origin is a fixed point. For a positive angle θ > 0, by com-
mon convention, we rotate in a counterclockwise direction. An example is
shown in Figure 3.14, where the transformation matrix is
−0.38 −0.92
R= . (3.74)
0.92 −0.38
Important application areas of rotations include computer graphics and
robotics. For example, in robotics, it is often important to know how to
rotate the joints of a robotic arm in order to pick up or place an object,
see Figure 3.15.
θ
− sin θ e1 cos θ
3.9.1 Rotations in R2
1 0
Consider the standard basis e1 = , e2 = of R2 , which defines
0 1
the standard coordinate system in R2 . We aim to rotate this coordinate
system by an angle θ as illustrated in Figure 3.16. Note that the rotated
vectors are still linearly independent and, therefore, are a basis of R2 . This
means that the rotation performs a basis change.
Rotations Φ are linear mappings so that we can express them by a
rotation matrix rotation matrix R(θ). Trigonometry (see Figure 3.16) allows us to de-
termine the coordinates of the rotated axes (the image of Φ) with respect
to the standard basis in R2 . We obtain
cos θ − sin θ
Φ(e1 ) = , Φ(e2 ) = . (3.75)
sin θ cos θ
Therefore, the rotation matrix that performs the basis change into the
rotated coordinates R(θ) is given as
cos θ − sin θ
R(θ) = Φ(e1 ) Φ(e2 ) = . (3.76)
sin θ cos θ
3.9.2 Rotations in R3
In contrast to the R2 case, in R3 we can rotate any two-dimensional plane
about a one-dimensional axis. The easiest way to specify the general rota-
tion matrix is to specify how the images of the standard basis e1 , e2 , e3 are
supposed to be rotated, and making sure these images Re1 , Re2 , Re3 are
orthonormal to each other. We can then obtain a general rotation matrix
R by combining the images of the standard basis.
To have a meaningful rotation angle, we have to define what “coun-
terclockwise” means when we operate in more than two dimensions. We
use the convention that a “counterclockwise” (planar) rotation about an
axis refers to a rotation about an axis when we look at the axis “head on,
from the end toward the origin”. In R3 , there are therefore three (planar)
rotations about the three standard basis vectors (see Figure 3.17):
e3 Figure 3.17
Rotation of a vector
(gray) in R3 by an
angle θ about the
e3 -axis. The rotated
vector is shown in
blue.
e2
θ e1
trix
I i−1 ··· ···
0 0
0
cos θ 0 − sin θ 0 n×n
Rij (θ) :=
0 0 I j−i−1 0 0 ∈R , (3.80)
0 sin θ 0 cos θ 0
0 ··· ··· 0 I n−j
Givens rotation for 1 6 i < j 6 n and θ ∈ R. Then Rij (θ) is called a Givens rotation.
Essentially, Rij (θ) is the identity matrix I n with
rii = cos θ , rij = − sin θ , rji = sin θ , rjj = cos θ . (3.81)
In two dimensions (i.e., n = 2), we obtain (3.76) as a special case.
kernel methods (Schölkopf and Smola, 2002). Kernel methods exploit the
fact that many linear algorithms can be expressed purely by inner prod-
uct computations. Then, the “kernel trick” allows us to compute these
inner products implicitly in a (potentially infinite-dimensional) feature
space, without even knowing this feature space explicitly. This allowed the
“non-linearization” of many algorithms used in machine learning, such as
kernel-PCA (Schölkopf et al., 1997) for dimensionality reduction. Gaus-
sian processes (Rasmussen and Williams, 2006) also fall into the category
of kernel methods and are the current state of the art in probabilistic re-
gression (fitting curves to data points). The idea of kernels is explored
further in Chapter 12.
Projections are often used in computer graphics, e.g., to generate shad-
ows. In optimization, orthogonal projections are often used to (iteratively)
minimize residual errors. This also has applications in machine learning,
e.g., in linear regression where we want to find a (linear) function that
minimizes the residual errors, i.e., the lengths of the orthogonal projec-
tions of the data onto the linear function (Bishop, 2006). We will investi-
gate this further in Chapter 9. PCA (Pearson, 1901; Hotelling, 1933) also
uses projections to reduce the dimensionality of high-dimensional data.
We will discuss this in more detail in Chapter 10.
Exercises
3.1 Show that h·, ·i defined for all x = [x1 , x2 ]> ∈ R2 and y = [y1 , y2 ]> ∈ R2 by
is an inner product.
3.2 Consider R2 with h·, ·i defined for all x and y in R2 as
2 0
hx, yi := x> y.
1 2
| {z }
=:A
using
a. hx, yi := x> y
2 1 0
b. hx, yi := x> Ay , A := 1 3 −1
0 −1 2
3.4 Compute the angle between
1 −1
x= , y=
2 −1
using
a. hx, yi := x> y
> 2 1
b. hx, yi := x By , B :=
1 3
3.5 Consider the Euclidean vector space R5 with the dot product. A subspace
U ⊆ R5 and x ∈ R5 are given by
0 1 −3 −1 −1
−1 −3 4 −3 −9
U = span[
2 , −1 .
1 , 1 , 5 ] , x=
0 −1 2 0 4
2 2 1 7 1
3.8 Using the Gram-Schmidt method, turn the basis B = (b1 , b2 ) of a two-
dimensional subspace U ⊆ R3 into an ONB C = (c1 , c2 ) of U , where
1 −1
b1 := 1 , b2 := 2 .
1 0
by 30◦ .