0% found this document useful (0 votes)
100 views31 pages

02 - Math of Patter Recognition

This document provides an overview of mathematical concepts required to understand the book, including: 1. Linear algebra topics such as vectors, inner products, norms, orthogonality, matrices, matrix multiplication, determinants, eigenvalues, and singular value decomposition. 2. Probability topics like distributions, expectations, variances, independence, and the normal distribution. 3. Optimization and matrix calculus concepts involving local minima, convexity, constrained optimization, and Lagrange multipliers. 4. Algorithm complexity analysis. Exercises are provided at the end to solidify the concepts.

Uploaded by

Xg Wu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
100 views31 pages

02 - Math of Patter Recognition

This document provides an overview of mathematical concepts required to understand the book, including: 1. Linear algebra topics such as vectors, inner products, norms, orthogonality, matrices, matrix multiplication, determinants, eigenvalues, and singular value decomposition. 2. Probability topics like distributions, expectations, variances, independence, and the normal distribution. 3. Optimization and matrix calculus concepts involving local minima, convexity, constrained optimization, and Lagrange multipliers. 4. Algorithm complexity analysis. Exercises are provided at the end to solidify the concepts.

Uploaded by

Xg Wu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Mathematical background

Jianxin Wu

LAMDA Group
National Key Lab for Novel Software Technology
Nanjing University, China
[email protected]

February 11, 2020

Contents
1 Linear algebra 2
1.1 Inner product, norm, distance, and orthogonality . . . . . . . . . 2
1.2 Angle and inequality . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Vector projection . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Basics of matrices . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Matrix multiplication . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6 Determinant and inverse of a square matrix . . . . . . . . . . . . 7
1.7 Eigenvalue, eigenvector, rank, and trace of a square matrix . . . 9
1.8 Singular value decomposition . . . . . . . . . . . . . . . . . . . . 10
1.9 Positive (semi-)definite real symmetric matrices . . . . . . . . . . 11

2 Probability 12
2.1 Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Joint and conditional distributions, and Bayes’ theorem . . . . . 14
2.3 Expectation and variance/covariance matrices . . . . . . . . . . . 15
2.4 Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5 Independence and correlation . . . . . . . . . . . . . . . . . . . . 18
2.6 The normal distribution . . . . . . . . . . . . . . . . . . . . . . . 19

3 Optimization and matrix calculus 21


3.1 Local minimum, necessary condition, and matrix calculus . . . . 21
3.2 Convex and concave optimization . . . . . . . . . . . . . . . . . . 22
3.3 Constrained optimization and the Lagrange multipliers . . . . . . 24

4 Complexity of algorithms 26

1
Exercises 28

This chapter provides a brief review of the basic mathematical background


that is required for understanding this book. Most of the contents in this chapter
can be found in standard undergraduate math textbooks, hence details such as
proofs will be omitted.
This book also requires some mathematics that are a little bit more ad-
vanced. We will provide the statements in this chapter, but detailed proofs are
again omitted.

1 Linear algebra
We will not consider complex numbers in this book. Hence, what we will deal
with are all real numbers.
Scalar. We use R to denote the set of real numbers. A real number x ∈ R
is also called a scalar.
Vector. A sequence of real numbers form a vector. We use bold face letters
to denote vectors, e.g., x ∈ Rd is a vector formed by a sequence of d real
numbers. We use
x = (x1 , x2 , . . . , xd )T
to indicate that x is formed by d numbers in a column shape, and the i-th
number in the sequence is a scalar xi , i.e., 1
 
x1
 x2 
x= . . (1)
 
 . .
xd

d is called the length (or dimensionality, or size) of the vector, and the vector is
called a d-dimensional one. We use 1d and 0d to denote d-dimensional vectors
whose elements are all 1 and all 0, respectively. When the vector size is obvious
from its context, we simply write 1 or 0.

1.1 Inner product, norm, distance, and orthogonality


The inner product of two vectors x and y is denoted by xT y (or x·y, or hx, yi, or
x0 y, or xt y; in this book, we will use the notation xT y). It is also called the dot
product. The dot product of two d-dimensional vectors x = (x1 , x2 , . . . , xd )T
and y = (y1 , y2 , . . . , yd )T is defined as
d
X
xT y = xi yi . (2)
i=1

1 The T superscript means the transpose of a matrix, which will be defined soon.

2
Hence, the inner product is a scalar, and we obviously have
xT y = y T x . (3)
The above fact will sometimes help us in this book—e.g., making transfor-
mations:
(xT y)z = z(xT y) = zxT y = zy T x = (zy T )x , (4)
and so on.
The norm of a vector x is denoted by kxk, and defined by

kxk = xT x . (5)
Other types of vector norms are available. The specific form in Equation 5 is
called the `2 norm. It is also called the length of x in some cases. Note that the
norm kxk and the squared norm xT x are always non-negative for any x ∈ Rd .
A vector whose length is 1 is called a unit vector. We usually say a unit
vector determines a direction. End points of unit vectors reside on the surface of
the unit hypersphere in the d-dimensional space whose center is the zero vector
0 and radius is 1. A ray from the center to any unit vector uniquely determines
a direction in that space, and vice versa. When x = cy and c > 0, we say the
two vectors x and y are in the same direction.
The distance between x and y is denoted by kx − yk. A frequently used
fact is about the squared distance:
kx − yk2 = (x − y)T (x − y) = kxk2 + kyk2 − 2xT y . (6)
The above equality utilizes the facts that kxk2 = xT x and xT y = y T x.

1.2 Angle and inequality


If xT y = 0, we say the two vectors are orthogonal, or perpendicular, also
denoted by x ⊥ y. From the geometry, we know the angle between these two
vectors is 90◦ or π2 .
Let the angle between vectors x and y be denoted by θ (0 ≤ θ ≤ 180◦ ); then
xT y = kxkkyk cos θ . (7)
The above equation in fact defines the angle as
 T 
x y
θ = arccos . (8)
kxkkyk
Because −1 ≤ cos θ ≤ 1 for any θ, these equations also tell us
xT y ≤ |xT y| ≤ kxkkyk . (9)
If we expand the vector form of this inequality and take the square of both sides,
it appears as
d
!2 d
! d !
X X X
xi yi ≤ x2i yi2 , (10)
i=1 i=1 i=1

3
Figure 1: Illustration of vector projection.

which is the Cauchy–Schwarz inequality.2 The equality holds if and only if there
is a constant c ∈ R such that xi = cyi for all 1 ≤ i ≤ d. In the vector form, the
equality condition is equivalent to x = cy for some constant c.
This inequality (and the equality condition) can be extended to integrals:
Z 2 Z  Z 
f (x)g(x) dx ≤ f 2 (x) dx g 2 (x) dx , (11)

assuming all integrals exist, in which f 2 (x) means f (x)f (x).

1.3 Vector projection


Sometimes we need to compute the projection of one vector onto another. As
illustrated in Figure 1, x is projected onto y (which must be non-zero). Hence,
x is decomposed as
x = x⊥ + z ,
where x⊥ is the projected vector, and z can be considered as the residue (or
error) of the projection. Note that x⊥ ⊥ z.
In order to determine x⊥ , we take two steps: to find its direction and norm
separately, and this trick will be useful in some other scenarios in this book, too.
x
For any non-zero vector x, its norm is kxk. Since x = kxk kxk, the vector
x x
kxk is in the same direction as x and it is also a unit vector. Hence, kxk is the

2 The two names in this inequality are Augustin-Louis Cauchy, the famous French mathe-

matician who first published this inequality and Karl Hermann Amandus Schwarz, a German
mathematician. The integral form generalization of this inequality was by Viktor Yakovlevich
Bunyakovsky, a Ukrainian/Russian mathematician.

4
direction of x. The combination of norm and direction uniquely determines any
vector. The norm alone determines the zero vector.
y y
The direction of y is kyk . It is obvious that the direction of x⊥ is kyk if the

angle θ between x and y is acute (< 90 ), as illustrated in Figure 1. The norm
of x⊥ is also simple:

xT y xT y
kx⊥ k = kxk cos θ = kxk = . (12)
kxkkyk kyk
Hence, the projection x⊥ is

xT y y xT y
x⊥ = = T y. (13)
kyk kyk y y
Equation 13 is derived assuming θ is acute. However, it is easy to verify that
this equation is correct too when the angle is right (= 90◦ ), obtuse (> 90◦ ), or
T
straight (= 180◦ ). The term x y
yT y
(which is a scalar) is called the projected
T
value, and x y
yT y
y is the projected vector, which is also denoted by projy x.
Vector projection is very useful in this book. For example, let y = (2, 1) and
x = (1, 1). The direction of y specifies all the points that possess this property:
its first dimension is twice its second dimension. Using Equation 13, we obtain
projy x = (1.2, 0.6), which also exhibits the same property. We may treat projy x
as the best approximation of x that satisfies the property specified in y. The
residue of this approximation z = x − projy x = (−0.2, 0.4) does not satisfy this
property and can be considered as noise or error in certain applications.

1.4 Basics of matrices


An m × n matrix contains mn numbers organized in m rows and n columns,
and we use xij (or xi,j ) to denote the element at the i-th row and j-th column
in a matrix X, that is,
 
x11 . . . x1n
X =  ... .. ..  . (14)

. . 
xm1 ... xmn

We also use [X]ij to refer to the element at the i-th row and j-th column in a
matrix X.
There are a few special cases. When m = n, we call the matrix a square
matrix. When n = 1, the matrix contains only one column, and we call it a
column matrix, or a column vector, or simply a vector. When m = 1, we call it
a row matrix, or a row vector. Note that when we say x is a vector, we mean a
T
column vector if not otherwise specified.
h i That is, when we write x = (1, 2, 3) ,
1
we are referring to a column matrix 2 .
3
There are also a few special cases within square matrices that are worth
noting. In a square matrix X of size n × n, the diagonal entries refer to those

5
elements xij in X satisfying i = j. If xij = 0 whenever i 6= j (i.e., when non-
diagonal entries are all 0), we say X is a diagonal matrix. The unit matrix is
a special diagonal matrix, whose diagonal entires are all 1. A unit matrix is
usually denoted by I (when the size of it can be inferred from the context) or
In (indicating the size is n × n).
Following the Matlab convention, we use

X = diag(x11 , x22 , . . . , xnn )

to denote an n × n diagonal matrix whose diagonal entries are x11 , x22 , . . . , xnn ,
sequentially. Similarly, for an n × n square matrix X, diag(X) is a vector
(x11 , x22 , . . . , xnn )T .
The transpose of a matrix X is denoted by X T , and is defined by

[X T ]ji = xij .

X T has size n × m if X is m × n. When X is square, X T and X have the same


size. If in addition X T = X, then we say X is a symmetric matrix.

1.5 Matrix multiplication


Addition and subtraction can be applied to matrices with the same size. Let X
and Y be two matrices with size m × n, then

[X + Y ]ij = xij + yij , (15)


[X − Y ]ij = xij − yij , (16)

for any 1 ≤ i ≤ m, 1 ≤ j ≤ n. For any matrix X and a scalar c, the scalar


multiplication cX is defined by

[cX]ij = cxij .

Not any two matrices can be multiplied. The multiplication XY exists (i.e.,
is well defined) if and only if the number of columns in X equals the number of
rows in Y —i.e., when there are positive integers m, n, and p such that the size
of X is m × n and the size of Y is n × p. The product XY is a matrix of size
m × p, and is defined by
n
X
[XY ]ij = xik ykj . (17)
k=1

When XY is well defined, we always have

(XY )T = Y T X T .

Note that Y X does not necessarily exist when XY exists. Even if both XY
and Y X are well-defined, XY 6= Y X except in a few special cases. However, for
any matrix X, both XX T and X T X exist and both are symmetric matrices.

6
Let x = (x1 , x2 , . . . , xm )T and y = (y1 , y2 , . . . , yp )T be two vectors. Treating
vectors as special matrices, the dimensions of x and y T satisfy the multiplication
constraint, such that xy T always exists for any x and y. We call xy T the
outer product between x and y. This outer product is an m × p matrix, and
[xy T ]ij = xi yj . Note that in general xy T 6= yxT .
The block matrix representation is sometimes useful. Let xi: denote the i-th
row of X (of size 1 × n), and x:i the i-th column (of size m × 1); we can write
X as either a column format
 
x1:
 x2: 
X= . , (18)
 
 .. 
xm:

or a row format
X = [x:1 |x:2 | . . . |x:n ] . (19)
Using the block matrix representation, we have
 
y 1:
n
 y 2:  X
XY = [x:1 |x:2 | . . . |x:n ]  .  = x:i y i: . (20)
 
 ..  i=1
y n:

That is, the product XY is the summation of n outer products (between x:i
and y Ti: , which are both column vectors), X and Y T have the same number of
columns. If we compute the outer product of their corresponding columns, we
get n m × p matrices. The summation of these matrices equals XY .
Similarly, we also have
 
x1:
 x2:   
XY =  .  y :1 |y :2 | . . . |y :p . (21)
 
 .. 
xm:

This (block) outer product tells us [XY ]ij = xi: y :j , which is exactly Equa-
tion 17.
For a square matrix X and a natural number k, the k-th power of X is well
defined, as
X k = XX
| {z. . . X} .
k times

1.6 Determinant and inverse of a square matrix


There are many ways to define the determinant of a square matrix, and we adopt
Laplace’s formula to define it recursively. The determinant of X is usually

7
denoted by det(X) or simply |X|, and is a scalar. Note that although the
| · | symbol looks like the absolute value operator, its meaning is different. The
determinant could be positive, zero, or negative, while absolute values are always
non-negative.
Given an n × n square matrix X, by removing its i-th row and j-th column,
we obtain an (n − 1) × (n − 1) matrix, and the determinant of this matrix is
called the (i, j)-th minor of X, denoted by Mij . Then, Laplace’s formula states
that
Xn
|X| = (−1)i+j aij Mij (22)
j=1

for any 1 ≤ i ≤ n. Similarly,


n
X
|X| = (−1)i+j aij Mij
i=1

for any 1 ≤ j ≤ n. For a scalar (i.e., a 1 × 1 matrix), the determinant is


itself. Hence, this recursive formula can be used to define the determinant of
any square matrix.
It is easy to prove that
|X| = |X T |
for any square matrix X, and

|XY | = |X||Y |

when the product is well-defined. For a scalar c and an n × n matrix X,

|cX| = cn |X| .

For a square matrix X, if there exists another matrix Y such that XY =


Y X = I, then we say Y is the inverse of X, denoted by X −1 . When the inverse
of X exists, we say X is invertible. X −1 is of the same size as X. If X −1 exists,
then its transpose (X −1 )T is abbreviated as X −T .
The following statement is useful for determining whether X is invertible or
not:
X is invertible ⇐⇒ |X| = 6 0. (23)
In other words, a square matrix is invertible if and only if its determinant is
non-zero.
Assuming both X and Y are invertible, XY exists, and c is a non-zero scalar,
we have the following properties.
−1
• X −1 is also invertible and X −1 = X;
−1
• (cX) = 1c X −1 ;
• (XY )−1 = Y −1 X −1 ; and,
T −1
• X −T = X −1 = X T .

8
1.7 Eigenvalue, eigenvector, rank, and trace of a square
matrix
For a square matrix A, if there exist a non-zero vector x and a scalar λ such
that
Ax = λx ,
we say λ is an eigenvalue of A and x is an eigenvector of A (which is associated
with this eigenvalue λ). An n×n real square matrix has n eigenvalues, although
some of them may be equal to each other. The eigenvalue and eigenvectors of a
real square matrix, however, may contain complex numbers.
Eigenvalues have connections with the diagonal entries and the determinant
of A. Denote the n eigenvalues by λ1 , λ2 , . . . , λn ; the following equations hold
(even if the eigenvalues are complex numbers):
n
X n
X
λi = aii , (24)
i=1 i=1
Yn
λi = |A| . (25)
i=1

The latter equation shows that a square matrix is invertible if and


Pnonly if all of
its eigenvalues are non-zero. The summation of all eigenvalues ( i=1 λi ) has a
special name: the trace. The trace of a square matrix X is denoted by tr(X).
Now we know
Xn
tr(X) = xii . (26)
i=1
If we assume all matrix multiplications are well-defined, we have
tr(XY ) = tr(Y X) . (27)
Applying this rule, we can easily derive
tr(XY Z) = tr(ZXY ) = tr(Y ZX)
and many other similar results.
The rank of a square matrix X equals its number of non-zero eigenvalues,
and is denoted by rank(X).
If X is also symmetric, then the properties of its eigenvalues and eigenvec-
tors are a lot nicer. Given any n × n real symmetric matrix X, the following
statements are true.
• All the eigenvalues of X are real numbers, hence can be sorted. We will
denote the eigenvalues of an n × n real symmetric matrix as λ1 , λ2 , . . . , λn ,
and assume λ1 ≥ λ2 ≥ · · · ≥ λn — i.e., they are sorted in descending order.
• All the eigenvectors of X only contain real values. We will denote the
eigenvectors as ξ 1 , ξ 2 , . . . , ξ n , and ξ i is associated with λi —i.e., the eigen-
vectors are also sorted according to their associated eigenvalues. The
eigenvectors are normalized—i.e., kξ i k = 1 for any 1 ≤ i ≤ n.

9
• The eigenvectors satisfy (for 1 ≤ i, j ≤ n)
(
T 1 if i = j
ξi ξj = . (28)
0 otherwise

That is, the n eigenvectors form an orthogonal basis set of Rn . Let E be


an n × n matrix whose i-th column is ξ i , that is,

E = [ξ 1 |ξ 2 | . . . |ξ n ] .

Then, Equation 28 is equivalent to

EE T = E T E = I . (29)

• rank(E) = n, because E is an orthogonal matrix. It is also easy to see


that |E| = ±1 and E −1 = E T .
• If we define a diagonal matrix Λ = diag(λ1 , λ2 , . . . , λn ), then the eigende-
composition of X is
X = EΛE T . (30)

• The eigendecomposition can also be written in an equivalent form as


n
X
X= λi ξ i ξ Ti , (31)
i=1

which is called the spectral decomposition. The spectral decomposition


says that the matrix X equals a weighted sum of n matrices, each being
the outer product between one eigenvector and itself, and weighted by the
corresponding eigenvalue.
We will also encounter the generalized eigenvalue problem. Let A and B
be two square matrices (and we assume they are real symmetric in this book).
Then, a vector x and a scalar λ that satisfy

Ax = λBx

are called the generalized eigenvector and generalized eigenvalue of A and B,


respectively. The generalized eigenvectors, however, are usually not normalized,
and are not orthogonal.

1.8 Singular value decomposition


Eigendecomposition is related to the Singular Value Decomposition (SVD). We
will briefly introduce a few facts for real matrices.
Let X be an m × n matrix; the SVD of X is

X = U ΣV T , (32)

10
where U is an m × m matrix, Σ is an m × n matrix whose non-diagonal elements
are all 0, and V is an n × n matrix.
If there are a scalar σ and two vectors u ∈ Rm and v ∈ Rn (both are unit
vectors) that satisfy the following two equalities simultaneously:

Xv = σu and X T u = σv , (33)

we say σ is a singular value of X, and u and v are its associated left- and
right-singular vectors, respectively.
If (σ, u, v) satisfy the above equation, so does (−σ, −u, v). In order to
remove this ambiguity, the singular value is always non-negative (i.e., σ ≥ 0).
The SVD finds all singular values and singular vectors. The columns of U
are called the left-singular vectors of X, and the columns of V are the right-
singular vectors. The matrices U and V are orthogonal. The diagonal entries
in Σ are the corresponding singular values.
Because XX T = (U ΣV T )(V ΣT U T ) = U ΣΣT U T and ΣΣT is diagonal, we
get that the left-singular vectors of X are the eigenvectors of XX T ; similarly,
the right-singular vectors of X are the eigenvectors of X T X; and the non-zero
singular values of X (diagonal non-zero entries in Σ) are the square roots of
the non-zero eigenvalues of XX T and X T X. A by-product is that the non-zero
eigenvalues of XX T and X T X are exactly the same.
This connection is helpful. When m  n (e.g., n = 10 but m = 100 000),
the eigendecomposition of XX T needs to perform eigendecomposition for a
100 000×100 000 matrix, which is infeasible or at least very inefficient. However,
we can compute the SVD of X. The squared positive singular values and left-
singular vectors are the positive eigenvalues and their associated eigenvectors of
XX T . The same trick also works well when n  m, in which the right-singular
vectors are useful in finding the eigendecomposition of X T X.

1.9 Positive (semi-)definite real symmetric matrices


We only consider real symmetric matrices and real vectors in this section, al-
though the definition of positive definite and positive semi-definite matrices are
wider than that.
An n × n matrix A is positive definite if for any non-zero real vector x (i.e.,
x ∈ Rn and x 6= 0),
xT Ax > 0 . (34)
We say A is positive semi-definite if xT Ax ≥ 0 holds for any x. The fact that
matrix A is positive definite can be abbreviated as “A is PD” or in mathematical
notation as A  0. Similarly, A is positive semi-definite is equivalent to A < 0
or “A is PSD.”
The term
Xn Xn
xT Ax = xi xj aij (35)
i=1 j=1

is a real quadratic form, which will be frequently used in this book.

11
There is a simple connection between eigenvalues and PD (PSD) matri-
ces. A real symmetric matrix is PD/PSD if and only if all its eigenvalues are
positive/non-negative.
One type of PSD matrix we will use frequently is of the form AAT or AT A,
in which A is any real matrix. The proof is pretty simple: because
T
xT AAT x = AT x AT x = kAT xk2 ≥ 0 ,


AAT is PSD; and similarly AT A is also PSD.


Now, for a PSD real symmetric matrix, we sort its eigenvalues as λ1 ≥ λ2 ≥
· · · ≥ λn ≥ 0, in which the final ≥ 0 relationship is always true for PSD matrices,
but not for all real symmetric matrices.

2 Probability
A random variable is usually denoted by an upper case letter, such as X. A
random variable is a variable that can take value from a finite or infinite set.
To keep things simple, we refrain from using the measure-theoretic definition of
random variables and probabilities. We will use the terms random variable and
distribution interchangeably.

2.1 Basics
If a random variable X can take a value from a finite or countably infinite set,
it is called a discrete random variable. Suppose the outcome of a particular
trial is either success or failure, and the chance of success is p (0 ≤ p ≤ 1).
When multiple trials are tested, the chance of success in any one trial is not
affected by any other trial (i.e., the trials are independent). Then, we denote
the number of trials that are required till we see the first successful outcome
as X. X is a random variable, which can take its value from the countably
infinite set {1, 2, 3, . . . }, hence is a discrete random variable. We say X follows
a geometric distribution with parameter p.3
A random variable is different from a usual variable: it may take different
values with different likelihoods (or probabilities). Hence, a random variable
is a function rather a variable whose value can be fixed. Let the set E =
{x1 , x2 , x3 , . . . } denote all values a discrete random variable X can possibly
take. We call each xi an event. The number of events should be either finite
or countably infinite, and the events are mutually exclusive; that is, if an event
xi happens, then any other event xj (j 6= i) cannot happen in the same trial.
Hence, the probability that either one of two events xi or xj happens equals the
sum of the probability of the two events:

Pr(X = x1 ||X = x2 ) = Pr(X = x1 ) + Pr(X = x2 ) ,


3 Another definition of the geometric distribution defines it as the number of failures before

the first success, and possible values are {0, 1, 2, . . . }.

12
in which Pr(·) means the probability and || is the logical or. The summation
rule can be extended to a countable number of elements.
A discrete random variable is determined by a probability mass function
(p.m.f.) p(X). A p.m.f. is specified by the probability of each event: Pr(X =
xi ) = ci (ci ∈ R), and it is a valid p.m.f. if and only if
X
ci ≥ 0 (∀ xi ∈ E) and ci = 1 . (36)
xi ∈E

For the geometric distribution, we haveP Pr(X = 1) = p, Pr(X = 2) = (1 − p)p,



and in general, ci = (1 − p)i−1 p. Since i=1 ci = 1, this is a valid p.m.f. And,
Pr(X ≤ 2) = P r(X = 1) + P r(X = 2) = 2p − p2 . The function

F (x) = Pr(X ≤ x) (37)

is called the cumulative distribution function (c.d.f. or CDF).


If the set of possible values E is infinite and uncountable (for example and
most likely, R or a subset of it), and in addition Pr(X = x) = 0 for each
possible x ∈ E, then we say X is a continuous random variable or continuous
distribution.4
As in the discrete case, the c.d.f. of X is still

F (x) = Pr(X ≤ x) = Pr(X < x) ,

where the additional equality follows from Pr(X = x) = 0. The corresponding


function of the discrete p.m.f. is called the probability density function (p.d.f.)
p(x), which should satisfy
Z ∞
p(x) ≥ 0 and p(x) dx = 1 (38)
−∞

to make p(x) a valid p.d.f. In this book, we assume a continuous c.d.f. is


differentiable; then
p(x) = F 0 (x) , (39)
where F 0 means the derivative of F .
The c.d.f. measures the accumulation of probability in both discrete and
continuous domains. The p.d.f. p(x) (which is the derivative of c.d.f.) measures
the rate of accumulation of probability—in other words, how dense is X at x.
Hence, the higher p(x) is, the larger the probability Pr(x − ε ≤ X ≤ x + ε) (but
not a larger Pr(x), which is always 0).
A few statements about c.d.f. and p.d.f.:

• F (x) is non-decreasing, and

F (−∞) , lim F (x) = 0 ,


x→−∞

4 In fact, if Pr(X = x) = 0 for all x ∈ E, then E cannot be finite or countable.

13
F (∞) , lim F (x) = 1 ,
x→∞

in which , means “defined as”. This property is true for both discrete
and continuous distributions.
Rb
• Pr(a ≤ X ≤ b) = a p(x) dx = F (b) − F (a).
• Although the p.m.f. is always between 0 and 1, the p.d.f. can be any
non-negative value.
• If a continuous X only takes values in the range E = [a, b], we can still
say that E = R, and let the p.d.f. p(x) = 0 for x < a or x > b.
When there is more than one random variable, we will use a subscript to
distinguish them—e.g., pY (y) or pX (x). If Y is a continuous random variable
and g is a fixed function (i.e., no randomness in the computation of g) and is
monotonic, then X = g(Y ) is also a random variable, and its p.d.f. can be
computed as
dx
pY (y) = pX (x) = pX (g(y)) |g 0 (y)| , (40)
dy
in which | · | is the absolute value function.

2.2 Joint and conditional distributions, and Bayes’ theo-


rem
In many situations we need to consider two or more random variables simul-
taneously. For example, let A be the age and I be the annual income in year
2016 (in RMB, using 10 000 as a step) for a person in China. Then, the joint
CDF Pr(A ≤ a, I ≤ i) is the percentage of people in China whose age is not
larger than a years old and whose income is not higher than i RMB in the
year 2016. If we denote the random vector X = (A, I)T and x = (30, 80 000)T ,
then F (x) = Pr(X ≤ x) = Pr(A ≤ 30, I ≤ 80 000) defines the c.d.f. of a joint
distribution. This definition also applies to any number of random variables.
The joint distribution can be discrete (if all random variables in it are discrete),
continuous (if all random variables in it are continuous), or hybrid (if both dis-
crete and continuous random variables exist). We will not deal with hybrid
distributions in this book.
For the Pdiscrete case, a multidimensional p.m.f. p(x) requires p(x) ≥ 0 for
any x and x p(x) = 1. For the R continuous case, we require the p.d.f. p(x) to
satisfy p(x) ≥ 0 for any x and p(x) dx = 1.
It is obvious that for a discrete p.m.f.,
X
p(x) = p(x, y)
y

when x and y are two random vectors (and one or both can be random variables—
i.e., 1-dimensional random vectors). In the continuous case,
Z
p(x) = p(x, y) dy .
y

14
The distributions obtained by summing or integrating one or more random
variables out are called marginal distributions. The summation is taken over all
possible values of y (the variable that is summed or integrated out).
Note that in general
p(x, y) 6= p(x)p(y) .
For example, let us guess Pr(A = 3) = 0.04 and Pr(I = 80 000) = 0.1—i.e.,
in China the percentage of people aged 3 is 4%, and people with 80 000 yearly
income is 10%; then, Pr(A = 3) Pr(I = 80 000) = 0.004. However, we would
expect Pr(A = 3, I = 80 000) to be almost 0: how many 3-year-old babies have
80 000 RMB yearly income?
In a random vector, if we know the value of one random variable for a partic-
ular example (or sample, or instance, or instantiation), it will affect our estimate
of other random variable(s) in that sample. In the age–income hypothetic ex-
ample, if we know I = 80 000, then we know A = 3 is almost impossible for
the same individual. Our estimate of the age will change to a new one when
we know the income, and this new distribution is called the conditional dis-
tribution. We use x|Y = y to denote the random vector (distribution) of x
conditioned on Y = y, and use p(x|Y = y) to denote the conditional p.m.f. or
p.d.f. For conditional distributions, we have

p(x, y)
p(x|y) = , (41)
p(y)
Z
p(y) = p(y|x)p(x) dx . (42)
x
R P
In the discrete case, in Equation 42 is changed to , and is called the law of
total probability. Putting these two together, we get Bayes’ theorem:5

p(y|x)p(x) p(y|x)p(x)
p(x|y) = =R . (43)
p(y) x
p(y|x)p(x) dx

Let y be some random vectors we can observe (or measure), and x be the
random variables that we cannot directly observe but want to estimate or pre-
dict. Then, knowing the values of y are “evidences” that will make us update
our estimate (“belief”) of x. Bayes’ theorem provides a mathematically precise
way to perform such updates, and we will use it frequently in this book.

2.3 Expectation and variance/covariance matrices


The expectation (or mean, or average, or expected value) of a random vector
X is denoted by E[X] (or EX, or E(X), or E(X), etc.), and is computed as
Z
E[X] = p(x)x dx , (44)
x
5 It is also called Bayes’ rule or Bayes’ law, named after Thomas Bayes, a famous British

statistician and philosopher.

15
R
i.e.,
P a weighted sum of x, and the weights are the p.d.f. or p.m.f. (changing to
in the discrete case). Note that the expectation is a normal scalar or vector,
which is not affected by randomness anymore (or at least not the randomness
related to X). Two obvious properties of expectations are
• E[X + Y ] = E[X] + E[Y ]; and,
• E[cX] = cE[X] for a scalar c.
The expectation concept can be generalized. Let g(·) be a function; then
g(X) is a random vector, and its expectation is
Z
E[g(X)] = p(x)g(x) dx . (45)
x

Similarly, g(X)|Y is also a random vector, and its expectation (the conditional
expectation) is
Z
E[g(X)|Y = y] = p(x|Y = y)g(x) dx . (46)
x

And we can also write


h(y) = E[g(X)|Y = y] . (47)
Note that the expectation E[g(X)|Y = y] is not dependent on X because it is
integrated (or summed) out. Hence, h(y) is a normal function of y, which is
not affected by the randomness caused by X anymore.
Now, in Equation 45 we can specify
g(x) = (x − E[X])2 ,
in which E[X] is a completely determined scalar (if the p.m.f. or p.d.f. of X is
known) and g(x) is thus not affected by randomness. The expectation of this
particular choice is called the variance (if X is a random variable) or covariance
matrix (if X is a random vector) of X:
Var(X) = E[(X − E[X])2 ] or Cov(X) = E[(X − E[X])(X − E[X])T ] (48)
When X is a random variable, this expectation is called the variance and is
denoted by Var(X). Variance is a scalar (which is always non-negative), and its
square root is called the standard deviation of X, denoted by σX .
When X is a random vector, this expectation is called the covariance matrix
and is denoted by Cov(X). The covariance matrix for a d-dimensional random
vector is a d × d real symmetric matrix, and is always positive semi-definite.
For a random variable X, it is easy to prove the following useful formula:
Var(X) = E[X 2 ] − (E[X])2 . (49)
Hence, the variance is the difference between two terms: the expectation of
the squared random variable and the square of the mean. Because variance is
non-negative, we always have
E[X 2 ] ≥ (E[X])2 .

16
A similar formula holds for random vectors:

Cov(X) = E[XX T ] − E[X]E[X]T . (50)

In a complex expectation involving multiple random variables or random


vectors, we may specify for which random variable (or vector) we want to com-
pute the expectation by adding a subscript. For example, EX [g(X, Y )] computes
the expectation of g(X, Y ) with respect to X.
One final note about expectation is that an expectation may not exist—e.g.,
if the integration or summation is undefined. The standard Cauchy distribution
provides an example.6 The p.d.f. of the standard Cauchy distribution is defined
as
1
p(x) = . (51)
π(1 + x2 )
R∞ 1 1
Since −∞ π(1+x 2 ) dx = π (arctan(∞) − arctan(−∞)) = 1 and p(x) ≥ 0, this is

a valid p.d.f. The expectation, however, does not exist because it is the sum of
two infinite values, which is not well-defined in mathematical analysis.

2.4 Inequalities
If we are asked to estimate a probability Pr(a ≤ X ≤ b) but know nothing about
X, the best we can say is: it is between 0 and 1 (including both ends). That
is, if there is no information in, there is no information out—the estimation is
valid for any distribution and is not useful at all.
If we know more about X, we can say more about probabilities involving
X. Markov’s inequality states that if X is a non-negative random variable (or,
Pr(X < 0) = 0) and a > 0 is a scalar, then

E[X]
Pr(X ≥ a) ≤ , (52)
a
assuming the mean is finite.7
Chebyshev’s inequality depends on both the mean and the variance. For a
random variable X, if its mean is finite and its variance is non-zero, then for
any scalar k > 0,
1
Pr (|X − E[X]| ≥ kσ) ≤ 2 , (53)
k
p
in which σ = Var(X) is the standard deviation of X.8
There is also a one-tailed version of Chebyshev’s inequality, which states
that for k > 0,
1
Pr(X − E[X] ≥ kσ) ≤ . (54)
1 + k2
6 The Cauchy distribution is named, once again, after Augustin-Louis Cauchy.
7 This inequality is named after Andrey (Andrei) Andreyevich Markov, a famous Russian
mathematician.
8 This inequality is named after Pafnuty Lvovich Chebyshev, another Russian mathemati-

cian.

17
2.5 Independence and correlation
Two random variables X and Y are independent if and only if the joint c.d.f.
FX,Y and the marginal c.d.f. FX and FY satisfy

FX,Y (x, y) = FX (x)FY (y) (55)

for any x and y; or equivalently, if and only if the p.d.f. satisfies

fX,Y (x, y) = fX (x)fY (y) . (56)

When X and Y are independent, knowing the distribution of X does not give
us any information about Y , and vice versa; in addition, E[XY ] = E[X]E[Y ].
When X and Y are not independent, we say they are dependent.
Another concept related to independence (or dependence) is correlatedness
(or uncorrelatedness). Two random variables are said to be uncorrelated if their
covariance is zero and correlated if their covariance is nonzero. The covariance
between two random variables X and Y is defined as

Cov(X, Y ) = E[XY ] − E[X]E[Y ] , (57)

which measures the level of linear relationship between them.


The range of Cov(X, Y ) is not bounded. A proper normalization could
convert it to a closed interval. Pearson’s correlation coefficient is denoted by
ρX,Y or corr(X, Y ),9 and is defined as

Cov(X, Y ) E[XY ] − E[X]E[Y ]


ρX,Y = corr(X, Y ) = = . (58)
σX σY σX σY

The range of Pearson’s correlation coefficient is [−1, +1]. When the correlation
coefficient is +1 or −1, X and Y are related by a perfect linear relationship
X = cY + b; when the correlation coefficient is 0, they are uncorrelated.
When X and Y are random vectors (m- and n-dimensional, respectively),
Cov(X, Y ) is an m × n covariance matrix, and defined as

Cov(X, Y ) = E (X − E[X])(Y − E[Y ])T


 
(59)
T T
 
= E XY − E[X]E[Y ] . (60)

Note that when X = Y , we get the covariance matrix of X (cf. Equation 50).
Independence is a much stronger condition than uncorrelatedness:

X and Y are independent =⇒ X and Y are uncorrelated. (61)


X and Y are uncorrelated 6=⇒ X and Y are independent. (62)
9 It is named after Karl Pearson, a famous British mathematician and biostatistician.

18
2.6 The normal distribution
Among all distributions, the normal distribution is probably the most widely
used. A random variable X follows a normal distribution if its p.d.f. is in the
form of
(x − µ)2
 
1
p(x) = √ exp − , (63)
2πσ 2σ 2
for some µ ∈ R and σ 2 > 0. We can denote it as X ∼ N (µ, σ 2 ) or p(x) =
N (x; µ, σ 2 ). A normal distribution is also called a Gaussian distribution.10
Note that the parameters that determine a normal distribution are (µ, σ 2 ), not
(µ, σ).
A d-dimensional random vector is jointly normal (or has a multivariate nor-
mal distribution) if its p.d.f. is in the form of
 
−d/2 −1/2 1 T −1
p(x) = (2π) |Σ| exp − (x − µ) Σ (x − µ) , (64)
2

for some µ ∈ Rd and positive semi-definite symmetric matrix Σ, and | · | is the


determinant of a matrix. We can write this distribution as X ∼ N (µ, Σ) or
p(x) = N (x; µ, Σ).
Examples of the normal p.d.f. are shown in Figure 2. Figure 2a is a normal
distribution with µ = 0 and σ 2 = 1, and Figure 2b is a 2-dimensional normal
distribution with µ = 0 and Σ = I2 .
The expectation of single- and multi-variate normal distributions are µ and
µ, respectively. Their variance and covariance matrices are σ 2 and Σ, respec-
tively. Hence, µ and σ 2 (not σ) are the counterparts
  of µ and Σ, respectively.
2
We might remember p(x) = √1
2πσ
exp − (x−µ)
2σ 2 very well, but are less fa-
miliar with the multivariate version p(x). However, the 1D distribution can help
us remember the more complex multivariate p.d.f. If we rewrite the univariate
normal density into an equivalent form
 
1
p(x) = (2π)−1/2 (σ 2 )−1/2 exp − (x − µ)T (σ 2 )−1 (x − µ) (65)
2

and change the dimensionality from 1 to d, the variance from σ 2 to the covari-
ance matrix Σ or its determinant |Σ|, x to x, and the mean from µ to µ, we get
exactly the multivariate p.d.f.
 
1
p(x) = (2π)−d/2 |Σ|−1/2 exp − (x − µ)T Σ−1 (x − µ) . (66)
2

The Gaussian distribution has many nice properties, some of which can be
found in Chapter 13, the chapter devoted to the properties of normal distribu-
tions. One particularly useful property is: if X and Y are jointly Gaussian and
X and Y are uncorrelated, then they are independent.
10 It is named after Johann Carl Friedrich Gauss, a very influential German mathematician.

19
0.4

0.35

0.3

0.25
p(x)

0.2

0.15

0.1

0.05

0
−4 −3 −2 −1 0 1 2 3 4
x

(a) 1D normal

0.15

0.1
p(x1,x2)

0.05

0
4
2
2
0
0
−2
−2
x −4 x1
2

(b) 2D normal

Figure 2: Probability density function of example normal distributions.

20
3 Optimization and matrix calculus
Optimization will be frequently encountered in this book. However, details of
optimization principles and techniques are beyond the scope of this book. We
will touch only a little bit on this huge topic in this chapter.
Informally speaking, given a cost (or objective) function f (x) : D 7→ R, the
purpose of mathematical optimization is to find an x? in the domain D, such
that f (x? ) ≤ f (x) for any x ∈ D. This type of optimization problem is called
a minimization problem, and is usually denoted as

min f (x) . (67)


x∈D

A solution x? that makes f (x) reach its minimum value is called a minimizer
of f , and is denoted as
x? = arg min f (x) . (68)
x∈D

Note that the minimizers of a minimization objective can be a (possibly infi-


nite) set of values rather than a single point. For example, the minimizers for
minx∈R sin(x) is a set containing infinitely many numbers: − π2 + 2nπ for any
integer n. We are, however, satisfied by any one of the minimizers in practice
in many applications.
In contrast, an optimization problem can also be maximizing a function
f (x), denoted as maxx∈D f (x). The maximizers are similarly denoted as x? =
arg maxx∈D f (x). However, since maximizing f (x) is equivalent to minimizing
−f (x), we often only talk about minimization problems.

3.1 Local minimum, necessary condition, and matrix cal-


culus
An x? ∈ D that satisfies f (x? ) ≤ f (x) for all x ∈ D is called a global minimum.
However, global minima are difficult to find in many (complex) optimization
problems. In such cases, we are usually satisfied by a local minimum.
A local minimum, in layman’s language, is some x that leads to the smallest
objective value in its local neighborhood. In mathematics, x? is a local minimum
if it belongs to the domain D, and there exists some radius r > 0 such that for
all x ∈ D satisfying kx − x? k ≤ r, we always have f (x? ) ≤ f (x).
There is one commonly used criterion for determining whether a particular
point x is a candidate for being a minimizer of f (x). If f is differentiable, then
∂f
=0 (69)
∂x
is a necessary condition for x to be a local minimum (or a local maximum). In
other words, for x to be either a minimum or a maximum point, the gradient
at that point should be an all-zero vector. Note that this is only a necessary
condition, but may not be sufficient. And, we do not know an x satisfying this
gradient test is a maximizer, a minimizer, or a saddle point (neither a maximum

21
nor a minimum). Points with a all-zero gradient are also called stationary points
or critical points.
∂f
The gradient ∂x is defined in all undergraduate mathematics texts if x ∈ R,
i.e., a scalar variable. In this case, the gradient is the derivative df dx . The
∂f
gradient ∂x for multivariate functions, however, is rarely included in these text-
books. These gradients are defined via matrix calculus as partial derivatives.
For vectors x, y, scalars x, y, and matrix X, the matrix form is defined as
 
∂x ∂xi
= , (70)
∂y ∂y
 i
∂x ∂x
= , (71)
∂y i ∂yi
 
∂x ∂xi
= (which is a matrix) , (72)
∂y ij ∂yj
 
∂y ∂y
= . (73)
∂X ij ∂xij

Using these definitions, it is easy to calculate some gradients (partial derivatives)—


e.g.,

∂xT y
= y, (74)
∂x
∂aT Xb
= abT . (75)
∂X
However, for more complex gradients—for example, those involving matrix in-
verse, eigenvalues, and matrix determinant, the solutions are not obvious. We
recommend The Matrix Cookbook , which lists many useful results—e.g.,

∂ det(X)
= det(X)X −T . (76)
∂X

3.2 Convex and concave optimization


Some functions have nicer properties than others in the optimization realm. For
example, when f (x) is a convex function whose domain is Rd , any local mini-
mum is also a global minimum. More generally, a convex minimization problem
is to minimize a convex objective on a convex set. In convex minimization, any
local minimum must also be a global minimum.
In this book, we only consider subsets of Rd . If S ⊆ Rd , then S is a convex
set if for any x ∈ S, y ∈ S and 0 ≤ λ ≤ 1,

λx + (1 − λ)y ∈ S

always holds. In other words, if we pick any two points from a set S and the line
segment connecting them falls entirely inside S, then S is convex. For example,

22
Figure 3: Illustration of a simple convex function.

in the 2-D space, all points inside a circle form a convex set, but the set of points
outside that circle is not convex.
A function f (whose domain is S) is convex if for any x and y in S and any
λ (0 ≤ λ ≤ 1), we have

f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y) .

f (x) = x2 is a convex function. If we pick any two points a < b on its curve,
the line segment that connects them is above the f (x) = x2 curve in the range
(a, b), as illustrated in Figure 3.
If f is a convex function, we say that −f is a concave function. Any local
maximum of a concave function (on a convex domain) is also a global maximum.
Jensen’s inequality shows that the constraints in the convex function defini-
tion can be extended to an arbitrary number of points. Let f (x) be a convex
function defined on a convex set S, and x1 , x2 , . . . , xn are points inP
S. Then, for
n
weights w1 , w2 , . . . , wn satisfying wi ≥ 0 (for all 1 ≤ i ≤ n) and i=1 wi = 1,

23
Jensen’s inequality states that
n
! n
X X
f wi xi ≤ wi f (xi ) . (77)
i=1 i=1

If f (x) is a concave function defined on a convex set S, we have


n
! n
X X
f wi xi ≥ wi f (xi ) . (78)
i=1 i=1

If we assume wi > 0 for all i, the equality holds if and only if x1 = x2 = · · · = xn


or f is a linear function. If wi = 0 for some i, we can remove these wi and their
corresponding xi and apply the equality condition again.
A twice differentiable function is convex (concave) if its second derivative is
non-negative (non-positive). For example, ln(x) is concave on (0, ∞) because
ln00 (x) = − x12 < 0. Similarly, f (x) = x4 is convex because its second derivative
is 12x2 ≥ 0. The same convexity test applies to a convex subset of R—e.g., an
interval [a, b].
For a scalar-valued function involving many variables, if it is continuous and
twice differentiable, its second-order partial derivatives form a square matrix,
called the Hessian matrix, or simply the Hessian. Such a function is convex if
the Hessian is positive semi-definite. For example, f (x) = xT Ax is convex if A
is positive semi-definite.
A function f is strictly convex if for any x 6= y in a convex domain S and
any λ (0 < λ < 1), we have f (λx + (1 − λ)y) < λf (x) + (1 − λ)f (y).11
A twice differentiable function is strictly convex if its second derivative is
positive (or its Hessian is positive definite in the multivariate case). For example,
f (x) = x2 is strictly convex, but any linear function is not. Hence, the equality
condition for Jensen’s inequality applied to a strictly convex function is xi = c
if wi > 0, in which c is a fixed vector.
For more treatments on convex functions and convex optimization, Convex
Optimization is an excellent textbook and reference.

3.3 Constrained optimization and the Lagrange multipli-


ers
Sometimes beyond the objective f (x), we also require the variables x to satisfy
some constraints. For example, we may require that x has unit length (which
will appear quite frequently later in this book). For x = (x1 , x2 )T and the
domain D = R2 , a concrete example is

min f (x) = v T x (79)


T
s.t. x x = 1 , (80)
11 Please pay attention to the three changes in this definition (compared to the convex

function definition): x 6= y, (0, 1) instead of [0, 1], and < instead of ≤.

24
in which v = [ 12 ] is a constant vector and “s.t.” means “subject to,” which
specifies a constraint on x. There can be more than one constraint, and the
constraint can also be an inequality.
Let us focus on equality constraints for now. A minimization problem with
only equality constraints is
min f (x) (81)
s.t. g1 (x) = 0, (82)
···
gm (x) = 0 . (83)
The method of Lagrange multipliers is a good tool to deal with this kind of
problem.12 This method defines a Lagrange function (or Lagrangian) as
L(x, λ) = f (x) − λT g(x) , (84)
in which λ = (λ1 , λ2 , . . . , λm )T are the m Lagrange multipliers, with the i-th
Lagrange multiplier λi associated with the i-th constraint gi (x) = 0; and we
use g(x) to denote (g1 (x), g2 (x), . . . , gm (x))T , the values of all m constraints.
Then, L is an unconstrained optimization objective, and
∂L
= 0, (85)
∂x
∂L
=0 (86)
∂λ
are necessary conditions for (x, λ) to be a stationary point of L(x, λ). Note that
the domain of the Lagrange multipliers is Rm —i.e., without any restriction.
Hence, we can also change the minus sign (−) to the plus sign (+) in the
Lagrangian.
The method of Lagrange multipliers states that if x0 is a stationary point
of the original constrained optimization problem, there always exists a λ0 such
that (x0 , λ0 ) is also a stationary point of the unconstrained objective L(x, λ).
In other words, we can use Equations 85 and 86 to find all stationary points of
the original problem.
If we move back to our example at the beginning of this section, its La-
grangian is
L(x, λ) = v T x − λ(xT x − 1) . (87)
Setting ∂L ∂L
∂x = 0 leads to v = 2λx; and setting ∂λ = 0 gives us the original
T
constraint x x = 1.
Because v = 2λx, we have kvk2 = v T v = 4λ2 xT x = 4λ2 . Hence, |λ| =
1 1
2 kvk, and the stationary point is x = 2λ v. √
Since v = (1, 2)T in our example, we have λ2 = 41 kvk2 = 45 ; or, λ = ± 25 .

Thus,
√ f (x) = v T x = 2λxT x = 2λ. Hence, min f (x) = − 5 and max f (x) =
5. The minimizer is − √15 (1, 2)T , and the maximizer is √15 (1, 2)T .
12 This method is named after Joseph-Louis Lagrange, an Italian mathematician and as-

tronomer.

25
These solutions are easily verified. √
Applying the Cauchy–Schwarz
√ √inequality,
we know |f (x)| = |v T x| ≤ kvkkxk = 5. That is, − 5 ≤ f (x) ≤ 5, and √ the
equality is obtained when
√ v = cx for some constant c. Because kvk = 5 and
kxk = 1, we know c = 5 and get the maximum and minimum points as above.
The handling of inequality constraints is more complex than equality con-
straints, and involves duality, saddle points, and duality gaps. The method of
Lagrange multipliers can be extended to handle these cases, but are beyond the
scope of this book. Interested readers are referred to the Convex Optimization
book for more details.

4 Complexity of algorithms
In the next chapter, we will discuss how modern algorithms and systems require
a lot of computing and storage resources, in terms of the number of instructions
to be executed by CPU and GPU, or the amount of data that need to be stored
in main memory or hard disk. Of course, we prefer algorithms whose resource
consumption (i.e., running time or storage complexity) is low.
In the theoretical analysis of an algorithm’s complexity, we are often inter-
ested in how fast the complexity grows when the input size gets larger. The
unit of such complexity, however, is usually variable. For example, when the
complexity analysis is based on a specific algorithm’s pseudocode, and we are
interested in how many arithmetic operations are involved when the input size
is 100, the running time complexity might evaluate to 50,000 arithmetic oper-
ations. If instead we are interested in the number of CPU instructions that
are executed, the same algorithm may have a complexity of 200,000 in terms of
CPU instructions.
The big-O notation (O) is often used to analyze the theoretical complexity of
algorithms, which measures how the running time or storage requirement grows
when the size of the input increases. Note that the input size may be measured
by more than one number—e.g., by both the number of training examples n
and the length of the feature vector d.
When the input size is a single number n and the complexity is f (n), we
say this algorithm’s complexity is O(g(n)) if and only if there exist a positive
constant M and an input size n0 , such that when n ≥ n0 , we always have

f (n) ≤ M g(n) . (88)

We assume both f and g are positive in the above equation. With a slight abuse
of notation, we can write the complexity as

f (n) = O(g(n)) . (89)

Informally speaking, Equation 89 states that f (n) grows at most as fast as g(n)
after the problem size n is large enough.
An interesting observation can be made from Equation 88. If this equation
holds, then we should have f (n) ≤ cM g(n) when c > 1, too. That is, when

26
c > 1, f (n) = O(g(n)) implies f (n) = O(cg(n)). In other words, a positive
constant scalar will not change the complexity result in the big O notation. A
direct consequence of this observation is that we do not need to be very careful
in deciding the unit for our complexity result.
However, in a specific application, different constant scalars might have very
different impacts. Both f1 (n) = 2n2 and f2 (n) = 20n2 are O(n2 ) in the big O
notation, but their running speed may differ by a factor of 10, and this speed
variation makes a big difference in real world systems.
The big O notation can be generalized to scenarios when there are more
variables involved in determining the input size. For example, the first pattern
recognition system we introduce (cf. Chapter 3) has a complexity f (n, d) =
O(nd) when there are n training examples and each training example is d-
dimensional. This notation means that there exist numbers n0 and d0 , and a
positive constant M , such that when n ≥ n0 and d ≥ d0 , we always have

f (n, d) ≤ M nd . (90)

The generalization to more than two variables is trivial.

27
Exercises
√ √
1. Let x = ( 3, 1)T and y = (1, 3)T be two vectors, and x⊥ be the pro-
jection of x onto y.
(a) What is the value of x⊥ ?
(b) Prove that y ⊥ (x − x⊥ ).
(c) Draw a graph to illustrate the relationship between these vectors.
(d) Prove that for any λ ∈ R, kx − x⊥ k ≤ kx − λyk. (Hint: the geometry
between these vectors suggests that kx−x⊥ k2 +kx⊥ −λyk2 = kx−λyk2 .)
2. Let X be a 5 × 5 real symmetric matrix, whose eigenvalues are 1, 1, 3, 4,
and x.
(a) Define a necessary and sufficient condition on x such that X is a
positive definite matrix.
(b) If det(X) = 72, what is the value of x?
3. Let x be a d-dimensional random vector, and x ∼ N (µ, Σ).
(a) Let p(x) be the probability density function for x. What is the equa-
tion that defines p(x)?
(b) Write down the equation that defines ln p(x).
(c) If you have access to The Matrix Cookbook, which equation will you
use to help you derive ∂ ln∂µ
p(x)
? What is your result?
(d) Similarly, if we treat Σ−1 as variables (rather than Σ), which equation
(or equations) will you use to help you derive ∂ ∂Σ ln p(x)
−1 , and what is the

result?
4. (Schwarz inequality) Let X and Y be two random variables (discrete or
continuous) and E[XY ] exists. Prove that
2
(E[XY ]) ≤ E[X 2 ]E[Y 2 ] .

5. Prove the following equality and inequalities.


(a) Starting from the definition of covariance matrix for a random vector
X, prove
Cov(X) = E[XX T ] − E[X]E[X]T .
(b) Let X and Y be two random variables. Prove that for any constant
u ∈ R and v ∈ R,
Cov(X, Y ) = Cov(X + u, Y + v) .

(c) Let X and Y be two random variables (discrete or continuous). Prove


that the correlation coefficient ρX,Y satisfies
−1 ≤ ρX,Y ≤ 1 .

28
6. Answer the following questions related to the exponential distribution.
(a) Calculate the expectation
( and variance of the exponential distribution
βe−βx for x ≥ 0
with p.d.f. p(x) = (in which β > 0).
0 for x < 0
(b) What is the c.d.f. of this distribution?
(c) (Memoryless property) Let X denote the continuous exponential ran-
dom variable. Prove that for any a > 0 and b > 0,
Pr(X ≥ a + b|X ≥ a) = Pr(X ≥ b) .

(d) If we assume X, the lifetime of a light bulb, follows the exponential


distribution with β = 10−3 . What is its expected lifetime? If a particular
light bulb has worked 2000 hours, what is the expectation of its remaining
lifetime?
7. Suppose X is a random variable following the exponential distribution,
whose probability density function is p(x) = 3e−3x for x ≥ 0 and p(x) = 0
for x < 0.
(a) What is the value of E[X] and Var(X)? Just give the results, no
derivation is needed.
(b) Can we apply Markov’s inequality to this distribution? If the answer
is yes, what is the estimate for Pr(X ≥ 1)?
(c) Can we apply Chebyshev’s inequality? If the answer is yes, what is
the estimate for Pr(X ≥ 1)?
(d) The one-sided (or one-tailed) Chebyshev inequality states that: if
E[X] and Var(X) both exist, for any positive number a > 0, we have
Var(X) Var(X)
Pr(X ≥ E[X] + a) ≤ Var(X)+a 2 and Pr(X ≤ E[X] − a) ≤ Var(X)+a2 .

Apply this inequality to estimate Pr(X ≥ 1).


(e) What is the exact value for Pr(X ≥ 1)?
(f) Compare the four values: estimate based on Markov’s inequality, esti-
mate based on Chebyshev’s inequality, estimate based on one-sided Cheby-
shev inequality, and the true value; what conclusion do you get?
8. Let A be a d × d real symmetric matrix, whose eigenvalues are sorted and
denoted by λ1 ≥ λ2 ≥ · · · ≥ λd . The eigenvector associated with λi is ξ i .
All the eigenvectors form an orthogonal matrix E, whose i-th column is
ξ i . If we denote Λ = diag(λ1 , λ2 , . . . , λd ), we have A = EΛE T .
(a) For any non-zero vector x 6= 0, the term
xT Ax
xT x
is called the Rayleigh quotient, denoted by R(x, A). Prove that for any
c 6= 0,
R(x, A) = R(cx, A) .

29
(b) Show that
xT Ax
max = max xT Ax .
x6=0 xT x xT x=1

(c) Show that any unit norm vector x (i.e., kxk = 1) can be expressed as
a linear combination of the eigenvectors, as x = Ew, or equivalently,
d
X
x= wi ξ i
i=1

where w = (w1 , w2 , . . . , wd )T , with kwk = 1.


(d) Prove that
max xT Ax = λ1 ,
xT x=1

i.e., the maximum value of the Rayleigh quotient R(x, A) is λ1 , the largest
eigenvalue of A. What is the optimal x that achieves this maximum?
(Hint: Express x as a linear combination of ξ i .)
(e) Prove that
min xT Ax = λd ,
xT x=1

i.e., the minimum value of the Rayleigh quotient R(x, A) is λd , the smallest
eigenvalue of A. What is the optimal x that achieves this minimum?
(Hint: Express x as a linear combination of ξ i .)

9. Answer the following questions on the Cauchy distribution.


(a) Show that the Cauchy distribution is a valid continuous distribution.
(b) Show that the expectation of the Cauchy distribution does not exist.
10. Answer the following questions related to convex and concave functions.
(a) Show that f (x) = eax is a convex function for any a ∈ R.
(b) Show that g(x) = ln(x) is a concave function on {x|x > 0}.
(c) Show that h(x) = x ln(x) is a convex function on {x|x ≥ 0} (we define
0 ln 0 = 0).
(d) Given a discrete distribution with its p.m.f. (p1 , p2 , . . . , pn ) (pi ≥ 0),
its entropy is defined as
n
X
H=− pi log2 pi ,
i=1

in which we assume 0 ln 0 = 0. Use the method of Lagrange multipliers to


find which values of pi will maximize the entropy.

30
11. Let X and Y be two random variables.
(a) Prove that if X and Y are independent, then they are uncorrelated.
(b) Let X be uniformly distributed on [−1, 1], and Y = X 2 . Show that X
and Y are uncorrelated but not independent.
(c) Let X and Y be two discrete random variables whose values can be ei-
ther 1 or 2. The joint probability is pij = Pr(X = i, Y = j) (i, j ∈ {1, 2}).
Prove that if X and Y are uncorrelated, then they are independent, too.

31

You might also like