02 - Math of Patter Recognition
02 - Math of Patter Recognition
Jianxin Wu
LAMDA Group
National Key Lab for Novel Software Technology
Nanjing University, China
[email protected]
Contents
1 Linear algebra 2
1.1 Inner product, norm, distance, and orthogonality . . . . . . . . . 2
1.2 Angle and inequality . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Vector projection . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Basics of matrices . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Matrix multiplication . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6 Determinant and inverse of a square matrix . . . . . . . . . . . . 7
1.7 Eigenvalue, eigenvector, rank, and trace of a square matrix . . . 9
1.8 Singular value decomposition . . . . . . . . . . . . . . . . . . . . 10
1.9 Positive (semi-)definite real symmetric matrices . . . . . . . . . . 11
2 Probability 12
2.1 Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Joint and conditional distributions, and Bayes’ theorem . . . . . 14
2.3 Expectation and variance/covariance matrices . . . . . . . . . . . 15
2.4 Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5 Independence and correlation . . . . . . . . . . . . . . . . . . . . 18
2.6 The normal distribution . . . . . . . . . . . . . . . . . . . . . . . 19
4 Complexity of algorithms 26
1
Exercises 28
1 Linear algebra
We will not consider complex numbers in this book. Hence, what we will deal
with are all real numbers.
Scalar. We use R to denote the set of real numbers. A real number x ∈ R
is also called a scalar.
Vector. A sequence of real numbers form a vector. We use bold face letters
to denote vectors, e.g., x ∈ Rd is a vector formed by a sequence of d real
numbers. We use
x = (x1 , x2 , . . . , xd )T
to indicate that x is formed by d numbers in a column shape, and the i-th
number in the sequence is a scalar xi , i.e., 1
x1
x2
x= . . (1)
. .
xd
d is called the length (or dimensionality, or size) of the vector, and the vector is
called a d-dimensional one. We use 1d and 0d to denote d-dimensional vectors
whose elements are all 1 and all 0, respectively. When the vector size is obvious
from its context, we simply write 1 or 0.
1 The T superscript means the transpose of a matrix, which will be defined soon.
2
Hence, the inner product is a scalar, and we obviously have
xT y = y T x . (3)
The above fact will sometimes help us in this book—e.g., making transfor-
mations:
(xT y)z = z(xT y) = zxT y = zy T x = (zy T )x , (4)
and so on.
The norm of a vector x is denoted by kxk, and defined by
√
kxk = xT x . (5)
Other types of vector norms are available. The specific form in Equation 5 is
called the `2 norm. It is also called the length of x in some cases. Note that the
norm kxk and the squared norm xT x are always non-negative for any x ∈ Rd .
A vector whose length is 1 is called a unit vector. We usually say a unit
vector determines a direction. End points of unit vectors reside on the surface of
the unit hypersphere in the d-dimensional space whose center is the zero vector
0 and radius is 1. A ray from the center to any unit vector uniquely determines
a direction in that space, and vice versa. When x = cy and c > 0, we say the
two vectors x and y are in the same direction.
The distance between x and y is denoted by kx − yk. A frequently used
fact is about the squared distance:
kx − yk2 = (x − y)T (x − y) = kxk2 + kyk2 − 2xT y . (6)
The above equality utilizes the facts that kxk2 = xT x and xT y = y T x.
3
Figure 1: Illustration of vector projection.
which is the Cauchy–Schwarz inequality.2 The equality holds if and only if there
is a constant c ∈ R such that xi = cyi for all 1 ≤ i ≤ d. In the vector form, the
equality condition is equivalent to x = cy for some constant c.
This inequality (and the equality condition) can be extended to integrals:
Z 2 Z Z
f (x)g(x) dx ≤ f 2 (x) dx g 2 (x) dx , (11)
2 The two names in this inequality are Augustin-Louis Cauchy, the famous French mathe-
matician who first published this inequality and Karl Hermann Amandus Schwarz, a German
mathematician. The integral form generalization of this inequality was by Viktor Yakovlevich
Bunyakovsky, a Ukrainian/Russian mathematician.
4
direction of x. The combination of norm and direction uniquely determines any
vector. The norm alone determines the zero vector.
y y
The direction of y is kyk . It is obvious that the direction of x⊥ is kyk if the
◦
angle θ between x and y is acute (< 90 ), as illustrated in Figure 1. The norm
of x⊥ is also simple:
xT y xT y
kx⊥ k = kxk cos θ = kxk = . (12)
kxkkyk kyk
Hence, the projection x⊥ is
xT y y xT y
x⊥ = = T y. (13)
kyk kyk y y
Equation 13 is derived assuming θ is acute. However, it is easy to verify that
this equation is correct too when the angle is right (= 90◦ ), obtuse (> 90◦ ), or
T
straight (= 180◦ ). The term x y
yT y
(which is a scalar) is called the projected
T
value, and x y
yT y
y is the projected vector, which is also denoted by projy x.
Vector projection is very useful in this book. For example, let y = (2, 1) and
x = (1, 1). The direction of y specifies all the points that possess this property:
its first dimension is twice its second dimension. Using Equation 13, we obtain
projy x = (1.2, 0.6), which also exhibits the same property. We may treat projy x
as the best approximation of x that satisfies the property specified in y. The
residue of this approximation z = x − projy x = (−0.2, 0.4) does not satisfy this
property and can be considered as noise or error in certain applications.
We also use [X]ij to refer to the element at the i-th row and j-th column in a
matrix X.
There are a few special cases. When m = n, we call the matrix a square
matrix. When n = 1, the matrix contains only one column, and we call it a
column matrix, or a column vector, or simply a vector. When m = 1, we call it
a row matrix, or a row vector. Note that when we say x is a vector, we mean a
T
column vector if not otherwise specified.
h i That is, when we write x = (1, 2, 3) ,
1
we are referring to a column matrix 2 .
3
There are also a few special cases within square matrices that are worth
noting. In a square matrix X of size n × n, the diagonal entries refer to those
5
elements xij in X satisfying i = j. If xij = 0 whenever i 6= j (i.e., when non-
diagonal entries are all 0), we say X is a diagonal matrix. The unit matrix is
a special diagonal matrix, whose diagonal entires are all 1. A unit matrix is
usually denoted by I (when the size of it can be inferred from the context) or
In (indicating the size is n × n).
Following the Matlab convention, we use
to denote an n × n diagonal matrix whose diagonal entries are x11 , x22 , . . . , xnn ,
sequentially. Similarly, for an n × n square matrix X, diag(X) is a vector
(x11 , x22 , . . . , xnn )T .
The transpose of a matrix X is denoted by X T , and is defined by
[X T ]ji = xij .
[cX]ij = cxij .
Not any two matrices can be multiplied. The multiplication XY exists (i.e.,
is well defined) if and only if the number of columns in X equals the number of
rows in Y —i.e., when there are positive integers m, n, and p such that the size
of X is m × n and the size of Y is n × p. The product XY is a matrix of size
m × p, and is defined by
n
X
[XY ]ij = xik ykj . (17)
k=1
(XY )T = Y T X T .
Note that Y X does not necessarily exist when XY exists. Even if both XY
and Y X are well-defined, XY 6= Y X except in a few special cases. However, for
any matrix X, both XX T and X T X exist and both are symmetric matrices.
6
Let x = (x1 , x2 , . . . , xm )T and y = (y1 , y2 , . . . , yp )T be two vectors. Treating
vectors as special matrices, the dimensions of x and y T satisfy the multiplication
constraint, such that xy T always exists for any x and y. We call xy T the
outer product between x and y. This outer product is an m × p matrix, and
[xy T ]ij = xi yj . Note that in general xy T 6= yxT .
The block matrix representation is sometimes useful. Let xi: denote the i-th
row of X (of size 1 × n), and x:i the i-th column (of size m × 1); we can write
X as either a column format
x1:
x2:
X= . , (18)
..
xm:
or a row format
X = [x:1 |x:2 | . . . |x:n ] . (19)
Using the block matrix representation, we have
y 1:
n
y 2: X
XY = [x:1 |x:2 | . . . |x:n ] . = x:i y i: . (20)
.. i=1
y n:
That is, the product XY is the summation of n outer products (between x:i
and y Ti: , which are both column vectors), X and Y T have the same number of
columns. If we compute the outer product of their corresponding columns, we
get n m × p matrices. The summation of these matrices equals XY .
Similarly, we also have
x1:
x2:
XY = . y :1 |y :2 | . . . |y :p . (21)
..
xm:
This (block) outer product tells us [XY ]ij = xi: y :j , which is exactly Equa-
tion 17.
For a square matrix X and a natural number k, the k-th power of X is well
defined, as
X k = XX
| {z. . . X} .
k times
7
denoted by det(X) or simply |X|, and is a scalar. Note that although the
| · | symbol looks like the absolute value operator, its meaning is different. The
determinant could be positive, zero, or negative, while absolute values are always
non-negative.
Given an n × n square matrix X, by removing its i-th row and j-th column,
we obtain an (n − 1) × (n − 1) matrix, and the determinant of this matrix is
called the (i, j)-th minor of X, denoted by Mij . Then, Laplace’s formula states
that
Xn
|X| = (−1)i+j aij Mij (22)
j=1
|XY | = |X||Y |
|cX| = cn |X| .
8
1.7 Eigenvalue, eigenvector, rank, and trace of a square
matrix
For a square matrix A, if there exist a non-zero vector x and a scalar λ such
that
Ax = λx ,
we say λ is an eigenvalue of A and x is an eigenvector of A (which is associated
with this eigenvalue λ). An n×n real square matrix has n eigenvalues, although
some of them may be equal to each other. The eigenvalue and eigenvectors of a
real square matrix, however, may contain complex numbers.
Eigenvalues have connections with the diagonal entries and the determinant
of A. Denote the n eigenvalues by λ1 , λ2 , . . . , λn ; the following equations hold
(even if the eigenvalues are complex numbers):
n
X n
X
λi = aii , (24)
i=1 i=1
Yn
λi = |A| . (25)
i=1
9
• The eigenvectors satisfy (for 1 ≤ i, j ≤ n)
(
T 1 if i = j
ξi ξj = . (28)
0 otherwise
E = [ξ 1 |ξ 2 | . . . |ξ n ] .
EE T = E T E = I . (29)
Ax = λBx
X = U ΣV T , (32)
10
where U is an m × m matrix, Σ is an m × n matrix whose non-diagonal elements
are all 0, and V is an n × n matrix.
If there are a scalar σ and two vectors u ∈ Rm and v ∈ Rn (both are unit
vectors) that satisfy the following two equalities simultaneously:
Xv = σu and X T u = σv , (33)
we say σ is a singular value of X, and u and v are its associated left- and
right-singular vectors, respectively.
If (σ, u, v) satisfy the above equation, so does (−σ, −u, v). In order to
remove this ambiguity, the singular value is always non-negative (i.e., σ ≥ 0).
The SVD finds all singular values and singular vectors. The columns of U
are called the left-singular vectors of X, and the columns of V are the right-
singular vectors. The matrices U and V are orthogonal. The diagonal entries
in Σ are the corresponding singular values.
Because XX T = (U ΣV T )(V ΣT U T ) = U ΣΣT U T and ΣΣT is diagonal, we
get that the left-singular vectors of X are the eigenvectors of XX T ; similarly,
the right-singular vectors of X are the eigenvectors of X T X; and the non-zero
singular values of X (diagonal non-zero entries in Σ) are the square roots of
the non-zero eigenvalues of XX T and X T X. A by-product is that the non-zero
eigenvalues of XX T and X T X are exactly the same.
This connection is helpful. When m n (e.g., n = 10 but m = 100 000),
the eigendecomposition of XX T needs to perform eigendecomposition for a
100 000×100 000 matrix, which is infeasible or at least very inefficient. However,
we can compute the SVD of X. The squared positive singular values and left-
singular vectors are the positive eigenvalues and their associated eigenvectors of
XX T . The same trick also works well when n m, in which the right-singular
vectors are useful in finding the eigendecomposition of X T X.
11
There is a simple connection between eigenvalues and PD (PSD) matri-
ces. A real symmetric matrix is PD/PSD if and only if all its eigenvalues are
positive/non-negative.
One type of PSD matrix we will use frequently is of the form AAT or AT A,
in which A is any real matrix. The proof is pretty simple: because
T
xT AAT x = AT x AT x = kAT xk2 ≥ 0 ,
2 Probability
A random variable is usually denoted by an upper case letter, such as X. A
random variable is a variable that can take value from a finite or infinite set.
To keep things simple, we refrain from using the measure-theoretic definition of
random variables and probabilities. We will use the terms random variable and
distribution interchangeably.
2.1 Basics
If a random variable X can take a value from a finite or countably infinite set,
it is called a discrete random variable. Suppose the outcome of a particular
trial is either success or failure, and the chance of success is p (0 ≤ p ≤ 1).
When multiple trials are tested, the chance of success in any one trial is not
affected by any other trial (i.e., the trials are independent). Then, we denote
the number of trials that are required till we see the first successful outcome
as X. X is a random variable, which can take its value from the countably
infinite set {1, 2, 3, . . . }, hence is a discrete random variable. We say X follows
a geometric distribution with parameter p.3
A random variable is different from a usual variable: it may take different
values with different likelihoods (or probabilities). Hence, a random variable
is a function rather a variable whose value can be fixed. Let the set E =
{x1 , x2 , x3 , . . . } denote all values a discrete random variable X can possibly
take. We call each xi an event. The number of events should be either finite
or countably infinite, and the events are mutually exclusive; that is, if an event
xi happens, then any other event xj (j 6= i) cannot happen in the same trial.
Hence, the probability that either one of two events xi or xj happens equals the
sum of the probability of the two events:
12
in which Pr(·) means the probability and || is the logical or. The summation
rule can be extended to a countable number of elements.
A discrete random variable is determined by a probability mass function
(p.m.f.) p(X). A p.m.f. is specified by the probability of each event: Pr(X =
xi ) = ci (ci ∈ R), and it is a valid p.m.f. if and only if
X
ci ≥ 0 (∀ xi ∈ E) and ci = 1 . (36)
xi ∈E
13
F (∞) , lim F (x) = 1 ,
x→∞
in which , means “defined as”. This property is true for both discrete
and continuous distributions.
Rb
• Pr(a ≤ X ≤ b) = a p(x) dx = F (b) − F (a).
• Although the p.m.f. is always between 0 and 1, the p.d.f. can be any
non-negative value.
• If a continuous X only takes values in the range E = [a, b], we can still
say that E = R, and let the p.d.f. p(x) = 0 for x < a or x > b.
When there is more than one random variable, we will use a subscript to
distinguish them—e.g., pY (y) or pX (x). If Y is a continuous random variable
and g is a fixed function (i.e., no randomness in the computation of g) and is
monotonic, then X = g(Y ) is also a random variable, and its p.d.f. can be
computed as
dx
pY (y) = pX (x) = pX (g(y)) |g 0 (y)| , (40)
dy
in which | · | is the absolute value function.
when x and y are two random vectors (and one or both can be random variables—
i.e., 1-dimensional random vectors). In the continuous case,
Z
p(x) = p(x, y) dy .
y
14
The distributions obtained by summing or integrating one or more random
variables out are called marginal distributions. The summation is taken over all
possible values of y (the variable that is summed or integrated out).
Note that in general
p(x, y) 6= p(x)p(y) .
For example, let us guess Pr(A = 3) = 0.04 and Pr(I = 80 000) = 0.1—i.e.,
in China the percentage of people aged 3 is 4%, and people with 80 000 yearly
income is 10%; then, Pr(A = 3) Pr(I = 80 000) = 0.004. However, we would
expect Pr(A = 3, I = 80 000) to be almost 0: how many 3-year-old babies have
80 000 RMB yearly income?
In a random vector, if we know the value of one random variable for a partic-
ular example (or sample, or instance, or instantiation), it will affect our estimate
of other random variable(s) in that sample. In the age–income hypothetic ex-
ample, if we know I = 80 000, then we know A = 3 is almost impossible for
the same individual. Our estimate of the age will change to a new one when
we know the income, and this new distribution is called the conditional dis-
tribution. We use x|Y = y to denote the random vector (distribution) of x
conditioned on Y = y, and use p(x|Y = y) to denote the conditional p.m.f. or
p.d.f. For conditional distributions, we have
p(x, y)
p(x|y) = , (41)
p(y)
Z
p(y) = p(y|x)p(x) dx . (42)
x
R P
In the discrete case, in Equation 42 is changed to , and is called the law of
total probability. Putting these two together, we get Bayes’ theorem:5
p(y|x)p(x) p(y|x)p(x)
p(x|y) = =R . (43)
p(y) x
p(y|x)p(x) dx
Let y be some random vectors we can observe (or measure), and x be the
random variables that we cannot directly observe but want to estimate or pre-
dict. Then, knowing the values of y are “evidences” that will make us update
our estimate (“belief”) of x. Bayes’ theorem provides a mathematically precise
way to perform such updates, and we will use it frequently in this book.
15
R
i.e.,
P a weighted sum of x, and the weights are the p.d.f. or p.m.f. (changing to
in the discrete case). Note that the expectation is a normal scalar or vector,
which is not affected by randomness anymore (or at least not the randomness
related to X). Two obvious properties of expectations are
• E[X + Y ] = E[X] + E[Y ]; and,
• E[cX] = cE[X] for a scalar c.
The expectation concept can be generalized. Let g(·) be a function; then
g(X) is a random vector, and its expectation is
Z
E[g(X)] = p(x)g(x) dx . (45)
x
Similarly, g(X)|Y is also a random vector, and its expectation (the conditional
expectation) is
Z
E[g(X)|Y = y] = p(x|Y = y)g(x) dx . (46)
x
16
A similar formula holds for random vectors:
a valid p.d.f. The expectation, however, does not exist because it is the sum of
two infinite values, which is not well-defined in mathematical analysis.
2.4 Inequalities
If we are asked to estimate a probability Pr(a ≤ X ≤ b) but know nothing about
X, the best we can say is: it is between 0 and 1 (including both ends). That
is, if there is no information in, there is no information out—the estimation is
valid for any distribution and is not useful at all.
If we know more about X, we can say more about probabilities involving
X. Markov’s inequality states that if X is a non-negative random variable (or,
Pr(X < 0) = 0) and a > 0 is a scalar, then
E[X]
Pr(X ≥ a) ≤ , (52)
a
assuming the mean is finite.7
Chebyshev’s inequality depends on both the mean and the variance. For a
random variable X, if its mean is finite and its variance is non-zero, then for
any scalar k > 0,
1
Pr (|X − E[X]| ≥ kσ) ≤ 2 , (53)
k
p
in which σ = Var(X) is the standard deviation of X.8
There is also a one-tailed version of Chebyshev’s inequality, which states
that for k > 0,
1
Pr(X − E[X] ≥ kσ) ≤ . (54)
1 + k2
6 The Cauchy distribution is named, once again, after Augustin-Louis Cauchy.
7 This inequality is named after Andrey (Andrei) Andreyevich Markov, a famous Russian
mathematician.
8 This inequality is named after Pafnuty Lvovich Chebyshev, another Russian mathemati-
cian.
17
2.5 Independence and correlation
Two random variables X and Y are independent if and only if the joint c.d.f.
FX,Y and the marginal c.d.f. FX and FY satisfy
When X and Y are independent, knowing the distribution of X does not give
us any information about Y , and vice versa; in addition, E[XY ] = E[X]E[Y ].
When X and Y are not independent, we say they are dependent.
Another concept related to independence (or dependence) is correlatedness
(or uncorrelatedness). Two random variables are said to be uncorrelated if their
covariance is zero and correlated if their covariance is nonzero. The covariance
between two random variables X and Y is defined as
The range of Pearson’s correlation coefficient is [−1, +1]. When the correlation
coefficient is +1 or −1, X and Y are related by a perfect linear relationship
X = cY + b; when the correlation coefficient is 0, they are uncorrelated.
When X and Y are random vectors (m- and n-dimensional, respectively),
Cov(X, Y ) is an m × n covariance matrix, and defined as
Note that when X = Y , we get the covariance matrix of X (cf. Equation 50).
Independence is a much stronger condition than uncorrelatedness:
18
2.6 The normal distribution
Among all distributions, the normal distribution is probably the most widely
used. A random variable X follows a normal distribution if its p.d.f. is in the
form of
(x − µ)2
1
p(x) = √ exp − , (63)
2πσ 2σ 2
for some µ ∈ R and σ 2 > 0. We can denote it as X ∼ N (µ, σ 2 ) or p(x) =
N (x; µ, σ 2 ). A normal distribution is also called a Gaussian distribution.10
Note that the parameters that determine a normal distribution are (µ, σ 2 ), not
(µ, σ).
A d-dimensional random vector is jointly normal (or has a multivariate nor-
mal distribution) if its p.d.f. is in the form of
−d/2 −1/2 1 T −1
p(x) = (2π) |Σ| exp − (x − µ) Σ (x − µ) , (64)
2
and change the dimensionality from 1 to d, the variance from σ 2 to the covari-
ance matrix Σ or its determinant |Σ|, x to x, and the mean from µ to µ, we get
exactly the multivariate p.d.f.
1
p(x) = (2π)−d/2 |Σ|−1/2 exp − (x − µ)T Σ−1 (x − µ) . (66)
2
The Gaussian distribution has many nice properties, some of which can be
found in Chapter 13, the chapter devoted to the properties of normal distribu-
tions. One particularly useful property is: if X and Y are jointly Gaussian and
X and Y are uncorrelated, then they are independent.
10 It is named after Johann Carl Friedrich Gauss, a very influential German mathematician.
19
0.4
0.35
0.3
0.25
p(x)
0.2
0.15
0.1
0.05
0
−4 −3 −2 −1 0 1 2 3 4
x
(a) 1D normal
0.15
0.1
p(x1,x2)
0.05
0
4
2
2
0
0
−2
−2
x −4 x1
2
(b) 2D normal
20
3 Optimization and matrix calculus
Optimization will be frequently encountered in this book. However, details of
optimization principles and techniques are beyond the scope of this book. We
will touch only a little bit on this huge topic in this chapter.
Informally speaking, given a cost (or objective) function f (x) : D 7→ R, the
purpose of mathematical optimization is to find an x? in the domain D, such
that f (x? ) ≤ f (x) for any x ∈ D. This type of optimization problem is called
a minimization problem, and is usually denoted as
A solution x? that makes f (x) reach its minimum value is called a minimizer
of f , and is denoted as
x? = arg min f (x) . (68)
x∈D
21
nor a minimum). Points with a all-zero gradient are also called stationary points
or critical points.
∂f
The gradient ∂x is defined in all undergraduate mathematics texts if x ∈ R,
i.e., a scalar variable. In this case, the gradient is the derivative df dx . The
∂f
gradient ∂x for multivariate functions, however, is rarely included in these text-
books. These gradients are defined via matrix calculus as partial derivatives.
For vectors x, y, scalars x, y, and matrix X, the matrix form is defined as
∂x ∂xi
= , (70)
∂y ∂y
i
∂x ∂x
= , (71)
∂y i ∂yi
∂x ∂xi
= (which is a matrix) , (72)
∂y ij ∂yj
∂y ∂y
= . (73)
∂X ij ∂xij
∂xT y
= y, (74)
∂x
∂aT Xb
= abT . (75)
∂X
However, for more complex gradients—for example, those involving matrix in-
verse, eigenvalues, and matrix determinant, the solutions are not obvious. We
recommend The Matrix Cookbook , which lists many useful results—e.g.,
∂ det(X)
= det(X)X −T . (76)
∂X
λx + (1 − λ)y ∈ S
always holds. In other words, if we pick any two points from a set S and the line
segment connecting them falls entirely inside S, then S is convex. For example,
22
Figure 3: Illustration of a simple convex function.
in the 2-D space, all points inside a circle form a convex set, but the set of points
outside that circle is not convex.
A function f (whose domain is S) is convex if for any x and y in S and any
λ (0 ≤ λ ≤ 1), we have
f (x) = x2 is a convex function. If we pick any two points a < b on its curve,
the line segment that connects them is above the f (x) = x2 curve in the range
(a, b), as illustrated in Figure 3.
If f is a convex function, we say that −f is a concave function. Any local
maximum of a concave function (on a convex domain) is also a global maximum.
Jensen’s inequality shows that the constraints in the convex function defini-
tion can be extended to an arbitrary number of points. Let f (x) be a convex
function defined on a convex set S, and x1 , x2 , . . . , xn are points inP
S. Then, for
n
weights w1 , w2 , . . . , wn satisfying wi ≥ 0 (for all 1 ≤ i ≤ n) and i=1 wi = 1,
23
Jensen’s inequality states that
n
! n
X X
f wi xi ≤ wi f (xi ) . (77)
i=1 i=1
24
in which v = [ 12 ] is a constant vector and “s.t.” means “subject to,” which
specifies a constraint on x. There can be more than one constraint, and the
constraint can also be an inequality.
Let us focus on equality constraints for now. A minimization problem with
only equality constraints is
min f (x) (81)
s.t. g1 (x) = 0, (82)
···
gm (x) = 0 . (83)
The method of Lagrange multipliers is a good tool to deal with this kind of
problem.12 This method defines a Lagrange function (or Lagrangian) as
L(x, λ) = f (x) − λT g(x) , (84)
in which λ = (λ1 , λ2 , . . . , λm )T are the m Lagrange multipliers, with the i-th
Lagrange multiplier λi associated with the i-th constraint gi (x) = 0; and we
use g(x) to denote (g1 (x), g2 (x), . . . , gm (x))T , the values of all m constraints.
Then, L is an unconstrained optimization objective, and
∂L
= 0, (85)
∂x
∂L
=0 (86)
∂λ
are necessary conditions for (x, λ) to be a stationary point of L(x, λ). Note that
the domain of the Lagrange multipliers is Rm —i.e., without any restriction.
Hence, we can also change the minus sign (−) to the plus sign (+) in the
Lagrangian.
The method of Lagrange multipliers states that if x0 is a stationary point
of the original constrained optimization problem, there always exists a λ0 such
that (x0 , λ0 ) is also a stationary point of the unconstrained objective L(x, λ).
In other words, we can use Equations 85 and 86 to find all stationary points of
the original problem.
If we move back to our example at the beginning of this section, its La-
grangian is
L(x, λ) = v T x − λ(xT x − 1) . (87)
Setting ∂L ∂L
∂x = 0 leads to v = 2λx; and setting ∂λ = 0 gives us the original
T
constraint x x = 1.
Because v = 2λx, we have kvk2 = v T v = 4λ2 xT x = 4λ2 . Hence, |λ| =
1 1
2 kvk, and the stationary point is x = 2λ v. √
Since v = (1, 2)T in our example, we have λ2 = 41 kvk2 = 45 ; or, λ = ± 25 .
√
Thus,
√ f (x) = v T x = 2λxT x = 2λ. Hence, min f (x) = − 5 and max f (x) =
5. The minimizer is − √15 (1, 2)T , and the maximizer is √15 (1, 2)T .
12 This method is named after Joseph-Louis Lagrange, an Italian mathematician and as-
tronomer.
25
These solutions are easily verified. √
Applying the Cauchy–Schwarz
√ √inequality,
we know |f (x)| = |v T x| ≤ kvkkxk = 5. That is, − 5 ≤ f (x) ≤ 5, and √ the
equality is obtained when
√ v = cx for some constant c. Because kvk = 5 and
kxk = 1, we know c = 5 and get the maximum and minimum points as above.
The handling of inequality constraints is more complex than equality con-
straints, and involves duality, saddle points, and duality gaps. The method of
Lagrange multipliers can be extended to handle these cases, but are beyond the
scope of this book. Interested readers are referred to the Convex Optimization
book for more details.
4 Complexity of algorithms
In the next chapter, we will discuss how modern algorithms and systems require
a lot of computing and storage resources, in terms of the number of instructions
to be executed by CPU and GPU, or the amount of data that need to be stored
in main memory or hard disk. Of course, we prefer algorithms whose resource
consumption (i.e., running time or storage complexity) is low.
In the theoretical analysis of an algorithm’s complexity, we are often inter-
ested in how fast the complexity grows when the input size gets larger. The
unit of such complexity, however, is usually variable. For example, when the
complexity analysis is based on a specific algorithm’s pseudocode, and we are
interested in how many arithmetic operations are involved when the input size
is 100, the running time complexity might evaluate to 50,000 arithmetic oper-
ations. If instead we are interested in the number of CPU instructions that
are executed, the same algorithm may have a complexity of 200,000 in terms of
CPU instructions.
The big-O notation (O) is often used to analyze the theoretical complexity of
algorithms, which measures how the running time or storage requirement grows
when the size of the input increases. Note that the input size may be measured
by more than one number—e.g., by both the number of training examples n
and the length of the feature vector d.
When the input size is a single number n and the complexity is f (n), we
say this algorithm’s complexity is O(g(n)) if and only if there exist a positive
constant M and an input size n0 , such that when n ≥ n0 , we always have
We assume both f and g are positive in the above equation. With a slight abuse
of notation, we can write the complexity as
Informally speaking, Equation 89 states that f (n) grows at most as fast as g(n)
after the problem size n is large enough.
An interesting observation can be made from Equation 88. If this equation
holds, then we should have f (n) ≤ cM g(n) when c > 1, too. That is, when
26
c > 1, f (n) = O(g(n)) implies f (n) = O(cg(n)). In other words, a positive
constant scalar will not change the complexity result in the big O notation. A
direct consequence of this observation is that we do not need to be very careful
in deciding the unit for our complexity result.
However, in a specific application, different constant scalars might have very
different impacts. Both f1 (n) = 2n2 and f2 (n) = 20n2 are O(n2 ) in the big O
notation, but their running speed may differ by a factor of 10, and this speed
variation makes a big difference in real world systems.
The big O notation can be generalized to scenarios when there are more
variables involved in determining the input size. For example, the first pattern
recognition system we introduce (cf. Chapter 3) has a complexity f (n, d) =
O(nd) when there are n training examples and each training example is d-
dimensional. This notation means that there exist numbers n0 and d0 , and a
positive constant M , such that when n ≥ n0 and d ≥ d0 , we always have
f (n, d) ≤ M nd . (90)
27
Exercises
√ √
1. Let x = ( 3, 1)T and y = (1, 3)T be two vectors, and x⊥ be the pro-
jection of x onto y.
(a) What is the value of x⊥ ?
(b) Prove that y ⊥ (x − x⊥ ).
(c) Draw a graph to illustrate the relationship between these vectors.
(d) Prove that for any λ ∈ R, kx − x⊥ k ≤ kx − λyk. (Hint: the geometry
between these vectors suggests that kx−x⊥ k2 +kx⊥ −λyk2 = kx−λyk2 .)
2. Let X be a 5 × 5 real symmetric matrix, whose eigenvalues are 1, 1, 3, 4,
and x.
(a) Define a necessary and sufficient condition on x such that X is a
positive definite matrix.
(b) If det(X) = 72, what is the value of x?
3. Let x be a d-dimensional random vector, and x ∼ N (µ, Σ).
(a) Let p(x) be the probability density function for x. What is the equa-
tion that defines p(x)?
(b) Write down the equation that defines ln p(x).
(c) If you have access to The Matrix Cookbook, which equation will you
use to help you derive ∂ ln∂µ
p(x)
? What is your result?
(d) Similarly, if we treat Σ−1 as variables (rather than Σ), which equation
(or equations) will you use to help you derive ∂ ∂Σ ln p(x)
−1 , and what is the
result?
4. (Schwarz inequality) Let X and Y be two random variables (discrete or
continuous) and E[XY ] exists. Prove that
2
(E[XY ]) ≤ E[X 2 ]E[Y 2 ] .
28
6. Answer the following questions related to the exponential distribution.
(a) Calculate the expectation
( and variance of the exponential distribution
βe−βx for x ≥ 0
with p.d.f. p(x) = (in which β > 0).
0 for x < 0
(b) What is the c.d.f. of this distribution?
(c) (Memoryless property) Let X denote the continuous exponential ran-
dom variable. Prove that for any a > 0 and b > 0,
Pr(X ≥ a + b|X ≥ a) = Pr(X ≥ b) .
29
(b) Show that
xT Ax
max = max xT Ax .
x6=0 xT x xT x=1
(c) Show that any unit norm vector x (i.e., kxk = 1) can be expressed as
a linear combination of the eigenvectors, as x = Ew, or equivalently,
d
X
x= wi ξ i
i=1
i.e., the maximum value of the Rayleigh quotient R(x, A) is λ1 , the largest
eigenvalue of A. What is the optimal x that achieves this maximum?
(Hint: Express x as a linear combination of ξ i .)
(e) Prove that
min xT Ax = λd ,
xT x=1
i.e., the minimum value of the Rayleigh quotient R(x, A) is λd , the smallest
eigenvalue of A. What is the optimal x that achieves this minimum?
(Hint: Express x as a linear combination of ξ i .)
30
11. Let X and Y be two random variables.
(a) Prove that if X and Y are independent, then they are uncorrelated.
(b) Let X be uniformly distributed on [−1, 1], and Y = X 2 . Show that X
and Y are uncorrelated but not independent.
(c) Let X and Y be two discrete random variables whose values can be ei-
ther 1 or 2. The joint probability is pij = Pr(X = i, Y = j) (i, j ∈ {1, 2}).
Prove that if X and Y are uncorrelated, then they are independent, too.
31