0% found this document useful (0 votes)
28 views52 pages

Assigment

Uploaded by

hoanglhse181582
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views52 pages

Assigment

Uploaded by

hoanglhse181582
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

THEORY AND PROBLEMS of MATHEMATICS

FOR MACHINE LEARNING

DINH PHUOC VINH

July 27, 2024

Department of Mathematics - FPTU HCM


Contents
1 ANALYTIC GEOMETRY 1
1.1 Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Inner Products, Length and Distance . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Orthogonality and Orthogonal Projection . . . . . . . . . . . . . . . . . . . 1
1.4 Solved Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.5 Supplementary Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 MATRIX DECOMPOSITION 9
2.1 Eigenvalues, Eigenvectors and Eigenspaces . . . . . . . . . . . . . . . . . . . 9
2.2 Eigendecomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Singular Value Decomposition (SVD) . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Matrix Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5 Solved Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.6 Supplementary Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3 VECTOR CALCULUS 15
3.1 Derivatives, Partial Derivatives and Gradients . . . . . . . . . . . . . . . . . 15
3.2 The Chain Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3 Hessian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.4 Taylor Series and Taylor Polynomials . . . . . . . . . . . . . . . . . . . . . . 16
3.5 Solved Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.6 Supplementary Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4 PROBABILITY AND DISTRIBUTIONS 21


4.1 Bivariate Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . 21
4.2 Bivariate Continuous Random Variables . . . . . . . . . . . . . . . . . . . . 21
4.3 Variance Matrix, Covariance Matrix and Correlation . . . . . . . . . . . . . 22
4.4 Bivariate Gaussian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.5 Solved Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.6 Supplementary Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

5 CONTINUOUS OPTIMIZATION 31
5.1 Gradient Descent Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.2 Convex Sets and Convex Functions . . . . . . . . . . . . . . . . . . . . . . . 31
5.3 Constrained Optimization and Lagrange Multipliers . . . . . . . . . . . . . 32
5.4 Linear Programming and Quadratic Programming . . . . . . . . . . . . . . 32
5.5 Solved Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.6 Supplementary Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

6 MULTIPLE CHOICE QUESTIONS 37


6.1 Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6.2 Answer key . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

List of Tables

i
ii

List of Figures
3.1 The Chain Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4.1 Region {(x, y)| 0 ≤ x ≤ 2, 0 ≤ y ≤ 1, 2 ≤ x + y} . . . . . . . . . . . . . . . 25

5.1 Examples
n of convex and non-convex o sets . . . . . . . . . . . . . . . . . . . . 31
5.2 Region (x, y)| |x|1/3 + |y|1/3 ≤ 1 . . . . . . . . . . . . . . . . . . . . . . . 34
n o
5.3 Region (x, y)| |x|3/2 + |y|3/2 ≤ 1 . . . . . . . . . . . . . . . . . . . . . . . 34
Chapter 1

ANALYTIC GEOMETRY
1.1 Norms

• A norm on a vector space V is a function ∥.∥ : V → R that satisfies

(i) ∥u∥ ≥ 0 for any vector u ∈ V and ∥u∥ = 0 if and only if u = 0.


(ii) ∥αu∥ = |α|∥u∥ for α ∈ R, u ∈ V.
(iii) ∥u + v∥ ≤ ∥u∥ + ∥v∥ for any u, v ∈ V.
q
• ∥x∥1 := |x1 | + |x2 | + · · · + |xn | and ∥x∥2 := x21 + x22 + · · · + x2n are two different
norms on Rn .

1.2 Inner Products, Length and Distance

• An inner product on a vector space V is a symmetric, positive bilinear mapping


⟨., .⟩ : V × V → R.

• The mapping ∥.∥ : V → R, ∥x∥ := ⟨x, x⟩ is the norm induced by the inner product
p

⟨., .⟩.

• ∥x∥ is also called the length of x.

• Given an inner product on a vector space V, the distance between two vectors u, v
are defined by d(u, v) := ∥u − v∥ = ⟨u − v, u − v⟩.
p

• (Cauchy - Swarchz inequality). Let (V, ⟨., .⟩) be an inner product space and ∥.∥ be
the norm induced by the inner product. Then, we have the following inequality

−∥u∥∥v∥ ≤ ⟨u, v⟩ ≤ ∥u∥∥v∥ (1.1)

for all vectors u, v ∈ V.

1.3 Orthogonality and Orthogonal Projection

• The angle 0 ≤ θ ≤ π between two vectors u, v is defined by

⟨u, u⟩
cos θ = . (1.2)
∥u∥∥v∥

If θ = π/2 or ⟨u, v⟩ = 0, two vectors are called orthogonal.

1
2 ANALYTIC GEOMETRY

• The set of vectors {v1 , . · · · , vk } is called an orthogonal set if and only if

0, i ̸= j,
(
⟨vi , vj ⟩ = (1.3)
∥vi ∥2 > 0, i = j.

The set of vectors {v1 , . · · · , vk } is called an orthonormal set if and only if

0, i ̸= j,
(
⟨vi , vj ⟩ = (1.4)
∥vi ∥2 = 1, i = j.

• (Gram-Schmidt process) Suppose {v1 , v2 , · · · , vk } is a basis of a vector space V.


Set b1 = v1 and for i = 2, 3, · · · , k let

⟨vi , b1 ⟩ ⟨vi , b2 ⟩ ⟨vi , bi−1 ⟩


bi := vi − b1 − b2 − · · · − bi−1 . (1.5)
⟨b1 , b1 ⟩ ⟨b2 , b2 ⟩ ⟨bi−1 , bi−1 ⟩

Then, {b1 /∥b1 ∥, · · · , bk /∥bk ∥} is an orthonormal basis of V.

• (Orthogonal Projection) If {b1 , b2 , · · · , bm } is an ONB of a subspace U of V and


x ∈ V. Then the orthogonal projection of x on U is given by

projU (x) := ⟨x, b1 ⟩b1 + ⟨x, b2 ⟩b2 + · · · + ⟨x, bm ⟩bm . (1.6)

1.4 Solved Problems

1°. Suppose ∥.∥ is the norm induced by an inner product ⟨., .⟩. Show that if ∥u∥ =
3, ∥v∥ = 4, and u + v∥ = 5, then u and v are orthogonal.

Solution. From the definitions and properties of an inner product and the norm
induced from this inner product, we have

∥u + v∥2 = ⟨u + v, u + v⟩ = ⟨u, u⟩ + 2⟨u, v⟩ + ⟨v, v⟩ = 2⟨u, v⟩ + ∥u∥2 + ∥v∥2 .

Thus, ⟨u, v⟩ = 0 and this shows that u and v are orthogonal.


" #
1 −1
2°. (a) Show that ⟨x, y⟩ = xT y is an inner product on R2 .
−1 2
(b) Find m such that v = [1 m]T is a unit vector.

Solution.

(a) We show
" that #⟨., .⟩ is a symmetric and positive bilinear mapping. Indeed, let
1 −1
A= .
−1 2
• Symmetry: ⟨u, v⟩ = uT Av = uT AT v = (Au)T v = vT (Au) = ⟨v, u⟩.
• Bilinear: ⟨au+bu′ , v⟩ = (au+bu′ )T Av = (au)T Av+(bu′ )T Av = a⟨u, v⟩+
b⟨u′ , v⟩.
Similarly, ⟨u, av + bv′ ⟩ = a⟨u, v⟩ + b⟨u, v′ ⟩, for all scalars a, b.
1.4 Solved Problems 3

• Positive definite: For every u = (x, y) ̸= (0, 0), ⟨u, u⟩ = uT Au = x2 −


2xy + 2y 2 = (x − y)2 + y 2 > 0.
" #
1 −1
(b) We have 1 = ∥v∥2 = ⟨v, v⟩ = [1 m] [1 m]T = 1 − 2m + 2m2 ⇔
−1 2
m = 0 or m = 1.

3°. Find a symmetric matrix A such that ⟨v, w⟩ = vT Aw if ⟨v, w⟩ = v1 w1 − v1 w2 −


v2 w1 + 3v2 w2 .

Solution.
" #
1 −1
Let A = . Then
−1 3
" #
1 −1
v Aw = [v1
T
v2 ] [w1 w2 ]T = v1 w1 − v1 w2 − v2 w1 + 3v2 w2 .
−1 3

4°. Let a > 0, and b > 0. If v = (x1 , y1 ) and w = (x2 , y2 ), define an inner product on
R2 by
x 1 x 2 y1 y2
⟨v, w⟩ := 2 + 2 .
a b
Find all unit vectors in R2 .

Solution.
Let v = (x, y) be a unit vector. Then

x2 y 2
1 = ∥v∥2 = ⟨v, v⟩ = + 2.
a2 b
x2 y2
Therefore, the set of all unit vectors is {(x, y) : a2
+ b2
= 1}. This is an ellipse.

5°. Are there vectors u, v such that ⟨u, v⟩ = −7, ∥u∥ = 3, and ∥v∥ = 2?

Solution.
Using the Cauchy-Schwarz inequality, we obtain

−6 = −∥u∥∥v∥ ≤ ⟨u, v⟩ = −7 ≤ ∥u∥∥v∥ = 6,

which leads to a contradiction. Thus, there are no such vectors.

6°. Let C be the vector space of all continuous functions defined on [0, 1].
Z 1
(a) Show that ⟨f, g⟩ = f (x)g(x) dx is an inner product on C.
0
(b) Find the length of the function f (x) = 2x − 3.
(c) Find the distance between f (x) = 1 + x and g(x) = 2.

Solution.
4 ANALYTIC GEOMETRY

(a) • Symmetry: For any functions f, g ∈ C, we have


Z 1 Z 1
⟨f, g⟩ = f (x)g(x) dx = g(x)f (x) dx = ⟨g, f ⟩
0 0

• Bilinear: For any scalars a and b and any functions f, g, h ∈ C, we have


Z 1
⟨af + bh, g⟩ = [af (x) + bh(x)]g(x) dx
0
Z 1 Z 1
=a f (x)g(x) dx + b h(x)g(x) dx
0 0
= a⟨f, g⟩ + b⟨h, g⟩.

Similarly,
⟨f, ag + bh⟩ = a⟨f, g⟩ + b⟨f, h⟩.

• Positive definite: If f is nonzero, then


Z 1
⟨f, f ⟩ = [f (x)]2 dx > 0.
0

13
Z 1 Z 1
(b) ∥f ∥2 = ⟨f, f ⟩ = [f (x)]2 dx = (2x − 3)2 dx = .
q 0 0 3
Hence, ∥f ∥ = 13
3 .
(c) Let h(x) = f (x) − g(x) = x − 1. Then,

1
Z 1 Z 1
∥h∥ = ⟨h, h⟩ =
2
[h(x)] dx =2
(x − 1)2 dx = .
0 0 3

Thus, ∥f − g∥ = 3
3 .

7°. An inner product on the vector space of all matrices of size 2 × 2 is defined by

⟨A, B⟩ = trace(AT B).


" #
0 1
Find all matrices orthogonal to A = .
−1 0

Solution. " # " #" # " #


a b 0 −1 a b −c −d
Let X = . Then, AT X = = and
c d 1 0 c d a b

0 = ⟨A, X⟩ = trace(AT X) = b − c ⇔ b = c.

Thus, X is orthogonal to A if and only if it is symmetric.

8°. Prove that if Q is an orthogonal matrix, then Q preserve the angle of two vectors
in a vector space with the dot product.

Solution.
We need to show that if Q is an orthogonal matrix, then the angle between Qu and
1.4 Solved Problems 5

Qv equals to the angle between u and v for any vectors u, v. Indeed, for any vector
x, we have
∥Qx∥2 = (Qx)T Qx = xT QT Qx = xT Ix = ∥x∥2 .
Therefore,
∥Qx∥ = ∥x∥ (1.7)
On the other hand, for any tow vectors x and y,

Qx · Qy = (Qx)T Qy = xT QT Qy = xT y. (1.8)

From (1.7), (1.8) and the formula (1.2), we obtain the conclusion of the problem.

9°. Show that if a matrix is positive definite, then its eigenvalues are real and positive.

Solution.
Suppose A is a positive definite matrix, λ is an eigenvalue of A, and v is an
λ−eigenvalue. Then, Av = λv and therefore,

0 < vT Av = vT (λv) = λvT v = λ∥v∥2 .

Hence, λ is a positive real number.

10°. Let P2 be the vector space of all polynomials of degree less than or equal to 2. Define
an inner product on P2 ,

⟨p(x), q(x)⟩ = p(0)q(0) + p(1)q(1) + p(2)q(2).

Find the polynomial in U = span{1 + x, x2 } closest to f (x) = x.

Solution.
First, We use the Gram-Schmidt process to construct a orthonormal basis of U.
Let p1 (x) = x + 1 and apply (1.5), we have

⟨x + 1, x2 ⟩ 14
p2 (x) = x2 − (x + 1) = x2 − (x + 1) = x2 − x − 1.
⟨x + 1, x + 1⟩ 14

pi (x)
Normalize pi (x) by setting bi (x) = ∥pi (x)∥ (i = 1, 2), we obtain

x+1 x2 − x − 1
{b1 = √ , b2 (x) = √ }
14 3

as an ONB of U.
Apply (1.6) to find the orthogonal projection of f (x) on U, we have

f1 (x) := projU (f (x)) = ⟨f (x), b1 (x)⟩b1 (x) + ⟨f (x), b2 (x)⟩b2 (x)


4x + 4 x2 − x − 1
= + .
7 3
x2
Thus, the polynomial in U closet to f (x) = x is f1 (x) = 3 − 5x
21 − 5
21 .
6 ANALYTIC GEOMETRY

1.5 Supplementary Problems

1°. Let ∥.∥2 , ∥.∥1 be Euclidean and Manhattan norms, respectively.

(a) Compute ∥x∥1 and ∥x∥2 , where x = (2, −1, 0, 1, 3).


(b) Show that ∥v∥1 ≥ ∥v∥2 for any vector v ∈ Rn .

2°. Let (V, ⟨., .⟩) be an inner product space and ∥.∥ be the norm induced by the inner
product.

(a) Show that ∥u + v∥2 + ∥u − v∥2 = 2∥u∥2 + 2∥v∥2 for any vectors u, v.
(b) Show that ⟨u + v, u − v⟩ = ∥u∥2 − ∥v∥2 for any vectors u, v.
(c) Compute ⟨u, v⟩ if ∥u + v∥ = 8 and |u − v∥ = 6.

3°. Show that two unit vectors u, v are orthogonal if ⟨3u − v, u + 3v⟩ = 0.

4°. Suppose {u, v, w} is an orthonormal set. Compute ⟨2u − 3v, 3u − w⟩.


" #
1 1
5°. (a) Show that ⟨x, y⟩ = xT y is an inner product on R2 .
1 3
h iT
(b) Find all real numbers a such that the length of a −1 with respect to the
" #
1 1 √
inner product ⟨u, v⟩ := uT v is 11.
1 3

6°. Consider the inner product

1 0 0
 

⟨u, v⟩ = u 0 3 0 v.
T  
0 0 1

(a) Find the length of x = (1, 1, −3).


(b) Find the distance between x = (1, 1, 0) and y = (0, 2, −1).
(c) Find all values of m such that two vectors x = (m, −1, m) and y = (m, 1, 2)
are orthogonal.

7°. In each case, find a symmetric matrix A such that ⟨v, w⟩ = vT Aw.

(a) ⟨v, w⟩ = v1 w1 + v1 w2 + v2 w1 + 3v2 w2 .


(b) ⟨v, w⟩ = 2v1 w1 − v1 w2 − v2 w1 + v2 w2 .

8°. Let a > 0 and b > 0. If v = (x1 , y1 ) and w = (x2 , y2 ), define an inner product on
R2 by
x 1 x 2 y1 y2
⟨v, w⟩ := 2 + 2 .
a b
Find all vectors orthogonal to (−1, 1).

9°. Let C be the vector space of all continuous functions defined on [0, 1].
Z 1
(a) Show that ⟨f, g⟩ = f (x)g(x)dx is an inner product on C.
0
(b) Find the distance between f (x) = 1 + x and g(x) = x2 .
1.5 Supplementary Problems 7

10°. A non-euclidean inner product on R3 is defined by ⟨(x, y, z), (x′ , y ′ , z ′ )⟩ = xx′ + xy ′ +


yx′ + 2yy ′ + zz ′ . Find all unit vectors orthogonal to both (1, 0, 0) and (0, 0, 1).

11°. An inner product on vector space of all matrices of size 2 × 2 is defined by ⟨A, B⟩ =
tr(AT B).
" #
1 2
(a) Find the length of the matrix .
3 −1
(b) Find two matrices that are orthogonal.

12°. Prove that if Q is an orthogonal matrix, then ∥Qv∥ = ∥v∥ for every vector v.

13°. Show that


|⟨u, v⟩| ≤ ∥u∥∥v∥,
where ⟨., .⟩ is an inner product.

14°. Show that if ∥.∥ is the norm induced by an inner product ⟨., .⟩, then for all vectors
x, y, the following hold:
∥x + y∥ ≤ ∥x∥ + ∥y∥.

15°. Let R2 be the inner product space with the inner product ⟨(x, y), (x′ , y ′ )⟩ = xx′ +2yy ′
and ∥.∥ be the norm derived from the inner product ⟨., .⟩. Find the value of α that
minimizes the value of ∥x − αy∥, where x = (1, 2) and y = (0, −1).

16°. Suppose two unit vectors u and v are not orthogonal. Show that w = u − v(vT u)
is orthogonal to v.

17°. Suppose A and B are two positive definite matrices.

(a) Show that A + B is positive definite.


(b) Is A − B positive definite?

18°. Let P2 be the vector space of all polynomials of degree less than or equal to 2. Define
an inner product on P2 ,

⟨p(x), q(x)⟩ = p(0)q(0) + p(1)q(1) + p(2)q(2).

Find the polynomial in U closest to f (x) = 2x where U = span{1, x2 }.


8 ANALYTIC GEOMETRY
Chapter 2

MATRIX DECOMPOSITION
2.1 Eigenvalues, Eigenvectors and Eigenspaces

In this section, let A = [aij ] denote an n × n matrix of real numbers.

• A nonzero vector v is called an eigenvector of A if there exists a real (or complex)


number λ such that
Av = λv. (2.1)
λ is called an eigenvalue of A.

• If v is a λ−eigenvector, then av is also a λ−eigenvector if a ̸= 0.

• The characteristic polynomial of a square matrix A is defined by

pA (λ) := det(A − λI) (2.2)

and eigenvalues of a square matrix are roots of its characteristic polynomial.

• If λ1 , · · · , λn are eigenvalues of A, then

det(A) = λ1 · λ2 · · · λn (2.3)

and
trace(A) = a11 + a22 + · · · + ann = λ1 + λ2 + · · · + λn . (2.4)

• The eigenspace corresponding to an eigenvalue λ of A is defined by

Eλ := {x : (A − λI)x = 0} = null(A − λI). (2.5)

2.2 Eigendecomposition

• A square matrix A of size n × n is called diagonalizable if we can find an invertible


matrix P such that
D := P−1 AP is a diagonal matrix.
The matrix P is not unique and consists of n eigenvectors of A and D = diag{λ1 , · · · , λn },
where λ1 , λ2 , · · · , λn are n eigenvalues of A.

• In general, a matrix A is diagonalizable if the dimension of the eigenspace Eλ equals


to the multiplicity of λ, for every eigenvalue λ of A.

• (Spectral Theorem) If A ∈ Rn×n is a symmetric matrix, then there exists an orthog-


onal matrix Q consisting of eigenvectors of A such that

QT AQ = D = diag{λ1 , · · · , λn }. (2.6)

9
10 MATRIX DECOMPOSITION

2.3 Singular Value Decomposition (SVD)


• (SVD Theorem) Every matrix A ∈ Rm×n can be expressed in the form
A = UΣVT , (2.7)
where U, V are orthogonal matrices and Σ = [Σij ] is a diagonal matrix with non-
negative singular values σ1 ≥ σ2 ≥ · · · ≥ σr > 0 = σr+1 = · · · = 0 on the main

diagonal (r = rank(A)). Moreover, A and Σ have the same size and Σii = σi = λi ,
where λi is an eigenvalue of A, i = 1, · · · , r.
• An SVD of a matrix is not unique. Indeed, A = UΣVT and A = (−U)Σ(−V)T
are two different SVDs of A.

2.4 Matrix Approximation


• Given an SVD of A, A = UΣVT , then the best rank-k approximation of A is defined
by
Â(k) := σ1 u1 v1T + σ2 u2 v2T + · · · + σk uk vkT (2.8)

• The error of the approximation is given by


∥A − Â(k)∥2 = σk+1 , (2.9)
where ∥.∥2 denotes the spectral norm of a matrix.
• Using Â(k) to approximate A, we need only k(m + n + 1) numbers to store the
approximated matrix.

2.5 Solved Problems


1°. Recall that a symmetric matrix A is called positive semi-definite if xT Ax ⩾ 0, ∀x.
Prove that all eigenvalues of a positive semi-definite are real and non-negative.

Solution.
Suppose A is a positive semi-definite matrix and λ is an eigenvalue of A. Then,
there exists an eigenvector v such that Av = λv. Therefore,
0 ≤ vT Av = vT (λv) = λvT v = λ∥v∥2 .
Hence, λ is real and none-negative.
2°. Show that if A ∈ Rn×n , AAT is symmetric and has real, non-negative eigenvalues.

Solution.
For any v ∈ Rn , We have
vT AT Av = (Av)T Av = ∥Av∥2 ≥ 0.
Hence, AT A is positive semi-definite and it follows from Problem 1◦ that AT A has
real and non-negative eigenvalues.
2.5 Solved Problems 11

" #
1 0 −1
3°. Find σ1 , the largest singular value of the matrix A = .
0 1 1

Solution. √
Since σ1 = λ1 , where λ1 is the largest eigenvalue of AAT , we first compute AAT
and yield " #
2 −1
AA =T
.
−1 2
The characteristic polynomial
√ of AA
√ is (2 − λ) − 1 and its eigenvalues are λ1 =
T 2

3, and λ2 = 1. Thus, σ1 = λ1 = 3.
4°. Use an SVD to construct the best rank-k approximation Â(k) of a matrix A of rank
r > k. What is the relative error of the approximation with respect to the spectral
norm?

Solution.
Since ∥A∥2 = σ1 and ∥A − Â(k)∥2 = σk+1 , the relative error can be computed by
∥AA − Â(k)∥2 σk+1
= .
∥A∥2 σ1
" #
1 0 1
5°. Find an SVD of the matrix .
−1 1 0
Solution.

• Step 1. Find VT . We have


2 −1 1
 

AT A = −1 1 0 .
 
1 0 1

Therefore, the characteristic polynomial of AT A is λ(1 − λ)(λ − 3) and its


eigenvalues are λ1 = 3, λ2 = 1, λ3 = 0.

– For λ1 = 3, solve the system


(AT A − λ1 I)x = 0
we obtain the general solution x = [2t −t t]T , t ∈ R. Therefore,
v1 = [ √26 √
−1
6
√1 ]T .
6
– For λ2 = 1, solve the system
(AT A − λ2 I)x = 0
we obtain the general solution x = [0 t t]T , t ∈ R. Therefore, v2 =
[0 √12 √12 ]T .
– For λ3 = 0, solve the system
(AT A − λ3 I)x = 0
we obtain the general solution x = [−t −t t]T , t ∈ R. Therefore,
v3 = [ √
−1
3
−1

3
√1 ]T .
3
12 MATRIX DECOMPOSITION

" # "√ #
σ 0 0 3 0 0
• Step 2. Construct Σ. We have Σ = 1 = .
0 σ2 0 0 1 0
• Step 3. Construct U. use the formula
Ai vi
ui = , i = 1, 2,
σi
we have
√ 
2/ √6

" #
1 0 1 
−1/√ 6

−1 1 0 " 1 # " 1 #
1/ 6 √ √
u1 = √ = √
−1
2 and u2 = √1
2
3 2 2

Thus, an SVD of A is
−1
 
" 1 # "√ # √2 √ √1
√ √1 3 0 0  6 6 6
A= 2 2  0 √1 √1  .
−1
√ √1 0 1 0  −1 −1
2 2
2 2 √ √ √1
3 3 3

2.6 Supplementary Problems

1°. Recall that a matrix A is positive semi-definite if xT Ax ⩾ 0, ∀x.

(a) Show that if D = diag(d1 , d2 , · · · , dn ) and each di ⩾ 0, then D is positive


semi-definite.
(b) If A is positive semi-definite and λ is an eigenvalue, then λ ⩾ 0.

2°. Show that if A ∈ Rn×n , AAT is symmetric and has real, non-negative eigenvalues.

3°. Suppose A ∈ Rn×n is symmetric and v1 , v1 are λ1 -eigenvector and λ2 -eigenvector,


respectively. Prove that if λ1 ̸= λ2 , then v1 ⊥ v2 .

4°. Suppose A = U ΣV T is an SVD of a matrix A ∈ Rn×n . Find an SVD of AT .


" #
1 0 2
5°. Find the spectral norm of the matrix A = .
0 1 0

6°. (a) Use an SVD to construct the best rank-k approximation Â(k) of a matrix A
of rank r > k. What is the relative error of the approximation with respect to
the spectral norm?
(b) Suppose a matrix A has positive singular values 4, 3, 2, 1. What is the spectral
norm of this matrix? What is the Frobenius norm of this matrix?

7°. Suppose a matrix A has positive singular values 4, 3, 2, 1.

(a) What is the largest eigenvalue of AAT ?


(b) What is the spectral norm of AT ?
" #
1 2
8°. Let σ1 ⩾ σ2 be two singular values of the matrix , find the value of σ12 + σ22 .
3 4
2.6 Supplementary Problems 13

" #
1 0 1
9°. Given an SVD of a matrix A = ,
−1 1 0

−1
 
" 1 # "√ # √2 √ √1
√ √1 3 0 0  6 6 6
A= 2 2  0 √1 √1  .
−1
√ √1 0 1 0 −1

−1
2 2
2 2 √ √ √1
3 3 3

(a) Let Â(1) be the best rank-1 approximation of A. Find (1,2)-entry of Â(1)
(b) Let Â(2) be the best rank-2 approximation of A. Find (1,3)-entry of Â(2)
14 MATRIX DECOMPOSITION
Chapter 3

VECTOR CALCULUS
3.1 Derivatives, Partial Derivatives and Gradients
• The derivative of the function f (x), denoted by f ′ (x) or dx d
f (x), is the function
defined by
f (x + h) − f (x)
f ′ (x) = lim . (3.1)
h→0 h
The domain of f ′ (x) is the set of all x such that the limit in (3.1) exists and finite.
• The partial derivative of f (x1 , x2 , · · · , xn ) with respect to xi , i = 1, · · · , n, is the
function defined by
∂f f (x1 , · · · , xi + h, · · · , xn ) − f (x1 , · · · , xi , · · · , xn )
= . (3.2)
∂xi h
When computing the derivative of f with respect to xi , we treat other variables as
constants.
• The gradient of a multivariate function x = (x1 , x2 , · · · , xn ) 7−→ f (x) ∈ R is a row
vector h i
∇x f := ∂x∂f
1
∂f
∂x2 · · · ∂f
∂xn . (3.3)

• If f : Rn → Rm , then the size of the gradient ∇x f is m × n.


• The gradient of f (A) : Rm×n → Rk , with respect to a matrix of variables A, is a
k × (m × n) tensor.

3.2 The Chain Rule


• For univariate functions u = u(x), f = f (u), we have
df df du
= . (3.4)
dx du dx

Fig. 3.1: The Chain Rule

• For vector-valued functions u : Rn → Rm , f : Rm → Rk , we have


∇x f = ∇u f ∇x u. (3.5)

15
16 VECTOR CALCULUS

3.3 Hessian

• When the the gradient is the collection of all first order partial derivatives of a
function, the Hessian is the collection of all its second order partial derivatives.

• If f = f (x1 , x2 , · · · , xn ), the Hessian of f , denoted by ∇2 f , is the n × n matrix

∂2f ∂2f ∂2f


 
2 ∂x1 ∂x2 ··· ∂x1 ∂xn 
 ∂x2 1
 ∂ f ∂2f ∂2f 
 ∂x2 ∂x1 ∂x22
··· ∂x1 ∂xn 
∇2 f =  .. .. .. . (3.6)
..
 

 . . . . 

∂2f ∂2f ∂2f
∂xn ∂x1 ∂xn ∂x2 ··· ∂xn ∂xn

• If f is a vector-value function f : Rn → Rm , the Hessian of f is an (n × n) × m tensor.

3.4 Taylor Series and Taylor Polynomials

• The Taylor series of f (x) at x0 (or centered at x0 ) is



f (x) = cn (x − x0 )n , (3.7)
X
|x − x0 | < R,
n=0

f (n) (x0 )
where the coefficients cn is given by cn = n! .

• When x0 = 0, the Taylor series is called the Maclaurin series which is of the following
form
f ′ (0) f ′′ (0) 2 f (n) (0) n
f (x) = f (0) + x+ x + ··· + x + ··· (3.8)
1! 2! n!

• The Taylor polynomial of degree n is defined by

f ′ (x0 ) f ′′ (x0 ) f (n) (x0 )


Tn (x) := f (x0 ) + (x − x0 ) + (x − x0 )2 + · · · + (x − x0 )n . (3.9)
1! 2! n!
The Taylor polynomials of a function at x0 can be used to approximate the value of
the function near x0 .

• The Taylor series of a function f (x, y) at (x0 , y0 ) is of the form

1 ∂f 1 ∂f
f (x, y) = f (x0 , y0 ) + (x0 , y0 )(x − x0 ) + (x0 , y0 )(y − y0 )
1! ∂x 1! ∂y

1
!
∂2f ∂2f ∂2f
+ (x0 , y0 )(x − x 0 ) 2
+ 2 (x 0 , y0 )(x − x 0 )(y − y0 ) + (x0 , y0 )(y − y0 )2
2! ∂x2 ∂x∂y ∂y 2
n
1
!
∂nf
+··· + (x0 , y0 )(x − x0 )k (y − y0 )n−k + ··· . (3.10)
X
Cnk
n! k=0
∂xk ∂y n−k
3.5 Solved Problems 17

3.5 Solved Problems


∂f ∂f
1°. Find the partial derivatives ∂s , ∂t when (t, s) = (0, 1).

(a) f (x, y) = 2x
1+y 2
, x(t, s) = s(1 − t), y(t, s) = t − s.
(b) f (u, v) = u2 + v 2 , u(t, s) = s cos t, v(t, s) = t2 − s.

Solution.

It is easily seen that (u, v) = (1, −1), and f (u, v) = 2 when (t, s) = (0, 1).

(a) We have

∂f ∂f ∂x ∂f ∂y 2 4xy
= + = (1 − t) + (−1) = 1 + 1 = 2.
∂s ∂x ∂s ∂y ∂s 1+y 2 (1 + y 2 )2

and
∂f ∂f ∂x ∂f ∂y 2 4xy
= + = (−s) + (1) = 1 − 1 = 0.
∂t ∂x ∂t ∂y ∂t 1+y 2 (1 + y 2 )2

(b) We have

∂f ∂f ∂u ∂f ∂v
= + = 2u cos t + 2v(−1) = 2 + 2 = 4.
∂s ∂u ∂s ∂v ∂s
and
∂f ∂f ∂u ∂f ∂v
= + = 2u(− sin t) + 2v(2t) = 0.
∂t ∂u ∂t ∂v ∂t
2°. Find the Taylor series of the function f (x) = ln(x + 2) centered at a = 1 and use
the Taylor polynomial of degree 2 to approximate the value of f (1.2).

Solution.
We have
1
f ′ (x) = , f ′′ (x) = −(x + 1)−2 , and f (n) (x) = (−1)n−1 (n − 1)(x + 1)−n .
x+1

Direct computations yield f (1) = ln 3, f ′ (1) = 21 , f ′′ (1) = −1


22
, f ′′′ (1) = 2
23
,··· , f (n) (1) =
(−1)(n−1) (n−1)
2n .
Thus, the Taylor series of f (x) about a = 1 is

1 −1 2 (−1)n−1 (n − 1)
ln(x+2) = ln 3+ (x−1)+ 2 (x−1)2 + 3 (x−1)3 +· · ·+ (x−1)n +· · · .
2 2 2! 2 3! 2n n!
The Taylor polynomial of degree 2 of f (x) at a = 1 is

1 −1
T2 (x) = ln 3 + (x − 1) + 2 (x − 1)2 .
2 2 2!
Then, 1.1631 ≈ ln(3.2) = f (1.2) ≈ T2 (1.2) ≈ 1.1936.
18 VECTOR CALCULUS

3°. Given the function f (x, y) = x3 + xy 2 − 4xy + 2. Find the Taylor series of the
function around the point (1, −1).

Solution.
We have ∂f∂x = fx = 3x + y − 4y, fy = 2xy − 4x, and
2 2

∂2f ∂2f
= fxx = 6x, = fxy = fyx = 2y − 4, and fyy = 2x.
∂x2 ∂x∂y

Hence, fxxx = 6, fxxy = 0 = fxyx = fyxx , fxyy = fyxy = fyyx = 2, fyyy = 0, and all
fourth partial derivatives are zeros.
Therefore,

f (1, −1) = 8, fx (1, −1) = 8, fy (1, −1) = −6, fxx (1, −1) = 6,

fxy (1, −1) = fyx (1, −1) = −6, fyy (1, −1) = 2.
The Taylor series of f (x, y) about (1, −1) is
1
f (x, y) = 8 + (8(x − 1) − 6(y + 1))
1!
1  
+ 6(x − 1)2 + 2(−6)(x − 1)(y + 2) + 2(y + 1)2
2!
1  
+ 6(x − 1)3 + 3(2)(x − 1)(y + 1)2 .
3!
4°. Find the linear approximation of the function f (x, y, z) = xyz − y 2 + x2 + 2y at the
point (2, 1, −1).

Solution.
∂f
We have fx = ∂x = yz + 2x, fy = xz − 2y + 2, fz = xy and therefore,

fx (2, 1, −1) = 1, fy (2, 1, −1) = −2, and fz (2, 1, −1) = 2.

The linear approximation of f at (2, 1, −1) is

L(x, y, z) = fx (2, 1, −1)(x − 2) + fy (2, 1, −1)(y − 1) + fz (2, 1, −1)(z + 1)


= x − 2y + 2z + 2.

5°. Find the gradient of the function f : Rn → R; f (x) = ∥x − c∥. with respect to x.

Solution.
Since f (x) = ∥x − c∥ = (x1 − c1 )2 + (x2 − c2 )2 + · · · + (xn − cn )2 , we have
p

∂f x1 − c1 ∂f x2 − c2 ∂f xn − cn
= , = ,··· , = .
∂x1 ∥x − c∥ ∂x2 ∥x − c∥ ∂xn ∥x − c∥

Hence,
1 1
∇x f = [x1 − c1 x2 − c2 ··· xn − cn ]T = (x − c)T .
∥x − c∥ ∥x − c∥
3.6 Supplementary Problems 19

6°. Find the Jacobian determinant of (x(t, s), y(t, s)) = (s cos t, t sin s).

Solution.

• x = s cos t ⇒ xt = −s sin t, xs = cos t;


• y = t sin s ⇒ yt = sin s, ys = t cos s.
• The Jacobian is " # " #
x xs −s sin t cos t
J= t = .
yt ys sin s t cos s

• The Jacobian determinant is


" #
−s sin t cos t
det(J) = det = −ts sin t cos s − cos t sin s.
sin s t cos s

7°. Find the gradient of f (x, y) = (x − 3y, x + 4y, −2x + y).

Solution.
The gradient of f with respect to (x, y, z) is the 3 × 2 matrix

1 −3
 

 1 4 .
 
−2 1

2
8°. Find the Hessian matrix of the function f (x, y, z) = x3 ey−z at the point (1, 0, −1).

Solution.

2 2 2
• We have fx = 3x2 ey−z , fy = x3 ey−z , fz = −2zx3 ey−z .
2 2 2 2
• fx = 3x2 ey−z ⇒ fxx = 6xey−z , fxy = 3x2 ey−z , fxz = −6zx2 ey−z .
2 2 2
• fy = x3 ey−z ⇒ fyy = x3 ey−z , fyz = −2zx3 ey−z .
2 2 2
• fz = −2zx3 ey−z ⇒ fzz = −2x3 ey−z + 4z 2 x3 ey−z .
• The Hessian of f at the point (1, 0, −1) is

fxx (1, 0, −1) fxy (1, 0, −1) fxz (1, 0, −1)


 

∇ f (1, 0, −1) = fyx (1, 0, −1) fyy (1, 0, −1) fyz (1, 0, −1)
2  
fzx (1, 0, −1) fzy (1, 0, −1) fzz (1, 0, −1)
6/e 3/e 6/e
 

= 3/e 1/e 2/e .


 
6/e 2/e 2/e

3.6 Supplementary Problems

1°. Find the derivative of each of following functions.


20 VECTOR CALCULUS

1
(a) g(x) = .
1 + e−ax−b
(b) h(t) = .
1 + t2
(c) L(θ) = pθ (1 − p)1−θ .

2°. Find the Taylor series of the function f (x) = x4 + 2x2 − 3x2 + 8x − 2 centered at
a = 1.

3°. Suppose f is a continuous function such that f ′′′ exists and f (0) = 2, f ′ (0) =
−1, f ′′ (0) = 6 and f ′′′ (0) = 12. Find the Taylor polynomial of degree 3 of f around
a = 0.

4°. Given the function f (x, y) = x3 − 6xy 2 + 12xy − 3x + 4y.

(a) Find the Taylor series of the function around the point (1, −1).
(b) Use the Taylor polynomial of degree 2 of f around (1, −1) to approximate the
value of f (1.2, −0.9).

5°. Use the linear approximation of the function f (x, y, z) = x2 y − yz 3 + 2x at the point
(0, 1, −1) to approximate the value of f (0.1, 0.8, −1.2).

6°. Given the functions f (u, v) = u(u − v), u(x, y) = x + y, v(x, y) = y. Find the partial
derivatives ∂f ∂f
∂x , ∂y .

7°. Find the gradient of the function f : Rn → R; f (x) = ∥x∥2 + cT x.

8°. Find the Jacobian determinant of (u, v, w) = (x − z, 2x + y + z, y + yz).

9°. Find the gradient of f (x, y, z) = (xyz, x + y − z).

10°. Find the Hessian matrix of the function f (x, y, z) = x3 + y 2 − 3yez at the point
(1, −1, 0).

11°. Find the gradient of f (x, y, z) = (x3 − 2xy, y + z 2 ) at the point (1, 0, 2).
Chapter 4

PROBABILITY AND DISTRIBUTIONS


4.1 Bivariate Discrete Random Variables
• The joint distribution p(x, y) = P (X = x and Y = y) of two discrete random
variables X, Y has the following properties:
(i) p(x, y) ≥ 0 for every pair (x, y).
(ii) x y p(x, y) = 1.
P P

• The marginal distribution of X is given by


p(x) = (4.1)
X
p(x, y)
y

and the marginal distribution of Y is given by


p(y) = (4.2)
X
p(x, y)
x

• The conditional distribution of X given Y is given by


p(x, y)
p(x|y) = (4.3)
p(y)
and the conditional distribution of Y given X is given by
p(x, y)
p(y|x) = . (4.4)
p(x)

4.2 Bivariate Continuous Random Variables


• The joint distribution f (x, y) of two continuous random variables X, Y has the fol-
lowing properties:
(i) f (x, y) ≥ 0 for all x, y.
Z ∞ Z ∞
(ii) f (x, y) dxdy = 1.
−∞ −∞

• The probability P (a < X < b, c < Y < d) is given by


Z bZ d
f (x, y) dy dx (4.5)
a c

• The marginal distribution of X is given by


Z ∞
fX (x) = f (x, y) dy (4.6)
−∞

and the marginal distribution of Y is given by


Z ∞
fY (y) = f (x, y) dx. (4.7)
−∞

21
22 PROBABILITY AND DISTRIBUTIONS

• The conditional distribution of X given Y is given by

f (x, y)
fX|y (x) = (4.8)
fY (y)

and the conditional distribution of Y given x is given by

f (x, y)
fY |x (y) = . (4.9)
fX (x)

4.3 Variance Matrix, Covariance Matrix and Correlation

• The covariance of two random variable is given by

Cov(X, Y ) = σXY = E(XY ) − E(X)E(Y ) (4.10)

• If X is a random vector, X = [X1 X2 · · · Xn ]T , then the variance matrix of X is


 
σ12 σ12 · · · σ1n
 σ21

σ22 · · · σ2n 
V[X] :=  (4.11)

 .. .. .. .. 
,
 . . . . 
σn1 σn2 · · · σn2

where σij = σXi Xj = Cov[Xi , Xj ] for i ̸= j and σi2 is the variance of variable
Xi , i = 1, · · · , n. The variance matrix is a symmetric and positive semi-definite.

• If X is a random vector with mean vector E(X) and variance matrix V[X] and Y is
a random vector such that Y = AX, where A is a real matrix. Then,

E(Y) = AE(X) (4.12)

and
V[Y ] = AV[X]AT . (4.13)

• The correlation coefficient of two random variables X, Y is given by


σXY
corr[X, Y ] := . (4.14)
σX σY
For all random variable X, Y , it holds that

−1 ≤ corr[X, Y ] ≤ 1. (4.15)

• The correlation matrix of a random vector X is


1
 
c12 · · · c1n
 c21

1 · · · c2n 
(4.16)

 . .. .. 
 . .. ,
 . . . .  
cn1 cn2 ··· 1

where cij = corr[Xi , Xj ] for i ̸= j.


4.4 Bivariate Gaussian 23

4.4 Bivariate Gaussian

• A bivariate Gaussian is of in form


" # " #!
µ σ12 σ12
p(x, y) = N µ = x , Σ = , (4.17)
µy σ21 σ22

where µ is the mean vector and Σ is the covariance matrix of X and Y .

• If the joint distribution p(x, y) is a Gaussian, then the marginal distributions are
also Gaussians (graphs of pdfs are bell-shaped curves):

p(x) = N (µx , σ12 ), p(y) = N (µy , σ22 ), (4.18)

• The conditional distributions are Gaussians

p(x|y) = N (µx|y , σx|y


2
), (4.19)

where µx|y and σx|y


2 are given by

σ12 2
σ12
µx|y = µx + (µy − y), 2
σx|y = σ12 − . (4.20)
σ22 σ22

• The mixture of two Gaussians p1 (x) = N (µ1 , σ12 ) and p2 (x) = N (µ2 , σ22 ) with a
weight 0 < α < 1 is defined by

p(x) = αp1 (x) + (1 − α)p2 (x). (4.21)

Then, the mean and variance of the corresponding random variable are given by

µ = αµ1 + (1 − α)µ2 , σ 2 = α(µ21 + σ12 ) + (1 − α)(µ22 + σ22 ) − µ2 . (4.22)

4.5 Solved Problems

1°. Given the joint distribution of two discrete random variables X and Y,

c
 , for x = 1, 2, 3; y = 1, 2,
f (x, y) = x+y
 0, elsewhere.

(a) Find the value of the constant c.


(b) Find P (Y = 2|X = 1).
(c) Find E(X)

Solution.
24 PROBABILITY AND DISTRIBUTIONS

(a) Since

3 X
2
1= f (x, y)
X

x=1 y=1

= f (1, 1) + f (1, 2) + f (2, 1) + f (2, 2) + f (3, 1) + f (3, 2)


1 1 1 1 1 1
 
=c + + + + +
2 3 3 4 4 5
28c
=
15
15
⇔c= .
28

(b) We have

P (Y = 2, X = 1)
P (Y = 2|X = 1) =
P (X = 1)
f (x = 1, y = 2)
= P2
y=1 f (x = 1, y)
5
2
=  28  = .
15 1
+ 1 5
28 2 3

15 1 1 25
 
(c) P (X = 1) = y=1 f (x = 1, y) = f (1, 1) + f (1, 2) = + =
P2
.
28 2 3 56
35 27
Similarly, P (X = 2) = 1 × , and P (X = 3) = .
112 112
Therefore,

3
25 35 27 201
E(X) = xP (X = x) = +2× +3× =
X
.
x=1
56 112 112 112

2°. Given the joint distribution of two continuous random variables X and Y,

(
c(x + y 2 ), for 0 ≤ x ≤ 2; 0 ≤ y ≤ 1,
f (x, y) =
0, elsewhere.

(a) Find the value of the constant c.

(b) Find P (X + Y > 2).

(c) Find E(Y |X = 1).

Solution.
4.5 Solved Problems 25

(a) We have
Z ∞ Z ∞
1= f (x, y) dy dx
−∞ −∞
Z 2Z 1
= f (x, y) dy dx
0 0
Z 2 Z 1 
= c(x + y 2 ) dy dx
0 0
1
Z 2  
= c x+ dx
0 3
8c
=
3
3
⇔c= .
8
(b) We have
Z 1Z 2
P (X + Y > 2) = f (x, y) dx dy
0 2−y
1Z 2 3
Z
= (x + y 2 ) dx dy
0 2−y 8
!#2
3
Z 1"
x2
= + y2x dy
0 8 2 2−y
3 (2 − y)2 13
Z 1 !
= 2 + 2y − − y 2 (2 − y)
2
dy = .
0 8 2 48

Fig. 4.1: Region {(x, y)| 0 ≤ x ≤ 2, 0 ≤ y ≤ 1, 2 ≤ x + y}

(c) The conditional distribution of Y given x = 1 is given by


f (x = 1, y)
fY |x=1 (y) =
fX (1)
3
1 + y2

= Z 81
f (1, y) dy
0
3
1 + y2

=Z 8
1 3 
1 + y 2 dy
0 8
3 
= 1 + y2 .
4
26 PROBABILITY AND DISTRIBUTIONS

Then,
3  5
Z 1 Z 1 
E(Y |X = 1) = yfY |x=1 (y) dy = y 1 + y 2 dy = .
0 0 4 8
3°. Suppose X and Y are two continuous random variables such that V (X) = 4, V (Y ) =
3, and Cov[X, Y ] = −8. Let Z = 2X − 3Y, find V (Z).

Solution.
V (Z) = V (2X − 3Y ) = 22 V (X) + 2 × 2 × (−3) × Cov[X, Y ] + (−3)2 V (Y ) = 139.
" # " #!
−2 3 −2
4°. Given the bivariate Gaussian p(x, y) = N µ = ,Σ = .
1 −2 2

(a) Find µx|y=0 and µy|x=−1 .


(b) Find V (X|y = 2).

Solution.

(a) We have

1 1
µx|y=0 = µx + σ12 × (µy − y) = −2 + (−2) (1 − 0) = −3
σ22 2

and
σ21 −2 5
µy|x=−1 = µy + (µx − x) = 1 + (−2 − (−1)) = .
σ12 3 3

(b) The variance of X given y = 2 is


2
σ12 (−2)2
V (X|y = 2) = σx|y=2
2
= σ12 − =3− = 1.
σ22 2

5°. Given two Gaussians p1 (x) = N (µ1 = 2, σ12 = 3) and p2 (x) = N (µ2 = 1, σ22 = 2).
Consider p(x) = αp1 (x) + (1 − α)p2 (x), where 0 ≤ α ≤ 1. Then p(x) is the pdf of a
continuous random variable X.

(a) Find E(X) and V (X), if α = 0.4.


(b) Given α = 0.3, find the probability that X > 1.

Solution.

(a) We have

E(X) = µ = αµ1 + (1 − α)µ2 = 0.4 × 2 + (1 − 0.4) × 1 = 1.4

and

V (X) = σ 2 = α(µ21 + σ12 ) + (1 − α)(µ22 + σ22 ) − µ2


= 0.4(22 + 3) + (1 − 0.4)(12 + 2) − 1.42 = 2.64.
4.6 Supplementary Problems 27

(b) We have

P (X > 1) = 1 − P (X ≤ 1)
Z 1
=1− p(x) dx
−∞
Z 1
=1− [αp1 (x) + (1 − α)p2 (x)] dx
−∞
Z 1 Z 1
=1−α p1 (x) dx − (1 − α) p2 (x) dx
−∞ −∞
1−2 1−1
   
= 1 − αP Z ≤ √ − (1 − α)P Z ≤ √
3 2
= 1 − 0.3 × 0.282 − 0.7 × 0.5 = 0.5654.

3
 

6°. Consider a random vector X with mean vector µ = 4 and the variance matrix
 
2
3 1 0
 

V[X] = 1 2 1 .
 
0 1 3
" #
2 3 1
Let Y = X.
1 −2 2

(a) Compute E(Y).


(b) Compute the variance matrix of Y.

Solution. " #
2 3 1
Let A = .
1 −2 2

# 3
 
" " #
2 3 1   20
(a) E(Y) = AE(X) = 4 = .
1 −2 2 −1
2
# 3 1 0 "
 
" #T " #
2 3 1   2 3 1 51 3
(b) V[Y] = AV[X]AT = 1 2 1 = .
1 −2 2 1 −2 2 3 11
0 1 3

4.6 Supplementary Problems

1. Given the joint distribution of two discrete random variables X and Y,


(
c(x + 2y), for x = 1, 2; y = 0, 1, 2,
f (x, y) =
0, elsewhere.

(a) Find the value of the constant c.


(b) Find P (Y |X = 1).
(c) Find E(X|Y = 0).
28 PROBABILITY AND DISTRIBUTIONS

2. Given the joint distribution of two discrete random variables X and Y,

 cx ,

for x = 1, 2; y = 1, 2, 3,
f (x, y) = y
 0, elsewhere.

(a) Find the value of the constant c.


(b) Find P (X + Y = 3).
(c) Find Cov[X, Y ].

3. Given the joint distribution of two continuous random variables X and Y,


(
c(x2 + 2y), for 0 ≤ x ≤ 1; 0 ≤ y ≤ 2,
f (x, y) =
0, elsewhere.

(a) Find the value of the constant c.


(b) Find P (Y > 1).
(c) Find E(X|Y = 1).
(d) Find P (X + Y < 2).

4. Given the joint distribution of two continuous random variables X and Y,


(
c(xy + 1), for 0 ≤ x ≤ 1; 0 ≤ y ≤ 1,
f (x, y) =
0, elsewhere.

(a) Find the value of the constant c.


(b) E(X)
(c) P (X + Y < 1).

5. Suppose X and Y are two continuous random variables such that V (X) = 3, V (Y ) =
5, and Cov[X, Y ] = −4. Let Z = 2X − Y, find V (Z).
" # " #!
2 2 3
6. Given the bivariate Gaussian p(x, y) = N µ = ,Σ = .
3 3 5

(a) Find µx|y=1 and µy|x=2 .


(b) Find V (X|Y = 1).

7. Given two Gaussians p1 (x) = N (µ1 = 3, σ12 = 4) and p2 (x) = N (µ2 = 3, σ22 = 6).
Consider p(x) = αp1 (x) + (1 − α)p2 (x), where 0 ≤ α ≤ 1.

(a) Prove that p(x) is the probability density function of some continuous random
variable X.
(b) Find E(X) and V (X), when α = 0.3.
(c) Given α = 0.3, find the probability that X < 3.

8. Given the data points in R3 : (1, 1, 0); (−1, 0, 1); (2, 2, 3); (−2, 1, 0).

(a) Find the sample mean vector.


(b) Find the data covariance matrix of the data points.
4.6 Supplementary Problems 29

2
 

9. Consider a random vector X with mean vector µ = −1 and the variance matrix
 
1
2 0 1
 

0 2 1 .
 
1 1 3
" #
−2 1 3
Let Y = X.
1 −2 0

(a) Compute E(Y).


(b) Compute the variance matrix of Y.
30 PROBABILITY AND DISTRIBUTIONS
Chapter 5

CONTINUOUS OPTIMIZATION
5.1 Gradient Descent Algorithm
• To find a local minimum value of a function f (x), an algorithm called gradient
descent can be used as follow:
– Choose/guess x0 near the optimal point x∗
– Repeat computing xn using
xn = xn−1 − γ[∇f (xn−1 )]T , (5.1)
for n = 1, 2, · · · until the stopping criteria is true.
• For good learning rate γ > 0, the sequence f (xn ) is decreasing and tends to the
local minimum value f (x∗ ).
• The small learning rate may cause a slow convergence while the algorithm may fail
to converge with a large learning rate. The learning rate could be changed at each
iterative step of the algorithm.

5.2 Convex Sets and Convex Functions


• A non-empty set C is called convex if for any two points u, v ∈ C, the line segment
[u, v] := {θu + (1 − θ)v | θ ∈ [0, 1]} is entirely contained in C.

(a) A convex set (b) A non-convex set

Fig. 5.1: Examples of convex and non-convex sets

• A function f defined on a convex set D is called convex if


f (θx + (1 − θy)) ≤ θf (x) + (1 − θ)f (y), for all x, y ∈ D and θ ∈ [0, 1]. (5.2)

• A function f is called concave if −f is a convex function.


• If f is convex (concave, respectively) on D and not constant, then it has a unique
local minimum (maximum) and hence the global minimum (maximum).
• A function f is convex if and only if its Hessian ∇2 f is positive semi-definite.

31
32 CONTINUOUS OPTIMIZATION

5.3 Constrained Optimization and Lagrange Multipliers

• To minimize/maximize f (x) subject to the constraints gi (x) = 0, i = 1, 2, · · · , m, we


set the Lagrangian
L(x, λ) = f (x) + λT g(x)
with the Lagrange multipliers λ = [λ1 λ2 · · · λm ]T .

The optimal point can be found by solving the system

=0
(
∂L
∂x (5.3)
∂L
∂λ =0

• The dual Lagrangian of the problem is defined by

D(λ) := min L(x, λ). (5.4)


x

• Consider the constrained optimization problem

min f (x) subject to gi (x) ≤ 0, hj (x) = 0. (5.5)


x∈Rn

By convex optimization, we mean that the f (.) and gi (.) are convex functions and
all hj (x = 0 are convex sets.

5.4 Linear Programming and Quadratic Programming

• The linear program is in the form

min cT x subject to Ax ≤ b, (5.6)


x∈Rn

where A ∈ Rm×n , b ∈ Rm , and c ∈ Rn .

• The Lagrangian of the linear program is −λT b.

• The quadratic program is in the form

1 T
min x Qx + cT x subject to Ax ≤ b, (5.7)
x∈Rn 2

where Q ∈ Rn×n is a positive definite matrix and A ∈ Rm×n , b ∈ Rm , and c ∈ Rn .

5.5 Solved Problems

1°. Show that the function f (x, y, z) = x2 + y 2 − xy + z 2 is convex on R3 .


5.5 Solved Problems 33

Solution.
The Hessian of f is
2 −1 0
 

H = −1 2 0 .
 
0 0 2
" #
2 −1
We have det[2] = 2 > 0, det = 3 > 0, and det(H) = 6 > 0. Since the
−1 2
determinants of all upper left square matrices are all positive, H is positive definite.
Therefore, f is convex.

2°. Find the minimum value of f (x, y) = x2 − xy + y 2 − 4y.

Solution.
Using the Hessian of f , one can show that f is convex and therefore, f has one local
minimum and it is also the global minimum of f . Moreover, f attains its minimum
at (x0 , y0 ) at which the partial derivatives vanish. We have
( ∂f
=0 2x − y = 0 x=
( (
4
∂x 3
∂f ⇔ ⇔
∂y =0 −x + 2y − 4 = 0 y= 8
3
 
The minimum value of f is f 4 8
3, 3 = − 16
3 .

3°. Use gradient descent algorithm to find a local minimum value of the function f (x, y) =
x2 +3y 2 −2xy using (x0 , y0 ) = (0, 1) and learning rate γ = 0.1. Find the point (x2 , y2 ).

Solution.
Compute partial derivatives of f , we obtain fx (x, y) = 2x − 2y, fy (x, y) = 6y − 2x.
Use (5.1) we have

x1 = x0 − γfx (x0 , y0 ) = 0 − 0.1(−2) = 0.2, y1 = y0 − γfy (x0 , y0 ) = 1 − 0.1(6) = 0.4,

and then,

x2 = x1 − γfx (x1 , y1 ) = 0.6, y2 = y1 − γfy (x1 , y1 ) = 0.2.

n o
4°. Prove that the set C = (x, y)| |x|1/3 + |y|1/3 ≤ 1 is not convex in R2 .

Solution.
Consider two points (1, 0) and (0, 1) in C. Then for θ = 0.5, we have θ(1, 0) + (1 −
θ)(0, 1) = (0.5, 0.5) ∈
/ C since 0.51/3 + 0.51/3 > 1. Hence, C is not convex (see Figure
5.2).
n o
5°. Is the set (x, y)| |x|3/2 + |y|3/2 ≤ 1 convex in R2 ?

Solution.
It can be seen that the set is convex (Figure 5.3).
34 CONTINUOUS OPTIMIZATION


Fig. 5.2: Region (x, y)| |x|1/3 + |y|1/3 ≤ 1


Fig. 5.3: Region (x, y)| |x|3/2 + |y|3/2 ≤ 1
5.5 Solved Problems 35

6°. Show that the intersection of two convex sets is also convex. Is the union of two
convex sets convex?

Solution.
Suppose A and B are two convex sets and x and y are two points in A ∩ B. For
any 0 ≤ θ ≤ 1, we have

θx + (1 − θ)y ∈ A since A is convex

and
θx + (1 − θ)y ∈ B since B is convex.
Therefore,
θx + (1 − θ)y ∈ A ∩ B.
This follows that A ∩ B is convex.
The union of two convex sets may not convex. For example, the sets {x|x > 1} and
{x|x < 0} are convex sets but {x|x > 1 or x < 0} is not convex.

7°. Find the Lagrangian and dual Lagrangian of the optimization problem

Minimize f (x, y) = y 2 − 6x subject to the constraint x2 − 2y ≤ 2.

Solution.
Let g(x, y) = 2x − y − 2 ≤ 0, then the Lagragian is

L(x, y, λ) = f (x, y) + λg(x, y) = y 2 − 6x + λ(x2 − 2y − 2).

Solve
=0 −6 + 2λx =0
( (
Lx

Ly =0 2y − 2λ =0

yields x0 = λ3 , y0 = λ.
Hence, the dual Lagrangian is
9
D(λ) = min L(x, y, λ) = L(x0 , y0 , λ) = −λ2 − 2λ − .
(x,y)∈R2 λ

8°. Find the dual Lagrangian of the linear program


(
x−y ≤3
min 2x − y subject to
(x,y)∈R2 2x + y ≤ 11.

Solution.
This is a linear program which is of the form

min cT x subject to Ax ≤ b
(x,y)∈R2
" #
1 −1
where c = [2 − 1]T , A = , and b = [3 11]T . The dual Lagrangian is
2 1
−λT b = −3λ1 − 11λ2 .
36 CONTINUOUS OPTIMIZATION

5.6 Supplementary Problems

1°. Use gradient descent algorithm to find the local minimum value of the function
f (x, y) = x2 + y 2 − 4y using (x0 , y0 ) = (1, 1) and learning rate γ = 0.2. Find the
point (x3 , y3 ).

2°. Show that the function f (x, y, z) = x2 + y 2 − yz + 3z 2 is convex on R3 .

3°. Show that every norm on Rn is a convex function.

4°. Find the minimum value of f (x, y) = x2 − 2xy + 3y 2 − 5y.

5°. Find the minimum value of the function f (x, y, z) = x2 + y 2 − xy + z 2 subject to the
constraint x + y = 2z − 4.

6°. Use gradient descent to find the local minimum value of the function f (x, y) =
2x2 + y 2 − xy + 2x using (x0 , y0 ) = (1, 1) and learning rate γ = 0.1. Find the point
(x3 , y3 ).

7°. Find the point closest to the origin and on the line of intersection of the planes
2x + y + 3z = 9 and 3x + 2y + z = 6.

8°. Prove that the set {(x, y)| |x| + |y| ≤ 1} is not convex in R2 .
p p

q q
9°. Is the set {(x, y)| |x|3/2 + |y|3/2 ≤ 1} convex in R2 ?

10°. Use the method of Lagrange multipliers to find critical points of the function f (x, y) =
x − 2y subject to the constraint xy = 1.

11°. Find the dual Lagrangian of the linear program


(
x+y ≤2
min 2x − y + z subject to
(x,y,z)∈R3 2x + z ≤ −1.

12°. Find the dual Lagrangian of the linear program


T
2 2 1 1
    

min −1 x subject to −1 1 x ≤ −1 .


     
x∈R3
1 3 2 3

13°. Find the Lagrangian and dual Lagrangian of the problem

min x2 − 4y subject to 6x − 2y 2 ≥ 3.
(x,y)∈R2
Chapter 6

MULTIPLE CHOICE QUESTIONS

6.1 Questions

1. Compute the distance between u = (2, 1) and v"= (−1,


# 1) with respect to the inner
4 2
product defined by ⟨x, y⟩ := xT Ay, where A = .
2 9

(a) 3
(b) 6
(c) 9
(d) 36

2. Compute the norm of u = (2, 1, −1) using the inner product defined by ⟨x, y⟩ :=
1 0 0
xT Ay, where A = 0 2 1.
 
0 1 3

(a) 2

(b) 5

(c) 7
(d) 7

3. Compute cos w where w is the angle between u = (1, " 0) and


# v = (0, 1) using the
1 1
inner product defined by ⟨x, y⟩ := xT Ay, where A = .
1 3

(a) 1/3
(b) 2/3

(c) 3/3

(d) 3

4. Which sets are √ orthogonal?



(i) {(1, −1); ( 2/2, 2/2)}
(ii) {(1, 0, 2); (2, 0, 1), (0, 1, 0)}

(a) (i) only


(b) (ii) only
(c) Both (i) and (ii)
(d) Neither (i) nor (ii)

5. Let Q be an orthogonal matrix. Which statements are true?


(i) detQ = 1
(ii) Q preserves the angle between two vectors.

37
38 MULTIPLE CHOICE QUESTIONS

(a) (i) only


(b) (ii) only
(c) Both (i) and (ii)
(d) Neither (i) nor (ii)

6. Find the orthogonal projection of u = (2, 2, 1) on the subspace spanned by b =


(1, 0, −1).

(a) ( 12 , 0, −1
2 )
(b) ( 12 , 0, 21 )
(c) (2, 0, 2)
(d) (2, 0, −2)

7. Find the (2, 1)-entry of projection matrix on the space spanned by b = (1, 0, −1).

(a) 0
(b) −1
2
(c) 1
2
(d) 1

2 3 −1
 

8. Find all eigenvalues of the matrix 4 1 −1.


 
0 3 1

(a) 1, 2, 3
(b) −2, 2, 4
(c) 1, 2
(d) 1, 3

1 2 0 0
 
3 4 0 0
9. Find the determinant of the matrix  .
 
5 6 9 10
7 8 11 12

(a) −4
(b) −2
(c) 2
(d) 4

10. Given the characteristic polynomial of a matrix A.

pA (x) = (2 − x)(−1 − x)(3 − x)

Find det(A) - tr(A).

(a) −10
(b) −6
6.1 Questions 39

(c) −2
(d) 4

11. 
Find the dimension of the eigenspace corresponding to the eigenvalue 1 of the matrix
−1 −1 1

−2 0 1.
 
−2 −1 2

(a) 0
(b) 1
(c) 2
(d) 3

12. Given an SVD of A,


# 0 1 0
 
" #"
0 1 2 0 0 
A= 1 0 0

1 0 0 1 0
0 0 1
Let B be the best rank-1 approximation of A and ∥.∥2 be the spectral norm of a
matrix. Compute ∥A − B∥2 .

(a) 0
(b) 1

(c) 2
(d) 2

13. Given an SVD of A,


# 0 1 0
 
" #"
0 1 2 0 0 
A= 1 0 0

1 0 0 1 0
0 0 1
Let B be the best rank-1 approximation of A. Find the (2, 2)-entry of B.

(a) 0
(b) 1
2

(c) 1
(d) 2
" #
1 2 0
14. Let σ1 > σ2 be two largest singular values of the matrix . Find σ12 − σ22 .
0 1 2

(a) 4
(b) 5
(c) 7
(d) 10
" #
1 0
15. Select a matrix P such that P −1 P is a diagonal matrix.
2 3
40 MULTIPLE CHOICE QUESTIONS

" #
0 1
(a) P =
1 −1
" #
0 1
(b) P =
1 1
" #
1 1
(c) P =
0 1
" #
1 1
(d) P =
0 −1

16. Let H be the Hessian of f (x, y, z) = x2 + 3yz 2 − 4x at the point (1, 2, 0). Find tr(H).

(a) 2
(b) 6
(c) 8
(d) 14

17. Given f (u, v) = u2 + 2uv and u(s) = 1/s, v(s) = s. Find the derivative of f with
respect to s at s = 4.

(a) −5
32
(b) −5
16
(c) 5
32
(d) 3
16
" #
1 2
18. Find the gradient of f (x) = xT x at x = [1 0]T .
2 1

(a) [1 2]
(b) [2 1]
(c) [2 4]
(d) [4 2]

x2
19. Find the coefficient involving the term (x−1)2 y in the Taylor series of f (x, y) = 1+y .

(a) −6
(b) −3
(c) −2
(d) −1

20. Let T1 (x, y) be the Taylor polynomial of degree 1 of the function f (x, y) = x2 + y 2 +
2xy at the point (2, −1). Find T1 (3, 0).

(a) 3
(b) 4
(c) 5
6.1 Questions 41

(d) 6

∂f
21. Given f (x, y) = x2 − xy and x(s, t) = s/t, y(s, t) = s − t. Find ∂s at s = 1, t = 2.

(a) 1/2
(b) 1
(c) 3/2
(d) 2

22. Given the joint probability of X, Y . Compute P (X = 1|Y = 2).

(a) 1/6
(b) 1/4
(c) 1/3
(d) 1/2

23. Given the joint probability of X, Y . Compute Cov[X, Y ].

(a) 0.12
(b) 0.2
(c) 10.32
(d) 0.6

24. Given the joint pdf of two continuous random variables X, Y .

k(x2 + 2y), 0 ≤ x ≤ 1, 0 ≤ y ≤ 1,
(
f (x, y) =
0, otherwise.

Find the value of the constant k.

(a) 1/3
(b) 1/2
(c) 3/4
(d) 4/3
42 MULTIPLE CHOICE QUESTIONS

25. Given the joint pdf of two continuous random variables X, Y .

4 (x + 2y), 0 ≤ x ≤ 1, 0 ≤ y ≤ 1,
(
3 2
f (x, y) =
0, otherwise.

Find E(X).

(a) 1/4
(b) 9/16
(c) 3/4
(d) 1

26. Given the joint pdf of two continuous random variables X, Y .

4 (x + 2y), 0 ≤ x ≤ 1, 0 ≤ y ≤ 1,
(
3 2
f (x, y) =
0, otherwise.

Find P (X < 1
2 | Y = 12 ).

(a) 13/32
(b) 13/27
(c) 13/24
(d) 3/4
" #
1 −1
27. Given bivariate Gaussian p(x, y) = N (µ, Σ) where µ = [2 3]T , Σ =
−1 4
Find corr[X, Y ].

(a) −0.5
(b) −0.25
(c) 0.25
(d) 0.5
" #
1 −1
28. Given bivariate Gaussian p(x, y) = N (µ, Σ) where µ = [2 3]T , Σ =
−1 4
Consider Z = 3X − Y . Find V (Z).

(a) 1
(b) 7
(c) 13
(d) 19
" #
1 −1
29. Given bivariate Gaussian p(x, y) = N (µ, Σ) where µ = [2 3]T , Σ =
−1 4
Find E(X | Y = 1).

(a) 0
6.1 Questions 43

(b) 1.5
(c) 2
(d) 2.5

30. Which sets are convex in R2 ?

A = {(x, y) | − 1 < x < 1}, B = {(x, y) | − 1 < x + y < 1}.

(a) A only
(b) B only
(c) Both A and B
(d) Neither A nor B

31. Which sets are convex in R2 ?


( )
n o x2 y 2
A = (x, y) | 1 < x + y < 2 ,
2 2
B= (x, y) + =1 .
4 9

(a) A only
(b) B only
(c) Both A and B
(d) Neither A nor B

32. Which functions are convex on R2 ?

f (x, y) = x2 + y 2 − 3xy, g(x, y) = 3x2 + y 2 − 3xy.

(a) f only
(b) g only
(c) Both f and g
(d) Neither f nor g

33. Which functions are convex on Rn ?

f (x) = cT x, g(x) = xT Qx,

where c, x ∈ Rn , Q is a positive definite matrix.

(a) f only
(b) g only
(c) Both f and g
(d) Neither f nor g

34. Find the minimum value of f (x, y) = x2 + y 2 − 2x + 6y.

(a) −12
(b) −10
(c) −9
44 MULTIPLE CHOICE QUESTIONS

(d) −8

35. Find the minimum value of f (x, y) = x2 + 2y 2 − 2x subject to x2 + y 2 = 4.

(a) 0
(b) 2
(c) 3
(d) 4

36. Consider the linear program

−x + y
(
≤ −3,
min −2x + 3y − z subject to
(x,y,z)∈R3 27 + z ≤ 4.

Find the dual Lagrangian.

(a) 3λ1 − 4λ2


(b) −3λ1 + 4λ2
(c) 4λ1 − 3λ2
(d) −4λ1 + 3λ2

37. Consider the optimization problem

min x2 − 4y subject to y 2 − 2x ≤ 3.
(x,y)∈R2

Find the dual Lagrangian.

(a) λ2 − 4
λ
(b) −λ2 − 4
λ
(c) λ2 + 4
λ
(d) −λ2 + 4
λ

38. Let ∥.∥ be the norm induced by an inner product ⟨., .⟩. Compute ⟨u, v⟩ if ∥u + ∥ = 8
and ∥u − v∥ = 6.

(a) 2
(b) 7
(c) 16
(d) 24

39. Suppose {u, v, w} is an orthonormal set in a vector space with an inner product ⟨., .⟩.
Compute ⟨u − v + w, v − 3w⟩.

(a) −4
(b) −3
(c) 2
(d) 4
6.2 Answer key 45

40. Consider the mapping f (x) = Ax + x, where A is an n × n real matrix. Find the
the gradient of f with respect to x.

(a) A
(b) AT
(c) A + I
(d) AT + I

41. Let L(x, y) be the linear approximation of the function f (x, y) = x3 + y 2 − 3x at


(1, −1). Find L(1.2, −0.8).

(a) −0.8
(b) −0.4
(c) 1.2
(d) 3.6

6.2 Answer key

1b 2c 3c 4c 5b 6a 7a 8b 9d 10a 11c 12b 13d 14a 15a 16d 17a 18c 19d 20c 21a 22c 23a 24c
25b 26a 27a 28d 29d 30c 31d 32b 33c 34b 35a 36a 37b 38b 39a 40d 41b
46 MULTIPLE CHOICE QUESTIONS
Bibliography
[1] Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong, Mathematics for
Machine Learning, Cambridge University Press, 2020.

[2] MPrasanna Sahoo, Probability and Statistical Mathematics, University of Louisville,


Louisville, KY, 2013.

[3] Stephen Boyd, and Lieven Vandenberghe, Convex Optimization, Cambridge Uni-
versity Press, 2004.

47
48 MULTIPLE CHOICE QUESTIONS

You might also like