0% found this document useful (0 votes)
5 views32 pages

Tut 7

Uploaded by

aftab.ycce
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views32 pages

Tut 7

Uploaded by

aftab.ycce
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

CSC 311: Introduction to Machine Learning

Tutorial - Matrix Decomposition & Probabilistic Models

TA: Vahid Balazadeh


Instructors: Michael Zhang and Chandra Gummaluru

University of Toronto

Based on slides by Haonan Duan


Intro ML (UofT) CSC311-Tut7 1 / 31
Matrix Decomposition

We can decompose an integer into its prime factors, e.g.,


12 = 2 × 2 × 3.
Similarly, matrices can be decomposed into product of other
matrices.
Examples are Eigendecomposition, SVD, Schur decomposition, LU
decomposition, . . . .
Here, we focus on Eigendecomposition and SVD

Intro ML (UofT) CSC311-Tut7 2 / 31


Eigenvector

An eigenvector of a square matrix A is a nonzero vector v such


that multiplication by A only changes the scale of v:

Av = λv

The scalar λ is known as the eigenvalue.


If v is an eigenvector of A, so is any rescaled vector αv.
αv has the same eigenvalue as v. Thus, we constrain the
eigenvector to be of unit length.

Intro ML (UofT) CSC311-Tut7 3 / 31


Compute eigenvalues - characteristic polynomial

Eigenvalue equation of matrix A:

Av = λv
λv − Av = 0
(λI − A)v = 0

If nonzero solution for v exists, then it must be the case that:

det(λI − A) = 0

Unpacking the determinant as a function of λ , we get a


polynomial, called the characteristic polynomial:

PA (λ) = det(λI − A) = λn + cn−1 λn−1 + . . . + c1 λ + c0

Compute eigenvalues of A → solve PA (λ) = 0

Intro ML (UofT) CSC311-Tut7 4 / 31


Exercise

Consider the matrix:


 
2 1
A=
1 2

What are the eigenvalues and eigenvectors of A?

Intro ML (UofT) CSC311-Tut7 5 / 31


Solution

We first need to calculate the eigenvalues,


 
2−λ 1
det(A − λI) = 0 =⇒ det =0
1 2−λ
=⇒ (2 − λ)2 − 1 = 0 =⇒ λ1 = 3, λ2 = 1

Then, we solve (A − λi I)vi = 0 to find eigenvectors:


 
−1 1
(A − λ1 I)v1 = 0 =⇒ v =0
1 −1 1
   
1 normalize 1 1
=⇒ v1 = =⇒ v1 = √
1 2 1
Similarly,
 
1 −1
(A − λ2 I)v2 = 0 =⇒ v1 = √
2 1

Intro ML (UofT) CSC311-Tut7 6 / 31


Eigendecomposition
Spectral Theorem - Every symmetric matrix A ∈ Rn×n has a
set of n orthonormal eigenvectors forming a basis. Furthermore,
all eigenvalues are real.
Therefore, A can be decomposed to the following form
A = P DP −1
P is an orthogonal matrix of the eigenvectors of A, and D is a
diagonal matrix of eigenvalues.

Intro ML (UofT) CSC311-Tut7 7 / 31


Eigendecomposition
Spectral Theorem - Every symmetric matrix A ∈ Rn×n has a
set of n orthonormal eigenvectors forming a basis. Furthermore,
all eigenvalues are real.
Therefore, A can be decomposed to the following form
A = P DP −1
P is an orthogonal matrix of the eigenvectors of A, and D is a
diagonal matrix of eigenvalues.
A [v1 , . . . , vn ] = [Av1 , . . . , Avn ]
| {z }
P
= [λ1 v1 , . . . , λn vn ]
 
λ1 ... 0
 .. .. .. 
= [v1 , . . . , vn ]  . . . 
| {z }
P 0 ... λn
| {z }
D
Intro ML (UofT) CSC311-Tut7 7 / 31
Intuitions of Eigendecomposition

Diagonal matrix allows fast computations of their determinants,


powers and inverses.
Eigendecomposition transforms a matrix into a diagonal form by
changing the basis.

det(A) = det(P DP −1 ) = det(P ) det(D) det(P )−1


= det(D)
Yn
= λi
i=1

A−1 = P D−1 P −1

Intro ML (UofT) CSC311-Tut7 8 / 31


Geometric intuitions of eigendecomposition

Top-left to bottom-left: P −1 performs a basis change.


Bottom-left to bottom-right: D performs a scaling.
Bottom-right to top-right: P undoes the basis change.

Intro ML (UofT) CSC311-Tut7 9 / 31


Singular Value Decomposition (SVD)

If A ∈ Rm×n is not square, eigendecomposition is undefined.


SVD is a decomposition of the form A = U ΣV T .
SVD is more general than eigendecomposition. Every real matrix
has a SVD.

Intro ML (UofT) CSC311-Tut7 10 / 31


SVD - Terminology

U and V are orthogonal matrices, and Σ is a diagonal matrix (not


necessarily square).
Diagonal entries of Σ are called singular values of A.
Columns of U are the left singular vectors, and columns of V are
the right singular vectors.

Intro ML (UofT) CSC311-Tut7 11 / 31


SVD and eigendecomposition

SVD can be interpreted in terms of eigendecomposition.


Left singular vectors of A are the eigenvectors of AAT .
Right singular vectors of A are the eigenvectors of AT A
Nonzero singular values of A are square roots of eigenvalues of
AT A and AAT . AT A and AAT are positive semi-definite (PSD),
thus their eigenvalues are positive.

Intro ML (UofT) CSC311-Tut7 12 / 31


Informal Proof

Since B = AA⊤ ∈ Rm×m is symmetric, eigendecomposition holds

B = P DP −1

Now, assume SVD exists, i.e., A = U ΣV ⊤ . Therefore,

B = AA⊤ = (U ΣV ⊤ )(V Σ⊤ U ⊤ ) = U ΣΣ⊤ U ⊤

Matching those two:

P DP −1 = U ΣΣ⊤ U ⊤
1 √
Therefore, U = P and Σ ≡ D 2 or σi = di .

A similar approach on C = A⊤ A ∈ Rn×n leads to V .

Intro ML (UofT) CSC311-Tut7 13 / 31


Exercise

Compute SVD of the matrix:


 
3 2 2
A=
2 3 −2

Intro ML (UofT) CSC311-Tut7 14 / 31


Solution
Here, we calculate U and Σ. First, define B = AA⊤
 
  3 2  
3 2 2  17 8
B= 2 3 = 
2 3 −2 8 17
2 −2

Then, we can calculate the eigenvalues and eigenvectors


 (using
  characteristic
1 1
polynomial): λ1 = 25, λ2 = 9 and v1 = √12 , v2 = √12 . Therefore,
−1 1
B = P DP −1 where
   
1 1 1 25 0
P =√ , D=
2 −1 1 0 9
1
We had U = P and Σ ≡ D 2 :
 
5 0 0
Σ=
0 3 0

Find V for exercise.


Intro ML (UofT) CSC311-Tut7 15 / 31
Rank-r approximation

Given a matrix A, SVD allows us to find its “best” rank-r


approximation Ar (r < n).
Why? store less parameters
Pn
We can write A = U ΣV ⊤ as A = ⊤
i=1 σi ui vi , where σi are sorted
from the largest to the smallest.

Intro ML (UofT) CSC311-Tut7 16 / 31


Rank-r approximation

The rank-r approximation Ar is defined as:


r
X
A= σi ui viT
i=1

Ar is the best approximation of rank r by many norms, such as


spectral norm.
∥Ax∥2
∥A∥2 := sup
x ∥x∥2
It means that ∥A − Ar ∥2 ≤ ∥A − B∥2 for any rank r matrix B.

Intro ML (UofT) CSC311-Tut7 17 / 31


Rank-r approximation

Intro ML (UofT) CSC311-Tut7 18 / 31


Maximum Likelihood Estimation (MLE)

Goal: estimate parameters θ from observed data {x1 , · · · , xN }


Main idea: We should choose parameters that assign high
probability to the observed data:

θ̂ = argmax L(θ; x1 , · · · , xN )

Intro ML (UofT) CSC311-Tut7 19 / 31


Three steps for computing MLE

1 Write down the likelihood objective:


N
Y
L(θ; x1 , · · · , xN ) = L(θ; xi )
i=1

2 Transform to log likelihood:


N
X
l(θ; x1 , · · · , xN ) = log L(θ; xi )
i=1

3 Compute the critical point:


∂l
=0
∂θ

Intro ML (UofT) CSC311-Tut7 20 / 31


Example - categorial distribution

X is a discrete random variable with the following probability mass


function (0 ≤ θ ≤ 1 is an unknown parameter):

X 0 1 2 3
P (X) 2θ/3 θ/3 2(1 − θ)/3 (1 − θ)/3

The following 10 independent observations were taken from X:


{3, 0, 2, 1, 3, 2, 1, 0, 2, 1}.
What is the MLE for θ?

Intro ML (UofT) CSC311-Tut7 21 / 31


Step 1: Likelihood objective

L(θ) = P (X = 3)P (X = 0)P (X = 2)P (X = 1)P (X = 3)


× P (X = 2)P (X = 1)P (X = 0)P (X = 2)P (X = 1)
2θ θ 2(1 − θ) 3 (1 − θ) 2
= ( )2 ( )3 ( ) ( )
3 3 3 3

Intro ML (UofT) CSC311-Tut7 22 / 31


Step 2: Log likelihood

l(θ) = log L(θ)


2 1
= 2(log + log θ) + 3(log + log θ)
3 3
2 2
+ 3(log + log(1 − θ)) + 2(log + log(1 − θ))
3 3
= C + 5(log θ + log(1 − θ))

Intro ML (UofT) CSC311-Tut7 23 / 31


Step 3: critical points

∂l
=0
∂θ
1 1
→ 5( − )=0
θ 1−θ
→ θ̂ = 0.5

Intro ML (UofT) CSC311-Tut7 24 / 31


Exercise

Suppose that X1 , · · · , Xn form a random sample from a uniform


distribution on the interval (0, θ), where of the parameter θ > 0 but is
unknown. Find MLE of θ.

Intro ML (UofT) CSC311-Tut7 25 / 31


Solution

- Calculate the likelihood:


Y Y I (Xi ∈ (0, θ))
L(X1 , . . . , Xn ; θ) = Pθ (Xi ) =
i i
θ

- Calculate the log-likelihood:


Y X I (Xi ∈ (0, θ))
l(θ) = log Pθ (Xi ) = log
i i
θ

If Xi ̸∈ (0, θ), then log 0 will be undefined. Therefore, θ ∈ [maxi {Xi }, ∞)

- What value of θ maximizes l(θ)?

Intro ML (UofT) CSC311-Tut7 26 / 31


Bayesian Inference - Philosophy

Bayesian interprets probability as degrees of beliefs.


Bayesian treats parameters as random variables.
Bayesian learning is updating our beliefs (probability distribution)
based on observations.

Intro ML (UofT) CSC311-Tut7 27 / 31


Bayesian versus Frequentist

MLE is the standard frequentist inference method.


Bayesian and frequentist are the two main approaches in
statistical machine learning. Some of their ideological differences
can be summarized as:
Frequentist Bayesian
Probability is relative frequency degree of beliefs
Parameter θ is unknown constant random variable

Han Liu and Larry Wasserman, Statistical Machine Learning, 2014


Intro ML (UofT) CSC311-Tut7 28 / 31
The Bayesian approach to machine learning

1 We define a model that expresses qualitative aspects of our


knowledge (eg, forms of distributions, independence assumptions).
The model will have some unknown parameters.
2 We specify a prior probability distribution for these unknown
parameters that expresses our beliefs about which values are more
or less likely, before seeing the data.
3 We gather data.
4 We compute the posterior probability distribution for the
parameters, given the observed data.
5 We use this posterior distribution to draw scientific conclusions
and make predictions

Radford M. Neal, Bayesian Methods for Machine Learning, NIPS 2004 tutorial
Intro ML (UofT) CSC311-Tut7 29 / 31
Computing the posterior

The posterior distribution is computed by the Bayes’ rule:

P (parameter)P (data|parameter)
P (parameter|data) =
P (data)

The denominator is just the required normalizing constant. So as


a proportionality, we can write:

posterior ∝ prior × likelihood

Intro ML (UofT) CSC311-Tut7 30 / 31


Exercise

Suppose you have a Beta(4, 4) prior distribution on the probability


θ that a coin will yield a ‘head’ when spun in a specified manner.
The coin is independently spun ten times, and ‘heads’ appear
fewer than 3 times. You are not told how many heads were seen,
only that the number is less than 3.
Calculate your exact posterior density (up to a proportionality
constant) for θ and sketch it.

Intro ML (UofT) CSC311-Tut7 31 / 31

You might also like