0% found this document useful (0 votes)
34 views5 pages

Machine Learning-Em Algorithm

Machine Learning-em algorithm MIT

Uploaded by

aviral1987
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views5 pages

Machine Learning-Em Algorithm

Machine Learning-em algorithm MIT

Uploaded by

aviral1987
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Machine learning: lecture 14 Topics

Tommi S. Jaakkola • Gaussian mixtures and the EM-algorithm


MIT CSAIL – complete, incomplete, and inferred data
[email protected] – EM for mixtures
– demo
– EM and convergence
– regularized mixtures
– selecting the number of mixture components
– Gaussian mixtures for classification

Tommi Jaakkola, MIT CSAIL 2

Review: mixture densities Types of data: complete


• A Gaussian mixture model with m components is defined as m
!
m
! p(x|θ) = pj p(x|µj , Σj )
p(x|θ) = pj p(x|µj , Σj ) j=1
j=1

where θ = {p1, . . . , pm, µ1, . . . , µm, Σ1, . . . , Σm} contains all • When the available data is complete each sample contains
the parameters of the mixture model. the setting of all the variables in the model.
x y
x1 0 1 ... 0
x2 0 0 ... 1
··· ...
xn 0 1 ... 0

• We have to estimate these models from incomplete data The parameter estimation problem is in this case
involving only x samples; the assignment to components has straightforward (each component Gaussian can be estimated
to be inferred separately)

Tommi Jaakkola, MIT CSAIL 3 Tommi Jaakkola, MIT CSAIL 4

Types of data: incomplete Types of data: inferred


m
! m
!
p(x|θ) = pj p(x|µj , Σj ) p(x|θ) = pj p(x|µj , Σj )
j=1 j=1

• Incomplete data for a mixture model typically contain only • We can infer the values for the missing data based on the
x samples. current setting of the parameters
x y x y
x1 x1 P (y = 1|x1, θ) P (y = 2|x1, θ)
. . . P (y = m|x1, θ)
x2 x2 P (y = 1|x2, θ) P (y = 2|x2, θ)
. . . P (y = m|x2, θ)
··· ··· ...
xn xn P (y = 1|xn, θ) P (y = 2|xn, θ) . . . P (y = m|xn, θ)
To estimate the parameters we have to infer which The parameter estimation problem is again easy if we treat
component Gaussian was responsible for generating each the inferred data as complete data. The solution has to be
sample xi iterative, however.

Tommi Jaakkola, MIT CSAIL 5 Tommi Jaakkola, MIT CSAIL 6


The EM-algorithm The EM-algorithm
Step 0: specify the initial setting of the parameters θ = θ(0) Step 0: specify the initial setting of the parameters θ = θ(0)
E-step: complete the incomplete data with the posterior
m
! probabilities
p(x|θ) = pj p(x|µj , Σj )
j=1 P (y = j|xi, θ(k)), j = 1, . . . , m, i = 1, . . . , n
For example, we could
– set each µj to x sampled at random from the training set
– set each Σj to be the sample covariance of the whole data
– set mixing proportions pj to be uniform pj = 1/m.

Tommi Jaakkola, MIT CSAIL 7 Tommi Jaakkola, MIT CSAIL 8

The EM-algorithm Demo


Step 0: specify the initial setting of the parameters θ = θ(0)
E-step: complete the incomplete data with the posterior
probabilities

P (y = j|xi, θ(k)), j = 1, . . . , m, i = 1, . . . , n

M-step: find the new setting of the parameters θ(k+1) by


maximizing the log-likelihood of the completed (inferred)
data
n !
m P (x ,y=j|θ)
! " i
#$ %
θ(k+1) = argmax P (y = j|xi, θ(k)) log [pj p(xi|µj , Σj )]
θ
i=1 j=1

Tommi Jaakkola, MIT CSAIL 9 Tommi Jaakkola, MIT CSAIL 10

Topics EM-algorithm: convergence


m
!
• Gaussian mixtures and the EM-algorithm
p(x|θ) = pj p(x|µj , Σj )
– complete, incomplete, and inferred data j=1
– EM for mixtures
• The EM-algorithm monotonically increases the log-likelihood
– demo
of the training data. In other words,
– EM and convergence
– regularized mixtures l(θ(0)) < l(θ(1)) < l(θ(2)) < . . . until convergence
&n
– selecting the number of mixture components l(θ(k)) = i=1 log p(xi|θ(k))
– Gaussian mixtures for classification 200

100

!100

!200

!300

!400

!500
0 5 10 15 20 25 30 35

Tommi Jaakkola, MIT CSAIL 11 Tommi Jaakkola, MIT CSAIL 12


EM-algorithm: auxiliary objective EM-algorithm: auxiliary objective
• We first introduce possible posterior assignments {Q(j|i)} • The auxiliary objective
and the corresponding auxiliary likelihood objective: n !
m (k) (k) (k)
! pj p(xi|µj , Σj )
n
! l(Q; θ(k)) = Q(j|i) log ≤ l(θ(k))
Q(j|i)
l(θ(k)) = log p(xi|θ(k)) i=1 j=1
i=1 recovers the log-likelihood of the data at the correct posterior
!n m
! (k) (k) (k) assignments. In other words,
= log pj p(xi|µj , Σj )
i=1 j=1 max l(Q; θ(k)) = l(Q(k); θ(k)) = l(θ(k))
n m (k) (k) (k) Q
! ! pj p(xi|µj , Σj )
= log Q(j|i) where Q(k)(j|i) = P (y = j|xi, θ(k)) are the posterior
i=1 j=1
Q(j|i)
assignments corresponding to parameters θ(k).
n !
m (k) (k) (k)
! pj p(xi|µj , Σj )
≥ Q(j|i) log
i=1 j=1
Q(j|i)

= l(Q; θ(k))
Tommi Jaakkola, MIT CSAIL 13 Tommi Jaakkola, MIT CSAIL 14

EM-algorithm: max-max and monotonicity Topics


• We can now rewrite the EM-algorithm in terms of two • Gaussian mixtures and the EM-algorithm
maximization steps involving the auxiliary objective: – complete, incomplete, and inferred data
E-step: Q(k) = argmaxQ l(Q; θ(k)) – EM for mixtures
M-step: θ(k+1) = argmaxθ l(Q(k); θ) – demo
– EM and convergence
The monotonic increase of the log-likelihood now follows – regularized mixtures
from the facts that 1) the auxiliary objective is monotonically – selecting the number of mixture components
increasing, and 2) it equals the log-likelihood after each E- – Gaussian mixtures for classification
step

l(θ(k)) = l(Q(k); θ(k))


≤ l(Q(k); θ(k+1))
≤ l(Q(k+1); θ(k+1)) = l(θ(k+1))

Tommi Jaakkola, MIT CSAIL 15 Tommi Jaakkola, MIT CSAIL 16

Regularized EM Regularized EM: prior


• Even a single covariance matrix in the Gaussian mixture • A Wishart prior over each covariance matrix is given by
model involves a number of parameters and can easily lead ' (
1 n!
to over-fitting. P (Σ|S, n!) ∝ exp − Trace(Σ −1
S)
|Σ|n!/2 2
m
!
p(x|θ) = pj p(x|µj , Σj ) (written here in a bit non-standard way)
j=1
S = “prior” covariance matrix
• We can regularize the model by assigning a prior distribution n! = equivalent sample size
over the parameters, especially the covariance matrices
The equivalent sample size represents the number of training
samples we would have to see in order for the prior and the
data to have equal effect on the solution

Tommi Jaakkola, MIT CSAIL 17 Tommi Jaakkola, MIT CSAIL 18


Regularized EM Regularized EM: demo
• The E-step is unaffected (though the resulting values for the
soft assignments will change)
• In the M-step we now maximize a penalized log-likelihood of
the weighted training set:
m " p̂(j|i)
n !
! #$ % ) * !m
P (y = j|xi, θ(k)) log pj p(xi|µj , Σj ) + log P (Σj |S, n!)
i=1 j=1 j=1

Formally the regularization penalty changes the resulting


covariance estimates only slightly:
+ n ,
(k+1) 1 !
Σj ← p̂(j|i) (xi − µ̂j )(xi − µ̂j ) + n S
T !
n̂j + n! i=1

Tommi Jaakkola, MIT CSAIL 19 Tommi Jaakkola, MIT CSAIL 20

Topics Model selection and mixtures


• Gaussian mixtures and the EM-algorithm • As a simple strategy for selecting the appropriate number
– complete, incomplete, and inferred data of mixture components, we can find m that minimizes the
– EM for mixtures overall description length (cf. BIC):
– demo dm
– EM and convergence DL ≈ − log p(data|θ̂m) + log(n)
2
– regularized mixtures
– selecting the number of mixture components – n is the number of training points,
– Gaussian mixtures for classification – θ̂m are the maximum likelihood parameters for the m-
component mixture, and
– dm is the (effective) number of parameters in the m-
component mixture.

Tommi Jaakkola, MIT CSAIL 21 Tommi Jaakkola, MIT CSAIL 22

Model selection: example Model selection: example


• Typical cases • Best cases (out of several runs):
12 12 12 12

10 10 10 10

8 8 8 8

6 6 6 6

4 4 4 4

2 2 2 2

0 0 0 0

!2 !2 !2 !2

!4 !4 !4 !4
!4 !2 0 2 4 6 8 !4 !2 0 2 4 6 8 !4 !2 0 2 4 6 8 !4 !2 0 2 4 6 8

m=1, -logP(data)=2017.38, penalty=14.98, DL=2032.36 m=1, -logP(data)=2017.38, penalty=14.98, DL=2032.36


m=2, -logP(data)=1712.69, penalty=32.95, DL=1745.65 m=2, -logP(data)=1712.69, penalty=32.95, DL=1745.65
m=3, -logP(data)=1711.40, penalty=50.93, DL=1762.32 m=3, -logP(data)=1678.56, penalty=50.93, DL=1729.49
m=4, -logP(data)=1682.06, penalty=68.90, DL=1750.97 m=4, -logP(data)=1649.08, penalty=68.90, DL=1717.98

Tommi Jaakkola, MIT CSAIL 23 Tommi Jaakkola, MIT CSAIL 24


Topics Classification example
• Gaussian mixtures and the EM-algorithm • A digit recognition problem (8x8 binary digits)
– complete, incomplete, and inferred data Training set n = 100 (50 examples of each digit).
– EM for mixtures Test set n = 400 (200 examples of each digit).
– demo • We’d like to estimate class conditional mixture models (and
– EM and convergence prior class frequencies) to solve the classification problem
– regularized mixtures
– selecting the number of mixture components
– Gaussian mixtures for classification

Tommi Jaakkola, MIT CSAIL 25 Tommi Jaakkola, MIT CSAIL 26

Classification example Classification example


• A digit recognition problem (8x8 binary digits) • A digit recognition problem (8x8 binary digits)
Training set n = 100 (50 examples of each digit). Training set n = 100 (50 examples of each digit).
Test set n = 400 (200 examples of each digit). Test set n = 400 (200 examples of each digit).
• We’d like to estimate class conditional mixture models (and • We’d like to estimate class conditional mixture models (and
prior class frequencies) to solve the classification problem prior class frequencies) to solve the classification problem
For example: P (y )
class labels y=0 y=1
Class 1: P (y = 1), p(x|θ1), (e.g., a 3-component mixture) pj|0
Class 0: P (y = 0), p(x|θ0), (e.g., a 3-component mixture) class conditional
j=1 j=3
mixtures

A new test example x would be classified according to p(x|µ1|0, Σ1|0)


P3
P̂ (y = 1)p(x|θ̂1) p(x|θ0) = j =1 pj|0 p(x|µj|0 , Σj|0 )
Class = 1 if log >0
P̂ (y = 0)p(x|θ̂0) (a hierarchical mixture model)
and Class = 0 otherwise.

Tommi Jaakkola, MIT CSAIL 27 Tommi Jaakkola, MIT CSAIL 28

Classification example
• A digit recognition problem (8x8 binary digits)
Training set n = 100 (50 examples of each digit).
Test set n = 400 (200 examples of each digit).
• The figure gives the number of missclassified examples on the
test set as a function of the number of mixture components
in each class-conditional model
44

42

40

38

36

34

32

30

28

26
0 2 4 6 8 10

Tommi Jaakkola, MIT CSAIL 29

You might also like