0% found this document useful (0 votes)
124 views24 pages

Isye 6416: Computational Statistics Spring 2023: Prof. Yao Xie

The document discusses the Expectation-Maximization (EM) algorithm and its applications, including Gaussian mixture models (GMM). [1] The EM algorithm is an iterative method for finding maximum likelihood estimates in problems with missing or latent data. [2] It alternates between an expectation (E) step, which computes the expected value of the log-likelihood, and a maximization (M) step, which computes the parameters maximizing the expected log-likelihood from the E step. [3] The algorithm is applied to GMM clustering by treating cluster labels as latent variables.

Uploaded by

Vikash Movva
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
124 views24 pages

Isye 6416: Computational Statistics Spring 2023: Prof. Yao Xie

The document discusses the Expectation-Maximization (EM) algorithm and its applications, including Gaussian mixture models (GMM). [1] The EM algorithm is an iterative method for finding maximum likelihood estimates in problems with missing or latent data. [2] It alternates between an expectation (E) step, which computes the expected value of the log-likelihood, and a maximization (M) step, which computes the parameters maximizing the expected log-likelihood from the E step. [3] The algorithm is applied to GMM clustering by treating cluster labels as latent variables.

Uploaded by

Vikash Movva
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

ISyE 6416: Computational Statistics

Spring 2023

Lecture 8: EM algorithm and


Gaussian Mixture Model

Prof. Yao Xie

H. Milton Stewart School of Industrial and Systems Engineering


Georgia Institute of Technology
Expectation-Maximization (EM) Algorithm
▶ an algorithm to a maximum likelihood estimator in non-ideal case: missing data,
indirect observations
▶ missing data
▶ clustering (unknown label)
▶ hidden-states in HMM
▶ latent factors

▶ replace one difficult likelihood


maximization with a sequence of
easier maximizations
▶ in the limit, the answer to the original
problem
Applications of EM
▶ Data clustering in machine learning
▶ Natural language processing (Baum-Welch algorithm to fit hidden Markov model)
▶ Imputing missing data
General set-up

S O

Observation
Hidden state
Space

▶ we do not observe S, only observe indirectly from O


▶ Joint distribution of state and observation f (S, O|θ)
Deriving EM

▶ Introduce

Q(θ; θ′ ) = E{log f (S, O|θ)|θ′ , O}

▶ The expectation uses the conditional


distribution of S given O and assumed
value of parameter θ′
Rhymer’s Notes

Intuition
Given O, the “best guess” we could have for S, is its conditional expectation with
respect to S|O, θ (notion of projection); but the computation of expectation involves
parameter values. We take a guess, and improve in next round.
Comment on the Q function
For the conditional likelihood: Q-function

Q(θ; θ′ ) = E{log f (S, O|θ)|θ′ , O}

▶ The expectation is taken with respect to the conditional distribution f (S|O)


▶ O: observed data
▶ In this sense, it has a Bayesian flavor: we have to compute the posterior
distribution of the state given the observation
▶ θ′ : assumed value of the parameter when deriving the posterior distribution
f (S|O)
▶ θ is the parameter involved in “log-likelihood” log f (S|θ) that we will maximize
with respect to
▶ θ and θ′ are usually not the same in your algorithm
E-M algorithm

▶ E-step: compute expectation of the log-likelihood


observed data O, unknown state S

Q(θ; θ′ ) = E{log f (S, O|θ)|θ′ , O}

▶ M-step: compute maximum likelihood using the expectation in previous step

E-step ⇒ M-step ⇒ E-step ⇒ M-step ⇒


▶ stop until ∥θk+1 − θk ∥ < ϵ or |Q(θk+1 |θk ) − Q(θk |θk−1 )| < ϵ
Example: EM for missing data

n = 4, p = 2

x1 = (0, 2)T , x2 = (1, 0)T , x3 = (2, 2)T , x4 = (∗, 4)T


Assume they are i.i.d. samples from Gaussian
 2 
T σ1 0
N ([µ1 , µ2 ] , )
0 σ22

Use EM algorithm to impute the missing data *.


Hidden state: Missing data.

Pattern classification, R. O. Duda, P. E. Hart, and D. G. Stork


(Cont.) Example: missing data
▶ Initialization: θ0 = (0, 0, 1, 1)T , i.e., mean [0, 0]T and covariance I2 .
▶ E-step

Q(θ|θ0 ) = Ex41 [log p(x|θ)|x1 , x2 , x3 , x42 ]


3
X
= log p(xi |θ)
i=1
Z
+ log(p([x41 , 4]T )|θ) · p([x41 , 4]|θ0 )dx41
3
X (1 + µ21 ) (4 − µ2 )2
= log p(xi |θ) − − − log(2πσ1 σ2 )
i=1
2σ12 2σ22

▶ M-step
θ1 = arg max Q(θ|θ0 )
θ
(Cont.) Example: missing data - iterations
 
0.75    
 2.0  0.75 0.938 0
0.938 ⇒
θ1 =  µ1 = Σ1 =

2.0 0 2.0
2.0

 
1.0
 2.0 
θ2 =  
0.667
2.0
The absent-minded biologist
197 animals
Distributed into 4 categories

125 18 20 34

Multinomial model of 5 category with unknown parameter θ


1 θ 1−θ 1−θ θ
( , , , , )
2 4 4 4 4
Can we figure out the number of Monkey A based on the data?
(Cont.) The absent-minded biologist

▶ data y = (125, 18, 20, 34)


▶ now assume y1 = y11 + y12 = 125
▶ Likelihood function
n! 1 θ 1 θ 1 θ θ
f (y|θ) = ( )y11 ( )y12 ( − )y2 ( − )y3 ( )y4
y11 !y12 !y2 !y3 !y4 ! 2 4 4 4 4 4 4
▶ log-likelihood

ℓ(θ|y) ∝ (y12 + y4 ) log θ + (y2 + y3 ) log(1 − θ)

▶ y12 unknown, cannot directly maximize ℓ(θ|y)


(Cont.) The absent-minded biologist: set-up EM

Q(θ|θ′ ) = Ey12 [(y12 + y4 ) log θ + (y2 + y3 ) log(1 − θ)|y1 , . . . , y4 , θ′ ]


= (Ey12 [y12 |y1 , θ′ ] + y4 ) log θ + (y2 + y3 ) log(1 − θ)

θ′ /4
Conditional distribution of y12 given y1 : Binomial (y1 , θ′ /4+1/2 )

y1 θ ′ θ′
Ey12 [y12 |y1 , θ′ ] = ′
:= y12 ,
2+θ
E-step:

Q(θ|θ′ ) = (y12
θ
+ y4 ) log θ + (y2 + y3 ) log(1 − θ)
(θ )
y12k +y4
M-step: θk+1 = arg max Q(θ|θk ) = (θk )
y12 +y2 +y3 +y4
Fitting Gaussian mixture model (GMM)
C
X
xi ∼ πc ϕ(xi |µc , Σc )
c=1
ϕ: density of multi-variate normal
▶ parameters {µc , Σc , πc }C c=1
▶ assume C is known.
▶ observed data {x1 , . . . , xn }
▶ complete data {(x1 , y1 ), . . . , (xn , yn )}
yn : “label” for each sample, missing.
(𝑥' , 𝑦' )

𝜋"

𝜋$

𝜋#
EM for GMM
▶ If we know the label information yi , likelihood function can be easily written

πyi ϕ(xi |µyi , Σyi )

▶ now yi unknown, compute its expectation with respect to the set of parameters

Xn
Q(θ|θ′ ) = E[ log πyi + log ϕ(xi |µyi , Σyi )|xi , θ′ ]
i=1

(𝑥' , 𝑦' )

𝜋"

𝜋$

𝜋#
E-step

▶ (πc(k) , µ(k) (k)


c , Σc ) parameter values in the kth iteration
▶ we need yi |xi , posterior distribution of label, given observation xi

pi,c := p(yi = c|xi ) ∝ πc(k) ϕ(xi |µc(k) , Σ(k)


c )
PC
and c=1 p(yi = c|xi ) = 1
n
X
Q(θ|θk ) = E[log πyi + log ϕ(xi |µyi , Σyi )|xi , θk ]
i=1
Xn XC n X
X C
= pi,c log πc + pi,c log ϕ(xi |µc , Σc )
i=1 c=1 i=1 c=1

Q: where is θ?
M-step
▶ Maximize Q(θ|θk ) with respect to πc , µc , Σc (note that they can be maximized
separately)
θk+1 = arg max Q(θ|θk )
θ
PC
▶ note that c=1 πc =1

Pn
pi,c xi
µ(k+1)
c = Pi=1
n
i=1 pi,c
Pn (k+1) (k+1) T
i=1 pi,c (xi − µc )(xi − µc )
Σ(k+1)
c = Pn
i=1 pi,c
n
1X
πc(k+1) = pi,c
n
i=1
Interpretation

(𝑥' , 𝑦' )
▶ pi,c : probability of each sample belong
to computer c
▶ πc(k+1) : count the expected number of 𝜋"
samples belong to component c 𝜋$
▶ soft-assignment: xi belong to 𝜋#
component c with assignment
probability pi,c 0.5 1
(k+1)
▶ µc :
“average” centroid using soft 0.3
𝑥' 2
assignment
▶ µ(k+1)
c : “average” covariance using 0.2
soft assignment 3
P(𝑦' = 𝑗|𝑥' )
k-means
1 1
▶ K-means: “hard” assignment
▶ EM algorithm: “soft” assignment: in the end, pi,c can 0
𝑥" 2
be viewed as a soft label for each sample; convert into
hard label:
C
ĉi = arg max pi,c
c=1
0
3
Demo
▶ The wine data set was introduced by Forina et al. (1986)
▶ It originally includes the results of 27 chemical measurements on 178 wines made
in the same region of Italy but derived from three different cultivars: Barolo,
Grignolino and Barbera
▶ We use the first two principle components of the data
Mixture of 3 Gaussian components
▶ First fun PCA to reduce the data dimension to 2
▶ Use pi,c , c = 1, 2, 3 as the proportion of “red”, “green”, and “blue” components
Properties of EM

▶ EM algorithm converges to local maximum


▶ Heuristic: escaping the local maximum through a random start
▶ EM works on improving Q(θ|θ′ ) rather than directly improving log f (x|θ)
▶ one can show that improvement on Q(θ|θ′ ) improves log f (x|θ)
▶ EM works well with exponential family
▶ E-step: sum of expectations of the sufficient statistics
▶ M-step: maximizing a linear function
usually possible to derive closed-form update
Convergence of EM

▶ Proof by A. Dempster, N. Larid and


D. Rubin in 1977, later generalized by
J. Wu in 1983.
▶ Basic idea: find a sequence of
quadratic lower bounds for the
likelihood function
▶ EM monotonically increases the
observed data log likelihood

ℓ(θk+1 ) ≥ Q(θk+1 ; θk ) ≥ Q(θk ; θk ) = ℓ(θk )

You might also like