0% found this document useful (0 votes)
2 views

lecture5

This document presents Lecture 5 of a course on Probabilistic Machine Learning, focusing on the Expectation Maximization (EM) algorithm and its application to Gaussian Mixture Models (GMMs). It outlines the concepts of latent variables, the steps involved in the EM algorithm, and the derivation of the algorithm for GMMs, along with potential issues such as local optima and label-switching. The lecture emphasizes the advantages of GMMs over k-means clustering and provides references for further reading.

Uploaded by

marius.boda
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

lecture5

This document presents Lecture 5 of a course on Probabilistic Machine Learning, focusing on the Expectation Maximization (EM) algorithm and its application to Gaussian Mixture Models (GMMs). It outlines the concepts of latent variables, the steps involved in the EM algorithm, and the derivation of the algorithm for GMMs, along with potential issues such as local optima and label-switching. The lecture emphasizes the advantages of GMMs over k-means clustering and provides references for further reading.

Uploaded by

marius.boda
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Probabilistic Machine Learning

Lecture 5: Expectation maximization

Pekka Marttinen

Aalto University

February, 2025

Pekka Marttinen (Aalto University) Probabilistic Machine Learning February, 2025 1 / 16


Lecture 5 overview

Gaussian mixture models (GMMs), recap


EM algorithm
EM for Gaussian mixture models
Suggested reading: Bishop: Pattern Recognition and Machine
Learning
p. 110-113 (2.3.9): Mixtures of Gaussians
simple_example.pdf
p. 430-443: EM for Gaussian mixtures

Pekka Marttinen (Aalto University) Probabilistic Machine Learning February, 2025 2 / 16


GMMs, latent variable representation
Introduce latent variables zn =(zn1 , . . . , znK ) which spci…es the
component k of observation xn

1 , 0, . . . , 0)T
zn = (0, . . . , 0, |{z}
k th elem.

De…ne
K K
p (zn ) = ∏ πkz nk
and p (xn jzn ) = ∏ N (xn jµk , Σk )z nk

k =1 k =1
Then the marginal distribution p (xn ) is a GMM:
K
p (xn ) = ∑ πk N (xn jµk , Σk )
k =1
Pekka Marttinen (Aalto University) Probabilistic Machine Learning February, 2025 3 / 16
GMM: responsibilities, complete data

Posterior probability (responsibility) p (znk = 1jxn ) that observation


xn was generated by component k

πk N (xn jµk , Σk )
γ(znk ) p (znk = 1jxn ) =
∑ j = 1 π j N ( xn j µ j , Σ j )
K

Complete data: latent variables z and data x together: (x, z)

Pekka Marttinen (Aalto University) Probabilistic Machine Learning February, 2025 4 / 16


Idea of the EM algorithm (1/2)

Let X denote the observed data, and θ model parameters. The goal
b
in maximum likelihood is to …nd θ:

θb = arg max flog p (X jθ )g


θ

If model contains latent variables Z , the log-likelihood is given by


( )
log p (X jθ ) = log ∑ p (X , Z j θ ) ,
Z

which may be di¢ cult to maximize analytically


Possible solutions: 1) numerical optimization, 2) the EM algorithm
(expectation-maximization)

Pekka Marttinen (Aalto University) Probabilistic Machine Learning February, 2025 5 / 16


Idea of the EM algorithm (2/2)

X : observed data, Z : unobserved latent variables


fX , Z g: complete data, X : incomplete data
In EM algorithm, we assume that the complete data log-likelihood:

log p (X , Z jθ )

is easy to maximize.
Problem: Z is not observed
Solution: maximize

Q (θ, θ0 ) EZ jX ,θ0 [log p (X , Z jθ )]


= ∑ p (Z jX , θ0 ) log p (X , Z jθ )
Z

where p (Z jX , θ0 ) is the posterior distribution of the latent variables


computed using the current parameter estimate θ0
Pekka Marttinen (Aalto University) Probabilistic Machine Learning February, 2025 6 / 16
Illustration of the EM algorithm for GMMs

Pekka Marttinen (Aalto University) Probabilistic Machine Learning February, 2025 7 / 16


EM algorithm in detail

Goal: maximize log p (X jθ ) w.r.t. θ


1 Initialize θ0
2 E-step Evaluate p (Z jX , θ0 ), and then compute

Q (θ, θ0 ) = EZ jX ,θ0 [log p (X , Z jθ )] = ∑ p (Z jX , θ0 ) log p (X , Z jθ )


Z

3 M-step Evaluate θ new using

θ new = arg max Q (θ, θ0 ).


θ

Set θ0 θ new
4 Repeat E and M steps until convergence

Pekka Marttinen (Aalto University) Probabilistic Machine Learning February, 2025 8 / 16


Why EM works

Figure: 11.16 in Murphy (2012)

As a function of θ, Q (θ, θ0 ) is a lower bound of the log-likelihood


log p (x jθ ) (plus a constant, see Bishop, Ch. 9.4).
EM iterates between 1) updating the lower bound (E-step), 2)
maximizing the lower bound (M-step).
Pekka Marttinen (Aalto University) Probabilistic Machine Learning February, 2025 9 / 16
EM algorithm, comments

In general, Z does not have to be discrete, just replace the


summation in Q (θ, θ0 ) by integration.
EM-algorithm can be used to compute the MAP (maximum a
posteriori) estimate by maximizing in the M-step Q (θ, θ0 ) + log p (θ ).
In general, EM-algorithm is applicable when the observed data X can
be augmented into complete data fX , Z g such that log p (X , Z jθ ) is
easy to maximize; Z does not have to be latent variables but can
represent, for example, unobserved values of missing or censored
observations.

Pekka Marttinen (Aalto University) Probabilistic Machine Learning February, 2025 10 / 16


EM algorithm, simple example

Consider N independent observations x = (x1 , . . . , xN ) from a


two-component mixture of univariate Gaussians
1 1
p (xn jθ ) = N (xn j0, 1) + N (xn jθ, 1). (1)
2 2
One unknown parameter, θ, the mean of the second component.
Goal: estimate
θb = arg max flog p (xjθ )g .
θ

simple_example.pdf

Pekka Marttinen (Aalto University) Probabilistic Machine Learning February, 2025 11 / 16


EM algorithm for GMMs
k =1 π k N (x j µ k , Σ k )
p (x) = ∑K
1 Initialize parameter µk , Σk and mizing coe¢ cients πk . Repeat until
convergence:
2 E-step: Evaluate the responsibilities using current parameter values
πk N (xn jµk , Σk )
γ(znk ) =
∑ j = 1 π k N ( xn j µ k , Σ j )
K

3 M-step: Re-estimate the parameters using the current responsibilities


N
1
µnew
k =
Nk ∑ γ(znk )xn
n =1
N
1
Σnew
k =
Nk ∑ γ(znk )(xn µnew
k )(xn µnew
k )
T
n =1
N
πknew = k
N
Pekka Marttinen (Aalto University) Probabilistic Machine Learning February, 2025 12 / 16
Derivation of the EM algorithm for GMMs

In the M-step the formulas for µnew


k and Σnew
k are obtained by
di¤erentiating the expected complete data log-likelihood Q (θ, θ0 )
with respect to the particular parameters, and setting the derivatives
to zero.
The formula for πknew can be derived by maximizing Q (θ, θ0 ) under
the constraint ∑K
k = πk = 1. This can be done using the Lagrange
multipliers.

Pekka Marttinen (Aalto University) Probabilistic Machine Learning February, 2025 13 / 16


EM for GMM, caveats
EM converges to a local optimum. In fact, the ML estimation for
GMMs is not well-de…ned due to singularities: if σk ! 0 for
components k with a single data point, likelihood goes to in…nity
(…g). Remedy: prior on σk .
Label-switching: non-identi…ability due to the fact that cluster labels
can be switched and likelihood remains the same.
In practice it is recommended to initialize the EM for the GMM by
k-means.

Pekka Marttinen (Aalto University) Probabilistic Machine Learning February, 2025 14 / 16


GMM vs. k-means
"Why use GMMs and not just k-means?"

from Wikipedia

1 Clusters can be of di¤erent sizes and shapes


2 Probabilistic assignment of data items to clusters
3 Possibility to include prior knowledge (structure of the model/prior
distributions on the parameters)
Pekka Marttinen (Aalto University) Probabilistic Machine Learning February, 2025 15 / 16
Important points

ML-estimation of GMMs can be done using numerical optimization or


the EM algorithm.
The main idea of the EM algorithm is to maximize the expectation of
the complete data log-likelihood, where the expectation is computed
with respect to the current posterior distributions (responsibilites) of
the latent variables.

Pekka Marttinen (Aalto University) Probabilistic Machine Learning February, 2025 16 / 16

You might also like