0% found this document useful (0 votes)
16 views60 pages

14 Gaussian Mixture Models

The document discusses Gaussian Mixture Models (GMMs) and the challenges of estimating parameters when dealing with hidden variables. It introduces the Expectation-Maximization (EM) algorithm as a method to recover original distributions and sources from mixed data. The document also provides examples and explanations of how to apply the EM algorithm to parameter estimation in mixture models.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views60 pages

14 Gaussian Mixture Models

The document discusses Gaussian Mixture Models (GMMs) and the challenges of estimating parameters when dealing with hidden variables. It introduces the Expectation-Maximization (EM) algorithm as a method to recover original distributions and sources from mixed data. The document also provides examples and explanations of how to apply the EM algorithm to parameter estimation in mixture models.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

CSE 3521: (Gaussian) Mixture Models

[Many slides are adapted from and at UC Berkeley CS-188 and previous CSE 5521 course at OSU.]
Learning for Hidden Variables
• We saw previously how to estimate parameters for random variable
distributions…
o But only when all variables are visible (i.e., our data points have values for all variables)
• Today: How to deal with hidden (unknowable) variables?
• But first, where might such a situation arise?
Example: Homework Scores
• Suppose I keep a spreadsheet with Student HW 1 HW 2 HW 3
homework scores in it… Bloggs, Joe 70 85 81
Doe, Jane 72 78 77
Public, John Q. 77 87 83
• What kind of model might I construct Roe, Richard 90 99 89
from this? Smith, John 81 91 84
Stu, Gary 79 85 80
Sue, Mary 80 88 83
• Let’s ignore individual students for the
moment, and try to model just the
homework assignments…
Example: Homework Scores
• Two variables: C P(C) Student HW 1 HW 2 HW 3
o X: Homework score 1 ? Bloggs, Joe 70 85 81
o C: Which homework? 2 ?
Doe, Jane 72 78 77
(C for “choice” or “class”) Public, John Q. 77 87 83
C 3 ?
Roe, Richard 90 99 89
Smith, John 81 91 84
• Distributions?
Stu, Gary 79 85 80
o C is categorical (1,2,3) C P(X|C)
Sue, Mary 80 88 83
o X, Gaussian is a X 1 X ~ N(μ1,σ1)
reasonable choice 2 X ~ N(μ2,σ2)
3 X ~ N(μ3,σ3)
Example: Homework Scores
• Parameters? C P(C) Student HW 1 HW 2 HW 3

• Easy… 1 0.33 Bloggs, Joe 70 85 81


Doe, Jane 72 78 77
• P(C)=1/3 2 0.33
Public, John Q. 77 87 83
o We have 3 homework C 3 0.33
Roe, Richard 90 99 89
assignments and an Smith, John 81 91 84
equal number of scores
Stu, Gary 79 85 80
for each C P(X|C)
Sue, Mary 80 88 83
• μi, σi X 1 X ~ N(μ1,σ1)

o Separately for each 2 X ~ N(μ2,σ2)


homework assignment, 3 X ~ N(μ3,σ3)
calculate the mean and
standard deviation
Problem: Information Loss
• Now suppose my C P(C) HW ?
spreadsheet gets corrupted 1 ? 83
and all the scores get mixed 2 ?
81
up… 79
C 3 ?
72
• Does my model change?
78
o No!
91
o BUT, the homework source (C) C P(X|C)
81
is now a hidden variable X 1 X ~ N(μ1,σ1)
84
o Only homework score (X) is 2 X ~ N(μ2,σ2) 88
visible
3 X ~ N(μ3,σ3) 77
85
85
Mixture Model
• This is now a type of model C P(C) HW ?
known as a mixture model 1 ? 83
o The data (scores) are a mixture 81
2 ?
of values from different sources 79
• Specifically a Gaussian Mixture C 3 ?
72
Model (GMM) or Mixture of 78
Gaussians (MoG) model 91
C P(X|C)
o Because the different sources 81
are all Gaussian distributed X 1 X ~ N(μ1,σ1)
84
• There are many different types 2 X ~ N(μ2,σ2) 88
of mixtures models, depending 3 X ~ N(μ3,σ3) 77
on how the visible variables 85
are distributed 85
Mixture Model, Generative Models
• Mixture models are one of the simplest examples of C P(C)
what are called generative models 1 ?
• If we choose a source C, the model tells use how to 2 ?
generate new examples for X C 3 ?
o In this case, by generating random numbers from the
appropriate Gaussian distribution
C P(X|C)
X 1 X ~ N(μ1,σ1)
2 X ~ N(μ2,σ2)
3 X ~ N(μ3,σ3)
Questions?
Parameter Estimation in Mixture Models
• Essentially, can we reverse the mixing process?
• Can we recover the original source distributions (Gaussians)?
o (Generally) Yes!
• Can we recover the original sources for each datapoint?
o Sometimes, but generally, no.

?
Parameter Estimation in Mixture Models
• How? MLE… Global joint distribution C P(C)
𝑃(𝑋1 , … , 𝑋𝑛 ) = ෑ 𝑃(𝑋 = 𝑥𝑖 ) 1 π1
𝑖
But C is hidden, 2 π2
= ෑ ෍ 𝑃(𝑋 = 𝑥𝑖 , 𝐶 = 𝑐)
𝑖 𝑐 must marginalize C 3 π3

= ෑ ෍ 𝑃 𝐶 = 𝑐 𝑃 𝑋 = 𝑥𝑖 𝐶 = 𝑐
𝑖 𝑐
Factorize! C P(X|C)
= ෑ ෍ 𝜋𝑐 𝑁(𝑥𝑖 ; 𝜇𝑐 , 𝜎𝑐 )
𝑖 𝑐 X 1 X ~ N(μ1,σ1)
𝑥−𝜇𝑐 2
1 −
2𝜎𝑐2
= ෑ ෍ 𝜋𝑐 𝑒 2 X ~ N(μ2,σ2)
𝑖 𝑐 2𝜋𝜎𝑐2
3 X ~ N(μ3,σ3)
• Now take derivatives w.r.t. μi, σi, πc and set equal to 0
o But unlike previous MLE examples, no closed form solution…
Parameter Estimation in Mixture Models
• With presence of the hidden variable (marginalization) there is too much
ambiguity to find a direct solution
• Idea: Consider hidden and visible variables separately
o Break the problem into two pieces…
• Expectation
o What are the different possible values for the hidden variables and how likely do we
expect those values to be for each data point?
• Maximization
o Maximize the likelihood of our data points (MLE), i.e., the visible variables, using our
expectations about the hidden variables
• This is the essence of what we call the EM (Expectation-Maximization)
Algorithm
Parameter Estimation: EM Algorithm
• Although often referred to as “the EM algorithm”, it actually represents a
family of algorithms for dealing with problems that have hidden variables
• It is an example of an iterative-improvement algorithm
o It takes a set of guesses for the parameters and gives you better (more likely) values
for the parameters
o Needs to be repeated many times to get “best” values
• It is susceptible to local maxima
o The starting set of guesses used does matter!
o Often run multiple times with different (random) starting values
Parameter Estimation: EM Algorithm
• Generic form of EM:
• Given:
o t, parameters at iteration t,
o Zi, hidden variables for datapoint i
o Xi, visible variables for datapoint i
Θ 𝑡+1 ෡
= argmax ෍ ෍ 𝑃 𝑍𝑖 = 𝑧 | 𝑋𝑖 , Θ𝑡 𝐿 𝑋𝑖 , 𝑍𝑖 = 𝑧 | Θ

Θ 𝑖 𝑧

• E-step: calculate probabilities of Z


• M-step: calculate most likely 
Mixture Models: Expectation
• Let’s apply EM to mixture models, specifically…
• Expectation: For every data point Xi, how likely is to have been generated by
source c?
P𝑐𝑖 = 𝑃 𝐶 = 𝑐 𝑋 = 𝑥𝑖 = α𝐶 𝑃 𝐶 = 𝑐, 𝑋 = 𝑥𝑖
= α𝐶 𝑃 𝑋 = 𝑥𝑖 𝐶 = 𝑐 𝑃 𝐶 = 𝑐

Normalize over all


For GMM, this is N(xi; μc, σc) Likelihood of the source, πc
possible sources

𝑁𝐶 = ෍ 𝑃𝑐𝑖 The overall “weight” of a particular source


(This is a convenience value we’ll use in the next step)
𝑖
Mixture Models: Maximization
• Maximization: Use our data points to estimate each parameter, weighted by
our (previously calculated) expectations
“count” the number of data points that belong to each source,
𝑁𝑐 𝑁𝑐 where expectation is used as a partial “count”
π
ෝ𝑐 = =
σ𝑗 𝑁𝑗 𝑁 That is P1i=0.6 is interpreted to mean we will give 60% of the ith
data point to source 1 (and the remaining 40% to other sources)

• The rest depends on the source distribution, but for GMM:


1 1
𝜇ො𝑐 = ෍ 𝑃𝑐𝑖 𝑥𝑖 2 2
𝑁𝑐 𝜎
ො𝑐 = ෍ 𝑃 𝑥
𝑐𝑖 𝑖 − 𝜇
ො 𝑐
𝑖 𝑁𝑐
𝑖

Weighted mean and standard deviation,


where expectation is used as weight
Questions?
Example: EM with GMMs
• Let’s return to our C P(C) HW ?
homework example 1 ? 83

• Start with some guesses… 2 ?


81
79
• πc=1/3 C 3 ?
72
o All homeworks sources are 78
equally likely 91
C P(X|C)
• μ1=70, μ2=77, μ3=85 X 81
1 X ~ N(μ1,σ1)
84
• σ1=σ2=σ3=2.5 2 X ~ N(μ2,σ2) 88
o Each homework’s scores
3 X ~ N(μ3,σ3) 77
cover about a 10 point range
85
85
Example: EM with GMMs
• Expectation: i (xi) 𝑃1,𝑖 𝑃2,𝑖 𝑃3,𝑖 HW ?
𝑃𝑐1 = 𝛼 𝑃 𝑋 = 83 𝐶 = 𝑐 𝑃 𝐶 = 𝑐 1 0.00 0.07 0.93 83
2 81
1 83−70 2
− 1
𝑃1,1 = 𝛼 𝑒 2 2.5 2 3 = 𝛼 7.148 ∙ 10−8= 0.00 3 79
2𝜋 2.5 2
2
4 72
1 83−77
− 1
𝑃2,1 = 𝛼 𝑒 2 2.5 2 = 𝛼 0.002986= 0.07 5 78
2 3
2𝜋 2.5 91
6
83−85 2
1 − 1 7 81
𝑃3,1 = 𝛼 𝑒 2 2.5 2 = 𝛼 0.03862 = 0.93
2 3
2𝜋 2.5 8 84
1
𝛼= 9 88
7.148 ∙ 10−8 + 0.002986 + 0.03862
10 77
11 85
12 85
Example: EM with GMMs
• Expectation: i (xi) 𝑃1,𝑖 𝑃2,𝑖 𝑃3,𝑖 HW ?
o Continue for all data points 1 0.00 0.07 0.93 83
2 0.00 0.50 0.50 81
0.00 + 0.00 + 0.00 + 0.84 + 0.01 + 0.00 +
𝑁1 = = 0.87 3 0.00 0.93 0.07 79
0.00 + 0.00 + 0.00 + 0.02 + 0.00 + 0.00
4 0.84 0.16 0.00 72
0.07 + 0.50 + 0.93 + 0.16 + 0.97 + 0.00 +
𝑁2 = = 4.14 5 0.01 0.97 0.02 78
0.50 + 0.02 + 0.00 + 0.97 + 0.01 + 0.01
6 0.00 0.00 1.00 91
0.93 + 0.50 + 0.07 + 0.00 + 0.02 + 1.00 +
𝑁3 = = 6.99 7 0.00 0.50 0.50 81
0.50 + 0.98 + 1.00 + 0.01 + 0.99 + 0.99
8 0.00 0.02 0.98 84
9 0.00 0.00 1.00 88
𝑁 = 𝑁1 + 𝑁2 + 𝑁3 = 0.87 + 4.14 + 6.99 = 12
Quick check for mistakes… Yep, we have
 10
11
0.02
0.00
0.97
0.01
0.01
0.99
77
85
12 data points!
12 0.00 0.01 0.99 85
Example: EM with GMMs
• Maximization i (xi) 𝑃1,𝑖 𝑃2,𝑖 𝑃3,𝑖 HW ?
1 0.00 0.07 0.93 83
0.87
𝜋1 = = 0.07 2 0.00 0.50 0.50 81
12
4.14 3 0.00 0.93 0.07 79
𝜋2 = = 0.35
12 4 0.84 0.16 0.00 72
6.99 5 0.01 0.97 0.02 78
𝜋3 = = 0.58
12 6 0.00 0.00 1.00 91
7 0.00 0.50 0.50 81


𝜋1 + 𝜋2 + 𝜋3 = 0.07 + 0.35 + 0.58 = 1 8
9
0.00
0.00
0.02
0.00
0.98
1.00
84
88
10 0.02 0.97 0.01 77
11 0.00 0.01 0.99 85
12 0.00 0.01 0.99 85
Example: EM with GMMs
• Maximization i (xi) 𝑃1,𝑖 𝑃2,𝑖 𝑃3,𝑖 HW ?

0.00 ∙ 83 + 0.00 ∙ 81 + 0.00 ∙ 79 + 1 0.00 0.07 0.93 83


1 0.84 ∙ 72 + 0.01 ∙ 78 + 0.00 ∙ 91 + 2 0.00 0.50 0.50 81
𝜇1 = = 72.2 (was 70)
0.87 0.00 ∙ 81 + 0.00 ∙ 84 + 0.00 ∙ 88 +
3 0.00 0.93 0.07 79
0.02 ∙ 77 + 0.00 ∙ 85 + 0.00 ∙ 85
4 0.84 0.16 0.00 72
5 0.01 0.97 0.02 78
6 0.00 0.00 1.00 91
7 0.00 0.50 0.50 81
8 0.00 0.02 0.98 84
9 0.00 0.00 1.00 88
10 0.02 0.97 0.01 77
11 0.00 0.01 0.99 85
12 0.00 0.01 0.99 85
Example: EM with GMMs
• Maximization i (xi) 𝑃1,𝑖 𝑃2,𝑖 𝑃3,𝑖 HW ?

0.00 ∙ 83 + 0.00 ∙ 81 + 0.00 ∙ 79 + 1 0.00 0.07 0.93 83


1 0.84 ∙ 72 + 0.01 ∙ 78 + 0.00 ∙ 91 + = 72.2 2 0.00 0.50 0.50 81
𝜇1 = (was 70)
0.87 0.00 ∙ 81 + 0.00 ∙ 84 + 0.00 ∙ 88 +
3 0.00 0.93 0.07 79
0.02 ∙ 77 + 0.00 ∙ 85 + 0.00 ∙ 85
0.07 ∙ 83 + 0.50 ∙ 81 + 0.93 ∙ 79 + 4 0.84 0.16 0.00 72
1 0.16 ∙ 72 + 0.97 ∙ 78 + 0.00 ∙ 91 + = 78.6 5 0.01 0.97 0.02 78
𝜇2 = (was 77)
4.14 0.50 ∙ 81 + 0.02 ∙ 84 + 0.00 ∙ 88 +
6 0.00 0.00 1.00 91
0.97 ∙ 77 + 0.01 ∙ 85 + 0.01 ∙ 85
0.93 ∙ 83 + 0.50 ∙ 81 + 0.07 ∙ 79 + 7 0.00 0.50 0.50 81
1 0.00 ∙ 72 + 0.02 ∙ 78 + 1.00 ∙ 91 + = 85.2 8 0.00 0.02 0.98 84
𝜇3 = (was 85)
6.99 0.50 ∙ 81 + 0.98 ∙ 84 + 1.00 ∙ 88 +
9 0.00 0.00 1.00 88
0.01 ∙ 77 + 0.99 ∙ 85 + 0.99 ∙ 85
10 0.02 0.97 0.01 77
11 0.00 0.01 0.99 85
12 0.00 0.01 0.99 85
Example: EM with GMMs
• Maximization i (xi) 𝑃1,𝑖 𝑃2,𝑖 𝑃3,𝑖 HW ?
1 0.00 0.07 0.93 83
0.00 ∙ 832 + 0.00 ∙812 + 0.00 ∙792 + 81
1 2 0.00 0.50 0.50
𝜎1 = 0.84 ∙ 722 + 0.01 ∙ 782 + 0.00 ∙ 912 + − 72.22 = 0.96
0.87 0.00 ∙ 812 + 0.00 ∙ 842 + 0.00 ∙ 882 + 3 0.00 0.93 0.07 79
(was 2.5)
0.02 ∙ 772 + 0.00 ∙ 852 + 0.00 ∙ 852 4 0.84 0.16 0.00 72
5 0.01 0.97 0.02 78
1 6 0.00 0.00 1.00 91
𝜎2 = 0.07 ∙ 832 + ⋯ + 0.01 ∙ 852 − 78.62 = 2.09 (was 2.5)
4.14 81
7 0.00 0.50 0.50
8 0.00 0.02 0.98 84
1
𝜎3 = 0.93 ∙ 832 + ⋯ + 0.99 ∙ 852 − 85.22 = 3.15 (was 2.5) 9 0.00 0.00 1.00 88
6.99
10 0.02 0.97 0.01 77
11 0.00 0.01 0.99 85
12 0.00 0.01 0.99 85
Example: EM with GMMs
• So we have:
𝜋1 = 0.07 𝜇1 = 72.2 𝜎1 = 0.96
𝜋2 = 0.35 𝜇2 = 78.6 𝜎2 = 2.09
𝜋3 = 0.58 𝜇3 = 85.2 𝜎3 = 3.15

• Parameter estimation done?


o No!
• EM is iterative
o These are just better guesses, not the solution
• Repeat the previous using the above values as new
guesses
o Possibly many repetitions…
More Examples: Many Iterations…
More Examples: Many Iterations…
Questions?
Unsupervised Learning,
Clustering
Unsupervised Learning
• Results from Mixture Models?
o Likelihoods of sources and source parameters
o Target variables? None
o Hard to make decisions about individual data points...
• Unsupervised Learning
o Parameter estimation is the goal rather than a means (leading to inference)
o No target variable, no supervision about results of target (accuracy, etc.)
o Kernel density estimation one example
• Clustering
o Very common example of unsupervised learning
o Grouping data points into ‘similar’ vs. ‘dissimilar’
o GMM/mixture models examples of probabilistic clustering
 No hard decisions made about groups ↔ data points
Hard Decisions
• Consider in a GMM, letting standard deviation get smaller...
o Though not necessary, let’s assume equally likely sources and equal standard deviations
𝑥−𝜇𝑎 2
1 −
2𝜎 2
α𝐶 𝑒 𝜋 𝑥−𝜇𝑎 2 − 𝑥−𝜇𝑏 2
𝑃(𝐶 = 𝑎|𝑋) 2𝜋𝜎 2 −
lim = lim = lim 𝑒 2𝜎2 =⋯
𝜎→0 𝑃(𝐶 = 𝑏|𝑋) 𝜎→0 𝑥−𝜇𝑏 2 𝜎→0
1 −
α𝐶 𝑒 2𝜎2 𝜋
2𝜋𝜎 2
∞ if 𝑥 − 𝜇𝑎 < 𝑥 − 𝜇𝑏 i.e. 𝑥 is closer to 𝑎 than 𝑏
=ቊ
0 if 𝑥 − 𝜇𝑏 < 𝑥 − 𝜇𝑎 i.e. 𝑥 is closer to 𝑏 than 𝑎

• With alpha normalization, having the ratio get infinitely large is the same as
the probabilities approaching 1 or 0, respectively:
1 if 𝑥𝑖 is closer to 𝑐 than any other source/class/group
𝑃𝑐𝑖 = 𝑃(𝐶 = 𝑐|𝑋 = 𝑥𝑖 ) = ቊ
0 otherwise
o In other words, the choice of source/class/group becomes a hard decision…
K-means

• In fact, the previous has a name: K-Means Clustering algorithm


• Data: 𝒙1 , … , 𝒙𝑁
• An iterative clustering algorithm
o Pick 𝐾 random cluster centers: 𝒄1 , 𝒄2 , … , 𝒄𝐾
o For 𝑡 = 1, … , 𝑇: [or, stop if assignments don’t change]
 for 𝑖 = 1, … , 𝑁: [update cluster assignments] “Hard” (non-probabilistic) decision
2 about group membership
𝑦𝑖 = argmin 𝒙𝑖 − 𝒄𝑘 2
𝑘

 for 𝑘 = 1, … , 𝐾: [update cluster centers]


1 Equivalent to GMM weighted mean
𝒄𝑘 = ෍ 𝒙𝑖 when you consider all weights are
𝑖 | 𝑦𝑖 = 𝑘 now 0 or 1
𝑖 | 𝑦𝑖 =𝑘
K-means: animation
Questions?
Agglomerative clustering

• Agglomerative clustering:
o First merge very similar instances
o Incrementally build larger clusters out of smaller clusters

• Algorithm:
o Maintain a set of clusters
o Initially, each instance in its own cluster
o Repeat:
 Pick the two closest clusters
 Merge them into a new cluster
 Stop when there is only one cluster left

• Produces not one clustering, but a family of clusters represented by a


dendrogram
Agglomerative clustering
• How should we define the “closeness” between two
clusters of multiple instances?

• Many options
o Closest pair (single-link clustering)
o Farthest pair (complete-link clustering)
o Average of all pairs
o Ward’s method (min variance, like k-means)

• Different choices create different clustering behaviors


Agglomerative clustering
complete link
single link
2

1.8 0.3

1.6
0.25
1.4

1.2 0.2

1
0.15
0.8

0.6 0.1

0.4
0.05
0.2

Each discrete value alone the x-axis is treated as one data instance.
Agglomerative clustering
complete link
single link
2

1.8 0.3

1.6
0.25
1.4

1.2 0.2

1
0.15
0.8

0.6 0.1

0.4
0.05
0.2

Each discrete value alone the x-axis is treated as one data instance.
Agglomerative clustering
complete link
single link
2

1.8 0.3

1.6
0.25
1.4

1.2 0.2

1
0.15
0.8

0.6 0.1

0.4
0.05
0.2

Each discrete value alone the x-axis is treated as one data instance.
Agglomerative clustering
complete link
single link
2

1.8 0.3

1.6
0.25
1.4

1.2 0.2

1
0.15
0.8

0.6 0.1

0.4
0.05
0.2

Each discrete value alone the x-axis is treated as one data instance.
Agglomerative clustering
complete link
single link
2

1.8 0.3

1.6
0.25
1.4

1.2 0.2

1
0.15
0.8

0.6 0.1

0.4
0.05
0.2

Each discrete value alone the x-axis is treated as one data instance.
Agglomerative clustering
complete link
single link
2

1.8 0.3

1.6
0.25
1.4

1.2 0.2

1
0.15
0.8

0.6 0.1

0.4
0.05
0.2

Each discrete value alone the x-axis is treated as one data instance.
Agglomerative clustering
complete link
single link
2

1.8 0.3

1.6
0.25
1.4

1.2 0.2

1
0.15
0.8

0.6 0.1

0.4
0.05
0.2

Each discrete value alone the x-axis is treated as one data instance.
Agglomerative clustering
complete link
single link
2

1.8 0.3

1.6
0.25
1.4

1.2 0.2

1
0.15
0.8

0.6 0.1

0.4
0.05
0.2

Each discrete value alone the x-axis is treated as one data instance.
Agglomerative clustering
complete link
single link
2

1.8 0.3

1.6
0.25
1.4

1.2 0.2

1
0.15
0.8

0.6 0.1

0.4
0.05
0.2

Each discrete value alone the x-axis is treated as one data instance.
Agglomerative clustering
complete link
single link
2

1.8 0.3

1.6
0.25
1.4

1.2 0.2

1
0.15
0.8

0.6 0.1

0.4
0.05
0.2

Each discrete value alone the x-axis is treated as one data instance.
Agglomerative clustering
complete link
single link
2

1.8 0.3

1.6
0.25
1.4

1.2 0.2

1
0.15
0.8

0.6 0.1

0.4
0.05
0.2

Each discrete value alone the x-axis is treated as one data instance.
Appendix
Derivation: EM Algorithm
• Suppose we have an arbitrary probabilistic graphical model with:
o , parameters
o Z, hidden variables
o X, visible variables
o P(X,Z|), joint probability distribution of the model

• How do we estimate parameters? EM Algorithm!


• But where does EM Algorithm come from?...
Derivation of EM: Step 1 - MLE
• So, given a set of (IID) example data points i, we wish to estimate the
parameters.
• Via Maximum Likelihood Estimation:
෩ = argmax ෑ 𝑃(𝑋𝑖 , 𝑍𝑖 |Θ
Θ ෡)

Θ Log trick – order preserving transformation
𝑖
෡)
= argmax log ෑ 𝑃(𝑋𝑖 , 𝑍𝑖 |Θ

Θ 𝑖
෡)
= argmax ෍ log 𝑃(𝑋𝑖 , 𝑍𝑖 |Θ

Θ 𝑖
෡)
= argmax ෍ 𝐿(𝑋𝑖 , 𝑍𝑖 |Θ “L” log-likelihood, i.e., L(…)=log P(…)

Θ 𝑖
Derivation of EM: Step 2 – Expected Values
• But what we derived isn’t calculable, while 𝑋𝑖 is known for each
example, 𝑍𝑖 is unknown. How to deal with this?
• Let’s consider the average case! That is, if 𝑍𝑖 is unknown, let us think
about all the values it could possibly be… †
• In probabilistic terms, average means Expected Value:
Θ෩ = argmax ෍ 𝐸𝑍 𝐿(𝑋𝑖 , 𝑍𝑖 |Θ ෡)
𝑖 Expectation: Likelihood of possibilities times

Θ value of possibilities
𝑖
෡)
= argmax ෍ ෍ 𝑃(𝑍𝑖 = 𝑧|𝑋𝑖 , Θ? ) 𝐿(𝑋𝑖 , 𝑍𝑖 = 𝑧|Θ

Θ 𝑖 𝑧

Problem! We need parameter values to calculate this


probability… but the parameters are what we’re trying to
estimate!
Derivation of EM: Step 2 – Expected Values
• How to resolve the ambiguity of needing parameter values to estimate
parameter values?
• Trick: Make a guess, and use that guess to generate a better guess
෡)
Θ(𝑡+1) = argmax ෍ ෍ 𝑃(𝑍𝑖 = 𝑧|𝑋𝑖 , Θ𝑡 ) 𝐿(𝑋𝑖 , 𝑍𝑖 = 𝑧|Θ

Θ 𝑖 𝑧
New guess for parameters (i.e., step t+1) Current guess for parameters (at step t)

• Cost: Multiple iterations


• This trick turns the algorithm into an iterative improvement algorithm.
• That is, it takes a guess and makes a better guess, so must be applied
repeatedly to get the best answer possible
• Also, results will now depend on the starting guess (Θ0 ).
Derivation of EM: Step 3 – Expectation/Maximzation
• The final step is to then break the calculation into two phases, the
classic Expectation and Maximization phases
• Expectation: Calculate the expectation probabilities for every possible
combination of data point i and hidden variable value z
𝑃𝑧𝑖 = 𝑃(𝑍𝑖 = 𝑧|𝑋𝑖 , Θ𝑡 )
• Maximization: Find new likelihood-maximizing parameter values based
on expectation
෡)
Θ(𝑡+1) = argmax ෍ ෍ 𝑃𝑧𝑖 𝐿(𝑋𝑖 , 𝑍𝑖 = 𝑧|Θ

Θ 𝑖 𝑧
Derivation: Gaussian Mixture Model (GMM)
• How specifically is the E-M algorithm applied to the Gaussian Mixture
Model?
• Recall, we have…
• Visible variable: observation X C
𝑃 𝐶 = 𝑐 = 𝜋𝑐

• Hidden variable: source (choice) C


• Parameters:
o 𝜋𝑐 , prior probability of source
o 𝜇𝑐 , mean of Gaussian for source c X 𝑋|𝐶 = 𝑐 ~ 𝑁(𝜇𝑐 , 𝜎𝑐 )

o 𝜎𝑐 , standard deviation of Gaussian for source c

• Expectation: This we derived in the main slide sequence so we’ll skip here
Derivation: GMM - Maximization

෡)
Θ(𝑡+1) = argmax ෍ ෍ 𝑃𝑧𝑖 𝐿(𝑋𝑖 , 𝑍𝑖 = 𝑧|Θ

Θ 𝑖 𝑧
෡)
= argmax ෍ ෍ 𝑃𝑧𝑖 log 𝑃(𝑋𝑖 , 𝑍𝑖 = 𝑧|Θ

Θ 𝑖 𝑧 Source prior Gaussian source
෡ 𝑃(𝑋𝑖 |𝐶𝑖 = 𝑐, Θ
= argmax ෍ ෍ 𝑃𝑐𝑖 log 𝑃 𝐶𝑖 = 𝑐|Θ ෡)

Θ 𝑖 𝑐 1
1 −
2𝜎𝑐2 𝑥𝑖 −𝜇𝑐 2
= argmax ෍ ෍ 𝑃𝑐𝑖 log 𝜋𝑐 𝑒
𝜋𝑐 ,𝜇𝑐 ,𝜎𝑐
𝑖 𝑐
𝜎𝑐 2𝜋 Note, the 2𝜋 term disappears
because as a constant it doesn’t
affect the maximum

1 2
= argmax ෍ ෍ 𝑃𝑐𝑖 log 𝜋𝑐 − ෍ ෍ 𝑃𝑐𝑖 log 𝜎𝑐 − ෍ ෍ 𝑃𝑐𝑖 2 𝑥𝑖 − 𝜇𝑐
𝜋𝑐 ,𝜇𝑐 ,𝜎𝑐 2𝜎𝑐
𝑖 𝑐 𝑖 𝑐 𝑖 𝑐
Derivation: GMM – Maximization - 𝜇𝑐
• Now, we can maximize using the zero-derivative trick… So, let’s start with
the mean parameter 𝜇𝑐
𝑑 1
0= ෍ ෍ 𝑃𝑐𝑖 log 𝜋𝑐 − ෍ ෍ 𝑃𝑐𝑖 log 𝜎𝑐 − ෍ ෍ 𝑃𝑐𝑖 2 𝑥𝑖 − 𝜇𝑐 2
𝑑𝜇𝑐 2𝜎𝑐
𝑖 𝑐 𝑖 𝑐 𝑖 𝑐
0 Constant with respect to 𝜇𝑐
0
𝑐−1 𝐶
𝑑 1 2 − ෍𝑃
1 2
1 2
0= − ෍ ෍ 𝑃𝜔𝑖 2 𝑥𝑖 − 𝜇𝜔 𝑐𝑖 2 𝑥𝑖 − 𝜇𝑐 − ෍ ෍ 𝑃𝜔𝑖 2 𝑥𝑖 − 𝜇𝜔
𝑑𝜇𝑐 2𝜎𝜔 2𝜎𝑐 2𝜎𝜔
𝜔=1 𝑖 𝑖 𝜔=𝑐+1 𝑖
0 0

𝑑 1 2
0= − ෍ 𝑃𝑐𝑖 2 𝑥𝑖 − 𝜇𝑐
𝑑𝜇𝑐 2𝜎𝑐
𝑖
Derivation: GMM – Maximization - 𝜇𝑐
• With all the constant terms eliminated, all that’s left is the derivative and
solve…
𝑑 1
0= − ෍ 𝑃𝑐𝑖 2 𝑥𝑖 − 𝜇𝑐 2
𝑑𝜇𝑐 2𝜎𝑐
𝑖
2
0 = − 2 ෍ 𝑃𝑐𝑖 𝑥𝑖 − 𝜇𝑐 −1 = ෍ 𝑃𝑐𝑖 𝑥𝑖 − 𝜇𝑐
2𝜎𝑐
𝑖 𝑖

σ𝑖 𝑃𝑐𝑖 𝑥𝑖
𝜇𝑐 ෍ 𝑃𝑐𝑖 = ෍ 𝑃𝑐𝑖 𝑥𝑖 𝜇𝑐 =
σ𝑖 𝑃𝑐𝑖
𝑖 𝑖
Derivation: GMM – Maximization - 𝜎𝑐
• 𝜎𝑐 is very similar, so starting from the non-constant terms…
𝑑 1 2
0= − log 𝜎𝑐 ෍ 𝑃𝑐𝑖 − 2 ෍ 𝑃𝑐𝑖 𝑥𝑖 − 𝜇𝑐
𝑑𝜎𝑐 2𝜎𝑐
𝑖 𝑖

1 −2 2
0 = − ෍ 𝑃𝑐𝑖 − 3 ෍ 𝑃𝑐𝑖 𝑥𝑖 − 𝜇𝑐 (Multiply both sides by 𝜎𝑐3 then shift terms)
𝜎𝑐 2𝜎𝑐
𝑖 𝑖

σ𝑖 𝑃𝑐𝑖 𝑥𝑖 − 𝜇𝑐 2
𝜎𝑐2 ෍ 𝑃𝑐𝑖 = ෍ 𝑃𝑐𝑖 𝑥𝑖 − 𝜇𝑐 2
𝜎𝑐 =
σ𝑖 𝑃𝑐𝑖
𝑖 𝑖
Derivation: GMM – Maximization - 𝜋𝑐
• From the set of non-constant terms, it would seem 𝜋𝑐 should be the easiest
of them all…
𝑑
0= ෍ 𝑃𝑐𝑖 log 𝜋𝑐
𝑑𝜋𝑐
𝑖
• However, remember the set of all 𝜋𝑐 form a probability distribution and so
have the additional constraint:

෍ 𝜋𝜔 = 1
𝜔
• We resort to the method of Lagrange multipliers to implement the constraint:
𝑑
0= ෍ 𝑃𝑐𝑖 log 𝜋𝑐 + 𝜆 1 − ෍ 𝜋𝜔
𝑑𝜋𝑐
𝑖 𝜔
Derivation: GMM – Maximization - 𝜋𝑐
• Then derivatives and solving…
𝑑 1
0= log 𝜋𝑐 ෍ 𝑃𝑐𝑖 + 𝜆 1 − ෍ 𝜋𝜔 = ෍ 𝑃𝑐𝑖 − 𝜆
𝑑𝜋𝑐 𝜋𝑐
𝑖 𝜔 𝑖
1
𝜋𝑐 = ෍ 𝑃𝑐𝑖
𝜆
𝑖
• Then plug back into the constraint to resolve the Lagrange multiplier…
1
1 = ෍ ෍ 𝑃𝜔𝑖 𝜆 = ෍ ෍ 𝑃𝜔𝑖 = ෍ 1 = 𝑁 (Number of data points, N)
𝜆
𝜔 𝑖 𝑖 𝜔 𝑖
1
𝜋𝑐 = ෍ 𝑃𝑐𝑖
𝑁
𝑖

You might also like