14 Gaussian Mixture Models
14 Gaussian Mixture Models
[Many slides are adapted from and at UC Berkeley CS-188 and previous CSE 5521 course at OSU.]
Learning for Hidden Variables
• We saw previously how to estimate parameters for random variable
distributions…
o But only when all variables are visible (i.e., our data points have values for all variables)
• Today: How to deal with hidden (unknowable) variables?
• But first, where might such a situation arise?
Example: Homework Scores
• Suppose I keep a spreadsheet with Student HW 1 HW 2 HW 3
homework scores in it… Bloggs, Joe 70 85 81
Doe, Jane 72 78 77
Public, John Q. 77 87 83
• What kind of model might I construct Roe, Richard 90 99 89
from this? Smith, John 81 91 84
Stu, Gary 79 85 80
Sue, Mary 80 88 83
• Let’s ignore individual students for the
moment, and try to model just the
homework assignments…
Example: Homework Scores
• Two variables: C P(C) Student HW 1 HW 2 HW 3
o X: Homework score 1 ? Bloggs, Joe 70 85 81
o C: Which homework? 2 ?
Doe, Jane 72 78 77
(C for “choice” or “class”) Public, John Q. 77 87 83
C 3 ?
Roe, Richard 90 99 89
Smith, John 81 91 84
• Distributions?
Stu, Gary 79 85 80
o C is categorical (1,2,3) C P(X|C)
Sue, Mary 80 88 83
o X, Gaussian is a X 1 X ~ N(μ1,σ1)
reasonable choice 2 X ~ N(μ2,σ2)
3 X ~ N(μ3,σ3)
Example: Homework Scores
• Parameters? C P(C) Student HW 1 HW 2 HW 3
?
Parameter Estimation in Mixture Models
• How? MLE… Global joint distribution C P(C)
𝑃(𝑋1 , … , 𝑋𝑛 ) = ෑ 𝑃(𝑋 = 𝑥𝑖 ) 1 π1
𝑖
But C is hidden, 2 π2
= ෑ 𝑃(𝑋 = 𝑥𝑖 , 𝐶 = 𝑐)
𝑖 𝑐 must marginalize C 3 π3
= ෑ 𝑃 𝐶 = 𝑐 𝑃 𝑋 = 𝑥𝑖 𝐶 = 𝑐
𝑖 𝑐
Factorize! C P(X|C)
= ෑ 𝜋𝑐 𝑁(𝑥𝑖 ; 𝜇𝑐 , 𝜎𝑐 )
𝑖 𝑐 X 1 X ~ N(μ1,σ1)
𝑥−𝜇𝑐 2
1 −
2𝜎𝑐2
= ෑ 𝜋𝑐 𝑒 2 X ~ N(μ2,σ2)
𝑖 𝑐 2𝜋𝜎𝑐2
3 X ~ N(μ3,σ3)
• Now take derivatives w.r.t. μi, σi, πc and set equal to 0
o But unlike previous MLE examples, no closed form solution…
Parameter Estimation in Mixture Models
• With presence of the hidden variable (marginalization) there is too much
ambiguity to find a direct solution
• Idea: Consider hidden and visible variables separately
o Break the problem into two pieces…
• Expectation
o What are the different possible values for the hidden variables and how likely do we
expect those values to be for each data point?
• Maximization
o Maximize the likelihood of our data points (MLE), i.e., the visible variables, using our
expectations about the hidden variables
• This is the essence of what we call the EM (Expectation-Maximization)
Algorithm
Parameter Estimation: EM Algorithm
• Although often referred to as “the EM algorithm”, it actually represents a
family of algorithms for dealing with problems that have hidden variables
• It is an example of an iterative-improvement algorithm
o It takes a set of guesses for the parameters and gives you better (more likely) values
for the parameters
o Needs to be repeated many times to get “best” values
• It is susceptible to local maxima
o The starting set of guesses used does matter!
o Often run multiple times with different (random) starting values
Parameter Estimation: EM Algorithm
• Generic form of EM:
• Given:
o t, parameters at iteration t,
o Zi, hidden variables for datapoint i
o Xi, visible variables for datapoint i
Θ 𝑡+1
= argmax 𝑃 𝑍𝑖 = 𝑧 | 𝑋𝑖 , Θ𝑡 𝐿 𝑋𝑖 , 𝑍𝑖 = 𝑧 | Θ
Θ 𝑖 𝑧
𝜋1 + 𝜋2 + 𝜋3 = 0.07 + 0.35 + 0.58 = 1 8
9
0.00
0.00
0.02
0.00
0.98
1.00
84
88
10 0.02 0.97 0.01 77
11 0.00 0.01 0.99 85
12 0.00 0.01 0.99 85
Example: EM with GMMs
• Maximization i (xi) 𝑃1,𝑖 𝑃2,𝑖 𝑃3,𝑖 HW ?
• With alpha normalization, having the ratio get infinitely large is the same as
the probabilities approaching 1 or 0, respectively:
1 if 𝑥𝑖 is closer to 𝑐 than any other source/class/group
𝑃𝑐𝑖 = 𝑃(𝐶 = 𝑐|𝑋 = 𝑥𝑖 ) = ቊ
0 otherwise
o In other words, the choice of source/class/group becomes a hard decision…
K-means
• Agglomerative clustering:
o First merge very similar instances
o Incrementally build larger clusters out of smaller clusters
• Algorithm:
o Maintain a set of clusters
o Initially, each instance in its own cluster
o Repeat:
Pick the two closest clusters
Merge them into a new cluster
Stop when there is only one cluster left
• Many options
o Closest pair (single-link clustering)
o Farthest pair (complete-link clustering)
o Average of all pairs
o Ward’s method (min variance, like k-means)
1.8 0.3
1.6
0.25
1.4
1.2 0.2
1
0.15
0.8
0.6 0.1
0.4
0.05
0.2
Each discrete value alone the x-axis is treated as one data instance.
Agglomerative clustering
complete link
single link
2
1.8 0.3
1.6
0.25
1.4
1.2 0.2
1
0.15
0.8
0.6 0.1
0.4
0.05
0.2
Each discrete value alone the x-axis is treated as one data instance.
Agglomerative clustering
complete link
single link
2
1.8 0.3
1.6
0.25
1.4
1.2 0.2
1
0.15
0.8
0.6 0.1
0.4
0.05
0.2
Each discrete value alone the x-axis is treated as one data instance.
Agglomerative clustering
complete link
single link
2
1.8 0.3
1.6
0.25
1.4
1.2 0.2
1
0.15
0.8
0.6 0.1
0.4
0.05
0.2
Each discrete value alone the x-axis is treated as one data instance.
Agglomerative clustering
complete link
single link
2
1.8 0.3
1.6
0.25
1.4
1.2 0.2
1
0.15
0.8
0.6 0.1
0.4
0.05
0.2
Each discrete value alone the x-axis is treated as one data instance.
Agglomerative clustering
complete link
single link
2
1.8 0.3
1.6
0.25
1.4
1.2 0.2
1
0.15
0.8
0.6 0.1
0.4
0.05
0.2
Each discrete value alone the x-axis is treated as one data instance.
Agglomerative clustering
complete link
single link
2
1.8 0.3
1.6
0.25
1.4
1.2 0.2
1
0.15
0.8
0.6 0.1
0.4
0.05
0.2
Each discrete value alone the x-axis is treated as one data instance.
Agglomerative clustering
complete link
single link
2
1.8 0.3
1.6
0.25
1.4
1.2 0.2
1
0.15
0.8
0.6 0.1
0.4
0.05
0.2
Each discrete value alone the x-axis is treated as one data instance.
Agglomerative clustering
complete link
single link
2
1.8 0.3
1.6
0.25
1.4
1.2 0.2
1
0.15
0.8
0.6 0.1
0.4
0.05
0.2
Each discrete value alone the x-axis is treated as one data instance.
Agglomerative clustering
complete link
single link
2
1.8 0.3
1.6
0.25
1.4
1.2 0.2
1
0.15
0.8
0.6 0.1
0.4
0.05
0.2
Each discrete value alone the x-axis is treated as one data instance.
Agglomerative clustering
complete link
single link
2
1.8 0.3
1.6
0.25
1.4
1.2 0.2
1
0.15
0.8
0.6 0.1
0.4
0.05
0.2
Each discrete value alone the x-axis is treated as one data instance.
Appendix
Derivation: EM Algorithm
• Suppose we have an arbitrary probabilistic graphical model with:
o , parameters
o Z, hidden variables
o X, visible variables
o P(X,Z|), joint probability distribution of the model
• Expectation: This we derived in the main slide sequence so we’ll skip here
Derivation: GMM - Maximization
)
Θ(𝑡+1) = argmax 𝑃𝑧𝑖 𝐿(𝑋𝑖 , 𝑍𝑖 = 𝑧|Θ
Θ 𝑖 𝑧
)
= argmax 𝑃𝑧𝑖 log 𝑃(𝑋𝑖 , 𝑍𝑖 = 𝑧|Θ
Θ 𝑖 𝑧 Source prior Gaussian source
𝑃(𝑋𝑖 |𝐶𝑖 = 𝑐, Θ
= argmax 𝑃𝑐𝑖 log 𝑃 𝐶𝑖 = 𝑐|Θ )
Θ 𝑖 𝑐 1
1 −
2𝜎𝑐2 𝑥𝑖 −𝜇𝑐 2
= argmax 𝑃𝑐𝑖 log 𝜋𝑐 𝑒
𝜋𝑐 ,𝜇𝑐 ,𝜎𝑐
𝑖 𝑐
𝜎𝑐 2𝜋 Note, the 2𝜋 term disappears
because as a constant it doesn’t
affect the maximum
1 2
= argmax 𝑃𝑐𝑖 log 𝜋𝑐 − 𝑃𝑐𝑖 log 𝜎𝑐 − 𝑃𝑐𝑖 2 𝑥𝑖 − 𝜇𝑐
𝜋𝑐 ,𝜇𝑐 ,𝜎𝑐 2𝜎𝑐
𝑖 𝑐 𝑖 𝑐 𝑖 𝑐
Derivation: GMM – Maximization - 𝜇𝑐
• Now, we can maximize using the zero-derivative trick… So, let’s start with
the mean parameter 𝜇𝑐
𝑑 1
0= 𝑃𝑐𝑖 log 𝜋𝑐 − 𝑃𝑐𝑖 log 𝜎𝑐 − 𝑃𝑐𝑖 2 𝑥𝑖 − 𝜇𝑐 2
𝑑𝜇𝑐 2𝜎𝑐
𝑖 𝑐 𝑖 𝑐 𝑖 𝑐
0 Constant with respect to 𝜇𝑐
0
𝑐−1 𝐶
𝑑 1 2 − 𝑃
1 2
1 2
0= − 𝑃𝜔𝑖 2 𝑥𝑖 − 𝜇𝜔 𝑐𝑖 2 𝑥𝑖 − 𝜇𝑐 − 𝑃𝜔𝑖 2 𝑥𝑖 − 𝜇𝜔
𝑑𝜇𝑐 2𝜎𝜔 2𝜎𝑐 2𝜎𝜔
𝜔=1 𝑖 𝑖 𝜔=𝑐+1 𝑖
0 0
𝑑 1 2
0= − 𝑃𝑐𝑖 2 𝑥𝑖 − 𝜇𝑐
𝑑𝜇𝑐 2𝜎𝑐
𝑖
Derivation: GMM – Maximization - 𝜇𝑐
• With all the constant terms eliminated, all that’s left is the derivative and
solve…
𝑑 1
0= − 𝑃𝑐𝑖 2 𝑥𝑖 − 𝜇𝑐 2
𝑑𝜇𝑐 2𝜎𝑐
𝑖
2
0 = − 2 𝑃𝑐𝑖 𝑥𝑖 − 𝜇𝑐 −1 = 𝑃𝑐𝑖 𝑥𝑖 − 𝜇𝑐
2𝜎𝑐
𝑖 𝑖
σ𝑖 𝑃𝑐𝑖 𝑥𝑖
𝜇𝑐 𝑃𝑐𝑖 = 𝑃𝑐𝑖 𝑥𝑖 𝜇𝑐 =
σ𝑖 𝑃𝑐𝑖
𝑖 𝑖
Derivation: GMM – Maximization - 𝜎𝑐
• 𝜎𝑐 is very similar, so starting from the non-constant terms…
𝑑 1 2
0= − log 𝜎𝑐 𝑃𝑐𝑖 − 2 𝑃𝑐𝑖 𝑥𝑖 − 𝜇𝑐
𝑑𝜎𝑐 2𝜎𝑐
𝑖 𝑖
1 −2 2
0 = − 𝑃𝑐𝑖 − 3 𝑃𝑐𝑖 𝑥𝑖 − 𝜇𝑐 (Multiply both sides by 𝜎𝑐3 then shift terms)
𝜎𝑐 2𝜎𝑐
𝑖 𝑖
σ𝑖 𝑃𝑐𝑖 𝑥𝑖 − 𝜇𝑐 2
𝜎𝑐2 𝑃𝑐𝑖 = 𝑃𝑐𝑖 𝑥𝑖 − 𝜇𝑐 2
𝜎𝑐 =
σ𝑖 𝑃𝑐𝑖
𝑖 𝑖
Derivation: GMM – Maximization - 𝜋𝑐
• From the set of non-constant terms, it would seem 𝜋𝑐 should be the easiest
of them all…
𝑑
0= 𝑃𝑐𝑖 log 𝜋𝑐
𝑑𝜋𝑐
𝑖
• However, remember the set of all 𝜋𝑐 form a probability distribution and so
have the additional constraint:
𝜋𝜔 = 1
𝜔
• We resort to the method of Lagrange multipliers to implement the constraint:
𝑑
0= 𝑃𝑐𝑖 log 𝜋𝑐 + 𝜆 1 − 𝜋𝜔
𝑑𝜋𝑐
𝑖 𝜔
Derivation: GMM – Maximization - 𝜋𝑐
• Then derivatives and solving…
𝑑 1
0= log 𝜋𝑐 𝑃𝑐𝑖 + 𝜆 1 − 𝜋𝜔 = 𝑃𝑐𝑖 − 𝜆
𝑑𝜋𝑐 𝜋𝑐
𝑖 𝜔 𝑖
1
𝜋𝑐 = 𝑃𝑐𝑖
𝜆
𝑖
• Then plug back into the constraint to resolve the Lagrange multiplier…
1
1 = 𝑃𝜔𝑖 𝜆 = 𝑃𝜔𝑖 = 1 = 𝑁 (Number of data points, N)
𝜆
𝜔 𝑖 𝑖 𝜔 𝑖
1
𝜋𝑐 = 𝑃𝑐𝑖
𝑁
𝑖