0% found this document useful (0 votes)

16 views60 pages

14 Gaussian Mixture Models

The document discusses Gaussian Mixture Models (GMMs) and the challenges of estimating parameters when dealing with hidden variables. It introduces the Expectation-Maximization (EM) algorithm as a method to recover original distributions and sources from mixed data. The document also provides examples and explanations of how to apply the EM algorithm to parameter estimation in mixture models.

Uploaded by

christina.zhou122

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views60 pages

14 Gaussian Mixture Models

Uploaded by

christina.zhou122

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 60

CSE 3521: (Gaussian) Mixture Models

[Many slides are adapted from and at UC Berkeley CS-188 and previous CSE 5521 course at OSU.]
Learning for Hidden Variables
• We saw previously how to estimate parameters for random variable
distributions…
o But only when all variables are visible (i.e., our data points have values for all variables)
• Today: How to deal with hidden (unknowable) variables?
• But first, where might such a situation arise?
Example: Homework Scores
• Suppose I keep a spreadsheet with Student HW 1 HW 2 HW 3
homework scores in it… Bloggs, Joe 70 85 81
Doe, Jane 72 78 77
Public, John Q. 77 87 83
• What kind of model might I construct Roe, Richard 90 99 89
from this? Smith, John 81 91 84
Stu, Gary 79 85 80
Sue, Mary 80 88 83
• Let’s ignore individual students for the
moment, and try to model just the
homework assignments…
Example: Homework Scores
• Two variables: C P(C) Student HW 1 HW 2 HW 3
o X: Homework score 1 ? Bloggs, Joe 70 85 81
o C: Which homework? 2 ?
Doe, Jane 72 78 77
(C for “choice” or “class”) Public, John Q. 77 87 83
C 3 ?
Roe, Richard 90 99 89
Smith, John 81 91 84
• Distributions?
Stu, Gary 79 85 80
o C is categorical (1,2,3) C P(X|C)
Sue, Mary 80 88 83
o X, Gaussian is a X 1 X ~ N(μ1,σ1)
reasonable choice 2 X ~ N(μ2,σ2)
3 X ~ N(μ3,σ3)
Example: Homework Scores
• Parameters? C P(C) Student HW 1 HW 2 HW 3

• Easy… 1 0.33 Bloggs, Joe 70 85 81

Doe, Jane 72 78 77
• P(C)=1/3 2 0.33
Public, John Q. 77 87 83
o We have 3 homework C 3 0.33
Roe, Richard 90 99 89
assignments and an Smith, John 81 91 84
equal number of scores
Stu, Gary 79 85 80
for each C P(X|C)
Sue, Mary 80 88 83
• μi, σi X 1 X ~ N(μ1,σ1)

o Separately for each 2 X ~ N(μ2,σ2)

homework assignment, 3 X ~ N(μ3,σ3)
calculate the mean and
standard deviation
Problem: Information Loss
• Now suppose my C P(C) HW ?
spreadsheet gets corrupted 1 ? 83
and all the scores get mixed 2 ?
81
up… 79
C 3 ?
72
• Does my model change?
78
o No!
91
o BUT, the homework source (C) C P(X|C)
81
is now a hidden variable X 1 X ~ N(μ1,σ1)
84
o Only homework score (X) is 2 X ~ N(μ2,σ2) 88
visible
3 X ~ N(μ3,σ3) 77
85
85
Mixture Model
• This is now a type of model C P(C) HW ?
known as a mixture model 1 ? 83
o The data (scores) are a mixture 81
2 ?
of values from different sources 79
• Specifically a Gaussian Mixture C 3 ?
72
Model (GMM) or Mixture of 78
Gaussians (MoG) model 91
C P(X|C)
o Because the different sources 81
are all Gaussian distributed X 1 X ~ N(μ1,σ1)
84
• There are many different types 2 X ~ N(μ2,σ2) 88
of mixtures models, depending 3 X ~ N(μ3,σ3) 77
on how the visible variables 85
are distributed 85
Mixture Model, Generative Models
• Mixture models are one of the simplest examples of C P(C)
what are called generative models 1 ?
• If we choose a source C, the model tells use how to 2 ?
generate new examples for X C 3 ?
o In this case, by generating random numbers from the
appropriate Gaussian distribution
C P(X|C)
X 1 X ~ N(μ1,σ1)
2 X ~ N(μ2,σ2)
3 X ~ N(μ3,σ3)
Questions?
Parameter Estimation in Mixture Models
• Essentially, can we reverse the mixing process?
• Can we recover the original source distributions (Gaussians)?
o (Generally) Yes!
• Can we recover the original sources for each datapoint?
o Sometimes, but generally, no.

?
Parameter Estimation in Mixture Models
• How? MLE… Global joint distribution C P(C)
𝑃(𝑋1 , … , 𝑋𝑛 ) = ෑ 𝑃(𝑋 = 𝑥𝑖 ) 1 π1
𝑖
But C is hidden, 2 π2
= ෑ ෍ 𝑃(𝑋 = 𝑥𝑖 , 𝐶 = 𝑐)
𝑖 𝑐 must marginalize C 3 π3

= ෑ ෍ 𝑃 𝐶 = 𝑐 𝑃 𝑋 = 𝑥𝑖 𝐶 = 𝑐
𝑖 𝑐
Factorize! C P(X|C)
= ෑ ෍ 𝜋𝑐 𝑁(𝑥𝑖 ; 𝜇𝑐 , 𝜎𝑐 )
𝑖 𝑐 X 1 X ~ N(μ1,σ1)
𝑥−𝜇𝑐 2
1 −
2𝜎𝑐2
= ෑ ෍ 𝜋𝑐 𝑒 2 X ~ N(μ2,σ2)
𝑖 𝑐 2𝜋𝜎𝑐2
3 X ~ N(μ3,σ3)
• Now take derivatives w.r.t. μi, σi, πc and set equal to 0
o But unlike previous MLE examples, no closed form solution…
Parameter Estimation in Mixture Models
• With presence of the hidden variable (marginalization) there is too much
ambiguity to find a direct solution
• Idea: Consider hidden and visible variables separately
o Break the problem into two pieces…
• Expectation
o What are the different possible values for the hidden variables and how likely do we
expect those values to be for each data point?
• Maximization
o Maximize the likelihood of our data points (MLE), i.e., the visible variables, using our
expectations about the hidden variables
• This is the essence of what we call the EM (Expectation-Maximization)
Algorithm
Parameter Estimation: EM Algorithm
• Although often referred to as “the EM algorithm”, it actually represents a
family of algorithms for dealing with problems that have hidden variables
• It is an example of an iterative-improvement algorithm
o It takes a set of guesses for the parameters and gives you better (more likely) values
for the parameters
o Needs to be repeated many times to get “best” values
• It is susceptible to local maxima
o The starting set of guesses used does matter!
o Often run multiple times with different (random) starting values
Parameter Estimation: EM Algorithm
• Generic form of EM:
• Given:
o t, parameters at iteration t,
o Zi, hidden variables for datapoint i
o Xi, visible variables for datapoint i
Θ 𝑡+1 ෡
= argmax ෍ ෍ 𝑃 𝑍𝑖 = 𝑧 | 𝑋𝑖 , Θ𝑡 𝐿 𝑋𝑖 , 𝑍𝑖 = 𝑧 | Θ
෡
Θ 𝑖 𝑧

• E-step: calculate probabilities of Z

• M-step: calculate most likely 
Mixture Models: Expectation
• Let’s apply EM to mixture models, specifically…
• Expectation: For every data point Xi, how likely is to have been generated by
source c?
P𝑐𝑖 = 𝑃 𝐶 = 𝑐 𝑋 = 𝑥𝑖 = α𝐶 𝑃 𝐶 = 𝑐, 𝑋 = 𝑥𝑖
= α𝐶 𝑃 𝑋 = 𝑥𝑖 𝐶 = 𝑐 𝑃 𝐶 = 𝑐

Normalize over all

For GMM, this is N(xi; μc, σc) Likelihood of the source, πc
possible sources

𝑁𝐶 = ෍ 𝑃𝑐𝑖 The overall “weight” of a particular source

(This is a convenience value we’ll use in the next step)
𝑖
Mixture Models: Maximization
• Maximization: Use our data points to estimate each parameter, weighted by
our (previously calculated) expectations
“count” the number of data points that belong to each source,
𝑁𝑐 𝑁𝑐 where expectation is used as a partial “count”
π
ෝ𝑐 = =
σ𝑗 𝑁𝑗 𝑁 That is P1i=0.6 is interpreted to mean we will give 60% of the ith
data point to source 1 (and the remaining 40% to other sources)

• The rest depends on the source distribution, but for GMM:

1 1
𝜇ො𝑐 = ෍ 𝑃𝑐𝑖 𝑥𝑖 2 2
𝑁𝑐 𝜎
ො𝑐 = ෍ 𝑃 𝑥
𝑐𝑖 𝑖 − 𝜇
ො 𝑐
𝑖 𝑁𝑐
𝑖

Weighted mean and standard deviation,

where expectation is used as weight
Questions?
Example: EM with GMMs
• Let’s return to our C P(C) HW ?
homework example 1 ? 83

• Start with some guesses… 2 ?

81
79
• πc=1/3 C 3 ?
72
o All homeworks sources are 78
equally likely 91
C P(X|C)
• μ1=70, μ2=77, μ3=85 X 81
1 X ~ N(μ1,σ1)
84
• σ1=σ2=σ3=2.5 2 X ~ N(μ2,σ2) 88
o Each homework’s scores
3 X ~ N(μ3,σ3) 77
cover about a 10 point range
85
85
Example: EM with GMMs
• Expectation: i (xi) 𝑃1,𝑖 𝑃2,𝑖 𝑃3,𝑖 HW ?
𝑃𝑐1 = 𝛼 𝑃 𝑋 = 83 𝐶 = 𝑐 𝑃 𝐶 = 𝑐 1 0.00 0.07 0.93 83
2 81
1 83−70 2
− 1
𝑃1,1 = 𝛼 𝑒 2 2.5 2 3 = 𝛼 7.148 ∙ 10−8= 0.00 3 79
2𝜋 2.5 2
2
4 72
1 83−77
− 1
𝑃2,1 = 𝛼 𝑒 2 2.5 2 = 𝛼 0.002986= 0.07 5 78
2 3
2𝜋 2.5 91
6
83−85 2
1 − 1 7 81
𝑃3,1 = 𝛼 𝑒 2 2.5 2 = 𝛼 0.03862 = 0.93
2 3
2𝜋 2.5 8 84
1
𝛼= 9 88
7.148 ∙ 10−8 + 0.002986 + 0.03862
10 77
11 85
12 85
Example: EM with GMMs
• Expectation: i (xi) 𝑃1,𝑖 𝑃2,𝑖 𝑃3,𝑖 HW ?
o Continue for all data points 1 0.00 0.07 0.93 83
2 0.00 0.50 0.50 81
0.00 + 0.00 + 0.00 + 0.84 + 0.01 + 0.00 +
𝑁1 = = 0.87 3 0.00 0.93 0.07 79
0.00 + 0.00 + 0.00 + 0.02 + 0.00 + 0.00
4 0.84 0.16 0.00 72
0.07 + 0.50 + 0.93 + 0.16 + 0.97 + 0.00 +
𝑁2 = = 4.14 5 0.01 0.97 0.02 78
0.50 + 0.02 + 0.00 + 0.97 + 0.01 + 0.01
6 0.00 0.00 1.00 91
0.93 + 0.50 + 0.07 + 0.00 + 0.02 + 1.00 +
𝑁3 = = 6.99 7 0.00 0.50 0.50 81
0.50 + 0.98 + 1.00 + 0.01 + 0.99 + 0.99
8 0.00 0.02 0.98 84
9 0.00 0.00 1.00 88
𝑁 = 𝑁1 + 𝑁2 + 𝑁3 = 0.87 + 4.14 + 6.99 = 12
Quick check for mistakes… Yep, we have
 10
11
0.02
0.00
0.97
0.01
0.01
0.99
77
85
12 data points!
12 0.00 0.01 0.99 85
Example: EM with GMMs
• Maximization i (xi) 𝑃1,𝑖 𝑃2,𝑖 𝑃3,𝑖 HW ?
1 0.00 0.07 0.93 83
0.87
𝜋1 = = 0.07 2 0.00 0.50 0.50 81
12
4.14 3 0.00 0.93 0.07 79
𝜋2 = = 0.35
12 4 0.84 0.16 0.00 72
6.99 5 0.01 0.97 0.02 78
𝜋3 = = 0.58
12 6 0.00 0.00 1.00 91
7 0.00 0.50 0.50 81


𝜋1 + 𝜋2 + 𝜋3 = 0.07 + 0.35 + 0.58 = 1 8
9
0.00
0.00
0.02
0.00
0.98
1.00
84
88
10 0.02 0.97 0.01 77
11 0.00 0.01 0.99 85
12 0.00 0.01 0.99 85
Example: EM with GMMs
• Maximization i (xi) 𝑃1,𝑖 𝑃2,𝑖 𝑃3,𝑖 HW ?

0.00 ∙ 83 + 0.00 ∙ 81 + 0.00 ∙ 79 + 1 0.00 0.07 0.93 83

1 0.84 ∙ 72 + 0.01 ∙ 78 + 0.00 ∙ 91 + 2 0.00 0.50 0.50 81
𝜇1 = = 72.2 (was 70)
0.87 0.00 ∙ 81 + 0.00 ∙ 84 + 0.00 ∙ 88 +
3 0.00 0.93 0.07 79
0.02 ∙ 77 + 0.00 ∙ 85 + 0.00 ∙ 85
4 0.84 0.16 0.00 72
5 0.01 0.97 0.02 78
6 0.00 0.00 1.00 91
7 0.00 0.50 0.50 81
8 0.00 0.02 0.98 84
9 0.00 0.00 1.00 88
10 0.02 0.97 0.01 77
11 0.00 0.01 0.99 85
12 0.00 0.01 0.99 85
Example: EM with GMMs
• Maximization i (xi) 𝑃1,𝑖 𝑃2,𝑖 𝑃3,𝑖 HW ?

0.00 ∙ 83 + 0.00 ∙ 81 + 0.00 ∙ 79 + 1 0.00 0.07 0.93 83

1 0.84 ∙ 72 + 0.01 ∙ 78 + 0.00 ∙ 91 + = 72.2 2 0.00 0.50 0.50 81
𝜇1 = (was 70)
0.87 0.00 ∙ 81 + 0.00 ∙ 84 + 0.00 ∙ 88 +
3 0.00 0.93 0.07 79
0.02 ∙ 77 + 0.00 ∙ 85 + 0.00 ∙ 85
0.07 ∙ 83 + 0.50 ∙ 81 + 0.93 ∙ 79 + 4 0.84 0.16 0.00 72
1 0.16 ∙ 72 + 0.97 ∙ 78 + 0.00 ∙ 91 + = 78.6 5 0.01 0.97 0.02 78
𝜇2 = (was 77)
4.14 0.50 ∙ 81 + 0.02 ∙ 84 + 0.00 ∙ 88 +
6 0.00 0.00 1.00 91
0.97 ∙ 77 + 0.01 ∙ 85 + 0.01 ∙ 85
0.93 ∙ 83 + 0.50 ∙ 81 + 0.07 ∙ 79 + 7 0.00 0.50 0.50 81
1 0.00 ∙ 72 + 0.02 ∙ 78 + 1.00 ∙ 91 + = 85.2 8 0.00 0.02 0.98 84
𝜇3 = (was 85)
6.99 0.50 ∙ 81 + 0.98 ∙ 84 + 1.00 ∙ 88 +
9 0.00 0.00 1.00 88
0.01 ∙ 77 + 0.99 ∙ 85 + 0.99 ∙ 85
10 0.02 0.97 0.01 77
11 0.00 0.01 0.99 85
12 0.00 0.01 0.99 85
Example: EM with GMMs
• Maximization i (xi) 𝑃1,𝑖 𝑃2,𝑖 𝑃3,𝑖 HW ?
1 0.00 0.07 0.93 83
0.00 ∙ 832 + 0.00 ∙812 + 0.00 ∙792 + 81
1 2 0.00 0.50 0.50
𝜎1 = 0.84 ∙ 722 + 0.01 ∙ 782 + 0.00 ∙ 912 + − 72.22 = 0.96
0.87 0.00 ∙ 812 + 0.00 ∙ 842 + 0.00 ∙ 882 + 3 0.00 0.93 0.07 79
(was 2.5)
0.02 ∙ 772 + 0.00 ∙ 852 + 0.00 ∙ 852 4 0.84 0.16 0.00 72
5 0.01 0.97 0.02 78
1 6 0.00 0.00 1.00 91
𝜎2 = 0.07 ∙ 832 + ⋯ + 0.01 ∙ 852 − 78.62 = 2.09 (was 2.5)
4.14 81
7 0.00 0.50 0.50
8 0.00 0.02 0.98 84
1
𝜎3 = 0.93 ∙ 832 + ⋯ + 0.99 ∙ 852 − 85.22 = 3.15 (was 2.5) 9 0.00 0.00 1.00 88
6.99
10 0.02 0.97 0.01 77
11 0.00 0.01 0.99 85
12 0.00 0.01 0.99 85
Example: EM with GMMs
• So we have:
𝜋1 = 0.07 𝜇1 = 72.2 𝜎1 = 0.96
𝜋2 = 0.35 𝜇2 = 78.6 𝜎2 = 2.09
𝜋3 = 0.58 𝜇3 = 85.2 𝜎3 = 3.15

• Parameter estimation done?

o No!
• EM is iterative
o These are just better guesses, not the solution
• Repeat the previous using the above values as new
guesses
o Possibly many repetitions…
More Examples: Many Iterations…
More Examples: Many Iterations…
Questions?
Unsupervised Learning,
Clustering
Unsupervised Learning
• Results from Mixture Models?
o Likelihoods of sources and source parameters
o Target variables? None
o Hard to make decisions about individual data points...
• Unsupervised Learning
o Parameter estimation is the goal rather than a means (leading to inference)
o No target variable, no supervision about results of target (accuracy, etc.)
o Kernel density estimation one example
• Clustering
o Very common example of unsupervised learning
o Grouping data points into ‘similar’ vs. ‘dissimilar’
o GMM/mixture models examples of probabilistic clustering
 No hard decisions made about groups ↔ data points
Hard Decisions
• Consider in a GMM, letting standard deviation get smaller...
o Though not necessary, let’s assume equally likely sources and equal standard deviations
𝑥−𝜇𝑎 2
1 −
2𝜎 2
α𝐶 𝑒 𝜋 𝑥−𝜇𝑎 2 − 𝑥−𝜇𝑏 2
𝑃(𝐶 = 𝑎|𝑋) 2𝜋𝜎 2 −
lim = lim = lim 𝑒 2𝜎2 =⋯
𝜎→0 𝑃(𝐶 = 𝑏|𝑋) 𝜎→0 𝑥−𝜇𝑏 2 𝜎→0
1 −
α𝐶 𝑒 2𝜎2 𝜋
2𝜋𝜎 2
∞ if 𝑥 − 𝜇𝑎 < 𝑥 − 𝜇𝑏 i.e. 𝑥 is closer to 𝑎 than 𝑏
=ቊ
0 if 𝑥 − 𝜇𝑏 < 𝑥 − 𝜇𝑎 i.e. 𝑥 is closer to 𝑏 than 𝑎

• With alpha normalization, having the ratio get infinitely large is the same as
the probabilities approaching 1 or 0, respectively:
1 if 𝑥𝑖 is closer to 𝑐 than any other source/class/group
𝑃𝑐𝑖 = 𝑃(𝐶 = 𝑐|𝑋 = 𝑥𝑖 ) = ቊ
0 otherwise
o In other words, the choice of source/class/group becomes a hard decision…
K-means

• In fact, the previous has a name: K-Means Clustering algorithm

• Data: 𝒙1 , … , 𝒙𝑁
• An iterative clustering algorithm
o Pick 𝐾 random cluster centers: 𝒄1 , 𝒄2 , … , 𝒄𝐾
o For 𝑡 = 1, … , 𝑇: [or, stop if assignments don’t change]
 for 𝑖 = 1, … , 𝑁: [update cluster assignments] “Hard” (non-probabilistic) decision
2 about group membership
𝑦𝑖 = argmin 𝒙𝑖 − 𝒄𝑘 2
𝑘

 for 𝑘 = 1, … , 𝐾: [update cluster centers]

1 Equivalent to GMM weighted mean
𝒄𝑘 = ෍ 𝒙𝑖 when you consider all weights are
𝑖 | 𝑦𝑖 = 𝑘 now 0 or 1
𝑖 | 𝑦𝑖 =𝑘
K-means: animation
Questions?
Agglomerative clustering

• Agglomerative clustering:
o First merge very similar instances
o Incrementally build larger clusters out of smaller clusters

• Algorithm:
o Maintain a set of clusters
o Initially, each instance in its own cluster
o Repeat:
 Pick the two closest clusters
 Merge them into a new cluster
 Stop when there is only one cluster left

• Produces not one clustering, but a family of clusters represented by a

dendrogram
Agglomerative clustering
• How should we define the “closeness” between two
clusters of multiple instances?

• Many options
o Closest pair (single-link clustering)
o Farthest pair (complete-link clustering)
o Average of all pairs
o Ward’s method (min variance, like k-means)

• Different choices create different clustering behaviors

Agglomerative clustering
complete link
single link
2

1.8 0.3

1.6
0.25
1.4

1.2 0.2

1
0.15
0.8

0.6 0.1

0.4
0.05
0.2

Each discrete value alone the x-axis is treated as one data instance.
Agglomerative clustering
complete link
single link
2

1.8 0.3

1.6
0.25
1.4

1.2 0.2

1
0.15
0.8

0.6 0.1

0.4
0.05
0.2

Each discrete value alone the x-axis is treated as one data instance.
Agglomerative clustering
complete link
single link
2

1.8 0.3

1.6
0.25
1.4

1.2 0.2

1
0.15
0.8

0.6 0.1

0.4
0.05
0.2

Each discrete value alone the x-axis is treated as one data instance.
Agglomerative clustering
complete link
single link
2

1.8 0.3

1.6
0.25
1.4

1.2 0.2

1
0.15
0.8

0.6 0.1

0.4
0.05
0.2

Each discrete value alone the x-axis is treated as one data instance.
Agglomerative clustering
complete link
single link
2

1.8 0.3

1.6
0.25
1.4

1.2 0.2

1
0.15
0.8

0.6 0.1

0.4
0.05
0.2

Each discrete value alone the x-axis is treated as one data instance.
Agglomerative clustering
complete link
single link
2

1.8 0.3

1.6
0.25
1.4

1.2 0.2

1
0.15
0.8

0.6 0.1

0.4
0.05
0.2

Each discrete value alone the x-axis is treated as one data instance.
Agglomerative clustering
complete link
single link
2

1.8 0.3

1.6
0.25
1.4

1.2 0.2

1
0.15
0.8

0.6 0.1

0.4
0.05
0.2

Each discrete value alone the x-axis is treated as one data instance.
Agglomerative clustering
complete link
single link
2

1.8 0.3

1.6
0.25
1.4

1.2 0.2

1
0.15
0.8

0.6 0.1

0.4
0.05
0.2

Each discrete value alone the x-axis is treated as one data instance.
Agglomerative clustering
complete link
single link
2

1.8 0.3

1.6
0.25
1.4

1.2 0.2

1
0.15
0.8

0.6 0.1

0.4
0.05
0.2

Each discrete value alone the x-axis is treated as one data instance.
Agglomerative clustering
complete link
single link
2

1.8 0.3

1.6
0.25
1.4

1.2 0.2

1
0.15
0.8

0.6 0.1

0.4
0.05
0.2

Each discrete value alone the x-axis is treated as one data instance.
Agglomerative clustering
complete link
single link
2

1.8 0.3

1.6
0.25
1.4

1.2 0.2

1
0.15
0.8

0.6 0.1

0.4
0.05
0.2

Each discrete value alone the x-axis is treated as one data instance.
Appendix
Derivation: EM Algorithm
• Suppose we have an arbitrary probabilistic graphical model with:
o , parameters
o Z, hidden variables
o X, visible variables
o P(X,Z|), joint probability distribution of the model

• How do we estimate parameters? EM Algorithm!

• But where does EM Algorithm come from?...
Derivation of EM: Step 1 - MLE
• So, given a set of (IID) example data points i, we wish to estimate the
parameters.
• Via Maximum Likelihood Estimation:
෩ = argmax ෑ 𝑃(𝑋𝑖 , 𝑍𝑖 |Θ
Θ ෡)
෡
Θ Log trick – order preserving transformation
𝑖
෡)
= argmax log ෑ 𝑃(𝑋𝑖 , 𝑍𝑖 |Θ
෡
Θ 𝑖
෡)
= argmax ෍ log 𝑃(𝑋𝑖 , 𝑍𝑖 |Θ
෡
Θ 𝑖
෡)
= argmax ෍ 𝐿(𝑋𝑖 , 𝑍𝑖 |Θ “L” log-likelihood, i.e., L(…)=log P(…)
෡
Θ 𝑖
Derivation of EM: Step 2 – Expected Values
• But what we derived isn’t calculable, while 𝑋𝑖 is known for each
example, 𝑍𝑖 is unknown. How to deal with this?
• Let’s consider the average case! That is, if 𝑍𝑖 is unknown, let us think
about all the values it could possibly be… †
• In probabilistic terms, average means Expected Value:
Θ෩ = argmax ෍ 𝐸𝑍 𝐿(𝑋𝑖 , 𝑍𝑖 |Θ ෡)
𝑖 Expectation: Likelihood of possibilities times
෡
Θ value of possibilities
𝑖
෡)
= argmax ෍ ෍ 𝑃(𝑍𝑖 = 𝑧|𝑋𝑖 , Θ? ) 𝐿(𝑋𝑖 , 𝑍𝑖 = 𝑧|Θ
෡
Θ 𝑖 𝑧

Problem! We need parameter values to calculate this

probability… but the parameters are what we’re trying to
estimate!
Derivation of EM: Step 2 – Expected Values
• How to resolve the ambiguity of needing parameter values to estimate
parameter values?
• Trick: Make a guess, and use that guess to generate a better guess
෡)
Θ(𝑡+1) = argmax ෍ ෍ 𝑃(𝑍𝑖 = 𝑧|𝑋𝑖 , Θ𝑡 ) 𝐿(𝑋𝑖 , 𝑍𝑖 = 𝑧|Θ
෡
Θ 𝑖 𝑧
New guess for parameters (i.e., step t+1) Current guess for parameters (at step t)

• Cost: Multiple iterations

• This trick turns the algorithm into an iterative improvement algorithm.
• That is, it takes a guess and makes a better guess, so must be applied
repeatedly to get the best answer possible
• Also, results will now depend on the starting guess (Θ0 ).
Derivation of EM: Step 3 – Expectation/Maximzation
• The final step is to then break the calculation into two phases, the
classic Expectation and Maximization phases
• Expectation: Calculate the expectation probabilities for every possible
combination of data point i and hidden variable value z
𝑃𝑧𝑖 = 𝑃(𝑍𝑖 = 𝑧|𝑋𝑖 , Θ𝑡 )
• Maximization: Find new likelihood-maximizing parameter values based
on expectation
෡)
Θ(𝑡+1) = argmax ෍ ෍ 𝑃𝑧𝑖 𝐿(𝑋𝑖 , 𝑍𝑖 = 𝑧|Θ
෡
Θ 𝑖 𝑧
Derivation: Gaussian Mixture Model (GMM)
• How specifically is the E-M algorithm applied to the Gaussian Mixture
Model?
• Recall, we have…
• Visible variable: observation X C
𝑃 𝐶 = 𝑐 = 𝜋𝑐

• Hidden variable: source (choice) C

• Parameters:
o 𝜋𝑐 , prior probability of source
o 𝜇𝑐 , mean of Gaussian for source c X 𝑋|𝐶 = 𝑐 ~ 𝑁(𝜇𝑐 , 𝜎𝑐 )

o 𝜎𝑐 , standard deviation of Gaussian for source c

• Expectation: This we derived in the main slide sequence so we’ll skip here
Derivation: GMM - Maximization

෡)
Θ(𝑡+1) = argmax ෍ ෍ 𝑃𝑧𝑖 𝐿(𝑋𝑖 , 𝑍𝑖 = 𝑧|Θ
෡
Θ 𝑖 𝑧
෡)
= argmax ෍ ෍ 𝑃𝑧𝑖 log 𝑃(𝑋𝑖 , 𝑍𝑖 = 𝑧|Θ
෡
Θ 𝑖 𝑧 Source prior Gaussian source
෡ 𝑃(𝑋𝑖 |𝐶𝑖 = 𝑐, Θ
= argmax ෍ ෍ 𝑃𝑐𝑖 log 𝑃 𝐶𝑖 = 𝑐|Θ ෡)
෡
Θ 𝑖 𝑐 1
1 −
2𝜎𝑐2 𝑥𝑖 −𝜇𝑐 2
= argmax ෍ ෍ 𝑃𝑐𝑖 log 𝜋𝑐 𝑒
𝜋𝑐 ,𝜇𝑐 ,𝜎𝑐
𝑖 𝑐
𝜎𝑐 2𝜋 Note, the 2𝜋 term disappears
because as a constant it doesn’t
affect the maximum

1 2
= argmax ෍ ෍ 𝑃𝑐𝑖 log 𝜋𝑐 − ෍ ෍ 𝑃𝑐𝑖 log 𝜎𝑐 − ෍ ෍ 𝑃𝑐𝑖 2 𝑥𝑖 − 𝜇𝑐
𝜋𝑐 ,𝜇𝑐 ,𝜎𝑐 2𝜎𝑐
𝑖 𝑐 𝑖 𝑐 𝑖 𝑐
Derivation: GMM – Maximization - 𝜇𝑐
• Now, we can maximize using the zero-derivative trick… So, let’s start with
the mean parameter 𝜇𝑐
𝑑 1
0= ෍ ෍ 𝑃𝑐𝑖 log 𝜋𝑐 − ෍ ෍ 𝑃𝑐𝑖 log 𝜎𝑐 − ෍ ෍ 𝑃𝑐𝑖 2 𝑥𝑖 − 𝜇𝑐 2
𝑑𝜇𝑐 2𝜎𝑐
𝑖 𝑐 𝑖 𝑐 𝑖 𝑐
0 Constant with respect to 𝜇𝑐
0
𝑐−1 𝐶
𝑑 1 2 − ෍𝑃
1 2
1 2
0= − ෍ ෍ 𝑃𝜔𝑖 2 𝑥𝑖 − 𝜇𝜔 𝑐𝑖 2 𝑥𝑖 − 𝜇𝑐 − ෍ ෍ 𝑃𝜔𝑖 2 𝑥𝑖 − 𝜇𝜔
𝑑𝜇𝑐 2𝜎𝜔 2𝜎𝑐 2𝜎𝜔
𝜔=1 𝑖 𝑖 𝜔=𝑐+1 𝑖
0 0

𝑑 1 2
0= − ෍ 𝑃𝑐𝑖 2 𝑥𝑖 − 𝜇𝑐
𝑑𝜇𝑐 2𝜎𝑐
𝑖
Derivation: GMM – Maximization - 𝜇𝑐
• With all the constant terms eliminated, all that’s left is the derivative and
solve…
𝑑 1
0= − ෍ 𝑃𝑐𝑖 2 𝑥𝑖 − 𝜇𝑐 2
𝑑𝜇𝑐 2𝜎𝑐
𝑖
2
0 = − 2 ෍ 𝑃𝑐𝑖 𝑥𝑖 − 𝜇𝑐 −1 = ෍ 𝑃𝑐𝑖 𝑥𝑖 − 𝜇𝑐
2𝜎𝑐
𝑖 𝑖

σ𝑖 𝑃𝑐𝑖 𝑥𝑖
𝜇𝑐 ෍ 𝑃𝑐𝑖 = ෍ 𝑃𝑐𝑖 𝑥𝑖 𝜇𝑐 =
σ𝑖 𝑃𝑐𝑖
𝑖 𝑖
Derivation: GMM – Maximization - 𝜎𝑐
• 𝜎𝑐 is very similar, so starting from the non-constant terms…
𝑑 1 2
0= − log 𝜎𝑐 ෍ 𝑃𝑐𝑖 − 2 ෍ 𝑃𝑐𝑖 𝑥𝑖 − 𝜇𝑐
𝑑𝜎𝑐 2𝜎𝑐
𝑖 𝑖

1 −2 2
0 = − ෍ 𝑃𝑐𝑖 − 3 ෍ 𝑃𝑐𝑖 𝑥𝑖 − 𝜇𝑐 (Multiply both sides by 𝜎𝑐3 then shift terms)
𝜎𝑐 2𝜎𝑐
𝑖 𝑖

σ𝑖 𝑃𝑐𝑖 𝑥𝑖 − 𝜇𝑐 2
𝜎𝑐2 ෍ 𝑃𝑐𝑖 = ෍ 𝑃𝑐𝑖 𝑥𝑖 − 𝜇𝑐 2
𝜎𝑐 =
σ𝑖 𝑃𝑐𝑖
𝑖 𝑖
Derivation: GMM – Maximization - 𝜋𝑐
• From the set of non-constant terms, it would seem 𝜋𝑐 should be the easiest
of them all…
𝑑
0= ෍ 𝑃𝑐𝑖 log 𝜋𝑐
𝑑𝜋𝑐
𝑖
• However, remember the set of all 𝜋𝑐 form a probability distribution and so
have the additional constraint:

෍ 𝜋𝜔 = 1
𝜔
• We resort to the method of Lagrange multipliers to implement the constraint:
𝑑
0= ෍ 𝑃𝑐𝑖 log 𝜋𝑐 + 𝜆 1 − ෍ 𝜋𝜔
𝑑𝜋𝑐
𝑖 𝜔
Derivation: GMM – Maximization - 𝜋𝑐
• Then derivatives and solving…
𝑑 1
0= log 𝜋𝑐 ෍ 𝑃𝑐𝑖 + 𝜆 1 − ෍ 𝜋𝜔 = ෍ 𝑃𝑐𝑖 − 𝜆
𝑑𝜋𝑐 𝜋𝑐
𝑖 𝜔 𝑖
1
𝜋𝑐 = ෍ 𝑃𝑐𝑖
𝜆
𝑖
• Then plug back into the constraint to resolve the Lagrange multiplier…
1
1 = ෍ ෍ 𝑃𝜔𝑖 𝜆 = ෍ ෍ 𝑃𝜔𝑖 = ෍ 1 = 𝑁 (Number of data points, N)
𝜆
𝜔 𝑖 𝑖 𝜔 𝑖
1
𝜋𝑐 = ෍ 𝑃𝑐𝑖
𝑁
𝑖

Practise Mathematics: Grade 7 Book 1
From Everand
Practise Mathematics: Grade 7 Book 1
Esther Chen
4/5 (2)
Chap2 Part2 GMM
No ratings yet
Chap2 Part2 GMM
34 pages
کتاب ششم بارگزاری شده
No ratings yet
کتاب ششم بارگزاری شده
49 pages
Gaussian Distribution
No ratings yet
Gaussian Distribution
5 pages
AI29
No ratings yet
AI29
3 pages
20 Gaussian Mixture Model
No ratings yet
20 Gaussian Mixture Model
55 pages
S6, S7, S8 CS - U4 Getter Setter EM Algorithm
No ratings yet
S6, S7, S8 CS - U4 Getter Setter EM Algorithm
32 pages
Lec15 16 Handout
No ratings yet
Lec15 16 Handout
33 pages
Expectation Maximization (EM) Algorithm
No ratings yet
Expectation Maximization (EM) Algorithm
47 pages
Clustering Mixture
No ratings yet
Clustering Mixture
22 pages
Pattern Classification 08. Gaussian Mixture Model: Abdelmoniem Bayoumi, PHD
No ratings yet
Pattern Classification 08. Gaussian Mixture Model: Abdelmoniem Bayoumi, PHD
12 pages
CB PDF
No ratings yet
CB PDF
69 pages
Mixture Models and Expectation-Maximization: Justus H. Piater
No ratings yet
Mixture Models and Expectation-Maximization: Justus H. Piater
11 pages
Machine Learning: CSCE883
No ratings yet
Machine Learning: CSCE883
22 pages
16) ISM-Session 16 - 30th and 31st March 2024
No ratings yet
16) ISM-Session 16 - 30th and 31st March 2024
36 pages
Gaussian Mixture Models
No ratings yet
Gaussian Mixture Models
3 pages
Learning With Hidden Variables - EM Algorithm
No ratings yet
Learning With Hidden Variables - EM Algorithm
31 pages
GaussianMixtureModel (GMM)
No ratings yet
GaussianMixtureModel (GMM)
18 pages
Unit 5 - ML
No ratings yet
Unit 5 - ML
10 pages
cs229 Notes7b PDF
No ratings yet
cs229 Notes7b PDF
4 pages
Cse291d 7
No ratings yet
Cse291d 7
39 pages
PROBABILISTIC Learning Jb-New
No ratings yet
PROBABILISTIC Learning Jb-New
13 pages
Dsci303-19 GM - em
No ratings yet
Dsci303-19 GM - em
81 pages
Gaussian Mixture Model (GMM)
No ratings yet
Gaussian Mixture Model (GMM)
10 pages
Week 7 GMM
No ratings yet
Week 7 GMM
9 pages
401 Week7 Part 2 EM Algorithm
No ratings yet
401 Week7 Part 2 EM Algorithm
58 pages
L08 GMM
No ratings yet
L08 GMM
11 pages
Andrew Rosenberg - Lecture 18: Gaussian Mixture Models and Expectation Maximization
No ratings yet
Andrew Rosenberg - Lecture 18: Gaussian Mixture Models and Expectation Maximization
34 pages
WINSEM2020-21 CSE4020 ETH VL2020210504996 Reference Material I 12-May-2021 5.5 Expectation Maximization
No ratings yet
WINSEM2020-21 CSE4020 ETH VL2020210504996 Reference Material I 12-May-2021 5.5 Expectation Maximization
28 pages
Gaussian Mixture Models
No ratings yet
Gaussian Mixture Models
5 pages
Reynolds Bio Metrics GMM
No ratings yet
Reynolds Bio Metrics GMM
5 pages
Expectation Minimization
No ratings yet
Expectation Minimization
22 pages
Week 7 - Latent Variable Models and Expectation Maximization
No ratings yet
Week 7 - Latent Variable Models and Expectation Maximization
39 pages
Mixture Models
No ratings yet
Mixture Models
16 pages
EM GaussianMixture Example
No ratings yet
EM GaussianMixture Example
2 pages
TR 97 021
No ratings yet
TR 97 021
15 pages
ASSIGNMENT1
No ratings yet
ASSIGNMENT1
7 pages
Notes7 Mixtures and EM
No ratings yet
Notes7 Mixtures and EM
7 pages
Isye 6416: Computational Statistics Spring 2023: Prof. Yao Xie
No ratings yet
Isye 6416: Computational Statistics Spring 2023: Prof. Yao Xie
24 pages
Module13 GaussianMixtureModel
No ratings yet
Module13 GaussianMixtureModel
17 pages
Gaussian Mixture Models Unit-III
No ratings yet
Gaussian Mixture Models Unit-III
13 pages
Likelihood EM HMM Kalman
No ratings yet
Likelihood EM HMM Kalman
46 pages
Chapter 1 - Part1
No ratings yet
Chapter 1 - Part1
56 pages
Gaussian Mixture Models: LE Thi Khuyen
No ratings yet
Gaussian Mixture Models: LE Thi Khuyen
40 pages
Gaussian Mixtures
No ratings yet
Gaussian Mixtures
5 pages
Expectation-Maximization For The Gaussian Mixture Model
No ratings yet
Expectation-Maximization For The Gaussian Mixture Model
8 pages
ML Unit3 EM GMM VodnalaSrujana
No ratings yet
ML Unit3 EM GMM VodnalaSrujana
4 pages
Gaussian Mixture Mode
No ratings yet
Gaussian Mixture Mode
30 pages
6.2 K Means
No ratings yet
6.2 K Means
23 pages
Algoritmo E-M. Utilizado para Calcular La Mezcla de Gausianas
No ratings yet
Algoritmo E-M. Utilizado para Calcular La Mezcla de Gausianas
8 pages
Get One More Story in Your Member Preview When You Sign Up. It's Free
No ratings yet
Get One More Story in Your Member Preview When You Sign Up. It's Free
12 pages
UCS 401 Unit-lV Probabilistic Models With Hidden Variables 04-12-24
No ratings yet
UCS 401 Unit-lV Probabilistic Models With Hidden Variables 04-12-24
13 pages
Lecture 5
No ratings yet
Lecture 5
16 pages
Expectation Maximisation Algorithm
No ratings yet
Expectation Maximisation Algorithm
11 pages
Density Estimation With Gaussian Mixture Models: CS 2XX: Mathematics For AI and ML
No ratings yet
Density Estimation With Gaussian Mixture Models: CS 2XX: Mathematics For AI and ML
26 pages
Gaussian Mixture Model
No ratings yet
Gaussian Mixture Model
17 pages
10+2 Level Mathematics For All Exams GMAT, GRE, CAT, SAT, ACT, IIT JEE, WBJEE, ISI, CMI, RMO, INMO, KVPY Etc.
From Everand
10+2 Level Mathematics For All Exams GMAT, GRE, CAT, SAT, ACT, IIT JEE, WBJEE, ISI, CMI, RMO, INMO, KVPY Etc.
Shubhankar Paul
No ratings yet
Crocheted Scoodies: 20 Gorgeous Hooded Scarves and Cowls to Crochet
From Everand
Crocheted Scoodies: 20 Gorgeous Hooded Scarves and Cowls to Crochet
Magdelena Melzer
No ratings yet
Geometry and Locus (Geometry) Mathematics Question Bank
From Everand
Geometry and Locus (Geometry) Mathematics Question Bank
Mohmmad Khaja Shareef
No ratings yet
Midnight Waves Hat
From Everand
Midnight Waves Hat
Nancy Hand
No ratings yet
Families of Discrete Random Variables
No ratings yet
Families of Discrete Random Variables
4 pages
QP 5
No ratings yet
QP 5
3 pages
6-Discrete Random Variables
No ratings yet
6-Discrete Random Variables
18 pages
CHAPTER 2 Probability Distribution Week 4
No ratings yet
CHAPTER 2 Probability Distribution Week 4
47 pages
IE 605 - Aug2021 - Tutorial 5
No ratings yet
IE 605 - Aug2021 - Tutorial 5
2 pages
Probability With Applications and R Second Edition Amy Shepherd Wagaman Instant Download
No ratings yet
Probability With Applications and R Second Edition Amy Shepherd Wagaman Instant Download
49 pages
Binomial
No ratings yet
Binomial
15 pages
Asset-V1 MITx+CTL - SC0x+1T2021+type@asset+block@SC0x W7L2 ManagingUncertainty2 FINAL CLEAN Upd
No ratings yet
Asset-V1 MITx+CTL - SC0x+1T2021+type@asset+block@SC0x W7L2 ManagingUncertainty2 FINAL CLEAN Upd
27 pages
مبادئ احصاء واحتمالات 1
No ratings yet
مبادئ احصاء واحتمالات 1
179 pages
M1112SP IIIb 2
No ratings yet
M1112SP IIIb 2
4 pages
Work Sheet III - Discrete Random Variables - Docx
No ratings yet
Work Sheet III - Discrete Random Variables - Docx
6 pages
5 Distributions and Algebra of Variance
No ratings yet
5 Distributions and Algebra of Variance
53 pages
Statistics AS-level Formula Sheet
No ratings yet
Statistics AS-level Formula Sheet
5 pages
Statistik 1 - 6 Distribusi Probabilitas Normal
No ratings yet
Statistik 1 - 6 Distribusi Probabilitas Normal
49 pages
Probability and Statistics 1
No ratings yet
Probability and Statistics 1
625 pages
Lecture 4
No ratings yet
Lecture 4
28 pages
Transmuted Probability Distributions A Review
No ratings yet
Transmuted Probability Distributions A Review
12 pages
COSM Important Question 2024
No ratings yet
COSM Important Question 2024
5 pages
STa301 Final Term Important Topic A&i
No ratings yet
STa301 Final Term Important Topic A&i
3 pages
Random Variable, Mathematical Expectation
No ratings yet
Random Variable, Mathematical Expectation
23 pages
Discrete-Probability-Distribution g11
No ratings yet
Discrete-Probability-Distribution g11
37 pages
Immediate Download Probability With Applications in Engineering Science and Technology 2nd Edition Matthew A. Carlton Ebooks 2024
100% (2)
Immediate Download Probability With Applications in Engineering Science and Technology 2nd Edition Matthew A. Carlton Ebooks 2024
65 pages
Random Variables and Distributions - New
No ratings yet
Random Variables and Distributions - New
84 pages
Statistical Inferance PDF
No ratings yet
Statistical Inferance PDF
4 pages
Survival Data Analysis HW3
No ratings yet
Survival Data Analysis HW3
5 pages
Foundations of Probability With R
No ratings yet
Foundations of Probability With R
70 pages
Jayawardhana - Samaranayake - 2003 - Accelerated Testing - Weibull
No ratings yet
Jayawardhana - Samaranayake - 2003 - Accelerated Testing - Weibull
16 pages
Probability Density Function and Cumulative Distribution Function
No ratings yet
Probability Density Function and Cumulative Distribution Function
2 pages
IB Questionbank Maths SL 1
No ratings yet
IB Questionbank Maths SL 1
7 pages
Sat Comp0142
No ratings yet
Sat Comp0142
4 pages