Gaussian Mixture Model: P (X - Y) P (Y - X) P (X)

The Gaussian mixture model (GMM) approximates a probability distribution as a weighted sum of multiple Gaussian distributions. It can reveal clusters in datasets and model class distributions in Bayesian classifiers or pixel distributions in images. The Expectation-Maximization (EM) algorithm is commonly used to fit GMMs by alternating between estimating component responsibilities (E-step) and updating the model parameters (M-step) until convergence. While gradient descent can also optimize GMMs, EM is more popular due to its simplicity and interpretability.

Uploaded by

Kishore Kumar Das

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views3 pages

Gaussian Mixture Model: P (X - Y) P (Y - X) P (X)

Uploaded by

Kishore Kumar Das

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

1.

Gaussian mixture model

Real-world data is rarely Gaussian. Sometimes there are clear clusters that we might rea-
sonably model as separate Gaussians. Sometimes there are no clear clusters, but we might
be able to approximate the underlying density as a combination of overlapping Gaussians.
(Given enough Gaussians, we can closely approximate any reasonable density.
The Mixture of Gaussians (MoG) model is used to represent the probability distribution of
real-valued D-dimensional feature vectors, in this note x. It’s possible that the MoG model is
interesting on its own, as it could reveal clusters in a dataset. However, MoGs are usually
part of a larger probabilistic model. As a simple example, MoGs can be used to model the
data in each class for a Bayes classifier, replacing the single Gaussian used for each class in
examples earlier in the course.
As a more complex example, MoGs can be used to model the distribution over patches of
pixels in an image. A noisy image patch y can be restored by finding the most probable
underlying image x by maximizing:

p(x | y) ∝ p(y | x) p(x),

where p(y |x) is a noise model, and p(x) is the prior over image patches, obtained by fitting
a mixture of Gaussians.
Because modelling a joint density is such a general task, Mixtures of Gaussians have many
possible applications. They were used as part of the vision system of an early successful
self-driving car. They also have several possible applications in astronomy

The model and its likelihood

According to the MoG model, each datapoint has an integer indicator z( n) ∈ {1, .., K} , stating
which of K Gaussians generated the point. However, we don’t observe{ z(n)} — these are
not labels, but hidden or latent variables. Under the model, the latents are drawn from a
general discrete or categorical distribution with probability vector π:

z(n) ∼ Discrete(π).

Then the datapoints are drawn from the corresponding Gaussian “mixture component”:

p(x(n) | z(n) = k, θ) = N (x(n); µ(k), Σ(k)),

where θ = {π, {µ(k), Σ(k)}} are the parameters of the model.

To fit the model by maximum likelihood, we need to maximize
N
log p(D | θ) = ∑ log p(x(n) | θ),
n=1

where p(x(n) |θ) doesn’t involve the latent z(n) because they aren’t present in our observed
data. To find the probability of the observations, we need to introduce the unknown terms

as dummy variables that get summed out, as we have done several times before:

p(x(n) | θ) = ∑ p(x(n), z(n) = k | θ), (sum rule)

= ∑ p(x(n) | z(n) = k, θ) P(z(n) = k | θ), (product rule)

= ∑ πk N (x(n); µ(k), Σ(k)).

So the negative log-likelihood cost function that we would like to minimize is:

MLPR:w9b Iain Murray, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2016/ 1

N
(k)
— log p(D | θ) = − ∑ log ∑ πk N (x(n) ; µ , Σ(k)) .
n=1 k

Unlike the log of a product, the log of a sum does not immediately simplify. We can’t find the
maximum likelihood parameters in closed form: setting the derivatives of this cost function
to zero doesn’t give equations we can solve analytically.

Gradient-based fitting
We can fit the parameters θ = { { π,µ(k), Σ(k)}} with gradient methods. However, to use
standard gradient-based optimizers we need to transform the parameters to be unconstrained.
As previously discussed for stochastic variational inference, one way to represent the
covariance matrix is in terms of an unconstrained lower-triangular matrix L̃, where:

L̃ij i =j
Lij =
exp(L̃ii ) i = j.
Σ = LLT.
The π vector is also constrained: it must be positive and add up to one. We can represent it
as the softmax of a vector containing arbitrary values (as discussed earlier in the course).
Mixtures of Gaussians aren’t usually fitted with gradient methods (see next section). However,
by using gradient-based optimization, we can consider a mixture of Gaussians as a “module”
in a neural network or deep learning system. Here’s a sketch of the idea: The vectors
modelled by the MoG could be the target outputs in a probabilistic regression problem with
multiple outputs. We’re simply replacing the usual Gaussian assumption for regression with
a mixture of Gaussians. The parameters of the mixture model, θ, would be specified by a
hidden layer that depends on some inputs. The gradients of the MoG log-likelihood would
be backpropagated through the neural network as usual. This model is known as a
MixtureDensity Network5.
For keen students: there is some literature analyzing gradient-based methods for mixture
models6. There are also more sophisticated gradient-based optimizers that can deal with the
constraints, which work better in some cases.

The EM algorithm

The Expectation Maximization (EM) algorithm is really a framework for optimizing a wide
class of models.8 Applied to Mixtures of Gaussians, we obtain a simple and interpretable
iterative fitting procedure, which is more popular than gradient-based optimization.
If we knew the latent indicator variables{ z(n)} , then fitting the parameters by maximum
likelihood would be easy. We’d just find the empirical mean and covariance of the points
belonging to each component. In addition the mixing proportion of a component/cluster
πk would be set to the fraction of points that belong to component k. We’d be fitting the
parameters of Gaussian Bayes classifier, with labels {z(n)}.
To set up some notation, we can indicate which cluster is responsible for each datapoint with
a vector of binary variables giving a “one-hot” encoding: r(n) = δ (n) . Then we can write
k z ,k
down expressions for what the maximum likelihood parameters would be, if we knew the
assignments of the datapoints to components:

rk N (n)
πk =
N
, where rk = ∑ rk
N n=1
(k) 1 (n) (n)
µ =
rk ∑ rk x
n=1

MLPR:w9b Iain Murray, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2016/ 2

1 N
Σ(k) = ∑ r(n)
k
x(n) x(n)T — µ(k) µ(k)T.
rk n=1

The EM algorithm uses the equations above, with probabilistic settings of the unknown
component assignments. The algorithm alternates between two steps:

1) E-step: Set soft responsibilities using Bayes’ rule:

(n) πN (x(n); µ(k), Σ(k))

rk = P(z(n) = k | x(n), θ) = k

∑l πl N (x(n); µ(l), Σ(l))

2) M-step: Update the parameters θ ={ π,{ µ(k), Σ(k) }} using the responsibilities from
the E-step in the equations for the parameters above.

If these steps are repeated until convergence, it can be shown that the algorithm will converge
to a local maximum of the likelihood. In practice we terminate after some maximum number
of iterations, or when the parameters are changing by less than some tolerance. Early
stopping based on a validation set could also be used.
Some parameter settings have infinite likelihood. For example, place the mean of one
component on top of a datapoint, and set the corresponding covariance to zero. Infinite
likelihoods can also be obtained by explaining D or fewer of the D-dimensional datapoints
with one of the components and making its covariance matrix low-rank. There are a variety
of solutions to this issue, including reinitializing ill-behaved components, and regularizing
the covariances.
Whether we use EM or gradient-based methods, the likelihood is multi-modal, and different
initializations of the parameters lead to different answers.

MLPR:w9b Iain Murray, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2016/ 3

Advanced Machine Learning: CS 281
100% (1)
Advanced Machine Learning: CS 281
88 pages
Unit 5 - Machine Learning
No ratings yet
Unit 5 - Machine Learning
16 pages
Week 7 - Latent Variable Models and Expectation Maximization
No ratings yet
Week 7 - Latent Variable Models and Expectation Maximization
39 pages
Bishop Solutions PDF
No ratings yet
Bishop Solutions PDF
87 pages
Lec15 16 Handout
No ratings yet
Lec15 16 Handout
33 pages
Latent 2
No ratings yet
Latent 2
4 pages
Statistical Methods in Psychiatry and Related Fields
No ratings yet
Statistical Methods in Psychiatry and Related Fields
371 pages
Lec22 PDF
No ratings yet
Lec22 PDF
8 pages
Algoritmo E-M. Utilizado para Calcular La Mezcla de Gausianas
No ratings yet
Algoritmo E-M. Utilizado para Calcular La Mezcla de Gausianas
8 pages
Dis10 Sol PDF
No ratings yet
Dis10 Sol PDF
6 pages
Assignment 11
100% (1)
Assignment 11
4 pages
ML Columbia PDF
No ratings yet
ML Columbia PDF
615 pages
Master Thesis Information Retrieval
100% (2)
Master Thesis Information Retrieval
8 pages
15.097: Probabilistic Modeling and Bayesian Analysis
No ratings yet
15.097: Probabilistic Modeling and Bayesian Analysis
42 pages
cs229 Notes7b PDF
No ratings yet
cs229 Notes7b PDF
4 pages
Cheat Sheet 1
No ratings yet
Cheat Sheet 1
2 pages
From Physics To Economics
No ratings yet
From Physics To Economics
19 pages
Duda Solutions PDF
No ratings yet
Duda Solutions PDF
77 pages
Tut6 Questions
No ratings yet
Tut6 Questions
2 pages
Asdad
No ratings yet
Asdad
14 pages
EM-algorithm: California Institute of Technology 136-93 Pasadena, CA 91125 Welling@vision - Caltech.edu
No ratings yet
EM-algorithm: California Institute of Technology 136-93 Pasadena, CA 91125 Welling@vision - Caltech.edu
7 pages
Training Gaussian Mixture Models at Scale Via Coresets: Mario Lucic
No ratings yet
Training Gaussian Mixture Models at Scale Via Coresets: Mario Lucic
25 pages
Mixture Models
No ratings yet
Mixture Models
16 pages
Essentials of Bayesian Inference 1706204646
No ratings yet
Essentials of Bayesian Inference 1706204646
21 pages
Lec 12
No ratings yet
Lec 12
15 pages
Talk On Regression Based Method For Bayesian Nonparanormal Graphical Models
No ratings yet
Talk On Regression Based Method For Bayesian Nonparanormal Graphical Models
40 pages
Lecture 2
No ratings yet
Lecture 2
8 pages
Density Estimation With Gaussian Mixture Models: CS 2XX: Mathematics For AI and ML
No ratings yet
Density Estimation With Gaussian Mixture Models: CS 2XX: Mathematics For AI and ML
26 pages
Unit 5 - Machine Learning - WWW - Rgpvnotes.in
No ratings yet
Unit 5 - Machine Learning - WWW - Rgpvnotes.in
17 pages
ML RUSA Module 6 Probablistic EM KNN SVM
No ratings yet
ML RUSA Module 6 Probablistic EM KNN SVM
51 pages
Cse291d 7
No ratings yet
Cse291d 7
39 pages
20 Gaussian Mixture Model
No ratings yet
20 Gaussian Mixture Model
55 pages
PRML Slides 2
No ratings yet
PRML Slides 2
86 pages
Stats, Mle, and Other Stuff: 1 Sevssd
No ratings yet
Stats, Mle, and Other Stuff: 1 Sevssd
10 pages
Mixture Models and Clustering
No ratings yet
Mixture Models and Clustering
8 pages
Week 7 GMM
No ratings yet
Week 7 GMM
9 pages
Chapter 1 - Part1
No ratings yet
Chapter 1 - Part1
56 pages
Learning Models From Data: 1 Parametric Estimation
No ratings yet
Learning Models From Data: 1 Parametric Estimation
14 pages
Non-Linear Models With BSSM
No ratings yet
Non-Linear Models With BSSM
1 page
M03 Clustering
No ratings yet
M03 Clustering
37 pages
Chap2 Part2 GMM
No ratings yet
Chap2 Part2 GMM
34 pages
Lecture-04 GMM EMalg
No ratings yet
Lecture-04 GMM EMalg
34 pages
Module13 GaussianMixtureModel
No ratings yet
Module13 GaussianMixtureModel
17 pages
ds11 2
No ratings yet
ds11 2
19 pages
Dsci303-19 GM - em
No ratings yet
Dsci303-19 GM - em
81 pages
Clustering Mixture
No ratings yet
Clustering Mixture
22 pages
کتاب ششم بارگزاری شده
No ratings yet
کتاب ششم بارگزاری شده
49 pages
Distribution System
No ratings yet
Distribution System
103 pages
Gaussian Mixtures
No ratings yet
Gaussian Mixtures
5 pages
Ch9 2-MixturesofGaussians PDF
No ratings yet
Ch9 2-MixturesofGaussians PDF
38 pages
GMMEMNotes
No ratings yet
GMMEMNotes
10 pages
Unsupervised Learning Clustering Math
No ratings yet
Unsupervised Learning Clustering Math
28 pages
CB PDF
No ratings yet
CB PDF
69 pages
CS 601 Machine Learning Unit 5
No ratings yet
CS 601 Machine Learning Unit 5
18 pages
Expectation-Maximization For The Gaussian Mixture Model
No ratings yet
Expectation-Maximization For The Gaussian Mixture Model
8 pages
5
No ratings yet
5
29 pages
07 - Bayesian Learning
No ratings yet
07 - Bayesian Learning
55 pages
Mixture Models and Expectation-Maximization: Justus H. Piater
No ratings yet
Mixture Models and Expectation-Maximization: Justus H. Piater
11 pages
EPAT Structure
No ratings yet
EPAT Structure
8 pages
Bishop ML
No ratings yet
Bishop ML
3 pages
Voice Morphing Seminar
No ratings yet
Voice Morphing Seminar
23 pages
Hands-On Speech Recognition Wit - Yamin Ren
No ratings yet
Hands-On Speech Recognition Wit - Yamin Ren
223 pages
Machine Learning LAB MANUAL
No ratings yet
Machine Learning LAB MANUAL
23 pages
RAPTOR
No ratings yet
RAPTOR
23 pages
JAWS (Screen Reader)
No ratings yet
JAWS (Screen Reader)
18 pages
Turbofan Exhaust Gas Temperature Forecasting and Performance Monitoring With A Neural Network Model
No ratings yet
Turbofan Exhaust Gas Temperature Forecasting and Performance Monitoring With A Neural Network Model
9 pages
Seznam Literatury o Statistice
No ratings yet
Seznam Literatury o Statistice
151 pages
Aimlllll
No ratings yet
Aimlllll
40 pages
Sound For 3D Cinema and The Sense of Presence
No ratings yet
Sound For 3D Cinema and The Sense of Presence
9 pages
Stock Market Ieee Paper
No ratings yet
Stock Market Ieee Paper
6 pages
The Oxford Handbook Quantitative Methods: Todd D. Little
No ratings yet
The Oxford Handbook Quantitative Methods: Todd D. Little
63 pages
Take Test - Final Exam Preparation - Artificial ..
No ratings yet
Take Test - Final Exam Preparation - Artificial ..
11 pages
zivkovicPRL2006 PDF
No ratings yet
zivkovicPRL2006 PDF
8 pages
NLP Research Paper
No ratings yet
NLP Research Paper
19 pages
Arxiv ICAISC2023 Voice Spoofing Detection
No ratings yet
Arxiv ICAISC2023 Voice Spoofing Detection
14 pages
M. J. Dworkin, "FIPS 197, Advanced Encryption Standard (AES) ," Netw. Secur. Natl. Inst. Stand. Technol
No ratings yet
M. J. Dworkin, "FIPS 197, Advanced Encryption Standard (AES) ," Netw. Secur. Natl. Inst. Stand. Technol
25 pages
Who Is Motivated To Volunteer? A Latent Profile Analysis Linking Volunteer Motivation To Frequency of Volunteering
No ratings yet
Who Is Motivated To Volunteer? A Latent Profile Analysis Linking Volunteer Motivation To Frequency of Volunteering
23 pages
Deeplearning For DC Motor PDF
No ratings yet
Deeplearning For DC Motor PDF
5 pages
Bayesdll: Bayesian Deep Learning Library: T.Hospedales@Ed - Ac.Uk
No ratings yet
Bayesdll: Bayesian Deep Learning Library: T.Hospedales@Ed - Ac.Uk
13 pages
Lecture 19 and 20
No ratings yet
Lecture 19 and 20
27 pages
IMVFX 1 HistGMM F23 S
No ratings yet
IMVFX 1 HistGMM F23 S
41 pages
Feasibility of Two-Factor Payment Authentication Using Eeg-Based Bcis
No ratings yet
Feasibility of Two-Factor Payment Authentication Using Eeg-Based Bcis
6 pages
Processes: Prospects and Challenges of AI and Neural Network Algorithms in MEMS Microcantilever Biosensors
No ratings yet
Processes: Prospects and Challenges of AI and Neural Network Algorithms in MEMS Microcantilever Biosensors
25 pages
3 - Baker Jayaram 2008 PDF
No ratings yet
3 - Baker Jayaram 2008 PDF
19 pages
(Kazuhiro-K, S-Nakamura) @is - Naist.jp, Tomoki@icts - Nagoya-U.ac - JP
No ratings yet
(Kazuhiro-K, S-Nakamura) @is - Naist.jp, Tomoki@icts - Nagoya-U.ac - JP
8 pages
Optimization Theory with Applications
From Everand
Optimization Theory with Applications
Donald A. Pierre
4/5 (4)
Gauss Nodes Revolution: Numerical Integration Theory Radically Simplified And Generalised
From Everand
Gauss Nodes Revolution: Numerical Integration Theory Radically Simplified And Generalised
Rob Porter
No ratings yet
Calculus: Maths of the Gods
From Everand
Calculus: Maths of the Gods
Bill Todorovich
No ratings yet
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Mathematical Functions
From Everand
Mathematical Functions
Oliver Linton
No ratings yet