0% found this document useful (0 votes)
38 views3 pages

Gaussian Mixture Model: P (X - Y) P (Y - X) P (X)

The Gaussian mixture model (GMM) approximates a probability distribution as a weighted sum of multiple Gaussian distributions. It can reveal clusters in datasets and model class distributions in Bayesian classifiers or pixel distributions in images. The Expectation-Maximization (EM) algorithm is commonly used to fit GMMs by alternating between estimating component responsibilities (E-step) and updating the model parameters (M-step) until convergence. While gradient descent can also optimize GMMs, EM is more popular due to its simplicity and interpretability.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views3 pages

Gaussian Mixture Model: P (X - Y) P (Y - X) P (X)

The Gaussian mixture model (GMM) approximates a probability distribution as a weighted sum of multiple Gaussian distributions. It can reveal clusters in datasets and model class distributions in Bayesian classifiers or pixel distributions in images. The Expectation-Maximization (EM) algorithm is commonly used to fit GMMs by alternating between estimating component responsibilities (E-step) and updating the model parameters (M-step) until convergence. While gradient descent can also optimize GMMs, EM is more popular due to its simplicity and interpretability.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

1.

Gaussian mixture model


Real-world data is rarely Gaussian. Sometimes there are clear clusters that we might rea-
sonably model as separate Gaussians. Sometimes there are no clear clusters, but we might
be able to approximate the underlying density as a combination of overlapping Gaussians.
(Given enough Gaussians, we can closely approximate any reasonable density.
The Mixture of Gaussians (MoG) model is used to represent the probability distribution of
real-valued D-dimensional feature vectors, in this note x. It’s possible that the MoG model is
interesting on its own, as it could reveal clusters in a dataset. However, MoGs are usually
part of a larger probabilistic model. As a simple example, MoGs can be used to model the
data in each class for a Bayes classifier, replacing the single Gaussian used for each class in
examples earlier in the course.
As a more complex example, MoGs can be used to model the distribution over patches of
pixels in an image. A noisy image patch y can be restored by finding the most probable
underlying image x by maximizing:

p(x | y) ∝ p(y | x) p(x),

where p(y |x) is a noise model, and p(x) is the prior over image patches, obtained by fitting
a mixture of Gaussians.
Because modelling a joint density is such a general task, Mixtures of Gaussians have many
possible applications. They were used as part of the vision system of an early successful
self-driving car. They also have several possible applications in astronomy

The model and its likelihood


According to the MoG model, each datapoint has an integer indicator z( n) ∈ {1, .., K} , stating
which of K Gaussians generated the point. However, we don’t observe{ z(n)} — these are
not labels, but hidden or latent variables. Under the model, the latents are drawn from a
general discrete or categorical distribution with probability vector π:

z(n) ∼ Discrete(π).

Then the datapoints are drawn from the corresponding Gaussian “mixture component”:

p(x(n) | z(n) = k, θ) = N (x(n); µ(k), Σ(k)),

where θ = {π, {µ(k), Σ(k)}} are the parameters of the model.


To fit the model by maximum likelihood, we need to maximize
N
log p(D | θ) = ∑ log p(x(n) | θ),
n=1

where p(x(n) |θ) doesn’t involve the latent z(n) because they aren’t present in our observed
data. To find the probability of the observations, we need to introduce the unknown terms

as dummy variables that get summed out, as we have done several times before:

p(x(n) | θ) = ∑ p(x(n), z(n) = k | θ), (sum rule)


k

= ∑ p(x(n) | z(n) = k, θ) P(z(n) = k | θ), (product rule)


k

= ∑ πk N (x(n); µ(k), Σ(k)).


k

So the negative log-likelihood cost function that we would like to minimize is:

MLPR:w9b Iain Murray, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2016/ 1


N
(k)
— log p(D | θ) = − ∑ log ∑ πk N (x(n) ; µ , Σ(k)) .
n=1 k

Unlike the log of a product, the log of a sum does not immediately simplify. We can’t find the
maximum likelihood parameters in closed form: setting the derivatives of this cost function
to zero doesn’t give equations we can solve analytically.

Gradient-based fitting
We can fit the parameters θ = { { π,µ(k), Σ(k)}} with gradient methods. However, to use
standard gradient-based optimizers we need to transform the parameters to be unconstrained.
As previously discussed for stochastic variational inference, one way to represent the
covariance matrix is in terms of an unconstrained lower-triangular matrix L̃, where:

L̃ij i =j
Lij =
exp(L̃ii ) i = j.
Σ = LLT.
The π vector is also constrained: it must be positive and add up to one. We can represent it
as the softmax of a vector containing arbitrary values (as discussed earlier in the course).
Mixtures of Gaussians aren’t usually fitted with gradient methods (see next section). However,
by using gradient-based optimization, we can consider a mixture of Gaussians as a “module”
in a neural network or deep learning system. Here’s a sketch of the idea: The vectors
modelled by the MoG could be the target outputs in a probabilistic regression problem with
multiple outputs. We’re simply replacing the usual Gaussian assumption for regression with
a mixture of Gaussians. The parameters of the mixture model, θ, would be specified by a
hidden layer that depends on some inputs. The gradients of the MoG log-likelihood would
be backpropagated through the neural network as usual. This model is known as a
MixtureDensity Network5.
For keen students: there is some literature analyzing gradient-based methods for mixture
models6. There are also more sophisticated gradient-based optimizers that can deal with the
constraints, which work better in some cases.

The EM algorithm

The Expectation Maximization (EM) algorithm is really a framework for optimizing a wide
class of models.8 Applied to Mixtures of Gaussians, we obtain a simple and interpretable
iterative fitting procedure, which is more popular than gradient-based optimization.
If we knew the latent indicator variables{ z(n)} , then fitting the parameters by maximum
likelihood would be easy. We’d just find the empirical mean and covariance of the points
belonging to each component. In addition the mixing proportion of a component/cluster
πk would be set to the fraction of points that belong to component k. We’d be fitting the
parameters of Gaussian Bayes classifier, with labels {z(n)}.
To set up some notation, we can indicate which cluster is responsible for each datapoint with
a vector of binary variables giving a “one-hot” encoding: r(n) = δ (n) . Then we can write
k z ,k
down expressions for what the maximum likelihood parameters would be, if we knew the
assignments of the datapoints to components:

rk N (n)
πk =
N
, where rk = ∑ rk
N n=1
(k) 1 (n) (n)
µ =
rk ∑ rk x
n=1

MLPR:w9b Iain Murray, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2016/ 2


1 N
Σ(k) = ∑ r(n)
k
x(n) x(n)T — µ(k) µ(k)T.
rk n=1

The EM algorithm uses the equations above, with probabilistic settings of the unknown
component assignments. The algorithm alternates between two steps:

1) E-step: Set soft responsibilities using Bayes’ rule:

(n) πN (x(n); µ(k), Σ(k))


rk = P(z(n) = k | x(n), θ) = k

∑l πl N (x(n); µ(l), Σ(l))

2) M-step: Update the parameters θ ={ π,{ µ(k), Σ(k) }} using the responsibilities from
the E-step in the equations for the parameters above.

If these steps are repeated until convergence, it can be shown that the algorithm will converge
to a local maximum of the likelihood. In practice we terminate after some maximum number
of iterations, or when the parameters are changing by less than some tolerance. Early
stopping based on a validation set could also be used.
Some parameter settings have infinite likelihood. For example, place the mean of one
component on top of a datapoint, and set the corresponding covariance to zero. Infinite
likelihoods can also be obtained by explaining D or fewer of the D-dimensional datapoints
with one of the components and making its covariance matrix low-rank. There are a variety
of solutions to this issue, including reinitializing ill-behaved components, and regularizing
the covariances.
Whether we use EM or gradient-based methods, the likelihood is multi-modal, and different
initializations of the parameters lead to different answers.

MLPR:w9b Iain Murray, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2016/ 3

You might also like