Gaussian Mixture Model: P (X - Y) P (Y - X) P (X)
Gaussian Mixture Model: P (X - Y) P (Y - X) P (X)
where p(y |x) is a noise model, and p(x) is the prior over image patches, obtained by fitting
a mixture of Gaussians.
Because modelling a joint density is such a general task, Mixtures of Gaussians have many
possible applications. They were used as part of the vision system of an early successful
self-driving car. They also have several possible applications in astronomy
z(n) ∼ Discrete(π).
Then the datapoints are drawn from the corresponding Gaussian “mixture component”:
where p(x(n) |θ) doesn’t involve the latent z(n) because they aren’t present in our observed
data. To find the probability of the observations, we need to introduce the unknown terms
as dummy variables that get summed out, as we have done several times before:
So the negative log-likelihood cost function that we would like to minimize is:
Unlike the log of a product, the log of a sum does not immediately simplify. We can’t find the
maximum likelihood parameters in closed form: setting the derivatives of this cost function
to zero doesn’t give equations we can solve analytically.
Gradient-based fitting
We can fit the parameters θ = { { π,µ(k), Σ(k)}} with gradient methods. However, to use
standard gradient-based optimizers we need to transform the parameters to be unconstrained.
As previously discussed for stochastic variational inference, one way to represent the
covariance matrix is in terms of an unconstrained lower-triangular matrix L̃, where:
L̃ij i =j
Lij =
exp(L̃ii ) i = j.
Σ = LLT.
The π vector is also constrained: it must be positive and add up to one. We can represent it
as the softmax of a vector containing arbitrary values (as discussed earlier in the course).
Mixtures of Gaussians aren’t usually fitted with gradient methods (see next section). However,
by using gradient-based optimization, we can consider a mixture of Gaussians as a “module”
in a neural network or deep learning system. Here’s a sketch of the idea: The vectors
modelled by the MoG could be the target outputs in a probabilistic regression problem with
multiple outputs. We’re simply replacing the usual Gaussian assumption for regression with
a mixture of Gaussians. The parameters of the mixture model, θ, would be specified by a
hidden layer that depends on some inputs. The gradients of the MoG log-likelihood would
be backpropagated through the neural network as usual. This model is known as a
MixtureDensity Network5.
For keen students: there is some literature analyzing gradient-based methods for mixture
models6. There are also more sophisticated gradient-based optimizers that can deal with the
constraints, which work better in some cases.
The EM algorithm
The Expectation Maximization (EM) algorithm is really a framework for optimizing a wide
class of models.8 Applied to Mixtures of Gaussians, we obtain a simple and interpretable
iterative fitting procedure, which is more popular than gradient-based optimization.
If we knew the latent indicator variables{ z(n)} , then fitting the parameters by maximum
likelihood would be easy. We’d just find the empirical mean and covariance of the points
belonging to each component. In addition the mixing proportion of a component/cluster
πk would be set to the fraction of points that belong to component k. We’d be fitting the
parameters of Gaussian Bayes classifier, with labels {z(n)}.
To set up some notation, we can indicate which cluster is responsible for each datapoint with
a vector of binary variables giving a “one-hot” encoding: r(n) = δ (n) . Then we can write
k z ,k
down expressions for what the maximum likelihood parameters would be, if we knew the
assignments of the datapoints to components:
rk N (n)
πk =
N
, where rk = ∑ rk
N n=1
(k) 1 (n) (n)
µ =
rk ∑ rk x
n=1
The EM algorithm uses the equations above, with probabilistic settings of the unknown
component assignments. The algorithm alternates between two steps:
2) M-step: Update the parameters θ ={ π,{ µ(k), Σ(k) }} using the responsibilities from
the E-step in the equations for the parameters above.
If these steps are repeated until convergence, it can be shown that the algorithm will converge
to a local maximum of the likelihood. In practice we terminate after some maximum number
of iterations, or when the parameters are changing by less than some tolerance. Early
stopping based on a validation set could also be used.
Some parameter settings have infinite likelihood. For example, place the mean of one
component on top of a datapoint, and set the corresponding covariance to zero. Infinite
likelihoods can also be obtained by explaining D or fewer of the D-dimensional datapoints
with one of the components and making its covariance matrix low-rank. There are a variety
of solutions to this issue, including reinitializing ill-behaved components, and regularizing
the covariances.
Whether we use EM or gradient-based methods, the likelihood is multi-modal, and different
initializations of the parameters lead to different answers.