Gaussian Mixture Modelling GMM
Gaussian Mixture Modelling GMM
towardsdatascience.com/gaussian-mixture-modelling-gmm-833c88587c7f
Daniel Foley
In a previous post, I discussed k-means clustering as a way of summarising text data. I also
talked about some of the limitations of k-means and in what situations it may not be the
most appropriate solution. Probably the biggest limitation is that each cluster has the same
diagonal covariance matrix. This produces spherical clusters that are quite inflexible in
terms of the types of distributions they can model. In this post, I wanted to address some of
those limitations and talk about one method in particular that can avoid these issues,
Gaussian Mixture Modelling (GMM). The format of this post will be very similar to the last
one where I explain the theory behind GMM and how it works. I then want to dive into
coding the algorithm in Python and we can see how the results differ from k-means and
why using GMM may be a good alternative.
How can we estimate this type of model? Well, one thing we could do is to introduce a
latent variable (gamma) for each data point. This assumes that each data point was
generated by using some information about the latent variable . In other words, it tells us
which Gaussian generated a particular data point. In practice, however, we do not observe
these latent variables so we need to estimate them. How do we do this? Well, luckily for us
there is already an algorithm to use in cases like these, the Expectation Maximisation (EM)
Algorithm and this is what we will discuss next.
The EM algorithm
1/11
The EM algorithm consists of two steps, an E-step or Expectation step and M-step or
Maximisation step. Let’s say we have some latent variables (which are unobserved and
denoted by the vector Z below) and our data points X. Our goal is to maximise the marginal
likelihood of X given our parameters (denoted by the vector θ). Essentially we can find the
marginal distribution as the joint of X and Z and sum over all Z’s (sum rule of probability).
The above equation often results in a complicated function that is hard to maximise. What
we can do in this case is to use Jensens Inequalityto construct a lower bound functionwhich
is much easier to optimise. If we optimise this by minimising the KL divergence(gap)
between the two distributions we can approximate the original function. This process is
illustrated in Figure 1 below. I have also provided a video link above which shows a
derivation of KL divergence for those of you who want a more rigorous mathematical
explanation.
To estimate our model essentially we only need to carry out two steps. In the first step (E-
step) we want to estimate the posterior distribution of our latent variables conditional on
our weights (π) means (µ)and covariance (Σ) of our Gaussians. The vector of parameters is
denoted as θ in Figure 1. Estimating the E-step requires initialising these values first and we
can do this with k-means which is usually a good starting point (more on this in the code
below). We can then move to the second step (M-step) and use to maximise the
likelihood with respect to our parameters θ. This process is repeated until the algorithm
converges (loss function doesn't change).
2/11
Note: Theta is a vector of all parameters, Source: Bayesian Methods for Machine Learning
The E-Step
Ok, now that we have visualised what the EM algorithm is doing I want to outline and
explain the equations we need to calculate in the E-step and the M-step. These will be really
important when it comes time to write our code. We can write the Gaussian Mixture
distribution as a combination of Gaussians with weights equal to π as below. Where K is the
number of Gaussians we want to model.
Taking the above results we can calculate the posterior distribution of the responsibilities
that each Gaussian has for each data point using the formula below. This equation is just
Bayes rule where π is the prior weights and the likelihood is normal.
3/11
Equation 3: Posterior Responsibilities using Bayes Rule
The M-Step
After calculating our posterior all we need to do is get an estimate of the parameters of each
Gaussian defined by the equations below and then evaluate the log-likelihood. These two
steps are then repeated until convergence.
Equation 6: weights
Remember though, we have set the problem up in such a way that we can instead maximise
a lower bound (or minimise the distance between the distributions) which will approximate
equation 8 above. We can write our lower bound as follows where z is our latent variable.
4/11
Notice our summation now appears outside the logarithm instead of inside it resulting in a
much simpler expression than equation 8.
Python Code
Now that we have explained the theory behind the modelling I want to code up this
algorithm using Python. Like my previous post, I am going to be using the same data set so
we can compare the results between k-means and GMM. The preprocessing steps are
exactly the same as those in the previous post and I provide a link to the full code at the
end of this post.
K-means estimation
As I mentioned before, in order to start the algorithm (perform 1st E-step) we need initial
values for our parameters. Rather than just randomly setting these values it is usually a
good idea to estimate them using k-means. This will usually give us a good starting point
and can help our model converge faster. Before we estimate GMM let’s have a quick look at
what kind of clusters k-means gives us.
sklearn k-means
Using our estimates from sklearn we can create a nice visualisation of our clusters (Figure
2). Notice the clusters are all spherical in shape and are the same size. The spherical
clusters do not seem to model the spread of the data very well indicating that k-means in
this particular case may not be the best approach. This illustrates one of the limitations of k-
means as all covariance matrices are diagonal with unit variance. This limitation means that
the model is not particularly flexible. With that in mind, let’s try out GMM and see what kind
of results that gives us.
5/11
Figure 2: k-means spherical Gaussians
GMM estimation
Figure 3 below illustrates what GMM is doing. It clearly shows three clusters modelled by
three different Gaussian distributions. I have used a toy data set here just to illustrate this
clearly as it is less clear with the Enron data set. As you can see, compared to Figure 2
modelled using spherical clusters, GMM is much more flexible allowing us to generate much
better fitting distributions.
6/11
Figure 3: GMM example: simple data set: Full Covariance
The next bit of code implements our initialise_parameters method which uses k-means
from the sklearn library to calculate our clusters. Notice that this function actually calls our
calculate_mean_covariance method defined above. We could have probably used one
method to calculate our clusters and initial parameters but it is usually much easier to
debug and avoid errors if each method only carries out one specific task.
It’s time to get right into the most important methods in our class. The E-step of the
algorithm is defined below and takes in our parameters and data which makes perfect
sense given the equations we defined above. Remember, the purpose of this step is to
7/11
calculate the posterior distribution of our responsibilities ( ). The main thing to note here is
that we loop through each of the C Gaussian’s (3 in our case) and calculate the posterior
using a function from scipy to calculate the multivariate normal pdf.
After we have calculated this value for each Gaussian we just need to normalise the gamma
( ), corresponding to the denominator in equation 3. This is to ensure our gammas are
valid probabilities. If we sum the values across clusters for each data point they should
equal 1.
After we calculate the values for the responsibilities ( ) we can feed these into the M-step.
Again the purpose of the M-step is to calculate our new parameter values using the results
from the E-step corresponding to equations 4, 5 and 6. To make debugging easier I have
separated the m_step method and the compute_loss_function method in my code below.
The compute_loss_function does exactly what its name implies. It takes in the
responsibilities and parameters returned by the E-step and M-step and uses these to
calculate our lower bound loss function defined in equation 9.
All of our most important methods have now been coded up. Keeping consistent with
sklearn I am going to define a fit method which will call the methods we just defined. In
particular, we start by initialising our parameter values. After this, we perform the steps
outlined in the EM-algorithm for our chosen number of iterations. Note that it doesn't
actually take a large number of iterations to converge particularly when you use k-means to
get values of the initial parameters (I think my algorithm converged in about 30 iterations).
Since we are probably also interested in using this model to predict what Gaussian new data
might belong to we can implement a predict and predict_proba method. The predict_proba
method will take in new data points and predict the responsibilities for each Gaussian. In
other words, the probability that this data point came from each distribution. This is the
essence of the soft assignment that I mentioned at the start of the post. The predict method
does essentially the same but assigns the cluster which has the highest probability using
np.argmax.
8/11
The main takeaway in this figure is that the distributions are clearly no longer spherical.
GMM has allowed us the relax our restrictions on the covariance matrix allowing the
distribution to have a much better fit to the data. This is particularly useful given that the
shape of our data was clearly not spherical. Now, this is probably not a perfect solution and
there are some data points which do not fit any distribution very well but it is an
improvement over k-means.
9/11
sklearn GMM
Alright, guys, that’s it for this post. I hope that was a useful and pretty intuitive explanation
of Gaussian Mixture Modelling. If any of you want to get a deeper understanding of the
material I recommend the Coursera course Bayesian Methods for Machine Learning. I
sourced a lot of the material from this course and I think it gives really nice and in-depth
explanations of the concepts I presented here. I would also recommend the book, Pattern
Recognition and Machine Learning by Bishop. This book is a great reference for most of the
classic algorithms you will come across in machine learning. Below I provide the full code
for the GMM class outlined in the post as well as a link to the Kaggle kernel where I did all
the analysis. As always, feedback is welcome.
11/11