0% found this document useful (0 votes)

69 views11 pages

Gaussian Mixture Modelling GMM

This document discusses Gaussian mixture models (GMM) for clustering text data. GMM is an alternative to k-means clustering that allows for non-spherical clusters by modeling each cluster as a Gaussian distribution. The Expectation-Maximization algorithm is used to estimate the parameters of the GMM by iteratively updating cluster responsibilities and parameter estimates until convergence. Python code is provided to demonstrate implementing GMM for text clustering.

Uploaded by

ConejoSinPompon goma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

69 views11 pages

Gaussian Mixture Modelling GMM

Uploaded by

ConejoSinPompon goma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Gaussian Mixture Modelling (GMM)

towardsdatascience.com/gaussian-mixture-modelling-gmm-833c88587c7f

Making Sense of Text Data using Unsupervised Learning

Daniel Foley

In a previous post, I discussed k-means clustering as a way of summarising text data. I also
talked about some of the limitations of k-means and in what situations it may not be the
most appropriate solution. Probably the biggest limitation is that each cluster has the same
diagonal covariance matrix. This produces spherical clusters that are quite inflexible in
terms of the types of distributions they can model. In this post, I wanted to address some of
those limitations and talk about one method in particular that can avoid these issues,
Gaussian Mixture Modelling (GMM). The format of this post will be very similar to the last
one where I explain the theory behind GMM and how it works. I then want to dive into
coding the algorithm in Python and we can see how the results differ from k-means and
why using GMM may be a good alternative.

GMM made simple(ish)

At its simplest, GMM is also a type of clustering algorithm. As its name implies, each cluster
is modelled according to a different Gaussian distribution. This flexible and probabilistic
approach to modelling the data means that rather than having hard assignments into
clusters like k-means, we have soft assignments. This means that each data point could have
been generated by any of the distributions with a corresponding probability. In effect, each
distribution has some ‘responsibility’ for generating a particular data point.

How can we estimate this type of model? Well, one thing we could do is to introduce a
latent variable (gamma) for each data point. This assumes that each data point was
generated by using some information about the latent variable . In other words, it tells us
which Gaussian generated a particular data point. In practice, however, we do not observe
these latent variables so we need to estimate them. How do we do this? Well, luckily for us
there is already an algorithm to use in cases like these, the Expectation Maximisation (EM)
Algorithm and this is what we will discuss next.

The EM algorithm

1/11
The EM algorithm consists of two steps, an E-step or Expectation step and M-step or
Maximisation step. Let’s say we have some latent variables (which are unobserved and
denoted by the vector Z below) and our data points X. Our goal is to maximise the marginal
likelihood of X given our parameters (denoted by the vector θ). Essentially we can find the
marginal distribution as the joint of X and Z and sum over all Z’s (sum rule of probability).

Equation 1: Marginal Likelihood with Latent variables

The above equation often results in a complicated function that is hard to maximise. What
we can do in this case is to use Jensens Inequalityto construct a lower bound functionwhich
is much easier to optimise. If we optimise this by minimising the KL divergence(gap)
between the two distributions we can approximate the original function. This process is
illustrated in Figure 1 below. I have also provided a video link above which shows a
derivation of KL divergence for those of you who want a more rigorous mathematical
explanation.

To estimate our model essentially we only need to carry out two steps. In the first step (E-
step) we want to estimate the posterior distribution of our latent variables conditional on
our weights (π) means (µ)and covariance (Σ) of our Gaussians. The vector of parameters is
denoted as θ in Figure 1. Estimating the E-step requires initialising these values first and we
can do this with k-means which is usually a good starting point (more on this in the code
below). We can then move to the second step (M-step) and use to maximise the
likelihood with respect to our parameters θ. This process is repeated until the algorithm
converges (loss function doesn't change).

Visualising the EM algorithm

Why don't we try and visualise this process using Figure 1? We calculate the posterior
distribution of in the first step which as it turns out is equivalent to the value we would
get by minimising the KL divergence between the two distributions. We then set the
posterior equal to q (confusing notation I know but this is just ) and maximise this
function with respect to the parameters θ. We can see from the graph as we iterate and
perform these calculations we move towards the optimum (or at least a local optimum).

2/11
Note: Theta is a vector of all parameters, Source: Bayesian Methods for Machine Learning

The EM algorithm for GMM

The E-Step
Ok, now that we have visualised what the EM algorithm is doing I want to outline and
explain the equations we need to calculate in the E-step and the M-step. These will be really
important when it comes time to write our code. We can write the Gaussian Mixture
distribution as a combination of Gaussians with weights equal to π as below. Where K is the
number of Gaussians we want to model.

Equation 2: Gaussian Mixture Distribution

Taking the above results we can calculate the posterior distribution of the responsibilities
that each Gaussian has for each data point using the formula below. This equation is just
Bayes rule where π is the prior weights and the likelihood is normal.

3/11
Equation 3: Posterior Responsibilities using Bayes Rule

The M-Step
After calculating our posterior all we need to do is get an estimate of the parameters of each
Gaussian defined by the equations below and then evaluate the log-likelihood. These two
steps are then repeated until convergence.

Equation 4: Mean of the Gaussians

Equation 5: Covariance of the Gaussians

Equation 6: weights

Equation 7: Sum of responsibilities in each Gaussian k

Equation 8: Marginal Likelihood: This is what we want to maximise

Remember though, we have set the problem up in such a way that we can instead maximise
a lower bound (or minimise the distance between the distributions) which will approximate
equation 8 above. We can write our lower bound as follows where z is our latent variable.

4/11
Notice our summation now appears outside the logarithm instead of inside it resulting in a
much simpler expression than equation 8.

Equation 9: Variational Lower Bound, Source: Bishop equation 9.74

Python Code
Now that we have explained the theory behind the modelling I want to code up this
algorithm using Python. Like my previous post, I am going to be using the same data set so
we can compare the results between k-means and GMM. The preprocessing steps are
exactly the same as those in the previous post and I provide a link to the full code at the
end of this post.

K-means estimation
As I mentioned before, in order to start the algorithm (perform 1st E-step) we need initial
values for our parameters. Rather than just randomly setting these values it is usually a
good idea to estimate them using k-means. This will usually give us a good starting point
and can help our model converge faster. Before we estimate GMM let’s have a quick look at
what kind of clusters k-means gives us.

sklearn k-means

Using our estimates from sklearn we can create a nice visualisation of our clusters (Figure
2). Notice the clusters are all spherical in shape and are the same size. The spherical
clusters do not seem to model the spread of the data very well indicating that k-means in
this particular case may not be the best approach. This illustrates one of the limitations of k-
means as all covariance matrices are diagonal with unit variance. This limitation means that
the model is not particularly flexible. With that in mind, let’s try out GMM and see what kind
of results that gives us.

Source: Python for Data Science Handbook

5/11
Figure 2: k-means spherical Gaussians

GMM estimation
Figure 3 below illustrates what GMM is doing. It clearly shows three clusters modelled by
three different Gaussian distributions. I have used a toy data set here just to illustrate this
clearly as it is less clear with the Enron data set. As you can see, compared to Figure 2
modelled using spherical clusters, GMM is much more flexible allowing us to generate much
better fitting distributions.

6/11
Figure 3: GMM example: simple data set: Full Covariance

GMM Python class

Ok, now we are going to get straight into coding our GMM class in Python. As always, we
start off with an init method. The only things I am initialising here are the number of times
we want to run our algorithm and the number of clusters we want to model. The most
interesting method in this code snippet is calculate_mean_covariance. This helps us
calculate values for our initial parameters. It takes in our data as well as our predictions
from k-means and calculates the weights, means and covariance matrices of each cluster.

The next bit of code implements our initialise_parameters method which uses k-means
from the sklearn library to calculate our clusters. Notice that this function actually calls our
calculate_mean_covariance method defined above. We could have probably used one
method to calculate our clusters and initial parameters but it is usually much easier to
debug and avoid errors if each method only carries out one specific task.

It’s time to get right into the most important methods in our class. The E-step of the
algorithm is defined below and takes in our parameters and data which makes perfect
sense given the equations we defined above. Remember, the purpose of this step is to

7/11
calculate the posterior distribution of our responsibilities ( ). The main thing to note here is
that we loop through each of the C Gaussian’s (3 in our case) and calculate the posterior
using a function from scipy to calculate the multivariate normal pdf.

from scipy.stats import multivariate_normal as mvn

After we have calculated this value for each Gaussian we just need to normalise the gamma
( ), corresponding to the denominator in equation 3. This is to ensure our gammas are
valid probabilities. If we sum the values across clusters for each data point they should
equal 1.

After we calculate the values for the responsibilities ( ) we can feed these into the M-step.
Again the purpose of the M-step is to calculate our new parameter values using the results
from the E-step corresponding to equations 4, 5 and 6. To make debugging easier I have
separated the m_step method and the compute_loss_function method in my code below.
The compute_loss_function does exactly what its name implies. It takes in the
responsibilities and parameters returned by the E-step and M-step and uses these to
calculate our lower bound loss function defined in equation 9.

All of our most important methods have now been coded up. Keeping consistent with
sklearn I am going to define a fit method which will call the methods we just defined. In
particular, we start by initialising our parameter values. After this, we perform the steps
outlined in the EM-algorithm for our chosen number of iterations. Note that it doesn't
actually take a large number of iterations to converge particularly when you use k-means to
get values of the initial parameters (I think my algorithm converged in about 30 iterations).

Since we are probably also interested in using this model to predict what Gaussian new data
might belong to we can implement a predict and predict_proba method. The predict_proba
method will take in new data points and predict the responsibilities for each Gaussian. In
other words, the probability that this data point came from each distribution. This is the
essence of the soft assignment that I mentioned at the start of the post. The predict method
does essentially the same but assigns the cluster which has the highest probability using
np.argmax.

Fitting our model

After that explanation, I think it’s about time we estimate our model and see what we get.
Hopefully, the GMM visualisation above provided a good intuition about what the model is
doing. We are going to be doing the exact same thing for our Enron data set. The code
below just estimates our GMM model on our dataset using 3 different Gaussians. For
plotting purposes, I also calculate the point of highest density of each distribution,
corresponding to the centre which is helpful as a visualisation aid. Finally, we also use the
model parameters to draw the shape of each distribution in Figure 4.

8/11
The main takeaway in this figure is that the distributions are clearly no longer spherical.
GMM has allowed us the relax our restrictions on the covariance matrix allowing the
distribution to have a much better fit to the data. This is particularly useful given that the
shape of our data was clearly not spherical. Now, this is probably not a perfect solution and
there are some data points which do not fit any distribution very well but it is an
improvement over k-means.

Figure 4: GMM with Full covariance

GMM sklearn Implementation

Now, just to make sure we haven't done anything completely crazy in our code I am going to
redo this estimation using sklearn and see if my solution is the same. The code below is
pretty much the exact same as the code above so I won't go through it in detail. It looks like
we have very similar result compared to sklearn. The one difference is that one of our
cluster centres appears to be different. In the sklearn implementation, the centre is closer to
0.4 while in our implementation it is closer to 0.6. Perhaps this is due to a slightly different
initialization in sklearn?

9/11
sklearn GMM

Figure 5: GMM sklearn

Alright, guys, that’s it for this post. I hope that was a useful and pretty intuitive explanation
of Gaussian Mixture Modelling. If any of you want to get a deeper understanding of the
material I recommend the Coursera course Bayesian Methods for Machine Learning. I
sourced a lot of the material from this course and I think it gives really nice and in-depth
explanations of the concepts I presented here. I would also recommend the book, Pattern
Recognition and Machine Learning by Bishop. This book is a great reference for most of the
classic algorithms you will come across in machine learning. Below I provide the full code
for the GMM class outlined in the post as well as a link to the Kaggle kernel where I did all
the analysis. As always, feedback is welcome.

Full Python Code for GMM class

Link to full code: https://www.kaggle.com/dfoly1/gaussian-mixture-model

Source: Christopher M. Bishop 2006, Pattern Recognition and Machine Learning

10/11
Source: Bayesian Methods for Machine Learning: Coursera course

Source: Python for Data Science Handbook

11/11

Gaussian Mixture Models Unit-III
No ratings yet
Gaussian Mixture Models Unit-III
13 pages
Week 7 - Latent Variable Models and Expectation Maximization
No ratings yet
Week 7 - Latent Variable Models and Expectation Maximization
39 pages
Isye 6416: Computational Statistics Spring 2023: Prof. Yao Xie
No ratings yet
Isye 6416: Computational Statistics Spring 2023: Prof. Yao Xie
24 pages
Tutorial em
No ratings yet
Tutorial em
57 pages
Gaussian Mixture Model (GMM)
No ratings yet
Gaussian Mixture Model (GMM)
10 pages
401 Week7 Part 2 EM Algorithm
No ratings yet
401 Week7 Part 2 EM Algorithm
58 pages
ML.5-Clustering Techniques (Week 9)
No ratings yet
ML.5-Clustering Techniques (Week 9)
71 pages
Lec15 16 Handout
No ratings yet
Lec15 16 Handout
33 pages
Hota ML13
No ratings yet
Hota ML13
24 pages
L11.2 Prob Models em
No ratings yet
L11.2 Prob Models em
20 pages
Chap2 Part2 GMM
No ratings yet
Chap2 Part2 GMM
34 pages
EM and Kmeans Relations
No ratings yet
EM and Kmeans Relations
70 pages
S6, S7, S8 CS - U4 Getter Setter EM Algorithm
No ratings yet
S6, S7, S8 CS - U4 Getter Setter EM Algorithm
32 pages
20 Gaussian Mixture Model
No ratings yet
20 Gaussian Mixture Model
55 pages
CB PDF
No ratings yet
CB PDF
69 pages
Gaussian Mixture Mode
No ratings yet
Gaussian Mixture Mode
30 pages
GMM
No ratings yet
GMM
25 pages
کتاب ششم بارگزاری شده
No ratings yet
کتاب ششم بارگزاری شده
49 pages
Dsci303-19 GM - em
No ratings yet
Dsci303-19 GM - em
81 pages
Lecture Expectation Maximization
No ratings yet
Lecture Expectation Maximization
58 pages
Lecture 3
No ratings yet
Lecture 3
15 pages
GaussianMixtureModel (GMM)
No ratings yet
GaussianMixtureModel (GMM)
18 pages
Module13 GaussianMixtureModel
No ratings yet
Module13 GaussianMixtureModel
17 pages
Pattern Classification 08. Gaussian Mixture Model: Abdelmoniem Bayoumi, PHD
No ratings yet
Pattern Classification 08. Gaussian Mixture Model: Abdelmoniem Bayoumi, PHD
12 pages
Clustering Mixture
No ratings yet
Clustering Mixture
22 pages
DSA5102 Lecture10
No ratings yet
DSA5102 Lecture10
40 pages
ds11 2
No ratings yet
ds11 2
19 pages
6.2 K Means
No ratings yet
6.2 K Means
23 pages
Lecture-04 GMM EMalg
No ratings yet
Lecture-04 GMM EMalg
34 pages
Pattern Analysis-Machine Learning
No ratings yet
Pattern Analysis-Machine Learning
74 pages
GMMEMNotes
No ratings yet
GMMEMNotes
10 pages
Andrew Rosenberg - Lecture 18: Gaussian Mixture Models and Expectation Maximization
No ratings yet
Andrew Rosenberg - Lecture 18: Gaussian Mixture Models and Expectation Maximization
34 pages
Lecture 5
No ratings yet
Lecture 5
16 pages
Elliptical Mixture Models Improve The Accuracy of Gaussian Mixture Models With Expectationmaximization Algorithm
No ratings yet
Elliptical Mixture Models Improve The Accuracy of Gaussian Mixture Models With Expectationmaximization Algorithm
20 pages
Expectation-Maximization Clustring V2
No ratings yet
Expectation-Maximization Clustring V2
9 pages
Gaussian Distribution
No ratings yet
Gaussian Distribution
5 pages
PROBABILISTIC Learning Jb-New
No ratings yet
PROBABILISTIC Learning Jb-New
13 pages
L08 GMM
No ratings yet
L08 GMM
11 pages
Expectation-Maximization For The Gaussian Mixture Model
No ratings yet
Expectation-Maximization For The Gaussian Mixture Model
8 pages
ML Unit Iii
No ratings yet
ML Unit Iii
12 pages
Unit 2
No ratings yet
Unit 2
7 pages
Gaussian Mixtures
No ratings yet
Gaussian Mixtures
5 pages
Probablistic Clustering
No ratings yet
Probablistic Clustering
28 pages
Experiment 9
No ratings yet
Experiment 9
3 pages
Unit 5 - ML
No ratings yet
Unit 5 - ML
10 pages
Get One More Story in Your Member Preview When You Sign Up. It's Free
No ratings yet
Get One More Story in Your Member Preview When You Sign Up. It's Free
12 pages
Applied Stat
No ratings yet
Applied Stat
2 pages
15 GMC
No ratings yet
15 GMC
4 pages
Computational Bayesian Statistics
100% (1)
Computational Bayesian Statistics
254 pages
AI29
No ratings yet
AI29
3 pages
9 Unsupervised Learning: 9.1 K-Means Clustering
No ratings yet
9 Unsupervised Learning: 9.1 K-Means Clustering
34 pages
Gaussian Mixture Model GMM
No ratings yet
Gaussian Mixture Model GMM
5 pages
ML (Exp 10) Yuti
No ratings yet
ML (Exp 10) Yuti
9 pages
cs229 Notes7b PDF
No ratings yet
cs229 Notes7b PDF
4 pages
Gaussian Mixture Models
No ratings yet
Gaussian Mixture Models
3 pages
Gaussian Mixture Model - GeeksforGeeks
No ratings yet
Gaussian Mixture Model - GeeksforGeeks
6 pages
(Ebook PDF) A Second Course in Statistics: Regression Analysis 8th Edition PDF Download
100% (6)
(Ebook PDF) A Second Course in Statistics: Regression Analysis 8th Edition PDF Download
54 pages
Drawbacks in The 3-Factor Approach of Fama and French (2018)
No ratings yet
Drawbacks in The 3-Factor Approach of Fama and French (2018)
36 pages
CFA L1 Quartic Quants Time Value of Money Notes
No ratings yet
CFA L1 Quartic Quants Time Value of Money Notes
14 pages
3 s2.0 B9780323960236000956 Main
No ratings yet
3 s2.0 B9780323960236000956 Main
15 pages
(Doc) Bouye Et Al 2000 Copulas For Finance. A Reading Guide and Some Applications - City Univ. London &amp Credit Lyonnais
No ratings yet
(Doc) Bouye Et Al 2000 Copulas For Finance. A Reading Guide and Some Applications - City Univ. London &amp Credit Lyonnais
69 pages
Assignment Forecasting
No ratings yet
Assignment Forecasting
9 pages
Mit18 05 s22 Statistics
No ratings yet
Mit18 05 s22 Statistics
173 pages
Game Theory - Lecture 3
No ratings yet
Game Theory - Lecture 3
49 pages
Fundamentals of Epidemiology (EPID 610) Exercise 12 Screening Learning Objectives
100% (1)
Fundamentals of Epidemiology (EPID 610) Exercise 12 Screening Learning Objectives
4 pages
Econometrics Unit 4
No ratings yet
Econometrics Unit 4
56 pages
Chapter 4
No ratings yet
Chapter 4
7 pages
Lecture 9 Simple Regression
No ratings yet
Lecture 9 Simple Regression
52 pages
Adafruit Feather
No ratings yet
Adafruit Feather
88 pages
Multiple Price List Design Explanation
100% (1)
Multiple Price List Design Explanation
5 pages
Ossei Kofi Tuffuor
No ratings yet
Ossei Kofi Tuffuor
83 pages
Structural Equation Models With Latent V
No ratings yet
Structural Equation Models With Latent V
36 pages
Lecture Notes On Epidemiological Studies For Undergraduates
No ratings yet
Lecture Notes On Epidemiological Studies For Undergraduates
41 pages
Lecture 4 MLR - 1
No ratings yet
Lecture 4 MLR - 1
30 pages
FML LabFile 7exps
No ratings yet
FML LabFile 7exps
37 pages
M4 TransparentModelswithMachineLearning
No ratings yet
M4 TransparentModelswithMachineLearning
34 pages
Case Study - 8
No ratings yet
Case Study - 8
21 pages
Jurmatis: Analisis Pengendalian Ketersediaan Bahan Baku Di PT. Akasha Wira Internasional, TBK Menggunakan Metode EOQ
No ratings yet
Jurmatis: Analisis Pengendalian Ketersediaan Bahan Baku Di PT. Akasha Wira Internasional, TBK Menggunakan Metode EOQ
10 pages
How Does Chromaprint Work
No ratings yet
How Does Chromaprint Work
4 pages
Hackster - io-Hardware-as-Code Part II Hello FPGA
No ratings yet
Hackster - io-Hardware-as-Code Part II Hello FPGA
7 pages
LRQA
No ratings yet
LRQA
3 pages
3673 113516 1 SM
No ratings yet
3673 113516 1 SM
12 pages
Econ 582 Forecasting: Eric Zivot
No ratings yet
Econ 582 Forecasting: Eric Zivot
20 pages
Hackster - io-Hardware-as-Code Part I An Introduction
No ratings yet
Hackster - io-Hardware-as-Code Part I An Introduction
4 pages
Fundamentals of Epidemiology (EPID 610) Exercise 13 Screening Learning Objectives
No ratings yet
Fundamentals of Epidemiology (EPID 610) Exercise 13 Screening Learning Objectives
4 pages
Arch Garch Assignment
No ratings yet
Arch Garch Assignment
5 pages
Tutorial 1
No ratings yet
Tutorial 1
3 pages
Chow Test
No ratings yet
Chow Test
2 pages
Ujian Lab - Regresi - Priscilia Claudia Ondang
No ratings yet
Ujian Lab - Regresi - Priscilia Claudia Ondang
2 pages

Gaussian Mixture Modelling GMM

Uploaded by

Gaussian Mixture Modelling GMM

Uploaded by

Gaussian Mixture Modelling (GMM)

Making Sense of Text Data using Unsupervised Learning

GMM made simple(ish)

Equation 1: Marginal Likelihood with Latent variables

Visualising the EM algorithm

The EM algorithm for GMM

Equation 2: Gaussian Mixture Distribution

Equation 4: Mean of the Gaussians

Equation 5: Covariance of the Gaussians

Equation 7: Sum of responsibilities in each Gaussian k

Equation 8: Marginal Likelihood: This is what we want to maximise

Equation 9: Variational Lower Bound, Source: Bishop equation 9.74

Source: Python for Data Science Handbook

GMM Python class

from scipy.stats import multivariate_normal as mvn

Fitting our model

Figure 4: GMM with Full covariance

GMM sklearn Implementation

Figure 5: GMM sklearn

Full Python Code for GMM class

Source: Christopher M. Bishop 2006, Pattern Recognition and Machine Learning

Source: Python for Data Science Handbook

You might also like