0% found this document useful (0 votes)
23 views12 pages

Fitting A Mixture Distribution To Data

This document provides a tutorial on fitting mixture distributions to data. It begins with an introduction to mixture distributions and their applications. The main algorithm for fitting mixture distributions is then explained in steps. First, fitting a mixture of two distributions is detailed with examples of fitting Gaussian and Poisson mixtures. Then, fitting general mixture models with multiple distributions is covered. Model-based clustering is also introduced as one application of mixture distributions. Numerical simulations are provided for Gaussian and Poisson mixtures to illustrate the algorithms.

Uploaded by

Bill Petrie
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views12 pages

Fitting A Mixture Distribution To Data

This document provides a tutorial on fitting mixture distributions to data. It begins with an introduction to mixture distributions and their applications. The main algorithm for fitting mixture distributions is then explained in steps. First, fitting a mixture of two distributions is detailed with examples of fitting Gaussian and Poisson mixtures. Then, fitting general mixture models with multiple distributions is covered. Model-based clustering is also introduced as one application of mixture distributions. Numerical simulations are provided for Gaussian and Poisson mixtures to illustrate the algorithms.

Uploaded by

Bill Petrie
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Fitting A Mixture Distribution to Data: Tutorial

Benyamin Ghojogh BGHOJOGH @ UWATERLOO . CA


Department of Electrical and Computer Engineering,
Machine Learning Laboratory, University of Waterloo, Waterloo, ON, Canada
Aydin Ghojogh AYDIN . GHOJOGH @ GMAIL . COM

Mark Crowley MCROWLEY @ UWATERLOO . CA


arXiv:1901.06708v2 [stat.OT] 11 Oct 2020

Department of Electrical and Computer Engineering,


Machine Learning Laboratory, University of Waterloo, Waterloo, ON, Canada
Fakhri Karray KARRAY @ UWATERLOO . CA
Department of Electrical and Computer Engineering,
Centre for Pattern Analysis and Machine Intelligence, University of Waterloo, Waterloo, ON, Canada

Abstract of K distributions {g1 (x; Θ1 ), . . . , gK (x; ΘK )} where the


This paper is a step-by-step tutorial for fitting a weights {w1 , . . . , wK } sum to one. As is obvious, every
mixture distribution to data. It merely assumes distribution in the mixture has its own parameter Θk . The
the reader has the background of calculus and lin- mixture distribution is formulated as:
ear algebra. Other required background is briefly K
X
reviewed before explaining the main algorithm. f (x; Θ1 , . . . , ΘK ) = wk gk (x; Θk ),
In explaining the main algorithm, first, fitting a k=1
mixture of two distributions is detailed and ex- (1)
K
X
amples of fitting two Gaussians and Poissons, re- subject to wk = 1.
spectively for continuous and discrete cases, are k=1
introduced. Thereafter, fitting several distribu- The distributions can be from different families, for ex-
tions in general case is explained and examples ample from beta and normal distributions. However, this
of several Gaussians (Gaussian Mixture Model) makes the problem very complex and sometimes useless;
and Poissons are again provided. Model-based therefore, mostly the distributions in a mixture are from
clustering, as one of the applications of mixture one family (e.g., all normal distributions) but with different
distributions, is also introduced. Numerical sim- parameters. This paper aims to find the parameters of the
ulations are also provided for both Gaussian and distributions in the mixture distribution f (x; Θ) as well as
Poisson examples for the sake of better clarifica- the weights (also called “mixing probabilities”) wk .
tion.
The remainder of paper is organized as follows. Section
2 reviews some technical background required for explain-
1. Introduction ing the main algorithm. Afterwards, the methodology of
Every random variable can be considered as a sample from fitting a mixture distribution to data is explained in Section
a distribution, whether a well-known distribution or a not 3. In that section, first the mixture of two distributions, as
very well-known (or “ugly”) distribution. Some random a special case of mixture distributions, is introduced and
variables are drawn from one single distribution, such as a analyzed. Then, the general mixture distribution is dis-
normal distribution. But life is not always so easy! Most of cussed. Meanwhile, examples of mixtures of Gaussians
real-life random variables might have been generated from (example for continuous cases) and Poissons (example for
a mixture of several distributions and not a single distri- discrete cases) are mentioned for better clarification. Sec-
bution. The mixture distribution is a weighted summation tion 4 briefly introduces clustering as one of the applica-
tions of mixture distributions. In Section 5, the discussed
methods are then implemented through some simulations
in order to have better sense of how these algorithms work.
Finally, Section 6 concludes the paper.
Fitting A Mixture Distribution to Data: Tutorial 2

2. Background expectation is:


This section reviews some technical background required X
for explaining the main algorithm. This review includes E(X) = xf (x), (8)
probability and Bayes rule, probability mass/density func- dom x
Z
tion, expectation, maximum likelihood estimation, expec- E(X) = xf (x) dx, (9)
tation maximization, and Lagrange multiplier.
dom x

2.1. Probability and Bayes Rule


respectively, where dom x is the domain of X. The condi-
If S denotes the total sample space and A denotes an event tional expectation is defined as:
in this sample space, the probability of event A is:
X
EX|Y (X|Y ) = xf (x|y), (10)
|A|
P(A) = . (2) dom x
|S| Z
EX|Y (X|Y ) = xf (x|y) dx, (11)
The conditional probability, i.e., probability of occurance dom x
of event A given that event B happens, is:
for discrete and continuous cases, respectively.
P(A, B)
P(A|B) = (3) 2.4. Maximum Likelihood Estimation
P(B)
P(B|A) P(A) Assume we have a sample with size n, i.e., {x1 , . . . , xn }.
= , (4) Also assume that we know the distribution from which this
P(B)
sample has been randomly drawn but we do not know the
where P(A|B), P(B|A), P(A), and P(B) are called like- parameters of that distribution. For example, we know it
lihood, posterior, prior, and marginal probabilities, respec- is drawn from a normal distribution but the mean and vari-
tively. If we assume that the event A consists of some cases ance of this distribution are unknown. The goal is to es-
A = {A1 , . . . , An }, we can write: timate the parameters of the distribution using the sample
{x1 , . . . , xn } available from it. This estimation of parame-
P(B|Ai ) P(Ai ) ters from the available sample is called “point estimation”.
P(Ai |B) = Pn . (5) One of the approaches for point estimation is Maximum
j=1 P(B|Aj ) P(Aj )
Likelihood Estimation (MLE). As it is obvious from its
the equations (4) and (5) are two versions of Bayes rule. name, MLE deals with the likelihood of data.
We postulate that the values of sample, i.e., x1 , . . . , xn , are
2.2. Probability Mass/Density Function independent random variates of data having the sample dis-
In discrete cases, the probability mass function is defined tribution. In other words, the data has a joint distribution
as: fX (x1 , . . . , xn |Θ) with parameter Θ and we assume the
variates are independent and identically distributed (iid)
iid
f (x) = P(X = x), (6) variates, i.e., xi ∼ fX (xi ; Θ) with the same parameter Θ.
Considering the Bayes rule, equation (4), we have:
where X and x are a random variable and a number, re-
spectively. fX (x1 , . . . , xn |Θ)π(Θ)
L(Θ|x1 , . . . , xn ) = . (12)
In continuous cases, the probability density function is: fX (x1 , . . . , xn )

P(x ≤ X ≤ x + ∆x) ∂P(X ≤ x) The MLE aims to find parameter Θ which maximizes the
f (x) = lim = . likelihood:
∆x→0 ∆x ∂x
(7)
Θ
b = arg max L(Θ). (13)
Θ
In this work, by mixture of distributions, we imply mixture
of mass/density functions. According to the definition, the likelihood can be written
as:
2.3. Expectation
Expectation means the value of a random variable X on av- L(Θ|x1 , . . . , xn ) := f (x1 , . . . , xn ; Θ)
erage. Therefore, expectation is a weighted average where n
(a) Y
the weights are probabilities of the random variable X to = f (xi , Θ), (14)
get different values. In discrete and continuous cases, the i=1
Fitting A Mixture Distribution to Data: Tutorial 3

where (a) is because the x1 , . . . , xn are iid. Note that in In the M-step, the MLE approach is used where the log-
literature, the L(Θ|x1 , . . . , xn ) is also denoted by L(Θ) for likelihood is replaced with its expectation, i.e., Q(Θ);
simplicity. therefore:
Usually, for more convenience, we use log-likelihood
rather than likelihood: Θ
b = arg max Q(Θ). (18)
Θ

`(Θ) := log L(Θ) (15) These two steps are iteratively repeated until convergence
Yn n
X of the estimated parameters Θ.
b
= log f (xi , Θ) = log f (xi , Θ). (16)
2.6. Lagrange Multiplier
i=1 i=1
Suppose we have a multi-variate function Q(Θ1 , . . . , ΘK )
Often, the logarithm is a natural logarithm for the sake of (called “objective function”) and we want to maximize (or
compatibility with the exponential in the well-known nor- minimize) it. However, this optimization is constrained and
mal density function. Notice that as logarithm function is its constraint is equality P (Θ1 , . . . , ΘK ) = c where c is a
monotonic, it does not change the location of maximization constant. So, the constrained optimization problem is:
of the likelihood.
maximize Q(Θ1 , . . . , ΘK ),
2.5. Expectation Maximization Θ1 ,...,ΘK
(19)
Sometimes, the data are not fully observable. For example, subject to P (Θ1 , . . . , ΘK ) = c.
the data are known to be whether zero or greater than zero.
For solving this problem, we can introduce a new variable
As an illustration, assume the data are collected for a partic-
α which is called “Lagrange multiplier”. Also, a new func-
ular disease but for convenience of the patients participated
tion L(Θ1 , . . . , ΘK , α), called “Lagrangian” is introduced:
in the survey, the severity of the disease is not recorded
but only the existence or non-existence of the disease is re-
ported. So, the data are not giving us complete information L(Θ1 , . . . , ΘK , α) = Q(Θ1 , . . . , ΘK )
as Xi > 0 is not obvious whether is Xi = 2 or Xi = 1000.  (20)
− α P (Θ1 , . . . , ΘK ) − c .
In this case, MLE cannot be directly applied as we do not
have access to complete information and some data are Maximizing (or minimizing) this Lagrangian function
missing. In this case, Expectation Maximization (EM) is gives us the solution to the optimization problem (Boyd &
useful. The main idea of EM can be summarized in this Vandenberghe, 2004):
short friendly conversation:
set
– What shall we do? The data is missing! The log- ∇Θ1 ,...,ΘK ,α L = 0, (21)
likelihood is not known completely so MLE cannot be used.
– Mmm, probably we can replace the missing data with which gives us:
something... set
∇Θ1 ,...,ΘK L = 0 =⇒ ∇Θ1 ,...,ΘK Q = α∇Θ1 ,...,ΘK P,
– Aha! Let us replace it with its mean.
set
– You are right! We can take the mean of log-likelihood ∇α L = 0 =⇒ P (Θ1 , . . . , ΘK ) = c.
over the possible values of the missing data. Then every-
thing in the log-likelihood will be known, and then... 3. Fitting A Mixture Distribution
– And then we can do MLE!
As was mentioned in the introduction, the goal of fitting a
Assume D(obs) and D(miss) denote the observed data (Xi ’s mixture distribution is to find the parameters and weights of
= 0 in the above example) and the missing data (Xi ’s > 0 a weighted summation of distributions (see equation (1)).
in the above example). The EM algorithm includes two First, as a spacial case of mixture distributions, we work on
main steps, i.e., E-step and M-step. mixture of two distributions and then we discuss the gen-
In the E-step, the log-likelihood (equation (15)), is taken eral mixture of distributions.
expectation with respect to the missing data D(miss) in or-
der to have a mean estimation of it. Let Q(Θ) denote the 3.1. Mixture of Two Distributions
expectation of the likelihood with respect to D(miss) : Assume that we want to fit a mixture of two distributions
g1 (x; Θ1 ) and g2 (x; Θ2 ) to the data. Note that, in theory,
Q(Θ) := ED(miss) |D(obs) ,Θ [`(Θ)]. (17) these two distributions are not necessarily from the same
distribution family. As we have only two distributions in
Note that in the above expectation, the D(obs) and Θ are the mixture, equation (1) is simplified to:
conditioned on, so they are treated as constants and not ran-
dom variables. f (x; Θ1 , Θ2 ) = w g1 (x; Θ1 ) + (1 − w) g2 (x; Θ2 ). (22)
Fitting A Mixture Distribution to Data: Tutorial 4

Note that the parameter w (or wk in general) is called “mix- Notice that the above expressions are linear with respect to
ing probability” (Friedman et al., 2009) and is sometimes ∆i and that is why the two logarithms were factored out.
denoted by π (or πk in general) in literature. Assume γ bi := E[∆i |X, Θ1 , Θ2 ] which is called “responsi-
The likelihood and log-likelihood for this mixture is: bility” of xi (Friedman et al., 2009).
The ∆i is either 0 or 1; therefore:
n
(a) Y
L(Θ1 , Θ2 ) = f (x1 , . . . , xn ; Θ1 , Θ2 ) = f (xi ; Θ1 , Θ2 ) E[∆i |X, Θ1 , Θ2 ] = 0 × P(∆i = 0|X, Θ1 , Θ2 )+
i=1
n h
1 × P(∆i = 1|X, Θ1 , Θ2 )
Y i
= w g1 (xi ; Θ1 ) + (1 − w) g2 (xi ; Θ2 ) , = P(∆i = 1|X, Θ1 , Θ2 ).
i=1
According to Bayes rule (equation (5)), we have:

n h P(∆i = 1|X, Θ1 , Θ2 )
X
`(Θ1 , Θ2 ) = log w g1 (xi ; Θ1 ) P(X, Θ1 , Θ2 , ∆i = 1)
=
i=1
i P(X; Θ1 , Θ2 )
+ (1 − w) g2 (xi ; Θ2 ) , P(X, Θ1 , Θ2 |∆i = 1) P(∆i = 1)
= P1 .
j=0 P(X, Θ1 , Θ2 |∆i = j) P(∆i = j)
where (a) is because of the assumption that x1 , . . . , xn are
iid. Optimizing this log-likelihood is difficult because of The marginal probability in the denominator is:
the summation within the logarithm. However, we can use P(X; Θ1 , Θ2 ) = (1 − w) g2 (xi ; Θ2 ) + w g1 (xi ; Θ1 ).
a nice trick here (Friedman et al., 2009): Let ∆i be defined
as: Thus:
 wb g1 (xi ; Θ1 )
1 if xi belongs to g1 (x; Θ1 ), γ
bi = , (23)
∆i := b g1 (xi ; Θ1 ) + (1 − w)
w b g2 (xi ; Θ2 )
0 if xi belongs to g2 (x; Θ2 ),
and
and its probability be: n h
X  
 Q(Θ1 , Θ2 ) = γbi log w g1 (xi ; Θ1 ) +
P(∆i = 1) = w, i=1 (24)
P(∆i = 0) = 1 − w.  i
(1 − γ
bi ) log (1 − w) g2 (xi ; Θ2 ) .
Therefore, the log-likelihood can be written as:
Some simplification of Q(Θ1 , Θ2 ) will help in next step:
`(Θ1 , Θ2 ) = n h
X
 Pn
Q(Θ1 ,Θ2 ) = γbi log w + γ
bi log g1 (xi ; Θ1 )+
 
 i=1 log w g1 (xi ; Θ1 ) if ∆i = 1
i=1
 Pn   i
i=1 log (1 − w) g2 (xi ; Θ2 ) if ∆i = 0 (1 − γ
bi ) log(1 − w) + (1 − γ
bi ) log g2 (xi ; Θ2 ) .
The above expression can be restated as: The M-step in EM:
n h
X   Θ
b 1, Θ
b 2, w
b = arg max Q(Θ1 , Θ2 , w).
`(Θ1 , Θ2 ) = ∆i log w g1 (xi ; Θ1 ) + Θ1 ,Θ2 ,w
i=1
 i Note that the function Q(Θ1 , Θ2 ) is also a function of w
(1 − ∆i ) log (1 − w) g2 (xi ; Θ2 ) . and that is why we wrote it as Q(Θ1 , Θ2 , w).
n
∂Q Xh γ
bi ∂g1 (xi ; Θ1 ) i set
The ∆i here is the incomplete (missing) datum because we = = 0, (25)
do not know whether it is ∆i = 0 or ∆i = 1 for xi . Hence, ∂Θ1 i=1
g1 (xi ; Θ1 ) ∂Θ1
using the EM algorithm, we try to estimate it by its expec- ∂Q
n
Xh 1 − γ bi ∂g2 (xi ; Θ2 ) i set
tation. = = 0, (26)
∂Θ2 i=1
g2 (xi ; Θ1 ) ∂Θ2
The E-step in EM:
n
∂Q X h 1 −1 i set
n h = bi ( ) + (1 − γ
γ bi )( ) = 0,
∂w w 1−w
X  
Q(Θ1 ,Θ2 ) = E[∆i |X, Θ1 , Θ2 ] log w g1 (xi ; Θ1 ) + i=1
i=1 n
1X
 i =⇒ w
b= γ
bi (27)
E[(1 − ∆i )|X, Θ1 , Θ2 ] log (1 − w) g2 (xi ; Θ2 ) . n i=1
Fitting A Mixture Distribution to Data: Tutorial 5

The Q(µ1 , µ2 , σ12 , σ22 ) is:


1 START: Initialize Θ b 1, Θ
b 2, w
b
2 while not converged do n h
// E-step in EM:
X
3
Q(µ1 , µ2 , σ12 , σ22 ) = γbi log w
4 for i from 1 to n do i=1
5 bi ← equation (23)
γ 1 (xi − µ1 )2
// M-step in EM: bi (− log(2π) − log σ1 −
+γ )
6 2 2σ12
7 b 1 ← equation (25)
Θ + (1 − γ
bi ) log(1 − w)
8 b 2 ← equation (26)
Θ 1 (xi − µ2 )2 i
9 wb ← equation (27) + (1 − γ
bi )(− log(2π) − log σ2 − ) .
2 2σ22
10 // Check convergence:
11 Compare Θ b 1, Θ
b 2 , and w
b with their values in Therefore:
previous iteration
n
∂Q Xh xi − µ1 i set
Algorithm 1: Fitting A Mixture of Two Distribu- = γ
bi ( ) = 0,
tions ∂µ1 i=1
σ12
Pn
i=1 γbi xi
=⇒ µb1 = P n , (30)
i=1 γbi
So, the mixing probability is the average of the responsi- ∂Q X n h
xi − µ2 i set
bilities which makes sense. Solving equations (25), (26), = (1 − γ bi )( ) = 0,
∂µ2 σ22
and (27) gives us the estimations Θb 1, Θ
b 2 , and w
b in every i=1
Pn
iteration. i=1 (1 − γ bi ) xi
=⇒ µb2 = P n , (31)
The iterative algorithm for finding the parameters of the i=1 (1 − γ bi )
mixture of two distributions is shown in Algorithm 1. ∂Q X n h
−1 (xi − µ1 )2 i set
= γbi ( + ) = 0,
3.1.1. M IXTURE OF T WO G AUSSIANS ∂σ1 i=1
σ1 σ13
Pn
Here, we consider a mixture of two one-dimensional Gaus- 2 bi (xi − µ
γ b1 )2
sian distributions as an example for mixture of two contin- =⇒ σb1 = i=1Pn , (32)
i=1 γbi
uous distributions. In this case, we have: n h
∂Q X −1 (xi − µ2 )2 i set
= (1 − γ bi )( + ) = 0,
∂σ2 σ2 σ23
1 (x − µ1 )2 i=1
g1 (x; µ1 , σ12 ) = p exp(− ) Pn
2πσ12 2σ12 (1 − γ bi )(xi − µ b2 )2
=⇒ σb22 = i=1Pn , (33)
x − µ1 i=1 (1 − γ bi )
= φ( ),
σ1
1 (x − µ2 )2 and w b is the same as equation (27).
g2 (x; µ2 , σ22 ) = p exp(− ) Iteratively solving equations (29), (30), (31), (32), (33), and
2πσ22 2σ22
(27) using Algorithm (1) gives us the estimations for µ b1 ,
x − µ2
= φ( ), µ
b2 , σ
b1 , σ
b2 , and w
b in equation (28).
σ2
3.1.2. M IXTURE OF T WO P OISSONS
where φ(x) is the probability density function of normal Here, we consider a mixture of two Poisson distributions
distribution. Therefore, equation (22) becomes: as an example for mixture of two discrete distributions. In
this case, we have:
f (x; µ1 , µ2 , σ12 , σ22 ) =
(28) e−λ1 λx1
x − µ1 x − µ2 g1 (x; λ1 ) = ,
w φ( ) + (1 − w) φ( ). x!
σ1 σ2
e−λ2 λx2
g2 (x; λ2 ) = ,
x!
The equation (23) becomes:
therefore, equation (22) becomes:
b φ( xiσ−µ
w 1
1
)
γ
bi = . (29) e−λ1 λx1 e−λ2 λx2
b φ( xiσ−µ
w 1
1
b φ( xiσ−µ
) + (1 − w) 2
2
) f (x; λ1 , λ2 ) = w + (1 − w) . (34)
x! x!
Fitting A Mixture Distribution to Data: Tutorial 6
n K
The equation (23) becomes: X hX i
`(Θ1 , . . . , ΘK ) = log wk gk (xi ; Θk ) ,
x
e−λ1 λ
b i i=1 k=1
b
w
b( xi !
1
)
γ
bi = b bxi b b xi
. (35)
e−λ1 λ e−λ2 λ where (a) is because of assumption that x1 , . . . , xn are iid.
w
b( xi !
1
) + (1 − w)
b ( xi !
2
)
Optimizing this log-likelihood is difficult because of the
The Q(λ1 , λ2 ) is: summation within the logarithm. We use the same trick
as the trick mentioned for mixture of two distributions:
n h
X
Q(λ1 , λ2 ) = γ
bi log w 
1 if xi belongs to gk (x; Θk ),
i=1 ∆i,k :=
0 otherwise,
bi (−λ1 + xi log λ1 − log xi !)

+ (1 − γ
bi ) log(1 − w) and its probability is:
i
+ (1 − γ
bi )(−λ2 + xi log λ2 − log xi !) . 
P(∆i,k = 1) = wk ,
P(∆i,k = 0) = 1 − wk .
Therefore:

∂Q Xn h
xi i set Therefore, the log-likelihood can be written as:
= γbi (−1 + ) = 0,
∂λ1 λ1
i=1
Pn `(Θ1 , . . . , ΘK ) =
i=1 γ
bi xi  Pn  
=⇒ λb1 = P n , (36)  i=1 log w1 g1 (xi ; Θ1 )
i=1 γ

bi if ∆i,1 = 1 and ∆i,k = 0 ∀k 6= 1




n h 
∂Q X xi i set 

= (1 − γ bi )(−1 + ) = 0, Pn
  
i=1 log w2 g2 (xi ; Θ2 )

∂λ2 λ2

i=1 if ∆i,2 = 1 and ∆i,k = 0 ∀k 6= 2
Pn
i=1 (1 − γbi ) xi


 .
.
=⇒ λb2 = P , (37)
Pn .

n 
(1 − γ
bi )


i=1
 
log wK gK (xi ; ΘK )



 i=1
if ∆i,K = 1 and ∆i,k = 0 ∀k 6= K

and wb is the same as equation (27).
Iteratively solving equations (35), (36), (37), and (27) using
Algorithm (1) gives us the estimations for λ b1 , λ
b2 , and w
b in The above expression can be restated as:
equation (34). n
" K #
X X 
3.2. Mixture of Several Distributions `(Θ1 , . . . , ΘK ) = ∆i,k log wk gk (xi ; Θk ) .
i=1 k=1
Now, assume a more general case where we want to fit a
mixture of K distributions g1 (x; Θ1 ), . . . , gK (x; ΘK ) to The ∆i,k here is the incomplete (missing) datum because
the data. Again, in theory, these K distributions are not we do not know whether it is ∆i,k = 0 or ∆i,k = 1 for xi
necessarily from the same distribution family. For more and a specific k. Therefore, using the EM algorithm, we try
convenience of reader, equation (1) is repeated here: to estimate it by its expectation.
K The E-step in EM:
X
f (x; Θ1 , . . . , ΘK ) = wk gk (x; Θk ), " K
n
k=1 X X
K
Q(Θ1 , . . . , ΘK ) = E[∆i,k |X, Θ1 , . . . , ΘK ]
X i=1 k=1
subject to wk = 1. #
k=1 
× log wk gk (xi ; Θk ) .
The likelihood and log-likelihood for this mixture is:

L(Θ1 , . . . , ΘK ) = f (x1 , . . . , xn ; Θ1 , . . . , ΘK ) The ∆i,k is either 0 or 1; therefore:


n
(a) Y
= f (xi ; Θ1 , . . . , ΘK ) E[∆i,k |X,Θ1 , . . . , ΘK ]
i=1 = 0 × P(∆i,k = 0|X, Θ1 , . . . , ΘK )
n X
K
Y + 1 × P(∆i,k = 1|X, Θ1 , . . . , ΘK )
= wk gk (xi ; Θk )
i=1 k=1 = P(∆i,k = 1|X, Θ1 , . . . , ΘK ).
Fitting A Mixture Distribution to Data: Tutorial 7

According to Bayes rule (equation (5)), we have:


1 START: Initialize Θ b 1, . . . , Θ
bK, w
b1 , . . . , w
bK
2 while not converged do
P(∆i,k = 1|X, Θ1 , . . . , ΘK ) 3 // E-step in EM:
P(X, Θ1 , . . . , ΘK , ∆i,k = 1) 4 for i from 1 to n do
=
P(X; Θ1 , . . . , ΘK ) 5 for k from 1 to K do
P(X, Θ1 , . . . , ΘK |∆i,k = 1) P(∆i,k = 1) 6 bi,k ← equation (38)
γ
= PK .
k0 =1 P(X, Θ1 , . . . , ΘK |∆i,k = 1)P(∆i,k = 1) // M-step in EM:
0 0
7
8 for k from 1 to K do
The marginal probability in the denominator is: 9 b k ← equation (40)
Θ
10 bk ← equation (41)
w
K
// Check convergence:
X
11
P(X; Θ1 , . . . , ΘK ) = wk0 gk0 (xi ; Θk0 ).
k0 =1 12 Compare Θ b 1, . . . , Θ
b K , and w
b1 , . . . , w
bK with
their values in previous iteration
Assuming that γ bi,k := E[∆i,k |X, Θ1 , . . . , ΘK ] (called re-
sponsibility of xi ), we have: Algorithm 2: Fitting A Mixture of Several Dis-
tributions
w
bk gk (xi ; Θk )
γ
bi,k = PK , (38)
k0 =1 w
bk0 gk0 (xi ; Θk0 ) 2.6):

and L(Θ1 , . . . , ΘK , w1 , . . . , wK , α)
K
X 
n X
X K
 = Q(Θ1 , . . . , ΘK , w1 , . . . , wK ) − α wk − 1
Q(Θ1 , . . . , ΘK ) = γ
bi,k log wk gk (xi ; Θk ) . k=1
i=1 k=1 n X
X K h i
(39) = γ
bi,k log wk + γ
bi,k log gk (xi ; Θk )
Some simplification of Q(Θ1 , . . . , ΘK ) will help in next i=1 k=1
step: K
X 
−α wk − 1
Q(Θ1 , . . . , ΘK ) = k=1

X n X K h i
γ
bi,k log wk + γ
bi,k log gk (xi ; Θk ) . n
∂L X γ
bi,k ∂gk (xi ; Θk ) set
i=1 k=1 = =0 (40)
∂Θk g (x
i=1 k i
; Θk ) ∂Θk
n n
The M-step in EM: ∂L Xγ bi,k set 1X
= − α = 0 =⇒ wk = γi,k
∂wk i=1
wk α i=1
Θ
b k, w
bk = arg max Q(Θ1 , . . . , ΘK , w1 , . . . , wK ), K K
Θk ,wk ∂L X set
X
K = wk − 1 = 0 =⇒ wk = 1
X ∂α
subject to wk = 1. k=1 k=1
K n n X K
k=1 X 1X X
∴ γi,k = 1 =⇒ α = γi,k
α i=1 i=1 k=1
Note that the function Q(Θ1 , . . . , ΘK ) is also a func- k=1
Pn
tion of w1 , . . . , wK and that is why we wrote it as i=1 γi,k
∴ w
bk = Pn P K
(41)
Q(Θ1 , . . . , ΘK , w1 , . . . , wK ). i=1 k0 =1 γi,k
0

The above problem is a constrained optimization problem:


Solving equations (40) and (41) gives us the estimations
Θ bk (for k ∈ {1, . . . , K}) in every iteration.
b k and w
maximize Q(Θ1 , . . . , ΘK , w1 , . . . , wK ),
Θk ,wk The iterative algorithm for finding the parameters of the
K
X mixture of several distributions is shown in Algorithm 2.
subject to wk = 1,
k=1
3.2.1. M IXTURE OF S EVERAL G AUSSIANS
Here, we consider a mixture of K one-dimensional Gaus-
which can be solved using Lagrange multiplier (see Section sian distributions as an example for mixture of several con-
Fitting A Mixture Distribution to Data: Tutorial 8

tinuous distributions. In this case, we have: 3.2.2. M ULTIVARIATE M IXTURE OF G AUSSIANS


The data might be multivariate (x ∈ Rd ) and the Gaus-
1 (x − µk )2
gk (x; µk , σk2 ) = p exp(− ) sian distributions in the mixture model should be multi-
2πσk2 2σk2
dimensional in this case. We consider a mixture of K mul-
x − µk tivariate Gaussian distributions. In this case, we have:
= φ( ), ∀k ∈ {1, . . . , K}
σk
gk (x; µk , Σk )
Therefore, equation (1) becomes:
1 (x − µk )> Σ−1
k (x − µk )
K =p exp(− )
X x − µk d
(2π) |Σk | 2
f (x; µ1 , . . . , µK , σ12 , . . . , σK
2
)= wk φ( ).
σk ∀k ∈ {1, . . . , K},
k=1
(42)
The equation (38) becomes: where |Σk | is the determinant of Σk .
Therefore, equation (1) becomes:
bk φ( xiσ−µ
w k
k
)
γ
bi,k = PK . (43)
bk0 φ( xiσ−µ0 k0 )
K
k0 =1 w X
k f (x; µ1 , . . . , µK , Σ1 , . . . , ΣK ) = wk gk (x; µk , Σk ).
k=1
The Q(µ1 , . . . , µK , σ12 , . . . , σK
2
) is: (46)
The equation (38) becomes:
Q(µ1 , . . . , µK , σ12 , . . . , σK
2
)
n X
K h w
bk gk (xi ; µk , Σk )
X 1 γ
bi,k = PK , (47)
= γ bi,k −
bi,k log wk + γ log(2π)
2 k0 =1 w
bk0 gk0 (xi ; µk0 , Σk0 )
i=1 k=1
(xi − µk )2 i where x1 , . . . , xn ∈ Rd and µ1 , . . . , µK ∈ Rd and
− log σk − .
2σk2 Σ1 , . . . , ΣK ∈ Rd×d and w bk ∈ R and γ bi,k ∈ R.
The Lagrangian is: The Q(µ1 , . . . , µK , Σ1 , . . . , ΣK ) is:

L(µ1 , . . . , µK , σ12 , . . . , σK
2
, w1 , . . . , wK , α) Q(µ1 , . . . , µK , Σ1 , . . . , ΣK )
n X K
"
n K  d
1
XXh X
= γ
bi,k log wk + γ bi,k − log(2π) = γ
bi,k log wk + γ bi,k − log(2π)
2 i=1
2
i=1 k=1 k=1

(xi − µk )2 i 1
− log σk − − log |Σk |
2σk2 2 #
K 1  > −1

X  − tr (xi − µk ) Σk (xi − µk ) ,
−α wk − 1 . 2
k=1
where tr(.) denotes the trace of matrix. The trace is used
Therefore:
here because (xi − µk )> Σ−1k (xi − µk ) is a scalar so it is
n h equal to its trace.
∂L X xi − µk i set
= γ
bi,k ( ) = 0, The Lagrangian is:
∂µk i=1
σk2
Pn
i=1 γbi,k xi L(µ1 , . . . , µK , Σ1 , . . . , ΣK , w1 , . . . , wK , α)
=⇒ µbk = P n , (44)
i=1 γbi,k n X K
"
X  d
n h = γ
bi,k log wk + γ bi,k − log(2π)
∂L X −1 (xi − µk )2 i set 2
= γ
bi,k ( + ) = 0, i=1 k=1
∂σk i=1
σk σk3
1
Pn
bi,k (xi − µ
γ bk )2 − log |Σk |
=⇒ σ 2
bk = i=1Pn , (45) 2
i=1 γ
#
bi,k 1  
> −1
− tr (xi − µk ) Σk (xi − µk )
and w bk is the same as equation (41). 2
Iteratively solving equations (43), (44), (45), and (41) us- K
X 
ing Algorithm (2) gives us the estimations for µ b1 , . . . , µ
bK , −α wk − 1 .
σ
b1 , . . . , σ
bK , and w
b1 , . . . , w
bK in equation (42). k=1
Fitting A Mixture Distribution to Data: Tutorial 9

Therefore: The Q(λ1 , . . . , λK ) is:


n n X
K h
∂L Xh i
set
bi,k Σ−1 = 0 ∈ Rd ,
X
= γ k (x i − µ k ) Q(λ1 , . . . , λK ) = γ
bi,k log wk
∂µk i=1 i=1 k=1
n h i i
(a)
bi,k (−λk + xi log λk − log xi !) .
X
=⇒ bi,k (xi − µk ) = 0,
γ +γ
i=1
Pn The Lagrangian is:
i=1 γbi,k xi
=⇒ µbk = P n ∈ Rd , (48)
i=1 γ
b i,k L(λ1 , . . . , λK , w1 , . . . , wK , α)
n
∂L (b) X h −1 n X K h
= γ
bi,k ( Σk X
∂Σk 2 = γ
bi,k log wk
i=1
i=1 k=1
1 i
set
+ (xi − µk )(xi − µk )> = 0 ∈ Rd×d ,
i
2 bi,k (−λk + xi log λk − log xi !)

n n
X X K
=⇒ Σk γ
bi,k = bi,k (xi − µk )(xi − µk )> ,
γ X 
−α wk − 1 .
i=1 i=1
Pn k=1
bi,k (xi − µk )(xi − µk )>
i=1 γ
=⇒ Σ
bk = Pn ∈ Rd×d ,
i=1 γ
b i,k Therefore:
(49) n h
∂L X xi i set
= γ
bi,k (−1 + ) = 0,
and w bk ∈ R is the same as equation (41). In above expres- ∂λk i=1
λk
sions, (a) is because Σ−1 k 6= 0 ∈ R
d×d
is not dependent on Pn
i=1 γ
bi,k xi
i and can be left factored out of the summation (note that =⇒ λk = P
b n , (52)
∂ i=1 γ
bi,k
γ
bi,k is a scalar), and (b) is because ∂Σ log |Σk | = Σk and
k
tr (xi − µk ) Σk (xi − µk ) = tr Σ−1
> −1 and w
 
k (xi − µk )(xi −
b is the same as equation (41).
>
 ∂
 −1 
µk ) and ∂Σk tr Σk A = −A. Iteratively solving equations (51), (52), and (41) using Al-
Iteratively solving equations (47), (48), (49), and (41) us- gorithm (2) gives us the estimations for λ b1 , . . . , λ
bK , and
ing Algorithm (2) gives us the estimations for µ b 1, . . . , µ
bK , wb1 , . . . , w
bK in equation (50).
Σ
b 1, . . . , Σ
b K , and w
b1 , . . . , w
bK in equation (46). The mul-
tivariate mixture of Gaussians is also mentioned in (Lee & 4. Using Mixture Distribution for Clustering
Scott, 2012). Moreover, note that the mixture of Gaussians Mixture distributions have a variety of applications includ-
is also referred to as Gaussian Mixture Models (GMM) in ing clustering. Assuming that the number of clusters, de-
the literature. noted by K, is known, the cluster label of a point xi
(i ∈ {1, . . . , n}) is determined as:
3.2.3. M IXTURE OF S EVERAL P OISSONS
Here, we consider a mixture of K Poisson distributions as label of xi ← arg max gk (xi ; Θk ), (53)
k
an example for mixture of several discrete distributions. In
this case, we have: where gk (xi ; Θk ) is the k-th distribution fitted to data
x1 , . . . , xn . In other words, where f (x; Θ1 , . . . , ΘK ) =
e−λk λxk PK
gk (x; λk ) = , k=1 wk gk (x; Θk ) is the fitted mixture distribution to
x! data. The reason of why this clustering works is that the
therefore, equation (1) becomes: density/mass function which has produced that point with
higher probability can be the best candidate for the cluster
K of that point. This method of clustering is referred to as
X e−λk λxk
f (x; λ1 , . . . , λK ) = wk . (50) “model-based clustering” in literature (Fraley & Raftery,
x! 1998; 2002).
k=1

The equation (38) becomes: 5. Simulations


e−λk λ
bb i x In this section, we do some simulations on fitting a mixture
w
bk ( xi !
k
) of densities in both continuous and discrete cases. For con-
γ
bi,k = . (51)
PK e−λk0 λ
b
b i
k0
x
tinuous cases, a mixture of three Gaussians and for discrete
k0 =1 w
bk0 ( xi ! ) cases, a mixture of three Poissons are simulated.
Fitting A Mixture Distribution to Data: Tutorial 10

Figure 1. The original probability density functions from which Figure 3. The change and convergence of σ1 (shown in blue), σ2
the sample is drawn. The mixture includes three different Gaus- (shown in red), and σ3 (shown in green) over the iterations.
sians showed in blue, red, and green colors.

Figure 2. The change and convergence of µ1 (shown in blue), µ2 Figure 4. The change and convergence of w1 (shown in blue), w2
(shown in red), and µ3 (shown in green) over the iterations. (shown in red), and w3 (shown in green) over the iterations.

5.1. Mixture of Three Gaussians


mated values for the parameters:
A sample with size n = 2200 from three distributions is
randomly generated for this experiment:

x − µ1 x + 10 µ1 = −9.99, σ1 = 1.17, w1 = 0.317


φ( ) = φ( ), µ2 = −0.05, σ2 = 1.93, w2 = 0.445
σ1 1.2
x − µ2 x−0 µ3 = 4.64, σ3 = 4.86, w3 = 0.237
φ( ) = φ( ),
σ2 2
x − µ3 x−5
φ( ) = φ( ). Comparing the estimations for µ1 , µ2 , µ3 and σ1 , σ2 , σ3
σ3 5
with those in original densities from which data were gen-
For having generality, the size of subset of sample gener- erated verifies the correctness of the estimations.
ated from the three densities are different, i.e., 700, 1000, The progress of the parameters µk , σk , and wk through the
and 500. The three densities are shown in Fig. 1. iterations until convergence are shown in figures 2, 3, and
Applying Algorithm 2 and using equations (43), (44), (45), 4, respectively.
and (41) for mixture of K = 3 Gaussians gives us the esti- Note that for setting initial values of parameters in mixture
Fitting A Mixture Distribution to Data: Tutorial 11

x 0 1 2 3 4 5 6 7 8 9 10
frequency 162 267 271 185 111 61 120 210 215 136 73
x 11 12 13 14 15 16 17 18 19 20
frequency 43 14 160 230 243 104 36 15 10 0

Table 1. The discrete data for simulation of fitting mixture of


Poissons.

Figure 5. The estimated probability density functions. The esti-


mated mixture includes three different Gaussians showed in blue,
red, and green colors. The dashed purplePdensity is the weighted
summation of these three densities, i.e., 3k=1 wk φ( x−µ
σk
k
). The
dashed brown density is the fitted density whose parameters are
estimated by MLE.

of Gaussians, one reasonable option is: Figure 6. The frequency of the discrete data sample.

range ← max(xi ) − min(xi ),


i i Poissons.
(0)
µk ∼ U (min(xi ), max(xi )), (54) Applying Algorithm 2 and using equations (51), (52), and
i i
(0)
(41) for mixture of K = 3 Poissons gives us the estimated
σk ∼ U (0, range/6), (55) values for the parameters:
(0)
wk ∼ U (0, 1), (56) λ1 = 1.66, w1 = 0.328
where U (α, β) is continuous uniform distribution in range λ2 = 6.72, w2 = 0.256
(α, β). This initialization makes sense because in normal λ3 = 12.85, w3 = 0.416
distribution, the mean belongs to the range of data and 99%
of data falls in range (µ − 3σ, µ + 3σ); therefore, the spread Comparing the estimations for λ1 , λ2 , λ3 with Fig. 6 veri-
of data is roughly 6σ. In the experiment of this section, the fies the correctness of the estimations. The progress of the
mentioned initialization is utilized. parameters λk and wk through the iterations until conver-
gence are shown in figures 7 and 8, respectively.
The fitted densities and the mixture distribution are de-
picted in Fig. 5. Comparing this figure with Fig. 1 verifies For setting initial values of parameters in mixture of Pois-
the correct estimation of the three densities. Figure 5 also sons, one reasonable option is:
shows the mixture distribution, i.e., the weighted summa- (0)
λk ∼ U (min(xi ), max(xi )), (57)
tion of the estimated densities. i i
(0)
Moreover, for the sake of better comparison, only one dis- wk ∼ U (0, 1). (58)
tribution is also fitted to data using MLE. The MLE
Pn estima-
µ (mle)
= x̄ = (1/n) The reason of this initialization is that the MLE estimation
tion of parameters are b i=1 xi and b(mle) = x̄ = (1/n) Pn xi which belongs to the
Pn
b(mle) = (1/n) i=1 (xi − x̄)2 . This fitted distribution is
σ of λ is λ i=1
also depicted in Fig. 5. We can see that this poor estima- range of data. This initialization is used in this experiment.
tion has not captured the multi-modality of data in contrast The fitted mass functions and the mixture distribution are
to the estimated mixture distribution. depicted in Fig. 9. Comparing this figure with Fig. 6 veri-
fies the correct estimation of the three mass functions. The
5.2. Mixture of Three Poissons mixture distribution, i.e., the weighted summation of the
A sample with size n = 2666 is made (see Table 1) for the estimated densities, is also shown in Fig. 9.
experiment where the frequency of data, displayed in Fig 6, For having better comparison, only one mass function is
shows that data are almost sampled from a mixture of three also fitted to data using MLE. For that, the parameter λ
Fitting A Mixture Distribution to Data: Tutorial 12

Figure 7. The change and convergence of λ1 (shown in blue), λ2 Figure 9. The estimated probability mass functions. The esti-
(shown in red), and λ3 (shown in green) over the iterations. mated mixture includes three different Poissons showed in blue,
red, and green colors. The purple density is the weighted summa-
−λ k
tion of these three densities, i.e., 3k=1 wk e x!k λ . The brown
P
density is the fitted density whose parameter is estimated by MLE.

Acknowledgment
The authors hugely thank Prof. Mu Zhu for his great course
“Statistical Concepts for Data Science”. This great course
partly covered the materials mentioned in this tutorial pa-
per.

References
Boyd, Stephen and Vandenberghe, Lieven. Convex opti-
mization. Cambridge university press, 2004.
Fraley, Chris and Raftery, Adrian E. How many clus-
ters? which clustering method? answers via model-
based cluster analysis. The computer journal, 41(8):
Figure 8. The change and convergence of w1 (shown in blue), w2
578–588, 1998.
(shown in red), and w3 (shown in green) over the iterations.
Fraley, Chris and Raftery, Adrian E. Model-based cluster-
ing, discriminant analysis, and density estimation. Jour-
b(mle) = x̄ = (1/n) Pn xi . This nal of the American statistical Association, 97(458):
is estimated using λ i=1 611–631, 2002.
fitted distribution is also depicted in Fig. 9. Again, the
poor performance of this single mass function in capturing Friedman, Jerome, Hastie, Trevor, and Tibshirani, Robert.
the multi-modality is obvious. The elements of statistical learning, volume 2. Springer
series in statistics New York, NY, USA:, 2009.
6. Conclusion
Lee, Gyemin and Scott, Clayton. Em algorithms for multi-
In this paper, a simple-to-understand and step-by-step tuto-
variate gaussian mixture models with truncated and cen-
rial on fitting a mixture distribution to data was proposed.
sored data. Computational Statistics & Data Analysis,
The assumption was the prior knowledge on calculus and
56(9):2816–2829, 2012.
basic linear algebra. For more clarification, fitting two dis-
tributions was primarily introduced and then it was gen-
eralized to K distributions. Fitting mixture of Gaussians
and Poissons were also mentioned as examples for continu-
ous and discrete cases, respectively. Simulations were also
shown for more clarification.

You might also like