Fitting A Mixture Distribution To Data
Fitting A Mixture Distribution To Data
P(x ≤ X ≤ x + ∆x) ∂P(X ≤ x) The MLE aims to find parameter Θ which maximizes the
f (x) = lim = . likelihood:
∆x→0 ∆x ∂x
(7)
Θ
b = arg max L(Θ). (13)
Θ
In this work, by mixture of distributions, we imply mixture
of mass/density functions. According to the definition, the likelihood can be written
as:
2.3. Expectation
Expectation means the value of a random variable X on av- L(Θ|x1 , . . . , xn ) := f (x1 , . . . , xn ; Θ)
erage. Therefore, expectation is a weighted average where n
(a) Y
the weights are probabilities of the random variable X to = f (xi , Θ), (14)
get different values. In discrete and continuous cases, the i=1
Fitting A Mixture Distribution to Data: Tutorial 3
where (a) is because the x1 , . . . , xn are iid. Note that in In the M-step, the MLE approach is used where the log-
literature, the L(Θ|x1 , . . . , xn ) is also denoted by L(Θ) for likelihood is replaced with its expectation, i.e., Q(Θ);
simplicity. therefore:
Usually, for more convenience, we use log-likelihood
rather than likelihood: Θ
b = arg max Q(Θ). (18)
Θ
`(Θ) := log L(Θ) (15) These two steps are iteratively repeated until convergence
Yn n
X of the estimated parameters Θ.
b
= log f (xi , Θ) = log f (xi , Θ). (16)
2.6. Lagrange Multiplier
i=1 i=1
Suppose we have a multi-variate function Q(Θ1 , . . . , ΘK )
Often, the logarithm is a natural logarithm for the sake of (called “objective function”) and we want to maximize (or
compatibility with the exponential in the well-known nor- minimize) it. However, this optimization is constrained and
mal density function. Notice that as logarithm function is its constraint is equality P (Θ1 , . . . , ΘK ) = c where c is a
monotonic, it does not change the location of maximization constant. So, the constrained optimization problem is:
of the likelihood.
maximize Q(Θ1 , . . . , ΘK ),
2.5. Expectation Maximization Θ1 ,...,ΘK
(19)
Sometimes, the data are not fully observable. For example, subject to P (Θ1 , . . . , ΘK ) = c.
the data are known to be whether zero or greater than zero.
For solving this problem, we can introduce a new variable
As an illustration, assume the data are collected for a partic-
α which is called “Lagrange multiplier”. Also, a new func-
ular disease but for convenience of the patients participated
tion L(Θ1 , . . . , ΘK , α), called “Lagrangian” is introduced:
in the survey, the severity of the disease is not recorded
but only the existence or non-existence of the disease is re-
ported. So, the data are not giving us complete information L(Θ1 , . . . , ΘK , α) = Q(Θ1 , . . . , ΘK )
as Xi > 0 is not obvious whether is Xi = 2 or Xi = 1000. (20)
− α P (Θ1 , . . . , ΘK ) − c .
In this case, MLE cannot be directly applied as we do not
have access to complete information and some data are Maximizing (or minimizing) this Lagrangian function
missing. In this case, Expectation Maximization (EM) is gives us the solution to the optimization problem (Boyd &
useful. The main idea of EM can be summarized in this Vandenberghe, 2004):
short friendly conversation:
set
– What shall we do? The data is missing! The log- ∇Θ1 ,...,ΘK ,α L = 0, (21)
likelihood is not known completely so MLE cannot be used.
– Mmm, probably we can replace the missing data with which gives us:
something... set
∇Θ1 ,...,ΘK L = 0 =⇒ ∇Θ1 ,...,ΘK Q = α∇Θ1 ,...,ΘK P,
– Aha! Let us replace it with its mean.
set
– You are right! We can take the mean of log-likelihood ∇α L = 0 =⇒ P (Θ1 , . . . , ΘK ) = c.
over the possible values of the missing data. Then every-
thing in the log-likelihood will be known, and then... 3. Fitting A Mixture Distribution
– And then we can do MLE!
As was mentioned in the introduction, the goal of fitting a
Assume D(obs) and D(miss) denote the observed data (Xi ’s mixture distribution is to find the parameters and weights of
= 0 in the above example) and the missing data (Xi ’s > 0 a weighted summation of distributions (see equation (1)).
in the above example). The EM algorithm includes two First, as a spacial case of mixture distributions, we work on
main steps, i.e., E-step and M-step. mixture of two distributions and then we discuss the gen-
In the E-step, the log-likelihood (equation (15)), is taken eral mixture of distributions.
expectation with respect to the missing data D(miss) in or-
der to have a mean estimation of it. Let Q(Θ) denote the 3.1. Mixture of Two Distributions
expectation of the likelihood with respect to D(miss) : Assume that we want to fit a mixture of two distributions
g1 (x; Θ1 ) and g2 (x; Θ2 ) to the data. Note that, in theory,
Q(Θ) := ED(miss) |D(obs) ,Θ [`(Θ)]. (17) these two distributions are not necessarily from the same
distribution family. As we have only two distributions in
Note that in the above expectation, the D(obs) and Θ are the mixture, equation (1) is simplified to:
conditioned on, so they are treated as constants and not ran-
dom variables. f (x; Θ1 , Θ2 ) = w g1 (x; Θ1 ) + (1 − w) g2 (x; Θ2 ). (22)
Fitting A Mixture Distribution to Data: Tutorial 4
Note that the parameter w (or wk in general) is called “mix- Notice that the above expressions are linear with respect to
ing probability” (Friedman et al., 2009) and is sometimes ∆i and that is why the two logarithms were factored out.
denoted by π (or πk in general) in literature. Assume γ bi := E[∆i |X, Θ1 , Θ2 ] which is called “responsi-
The likelihood and log-likelihood for this mixture is: bility” of xi (Friedman et al., 2009).
The ∆i is either 0 or 1; therefore:
n
(a) Y
L(Θ1 , Θ2 ) = f (x1 , . . . , xn ; Θ1 , Θ2 ) = f (xi ; Θ1 , Θ2 ) E[∆i |X, Θ1 , Θ2 ] = 0 × P(∆i = 0|X, Θ1 , Θ2 )+
i=1
n h
1 × P(∆i = 1|X, Θ1 , Θ2 )
Y i
= w g1 (xi ; Θ1 ) + (1 − w) g2 (xi ; Θ2 ) , = P(∆i = 1|X, Θ1 , Θ2 ).
i=1
According to Bayes rule (equation (5)), we have:
n h P(∆i = 1|X, Θ1 , Θ2 )
X
`(Θ1 , Θ2 ) = log w g1 (xi ; Θ1 ) P(X, Θ1 , Θ2 , ∆i = 1)
=
i=1
i P(X; Θ1 , Θ2 )
+ (1 − w) g2 (xi ; Θ2 ) , P(X, Θ1 , Θ2 |∆i = 1) P(∆i = 1)
= P1 .
j=0 P(X, Θ1 , Θ2 |∆i = j) P(∆i = j)
where (a) is because of the assumption that x1 , . . . , xn are
iid. Optimizing this log-likelihood is difficult because of The marginal probability in the denominator is:
the summation within the logarithm. However, we can use P(X; Θ1 , Θ2 ) = (1 − w) g2 (xi ; Θ2 ) + w g1 (xi ; Θ1 ).
a nice trick here (Friedman et al., 2009): Let ∆i be defined
as: Thus:
wb g1 (xi ; Θ1 )
1 if xi belongs to g1 (x; Θ1 ), γ
bi = , (23)
∆i := b g1 (xi ; Θ1 ) + (1 − w)
w b g2 (xi ; Θ2 )
0 if xi belongs to g2 (x; Θ2 ),
and
and its probability be: n h
X
Q(Θ1 , Θ2 ) = γbi log w g1 (xi ; Θ1 ) +
P(∆i = 1) = w, i=1 (24)
P(∆i = 0) = 1 − w. i
(1 − γ
bi ) log (1 − w) g2 (xi ; Θ2 ) .
Therefore, the log-likelihood can be written as:
Some simplification of Q(Θ1 , Θ2 ) will help in next step:
`(Θ1 , Θ2 ) = n h
X
Pn
Q(Θ1 ,Θ2 ) = γbi log w + γ
bi log g1 (xi ; Θ1 )+
i=1 log w g1 (xi ; Θ1 ) if ∆i = 1
i=1
Pn i
i=1 log (1 − w) g2 (xi ; Θ2 ) if ∆i = 0 (1 − γ
bi ) log(1 − w) + (1 − γ
bi ) log g2 (xi ; Θ2 ) .
The above expression can be restated as: The M-step in EM:
n h
X Θ
b 1, Θ
b 2, w
b = arg max Q(Θ1 , Θ2 , w).
`(Θ1 , Θ2 ) = ∆i log w g1 (xi ; Θ1 ) + Θ1 ,Θ2 ,w
i=1
i Note that the function Q(Θ1 , Θ2 ) is also a function of w
(1 − ∆i ) log (1 − w) g2 (xi ; Θ2 ) . and that is why we wrote it as Q(Θ1 , Θ2 , w).
n
∂Q Xh γ
bi ∂g1 (xi ; Θ1 ) i set
The ∆i here is the incomplete (missing) datum because we = = 0, (25)
do not know whether it is ∆i = 0 or ∆i = 1 for xi . Hence, ∂Θ1 i=1
g1 (xi ; Θ1 ) ∂Θ1
using the EM algorithm, we try to estimate it by its expec- ∂Q
n
Xh 1 − γ bi ∂g2 (xi ; Θ2 ) i set
tation. = = 0, (26)
∂Θ2 i=1
g2 (xi ; Θ1 ) ∂Θ2
The E-step in EM:
n
∂Q X h 1 −1 i set
n h = bi ( ) + (1 − γ
γ bi )( ) = 0,
∂w w 1−w
X
Q(Θ1 ,Θ2 ) = E[∆i |X, Θ1 , Θ2 ] log w g1 (xi ; Θ1 ) + i=1
i=1 n
1X
i =⇒ w
b= γ
bi (27)
E[(1 − ∆i )|X, Θ1 , Θ2 ] log (1 − w) g2 (xi ; Θ2 ) . n i=1
Fitting A Mixture Distribution to Data: Tutorial 5
∂Q Xn h
xi i set Therefore, the log-likelihood can be written as:
= γbi (−1 + ) = 0,
∂λ1 λ1
i=1
Pn `(Θ1 , . . . , ΘK ) =
i=1 γ
bi xi Pn
=⇒ λb1 = P n , (36) i=1 log w1 g1 (xi ; Θ1 )
i=1 γ
bi if ∆i,1 = 1 and ∆i,k = 0 ∀k 6= 1
n h
∂Q X xi i set
= (1 − γ bi )(−1 + ) = 0, Pn
i=1 log w2 g2 (xi ; Θ2 )
∂λ2 λ2
i=1 if ∆i,2 = 1 and ∆i,k = 0 ∀k 6= 2
Pn
i=1 (1 − γbi ) xi
.
.
=⇒ λb2 = P , (37)
Pn .
n
(1 − γ
bi )
i=1
log wK gK (xi ; ΘK )
i=1
if ∆i,K = 1 and ∆i,k = 0 ∀k 6= K
and wb is the same as equation (27).
Iteratively solving equations (35), (36), (37), and (27) using
Algorithm (1) gives us the estimations for λ b1 , λ
b2 , and w
b in The above expression can be restated as:
equation (34). n
" K #
X X
3.2. Mixture of Several Distributions `(Θ1 , . . . , ΘK ) = ∆i,k log wk gk (xi ; Θk ) .
i=1 k=1
Now, assume a more general case where we want to fit a
mixture of K distributions g1 (x; Θ1 ), . . . , gK (x; ΘK ) to The ∆i,k here is the incomplete (missing) datum because
the data. Again, in theory, these K distributions are not we do not know whether it is ∆i,k = 0 or ∆i,k = 1 for xi
necessarily from the same distribution family. For more and a specific k. Therefore, using the EM algorithm, we try
convenience of reader, equation (1) is repeated here: to estimate it by its expectation.
K The E-step in EM:
X
f (x; Θ1 , . . . , ΘK ) = wk gk (x; Θk ), " K
n
k=1 X X
K
Q(Θ1 , . . . , ΘK ) = E[∆i,k |X, Θ1 , . . . , ΘK ]
X i=1 k=1
subject to wk = 1. #
k=1
× log wk gk (xi ; Θk ) .
The likelihood and log-likelihood for this mixture is:
and L(Θ1 , . . . , ΘK , w1 , . . . , wK , α)
K
X
n X
X K
= Q(Θ1 , . . . , ΘK , w1 , . . . , wK ) − α wk − 1
Q(Θ1 , . . . , ΘK ) = γ
bi,k log wk gk (xi ; Θk ) . k=1
i=1 k=1 n X
X K h i
(39) = γ
bi,k log wk + γ
bi,k log gk (xi ; Θk )
Some simplification of Q(Θ1 , . . . , ΘK ) will help in next i=1 k=1
step: K
X
−α wk − 1
Q(Θ1 , . . . , ΘK ) = k=1
X n X K h i
γ
bi,k log wk + γ
bi,k log gk (xi ; Θk ) . n
∂L X γ
bi,k ∂gk (xi ; Θk ) set
i=1 k=1 = =0 (40)
∂Θk g (x
i=1 k i
; Θk ) ∂Θk
n n
The M-step in EM: ∂L Xγ bi,k set 1X
= − α = 0 =⇒ wk = γi,k
∂wk i=1
wk α i=1
Θ
b k, w
bk = arg max Q(Θ1 , . . . , ΘK , w1 , . . . , wK ), K K
Θk ,wk ∂L X set
X
K = wk − 1 = 0 =⇒ wk = 1
X ∂α
subject to wk = 1. k=1 k=1
K n n X K
k=1 X 1X X
∴ γi,k = 1 =⇒ α = γi,k
α i=1 i=1 k=1
Note that the function Q(Θ1 , . . . , ΘK ) is also a func- k=1
Pn
tion of w1 , . . . , wK and that is why we wrote it as i=1 γi,k
∴ w
bk = Pn P K
(41)
Q(Θ1 , . . . , ΘK , w1 , . . . , wK ). i=1 k0 =1 γi,k
0
L(µ1 , . . . , µK , σ12 , . . . , σK
2
, w1 , . . . , wK , α) Q(µ1 , . . . , µK , Σ1 , . . . , ΣK )
n X K
"
n K d
1
XXh X
= γ
bi,k log wk + γ bi,k − log(2π) = γ
bi,k log wk + γ bi,k − log(2π)
2 i=1
2
i=1 k=1 k=1
(xi − µk )2 i 1
− log σk − − log |Σk |
2σk2 2 #
K 1 > −1
X − tr (xi − µk ) Σk (xi − µk ) ,
−α wk − 1 . 2
k=1
where tr(.) denotes the trace of matrix. The trace is used
Therefore:
here because (xi − µk )> Σ−1k (xi − µk ) is a scalar so it is
n h equal to its trace.
∂L X xi − µk i set
= γ
bi,k ( ) = 0, The Lagrangian is:
∂µk i=1
σk2
Pn
i=1 γbi,k xi L(µ1 , . . . , µK , Σ1 , . . . , ΣK , w1 , . . . , wK , α)
=⇒ µbk = P n , (44)
i=1 γbi,k n X K
"
X d
n h = γ
bi,k log wk + γ bi,k − log(2π)
∂L X −1 (xi − µk )2 i set 2
= γ
bi,k ( + ) = 0, i=1 k=1
∂σk i=1
σk σk3
1
Pn
bi,k (xi − µ
γ bk )2 − log |Σk |
=⇒ σ 2
bk = i=1Pn , (45) 2
i=1 γ
#
bi,k 1
> −1
− tr (xi − µk ) Σk (xi − µk )
and w bk is the same as equation (41). 2
Iteratively solving equations (43), (44), (45), and (41) us- K
X
ing Algorithm (2) gives us the estimations for µ b1 , . . . , µ
bK , −α wk − 1 .
σ
b1 , . . . , σ
bK , and w
b1 , . . . , w
bK in equation (42). k=1
Fitting A Mixture Distribution to Data: Tutorial 9
Figure 1. The original probability density functions from which Figure 3. The change and convergence of σ1 (shown in blue), σ2
the sample is drawn. The mixture includes three different Gaus- (shown in red), and σ3 (shown in green) over the iterations.
sians showed in blue, red, and green colors.
Figure 2. The change and convergence of µ1 (shown in blue), µ2 Figure 4. The change and convergence of w1 (shown in blue), w2
(shown in red), and µ3 (shown in green) over the iterations. (shown in red), and w3 (shown in green) over the iterations.
x 0 1 2 3 4 5 6 7 8 9 10
frequency 162 267 271 185 111 61 120 210 215 136 73
x 11 12 13 14 15 16 17 18 19 20
frequency 43 14 160 230 243 104 36 15 10 0
of Gaussians, one reasonable option is: Figure 6. The frequency of the discrete data sample.
Figure 7. The change and convergence of λ1 (shown in blue), λ2 Figure 9. The estimated probability mass functions. The esti-
(shown in red), and λ3 (shown in green) over the iterations. mated mixture includes three different Poissons showed in blue,
red, and green colors. The purple density is the weighted summa-
−λ k
tion of these three densities, i.e., 3k=1 wk e x!k λ . The brown
P
density is the fitted density whose parameter is estimated by MLE.
Acknowledgment
The authors hugely thank Prof. Mu Zhu for his great course
“Statistical Concepts for Data Science”. This great course
partly covered the materials mentioned in this tutorial pa-
per.
References
Boyd, Stephen and Vandenberghe, Lieven. Convex opti-
mization. Cambridge university press, 2004.
Fraley, Chris and Raftery, Adrian E. How many clus-
ters? which clustering method? answers via model-
based cluster analysis. The computer journal, 41(8):
Figure 8. The change and convergence of w1 (shown in blue), w2
578–588, 1998.
(shown in red), and w3 (shown in green) over the iterations.
Fraley, Chris and Raftery, Adrian E. Model-based cluster-
ing, discriminant analysis, and density estimation. Jour-
b(mle) = x̄ = (1/n) Pn xi . This nal of the American statistical Association, 97(458):
is estimated using λ i=1 611–631, 2002.
fitted distribution is also depicted in Fig. 9. Again, the
poor performance of this single mass function in capturing Friedman, Jerome, Hastie, Trevor, and Tibshirani, Robert.
the multi-modality is obvious. The elements of statistical learning, volume 2. Springer
series in statistics New York, NY, USA:, 2009.
6. Conclusion
Lee, Gyemin and Scott, Clayton. Em algorithms for multi-
In this paper, a simple-to-understand and step-by-step tuto-
variate gaussian mixture models with truncated and cen-
rial on fitting a mixture distribution to data was proposed.
sored data. Computational Statistics & Data Analysis,
The assumption was the prior knowledge on calculus and
56(9):2816–2829, 2012.
basic linear algebra. For more clarification, fitting two dis-
tributions was primarily introduced and then it was gen-
eralized to K distributions. Fitting mixture of Gaussians
and Poissons were also mentioned as examples for continu-
ous and discrete cases, respectively. Simulations were also
shown for more clarification.