0% found this document useful (0 votes)
37 views32 pages

Statistical Inference III: Mohammad Samsul Alam

The document discusses the Expectation Maximization (EM) algorithm. The EM algorithm is an iterative method used to find maximum likelihood estimates of parameters in statistical models, when the model depends on unobserved latent variables. It involves two steps: (1) the Expectation (E) step, which calculates the expected value of the log-likelihood with respect to the latent variables; and (2) the Maximization (M) step, which computes the parameter estimates by maximizing the expected log-likelihood found in the E step. The algorithm iterates between these two steps until the parameter estimates converge. The EM algorithm is useful when the data can be thought of as incomplete or having missing values.

Uploaded by

Md Abdul Basit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views32 pages

Statistical Inference III: Mohammad Samsul Alam

The document discusses the Expectation Maximization (EM) algorithm. The EM algorithm is an iterative method used to find maximum likelihood estimates of parameters in statistical models, when the model depends on unobserved latent variables. It involves two steps: (1) the Expectation (E) step, which calculates the expected value of the log-likelihood with respect to the latent variables; and (2) the Maximization (M) step, which computes the parameter estimates by maximizing the expected log-likelihood found in the E step. The algorithm iterates between these two steps until the parameter estimates converge. The EM algorithm is useful when the data can be thought of as incomplete or having missing values.

Uploaded by

Md Abdul Basit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Statistical Inference III

(Expectation Maximization (EM) Algorithm)

Mohammad Samsul Alam


Assistant Professor of Applied Statistics
Institute of Statistical Research and Training (ISRT)
University of Dhaka

https://www.isrt.ac.bd/people/msalam

Email: [email protected] Lecture Material 4 1|32


Background I
The maximum likelihood estimate (mle) of population quanti-
ties θ is the value of θ for which the probability of observing the
data in hand is maximum. This can be done by maximizing the
likelihood function L(θ|y) or f (y1 , y2 , . . . , yn |θ) with respect to
θ. By notation, this can be written as,

arg max L(θ|y) (1)


θ

This approach intuitively assumes that the data vector y is


complete in the sense that there is no missing value.
Now, if we consider that the data set is not complete. So that,
we can write the complete data as y = {yo , ym }.
In such case, a simple way is to discard the ym observations
and forward to get mle using the yo .
Email: [email protected] Lecture Material 4 2|32
Background II
This approach is not always a good practice, because we are
not using the information of ym while estimating the parameter
θ.
Fisher (1913, 1922) introduced a two steps iterative procedure
for mle in such case.
The first is the Expectation step, and the second step is the
Maximization step. Hence, the name of the procedure is EM
algorithm.
The EM algorithm uses two likelihood functions: one is the
complete data likelihood L(θ|y = {yo , ym }), and the other
is the incomplete data likelihood or observed data likelihood
L(θ|yo ).
The main idea is to maximize the observed data likelihood
through maximizing the complete data likelihood.

Email: [email protected] Lecture Material 4 3|32


EM Algorithm I

Assume that, among the n units of the sample, n1 observations


have been observed and are the elements of yo , and the others
n2 = n − n1 are elements of ym .
Also, assume that the elements in yo are iid having common
probability distribution Pθ (), where θ ∈ Ω, and elements in yo
and ym are mutually independent.
From the conditional probability we can write,

Pθ (yo , ym )
Pθ (ym |yo ) =
Pθ (yo )

we can simply rewrite,

log Pθ (yo ) = log Pθ (yo , ym ) − log Pθ (ym |yo )


Email: [email protected] Lecture Material 4 4|32
EM Algorithm II
Let, θk is particular realization of θ obtained at k th iteration.
So,
Z Z
log Pθ (yo )Pθk (ym |yo )dym = log Pθ (yo , ym )Pθk (ym |yo )dym
(2)
Z
− log Pθ (ym |yo )Pθk (ym |yo )dym

Using the facts ,


Z
log Pθ (yo )Pθk (ym |yo )dym = log Pθ (yo ), and
Z
Eθ [g(x)] = g(x)f (x; θ)dx,

the equation (2) can be written as,


Email: [email protected] Lecture Material 4 5|32
EM Algorithm III
log Pθ (yo ) = Eθk [log Pθ (yo , ym )|yo ] − Eθk [log Pθ (ym |yo )|yo ]

This can be simplified as,

lθ (yo ) = Q(θ, θk ) − ν(θ, θk ) (3)

by letting,

Q(θ, θk ) = Eθk [log Pθ (yo , ym )|yo ]


ν(θ, θk ) = Eθk [log Pθ (ym |yo )|yo ]

EM algorithm maximizes the observed likelihood in (3), the


left hand side, by maximizing the complete data likelihood.

Email: [email protected] Lecture Material 4 6|32


EM Algorithm IV

Derivation of the expectation in the first term of the right


hand side of (3), that is finding Q(θ, θk ) is the E step, and
maximization of Q(θ, θk ) w.r.t θ is the M step of the EM
algorithm.

Email: [email protected] Lecture Material 4 7|32


EM Algorithm V
EM Algorithm
Let θ̂(m) denote the estimate on the mth step. To compute the
estimate on the (m + 1)st step, do
1 Expectation Step: Compute

Q(θ, θ̂(m) ) = Eθ̂(m) [log Pθ (yo , ym )|yo ]

where the expectation is taken under the conditional pdf


Pθ̂(m) (ym |yo ).
2 Maximization Step: Let

θ̂(m+1) = arg max Q(θ, θ̂(m) )


θ

Email: [email protected] Lecture Material 4 8|32


EM Algorithm VI

Let us now study the difference between the log likelihood


function Lθ (yo ) evaluated at two different value θ and θk ,

Lθ (yo ) − Lθk (yo ) = (Q(θ, θk ) − ν(θ, θk )) − (Q(θk , θk ) − ν(θk , θk ))


(4)
= (Q(θ, θk ) − Q(θk , θk )) + (ν(θk , θk ) − ν(θ, θk ))

Now, second term in right hand side of above equation can be


simplified as,

Email: [email protected] Lecture Material 4 9|32


EM Algorithm VII

(ν(θk , θk ) − ν(θ, θk ))
=Eθk (log Pθk (ym |yo )|yo ) − Eθk (log Pθ (ym |yo )|yo )
Z
= log Pθk (ym |yo )Pθk (ym |yo )dym
Z
− log Pθ (ym |yo )Pθk (ym |yo )dym
Z
= [log Pθk (ym |yo ) − log Pθ (ym |yo )] Pθk (ym |yo )dym
Z " #
Pθ (ym |yo )
= − log Pθk (ym |yo )dym
Pθk (ym |yo )
" #
Pθ (ym |yo )
=Eθk − log
Pθk (ym |yo )

Email: [email protected] Lecture Material 4 10|32


EM Algorithm VIII
Since negative logarithm is a convex function, hence according to
Jensen’s Inequality1 ,

" # " #
Pθ (ym |yo ) Pθ (ym |yo )
Eθk − log ≥ − log Eθk
Pθk (ym |yo ) Pθk (ym |yo )
Z " #
Pθ (ym |yo )
= − log Pθk (ym |yo )dym
Pθk (ym |yo )
Z
= − log Pθ (ym |yo )dym
= − log(1)
=0

Email: [email protected] Lecture Material 4 11|32


EM Algorithm IX

So we can conclude that, left side of equation (4) would be


maximum if the first term of right hand is maximum (since the
second term is positive as shown in above equation).
Since, the difference in first term of right hand side could
be maximum by maximizing Q(θ, θk ), which is the expected
complete data likelihood.
This implies, initiating with a starting value, iterative maxi-
mization of the expectation of complete data likelihood will
yield the maximum observed data likelihood.

Email: [email protected] Lecture Material 4 12|32


EM Algorithm X

Case I:Censoring Model


Supppose X1 , X2 , . . . , Xn1 are iid with pdf f (x − θ), for −∞ < x <
∞, where −∞ < θ < ∞. Also, denote the cdf of Xi by F (x − θ).
Let Z1 , Z2 , . . . , Zn2 denote the censored observations. For these
observations, we only know that Zj > a, for some a which is known,
and that the Zj s are independent of the Xi s.

1
Jensen’s Inequality states that if f is convex function then

E{f (x)} ≥ f (E{x})

provided that both expectations exist.


Email: [email protected] Lecture Material 4 13|32
Solution I

The observed and complete likelihoods are given, respectively,


by
n1
L(θ|x) = [1 − F (a − θ)]n2
Y
f (xi − θ), (5)
i=1

n1 n2
c
Y Y
L (θ|x, z) = f (xi − θ) f (zi − θ). (6)
i=1 i=1

Email: [email protected] Lecture Material 4 14|32


Solution II

By expression (5), the conditional distribution (Z) given (X)


is the ratio of 6 to 5, that is,
Qn1
f (xi − θ) ni=1 f (zi − θ)
Q2
i=1
k(z|θ, x) = n2 n1
[1 − F (a − θ)] i=1 f (xi − θ)
Q
n2
= [1 − F (a − θ)]−n2
Y
f (zi − θ), a < zi , i = 1, . . . , n2
i=1
(7)

Thus, Z and X are independent, and Z1 , Z2 , . . . , Zn2 are iid


with the common pdf f (z − θ)/ [1 − F (a − θ)], for z > a.

Email: [email protected] Lecture Material 4 15|32


Solution III
Based on these observations and expression (7), we have the
following derivation:

Q(θ|θo, x) (8)
=Eθ0 [log Lc (θ|x, Z)]
"n n2
#
X1 X
=Eθ0 log f (xi − θ) + log f (Zi − θ)
i=1 i=1
n1
X
= log f (xi − θ) + n2 Eθ0 [log f (Z − θ)]
i=1
n1
X Z ∞
f (z − θ0 )
= log f (xi − θ) + n2 log f (z − θ) dz (9)
i=1 a 1 − F (a − θ0 )

This last result is the E step of the EM algorithm.

Email: [email protected] Lecture Material 4 16|32


Solution IV
For the M step, we need the partial derivative of Q(θ|θo, x)
with respect to θ. This is easily found to be,
( 0 0 )
δQ f (xi − θ) Z ∞
f (z − θ) f (z − θ0 )
=− + n2 dz .
δθ f (xi − θ) a f (z − θ) 1 − F (a − θ0 )
(10)

Assuming that θ0 = θ̂0 , the first-step EM estimate would be


the value of θ, say θ̂(1) , which solves δQ/δθ = 0.

Example: Censoring Model


For the censoring model discussed so far, assume that X has a
N (θ, 1) distribution. Find the M step of EM algorithm.

Email: [email protected] Lecture Material 4 17|32


Solution V

Then ( )
x2
f (x) = φ(x) = (2π)−1/2 exp −
2
0
, and it is easy to show that f (x)/f (x) = −x.
Letting Φ(z) denote, as usual, the cdf of a standard normal
random variable, by (10) the partial derivative of Q(θ|θ0 , x)
with respect to θ for the censoring model simplifies to

Email: [email protected] Lecture Material 4 18|32


Solution VI

n1
δQ X Z ∞
1 exp {−(z − θ0 )2 /2}
= (xi − θ) + n2 (z − θ) √ dz
δθ i=1 a 2π 1 − Φ(a − θ0 )
exp{−(z−θ0 )2 /2}
= n1 (x̄ − θ) + n2 a∞ (z − θ0 ) √12π 1−Φ(a−θ0 ) dz − n2 (θ − θ0 )
R

0
Z ∞
f (z − θ0 ) f (z − θ0 )
= n1 (x̄ − θ) − n2 dz − n2 (θ − θ0 )
a f (z − θ0 ) 1 − Φ(a − θ0 )
0
Z ∞
f (z − θ0 )
= n1 (x̄ − θ) − n2 dz − n2 (θ − θ0 )
a 1 − Φ(a − θ0 )
δ
Z ∞
δz
f (z − θ0 )
= n1 (x̄ − θ) − n2 dz − n2 (θ − θ0 )
a 1 − Φ(a − θ0 )

Email: [email protected] Lecture Material 4 19|32


Solution VII
δ a∞ f (z − θ0 )
R
δQ
= n1 (x̄ − θ) − n2 dz − n2 (θ − θ0 )
δθ δz 1 − Φ(a − θ0 )
n2
= n1 (x̄ − θ) + φ(a − θ0 ) − n2 (θ − θ0 )
1 − Φ(a − θ0 )

Solving δQ/δθ = 0 for θ determines the EM step estiamtes. In


particular, given that θ̂(m) is the EM estimate on the mth step,
the (m + 1)st step estimate is,
h i
(m)
n1 n2 n2 φ a − θ̂
θ̂(m+1) = x̄ + θ̂(m) + h i,
n n n 1 − Φ a − θ̂(m)

where n = n1 + n2 .

Email: [email protected] Lecture Material 4 20|32


Solution VIII

Example II
Let Xi and Zi have identical exponential distrbution with rate λ.
However, they are independent of each other. Also, assume that Zi
is censored observation, and there are such n2 = n − n1 , where n1 is
the number of observed variable, censored observations. Estimate
the λ using EM algorithm.

For censoring model, we have

n1 Z ∞
X 
−λxi
 f (z; λ0 )
Q(λ|x, λ0 ) = log λe + n2 log f (z; λ) dz
i=1 a 1 − F (a; λ0 )

Email: [email protected] Lecture Material 4 21|32


Solution IX
Using the fact that, for an exponential distribution,
F (a; λ) = 1 − exp{−λx} we can write,

n1 Z ∞
X 
−λxi
 f (z; λ0 )
Q(λ|x, λ0 ) = log λe + n2 log f (z; λ) dz
i=1 a e−λ0 a

Now,

Z ∞
f (z; λ0 )
n2 log f (z; λ) dz
a e−λ0 a
Z ∞   λ e−λ0 z
0
=n2 log λe−λz dz
a e−λ0 a

Email: [email protected] Lecture Material 4 22|32


Solution X
Z ∞
λ0 e−λ0 z
= n2 [log λ − λz] −λ0 a dz
a e
n2 log λ Z ∞ −λ0 z n2 λλ0 Z ∞ −λ0 z
= −λ0 a λ0 e dz − −λ0 a ze dz
e a e a
n2 log λ n2 λλ0 Z ∞ −λ0 z
= −λ0 a [1 − F (a; λ0 )] − −λ0 a ze dz
e e a
n2 λλ0 Z ∞ −λ0 z
= n2 log λ − −λ0 a ze dz
e a

Rb Rb
To apply a udv = uv|ba − a vdu let us assume that,

u = z → du = dz and
−λ0 z
Z Z
1 −λ0 z
dv = e → dv = e−λ0 z → v = − e .
λ0
Email: [email protected] Lecture Material 4 23|32
Solution XI
Therefore,

n2 λλ0 Z ∞ −λ0 z
ze dz
e−λ0 a "a ∞ #
Z ∞
n2 λλ0 1 1
  
−λ0 z −λ0 z
= −λ0 a z − e − − e dz
e λ0
a a λ0
∞ #
n2 λλ0 ae−λ0 a
" !
1
−λ0 z
= −λ0 a − e
λ20

e λ0
a
n2 λλ0 ae−λ0 a e−λ0 a
" #
= −λ0 a +
e λ0 λ20
n2 λ
=n2 aλ +
λ0

Email: [email protected] Lecture Material 4 24|32


Solution XII
Finally, we have
n1   n2 λ
log λe−λxi + n2 log λ − n2 aλ −
X
Q(λ|x, λ0 ) =
i=1 λ0
n2 λ
= n1 log λ − n1 λx̄ + n2 log λ − n2 aλ −
λ0
n2 λ
= n log λ − n1 λx̄ − n2 aλ −
λ0

Then, for M -step, we need to write δ/δλ(Q(λ|x, λ0 )) = 0 and need


to solve for λ. This yeilds, for (m + 1)th iteration,

n
λ(m+1) = n2
n1 x̄ + n2 a + λ(m)

Email: [email protected] Lecture Material 4 25|32


Solution XIII

Case 2: Mixture Distribution


Consider a mixture problem involving normal distributions. Suppose
Y1 has a N (µ1 , σ12 ) distribution and Y2 has a N (µ2 , σ22 ) distribution.
Let W be a random variable independent of Y1 and Y2 , and with
probability of success  = P (W = 1). Suppose the random variable
we observe is X = (1 − W )Y1 + W Y2 . In this case, the vector of
0
parameter is given by θ = (µ1 , σ12 , µ2 , σ22 ).

The pdf of the mixture random variable X = (1 − W )Y1 + W Y2


is

f (x) = (1 − )f1 (x) + f2 (x), −∞ < x < ∞,

Email: [email protected] Lecture Material 4 26|32


Solution XIV
where fj (x) = σj−1 φ [(x − µj )/σj ], j = 1, 2, and φ(z) is the pdf
of a standard normal random variable (see Section 3.4 from
Hogg, et. al. 2013).
0
Suppose we observe a random sample X = (X1 , x2 , . . . , Xn )
from this mixture distribution with pdf f (x). Then the log of
the likelihood function is,
n
X
l(θ|x) = log [(1 − )f1 (xi ) + f2 (xi )] . (11)
i=1

In this mixture problem, the unobserved data are the random


variables which identify the distribution membership.
For i = 1, 2, . . . , n, define the random variables,
(
0 if Xi has pdf f1 (x)
Wi =
1 if Xi has pdf f2 (x).
Email: [email protected] Lecture Material 4 27|32
Solution XV
These variables, of course, constitute the random sample on
the Bernoulli random variable W .
Accordingly, assume that W1 , W2 , . . . , Wn are iid Bernoulli ran-
dom variables with probability of success .
The complete likelihood function is

Lc (θ|x, w) =
Y Y
f1 (xi ) f2 (xi ). (12)
Wi =0 Wi =1

Hence the log of the complete likelihood function is,

lc (θ|x, w) =
X X
log f1 (xi ) + log f2 (xi )
Wi =0 Wi =1
Xn
= [(1 − wi ) log f1 (xi ) + wi log f2 (xi )] . (13)
i=1

Email: [email protected] Lecture Material 4 28|32


Solution XVI
For the E step of the algorihm, we need the conditional expec-
tation of Wi given x under θ0 ; that is,

Eθ0 [Wi |θ0 , x] = P [Wi = 1|θ0 , x] (14)

An estimate of this expectation is the likelihood of xi being


drawn from distribution f2 (x), which given by

ˆf2,0 (xi )
γi = , (15)
(1 − ˆ)f1,0 (xi ) + ˆf2,0 (xi )

where the subscript 0 signifies that the parameters at θ0 are


being used.
Expression (15) is intuitively evident; see McLachlan and Kr-
ishnan (1997) for more discussion.

Email: [email protected] Lecture Material 4 29|32


Solution XVII

Replacing wi by γi , in expession (13), the M step fo the algo-


rithm is to maximize
n
X
Q(θ|θ0 , x) = [(1 − γi ) log f1 (xi ) + γi log f2 (xi )] . (16)
i=1

This maximization is easy to obtain by taking partial derivatives


of Q(θ|θ0 , x) with respect to the parameters. For example,
n
δQ
= (1 − γi )(−1/2σ12 )(−2)(xi − µ1 ).
X
δµ1 i=1

Email: [email protected] Lecture Material 4 30|32


Solution XVIII

Setting this to 0 and solving for µ1 yields the estimate of µ1 .


The estimates of the other mean and the variances can be
obtained similarly. Thesee estiamtes are
P1
(1 − γi )xi
µ̂1 = Pi=1
1
i=1 (1 − γi )
P1
(1 − γi )(xi − µ̂1 )2
σ̂12 = i=1 P1
i=1 (1 − γi )
P1
γi xi
µ̂2 = Pi=1
1
i=1 γi
P1
γi (xi − µ̂1 )2
σ̂22 = i=1 P1
i=1 γi

Email: [email protected] Lecture Material 4 31|32


Solution XIX

Since γi is an estimate of P [Wi = 1|θ 0 , x], the average


(1/n) ni=1 γi is an estimate of  = P [Wi = 1]. This average is
P

the estimate of ˆ.


So, to start the algorithm, we need to take initial guess for
θ = {ˆ, µ̂1 , σ̂12 , µ̂2 , σ̂22 }.

Email: [email protected] Lecture Material 4 32|32

You might also like