0% found this document useful (0 votes)

49 views36 pages

Lecture3 EM

The EM algorithm is an iterative method for finding maximum likelihood estimates of parameters in probabilistic models with latent variables. It alternates between performing an expectation (E) step, which computes the expected value of the log-likelihood using the current parameter estimates, and a maximization (M) step, which computes parameter estimates maximizing the expected log-likelihood from the E step. This process is guaranteed to increase the observed-data log-likelihood at each iteration until convergence. The algorithm is demonstrated on examples involving estimating parameters of mixtures of normal distributions and allele frequencies from genetic data.

Uploaded by

GinIchito

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

49 views36 pages

Lecture3 EM

Uploaded by

GinIchito

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 36

EM Algorithm

Last lecture — 1/35 —

• General optimization problems

– Newton Raphson
– Fisher scoring
– Quasi Newton

• Nonlinear regression models

– Gauss-Newton
• Generalized linear models
– Iteratively reweighted least squares
Expectation–maximization (EM) algorithm — 2/35 —

• An iterative algorithm for maximizing likelihood when the model contains

unobserved latent variables.
• Was initially invented by computer scientist in special circumstances.
• Generalized by Arthur Dempster, Nan Laird, and Donald Rubin in a classic 1977
JRSSB paper, which is widely known as the “DLR” paper.
• The algorithm iterate between E-step (expectation) and M-step (maximization).
• E-step: create a function for the expectation of the log-likelihood, evaluated
using the current estimate for the parameters.
• M-step: compute parameters maximizing the expected log-likelihood found on
the E step.
Motivating example of EM algorithm — 3/35 —

• Assume people’s height (in cm) follow normal distributions with different means
for male and female: N(µ1, σ21) for male, and N(µ2, σ22) for female.
• We observe the heights for 5 people (don’t know the gender): 182, 163, 175,
185, 158.
• We want to estimate µ1, µ2, σ1 and σ2.

This is the typical “two-component normal mixture model”, e.g., data are from a
mixture of two normal distributions. The goal is to estimate model parameters.

We could, of course, form the likelihood function (multiplication of Normal densities)

and find its maximum by Newton-Raphson.
A sketch of an EM algorithm — 4/35 —

Some notations: For person i, denote his/her height by xi, and use Zi to indicate
gender. Define pi be the proportion of male in the population.

Start by choosing reasonable initial values. Then:

• In the E-step, compute the probability of each person being male or female,
given the current model parameters. We have (after some derivation)
π(k)φ(xi; µ(k) , σ(k)
1 )
λi(k) ≡ E[Zi|µ(k)
1 , µ(k)
2 , σ(k)
1 , σ2 ) =
(k) 1
πφ(xi; µ1(k), σ(k)
1 ) + (1 − π(k) )φ(x ; µ(k) , σ(k) )
i 2 2

• In the M-step, update parameters and group proportions by considering the

probabilities from E-step as weights. They are basically weighted average and
variance. For example,
λ (k)
λ(k)
P P
i i xi i (1 − i )xi
X
µ1(k+1) = P (k) , µ(k+1)
2 = P , pi = λi /5
(k)

i λi i (1 − λi )
(k)
i
Example results — 5/35 —

We choose µ1 = 175, µ2 = 165, σ1 = σ2 = 10 as initial values.

• After first iteration, we have after E-step

Person 1 2 3 4 5
xi: height (cm) 179 165 175 185 158
λi: prob. male 0.79 0.48 0.71 0.87 0.31

The estimates for parameters after M-step are (weighted average and variance):
µ1 = 176, µ2 = 167, σ1 = 8.7, σ2 = 9.2, π = 0.63.
• At iteration 15 (converged), we have:
Person 1 2 3 4 5
Height (cm) 179 165 175 185 158
Prob. male 9.999968e-01 4.009256e-03 9.990943e-01 1.000000e+00 2.443061e-06

The estimates for parameters are: µ1 = 179.6, µ2 = 161.5, σ1 = 4.1, σ2 = 3.5,

π = 0.6.
Another motivating example of EM algorithm — 6/35 —

ABO blood groups

Genotype Genotype Frequency Phenotype

AA p2A A
AO 2pA pO A
BB p2B B
BO 2pB pO B
OO p2O O
AB 2pA pB AB

• The genotype frequencies above assume Hardy-Weinberg equilibrium.

• For a random sample of n individuals, we observe their phenotype, but not their
genotype.
• We wish to obtain the MLEs of the underlying allele frequencies pA, pB, and
pO = 1 − pA − pB. The likelihood is (from multinomial):
L(pA, pB) = (p2A + 2pA pO)nA × (p2B + 2pB pO)nB × (p2O)nO × (2pA pB)nAB
Motivating example (Allele counting algorithm) — 7/35 —

Let nA, nB, nO, nAB be the observed numbers of individuals with phenotypes A, B, O,
AB, respectively.

Let nAA, nAO, nBB and nBO be the unobserved numbers of individuals with genotypes
AA, AO, BB and BO, respectively. They satisfy nAA + nAO = nA and nBB + nBO = nB.

1. Start with initial estimates p(0) = (pA , pB , pO )

(0) (0) (0)

2. Calculate the expected nAA and nBB, given observed data and p(k)
p(k) p(k)
(k+1)
nAA = E(nAA|nA, p(k)) = nA A A
, BB =?
n(k+1)
pA pA + 2pO pA
(k) (k) (k) (k)

(k+1) (k+1) (k+1)

3. Update p(k+1). Imagining that nAA , nBB and nAB were actually observed
(k+1)
pA = (2n(k+1)
AA + n(k+1)
AO + n(k+1)
AB )/(2n), p(k+1)
B =?

4. Repeat step 2 and 3 until the estimates converge

EM algorithm: Applications — 8/35 —

Expectation-Mmaximization algorithm (Dempster, Laird, & Rubin, 1977, JRSSB,

39:1–38) is a general iterative algorithm for parameter estimation by maximum
likelihood (optimization problems).

It is useful when
• some of the random variables involved are not observed, i.e., considered
missing or incomplete.
• direct maximizing the target likelihood function is difficult, but one can introduce
(missing) random variables so that maximizing the complete-data likelihood is
simple.

Typical problems include:

• Filling in missing data in a sample
• Discovering the value of latent variables
• Estimating parameters of HMMs
• Estimating parameters of finite mixtures
Description of EM — 9/35 —

Consider (Yobs, Ymis) ∼ f (yobs, ymis|θ), where we observe Yobs but not Ymis

It can be difficult to find MLE θ̂ = arg maxθ g(Yobs|θ) = arg maxθ f (Yobs, ymis|θ) dymis
R

But it could be easy to find θ̂C = arg maxθ f (Yobs, Ymis|θ), if we had observed Ymis.

• E step: h(k)(θ) ≡ E log f (Yobs, Ymis|θ)Yobs, θ(k)

• M step: θ(k+1) = arg maxθ h(k)(θ);

Nice properties (compared to Newton-Raphson):

1. simplicity of implementation
2. stable monotone convergence
Justification of EM — 10/35 —

The E-step creates a surrogate function (often called the “Q function”), which is the
expected value of the log likelihood function, with respect to the conditional
distribution of Ymis given Yobs, under the current estimate of the parameters θ(k).

The M-step maximizes the surrogate function.

Ascent property of EM — 11/35 —

Theorem: At each iteration of the EM algorithm,

log g(Yobs|θ(k+1)) ≥ log g(Yobs|θ(k))
and the equality holds if and only if θ(k+1) = θ(k).

Proof: The definition of θ(k+1) gives

E{log f (Yobs, Ymis|θ(k+1))|Yobs, θ(k)} ≥ E{log f (Yobs, Ymis|θ(k))|Yobs, θ(k)},

which can be expanded to

E{log c(Ymis|Yobs, θ(k+1))|Yobs, θ(k)}+log g(Yobs|θ(k+1)) ≥ E{log c(Ymis|Yobs, θ(k))|Yobs, θ(k)}+log g(Yobs|θ(k)).
(1)
By the non-negativity of the Kullback-leibler information, i.e.,
Z
p(x)
log p(x)dx ≥ 0, for densities p(x), q(x),
q(x)
we have
c(Ymis|Yobs, θ(k)) , θ (k)
Z " #
c(Y |Y
mis obs )
log c(Y mis |Yobs , θ (k)
) dymis = E log Y , θ(k) ≥ 0.
obs (2)
c(Ymis|Yobs, θ(k+1)) c(Ymis|Yobs, θ(k+1))
Ascent property of EM (continued) — 12/35 —

Combining (1) and (2) yields

log g(Yobs|θ(k+1)) ≥ log g(Yobs|θ(k)),

thus we partially proved the theorem. If the equality holds, i.e.,

log g(Yobs|θ(k+1)) = log g(Yobs|θ(k)), (3)

by (1) and (2),

E{log c(Ymis|Yobs, θ(k+1))|Yobs, θ(k)} = E{log c(Ymis|Yobs, θ(k))|Yobs, θ(k)}.

The Kullback-leibler information is zero if and only if

log c(Ymis|Yobs, θ(k+1)) = log c(Ymis|Yobs, θ(k)). (4)

Combining (3) and (4), we have

log f (Y|θ(k+1)) = log f (Y|θ(k)).

The uniqueness of θ leads to θ(k+1) = θ(k) .

Example 1: Grouped Multinomial Data — 13/35 —

Suppose Y = (y1, y2, y3, y4) has a multinomial distribution with cell probabilities
1 θ 1−θ 1−θ θ
!
+ , , , .
2 4 4 4 4
Then the probability for Y is given by
!y1 !y2 !y3 y4
(y1 + y2 + y3 + y4)! 1 θ 1−θ 1−θ θ
L(θ|Y) ≡ + .
y1!y2!y3!y4! 2 4 4 4 4
If we use Newton-Raphson to directly maximize f (Y, θ), we need

y1/4 y2 + y3 y4
˙
l(θ|Y) = − +
1/2 + θ/4 1−θ θ
¨l(θ|Y) = − y1 − y2 + y3 − y4
(2 + θ)2 (1 − θ)2 θ2

The probability of the first cell is a trouble-maker!

How to avoid?
Example 1: Grouped Multinomial Data (continued) — 14/35 —

Suppose Y = (y1, y2, y3, y4) has a multinomial distribution with cell probabilities
1 θ 1−θ 1−θ θ
!
+ , , , .
2 4 4 4 4
Define the complete-data: X = (x0, x1, y2, y3, y4) to have a multinomial distribution
with probabilities
1 θ 1−θ 1−θ θ
!
, , , , ,
2 4 4 4 4
and to satisfy
x0 + x1 = y1

Observed-data log likelihood

1 θ
!
l(θ|Y) ≡ y1 log + + (y2 + y3) log (1 − θ) + y4 log θ
2 4
Complete-data log likelihood

lC(θ|X) ≡ (x1 + y4) log θ + (y2 + y3) log (1 − θ)

Example 1: Grouped Multinomial Data (continued) — 15/35 —

E step: evaluate
θ (k)
/4
x1(k+1) = E(x1|Y, θ ) = y1
(k)
1/2 + θ(k)/4

(k+1)
M step: maximize complete-data log likelihood with x1 replaced by x1

x1(k+1) + y4
θ(k+1) =
x1(k+1) + y4 + y2 + y3
Example 1: Grouped Multinomial Data (continued) — 16/35 —

We observe Y = (125, 18, 20, 34) and start EM with θ(0) = 0.5.

Parameter update Convergence to θ̂ Convergence rate

k θ(k) θ(k) − θ̂ (θ(k) − θ̂)/(θ(k−1) − θ̂)
0 .500000000 .126821498
1 .608247423 .018574075 .1465
2 .624321051 .002500447 .1346
3 .626488879 .000332619 .1330
4 .626777323 .000044176 .1328
5 .626815632 .000005866 .1328
6 .626820719 .000000779 .1328
7 .626821395 .000000104
8 .626821484 .000000014
θ̂ .626821498 Stop
Example 2: Normal mixtures — 17/35 —

Consider a J -group normal mixture, where x1, . . . , xn ∼ Jj=1 p jφ(xi|µ j, σ j). φ(.|µ, σ)
P

is the normal density. This is the clustering/finite mixture problem in which EM is

typically used for.

Define indicator variable for observation i: (yi1, yi2, . . . , yiJ ) follows a multinomial
distribution (with trail number=1) and cell probabilities p = (p1, p2, . . . , p J ). Clearly,
j yi j = 1. Given yi j∗ = 1 and yi j = 0 for j , j∗, we assume
P

xi ∼ N(µ j∗, σ j∗).

p jφ(xi|µ j, σ j).
PJ
You can check, marginally, xi ∼ j=1

Here, {xi}i is the observed data; {xi, yi1, . . . , yiJ }i is the complete data.

Observed-data log likelihood

 
X 
 J 

X
l(µ, σ, p|x) ≡ φ(x , σ
 
log  p |µ )

 j i j j 
 
i  j=1 
Complete-data log likelihood
X n o
lC(µ, σ, p|x, y) ≡ yi j log p j + log φ(xi|µ j, σ j)
ij
Example 2: Normal mixtures (continued) — 18/35 —

Complete-data log likelihood:

X
lC(µ, σ, p|x, y) ≡ yi j{log p j − (xi − µ j)2/(2σ2j ) − log σ j}
ij

E step: evaluate for i = 1, . . . , n and j = 1, . . . , J ,

ω(k)
ij ≡ E(yi j |xi , µ (k)
, σ ,p )
(k) (k)

= Pr(yi j = 1|xi, µ(k), σ(k), p(k))

p(k)
j f (x i |µ(k)
j , σ (k)
j )
= P (k)
p
j j f (x |µ
i j
(k)
, σ (k)
j )
M step: maximize complete-data log likelihood with yi j replaced by ωi j
X
p(k+1)
j =n −1
ωi(k)j
X i . X (k)
µ(k+1)
j = ωi j xi
(k)
ωi j
i i
s X
X 2
σ(k+1)
j = ωi j
(k)
xi − µ(k)
j ω (k)
ij
i i

Practice: When all groups share the same variance (σ2), what’s the M-step update
for σ2?
Example 2: Normal mixtures in R — 20/35 —

### two component EM

### pN(0,1)+(1-p)N(4,1)

EM_TwoMixtureNormal = function(p, mu1, mu2, sd1, sd2, X, maxiter=1000, tol=1e-5)

{
diff=1
iter=0

while (diff>tol & iter<maxiter) {

## E-step: compute omega:

d1=dnorm(X, mean=mu1, sd=sd1) # compute density in two groups
d2=dnorm(X, mean=mu2, sd=sd2)
omega=d1*p/(d1*p+d2*(1-p))

## M-step: update p, mu and sd

p.new=mean(omega)
mu1.new=sum(X*omega) / sum(omega)
mu2.new=sum(X*(1-omega)) / sum(1-omega)
resid1=X-mu1
resid2=X-mu2;
sd1.new=sqrt(sum(resid1ˆ2*omega) / sum(omega))
sd2.new=sqrt(sum(resid2ˆ2*(1-omega)) / sum(1-omega))

## calculate diff to check convergence

diff=sqrt(sum((mu1.new-mu1)ˆ2+(mu2.new-mu2)ˆ2
+(sd1.new-sd1)ˆ2+(sd2.new-sd2)ˆ2))

p=p.new;
mu1=mu1.new;
mu2=mu2.new;
sd1=sd1.new;
sd2=sd2.new;

iter=iter+1;

cat("Iter", iter, ": mu1=", mu1.new, ", mu2=",mu2.new, ", sd1=",sd1.new,

", sd2=",sd2.new, ", p=", p.new, ", diff=", diff, "\n")
}

}
Example 2: Normal mixtures in R (continued) — 22/35 —

> ## simulation
> p0=0.3;
> n=5000;
> X1=rnorm(n*p0); # n*p0 indiviudals from N(0,1)
> X2=rnorm(n*(1-p0), mean=4) # n*(1-p0) individuals from N(4,1)
> X=c(X1,X2) # observed data
> hist(X, 50)

Histogram of X
10 20 30 40 50 60
Frequency
0

−2 0 2 4 6 8
X
Example 2: Normal mixtures in R (continued) — 23/35 —

> ## initial values for EM

> p=0.5
> mu1=quantile(X, 0.1);
> mu2=quantile(X, 0.9)
> sd1=sd2=sd(X)

> c(p, mu1, mu2, sd1, sd2)

0.5000000 -0.3903964 5.0651073 2.0738555 2.0738555

> EM_TwoMixtureNormal(p, mu1, mu2, sd1, sd2, X)

Iter 1: mu1=0.8697, mu2=4.0109, sd1=2.1342, sd2=1.5508, p=0.3916, diff=1.7252
Iter 2: mu1=0.9877, mu2=3.9000, sd1=1.8949, sd2=1.2262, p=0.3843, diff=0.4345
Iter 3: mu1=0.8353, mu2=4.0047, sd1=1.7812, sd2=1.0749, p=0.3862, diff=0.2645
Iter 4: mu1=0.7203, mu2=4.0716, sd1=1.6474, sd2=0.9899, p=0.3852, diff=0.2070
...
Iter 44: mu1=-0.0048, mu2=3.9515, sd1=0.9885, sd2=1.0316, p=0.2959, diff=1.9e-05
Iter 45: mu1=-0.0048, mu2=3.9515, sd1=0.9885, sd2=1.0316, p=0.2959, diff=1.4e-05
Iter 46: mu1=-0.0049, mu2=3.9515, sd1=0.9885, sd2=1.0316, p=0.2959, diff=1.1e-05
Iter 47: mu1=-0.0049, mu2=3.9515, sd1=0.9885, sd2=1.0316, p=0.2959, diff=8.7e-06
In class practice: Poisson mixture — 24/35 —

Using the same notations as in Normal mixture model. now assume the data is
from a mixture of Poisson distributions.
Consider x1, . . . , xn ∼ Jj=1 p jφ(xi|λ j), where φ(.|λ) is the Poisson density. Again use
P

yi j to indicate group assignments, (yi1, yi2, . . . , yiJ ) follows a multinomial distribution

with cell probabilities p = (p1, p2, . . . , p J ).

Now the observed-data log likelihood

 
X 
 J 

X
λ λ
 
l(λ, p|x) ≡ log  p (x log − )

 j i j j 
 
i  j=1 
Complete-data log likelihood
X n o
lC(λ, p|x, y) ≡ yi j log p j + (xi log λ j − λ j)
ij

Derivate the EM iterations!

Example 3: Mixed-effects model — 25/35 —

For a longitudinal dataset of i = 1, . . . , N subjects, each with ni measurements of

the outcome. The linear mixed effect model is given by
Yi = Xiβ + Zibi + i, bi ∼ Nq(0, D), i ∼ Nni (0, σ2 Ini ), bi, i independent

Observed-data log-likelihood
X( 1 1
)
l(β, D, σ2|Y1, . . . , YN ) ≡ − (Yi − Xiβ)0Σ−1
i (Y i − X i β) − log |Σi| ,
i
2 2
where Σi = Zi DZi0 + σ2 Ini .
• In fact, this likelihood can be directly maximized for (β, D, σ2) by using
Newton-Raphson or Fisher scoring.
• Given (D, σ2) and hence Σi, we obtain β that maximizes the likelihood by solving
∂l(β, D, σ2|Y1, . . . , YN ) X 0 −1
= Xi Σ (Yi − Xiβ) = 0,
∂β i
which implies
 N −1 N
X  X
β =  Xi0Σ−1
i Xi 

 X i Σi Y i .
0 −1

i=1 i=1
Example 3: Mixed-effects model (continued) — 26/35 —

Complete-data log-likelihood
Note the equivalence of (i, bi) and (Yi, bi) and the fact that
! ( ! !)
bi 0 D 0
=N ,
i 0 0 σ2 Ini

X( 1 0 1 1 0 ni
)
lC(β, D, σ2|1, . . . , N , b1, . . . , bN ) ≡ − bi Dbi − log |D| − 2 i i − log σ2
i
2 2 2σ 2

The parameter that maximizes the complete-data log-likelihood is obtained as,

conditional on other parameters,
N
X
D = N −1 bib0i
i=1
 N −1 N
X  X
σ =  ni
2
i0i
i=1 i=1
 N −1 N
X  X
β =   0
Xi Xi Xi0(Yi − Zibi).
i=1 i=1
Example 3: Mixed-effects model (continued) — 27/35 —

E step: to evaluate

E bibi | Yi, β , D , σ
0 (k) (k) 2(k)

E i | Yi , β , D , σ
0 (k) (k) 2(k)

E bi | Yi , β , D , σ
(k) (k) 2(k)

We use the relationship

E(bib0i | Yi) = E(bi | Yi)E(b0i | Yi) + Var(bi | Yi).

Thus we need to calculate E(bi | Yi) and Var(bi | Yi). Recall the conditional
distribution for multivariate normal variables
Xi β Zi DZi + σ Ini Zi D
0 2
! ( ! !)
Yi
=N , 0 ,
bi 0 DZi D
Let Σi = Zi DZi0 + σ2 Ini . We known that

E(bi | Yi) = 0 + DZi0Σ−1

i (Yi − Xi β)
Var(bi | Yi) = D − DZi0Σ−1
i Zi D.
Example 3: Mixed-effects model (continued) — 28/35 —

Similarly, We use the relationship

i .
Var(i | Yi) = σ2 Ini − σ4Σ−1
M step
N
X
D(k+1) = N −1 E(bib0i | Yi, β(k), D(k), σ2(k)))
i=1
 N −1 N
X  X
σ2(k+1) =  ni E(i0i | Yi, β(k), D(k), σ2(k))
i=1 i=1
 N −1 N
X  X
β(k+1) =  0
Xi Xi Xi0E(Yi − Zibi | Yi, β(k), D(k), σ2(k)).
i=1 i=1
Issues — 29/35 —

1. Stopping rules

• |l(θ(k+1)) − l(θ(k))| < for m consecutive steps, where l(θ) is observed-data

log-likelihood.

This is bad! l(θ) may not change much even when θ does.

• ||θ(k+1) − θ(k)|| < for m consecutive steps

This could run into problems when the components of θ are of very different
magnitudes.

• |θ(k+1)
j − θ (k)
j | < 1 (|θ j | + 2 ) for j = 1, . . . , p
(k)

In practice, take

1 =10−8
2 =101 to 1001
Issues (continued) — 30/35 —

2. Local vs. global max

• There may be multiple modes
• EM may converge to a saddle point
• Solution: Multiple starting points
3. Starting points
• Use information from the context
• Use a crude method (such as the method of moments)
• Use an alternative model formulation
4. Slow convergence
• EM can be painfully slow to converge near the maximum
• Solution: Switch to another optimization algorithm when you get near the
maximum
5. Standard errors
• Numerical approximation of the Hessian matrix
• Louis (1982), Meng and Rubin (1991)
Numerical approximation of the Hessian matrix — 31/35 —

Note: l(θ) = observed-data log-likelihood

We estimate the gradient using

∂l(θ) l(θ + δiei) − l(θ − δiei)
{l(θ)}i =
˙ ≈
∂θi 2δi
where ei is a unit vector with 1 for the ith element and 0 otherwise.

In calculating derivatives using this formula, I generally start with some medium
size δ and then repeatedly halve it until the estimated derivative stabilizes.

We can estimate the Hessian by applying the above formula twice:

l(θ + δiei + δ je j) − l(θ + δiei − δ je j) − l(θ − δiei + δ je j) + l(θ − δiei − δ je j)
¨ ij ≈
{l(θ)}
4δiδ j
Louis estimator (1982) — 32/35 —

lC(θ|Yobs, Ymis) ≡ log { f (Yobs, Ymis|θ)}

(Z )
lO(θ|Yobs) ≡ log f (Yobs, ymis|θ) dymis

l˙C(θ|Yobs, Ymis), l˙O(θ|Yobs) = gradients of lC, lO

l¨C(θ|Yobs, Ymis), l¨O(θ|Yobs) = second derivatives of lC, lO

We can prove that

• MLE: θ̂ = arg maxθ lO(θ|Yobs)

n o−1
• Louis variance estimator: −l¨O(θ|Yobs) evaluated at θ = θ̂
• Note: All of the conditional expectations can be computed in the EM algorithm
using only l˙C and l¨C, which are first and second derivatives of the complete-data
log-likelihood. Louis estimator should be evaluated at the last step of EM.
Proof of (5) — 33/35 —

Proof: By the definition of lO(θ|Yobs),

nR o
∂ log f (Yobs, ymis|θ) dymis
l˙O(θ|Yobs) =
∂θ
∂ f (Yobs, ymis|θ) dymis/∂θ
R
=
f (Yobs, ymis|θ) dymis
R

f 0(Yobs, ymis|θ) dymis

R
= R . (7)
f (Yobs, ymis|θ) dymis
Multiplying and dividing the integrand of the numerator by f (Yobs, ymis|θ) gives
(5),
R f 0(Y ,y |θ)
obs mis
f (Yobs, ymis|θ) dymis
˙lO(θ|Yobs) = f (YRobs,ymis|θ)
f (Yobs, ymis|θ) dymis
R ∂ log{ f (Yobs,ymis|θ)}
∂θ f (Yobs, ymis|θ) dymis
=
f (Yobs, ymis|θ) dymis
R

Proof: We take an additional derivative of l˙O(θ|Yobs) in expression (7) to obtain

2
f (Yobs, ymis|θ) dymis  f (Yobs, ymis|θ) dymis 
R R 0
00 
¨lO(θ|Yobs) = R
 
−

f (Yobs, ymis|θ) dymis  f (Yobs, ymis|θ) dymis 
 R 

f 00(Yobs, ymis|θ) dymis
R
n o⊗2
= R − lO(θ|Yobs) .
˙
f (Yobs, ymis|θ) dymis
To see how the first term breaks down, we take an additional derivative of
∂ log { f (Yobs, ymis|θ)}
Z Z
f (Yobs, ymis|θ) dymis =
0
f (Yobs, ymis|θ) dymis
∂θ
to obtain
∂ log { f (Yobs, ymis|θ)}
Z Z 2
f 00(Yobs, ymis|θ) dymis = f (Yobs, ymis|θ) dymis
∂θ∂θ0
#⊗2
∂ log { f (Yobs, ymis|θ)}
Z "
+ f (Yobs, ymis|θ) dymis
∂θ
Thus we express the first term to be
n o h i⊗2
E l¨C(θ|Yobs, Ymis)|Yobs + E l˙C(θ|Yobs, Ymis) Yobs .
Convergence rate of EM — 35/35 —

Let IC(θ) and IO(θ) denote the complete information and observed information,
respectively.

One can show when the EM converges, the linear convergence rate, denoted as
(θ(k+1) − θ̂)/(θ(k) − θ̂) approximates 1 − IO(θ̂)/IC(θ̂). (later)

This means that

• When missingness is small, EM converges quickly
• Otherwise EM converges slowly.

Session 24: Recursion and Recurrence Relations
No ratings yet
Session 24: Recursion and Recurrence Relations
19 pages
QI Theory-2 PDF
No ratings yet
QI Theory-2 PDF
66 pages
Isye 6416: Computational Statistics Spring 2023: Prof. Yao Xie
No ratings yet
Isye 6416: Computational Statistics Spring 2023: Prof. Yao Xie
24 pages
EM-algorithm: California Institute of Technology 136-93 Pasadena, CA 91125 Welling@vision - Caltech.edu
No ratings yet
EM-algorithm: California Institute of Technology 136-93 Pasadena, CA 91125 Welling@vision - Caltech.edu
7 pages
EM Presentation 2013
No ratings yet
EM Presentation 2013
18 pages
کتاب ششم بارگزاری شده
No ratings yet
کتاب ششم بارگزاری شده
49 pages
(Slides) The em Algorithm
No ratings yet
(Slides) The em Algorithm
14 pages
Lecture-04_GMM_EMalg
No ratings yet
Lecture-04_GMM_EMalg
34 pages
The Kullback-Liebler Distance and Entropy
No ratings yet
The Kullback-Liebler Distance and Entropy
5 pages
TR 97 021
No ratings yet
TR 97 021
15 pages
GMMEMNotes
No ratings yet
GMMEMNotes
10 pages
5
No ratings yet
5
29 pages
8th Lecture Note - 1039837803 230515 094639
No ratings yet
8th Lecture Note - 1039837803 230515 094639
10 pages
Statistical Inference III: Mohammad Samsul Alam
No ratings yet
Statistical Inference III: Mohammad Samsul Alam
32 pages
EM Algorithm: Shu-Ching Chang Hyung Jin Kim December 9, 2007
No ratings yet
EM Algorithm: Shu-Ching Chang Hyung Jin Kim December 9, 2007
10 pages
cs229 Notes7b PDF
No ratings yet
cs229 Notes7b PDF
4 pages
AI29
No ratings yet
AI29
3 pages
Lec15 16 Handout
No ratings yet
Lec15 16 Handout
33 pages
Learning With Hidden Variables - EM Algorithm
No ratings yet
Learning With Hidden Variables - EM Algorithm
31 pages
Expectation-Maximization Algorithm
No ratings yet
Expectation-Maximization Algorithm
13 pages
The EM Algorithm: Ajit Singh November 20, 2005
No ratings yet
The EM Algorithm: Ajit Singh November 20, 2005
4 pages
Applied Stat
No ratings yet
Applied Stat
2 pages
Tutorial On Generalized Expectation Maximization: Javier R. Movellan
No ratings yet
Tutorial On Generalized Expectation Maximization: Javier R. Movellan
6 pages
Tutorial On Generalized Expectation
No ratings yet
Tutorial On Generalized Expectation
6 pages
Expectation Maximization Homework Solution
100% (1)
Expectation Maximization Homework Solution
8 pages
Mixture Models and Expectation-Maximization: Justus H. Piater
No ratings yet
Mixture Models and Expectation-Maximization: Justus H. Piater
11 pages
The Expectation Maximization Algorithm
No ratings yet
The Expectation Maximization Algorithm
7 pages
gmm
No ratings yet
gmm
8 pages
EM Algo
No ratings yet
EM Algo
8 pages
HW2
No ratings yet
HW2
4 pages
Intro To em
No ratings yet
Intro To em
4 pages
Maximum Likelihood Estimation
No ratings yet
Maximum Likelihood Estimation
14 pages
ds11 2
No ratings yet
ds11 2
19 pages
Expectation Maximization (EM) Algorithm.pptx
No ratings yet
Expectation Maximization (EM) Algorithm.pptx
47 pages
Modulesdocumentfile - phpsTAT301slidesem and Mixture Models PDF
No ratings yet
Modulesdocumentfile - phpsTAT301slidesem and Mixture Models PDF
83 pages
Aiml Lab Algorithms
No ratings yet
Aiml Lab Algorithms
10 pages
The Problem: Library (MASS) Data (Faithful) Attach (Faithful)
No ratings yet
The Problem: Library (MASS) Data (Faithful) Attach (Faithful)
7 pages
Chapter 9.4 Allele Frequency Estimation
No ratings yet
Chapter 9.4 Allele Frequency Estimation
24 pages
Expectation Maximization
No ratings yet
Expectation Maximization
19 pages
PROBABILISTIC Learning Jb-new
No ratings yet
PROBABILISTIC Learning Jb-new
13 pages
Dsci303-19 GM - em
No ratings yet
Dsci303-19 GM - em
81 pages
Expectation Maximization Notes
No ratings yet
Expectation Maximization Notes
5 pages
Unit 3 ML
No ratings yet
Unit 3 ML
45 pages
Em Algo For Multivariate GMM
No ratings yet
Em Algo For Multivariate GMM
9 pages
A Modified Expectation Maximization Algorithm For Penalized Likelihood Estimation in Emission Tomorzradhv
No ratings yet
A Modified Expectation Maximization Algorithm For Penalized Likelihood Estimation in Emission Tomorzradhv
6 pages
GMM Said Crv10 Tutorial
No ratings yet
GMM Said Crv10 Tutorial
27 pages
Likelihood EM HMM Kalman
No ratings yet
Likelihood EM HMM Kalman
46 pages
Algoritmo E-M. Utilizado para Calcular La Mezcla de Gausianas
No ratings yet
Algoritmo E-M. Utilizado para Calcular La Mezcla de Gausianas
8 pages
Some Studies of Expectation Maximization Clustering Algorithm To Enhance Performance
No ratings yet
Some Studies of Expectation Maximization Clustering Algorithm To Enhance Performance
16 pages
6.2 K Means
No ratings yet
6.2 K Means
23 pages
Oral Texte
No ratings yet
Oral Texte
12 pages
lecture5
No ratings yet
lecture5
16 pages
Beamer
No ratings yet
Beamer
34 pages
Inf 2
No ratings yet
Inf 2
37 pages
ML-2-Expectation Maximization
No ratings yet
ML-2-Expectation Maximization
11 pages
N D IX: The E-M Algorithm
No ratings yet
N D IX: The E-M Algorithm
12 pages
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
From Everand
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
Yue Jiang
4.5/5 (2)
Shortcuts to College Calculus Refreshment Kit
From Everand
Shortcuts to College Calculus Refreshment Kit
Juan Acevedo
No ratings yet
A Short Course in Discrete Mathematics
From Everand
A Short Course in Discrete Mathematics
Edward A. Bender
3/5 (1)
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Trigonometric Ratios to Transformations (Trigonometry) Mathematics E-Book For Public Exams
From Everand
Trigonometric Ratios to Transformations (Trigonometry) Mathematics E-Book For Public Exams
Mohmmad Khaja Shareef
5/5 (1)
Mathematical Functions
From Everand
Mathematical Functions
Oliver Linton
No ratings yet
Bok:978 94 017 0243 0 PDF
100% (1)
Bok:978 94 017 0243 0 PDF
277 pages
Polynomials
No ratings yet
Polynomials
20 pages
CE595 Section 5-1
No ratings yet
CE595 Section 5-1
101 pages
(Ebook) Microeconometrics Using Stata: Volume II: Nonlinear Models and Causal Inference Methods (Second Edition) by A. Colin Csmron, Pravin K. Trivedi ISBN 9781597183642, 1597183644 - The ebook version is available in PDF and DOCX for easy access
No ratings yet
(Ebook) Microeconometrics Using Stata: Volume II: Nonlinear Models and Causal Inference Methods (Second Edition) by A. Colin Csmron, Pravin K. Trivedi ISBN 9781597183642, 1597183644 - The ebook version is available in PDF and DOCX for easy access
50 pages
Higher Student Book - Chapter 21
No ratings yet
Higher Student Book - Chapter 21
23 pages
Profit Maximization Problem For Cobb-Douglas and CES Production Functions
No ratings yet
Profit Maximization Problem For Cobb-Douglas and CES Production Functions
44 pages
Fibonacci's Perfect Spiral: Middle School
No ratings yet
Fibonacci's Perfect Spiral: Middle School
8 pages
MAT201-Question 01
No ratings yet
MAT201-Question 01
2 pages
EGM 212 - Ordinary Differential Equations-PPP
No ratings yet
EGM 212 - Ordinary Differential Equations-PPP
365 pages
WEEK5
No ratings yet
WEEK5
34 pages
Difference Equation: Michael Haag
No ratings yet
Difference Equation: Michael Haag
4 pages
DMM Quiz
No ratings yet
DMM Quiz
11 pages
A.2 Level Math Edexcel Past Papers C34 2020
No ratings yet
A.2 Level Math Edexcel Past Papers C34 2020
16 pages
1.1 Introducing Calculus | AP Calculus AB:BC Class Notes | Fiveable
No ratings yet
1.1 Introducing Calculus | AP Calculus AB:BC Class Notes | Fiveable
3 pages
1 (1) 2 PDF
No ratings yet
1 (1) 2 PDF
3 pages
Quantum Mechanics The Theoretical Minimum Exercises Complete1
100% (1)
Quantum Mechanics The Theoretical Minimum Exercises Complete1
105 pages
Ecuaciones Diferenciales
No ratings yet
Ecuaciones Diferenciales
141 pages
2.1 Practice
No ratings yet
2.1 Practice
1 page
Khajehzadeh Et Al
No ratings yet
Khajehzadeh Et Al
10 pages
S3511ML Tutorial 4 2024
No ratings yet
S3511ML Tutorial 4 2024
2 pages
International Journal of Non-Linear Mechanics: Chiara Gastaldi, Teresa M. Berruti
No ratings yet
International Journal of Non-Linear Mechanics: Chiara Gastaldi, Teresa M. Berruti
16 pages
AP Calculus BC 2010 Scoring Guidelines Form B: The College Board
No ratings yet
AP Calculus BC 2010 Scoring Guidelines Form B: The College Board
7 pages
In-Class Activities: Today's Objectives
No ratings yet
In-Class Activities: Today's Objectives
18 pages
Linear Programming
No ratings yet
Linear Programming
3 pages
Unit 4 (Optimization)
No ratings yet
Unit 4 (Optimization)
50 pages
Math112-Boos.300196202.24W - HW12
No ratings yet
Math112-Boos.300196202.24W - HW12
9 pages