Lecture3 EM
Lecture3 EM
• Assume people’s height (in cm) follow normal distributions with different means
for male and female: N(µ1, σ21) for male, and N(µ2, σ22) for female.
• We observe the heights for 5 people (don’t know the gender): 182, 163, 175,
185, 158.
• We want to estimate µ1, µ2, σ1 and σ2.
This is the typical “two-component normal mixture model”, e.g., data are from a
mixture of two normal distributions. The goal is to estimate model parameters.
Some notations: For person i, denote his/her height by xi, and use Zi to indicate
gender. Define pi be the proportion of male in the population.
• In the E-step, compute the probability of each person being male or female,
given the current model parameters. We have (after some derivation)
π(k)φ(xi; µ(k) , σ(k)
1 )
λi(k) ≡ E[Zi|µ(k)
1 , µ(k)
2 , σ(k)
1 , σ2 ) =
(k) 1
πφ(xi; µ1(k), σ(k)
1 ) + (1 − π(k) )φ(x ; µ(k) , σ(k) )
i 2 2
i λi i (1 − λi )
(k)
i
Example results — 5/35 —
The estimates for parameters after M-step are (weighted average and variance):
µ1 = 176, µ2 = 167, σ1 = 8.7, σ2 = 9.2, π = 0.63.
• At iteration 15 (converged), we have:
Person 1 2 3 4 5
Height (cm) 179 165 175 185 158
Prob. male 9.999968e-01 4.009256e-03 9.990943e-01 1.000000e+00 2.443061e-06
Let nA, nB, nO, nAB be the observed numbers of individuals with phenotypes A, B, O,
AB, respectively.
Let nAA, nAO, nBB and nBO be the unobserved numbers of individuals with genotypes
AA, AO, BB and BO, respectively. They satisfy nAA + nAO = nA and nBB + nBO = nB.
2. Calculate the expected nAA and nBB, given observed data and p(k)
p(k) p(k)
(k+1)
nAA = E(nAA|nA, p(k)) = nA A A
, BB =?
n(k+1)
pA pA + 2pO pA
(k) (k) (k) (k)
It is useful when
• some of the random variables involved are not observed, i.e., considered
missing or incomplete.
• direct maximizing the target likelihood function is difficult, but one can introduce
(missing) random variables so that maximizing the complete-data likelihood is
simple.
Consider (Yobs, Ymis) ∼ f (yobs, ymis|θ), where we observe Yobs but not Ymis
It can be difficult to find MLE θ̂ = arg maxθ g(Yobs|θ) = arg maxθ f (Yobs, ymis|θ) dymis
R
But it could be easy to find θ̂C = arg maxθ f (Yobs, Ymis|θ), if we had observed Ymis.
• E step: h(k)(θ) ≡ E log f (Yobs, Ymis|θ)Yobs, θ(k)
The E-step creates a surrogate function (often called the “Q function”), which is the
expected value of the log likelihood function, with respect to the conditional
distribution of Ymis given Yobs, under the current estimate of the parameters θ(k).
E{log c(Ymis|Yobs, θ(k+1))|Yobs, θ(k)}+log g(Yobs|θ(k+1)) ≥ E{log c(Ymis|Yobs, θ(k))|Yobs, θ(k)}+log g(Yobs|θ(k)).
(1)
By the non-negativity of the Kullback-leibler information, i.e.,
Z
p(x)
log p(x)dx ≥ 0, for densities p(x), q(x),
q(x)
we have
c(Ymis|Yobs, θ(k)) , θ (k)
Z " #
c(Y |Y
mis obs )
log c(Y mis |Yobs , θ (k)
) dymis = E log Y , θ(k) ≥ 0.
obs (2)
c(Ymis|Yobs, θ(k+1)) c(Ymis|Yobs, θ(k+1))
Ascent property of EM (continued) — 12/35 —
Suppose Y = (y1, y2, y3, y4) has a multinomial distribution with cell probabilities
1 θ 1−θ 1−θ θ
!
+ , , , .
2 4 4 4 4
Then the probability for Y is given by
!y1 !y2 !y3 y4
(y1 + y2 + y3 + y4)! 1 θ 1−θ 1−θ θ
L(θ|Y) ≡ + .
y1!y2!y3!y4! 2 4 4 4 4
If we use Newton-Raphson to directly maximize f (Y, θ), we need
y1/4 y2 + y3 y4
˙
l(θ|Y) = − +
1/2 + θ/4 1−θ θ
¨l(θ|Y) = − y1 − y2 + y3 − y4
(2 + θ)2 (1 − θ)2 θ2
How to avoid?
Example 1: Grouped Multinomial Data (continued) — 14/35 —
Suppose Y = (y1, y2, y3, y4) has a multinomial distribution with cell probabilities
1 θ 1−θ 1−θ θ
!
+ , , , .
2 4 4 4 4
Define the complete-data: X = (x0, x1, y2, y3, y4) to have a multinomial distribution
with probabilities
1 θ 1−θ 1−θ θ
!
, , , , ,
2 4 4 4 4
and to satisfy
x0 + x1 = y1
E step: evaluate
θ (k)
/4
x1(k+1) = E(x1|Y, θ ) = y1
(k)
1/2 + θ(k)/4
(k+1)
M step: maximize complete-data log likelihood with x1 replaced by x1
x1(k+1) + y4
θ(k+1) =
x1(k+1) + y4 + y2 + y3
Example 1: Grouped Multinomial Data (continued) — 16/35 —
We observe Y = (125, 18, 20, 34) and start EM with θ(0) = 0.5.
Consider a J -group normal mixture, where x1, . . . , xn ∼ Jj=1 p jφ(xi|µ j, σ j). φ(.|µ, σ)
P
Define indicator variable for observation i: (yi1, yi2, . . . , yiJ ) follows a multinomial
distribution (with trail number=1) and cell probabilities p = (p1, p2, . . . , p J ). Clearly,
j yi j = 1. Given yi j∗ = 1 and yi j = 0 for j , j∗, we assume
P
Here, {xi}i is the observed data; {xi, yi1, . . . , yiJ }i is the complete data.
ω(k)
ij ≡ E(yi j |xi , µ (k)
, σ ,p )
(k) (k)
Practice: When all groups share the same variance (σ2), what’s the M-step update
for σ2?
Example 2: Normal mixtures in R — 20/35 —
p=p.new;
mu1=mu1.new;
mu2=mu2.new;
sd1=sd1.new;
sd2=sd2.new;
iter=iter+1;
}
Example 2: Normal mixtures in R (continued) — 22/35 —
> ## simulation
> p0=0.3;
> n=5000;
> X1=rnorm(n*p0); # n*p0 indiviudals from N(0,1)
> X2=rnorm(n*(1-p0), mean=4) # n*(1-p0) individuals from N(4,1)
> X=c(X1,X2) # observed data
> hist(X, 50)
Histogram of X
10 20 30 40 50 60
Frequency
0
−2 0 2 4 6 8
X
Example 2: Normal mixtures in R (continued) — 23/35 —
Using the same notations as in Normal mixture model. now assume the data is
from a mixture of Poisson distributions.
Consider x1, . . . , xn ∼ Jj=1 p jφ(xi|λ j), where φ(.|λ) is the Poisson density. Again use
P
Observed-data log-likelihood
X( 1 1
)
l(β, D, σ2|Y1, . . . , YN ) ≡ − (Yi − Xiβ)0Σ−1
i (Y i − X i β) − log |Σi| ,
i
2 2
where Σi = Zi DZi0 + σ2 Ini .
• In fact, this likelihood can be directly maximized for (β, D, σ2) by using
Newton-Raphson or Fisher scoring.
• Given (D, σ2) and hence Σi, we obtain β that maximizes the likelihood by solving
∂l(β, D, σ2|Y1, . . . , YN ) X 0 −1
= Xi Σ (Yi − Xiβ) = 0,
∂β i
which implies
N −1 N
X X
β = Xi0Σ−1
i Xi
X i Σi Y i .
0 −1
i=1 i=1
Example 3: Mixed-effects model (continued) — 26/35 —
Complete-data log-likelihood
Note the equivalence of (i, bi) and (Yi, bi) and the fact that
! ( ! !)
bi 0 D 0
=N ,
i 0 0 σ2 Ini
X( 1 0 1 1 0 ni
)
lC(β, D, σ2|1, . . . , N , b1, . . . , bN ) ≡ − bi Dbi − log |D| − 2 i i − log σ2
i
2 2 2σ 2
E step: to evaluate
E bibi | Yi, β , D , σ
0 (k) (k) 2(k)
E i | Yi , β , D , σ
0 (k) (k) 2(k)
E bi | Yi , β , D , σ
(k) (k) 2(k)
i .
Var(i | Yi) = σ2 Ini − σ4Σ−1
M step
N
X
D(k+1) = N −1 E(bib0i | Yi, β(k), D(k), σ2(k)))
i=1
N −1 N
X X
σ2(k+1) = ni E(i0i | Yi, β(k), D(k), σ2(k))
i=1 i=1
N −1 N
X X
β(k+1) = 0
Xi Xi Xi0E(Yi − Zibi | Yi, β(k), D(k), σ2(k)).
i=1 i=1
Issues — 29/35 —
1. Stopping rules
This is bad! l(θ) may not change much even when θ does.
• |θ(k+1)
j − θ (k)
j | < 1 (|θ j | + 2 ) for j = 1, . . . , p
(k)
In practice, take
1 =10−8
2 =101 to 1001
Issues (continued) — 30/35 —
In calculating derivatives using this formula, I generally start with some medium
size δ and then repeatedly halve it until the estimated derivative stabilizes.
f (Yobs, ymis|θ)
Z
= lC(θ|Yobs, Ymis)
˙ dymis
f (Yobs, ymis|θ) dymis
R
n o
= E lC(θ|Yobs, Ymis)|Yobs .
˙
Proof of (6) — 34/35 —
Let IC(θ) and IO(θ) denote the complete information and observed information,
respectively.
One can show when the EM converges, the linear convergence rate, denoted as
(θ(k+1) − θ̂)/(θ(k) − θ̂) approximates 1 − IO(θ̂)/IC(θ̂). (later)