0% found this document useful (0 votes)
120 views7 pages

EM-algorithm: California Institute of Technology 136-93 Pasadena, CA 91125 Welling@vision - Caltech.edu

1) The document discusses the EM algorithm, which is used to find maximum likelihood estimates of parameters in probabilistic models that contain hidden variables. 2) It introduces the EM algorithm, which iterates between an expectation (E) step and a maximization (M) step. The E step computes the expected value of the log likelihood with respect to the hidden variables, and the M step maximizes this value to update the parameter estimates. 3) It provides an example of applying the EM algorithm to estimate the parameter p in a multinomial distribution model, when some variables are hidden due to being colorblind.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
120 views7 pages

EM-algorithm: California Institute of Technology 136-93 Pasadena, CA 91125 Welling@vision - Caltech.edu

1) The document discusses the EM algorithm, which is used to find maximum likelihood estimates of parameters in probabilistic models that contain hidden variables. 2) It introduces the EM algorithm, which iterates between an expectation (E) step and a maximization (M) step. The E step computes the expected value of the log likelihood with respect to the hidden variables, and the M step maximizes this value to update the parameter estimates. 3) It provides an example of applying the EM algorithm to estimate the parameter p in a multinomial distribution model, when some variables are hidden due to being colorblind.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

EM-algorithm

Max Welling

California Institute of Technology 136-93


Pasadena, CA 91125
[email protected]

1 Introduction
In the previous class we already mentioned that many of the most powerful probabilistic models contain
hidden variables. We will denote these variables with y. It is usually also the case that these models are
most easily written in terms of their joint density,

p(d, y, θ) = p(d|y, θ) p(y|θ) p(θ) (1)

Remember also that the objective function we want to maximize is the log-likelihood (possibly including the
prior term like in MAP-estimation) given by,

L(d, θ) = log[p(d|θ)] + log[p(θ)] (2)


Z
= log[ dy p(d, y|θ)] + log[p(θ)] (3)

Notice that maximum likelihood estimation is a special case, by neglecting the prior term (i.e. set p(θ) = 1).
In the following we will include the p(θ), because it does not complicate the derivation and treats a slightly
more general case. It is then very easy to switch to ML estimation by changing p(..., θ) → p(...|θ). For
unsupervised learning we can simply set d = xN ,
N
X
N
L(x , θ) = log[p(xn |θ)] + log[p(θ)], (4)
n=1

while for supervised learning we may put d = {xN , tN } and decompose,


N
X N
X
L(tN , xN , θ) = log[p(tn |xn , θ)] + log[p(xn |θ)] + log[p(θ)]. (5)
n=1 n=1

In the case of supervised learning we may decide that we are not interested in modelling the input distribution
p(xN |θ) and simply omit it (i.e. set it equal to one).
We could now directly take derivatives with respect to θ of this likelihood functions and use them to find
the maximum (for instance through gradient descent). It turns out that these equations can become quite
hairy, and an easier method exists, called expectation maximization (EM). The main idea is that it is much
easier to optimize the joint density log[p(d, y, θ)] if we had known the values for y. Unfortunately we don’t
know them and that was the reason we have integrated them out in the likelihood. Instead we will optimize
the function,
Q(θ t |θ t−1 ) = E[log[p(d, y, θ t )] |d, θ t−1 ], (6)
i.e. given the data and the parameter values from a previous iteration. We therefore iterate,
E-step Calculate Q(θ t |θ t−1 ) , given some parameter estimates from the previous iteration θ t−1 .
M-step Maximize Q(θ t |θ t−1 ) over θ t .
This procedure is guaranteed to improve the log-likelihood at every iteration.

1
2 Example
Let’s consider the case where we have a random variable distributed according to a multinomial distribution,
N!
p(n1 , n2 , n3 ) = pn1 pn2 pn3 (7)
n1 ! n2 ! n3 ! 1 2 3
with n1 + n2 + n3 = N and p1 + p2 + p3 = 1. For instance, think of a jar with three different colors of balls.
The probability of drawing a red ball is p1 , a green ball p2 , and a blue ball p3 . After we picked a ball we
return it in the jar (i.e. the draws are independent). If we decide on picking N balls, the probability of a
particular draw of n1 red balls, n2 green balls and n3 blue balls is then described by the above distribution.
The combinatorial factor adds the different possibilities to arrive at this draw. Now, let us assume that we
have the following parametrized model for the probabilities of drawing the different colors,
1
p1 = (8)
4
1 p
p2 = + (9)
4 4
1 p
p3 = − (10)
2 4
where p is thus the only parameter. Now assume that the man who is doing the experiment is actually
colourblind and cannot discern the red from the green balls. He draws N balls, but only sees m1 = n1 + n2
red/green balls and m2 = n3 blue balls. The question is, can the man still estimate the parameter p and
with that in hand calculate his best guess for the number of red and green balls (obviously, he knows the
number of blue balls). Fortunately the man is smart and maximizes the likelihood of his observations,
N!
L(m1 , m2 ) = (p1 + p2 )m1 pm
3
2
(11)
m1 ! m2 !
N! 1 p 1 p
L(m1 , m2 ) = ( + )m1 ( − )m2 (12)
m1 ! m2 ! 2 4 2 4
which is now a binomial distribution, where the probability of drawing a red or a green ball is given by
p1 + p2 . Taking the logarithm and maximizing with respect to p gives,
m1 − m2
p=2 (13)
m1 + m2
and therefore he estimates his total number of red and green balls to be,
p1 1
E[n1 |m1 ] = m1 = (m1 + m2 ) (14)
p1 + p2 4
p2 1
E[n2 |m1 ] = m1 = (3 m1 − m2 ) (15)
p1 + p2 4
Now let’s see what the EM procedure would give us in this case. First we have to compute the average of
the log of the complete data pdf over the unobserved variables, given the observed variables. Therefore we
need,
m1 ! p1 p2
p(n1 , n2 |m1 ) = ( )n1 ( )n2 (16)
n1 ! n2 ! p1 + p2 p1 + p2
First notice that n3 is determined if m2 is given (they are equal). For the red and green balls we know
that the total must be given by m2 , and that each ball is drawn with a relative probability p1p+p
1
2
and p1p+p
2
2
respectively, hence they are distributed according to the above binomial distribution. Now let’s look at the
complete data likelihood. In our formalism we need to look at,
p(n1 , n2 , n3 , m1 , m2 ) = δ(m1 − n1 − n2 ) δ(m2 − n3 ) p(n1 , n2 , n3 ). (17)
The delta functions reflect the fact that the random variables mi are deterministic functions of ni . We may
now reason that the logarithm changes the product into a sum of three terms, of which the terms which

2
contain delta functions do not depend on any parameters to be optimized and can be omitted for that reason.
The same holds for some terms in the expression for p(n). Putting everything together we find,
1 p̃ 1 p̃
Q ∝ E[n2 |m1 ] log( + ) + m2 log( − ) (18)
4 4 2 4
where we have
p2 1+p
E[n2 |m1 ] = m1 = m1 (19)
p1 + p2 2+p
This calculation is formally the E-step. Notice that p̃ denotes the “new” p which is going to be updated
while p denotes the value inserted from the previous iteration. Now taking derivatives and equating to zero
gives,
2E[n2 |m1 ] − m2
p̃ = (20)
E[n2 |m1 ] + m2
which formally comprises the M-step. Iterating these equations will lead to the same answer as the analytical
expression (see demo-EM).

3 EM-Algorithm
Consider a function Lθ (x) that needs to be maximized over the parameters θ, and assume that it can be
written as follows,
Lθ (x) = Fθ [x, q] + Rθ [x, q], (21)
where Rθ [x, q] is a positive rest-term, depending on the variable x and the function q, with the following
properties,
R[x, q] ≥ 0 ∀q (22)
∃ p ⇒ R[x, p] = 0 (23)
With these properties it is easy to prove that the following iterative scheme increases L(x) at every iteration,
Lθt (x) = Fθt [x, pt ]
≤ Fθt+1 [x, pt ] maximize over θt
= Lθt+1 (x) − Rθt+1 [x, pt ] using pt = qt+1
≤ Lθt+1 (x) since R[x, q] ≥ 0 ∀ q. (24)

The mysterious part about this is of course where we get the function R[x, q]. In the following we will show
that if we take L to be the log-likelihood, R can be easily derived, and the iterative scheme above can be used
to maximize the log-likelihood. In the following q(y) will represent an arbitrary probability distribution.
L(d, θ) = log[p(d, θ)]
Z
= dy q(y) log[p(d, θ)]
Z · ¸
p(d, y, θ) q(y)
= dy q(y) log ×
p(y|d, θ) q(y)
= Q(q||pjoint ) + H(q||q) + KL(q||ppost ) (25)
R
where we used dy q(y) = 1 and we have defined
Z
Q(q||pjoint ) = dy q(y) log[p(d, y, θ)] (26)
Z
H(q||q) = − dy q(y) log[q(y)] (27)
Z
q(y)
KL(q||ppost ) = dy q(y) log[ ] (28)
p(y|d, θ)

3
The last term KL(q||ppost ) is the Kullback-Leibler distance between the arbitrary probability density q(y)
and the posterior density p(y|d, θ). If we look back at our original optimization problem we see that we need
to identify F = Q + H and KL = R. Finally, we need to prove that the two properties of R hold. Both are
direct consequences of the fact that the KL-term computes the distance between the distribution q(y) and
p(y|d, θ), which is always larger than zero and equal to zero if q(y) = p(y|d, θ). The proof goes as follows.
It is easy to see that if we set
q(y) = p(y|d, θ), then R = 0, (29)
confirming the first property. The second property is a direct consequence of Jensen’s inequality,

E[f (x)] ≥ f (E[x]) (30)

for convex functions f . Using that f (x) = − log(x) is convex we write,


Z · ¸
q
KL(q||p) = dx q log( )
p
Z · ¸
p
= dx q − log( )
q
·Z ¸
p
≥ − log dx q
q
Z
= − log dx p

= − log(1)
= 0
⇒ KL(q||p) ≥ 0, (31)

confirming the second property.


We can now identify the E-step as substituting q(y) = p(y|d, θ t−1 ) into Q and averaging log[p(d, y, θ t )]
over it, while the M-step consists maximizing Q with respect to θ t . Notice that the M-step is equivalent to
maximizing F = Q + H, since H does not depend θ t .
We have shown that the above iterations will increase the log-likelihood at every iteration, but we haven’t
shown that the algorithm converges only when it has reached a local maximum of this likelihood. This proof
is beyond the scope of this class.
We also notice that the above derivation is completely symmetric under the exchange d ↔ θ. We could
imagine a situation where we needed to maximize p(x, θ) over x, given θ. EM could therefore also be
employed to this problem, by using the same algorithm, but interchanging d = x and θ.

4 Generalizations
From the above derivation it is also clear that we can perform partial M-steps. As long as each M-step
improves Q, but not maximizes it, we are still guaranteed that the log-likelihood increases at every iteration.
We could for instance use gradient ascent as a partial M-step. This algorithm is called “generalized EM”
(GEM).
In the same spirit we can also perform partial E-steps. First notice that by substituting q(y) =
p(y|d, θ t−1 ) we have actually maximized F with respect to q(y). This is easily seen by taking the derivative
with respect to q and enforcing the fact that it integrates to 1 by a Lagrange multiplier term,
Z

F (q) + λ (q − 1)
∂q
Z

= ( q (log[pjoint ] − log[q] + λ)
∂q
= log[pjoint ] − log[q] + λ − 1
⇒ q ∝ pjoint
⇒ q(y) = p(y|d, θ) (32)

4
where the last line follows because q must be properly normalized. This derivation is slightly sloppy since we
are taking derivatives of functionals with respect to functions, which would, when done properly, add a lot of
notational burden while arriving at the same result. Thus, we see that a full E-step is done by maximizing F
with respect to q. This leads to the generalization that a partial E-step would involve improving F , without
actually maximizing it. This can be done by modelling q as a function of parameters and possibly on the
data, q = q(y|d, ν). Maximization over q is then performed by maximization over ν, for instance by gradient
ascent. Only when the true posterior is included in this parameterized family, can we actually reach the true
maximum likelihood solution. But, in case q is only an approximation to the true posterior at best, not all
is lost. In that case we improve a bound on the log-likelihood since we have
F = L − KL ≤ L since KL ≥ 0 (33)
Maximizing F is thus equivalent to maximizing a lower bound on L. Maximizing F has two consequences.
Firstly it results in maximizing L, and at the same time it will minimize KL, which is the KL-distance
between the true posterior and our parametrized model of it, i.e. q is driven as close as possible (in the
sense of Kl-distance) to the true posterior under the current parametrized model q. Naturally, this makes
only sense when the true posterior is too difficult to calculate analytically. After averaging the joint density
pjoint over the approximate posterior q, the M-step then maximizes F (or Q) over θ t . Summarizing, the
variational EM algorithm thus alternates the following steps
E-step Maximize F with respect to q.
M-step Maximize F (or Q) with respect to θ.
Notice that now we need to include the entropy term H in the E-step of the optimization, since it depends
on q.
Apart from the above variational procedure there are alternative methods to approximate the E-step if
it turns out too complicated. The most widely used methods are sampling techniques, which approximate
the average by
Z N
1 X
dy p(y) f (y) ≈ f (yn ) (34)
N n=1
Examples are “Markov Chain Monte Carlo” (MCMC) and “Gibbs” sampling. We will discuss these methods
in more detail in later lectures.
Finally, we notice that the derivation of the EM algorithm did not depend on the fact that log[p(d, θ)]
is a probability density. Of, course it has to be positive everywhere, since otherwise the logarithm would be
undefined, but it does not depend in p(d) being normalized. Notice also that the derivation did crucially
depend on the fact that the posterior p(y|d) is a normalized probability density, otherwise the KL term is
not guaranteed to be positive. These facts suggest a generalization to positive unnormalized functions g(d)
which can be written as, Z
g(d) = dy f (d, y), (35)

with positive f (d, y). In this case we can define,

f (d, y) f (d, y)
p(y|d) = =R (36)
g(d) dy f (d, y)

which is automatically normalized (and positive) with respect to y, i.e. it defines a probability density. We
can now repeat the whole derivation for the objective function,

L0 = log[g(d|θ)] (37)

where the E-step consists of evaluating the posterior through (36) and using it to average f (x, y), while the
M-step maximizes this resultant average with respect to some parameters in f (x, y).

5 EM-Example
As an example we will consider incomplete data generated from a K-dimensional Gaussian density, p(x) =
Gx [µ, Σ]. The data are incomplete because for every sample the values of some dimensions may be missing.

5
One can imagine a malfunctioning measuring device. This example is therefore different in spirit than the
latent variable models we will encounter in future classes, since there the models are build with hidden
variables and their values cannot be observed in principle. In the present case, the model is such that every
sample could in principle be observed, but we were unlucky and had a malfunctioning sensor.
For every datapoint we will divide the vector xn into an observed part and a missing part: xn = (xon , xm
n ).
This notation does not imply that always the last part of the vector is missing. Any dimension could be
missing. The log likelihood that we aim to optimize is thus given by,
N
X Z
L(xoN |θ) = log dxm
n Gxn [µ, Σ] (38)
n=1

Because the missing dimensions are different for every datapoint, this sum of integrals becomes messy.
Therefore we will do EM. First write down Q,
N Z
X
Q(θ t |θ t−1 ) = dxm m o o m
n p(xn |xn , θ t−1 ) log [p(xn , xn |θ t )] (39)
n=1

The log of the joint is determined by the Gaussian density,


K 1 1
log [p(xon , xm
n |θ t )] = − log(2π) − log det[Σ] − (xn − µ)T Σ−1 (xn − µ) (40)
2 2 2
where θ = (µ, Σ). For the posterior we will use the following lemma,
Lemma Let Gx [µ, Σ] hdenote a normal
i density with mean µ and covariance Σ. If we write x = [x1 , x2 ],
µ = [µ1 , µ2 ] and Σ = ΣΣ11 Σ 12 , then we have,
Σ
21 22

p(x2 |x1 ) = Gx2 [µ2 − Σ21 (Σ11 )−1 (µ1 − x1 ), Σ22 − Σ21 (Σ11 )−1 Σ12 ] (41)
In the E-step we need to calculate the E[log p(xo , xm |θ)|xon ], which reduces to the calculation of the sufficient
statistics, E[xm o m m o
n |xn ] and E[xn xn |xn ], using (40). But we can use the above lemma for this calculation.
Suppose we have a datapoint from which the last dimensions are missing. Again, I stress that this needs not
be the case and the expectation below must be taken for every datapoint separately over a different set of
missing dimensions. h o i ³ ´
xo
E xxm = E[xm ] (42)
and · ¸ µ ¶
xo xo T xo xm T xo xo T xo E[xm ]T
E = (43)
xm xo T xm xm T E[xm ]xo T E[xm xm T ]
where we substitute
E[xm ] = µm + Σmo (Σoo )−1 (µo − xo ) (44)
m mT m m T mm mo oo −1 om
E[x x ] = E[x ]E[x ] + Σ −Σ (Σ ) Σ (45)
Above we assumed for simplicity that the first dimensions are missing, but in general the expectation appears
at the locations of missing dimensions. This then is the solution to the E-step. For every datapoint, split
the vector in observed and unobserved dimensions and calculate the expectations of the first and second
moment with respect to the posterior which are given above.
The M-step then consists of taking derivatives of Q = E[log p(xo , xm |θ)|xo ] with respect to µ and Σ and
equating them to zero. Recall that we are maximizing with respect to the parameters in the joint density
only; the parameters defining the posterior are from the previous iteration and considered constant. Taking
the derivatives we find,
N
X
∂ £ ¤
Q = E Σ−1 (xn − µ) ⇒
∂µ n=1
N
1 X
µnew = E[xn ] (46)
N n=1

6
and
N
∂ 1X £ ¤
Q = E Σ − (xn − µ)(xn − µ)T ⇒
∂Σ−1 2 n=1
N
1 X
Σnew = E[xn xTn ] − µnew µTnew (47)
N n=1

To calculate the expectations in these update rule we use (5) and (43) where the expectations are always
taken over the missing dimensions only, which may be different from point to point. In the derivation we
made use of the fact that,

aT Aa = = aaT (48)
∂A

log det[A] = A−T (49)
∂A
The final algorithm thus consists of alternating (44) and (45) with (46) and (47). The result is easy to
interpret since mean and covariance are updated by calculating the sample mean and covariance, but with
missing dimensions replaced by their expectation, given the data.

References

You might also like