Tutorial On Generalized Expectation
Tutorial On Generalized Expectation
Javier R. Movellan
Preliminaries
The goal of this primer is to introduce the EM (expectation maximization) algorithm and some of its modern generalizations, including variational approximations. Notational conventions Unless otherwise stated, capital letters are used for random variables, small letters for specic values taken by random variables, and Greek letters for model parameters. We adhere to a Bayesian framework and treat model parameters as random variables with known prior. From this point of view Maximum-Likelihood methods can be interpreted as using weak priors. The probability space in which random variables are dened is left implicit and assumed to be endowed with the conditions needed to support the derivations being presented. We present the results using discrete random variables. Conversion to continuous variables simply requires changing probability mass functions into probability density fucntions and sums into integrals. When the context makes it clear, we identify probability functions by their arguments, and drop commas between arguments: e.g., p(xy) is shorthand for the joint probability mass or joint probability density that the random variable X takes the specic value x and the random variable Y takes the value y. Let O, H be random vectors representing observable data and hidden states. Let represent model parameters controlling the distribution of O, H. We treat as a random variable with known prior. We have two problems of interest: For a xed sample o from O nd values of with large posterior For a xed sample o from O nd values of H with large posterior Both problems are formally identical so we will focus on the rst one. Note 1 p( | o) = p(o, ) p(o) Thus argmax p( | o) = argmax log p(o, )
(1) (2)
Let q = {q ( | o) : p } be a family of distributions of H parameterized by . We call q a variational family, and the variational parameters of that family. Note log p(o, ) =
h
p(oh) q (h | o) q (h | o) p(h | o, )
q (h | o) log q (h | o) log
h
p(o, h, ) q (h | o) q (h | o) p(h | o, )
(6) (7)
K(, ) =
def
Note K(, ) is the KL divergence between the distribution q ( | o) and p( | o ). Since KL divergences are non-negative, it follows that F(, ) is a lower bound on log(o, ),i.e., log p(o, ) F(, ) (8) This equation becomes an equality for values of for which K(, ) = 0, i.e., values of such that q (h | o) = p(h | o ) for all h.
We obtain a sequence, ((1) , (1) ), ((2) , (2) ) by iteration over two steps: E Step: (k+1) = argmax F(, (k) )
(9)
Note since F(, ) = log p(o, ) + K(, ) (10) and since log p(o, ) is a constant with respect to , this step amounts to minimizing K(, (k) ) with respect to , i.e., choose a member of the variational family q which is as close as possible to the current p. M Step: (k+1) = argmax F(k+1 , ) (11)
Successive application of EM maximize the lower bound F on log p(o, ), i.e, F((k+1) , (k) ) F((k) , (k) ) and F((k+1) , (k+1) ) F((k+1) , (k) ) Interpretation Optimizing F(, ) with respect to is equivalent to optimizing q (h | o) log p(o, h, )
h
(12) (13)
2.1
(14)
and since the log function is concave from below then q (h | o) log p(o, h, ) log
h h
(15)
Successive applications of EM increase a lower bound F on log p(o, ). This lower bound consists of the sum of two terms: a data driven term log p(o, ) that measures how well the distribution p() ts the observable data, and the term KL(, ) that penalizes deviations from the variational family q: F(, ) = log p(o, ) K(, ) (16)
Thus we can think of the Genearlized EM algorithm as solving a penalized maximum likelihood problem. Note log p(o, (k+1) ) log p(o, (k) ) K((k+1) , (k+1) ) K((k+1) , (k) ) (17) Note q(k+1) was chosen to be closest to p(|o, (k) ). Thus it is not unreasonable (but also not guaranteed) to expect that it may not be as close to p(|o, (k) ). In other words, it is not unreasonalbe (but also not guaranteed) to expect that K((k+1) , (k+1) ) K((k+1) , (k) ) 0 and thus log p(o, (k+1) ) log p(o, (k) ) (18) (19)
An important special case occurs when the family {q ( | o)} equals the family {p( | o, )}. In this case (k+1) = (k) and we can guarantee that log p(o, (k+1) ) log p(o, (k) ) K((k) , (k+1) ) 0 (20)
Moreover in this case to maximize F with respect to (k+1) we just need to maximize Q((k) , (k+1) ) =
h
def
(21)
(22)
which is the objective function maximized by the standard EM algorithm. In the same vein, note that F(, ) is a free energy, i.e., the expected energy of states plus the entropy of the distribution under which the expected value is computed. In this case the energy of a state h is log p(o, h, ). Thus if there are no further constraints, the optimal distribution q( | o, ) is Boltzmann p(h | o, ) exp(log p(o, h, )) = p(o, h, )) p(o, h, )) p(h | o, ) = = p(h | o, ) h p(o, h, ) (23) (24)
Consider the case in which we are given a set of iid observations o = (o1 , on ). If we directly optimize log p(o | ) with respect to we get
n n i=1 n
log p(o | ) = =
i=1 n
log p(oi | ) =
i=1
1 p(oi | )
p(oi , h h
1 p(oi | )
p(oi , h | )
h
log p(oi , h | )
=
i=1 h
p(h | oi , )
log p(oi , h | ) = 0
=
i=1 h
p(h | oi , )
log p(oi , h | ) = 0
(28)
Example 1
Consider a simple Gaussian mixture model and a vector of independent observations o = (o1 , . . . , on )T from that model
n
log p(o | ) =
i=1
log p(oi | )
(29)
(31)
where the prior mixture term is xed. Taking derivatives with respect to we get log p(o | ) = =
i
(32) (33)
which is a non-linear equation difcult to solve. However EM asks us to optimize p(H = 1 | oi , ) log p(oi , H = 1 | ) =
i
(34)
(35)
oi p(H = 1 | oi , ) p(H = 1 | oi , )
i
(36)
History
The rst version of this document was written by Javier R. Movellan in January 2005, as part of the Kolmogorov project.