Theory and Use of EM Algorithm
Theory and Use of EM Algorithm
1561/2000000034
Maya R. Gupta
Yihua Chen
Boston – Delft
Full text available at: http://dx.doi.org/10.1561/2000000034
The preferred citation for this publication is M. R. Gupta and Y. Chen, Theory and
Use of the EM Algorithm, Foundations and Trends
R
in Signal Processing, vol 4,
no 3, pp 223–296, 2010
ISBN: 978-1-60198-430-2
c 2011 M. R. Gupta and Y. Chen
All rights reserved. No part of this publication may be reproduced, stored in a retrieval
system, or transmitted in any form or by any means, mechanical, photocopying, recording
or otherwise, without prior written permission of the publishers.
Photocopying. In the USA: This journal is registered at the Copyright Clearance Cen-
ter, Inc., 222 Rosewood Drive, Danvers, MA 01923. Authorization to photocopy items for
internal or personal use, or the internal or personal use of specific clients, is granted by
now Publishers Inc for users registered with the Copyright Clearance Center (CCC). The
‘services’ for users can be found on the internet at: www.copyright.com
For those organizations that have been granted a photocopy license, a separate system
of payment has been arranged. Authorization does not extend to other kinds of copy-
ing, such as that for general distribution, for advertising or promotional purposes, for
creating new collective works, or for resale. In the rest of the world: Permission to pho-
tocopy must be obtained from the copyright owner. Please apply to now Publishers Inc.,
PO Box 1024, Hanover, MA 02339, USA; Tel. +1-781-871-0245; www.nowpublishers.com;
[email protected]
now Publishers Inc. has an exclusive license to publish this material worldwide. Permission
to use this content must be obtained from the copyright license holder. Please apply to now
Publishers, PO Box 179, 2600 AD Delft, The Netherlands, www.nowpublishers.com; e-mail:
[email protected]
Full text available at: http://dx.doi.org/10.1561/2000000034
Editor-in-Chief:
Robert M. Gray
Dept of Electrical Engineering
Stanford University
350 Serra Mall
Stanford, CA 94305
USA
[email protected]
Editors
Editorial Scope
1
Department of Electrical Engineering, University of Washington, Seattle,
WA 98195, USA, [email protected]
2
Department of Electrical Engineering, University of Washington, Seattle,
WA 98195, USA, [email protected]
Abstract
This introduction to the expectation–maximization (EM) algorithm
provides an intuitive and mathematically rigorous understanding of
EM. Two of the most popular applications of EM are described in
detail: estimating Gaussian mixture models (GMMs), and estimat-
ing hidden Markov models (HMMs). EM solutions are also derived
for learning an optimal mixture of fixed models, for estimating the
parameters of a compound Dirichlet distribution, and for dis-entangling
superimposed signals. Practical issues that arise in the use of EM are
discussed, as well as variants of the algorithm that help deal with these
challenges.
Full text available at: http://dx.doi.org/10.1561/2000000034
Contents
2 Analysis of EM 15
2.1 Convergence 15
2.2 Maximization–Maximization 19
3 Learning Mixtures 23
4 More EM Examples 41
ix
Full text available at: http://dx.doi.org/10.1561/2000000034
5 EM Variants 63
5.1 EM May Not Find the Global Optimum 63
5.2 EM May Not Simplify the Computation 64
5.3 Speed 66
5.4 When Maximizing the Likelihood Is Not the Goal 66
Acknowledgments 71
References 73
Full text available at: http://dx.doi.org/10.1561/2000000034
1
The Expectation-Maximization Method
1
Full text available at: http://dx.doi.org/10.1561/2000000034
1A different standard choice of notation for a parametric density would be p(y; θ), but
we prefer p(y | θ) because this notation is clearer when one wants to find the maximum
a posteriori estimate rather than the maximum likelihood estimate—we will talk more
about the maximum a posteriori estimate of θ in Section 1.3.
2 The treatment of discrete random vectors is a straightforward special case of the continuous
treatment: one only needs to replace the probability density function with probability mass
function and integral with summation.
3 We assume that the support of X, denoted by X , which is the closure of the set
{x p(x | θ) > 0}, does not depend on θ. An example where the support does depend on
θ is if X is uniformly distributed on the interval [0, θ]. If the support does depend on θ,
then the monotonicity of the EM algorithm might not hold. See Section 2.1 for details.
4 A rigorous description of this dependency is deferred to Section 1.4.
Full text available at: http://dx.doi.org/10.1561/2000000034
Given that you only have y, the goal here is to find the maximum
likelihood estimate (MLE) of θ:
5 Note this Q-function has nothing to do with the sum of the tail of a Gaussian, which is
also called the Q-function. People call (1.3) the Q-function because the original paper [11]
used a Q to notate it. We like to say that the Q stands for quixotic because it is a bit
crazy and hopeful and beautiful to think you can find the maximum likelihood estimate
of θ in this way that iterates round-and-round like a windmill, and if Don Quixote had
been a statistician, it is just the sort of thing he might have done.
Full text available at: http://dx.doi.org/10.1561/2000000034
The EM estimate is only guaranteed to never get worse (see Section 2.1
for details). Usually, it will find a peak in the likelihood p(y | θ), but
if the likelihood function p(y | θ) has multiple peaks, EM will not nec-
essarily find the global maximum of the likelihood. In practice, it is
common to start EM from multiple random initial guesses, and choose
the one with the largest likelihood as the final guess for θ.
The traditional description of the EM algorithm consists of only
two steps. The above Steps 2 and 3 combined are called the E-step for
expectation, and Step 4 is called the M-step for maximization:
E-step: Given the estimate from the previous iteration θ(m) , compute
the conditional expectation Q(θ | θ(m) ) given in (1.3).
M-step: The (m + 1)th guess of θ is:
data points belong to k clusters. Let the complete data be the observed
data points and the missing information that specifies which of the k
clusters each observed data point belongs to. The goal is to estimate
the k cluster centers θ. First, one makes an initial guess θ̂0 of the k clus-
ter centers. Then in the E-like step, one assigns each of the n points
to the closest cluster based on the estimated cluster centers θ(m) . Then
in the M-like step, one takes all the points assigned to each cluster,
and computes the mean of those points to form a new estimate of the
cluster’s centroid. Underlying k-means is a model that the clusters are
defined by Gaussian distributions with unknown means (the θ to be
estimated) and identity covariance matrices.
EM clustering differs from k-means clustering in that at each iter-
ation you do not choose a single x(m) , that is, one does not force each
observed point yi to belong to only one cluster. Instead, each observed
point yi is probabilistically assigned to the k clusters by estimating
p(x | y, θ(m) ). We treat EM clustering in more depth in Section 3.2.
6 The k-means clustering algorithm dates to 1967 [35] and is a special case of vector
quantization, which was first proposed as Lloyd’s algorithm in 1957 [32]. See [17] for
details.
Full text available at: http://dx.doi.org/10.1561/2000000034
θ̂MAP = arg max log p(θ | y) = arg max (log p(y | θ) + log p(θ)).
θ∈Ω θ∈Ω
then
n
X
(m)
Q(θ | θ )= Qi (θ | θ(m) ),
i=1
where
Qi (θ | θ(m) ) = EXi |yi ,θ(m) [log p(Xi | θ)], i = 1, . . . , n.
Proof. First, we show that given θ, the elements of the set {(Xi , Yi )},
i = 1, . . . , n, are mutually independent, that is,
n
Y
p(x, y | θ) = p(xi , yi | θ). (1.7)
i=1
This mutual independence holds because
p(x, y | θ) = p(y1 | y2 , . . . , yn , x, θ) · · · p(yn | x, θ)p(x | θ)
(by the chain rule)
= p(y1 | x1 , θ) · · · p(yn | xn , θ)p(x | θ)
(by (1.6), but keep θ in the condition)
n
Y
= p(y1 | x1 , θ) · · · p(yn | xn , θ) p(xi | θ)
i=1
(by the independence assumption on X)
n
Y
= p(yi | xi , θ)p(xi | θ)
i=1
Yn
= p(xi , yi | θ).
i=1
Then we show that for all i = 1, . . . , n, we have
p(xi | y, θ) = p(xi | yi , θ). (1.8)
This is because
p(xi , y | θ)
p(xi | y, θ) =
p(y | θ)
(by Bayes’ rule)
R
X n−1 p(x, y |Rθ)dx1 . . . dxi−1 dxi+1 . . . dxn
=
X n p(x, y | θ)dx
Full text available at: http://dx.doi.org/10.1561/2000000034
Then,
θ y1 1 − θ y2 1 − θ y3 θ y4
n! 1
P (y | θ) = + .
y1 !y2 !y3 !y4 ! 2 4 4 4 4
For this simple example, one could directly maximize the log-likelihood
log P (y | θ), but here we will instead illustrate how to use the EM algo-
rithm to find the maximum likelihood estimate of θ.
To use EM, we need to specify what the complete data X is. We
will choose the complete data to enable us to specify the probability
mass function (pmf) in terms of only θ and 1 − θ. To that end, we
T
define the complete data to be X = X1 . . . X5 , where X has a
multinomial distribution with number of trials n and the probability
of each event is:
T
1 1 1 1 1
qθ = θ (1 − θ) (1 − θ) θ , θ ∈ (0, 1).
2 4 4 4 4
By defining X this way, we can then write the observed data Y as:
T
Y = T (X) = X1 + X2 X3 X4 X5 .
Full text available at: http://dx.doi.org/10.1561/2000000034
θ(m+1) = arg max Q(θ | θ(m) ) = arg max EX|y,θ(m) [log p(X | θ)].
θ∈(0,1) θ∈(0,1)
P (x | y, θ(m) )
!x1 !x2
1 θ(m) 5
y1 ! 2 4
Y
= 1{x1 +x2 =y1 } 1{xi =yi−1 }
x1 !x2 ! 1 θ(m) 1 θ(m)
2 + 4 2 + 4 i=3
x1 !x2 5
θ(m)
y1 ! 2 Y
= 1{x1 +x2 =y1 } 1{xi =yi−1 } .
x1 !x2 ! 2 + θ(m) 2 + θ(m) i=3
Given an initial estimate θ(0) = 0.5, the above algorithm reaches θ̂MLE
to MATLAB’s numerical precision on the 18th iteration.
Full text available at: http://dx.doi.org/10.1561/2000000034
References
73
Full text available at: http://dx.doi.org/10.1561/2000000034
74 References
References 75
76 References
References 77
[61] L. R. Welch, “Hidden Markov Models and the Baum-Welch Algorithm,” IEEE
Information Theory Society Newsletter, vol. 53, no. 4, pp. 1–13, December 2003.
[62] C. F. J. Wu, “On the convergence properties of the EM algorithm,” The Annals
of Statistics, vol. 11, no. 1, pp. 95–103, March 1983.
[63] L. Xu and M. I. Jordan, “On convergence properties of the EM algorithm for
Gaussian mixtures,” Neural Computation, vol. 8, no. 1, pp. 129–151, January
1996.
[64] R. W. Yeung, A First Course in Information Theory. New York, NY: Springer,
2002.
[65] J. Zhang, “The mean field theory in EM procedures for Markov random
fields,” IEEE Transactions on Signal Processing, vol. 40, no. 10, pp. 2570–2583,
October 1992.