0% found this document useful (0 votes)
37 views26 pages

Theory and Use of EM Algorithm

EM Algorithm

Uploaded by

dafddsgfsgfs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views26 pages

Theory and Use of EM Algorithm

EM Algorithm

Uploaded by

dafddsgfsgfs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Full text available at: http://dx.doi.org/10.

1561/2000000034

Theory and Use of the


EM Algorithm
Full text available at: http://dx.doi.org/10.1561/2000000034

Theory and Use of the


EM Algorithm

Maya R. Gupta

Department of Electrical Engineering


University of Washington
Seattle, WA 98195
USA
[email protected]

Yihua Chen

Department of Electrical Engineering


University of Washington
Seattle, WA 98195
USA
[email protected]

Boston – Delft
Full text available at: http://dx.doi.org/10.1561/2000000034

Foundations and Trends


R
in
Signal Processing

Published, sold and distributed by:


now Publishers Inc.
PO Box 1024
Hanover, MA 02339
USA
Tel. +1-781-985-4510
www.nowpublishers.com
[email protected]

Outside North America:


now Publishers Inc.
PO Box 179
2600 AD Delft
The Netherlands
Tel. +31-6-51115274

The preferred citation for this publication is M. R. Gupta and Y. Chen, Theory and
Use of the EM Algorithm, Foundations and Trends R
in Signal Processing, vol 4,
no 3, pp 223–296, 2010

ISBN: 978-1-60198-430-2
c 2011 M. R. Gupta and Y. Chen

All rights reserved. No part of this publication may be reproduced, stored in a retrieval
system, or transmitted in any form or by any means, mechanical, photocopying, recording
or otherwise, without prior written permission of the publishers.
Photocopying. In the USA: This journal is registered at the Copyright Clearance Cen-
ter, Inc., 222 Rosewood Drive, Danvers, MA 01923. Authorization to photocopy items for
internal or personal use, or the internal or personal use of specific clients, is granted by
now Publishers Inc for users registered with the Copyright Clearance Center (CCC). The
‘services’ for users can be found on the internet at: www.copyright.com
For those organizations that have been granted a photocopy license, a separate system
of payment has been arranged. Authorization does not extend to other kinds of copy-
ing, such as that for general distribution, for advertising or promotional purposes, for
creating new collective works, or for resale. In the rest of the world: Permission to pho-
tocopy must be obtained from the copyright owner. Please apply to now Publishers Inc.,
PO Box 1024, Hanover, MA 02339, USA; Tel. +1-781-871-0245; www.nowpublishers.com;
[email protected]
now Publishers Inc. has an exclusive license to publish this material worldwide. Permission
to use this content must be obtained from the copyright license holder. Please apply to now
Publishers, PO Box 179, 2600 AD Delft, The Netherlands, www.nowpublishers.com; e-mail:
[email protected]
Full text available at: http://dx.doi.org/10.1561/2000000034

Foundations and Trends R


in
Signal Processing
Volume 4 Issue 3, 2010
Editorial Board

Editor-in-Chief:
Robert M. Gray
Dept of Electrical Engineering
Stanford University
350 Serra Mall
Stanford, CA 94305
USA
[email protected]

Editors

Abeer Alwan (UCLA) Jelena Kovacevic (CMU)


John Apostolopoulos (HP Labs) Jia Li (Pennsylvania State
Pamela Cosman (UCSD) University)
Michelle Effros (California Institute B.S. Manjunath (UCSB)
of Technology) Urbashi Mitra (USC)
Yonina Eldar (Technion) Thrasos Pappas (Northwestern
Yariv Ephraim (George Mason University)
University) Mihaela van der Shaar (UCLA)
Sadaoki Furui (Tokyo Institute Michael Unser (EPFL)
of Technology) P.P. Vaidyanathan (California
Vivek Goyal (MIT) Institute of Technology)
Sinan Gunturk (Courant Institute) Rabab Ward (University
Christine Guillemot (IRISA) of British Columbia)
Sheila Hemami (Cornell) Susie Wee (HP Labs)
Lina Karam (Arizona State Clifford J. Weinstein (MIT Lincoln
University) Laboratories)
Nick Kingsbury (Cambridge Min Wu (University of Maryland)
University) Josiane Zerubia (INRIA)
Alex Kot (Nanyang Technical Pao-Chi CHang (National Central
University) University)
Full text available at: http://dx.doi.org/10.1561/2000000034

Editorial Scope

Foundations and Trends R


in Signal Processing will publish sur-
vey and tutorial articles on the foundations, algorithms, methods, and
applications of signal processing including the following topics:

• Adaptive signal processing • Signal processing for


• Audio signal processing communications
• Biological and biomedical signal • Signal processing for security and
processing forensic analysis, biometric signal
processing
• Complexity in signal processing
• Signal quantization, sampling,
• Digital and multirate signal
analog-to-digital conversion,
processing
coding and compression
• Distributed and network signal
• Signal reconstruction,
processing
digital-to-analog conversion,
• Image and video processing enhancement, decoding and
• Linear and nonlinear filtering inverse problems
• Multidimensional signal processing • Speech/audio/image/video
• Multimodal signal processing compression
• Multiresolution signal processing • Speech and spoken language
processing
• Nonlinear signal processing
• Statistical/machine learning
• Randomized algorithms in signal
processing • Statistical signal processing
• Sensor and multiple source signal • classification and detection
processing, source separation • estimation and regression
• Signal decompositions, subband • tree-structured methods
and transform methods, sparse
representations

Information for Librarians


Foundations and Trends R
in Signal Processing, 2010, Volume 4, 4 issues. ISSN
paper version 1932-8346. ISSN online version 1932-8354. Also available as a
combined paper and online subscription.
Full text available at: http://dx.doi.org/10.1561/2000000034

Foundations and Trends R


in
Signal Processing
Vol. 4, No. 3 (2010) 223–296

c 2011 M. R. Gupta and Y. Chen
DOI: 10.1561/2000000034

Theory and Use of the EM Algorithm

Maya R. Gupta1 and Yihua Chen2

1
Department of Electrical Engineering, University of Washington, Seattle,
WA 98195, USA, [email protected]
2
Department of Electrical Engineering, University of Washington, Seattle,
WA 98195, USA, [email protected]

Abstract
This introduction to the expectation–maximization (EM) algorithm
provides an intuitive and mathematically rigorous understanding of
EM. Two of the most popular applications of EM are described in
detail: estimating Gaussian mixture models (GMMs), and estimat-
ing hidden Markov models (HMMs). EM solutions are also derived
for learning an optimal mixture of fixed models, for estimating the
parameters of a compound Dirichlet distribution, and for dis-entangling
superimposed signals. Practical issues that arise in the use of EM are
discussed, as well as variants of the algorithm that help deal with these
challenges.
Full text available at: http://dx.doi.org/10.1561/2000000034

Contents

1 The Expectation-Maximization Method 1


1.1 The EM Algorithm 3
1.2 Contrasting EM with a Simple Variant 6
1.3 Using a Prior with EM (MAP EM) 7
1.4 Specifying the Complete Data 7
1.5 A Toy Example 10

2 Analysis of EM 15

2.1 Convergence 15
2.2 Maximization–Maximization 19

3 Learning Mixtures 23

3.1 Learning an Optimal Mixture of Fixed Models 23


3.2 Learning a GMM 26
3.3 Estimating a Constrained GMM 34

4 More EM Examples 41

4.1 Learning a Hidden Markov Model 41


4.2 Estimating Multiple Transmitter Locations 51
4.3 Estimating a Compound Dirichlet Distribution 54

ix
Full text available at: http://dx.doi.org/10.1561/2000000034

5 EM Variants 63
5.1 EM May Not Find the Global Optimum 63
5.2 EM May Not Simplify the Computation 64
5.3 Speed 66
5.4 When Maximizing the Likelihood Is Not the Goal 66

6 Conclusions and Some Historical Notes 69

Acknowledgments 71

References 73
Full text available at: http://dx.doi.org/10.1561/2000000034

1
The Expectation-Maximization Method

Expectation–maximization (EM) is an iterative method that attempts


to find the maximum likelihood estimator of a parameter θ of a para-
metric probability distribution. Let us begin with an example. Consider
the temperature outside your window for each of the 24 hours of a
day, represented by x ∈ R24 , and say that this temperature depends on
the season θ ∈ {summer, fall, winter, spring}, and that you know the
seasonal temperature distribution p(x | θ). But what if you could only
measure the average temperature y = x̄ for some day, and you would
like to estimate what season θ it is (for example, is spring here yet?). In
particular, you might seek the maximum likelihood estimate of θ, that
is, the value θ̂ that maximizes p(y | θ). If this is not a trivial maximum
likelihood problem, you might call upon EM. EM iteratively alternates
between making guesses about the complete data x, and finding the θ
that maximizes p(x | θ) over θ. In this way, EM tries to find the maxi-
mum likelihood estimate of θ given y. We will see in later sections that
EM does not actually promise to find the θ that maximizes p(y | θ),
but there are some theoretical guarantees, and it often does a good job
in practice, though it may need a little help in the form of multiple
random starts.

1
Full text available at: http://dx.doi.org/10.1561/2000000034

2 The Expectation-Maximization Method

This exposition is designed to be useful to both the EM novice and


the experienced EM user looking to better understand the method and
its use. To this end, we err on the side of providing too many explicit
details rather than too few.
First, we go over the steps of EM, breaking down the usual two-step
description into a five-step description. Table 1.1 summarizes the key
notation. We recommend reading this document linearly up through
Section 1.4, after which sections can generally be read out-of-order.
Section 1 ends with a detailed version of a historical toy example for
EM. In Section 2 we show that EM never gets worse as it iterates in
terms of the likelihood of the estimate it produces, and we explain the
maximization–maximization interpretation of EM. We also explain the
general advantages and disadvantages of EM compared to other options
for maximizing the likelihood, like the Newton–Raphson method. The

Table 1.1. Notation summary.

R Set of real numbers


R+ Set of positive real numbers
N Set of natural numbers
y ∈ Rd Given measurement or observation
Y ∈ Rd Random measurement; y is a realization of Y
x ∈ Rd 1 Complete data you wish you had
X ∈ Rd 1 Random complete data; x is a realization of X
z ∈ Rd2 Missing data; in some problems x = (y, z)
Z ∈ Rd2 Random missing data; z is a realization of Z
θ∈Ω Parameter(s) to estimate, Ω is the parameter space
θ(m) ∈ Ω mth estimate of θ
p(y | θ) Density of y given θ; also written as p(Y = y | θ)
X Support of X (closure of the set of x where
p(x | θ) > 0)
X (y) Support of X conditioned on y (closure of the
set of x where p(x | y, θ) > 0)
, “Is defined to be”
L(θ) Likelihood of θ given y, that is, p(y | θ)
`(θ) Log-likelihood of θ given y, that is, log p(y | θ)
EX|y,θ [X] Expectation
R of X conditioned on y and θ, that is,
X (y) xp(x | y, θ)dx
1{·} Indicator function: equals 1 if the expression {·} is
true, and 0 otherwise
1 Vector of ones
DKL (P k Q) Kullback–Leibler divergence (a.k.a. relative entropy)
between distributions P and Q
Full text available at: http://dx.doi.org/10.1561/2000000034

1.1 The EM Algorithm 3

advantages of EM are made clearer in Sections 3 and 4, in which we


derive a number of popular applications of EM and use these applica-
tions to illustrate practical issues that can arise with EM. Section 3
covers learning the optimal combination of fixed models to explain the
observed data, and fitting a Gaussian mixture model (GMM) to the
data. Section 4 covers learning hidden Markov models (HMMs), sep-
arating superimposed signals, and estimating the parameter for the
compound Dirichlet distribution. In Section 5, we categorize and dis-
cuss some of the variants of EM and related methods, and we conclude
this manuscript in Section 6 with some historical notes.

1.1 The EM Algorithm


To use EM, you must be given some observed data y, a parametric
density p(y | θ), a description of some complete data x that you wish
you had, and the parametric density p(x | θ).1 In Sections 3 and 4 we
will explain how to define the complete data x for some standard EM
applications.
We assume that the complete data can be modeled as a continuous2
random vector X with density p(x | θ),3 where θ ∈ Ω for some set Ω. You
do not observe X directly; instead, you observe a realization y of the
random vector Y that depends4 on X. For example, X might be a
random vector and Y the mean of its components, or if X is a complex
number then Y might be only its magnitude, or Y might be the first
component of the vector X.

1A different standard choice of notation for a parametric density would be p(y; θ), but
we prefer p(y | θ) because this notation is clearer when one wants to find the maximum
a posteriori estimate rather than the maximum likelihood estimate—we will talk more
about the maximum a posteriori estimate of θ in Section 1.3.
2 The treatment of discrete random vectors is a straightforward special case of the continuous

treatment: one only needs to replace the probability density function with probability mass
function and integral with summation.
3 We assume that the support of X, denoted by X , which is the closure of the set

{x p(x | θ) > 0}, does not depend on θ. An example where the support does depend on
θ is if X is uniformly distributed on the interval [0, θ]. If the support does depend on θ,
then the monotonicity of the EM algorithm might not hold. See Section 2.1 for details.
4 A rigorous description of this dependency is deferred to Section 1.4.
Full text available at: http://dx.doi.org/10.1561/2000000034

4 The Expectation-Maximization Method

Given that you only have y, the goal here is to find the maximum
likelihood estimate (MLE) of θ:

θ̂MLE = arg max p(y | θ). (1.1)


θ∈Ω

It is often easier to calculate the θ that maximizes the log-likelihood


of y:

θ̂MLE = arg max log p(y | θ). (1.2)


θ∈Ω

Because log is a monotonically increasing function, the solution to (1.1)


will be the same as the solution to (1.2). However, for some problems it
is difficult to solve either (1.1) or (1.2). Then we can try EM: we make
a guess about the complete data X and solve for the θ that maximizes
the (expected) log-likelihood of X. And once we have an estimate for
θ, we can make a better guess about the complete data X, and iterate.
EM is usually described as two steps (the E-step and the M-step),
but let us first break it down into five steps:

Step 1: Let m = 0 and make an initial estimate θ(m) for θ.


Step 2: Given the observed data y and pretending for the moment
that your current guess θ(m) is correct, formulate the condi-
tional probability distribution p(x | y, θ(m) ) for the complete
data x.
Step 3: Using the conditional probability distribution p(x | y, θ(m) ) cal-
culated in Step 2, form the conditional expected log-likelihood,
which is called the Q-function5 :
Z
(m)
Q(θ | θ ) = log p(x | θ)p(x | y, θ(m) )dx
X (y)

= EX|y,θ(m) [log p(X | θ)], (1.3)

5 Note this Q-function has nothing to do with the sum of the tail of a Gaussian, which is
also called the Q-function. People call (1.3) the Q-function because the original paper [11]
used a Q to notate it. We like to say that the Q stands for quixotic because it is a bit
crazy and hopeful and beautiful to think you can find the maximum likelihood estimate
of θ in this way that iterates round-and-round like a windmill, and if Don Quixote had
been a statistician, it is just the sort of thing he might have done.
Full text available at: http://dx.doi.org/10.1561/2000000034

1.1 The EM Algorithm 5

where the integral


is over the set X (y), which is the closure
of the set {x p(x | y, θ) > 0}, and we assume that X (y) does
not depend on θ.
Note that θ is a free variable in (1.3), so the Q-function is
a function of θ, but also depends on your current guess θ(m)
implicitly through the p(x | y, θ(m) ) calculated in Step 2.
Step 4: Find the θ that maximizes the Q-function (1.3); the result is
your new estimate θ(m+1) .
Step 5: Let m := m + 1 and go back to Step 2. (The EM algorithm
does not specify a stopping criterion; standard criteria are to
iterate until the estimate stops changing: kθ(m+1) − θ(m) k < 
for some  > 0, or to iterate until the log-likelihood `(θ) =
log p(y | θ) stops changing: |`(θ(m+1) ) − `(θ(m) )| <  for some
 > 0.)

The EM estimate is only guaranteed to never get worse (see Section 2.1
for details). Usually, it will find a peak in the likelihood p(y | θ), but
if the likelihood function p(y | θ) has multiple peaks, EM will not nec-
essarily find the global maximum of the likelihood. In practice, it is
common to start EM from multiple random initial guesses, and choose
the one with the largest likelihood as the final guess for θ.
The traditional description of the EM algorithm consists of only
two steps. The above Steps 2 and 3 combined are called the E-step for
expectation, and Step 4 is called the M-step for maximization:
E-step: Given the estimate from the previous iteration θ(m) , compute
the conditional expectation Q(θ | θ(m) ) given in (1.3).
M-step: The (m + 1)th guess of θ is:

θ(m+1) = arg max Q(θ | θ(m) ). (1.4)


θ∈Ω

Since the E-step is just to compute the Q-function which is used


in the M-step, EM can be summarized as just iteratively solving the
M-step given by (1.4). When applying EM to a particular problem, this
is usually the best way to think about EM because then one does not
waste time computing parts of the Q-function that do not depend on θ.
Full text available at: http://dx.doi.org/10.1561/2000000034

6 The Expectation-Maximization Method

1.2 Contrasting EM with a Simple Variant


As a comparison that may help illuminate EM, we next consider a
simple variant of EM. In Step 2 above, one computes the conditional
distribution p(x | y, θ(m) ) over all possible values of x, and this entire
conditional distribution is taken into account in the M-step. A simple
variant is to instead use only the mth maximum likelihood estimate
x(m) of the complete data x:

E-like-step: x(m) = arg max p(x | y, θ(m) ),


x∈X (y)
(m+1)
M-like-step: θ = arg max p(x(m) | θ).
θ∈Ω

We call this variant the point-estimate variant of EM ; it has also been


called classification EM. More on this variant can be found in [7, 9].
Perhaps the most famous example of this variant is k-means clus-
tering 6 [21, 35]. In k-means clustering, we have n observed data points
T
y = y1 y2 . . . yn , where each yi ∈ Rd , and it is believed that the


data points belong to k clusters. Let the complete data be the observed
data points and the missing information that specifies which of the k
clusters each observed data point belongs to. The goal is to estimate
the k cluster centers θ. First, one makes an initial guess θ̂0 of the k clus-
ter centers. Then in the E-like step, one assigns each of the n points
to the closest cluster based on the estimated cluster centers θ(m) . Then
in the M-like step, one takes all the points assigned to each cluster,
and computes the mean of those points to form a new estimate of the
cluster’s centroid. Underlying k-means is a model that the clusters are
defined by Gaussian distributions with unknown means (the θ to be
estimated) and identity covariance matrices.
EM clustering differs from k-means clustering in that at each iter-
ation you do not choose a single x(m) , that is, one does not force each
observed point yi to belong to only one cluster. Instead, each observed
point yi is probabilistically assigned to the k clusters by estimating
p(x | y, θ(m) ). We treat EM clustering in more depth in Section 3.2.

6 The k-means clustering algorithm dates to 1967 [35] and is a special case of vector
quantization, which was first proposed as Lloyd’s algorithm in 1957 [32]. See [17] for
details.
Full text available at: http://dx.doi.org/10.1561/2000000034

1.3 Using a Prior with EM (MAP EM) 7

1.3 Using a Prior with EM (MAP EM)


The EM algorithm can fail due to singularities of the log-likelihood
function — for example, for learning a GMM with 10 components, it
may decide that the most likely solution is for one of the Gaussians to
only have one data point assigned to it, with the bad result that the
Gaussian is estimated as having zero covariance (see Section 3.2.5 for
details).
A straightforward solution to such degeneracies is to take into
account or impose some prior information on the solution for θ. One
approach would be to restrict the set of possible θ. Such a restriction
is equivalent to putting a uniform prior probability over the restricted
set. More generally, one can impose any prior p(θ), and then modify
EM to maximize the posterior rather than the likelihood:

θ̂MAP = arg max log p(θ | y) = arg max (log p(y | θ) + log p(θ)).
θ∈Ω θ∈Ω

The EM algorithm is easily extended to maximum a posteriori (MAP)


estimation by modifying the M-step:
E-step: Given the estimate from the previous iteration θ(m) , compute
as a function of θ ∈ Ω the conditional expectation

Q(θ | θ(m) ) = EX|y,θ(m) [log p(X | θ)].

M-step: Maximize Q(θ | θ(m) ) + log p(θ) over θ ∈ Ω to find

θ(m+1) = arg max(Q(θ | θ(m) ) + log p(θ)).


θ∈Ω

An example of MAP EM is given in Section 3.3.

1.4 Specifying the Complete Data


Practically, the complete data should be defined so that given x it is
relatively easy to maximize p(x | θ) with respect to θ. Theoretically,
the complete data X must satisfy the Markov relationship θ → X → Y
with respect to the parameter θ and the observed data Y , that is, it
must be that

p(y | x, θ) = p(y | x).


Full text available at: http://dx.doi.org/10.1561/2000000034

8 The Expectation-Maximization Method

A special case is when Y is a function of X, that is, Y = T (X); in


this case, X → Y is a deterministic function, and thus the Markov
relationship always holds.

1.4.1 EM for Missing Data Problems


For many applications of EM, including GMM and HMM, the com-
plete data X is the observed data Y plus some missing (sometimes
called latent or hidden) data Z, such that X = (Y, Z). This is a spe-
cial case of Y = T (X), where the function T simply removes Z from
X to produce Y . In general when using EM with missing data, one
can write the Q-function as an integral over the domain of Z, denoted
by Z, rather than over the domain of X, because the only random part
of the complete data X is the missing data Z. Then, for missing data
problems where x = (y, z),
Z
(m)
Q(θ | θ ) = log p(x | θ)p(x | y, θ(m) )dx
X
Z
= log p(y, z | θ)p(y, z | y, θ(m) )dx
X
Z
= log p(y, z | θ)p(z | y, θ(m) )dz
Z

= EZ|y,θ(m) [log p(y, Z | θ)]. (1.5)

1.4.2 EM for Independently, Identically


Distributed Samples
For many common applications such as learning a GMM or HMM, the
complete data X is a set of n independent and identically distributed
 T
(i.i.d.) random vectors, X = X1 X2 . . . Xn and the ith observed
sample yi is only a function of xi . Then the following proposition is
useful for decomposing the Q-function into a sum:

Proposition 1.1. Suppose p(x | θ) = ni=1 p(xi | θ) for all x ∈ X n and


Q

all θ ∈ Ω, and the Markov relationship θ → Xi → Yi holds for all i =


1, . . . , n, that is,
p(yi | x, y1 , . . . , yi−1 , yi+1 , . . . , yn , θ) = p(yi | xi ), (1.6)
Full text available at: http://dx.doi.org/10.1561/2000000034

1.4 Specifying the Complete Data 9

then
n
X
(m)
Q(θ | θ )= Qi (θ | θ(m) ),
i=1
where
Qi (θ | θ(m) ) = EXi |yi ,θ(m) [log p(Xi | θ)], i = 1, . . . , n.

Proof. First, we show that given θ, the elements of the set {(Xi , Yi )},
i = 1, . . . , n, are mutually independent, that is,
n
Y
p(x, y | θ) = p(xi , yi | θ). (1.7)
i=1
This mutual independence holds because
p(x, y | θ) = p(y1 | y2 , . . . , yn , x, θ) · · · p(yn | x, θ)p(x | θ)
(by the chain rule)
= p(y1 | x1 , θ) · · · p(yn | xn , θ)p(x | θ)
(by (1.6), but keep θ in the condition)
n
Y
= p(y1 | x1 , θ) · · · p(yn | xn , θ) p(xi | θ)
i=1
(by the independence assumption on X)
n
Y
= p(yi | xi , θ)p(xi | θ)
i=1
Yn
= p(xi , yi | θ).
i=1
Then we show that for all i = 1, . . . , n, we have
p(xi | y, θ) = p(xi | yi , θ). (1.8)
This is because
p(xi , y | θ)
p(xi | y, θ) =
p(y | θ)
(by Bayes’ rule)
R
X n−1 p(x, y |Rθ)dx1 . . . dxi−1 dxi+1 . . . dxn
=
X n p(x, y | θ)dx
Full text available at: http://dx.doi.org/10.1561/2000000034

10 The Expectation-Maximization Method


R Qn
X n−1 p(xj , yj | θ)dx1 . . . dxi−1 dxi+1 . . . dxn
= Rj=1 Qn
Xn j=1 p(xj , yj | θ)dx1 . . . dxn
(by (1.7))
Qn R
p(xi , yi | θ) j=1, j6=i X p(xj , yj | θ)dxj
= Qn R
j=1 X p(xj , yj | θ)dxj
p(xi , yi | θ) nj=1, j6=i p(yj | θ)
Q
= Qn
j=1 p(yj | θ)
p(xi , yi | θ)
=
p(yi | θ)
= p(xi | yi , θ).

Then,

Q(θ | θ(m) ) = EX|y,θ(m) [log p(X | θ)]


n
" #
Y
= EX|y,θ(m) log p(Xi | θ)
i=1
(by the independence assumption on X)
n
" #
X
= EX|y,θ(m) log p(Xi | θ)
i=1
n
X
= EXi |y,θ(m) [log p(Xi | θ)]
i=1
Xn
= EXi |yi ,θ(m) [log p(Xi | θ)],
i=1

where the last line holds because of (1.8).

1.5 A Toy Example


We next present a fully worked-out version of a “toy example” of EM
that was used in the seminal EM paper [11]. Here, we give more details,
and we have changed it to literally be a toy example.
Imagine you ask n kids to choose a toy out of four choices. Let Y =
 T
Y1 . . . Y4 denote the histogram of their n choices, where Yi is the
number of the kids that chose toy i, for i = 1, . . . , 4. We can model this
Full text available at: http://dx.doi.org/10.1561/2000000034

1.5 A Toy Example 11

random histogram Y as being distributed according to a multinomial


distribution. The multinomial has two parameters: the number of kids
asked, denoted by n ∈ N, and the probability that a kid will choose each
of the four toys, denoted by p ∈ [0, 1]4 , where p1 + p2 + p3 + p4 = 1.
Then the probability of seeing some particular histogram y is:
n!
P (y | p) = py1 py2 py3 py4 . (1.9)
y1 !y2 !y3 !y4 ! 1 2 3 4
Next, say that we have reason to believe that the unknown proba-
bility p of choosing each of the toys is parameterized by some hidden
value θ ∈ (0, 1) such that
 T
1 1 1 1 1
pθ = + θ (1 − θ) (1 − θ) θ , θ ∈ (0, 1). (1.10)
2 4 4 4 4
The estimation problem is to guess the θ that maximizes the probability
of the observed histogram y of toy choices.
Combining (1.9) and (1.10), we can write the probability of seeing
 T
the histogram y = y1 y2 y3 y4 as

θ y1 1 − θ y2 1 − θ y3 θ y4
       
n! 1
P (y | θ) = + .
y1 !y2 !y3 !y4 ! 2 4 4 4 4
For this simple example, one could directly maximize the log-likelihood
log P (y | θ), but here we will instead illustrate how to use the EM algo-
rithm to find the maximum likelihood estimate of θ.
To use EM, we need to specify what the complete data X is. We
will choose the complete data to enable us to specify the probability
mass function (pmf) in terms of only θ and 1 − θ. To that end, we
 T
define the complete data to be X = X1 . . . X5 , where X has a
multinomial distribution with number of trials n and the probability
of each event is:
 T
1 1 1 1 1
qθ = θ (1 − θ) (1 − θ) θ , θ ∈ (0, 1).
2 4 4 4 4
By defining X this way, we can then write the observed data Y as:
 T
Y = T (X) = X1 + X2 X3 X4 X5 .
Full text available at: http://dx.doi.org/10.1561/2000000034

12 The Expectation-Maximization Method

The likelihood of a realization x of the complete data is


 x1  x2 +x5 
1 − θ x3 +x4

n! 1 θ
P (x | θ) = Q5 . (1.11)
i=1 xi !
2 4 4

For EM, we need to maximize the Q-function:

θ(m+1) = arg max Q(θ | θ(m) ) = arg max EX|y,θ(m) [log p(X | θ)].
θ∈(0,1) θ∈(0,1)

To solve the above equation, we actually only need the terms of


log p(x | θ) that depend on θ, because the other terms are irrelevant
as far as maximizing over θ is concerned. Take the log of (1.11) and
ignore those terms that do not depend on θ, then

θ(m+1) = arg max EX|y,θ(m) [(X2 + X5 ) log θ + (X3 + X4 ) log(1 − θ)]


θ∈(0,1)

= arg max (EX|y,θ(m) [X2 ] + EX|y,θ(m) [X5 ]) log θ


θ∈(0,1)

+ (EX|y,θ(m) [X3 ] + EX|y,θ(m) [X4 ]) log(1 − θ).

To solve the above maximization problem, we need the expectation


of the complete data X conditioned on the already known incomplete
data y, which only leaves the uncertainty about X1 and X2 . Since we
know that X1 + X2 = y1 , we can use the indicator function 1{·} to
write that given y1 , the pair (X1 , X2 ) is binomially distributed with X1
“successes” in y1 events:

P (x | y, θ(m) )
!x1 !x2
1 θ(m) 5
y1 ! 2 4
Y
= 1{x1 +x2 =y1 } 1{xi =yi−1 }
x1 !x2 ! 1 θ(m) 1 θ(m)
2 + 4 2 + 4 i=3
x1 !x2 5
θ(m)

y1 ! 2 Y
= 1{x1 +x2 =y1 } 1{xi =yi−1 } .
x1 !x2 ! 2 + θ(m) 2 + θ(m) i=3

Then the conditional expectation of X given y and θ(m) is


h iT
2 θ(m)
EX|y,θ(m) [X] = y
2+θ(m) 1
y
2+θ(m) 1
y2 y3 y4 ,
Full text available at: http://dx.doi.org/10.1561/2000000034

1.5 A Toy Example 13

and the M-step becomes


! !
(m+1) θ(m)
θ = arg max y1 + y4 log θ + (y2 + y3 ) log(1 − θ)
θ∈(0,1) 2 + θ(m)
θ(m)
y
2+θ(m) 1
+ y4
= .
θ(m)
2+θ(m)
y1 + y2 + y3 + y4

Given an initial estimate θ(0) = 0.5, the above algorithm reaches θ̂MLE
to MATLAB’s numerical precision on the 18th iteration.
Full text available at: http://dx.doi.org/10.1561/2000000034

References

[1] M. M. Ali, C. Khompatraporn, and Z. B. Zabinsky, “A numerical evaluation


of several stochastic algorithms on selected continuous global optimization test
problems,” Journal of Global Optimization, vol. 31, no. 4, pp. 635–672, April
2005.
[2] T. L. Bailey and C. Elkan, “Unsupervised learning of multiple motifs in biopoly-
mers using expectation maximization,” Machine Learning, vol. 21, pp. 51–80,
1995.
[3] A. Banerjee, X. Guo, and H. Wang, “On the optimality of conditional expec-
tation as a Bregman predictor,” IEEE Transactions on Information Theory,
vol. 51, no. 7, pp. 2664–2669, July 2005.
[4] L. E. Baum, T. Petrie, G. Soules, and N. Weiss, “A maximization technique
occurring in the statistical analysis of probabilistic functions of Markov chains,”
The Annals of Mathematical Statistics, vol. 41, no. 1, pp. 164–171, February
1970.
[5] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge, UK: Cam-
bridge University Press, 2004.
[6] R. A. Boyles, “On the convergence of the EM algorithm,” Journal of the Royal
Statistical Society, Series B (Methodological), vol. 45, no. 1, pp. 47–50, 1983.
[7] P. Bryant and J. A. Williamson, “Asymptotic behavior of classification maxi-
mum likelihood estimates,” Biometrika, vol. 65, no. 2, pp. 273–281, 1978.
[8] G. Celeux and J. Diebolt, “The SEM algorithm: a probabilistic teacher algo-
rithm derived from the EM algorithm for the mixture problem,” Computational
Statistics Quaterly, vol. 2, pp. 73–82, 1985.

73
Full text available at: http://dx.doi.org/10.1561/2000000034

74 References

[9] G. Celeux and G. Govaert, “A classification EM algorithm for clustering


and two stochastic versions,” Computational Statistics Data Analysis, vol. 14,
pp. 315–332, 1992.
[10] Y. Chen and J. Krumm, “Probabilistic modeling of traffic lanes from GPS
traces,” in Proceedings of 18th ACM SIGSPATIAL International Conference
on Advances in Geographic Information Systems, 2010.
[11] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from
incomplete data via the EM algorithm,” Journal of the Royal Statistical Society,
Series B (Methodological), vol. 39, no. 1, pp. 1–38, 1977.
[12] M. Feder and E. Weinstein, “Parameter estimation of superimposed signals
using the EM algorithm,” IEEE Transactions on Acoustics, Speech and Signal
Processing, vol. 36, no. 4, pp. 477–489, April 1988.
[13] G. B. Folland, Real Analysis: Modern Techniques and Their Applications. New
York, NY: John Wiley & Sons, 2nd Edition, 1999.
[14] B. A. Frigyik, A. Kapila, and M. R. Gupta, “Introduction to the Dirichlet distri-
bution and related processes,” Department of Electrical Engineering, University
of Washignton, UWEETR-2010-0006, 2010.
[15] B. A. Frigyik, S. Srivastava, and M. R. Gupta, “Functional Bregman divergence
and Bayesian estimation of distributions,” IEEE Transactions on Information
Theory, vol. 54, no. 11, pp. 5130–5139, November 2008.
[16] M. Gales and S. Young, “The application of hidden Markov models in speech
recognition,” Foundations and Trends in Signal Processing, vol. 1, no. 3,
pp. 195–304, 2008.
[17] A. Gersho and R. M. Gray, Vector Quantization and Signal Compression.
Norwell, MA: Kluwer, 1991.
[18] A. Goldsmith, Wireless Communications. Cambridge, UK: Cambridge Univer-
sity Press, 2005.
[19] M. I. Gurelli and L. Onural, “On a parameter estimation method for Gibbs-
Markov random fields,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 16, no. 4, pp. 424–430, April 1994.
[20] H. O. Hartley, “Maximum likelihood estimation from incomplete data,” Bio-
metrics, vol. 14, no. 2, pp. 174–194, June 1958.
[21] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning:
Data Mining, Inference, and Prediction. New York, NY: Springer, 2nd Edition,
2009.
[22] S. Haykin, “Cognitive radio: Brain-empowered wireless communications,” IEEE
Journal on Selected Areas in Communications, vol. 23, no. 2, pp. 201–220,
February 2005.
[23] M. Hazen and M. R. Gupta, “A multiresolutional estimated gradient architec-
ture for global optimization,” in Proceedings of the IEEE Congress on Evolu-
tionary Computation, pp. 3013–3020, 2006.
[24] M. Hazen and M. R. Gupta, “Gradient estimation in global optimization algo-
rithms,” in Proceedings of the IEEE Congress on Evolutionary Computation,
pp. 1841–1848, 2009.
[25] T. Hebert and R. Leahy, “A generalized EM algorithm for 3-D Bayesian recon-
struction from Poisson data using Gibbs priors,” IEEE Transactions on Medical
Imaging, vol. 8, no. 2, pp. 194–202, June 1989.
Full text available at: http://dx.doi.org/10.1561/2000000034

References 75

[26] T. J. Hebert and K. Lu, “Expectation–maximization algorithms, null spaces,


and MAP image restoration,” IEEE Transactions on Image Processing, vol. 4,
no. 8, pp. 1084–1095, August 1995.
[27] M. Jamshidian and R. I. Jennrich, “Acceleration of the EM algorithm by
using quasi-Newton methods,” Journal of the Royal Statistical Society, Series
B (Methodological), vol. 59, no. 3, pp. 569–587, 1997.
[28] D. R. Jones, “A taxonomy of global optimization methods based on response
surfaces,” Journal of Global Optimization, vol. 21, no. 4, pp. 345–383, December
2001.
[29] K. Lange, “Convergence of EM image reconstruction algorithms with Gibbs
smoothing,” IEEE Transactions on Medical Imaging, vol. 9, no. 4, pp. 439–446,
December 1990.
[30] K. Lange, “A gradient algorithm locally equivalent to the EM algorithm,” Jour-
nal of the Royal Statistical Society, Series B (Methodological), vol. 57, no. 2,
pp. 425–437, 1995.
[31] J. Li and R. M. Gray, Image Segmentation and Compression Using Hidden
Markov Models. New York, NY: Springer, 2000.
[32] S. P. Lloyd, “Least squares quantization in PCM,” IEEE Transactions on Infor-
mation Theory, vol. 28, no. 2, pp. 129–137, First published in 1957 as a Bell
Labs technical note, 1982.
[33] L. B. Lucy, “An iterative technique for the rectification of observed distribu-
tions,” Astronomical Journal, vol. 79, no. 6, pp. 745–754, June 1974.
[34] D. J. C. MacKay, Information Theory, Inference, and Learning Algorithms.
Cambridge, UK: Cambridge University Press, 2003.
[35] J. MacQueen, “Some methods for classification and analysis of multivariate
observations,” in Proceedings of the fifth Berkeley Symposium on Mathematical
Statistics and Probability, pp. 281–297, 1967.
[36] G. J. McLachlan and T. Krishnan, The EM Algorithm and Extensions. New
York, NY: John Wiley & Sons, 2nd Edition, 2008.
[37] G. J. McLachlan and D. Peel, Finite Mixture Models. New York, NY: John
Wiley & Sons, 2000.
[38] R. Mendes, J. Kennedy, and J. Neves, “The fully informed particle swarm: sim-
pler, maybe better,” IEEE Transactions on Evolutionary Computation, vol. 8,
no. 3, pp. 204–210, June 2004.
[39] X.-L. Meng and D. B. Rubin, “On the global and componentwise rates of con-
vergence of the EM algorithm,” Linear Algebra and its Applications, vol. 199,
pp. 413–425, March 1994.
[40] X.-L. Meng and D. A. van Dyk, “The EM algorithm — an old folk-song sung
to a fast new tune,” Journal of the Royal Statistical Society, Series B (Method-
ological), vol. 59, no. 3, pp. 511–567, 1997.
[41] R. M. Neal and G. E. Hinton, “A view of the EM algorithm that justifies
incremental, sparse, and other variants,” in Learning in Graphical Models, (M. I.
Jordan, ed.), MIT Press, November 1998.
[42] J. K. Nelson and M. R. Gupta, “An EM technique for multiple transmitter
localization,” in Proceedings of the 41st Annual Conference on Information
Sciences and Systems, pp. 610–615, 2007.
Full text available at: http://dx.doi.org/10.1561/2000000034

76 References

[43] J. K. Nelson, M. R. Gupta, J. Almodovar, and W. H. Mortensen, “A quasi EM


method for estimating multiple transmitter locations,” IEEE Signal Processing
Letters, vol. 16, no. 5, pp. 354–357, May 2009.
[44] S. Newcomb, “A generalized theory of the combination of observations so as
to obtain the best result,” American Journal of Mathematics, vol. 8, no. 4,
pp. 343–366, August 1886.
[45] J. Nocedal and S. J. Wright, Numerical Optimization. New York, NY: Springer,
2nd Edition, 2006.
[46] P. M. Pardalos and H. E. Romeijn, eds., Handbook of Global Optimization.
Vol. 2, Norwell, MA: Kluwer, 2002.
[47] K. B. Petersen and M. S. Pedersen, The Matrix Cookbook. November 2008.
http://matrixcookbook.com/.
[48] W. Qian and D. M. Titterington, “Stochastic relaxations and EM algorithms
for Markov random fields,” Journal of Statistical Computation and Simulation,
vol. 40, pp. 55–69, 1992.
[49] L. R. Rabiner, “A tutorial on hidden Markov models and selected applications
in speech recognition,” Proceedings of the IEEE, vol. 77, no. 2, pp. 257–286,
February 1989.
[50] R. A. Redner and H. F. Walker, “Mixture densities, maximum likelihood and
the EM algorithm,” SIAM Review, vol. 26, no. 2, pp. 195–239, April 1984.
[51] W. H. Richardson, “Bayesian-based iterative method of image restoration,”
Journal of Optical Society of America, vol. 62, no. 1, pp. 55–59, 1972.
[52] C. P. Robert and G. Casella, Monte Carlo Statistical Methods. New York, NY:
Springer, 2nd Edition, 2004.
[53] A. Roche, “EM algorithm and variants: An informal tutorial,” Unpublished
(available online at ftp://ftp.cea.fr/pub/dsv/madic/publis/Roche em.
pdf), 2003.
[54] G. Ronning, “Maximum Likelihood estimation of Dirichlet distributions,” Jour-
nal of Statistical Computation and Simulation, vol. 32, no. 4, pp. 215–221, 1989.
[55] H. Stark and Y. Yang, Vector Space Projections: A Numerical Approach to
Signal and Image Processing, Neural Nets, and Optics. New York, NY: John
Wiley & Sons, 1998.
[56] C. A. Sugar and G. M. James, “Finding the number of clusters in a dataset:
An information-theoretic approach,” Journal of the American Statistical Asso-
ciation, vol. 98, no. 463, pp. 750–763, September 2003.
[57] M. A. Tanner and W. H. Wong, “The calculation of posterior distributions by
data augmentation,” Journal of the American Statistical Association, vol. 82,
no. 398, pp. 528–540, June 1987.
[58] D. A. van Dyk and X.-L. Meng, “The art of data augmentation,” Journal of
Computational and Graphical Statistics, vol. 10, no. 1, pp. 1–50, March 2001.
[59] J. Wang, A. Dogandzic, and A. Nehorai, “Maximum likelihood estimation of
compound-Gaussian clutter and target parameters,” IEEE Transactions on
Signal Processing, vol. 54, no. 10, pp. 3884–3898, October 2006.
[60] G. C. G. Wei and M. A. Tanner, “A Monte Carlo implementation of the EM
algorithm and the poor man’s data augmentation algorithms,” Journal of the
American Statistical Association, vol. 85, no. 411, pp. 699–704, September 1990.
Full text available at: http://dx.doi.org/10.1561/2000000034

References 77

[61] L. R. Welch, “Hidden Markov Models and the Baum-Welch Algorithm,” IEEE
Information Theory Society Newsletter, vol. 53, no. 4, pp. 1–13, December 2003.
[62] C. F. J. Wu, “On the convergence properties of the EM algorithm,” The Annals
of Statistics, vol. 11, no. 1, pp. 95–103, March 1983.
[63] L. Xu and M. I. Jordan, “On convergence properties of the EM algorithm for
Gaussian mixtures,” Neural Computation, vol. 8, no. 1, pp. 129–151, January
1996.
[64] R. W. Yeung, A First Course in Information Theory. New York, NY: Springer,
2002.
[65] J. Zhang, “The mean field theory in EM procedures for Markov random
fields,” IEEE Transactions on Signal Processing, vol. 40, no. 10, pp. 2570–2583,
October 1992.

You might also like