0% found this document useful (0 votes)
51 views22 pages

Expectation Maximization: Dekang Lin Department of Computing Science University of Alberta

Expectation maximization (EM) is perhaps most often used and mostly half understood algorithm for unsupervised learning. It is very intuitive. Many people rely on their intuition to apply the algorithm in different problem domains. I will present a proof of the EM Theorem that explains why the algorithm works.

Uploaded by

Trần Vũ Hà
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views22 pages

Expectation Maximization: Dekang Lin Department of Computing Science University of Alberta

Expectation maximization (EM) is perhaps most often used and mostly half understood algorithm for unsupervised learning. It is very intuitive. Many people rely on their intuition to apply the algorithm in different problem domains. I will present a proof of the EM Theorem that explains why the algorithm works.

Uploaded by

Trần Vũ Hà
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 22

Expectation Maximization

Dekang Lin
Department of Computing Science
University of Alberta

Objectives
Expectation Maximization (EM) is perhaps most
often used and mostly half understood algorithm
for unsupervised learning.

It is very intuitive.

Many people rely on their intuition to apply the


algorithm in different problem domains.
I will present a proof of the EM Theorem that
explains why the algorithm works.

Hopefully this will help applying EM when intuition is


not obvious.

Model Building with Partial
Observations
Our goal is to build a probabilistic model

A model is defined by a set of parameters


The model parameters can be estimated from a set of
training examples: x
1
, x
2
, , x
n

x
i
s are identically and independently distributed (iid)
Unfortunately, we only get to observe part of each
training example:

x
i
=(t
i
, y
i
) and we can only observe y
i
.
How do we build the model?

Example: POS Tagging
Complete data: A sentence (a sequence of
words) and a corresponding sequence of
POS tags.
Observed data: the sentence
Unobserved data: the sequence of tags
Model: an HMM with transition/emission
probability tables.

Training with Tagged Corpus
Pierre NNP Vinken NNP , , 61 CD years NNS
old JJ , , will MD join VB the DT board NN
as IN a DT nonexecutive JJ director NN Nov.
NNP 29 CD . .
Mr. NNP Vinken NNP is VBZ chairman NN of IN
Elsevier NNP N.V. NNP , , the DT Dutch NNP
publishing VBG group NN . .
Rudolph NNP Agnew NNP , , 55 CD years NNS
old JJ and CC former JJ chairman NN of IN
Consolidated NNP Gold NNP Fields NNP PLC NNP
, , was VBD named VBN a DT nonexecutive JJ
director NN of IN this DT British JJ
industrial JJ conglomerate NN . .
Pierre NNP Vinken NNP , , 61 CD years NNS
old JJ , , will MD join VB the DT board NN
as IN a DT nonexecutive JJ director NN Nov.
NNP 29 CD . .
Mr. NNP Vinken NNP is VBZ chairman NN of IN
Elsevier NNP N.V. NNP , , the DT Dutch NNP
publishing VBG group NN . .
Rudolph NNP Agnew NNP , , 55 CD years NNS
old JJ and CC former JJ chairman NN of IN
Consolidated NNP Gold NNP Fields NNP PLC NNP
, , was VBD named VBN a DT nonexecutive JJ
director NN of IN this DT British JJ
industrial JJ conglomerate NN . .
c(JJ)=7 c(JJ, NN)=4, P(NN|JJ)=4/7

Example: Parsing
Complete Data: a sentence and its parse
tree.
Observed data: the sentence
Unobserved data: the nonterminal
categories and their relationships that form
the parse tree.
Model:

PCFG or anything that allows one to compute


the probability of parse trees.

Example: Semantic Labeling
Complete Data: (context, cluster, word)
Observed Data: (context, word)
Unobserved Data: cluster
Model:

P(context, cluster, word) = P(context)P(cluster|


context)P(word|cluster)

What is the best Model?
There are many possibly models

Many possible ways to set the model parameters.


We obviously want the best model.
Which model is the best?

The model that assigns the highest probability to the


observation is the best.

Maximize
i
P

(y
i
), or equivalently
i
log P

(y
i
)

What about maximizing the probability of the hidden data?

This is know as the maximum likelihood estimation (MLE)



MLE Example
A coin with P(H)=p, P(T)=q.

We observed m Hs and n Ts.

What are p and q according to MLE?


Maximize
i
log P

(y
i
)= log p
m
q
n

Under the constraint: p+q=1


Lagrange Method:

Define g(p,q)=m log p + n log q+(p+q-1)

Solve the equations


1 , 0
) , (
, 0
) , (
+

q p
q
q p g
p
q p g

Example
Suppose we have two coins. Coin 1 is fair. Coin 2 has
probability p generating H.

They each have probability to be chosen and tossed.

The complete data is (1, H), (1, T), (2, T), (1, H), (2, T)

We only know the result of the toss, but dont know when
coin was chosen.

The observed data is H, T, T, H, T.


Problem:

Suppose the observations include m Hs and n Ts.

How to estimate p to maximize


i
log P

(y
i
)?

Need for Iterative Algorithm
Unfortunately, we often cannot find the best by
solving equations.
Example:

Three coins, 0, 1, and 2, with probabilities p


0
, p
1
, and p
2

generating H.

Experiment: Toss coin 0

If H, toss coin 1 three times

If T, toss coin 2 three times

Observations:

<HHH>, <TTT>, <HHH>, <TTT>, <HHH>

What is MLE for p


0
, p
1
, and p
2
?

Overview of EM
Create an initial model,
0
.

Arbitrarily, randomly, or with a small set of training


examples.
Use the model to obtain another model such
that

i
log P

(y
i
) >
i
log P

(y
i
)
Repeat the above step until reaching a local
maximum.

Guaranteed to find a better model after each iteration.



Maximizing Likelihood
How do we find a better model given a
model ?
Can we use Lagrange method to maximize

i
logP

(y
i
)?

If this can be done, there is no need to iterate!



EM Theorem
The following EM Theorem holds

This theorem is similar to (but is not identical to, nor


does it follow) the EM Theorem in [Jelinek 1997, p.148]
(the proof is almost identical).
EM Theorem:

t
is summation over all possible values of unobserved
data


>
>
i
i
i
i
i t
i i
i t
i i
y P y P
y t P y t P y t P y t P
) ( log ) ( log
) , ( log ) | ( ) , ( log ) | (



What does EM Theorem Mean?
If we can find a that maximizes
the same will also satisfy the condition

which is needed in the EM algorithm.
We can maximize the former by taking its partial
derivatives w.r.t. parameters in .

i t
i i
y t P y t P ) , ( log ) | (


>
i
i
i
i
y P y P ) ( log ) ( log



>
>
i
i
i
i
i t
i i
i t
i i
y P y P
y t P y t P y t P y t P
) ( log ) ( log
) , ( log ) | ( ) , ( log ) | (



EM Theorem: why?
Why optimizing
is easier than optimizing
P

(t, y
i
) involves the complete data and is usually
a product of a set of parameters. P

(y
i
) usually
involves summation over all hidden variables.

i t
i i
y t P y t P ) , ( log ) | (

i
i
y P ) ( log


EM Theorem: Proof

'

'

'

'

'

'

1
]
1

'

1
]
1

i t
i i
i t
i i
i t
i
i
i t
i
i
i
i t
i
i
i
i t
i
i
i
i t
i
i
i
i i t
i
i
i i
t
i
i
i i
i
i
i
i
y t P y t P y t P y t P
y t P
y t P
y t P
y t P
y t P
y t P
y t P
y t P
y t P
y t P
y t P
y t P
y t P
y t P
y t P
y t P
y t P
y P y t P
y t P
y t P
y P y t P
y P y P
) , ( log ) | ( ) , ( log ) | (
) , (
) , (
log ) | (
) | (
) | (
log ) | (
) , (
) , (
log ) | (
) | (
) , (
log ) | (
) | (
) , (
log ) | (
) , (
) , (
) ( log ) | (
) , (
) , (
) ( log ) | (
) ( log ) ( log



=1
0 (Jensens Inequality)

The proof used the inequality
More generally, if p and q are probability
distributions
Even more generally, if f is a convex function,
E[f(x)] f(E[x])

Jensens Inequality
0
) | (
) | (
log ) | (

'

t
i
i
i
y t P
y t P
y t P

( )
( )
( )
0 log

x
x p
x q
x p

What is ?
The expected value of log P

(t,y
i
) according
to the model .
The EM Theorem states that we can get a
better model by maximizing the sum (over
all instances) of the expectation.

t
i i
y t P y t P ) , ( log ) | (


A Generic Set Up for EM
Assume P

(t, y) is a product of a set of parameters.


Assume consists of M groups of parameters.

The parameters in each group sum up to 1.

Let u
jk
be a parameter.
m
u
jm
=1

Let T
jk
be a subset of hidden data such that if t is in T
jk
, the
computation of P

(t, y
i
) involves u
jk
.

Let n(t,y
i
) be the number of times u
jk
is used in P

(t,y
i
), i.e.,
P

(t,y
i
)=u
jk
n(t,y
i
)

v(t,y), where v(t,y) is the product of all other
parameters.

j
i T t
i i
jk
j
jk
i
i T t
i
j
jk
i T t
i
y t n
jk i
jk
i t l m
lm l i i
jk
jk
jk
i
y t n y t P
u
u
y t n y t P
u
y t v u y t P
u
u y t P y t P

,
_


) , ( ) | (
0
) , ( ) | (
) , ( log ) | (
1 ) , ( log ) | (
) , (
pseudo count of instances
involving u
jk

Summary
EM Theorem

Intuition

Proof
Generic Set-up

You might also like