Expectation Maximization: Dekang Lin Department of Computing Science University of Alberta
Expectation Maximization: Dekang Lin Department of Computing Science University of Alberta
Dekang Lin
Department of Computing Science
University of Alberta
Objectives
Expectation Maximization (EM) is perhaps most
often used and mostly half understood algorithm
for unsupervised learning.
It is very intuitive.
x
i
s are identically and independently distributed (iid)
Unfortunately, we only get to observe part of each
training example:
x
i
=(t
i
, y
i
) and we can only observe y
i
.
How do we build the model?
Example: POS Tagging
Complete data: A sentence (a sequence of
words) and a corresponding sequence of
POS tags.
Observed data: the sentence
Unobserved data: the sequence of tags
Model: an HMM with transition/emission
probability tables.
Training with Tagged Corpus
Pierre NNP Vinken NNP , , 61 CD years NNS
old JJ , , will MD join VB the DT board NN
as IN a DT nonexecutive JJ director NN Nov.
NNP 29 CD . .
Mr. NNP Vinken NNP is VBZ chairman NN of IN
Elsevier NNP N.V. NNP , , the DT Dutch NNP
publishing VBG group NN . .
Rudolph NNP Agnew NNP , , 55 CD years NNS
old JJ and CC former JJ chairman NN of IN
Consolidated NNP Gold NNP Fields NNP PLC NNP
, , was VBD named VBN a DT nonexecutive JJ
director NN of IN this DT British JJ
industrial JJ conglomerate NN . .
Pierre NNP Vinken NNP , , 61 CD years NNS
old JJ , , will MD join VB the DT board NN
as IN a DT nonexecutive JJ director NN Nov.
NNP 29 CD . .
Mr. NNP Vinken NNP is VBZ chairman NN of IN
Elsevier NNP N.V. NNP , , the DT Dutch NNP
publishing VBG group NN . .
Rudolph NNP Agnew NNP , , 55 CD years NNS
old JJ and CC former JJ chairman NN of IN
Consolidated NNP Gold NNP Fields NNP PLC NNP
, , was VBD named VBN a DT nonexecutive JJ
director NN of IN this DT British JJ
industrial JJ conglomerate NN . .
c(JJ)=7 c(JJ, NN)=4, P(NN|JJ)=4/7
Example: Parsing
Complete Data: a sentence and its parse
tree.
Observed data: the sentence
Unobserved data: the nonterminal
categories and their relationships that form
the parse tree.
Model:
Maximize
i
P
(y
i
), or equivalently
i
log P
(y
i
)
(y
i
)= log p
m
q
n
q p
q
q p g
p
q p g
Example
Suppose we have two coins. Coin 1 is fair. Coin 2 has
probability p generating H.
The complete data is (1, H), (1, T), (2, T), (1, H), (2, T)
We only know the result of the toss, but dont know when
coin was chosen.
(y
i
)?
Need for Iterative Algorithm
Unfortunately, we often cannot find the best by
solving equations.
Example:
Observations:
(y
i
) >
i
log P
(y
i
)
Repeat the above step until reaching a local
maximum.
i
logP
(y
i
)?
t
is summation over all possible values of unobserved
data
>
>
i
i
i
i
i t
i i
i t
i i
y P y P
y t P y t P y t P y t P
) ( log ) ( log
) , ( log ) | ( ) , ( log ) | (
What does EM Theorem Mean?
If we can find a that maximizes
the same will also satisfy the condition
which is needed in the EM algorithm.
We can maximize the former by taking its partial
derivatives w.r.t. parameters in .
i t
i i
y t P y t P ) , ( log ) | (
>
i
i
i
i
y P y P ) ( log ) ( log
>
>
i
i
i
i
i t
i i
i t
i i
y P y P
y t P y t P y t P y t P
) ( log ) ( log
) , ( log ) | ( ) , ( log ) | (
EM Theorem: why?
Why optimizing
is easier than optimizing
P
(t, y
i
) involves the complete data and is usually
a product of a set of parameters. P
(y
i
) usually
involves summation over all hidden variables.
i t
i i
y t P y t P ) , ( log ) | (
i
i
y P ) ( log
EM Theorem: Proof
'
'
'
'
'
'
1
]
1
'
1
]
1
i t
i i
i t
i i
i t
i
i
i t
i
i
i
i t
i
i
i
i t
i
i
i
i t
i
i
i
i i t
i
i
i i
t
i
i
i i
i
i
i
i
y t P y t P y t P y t P
y t P
y t P
y t P
y t P
y t P
y t P
y t P
y t P
y t P
y t P
y t P
y t P
y t P
y t P
y t P
y t P
y t P
y P y t P
y t P
y t P
y P y t P
y P y P
) , ( log ) | ( ) , ( log ) | (
) , (
) , (
log ) | (
) | (
) | (
log ) | (
) , (
) , (
log ) | (
) | (
) , (
log ) | (
) | (
) , (
log ) | (
) , (
) , (
) ( log ) | (
) , (
) , (
) ( log ) | (
) ( log ) ( log
=1
0 (Jensens Inequality)
The proof used the inequality
More generally, if p and q are probability
distributions
Even more generally, if f is a convex function,
E[f(x)] f(E[x])
Jensens Inequality
0
) | (
) | (
log ) | (
'
t
i
i
i
y t P
y t P
y t P
( )
( )
( )
0 log
x
x p
x q
x p
What is ?
The expected value of log P
(t,y
i
) according
to the model .
The EM Theorem states that we can get a
better model by maximizing the sum (over
all instances) of the expectation.
t
i i
y t P y t P ) , ( log ) | (
A Generic Set Up for EM
Assume P
Let u
jk
be a parameter.
m
u
jm
=1
Let T
jk
be a subset of hidden data such that if t is in T
jk
, the
computation of P
(t, y
i
) involves u
jk
.
Let n(t,y
i
) be the number of times u
jk
is used in P
(t,y
i
), i.e.,
P
(t,y
i
)=u
jk
n(t,y
i
)
v(t,y), where v(t,y) is the product of all other
parameters.
j
i T t
i i
jk
j
jk
i
i T t
i
j
jk
i T t
i
y t n
jk i
jk
i t l m
lm l i i
jk
jk
jk
i
y t n y t P
u
u
y t n y t P
u
y t v u y t P
u
u y t P y t P
,
_
) , ( ) | (
0
) , ( ) | (
) , ( log ) | (
1 ) , ( log ) | (
) , (
pseudo count of instances
involving u
jk
Summary
EM Theorem
Intuition
Proof
Generic Set-up