0% found this document useful (0 votes)

51 views22 pages

Expectation Maximization: Dekang Lin Department of Computing Science University of Alberta

Expectation maximization (EM) is perhaps most often used and mostly half understood algorithm for unsupervised learning. It is very intuitive. Many people rely on their intuition to apply the algorithm in different problem domains. I will present a proof of the EM Theorem that explains why the algorithm works.

Uploaded by

Trần Vũ Hà

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

51 views22 pages

Expectation Maximization: Dekang Lin Department of Computing Science University of Alberta

Uploaded by

Trần Vũ Hà

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 22

Expectation Maximization

Dekang Lin
Department of Computing Science
University of Alberta

Objectives
Expectation Maximization (EM) is perhaps most
often used and mostly half understood algorithm
for unsupervised learning.

It is very intuitive.

Many people rely on their intuition to apply the

algorithm in different problem domains.
I will present a proof of the EM Theorem that
explains why the algorithm works.

Hopefully this will help applying EM when intuition is

not obvious.

Model Building with Partial
Observations
Our goal is to build a probabilistic model

A model is defined by a set of parameters

The model parameters can be estimated from a set of
training examples: x
1
, x
2
, , x
n

x
i
s are identically and independently distributed (iid)
Unfortunately, we only get to observe part of each
training example:

x
i
=(t
i
, y
i
) and we can only observe y
i
.
How do we build the model?

Example: POS Tagging
Complete data: A sentence (a sequence of
words) and a corresponding sequence of
POS tags.
Observed data: the sentence
Unobserved data: the sequence of tags
Model: an HMM with transition/emission
probability tables.

Training with Tagged Corpus
Pierre NNP Vinken NNP , , 61 CD years NNS
old JJ , , will MD join VB the DT board NN
as IN a DT nonexecutive JJ director NN Nov.
NNP 29 CD . .
Mr. NNP Vinken NNP is VBZ chairman NN of IN
Elsevier NNP N.V. NNP , , the DT Dutch NNP
publishing VBG group NN . .
Rudolph NNP Agnew NNP , , 55 CD years NNS
old JJ and CC former JJ chairman NN of IN
Consolidated NNP Gold NNP Fields NNP PLC NNP
, , was VBD named VBN a DT nonexecutive JJ
director NN of IN this DT British JJ
industrial JJ conglomerate NN . .
Pierre NNP Vinken NNP , , 61 CD years NNS
old JJ , , will MD join VB the DT board NN
as IN a DT nonexecutive JJ director NN Nov.
NNP 29 CD . .
Mr. NNP Vinken NNP is VBZ chairman NN of IN
Elsevier NNP N.V. NNP , , the DT Dutch NNP
publishing VBG group NN . .
Rudolph NNP Agnew NNP , , 55 CD years NNS
old JJ and CC former JJ chairman NN of IN
Consolidated NNP Gold NNP Fields NNP PLC NNP
, , was VBD named VBN a DT nonexecutive JJ
director NN of IN this DT British JJ
industrial JJ conglomerate NN . .
c(JJ)=7 c(JJ, NN)=4, P(NN|JJ)=4/7

Example: Parsing
Complete Data: a sentence and its parse
tree.
Observed data: the sentence
Unobserved data: the nonterminal
categories and their relationships that form
the parse tree.
Model:

PCFG or anything that allows one to compute

the probability of parse trees.

Example: Semantic Labeling
Complete Data: (context, cluster, word)
Observed Data: (context, word)
Unobserved Data: cluster
Model:

P(context, cluster, word) = P(context)P(cluster|

context)P(word|cluster)

What is the best Model?
There are many possibly models

Many possible ways to set the model parameters.

We obviously want the best model.
Which model is the best?

The model that assigns the highest probability to the

observation is the best.

Maximize
i
P

(y
i
), or equivalently
i
log P

(y
i
)

What about maximizing the probability of the hidden data?

This is know as the maximum likelihood estimation (MLE)

MLE Example
A coin with P(H)=p, P(T)=q.

We observed m Hs and n Ts.

What are p and q according to MLE?

Maximize
i
log P

(y
i
)= log p
m
q
n

Under the constraint: p+q=1

Lagrange Method:

Define g(p,q)=m log p + n log q+(p+q-1)

Solve the equations

1 , 0
) , (
, 0
) , (
+

q p
q
q p g
p
q p g

Example
Suppose we have two coins. Coin 1 is fair. Coin 2 has
probability p generating H.

They each have probability to be chosen and tossed.

The complete data is (1, H), (1, T), (2, T), (1, H), (2, T)

We only know the result of the toss, but dont know when
coin was chosen.

The observed data is H, T, T, H, T.

Problem:

Suppose the observations include m Hs and n Ts.

How to estimate p to maximize

i
log P

(y
i
)?

Need for Iterative Algorithm
Unfortunately, we often cannot find the best by
solving equations.
Example:

Three coins, 0, 1, and 2, with probabilities p

0
, p
1
, and p
2

generating H.

Experiment: Toss coin 0

If H, toss coin 1 three times

If T, toss coin 2 three times

Observations:

<HHH>, <TTT>, <HHH>, <TTT>, <HHH>

What is MLE for p

0
, p
1
, and p
2
?

Overview of EM
Create an initial model,
0
.

Arbitrarily, randomly, or with a small set of training

examples.
Use the model to obtain another model such
that

i
log P

(y
i
) >
i
log P

(y
i
)
Repeat the above step until reaching a local
maximum.

Guaranteed to find a better model after each iteration.

Maximizing Likelihood
How do we find a better model given a
model ?
Can we use Lagrange method to maximize

i
logP

(y
i
)?

If this can be done, there is no need to iterate!

EM Theorem
The following EM Theorem holds

This theorem is similar to (but is not identical to, nor

does it follow) the EM Theorem in [Jelinek 1997, p.148]
(the proof is almost identical).
EM Theorem:

t
is summation over all possible values of unobserved
data

>
>
i
i
i
i
i t
i i
i t
i i
y P y P
y t P y t P y t P y t P
) ( log ) ( log
) , ( log ) | ( ) , ( log ) | (

What does EM Theorem Mean?
If we can find a that maximizes
the same will also satisfy the condition

which is needed in the EM algorithm.
We can maximize the former by taking its partial
derivatives w.r.t. parameters in .

i t
i i
y t P y t P ) , ( log ) | (

>
i
i
i
i
y P y P ) ( log ) ( log

>
>
i
i
i
i
i t
i i
i t
i i
y P y P
y t P y t P y t P y t P
) ( log ) ( log
) , ( log ) | ( ) , ( log ) | (

EM Theorem: why?
Why optimizing
is easier than optimizing
P

(t, y
i
) involves the complete data and is usually
a product of a set of parameters. P

(y
i
) usually
involves summation over all hidden variables.

i t
i i
y t P y t P ) , ( log ) | (

i
i
y P ) ( log

EM Theorem: Proof

1
]
1

i t
i i
i t
i i
i t
i
i
i t
i
i
i
i t
i
i
i
i t
i
i
i
i t
i
i
i
i i t
i
i
i i
t
i
i
i i
i
i
i
i
y t P y t P y t P y t P
y t P
y t P
y t P
y t P
y t P
y t P
y t P
y t P
y t P
y t P
y t P
y t P
y t P
y t P
y t P
y t P
y t P
y P y t P
y t P
y t P
y P y t P
y P y P
) , ( log ) | ( ) , ( log ) | (
) , (
) , (
log ) | (
) | (
) | (
log ) | (
) , (
) , (
log ) | (
) | (
) , (
log ) | (
) | (
) , (
log ) | (
) , (
) , (
) ( log ) | (
) , (
) , (
) ( log ) | (
) ( log ) ( log

=1
0 (Jensens Inequality)

The proof used the inequality
More generally, if p and q are probability
distributions
Even more generally, if f is a convex function,
E[f(x)] f(E[x])

Jensens Inequality
0
) | (
) | (
log ) | (

t
i
i
i
y t P
y t P
y t P

( )
( )
( )
0 log

x
x p
x q
x p

What is ?
The expected value of log P

(t,y
i
) according
to the model .
The EM Theorem states that we can get a
better model by maximizing the sum (over
all instances) of the expectation.

t
i i
y t P y t P ) , ( log ) | (

A Generic Set Up for EM
Assume P

(t, y) is a product of a set of parameters.

Assume consists of M groups of parameters.

The parameters in each group sum up to 1.

Let u
jk
be a parameter.
m
u
jm
=1

Let T
jk
be a subset of hidden data such that if t is in T
jk
, the
computation of P

(t, y
i
) involves u
jk
.

Let n(t,y
i
) be the number of times u
jk
is used in P

(t,y
i
), i.e.,
P

(t,y
i
)=u
jk
n(t,y
i
)

v(t,y), where v(t,y) is the product of all other
parameters.

j
i T t
i i
jk
j
jk
i
i T t
i
j
jk
i T t
i
y t n
jk i
jk
i t l m
lm l i i
jk
jk
jk
i
y t n y t P
u
u
y t n y t P
u
y t v u y t P
u
u y t P y t P

,
_

) , ( ) | (
0
) , ( ) | (
) , ( log ) | (
1 ) , ( log ) | (
) , (
pseudo count of instances
involving u
jk

Summary
EM Theorem

Intuition

Proof
Generic Set-up

Astm D3359-23
No ratings yet
Astm D3359-23
9 pages
Lec4 - Shear of Thin Walled Beams
No ratings yet
Lec4 - Shear of Thin Walled Beams
52 pages
Electronics Testing With A Multimeter
50% (2)
Electronics Testing With A Multimeter
8 pages
Condensation in Drop and Film Form
No ratings yet
Condensation in Drop and Film Form
5 pages
ML Columbia PDF
No ratings yet
ML Columbia PDF
615 pages
Bicol College of Applied Science and Technology
No ratings yet
Bicol College of Applied Science and Technology
50 pages
MTR3 - Microindenter and Scratch Tester ENG
100% (1)
MTR3 - Microindenter and Scratch Tester ENG
5 pages
The Transfer Function PDF
No ratings yet
The Transfer Function PDF
15 pages
Mechanics of Fluids 5th Edition Potter Wiggert Ramadan Solution Manual
100% (47)
Mechanics of Fluids 5th Edition Potter Wiggert Ramadan Solution Manual
53 pages
Kolkata Organic Soil
No ratings yet
Kolkata Organic Soil
9 pages
The Cause and Effect of Poor Atomization in The Auxiliary En-Gine
No ratings yet
The Cause and Effect of Poor Atomization in The Auxiliary En-Gine
35 pages
Item 346: Item 346 Is A For Stone-Matrix Asphalt (SMA) and Asphalt Rubber Stone-Matrix (SMAR) Mixtures
No ratings yet
Item 346: Item 346 Is A For Stone-Matrix Asphalt (SMA) and Asphalt Rubber Stone-Matrix (SMAR) Mixtures
34 pages
CpE646 6v3 PDF
No ratings yet
CpE646 6v3 PDF
44 pages
2 KOM Basic Concepts
No ratings yet
2 KOM Basic Concepts
26 pages
Statistical Inference III: Mohammad Samsul Alam
No ratings yet
Statistical Inference III: Mohammad Samsul Alam
32 pages
Unit1 Inventions
No ratings yet
Unit1 Inventions
2 pages
Turbulence - Which Model Should I Select For My CFD Analysis
No ratings yet
Turbulence - Which Model Should I Select For My CFD Analysis
11 pages
Lec 3 Einstein Theory of Photoelectric Effect
No ratings yet
Lec 3 Einstein Theory of Photoelectric Effect
12 pages
Examples of Maximum Likelihood Estimation and Optimization in R
No ratings yet
Examples of Maximum Likelihood Estimation and Optimization in R
15 pages
Inverse Laplace Transform Using Rule 20
No ratings yet
Inverse Laplace Transform Using Rule 20
10 pages
Dharumavantha School Examinations: Second Term Test Grade 9
No ratings yet
Dharumavantha School Examinations: Second Term Test Grade 9
14 pages
Cosido SemiDetailed-LessonPlan1
No ratings yet
Cosido SemiDetailed-LessonPlan1
4 pages
Machine Learning: Lecture 6: Bayesian Learning (Based On Chapter 6 of Mitchell T.., Machine Learning, 1997)
No ratings yet
Machine Learning: Lecture 6: Bayesian Learning (Based On Chapter 6 of Mitchell T.., Machine Learning, 1997)
15 pages
EM Presentation 2013
No ratings yet
EM Presentation 2013
18 pages
Lecture3 EM
No ratings yet
Lecture3 EM
36 pages
Chapter Five: Plate Girders
No ratings yet
Chapter Five: Plate Girders
21 pages
Xu-Ly-Ngon-Ngu-Tu-Nhien - Regina-Barzilay - Lec5-The-Em-Algorithm - (Cuuduongthancong - Com)
No ratings yet
Xu-Ly-Ngon-Ngu-Tu-Nhien - Regina-Barzilay - Lec5-The-Em-Algorithm - (Cuuduongthancong - Com)
73 pages
2223hk1 Slide01 ML2022-2
No ratings yet
2223hk1 Slide01 ML2022-2
23 pages
Data of Bradford Assay
No ratings yet
Data of Bradford Assay
2 pages
A Modified Expectation Maximization Algorithm For Penalized Likelihood Estimation in Emission Tomorzradhv
No ratings yet
A Modified Expectation Maximization Algorithm For Penalized Likelihood Estimation in Emission Tomorzradhv
6 pages
Fans in Thermal Power Plants
No ratings yet
Fans in Thermal Power Plants
3 pages
BOOK-2.pdf 5 ELEMENTS OF VASTU
No ratings yet
BOOK-2.pdf 5 ELEMENTS OF VASTU
6 pages
Learning With Hidden Variables - EM Algorithm
No ratings yet
Learning With Hidden Variables - EM Algorithm
31 pages
Likelihood, Bayesian, and Decision Theory
No ratings yet
Likelihood, Bayesian, and Decision Theory
50 pages
(Slides) The em Algorithm
No ratings yet
(Slides) The em Algorithm
14 pages
2006 Paper 1-BI
No ratings yet
2006 Paper 1-BI
10 pages
Objectives:: Expectation Maximization (Em)
No ratings yet
Objectives:: Expectation Maximization (Em)
17 pages
HMM Tutorial
No ratings yet
HMM Tutorial
15 pages
Machine Learning
No ratings yet
Machine Learning
93 pages
11 Hidden Markov Models (HMMS) Model and Problem Description
No ratings yet
11 Hidden Markov Models (HMMS) Model and Problem Description
15 pages
Maximum Likelihood Estimators and Least Squares
No ratings yet
Maximum Likelihood Estimators and Least Squares
5 pages
Expectation Maximization Notes
No ratings yet
Expectation Maximization Notes
5 pages
The Expectation-Maximization Algorithm: IEEE Signal Processing Magazine December 1996
No ratings yet
The Expectation-Maximization Algorithm: IEEE Signal Processing Magazine December 1996
15 pages
Stats, Mle, and Other Stuff: 1 Sevssd
No ratings yet
Stats, Mle, and Other Stuff: 1 Sevssd
10 pages
CS229 Lecture 3 PDF
100% (1)
CS229 Lecture 3 PDF
35 pages
EM-algorithm: California Institute of Technology 136-93 Pasadena, CA 91125 Welling@vision - Caltech.edu
No ratings yet
EM-algorithm: California Institute of Technology 136-93 Pasadena, CA 91125 Welling@vision - Caltech.edu
7 pages
Grades 7 12 Reading Selections
No ratings yet
Grades 7 12 Reading Selections
11 pages
Duda Solutions PDF
No ratings yet
Duda Solutions PDF
77 pages
Princípio de Babinet 1
No ratings yet
Princípio de Babinet 1
3 pages
Aiml Lab Algorithms
No ratings yet
Aiml Lab Algorithms
10 pages
Maximum Likelihood Estimation (MLE)
No ratings yet
Maximum Likelihood Estimation (MLE)
4 pages
Xpectation Aximization: Grading An Exam Without An Answer Key
No ratings yet
Xpectation Aximization: Grading An Exam Without An Answer Key
9 pages
05 Vae
No ratings yet
05 Vae
76 pages
11 Parameter Estimation
No ratings yet
11 Parameter Estimation
6 pages
Maximum Likelihood
No ratings yet
Maximum Likelihood
11 pages
MLE Lecture Note For Econometrician
No ratings yet
MLE Lecture Note For Econometrician
13 pages
Inf 2
No ratings yet
Inf 2
37 pages
Probabilistic Modelling and Reasoning
No ratings yet
Probabilistic Modelling and Reasoning
13 pages
ds11 2
No ratings yet
ds11 2
19 pages
ML Unit 3
No ratings yet
ML Unit 3
14 pages
斯坦福大学机器学习数学基础 65-72
No ratings yet
斯坦福大学机器学习数学基础 65-72
8 pages
Mixture Models and Expectation-Maximization: Justus H. Piater
No ratings yet
Mixture Models and Expectation-Maximization: Justus H. Piater
11 pages
Unit 3 ML
No ratings yet
Unit 3 ML
45 pages
9709 w06 QP 4
No ratings yet
9709 w06 QP 4
4 pages
7.estimation Clustering
No ratings yet
7.estimation Clustering
56 pages
ML Notes
No ratings yet
ML Notes
4 pages
Lec 5
No ratings yet
Lec 5
73 pages
Point Estimation: Definition of Estimators
No ratings yet
Point Estimation: Definition of Estimators
8 pages
Quarter 2 Physics 1 Wk3 Relative Velocities in 1D and 2D Students
No ratings yet
Quarter 2 Physics 1 Wk3 Relative Velocities in 1D and 2D Students
42 pages
AI UNIT 3 Tycs
No ratings yet
AI UNIT 3 Tycs
16 pages
MLT 2021-22
No ratings yet
MLT 2021-22
14 pages
Expectation Maximization (EM) Algorithm
No ratings yet
Expectation Maximization (EM) Algorithm
47 pages
Lecture1 ML MLE
No ratings yet
Lecture1 ML MLE
103 pages
NOTES
No ratings yet
NOTES
14 pages
Tutorial 4
No ratings yet
Tutorial 4
6 pages
Lecture 1
No ratings yet
Lecture 1
8 pages
Review Materials 0 8 1
No ratings yet
Review Materials 0 8 1
140 pages
AIML-Unit 3 Notes-Assignment 3
No ratings yet
AIML-Unit 3 Notes-Assignment 3
37 pages
TS Theme3
No ratings yet
TS Theme3
18 pages
Problem 1: Otherwise, 0 X 0 1), 0 ( ) (
No ratings yet
Problem 1: Otherwise, 0 X 0 1), 0 ( ) (
4 pages
DS ML Probability Statistics Interview
No ratings yet
DS ML Probability Statistics Interview
6 pages
Physics Lab 2 Kinematics
No ratings yet
Physics Lab 2 Kinematics
3 pages
5
No ratings yet
5
29 pages
Unit 3
No ratings yet
Unit 3
16 pages
Machine Learning Lecture Notes Undergrad
No ratings yet
Machine Learning Lecture Notes Undergrad
19 pages
EM Algorithm and Variants: An Informal Tutorial
No ratings yet
EM Algorithm and Variants: An Informal Tutorial
17 pages
Permutation and Combinations
From Everand
Permutation and Combinations
Ramesh Chandra
4/5 (36)
Amazing Java: Learn Java Quickly
From Everand
Amazing Java: Learn Java Quickly
Andrei Besedin
No ratings yet
Python For Beginners
From Everand
Python For Beginners
Célio Azevedo
No ratings yet