0% found this document useful (0 votes)
48 views10 pages

The Expectation Maximization (EM) Algorithm

Uploaded by

Tarun Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views10 pages

The Expectation Maximization (EM) Algorithm

Uploaded by

Tarun Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 10

The Expectation Maximization (EM)

Algorithm
General Idea
▪ Start by devising a noisy channel
▪ Any model that predicts the corpus observations via
some hidden structure (tags, parses, …)
▪ Initially guess the parameters of the model!
▪ Educated guess is best, but random can work

▪ Expectation step: Use current parameters (and


observations) to reconstruct hidden structure
▪ Maximization step: Use that hidden structure
(and observations) to reestimate parameters
Repeat until convergence!
General Idea
initial
guess E step

Guess of Guess of unknown


unknown hidden structure
(tags, parses, weather)
parameters
(probabilities) Observed structure
(words, ice cream)

M step
Grammar Reestimation
E step correct test trees
P
A s
c
R o
S r accuracy
e
E r
test R
sentences expensive and/or
wrong sublanguage

cheap, plentiful Grammar


and appropriate
LEARNER
training
M step trees
EM by Dynamic Programming: Two
Versions

▪ The Viterbi approximation


▪ Expectation: pick the best parse of each sentence
▪ Maximization: retrain on this best-parsed corpus
▪ Advantage: Speed!

▪ Real EM we r?
h y slo
▪ Expectation:
w find all parses of each sentence
▪ Maximization: retrain on all parses in proportion to
their probability (as if we observed fractional count)
▪ Advantage: p(training corpus) guaranteed to increase
▪ Exponentially many parses, so don’t extract them
from chart – need some kind of clever counting
Examples of EM
▪ Finite-State case: Hidden Markov Models
▪ “forward-backward” or “Baum-Welch” algorithm
▪ Applications:
▪ explain ice cream in terms of underlying weather sequence
▪ explain words in terms of underlying tag sequence
▪ explain phoneme sequence in terms of underlying word
compose ▪ explain sound sequence in terms of underlying phoneme
these?
▪Context-Free case: Probabilistic CFGs
▪ “inside-outside” algorithm: unsupervised grammar learning!
▪ Explain raw text in terms of underlying cx-free parse
▪ In practice, local maximum problem gets in the way
▪ But can improve a good starting grammar via raw text
▪ Clustering case: explain points via clusters
Our old friend PCFG
S

NP VP
p( time | S) = p(S ® NP VP | S) * p(NP ® time | NP)
V PP
flies * p(VP ® V PP | VP)
P NP
like
Det N
* p(V ® flies | V) *…
an arrow
Viterbi reestimation for parsing

▪ Start with a “pretty good” grammar


▪ E.g., it was trained on supervised data (a treebank) that is small,
imperfectly annotated, or has sentences in a different style from
what you want to parse. S
▪ Parse a corpus of unparsed sentences:
AdvP S
# copies of …
this sentence 12 Today stocks were up 12 NP VP
Today
in the corpus …
stocks V PRT
▪ Reestimate:
▪ Collect counts: …; c(S  NP VP) += 12; c(S) += 2*12; …were up
▪ Divide: p(S  NP VP | S) = c(S  NP VP) / c(S)
▪ May be wise to smooth
True EM for parsing
▪ Similar, but now we consider all parses of each sentence
S
▪ Parse our corpus of unparsed sentences:
AdvP S
# copies of …
this sentence 12 Today stocks were up 10.8 NP VP
Today
in the corpus …
stocks V PRT

▪ Collect counts fractionally: were up


▪ …; c(S  NP VP) += 10.8; c(S) += 2*10.8; … 1.2 S
▪ …; c(S  NP VP) += 1.2; c(S) += 1*1.2; …
NP VP

NP NP V PRT

Today stocks were up


600.465 - Intro to NLP - J. Eisner
Why do we want this info?

▪ Grammar reestimation by EM method


▪ E step collects those expected counts
▪ M step sets

▪ Minimum Bayes Risk decoding


▪ Find a tree that maximizes expected reward,
e.g., expected total # of correct constituents
▪ CKY-like dynamic programming algorithm
▪The input specifies the probability of correctness
for each possible constituent (e.g., VP from 1 to 5)

You might also like