Maximum Entropy Markov Models: Alan Ritter CSE 5525
Maximum Entropy Markov Models: Alan Ritter CSE 5525
Entropy Markov
Models
Alan Ritter
CSE 5525
I E.g., w1 , . . . , wi 1 =
I Trigram estimates:
q(model|w1 , . . . wi 1 ) = 1 qM L (model|wi 2 = any, wi 1 = statistical) +
2 qM L (model|wi 1 = statistical) +
3 qM L (model)
P Count(x,y)
where i 0, i i = 1, qM L (y|x) = Count(x)
Trigram Models
q(model|w1 , . . . wi 1 ) = 1 qM L (model|wi 2 = any, wi 1 = statistical) +
2 qM L (model|wi 1 = statistical) +
3 qM L (model)
q(model|w1 , . . . wi 1 ) =
1 qM L (model|wi 2 = any, wi 1 = statistical) +
2 qM L (model|wi 1 = statistical) +
3 qM L (model) +
4 qM L (model|wi 2 = any) +
5 qM L (model|wi 1 is an adjective) +
7 qM L (model|author = Chomsky) +
OUTPUT:
Profits/N soared/V at/P Boeing/N Co./N ,/, easily/ADV topping/V
forecasts/N on/P Wall/N Street/N ,/, as/P their/POSS CEO/N
Alan/N Mulally/N announced/V first/ADJ quarter/N results/N ./.
N = Noun
V = Verb
P = Preposition
Adv = Adverb
Adj = Adjective
...
A Second Example: Part-of-Speech Tagging
p(ti |t1 , . . . , ti 1 , w1 . . . wn )
I Log-linear models
I x is a “history” w1 , w2 , . . . wi 1 , e.g.,
I y is an “outcome” wi
Feature Vector Representations
I y is an “outcome” wi
I Example features:
⇢
1 if y = model
f1 (x, y) =
0 otherwise
⇢
1 if y = model and wi 1 = statistical
f2 (x, y) =
0 otherwise
⇢
1 if y = model, wi 2 = any, wi 1 = statistical
f3 (x, y) =
0 otherwise
⇢
1 if y = model, wi 2 = any
f4 (x, y) =
0 otherwise
⇢
1 if y = model, wi 1 is an adjective
f5 (x, y) =
0 otherwise
⇢
1 if y = model, wi 1 ends in “ical”
f6 (x, y) =
0 otherwise
⇢
1 if y = model, author = Chomsky
f7 (x, y) =
0 otherwise
⇢
1 if y = model, “model” is not in w1 , . . . wi 1
f8 (x, y) =
0 otherwise
⇢
1 if y = model, “grammatical” is in w1 , . . . wi 1
f9 (x, y) =
0 otherwise
Defining Features in Practice
I We had the following “trigram” feature:
⇢
1 if y = model, wi 2 = any, wi 1 = statistical
f3 (x, y) =
0 otherwise
I In practice, we would probably introduce one trigram feature
for every trigram seen in the training data: i.e., for all
trigrams (u, v, w) seen in training data, create a feature
⇢
1 if y = w, wi 2 = u, wi 1 =v
fN (u,v,w) (x, y) =
0 otherwise
For example:
⇢
1 if current word wi is base and y = Vt
f1 (x, y) =
0 otherwise
⇢
1 if current word wi ends in ing and y = VBG
f2 (x, y) =
0 otherwise
...
The Full Set of Features in Ratnaparkhi, 1996
I Word/tag features for all word/tag pairs, e.g.,
⇢
1 if current word wi is base and y = Vt
f100 (x, y) =
0 otherwise
⇢
1 if current word wi ends in ing and y = VBG
f101 (x, y) =
0 otherwise
⇢
1 if current word wi starts with pre and y = NN
f102 (h, t) =
0 otherwise
The Full Set of Features in Ratnaparkhi, 1996
I Contextual Features, e.g.,
⇢
1 if hti 2 , ti 1 , yi = hDT, JJ, Vti
f103 (x, y) =
0 otherwise
⇢
1 if hti 1 , yi = hJJ, Vti
f104 (x, y) =
0 otherwise
⇢
1 if hyi = hVti
f105 (x, y) =
0 otherwise
⇢
1 if previous word wi 1 = the and y = Vt
f106 (x, y) =
0 otherwise
⇢
1 if next word wi+1 = the and y = Vt
f107 (x, y) =
0 otherwise
The Final Result
I x is a “history” w1 , w2 , . . . wi 1 , e.g.,
Third, the notion “grammatical in English” cannot be identified in any
way with the notion “high order of statistical approximation to English”.
It is fair to assume that neither sentence (1) nor (2) (nor indeed any
part of these sentences) has ever occurred in an English discourse.
Hence, in any statistical
I Each possible y gets a di↵erent score:
ev·f (x,y)
p(y | x; v) = P v·f (x,y 0)
y 0 2Y e
Why the name?
X
v·f (x,y 0 )
log p(y | x; v) = v · f (x, y) log e
| {z }
y 0 2Y
Linear term | {z }
Normalization term
Overview
I Log-linear models
vM L = argmaxv2Rm L(v)
where
n
X n
X n
X X
(i) (i) (i) (i) v·f (x(i) ,y 0 )
L(v) = log p(y | x ; v) = v · f (x , y ) log e
i=1 i=1 i=1 y 0 2Y
Calculating the Maximum-Likelihood Estimates
I Need to maximize:
Xn n
X X
v·f (x(i) ,y 0 )
L(v) = v · f (x(i) , y (i) ) log e
i=1 i=1 y 0 2Y
I Calculating gradients:
n n P v·f (x(i) ,y 0 )
dL(v) X X fk (x , y )e
y 0 2Y
(i) 0
(i) (i)
= fk (x , y ) P v·f (x(i) ,z 0 )
dvk i=1 i=1 0
z 2Y e
Xn Xn X v·f (x(i) ,y 0 )
(i) (i) (i) 0 e
= fk (x , y ) fk (x , y ) P v·f (x(i) ,z 0 )
i=1 i=1 y 0 2Y z 0 2Y e
Xn X n X
= fk (x(i) , y (i) ) fk (x(i) , y 0 )p(y 0 | x(i) ; v)
i=1 i=1 y 0 2Y
| {z } | {z }
Empirical counts Expected counts
Gradient Ascent Methods
Xn n X
X
dL(v)
= f (x(i) , y (i) ) f (x(i) , y 0 )p(y 0 | x(i) ; v)
dv i=1 i=1 y 0 2Y
Initialization: v = 0
Iterate until convergence:
I Calculate = dL(v)
dv
I Calculate ⇤ = argmax L(v + ) (Line
Search)
I Set v v+ ⇤
Conjugate Gradient Methods
I Log-linear models
I Calculating gradients:
n
X n X
X
dL(v)
= fk (x(i) , y (i) ) fk (x(i) , y 0 )p(y 0 | x(i) ; v) vk
dvk i=1 i=1 y 0 2Y
| {z } | {z }
Empirical counts Expected counts
OUTPUT:
Profits/N soared/V at/P Boeing/N Co./N ,/, easily/ADV topping/V
forecasts/N on/P Wall/N Street/N ,/, as/P their/POSS CEO/N
Alan/N Mulally/N announced/V first/ADJ quarter/N results/N ./.
N = Noun
V = Verb
P = Preposition
Adv = Adverb
Adj = Adjective
...
Named Entity Recognition
OUTPUT:
Profits/NA soared/NA at/NA Boeing/SC Co./CC ,/NA easily/NA
topping/NA forecasts/NA on/NA Wall/SL Street/CL ,/NA as/NA
their/NA CEO/NA Alan/SP Mulally/CP announced/NA first/NA
quarter/NA results/NA ./NA
NA = No entity
SC = Start Company
CC = Continue Company
SL = Start Location
CL = Continue Location
Our Goal
Training set:
1 Pierre/NNP Vinken/NNP ,/, 61/CD years/NNS old/JJ ,/, will/MD
join/VB the/DT board/NN as/IN a/DT nonexecutive/JJ director/NN
Nov./NNP 29/CD ./.
2 Mr./NNP Vinken/NNP is/VBZ chairman/NN of/IN Elsevier/NNP
N.V./NNP ,/, the/DT Dutch/NNP publishing/VBG group/NN ./.
3 Rudolph/NNP Agnew/NNP ,/, 55/CD years/NNS old/JJ and/CC
chairman/NN of/IN Consolidated/NNP Gold/NNP Fields/NNP PLC/NNP
,/, was/VBD named/VBN a/DT nonexecutive/JJ director/NN of/IN
this/DT British/JJ industrial/JJ conglomerate/NN ./.
...
38,219 It/PRP is/VBZ also/RB pulling/VBG 20/CD people/NNS out/IN
of/IN Puerto/NNP Rico/NNP ,/, who/WP were/VBD helping/VBG
Huricane/NNP Hugo/NNP victims/NNS ,/, and/CC sending/VBG
them/PRP to/TO San/NNP Francisco/NNP instead/RB ./.
I From the training set, induce a function/algorithm that maps new
sentences to their tag sequences.
Overview
I Log-linear taggers
Log-Linear Models for Tagging
I We have an input sentence w[1:n] = w1 , w2 , . . . , wn
(wi is the i’th word in the sentence)
Log-Linear Models for Tagging
I We have an input sentence w[1:n] = w1 , w2 , . . . , wn
(wi is the i’th word in the sentence)
Qn
p(t[1:n] |w[1:n] ) = j=1 p(tj | w1 . . . wn , t1 . . . tj 1 ) Chain rule
How to model p(t[1:n]|w[1:n])?
A Trigram Log-Linear Tagger:
Qn
p(t[1:n] |w[1:n] ) = j=1 p(tj | w1 . . . wn , t1 . . . tj 1 ) Chain rule
Qn
= j=1 p(tj | w1 , . . . , wn , tj 2 , tj 1 )
Independence assumptions
I We take t0 = t 1 =*
How to model p(t[1:n]|w[1:n])?
A Trigram Log-Linear Tagger:
Qn
p(t[1:n] |w[1:n] ) = j=1 p(tj | w1 . . . wn , t1 . . . tj 1 ) Chain rule
Qn
= j=1 p(tj | w1 , . . . , wn , tj 2 , tj 1 )
Independence assumptions
I We take t0 = t 1 =*
I Independence assumption: each tag only depends on previous
two tags
p(tj |w1 , . . . , wn , t1 , . . . , tj 1 ) = p(tj |w1 , . . . , wn , tj 2 , tj 1 )
An Example
I t 2 , t 1 = DT, JJ
I w[1:n] = hHispaniola, quickly, became, . . . , Hemisphere, .i
I i=6
Recap: Feature Vector Representations in Log-Linear
Models
I A feature is a function f : X ⇥ Y ! R
(Often binary features or indicator functions
f : X ⇥ Y ! {0, 1}).
For example:
⇢
1 if current word wi is base and t = Vt
f1 (h, t) =
0 otherwise
⇢
1 if current word wi ends in ing and t = VBG
f2 (h, t) =
0 otherwise
...
⇢
1 if current word wi is base and t = Vt
f100 (h, t) =
0 otherwise
⇢
1 if current word wi ends in ing and t = VBG
f101 (h, t) =
0 otherwise
⇢
1 if current word wi starts with pre and t = NN
f102 (h, t) =
0 otherwise
The Full Set of Features in [(Ratnaparkhi, 96)]
I Contextual Features, e.g.,
⇢
1 if ht 2 , t 1 , ti = hDT, JJ, Vti
f103 (h, t) =
0 otherwise
⇢
1 if ht 1 , ti = hJJ, Vti
f104 (h, t) =
0 otherwise
⇢
1 if hti = hVti
f105 (h, t) =
0 otherwise
⇢
1 if previous word wi 1 = the and t = Vt
f106 (h, t) =
0 otherwise
⇢
1 if next word wi+1 = the and t = Vt
f107 (h, t) =
0 otherwise
Log-Linear Models
I We have some input domain X , and a finite label set Y. Aim
is to provide a conditional probability p(y | x) for any x 2 X
and y 2 Y.
I A feature is a function f : X ⇥ Y ! R
(Often binary features or indicator functions
f : X ⇥ Y ! {0, 1}).
I Say we have m features fk for k = 1 . . . m
) A feature vector f (x, y) 2 Rm for any x 2 X and y 2 Y.
I We also have a parameter vector v 2 Rm
I We define ev·f (x,y)
p(y | x; v) = P v·f (x,y 0 )
0
y 2Y e
Training the Log-Linear Model
that is,
Base case:
⇡(0, *, *) = 1
Recursive definition:
For any k 2 {1 . . . n}, for any u 2 Sk 1 and v 2 Sk :
Input: a sentence w1 . . . wn , log-linear model that provides q(v|t, u, w[1:n] , i) for any
tag-trigram t, u, v, for any i 2 {1 . . . n}
Initialization: Set ⇡(0, *, *) = 1.
Algorithm:
I For k = 1 . . . n,
I For u 2 Sk 1, v 2 Sk ,
p(word|tag)
) “tag=question;prev=head;begins-with-number”
“tag=question;prev=head;contains-alphanum”
“tag=question;prev=head;contains-nonspace”
“tag=question;prev=head;contains-number”
“tag=question;prev=head;prev-is-blank”
FAQ Segmentation: An HMM Tagger