0% found this document useful (0 votes)
49 views70 pages

Maximum Entropy Markov Models: Alan Ritter CSE 5525

The document discusses maximum entropy Markov models and language modeling problems. It introduces the concept of using feature vectors to represent the input history and output labels in order to estimate the conditional probability of an output label given the input history. A variety of features are proposed for language modeling tasks, including part-of-speech tagging. The document notes that defining many features quickly becomes unwieldy, and discusses approaches to make the feature representation more tractable.

Uploaded by

bumi mina
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views70 pages

Maximum Entropy Markov Models: Alan Ritter CSE 5525

The document discusses maximum entropy Markov models and language modeling problems. It introduces the concept of using feature vectors to represent the input history and output labels in order to estimate the conditional probability of an output label given the input history. A variety of features are proposed for language modeling tasks, including part-of-speech tagging. The document notes that defining many features quickly becomes unwieldy, and discusses approaches to make the feature representation more tractable.

Uploaded by

bumi mina
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 70

Maximum

Entropy Markov
Models
Alan Ritter
CSE 5525

Many slides from Michael Collins


The Language Modeling Problem

I wi is the i’th word in a document

I Estimate a distribution p(wi |w1 , w2 , . . . wi 1 ) given previous


“history” w1 , . . . , wi 1 .

I E.g., w1 , . . . , wi 1 =

Third, the notion “grammatical in English” cannot be


identified in any way with the notion “high order of
statistical approximation to English”. It is fair to assume
that neither sentence (1) nor (2) (nor indeed any part of
these sentences) has ever occurred in an English
discourse. Hence, in any statistical
Trigram Models
I Estimate a distribution p(wi |w1 , w2 , . . . wi 1 ) given previous
“history” w1 , . . . , wi 1 =
Third, the notion “grammatical in English” cannot be identified in any way
with the notion “high order of statistical approximation to English”. It is fair
to assume that neither sentence (1) nor (2) (nor indeed any part of these
sentences) has ever occurred in an English discourse. Hence, in any statistical

I Trigram estimates:
q(model|w1 , . . . wi 1 ) = 1 qM L (model|wi 2 = any, wi 1 = statistical) +
2 qM L (model|wi 1 = statistical) +
3 qM L (model)
P Count(x,y)
where i 0, i i = 1, qM L (y|x) = Count(x)
Trigram Models
q(model|w1 , . . . wi 1 ) = 1 qM L (model|wi 2 = any, wi 1 = statistical) +
2 qM L (model|wi 1 = statistical) +
3 qM L (model)

I Makes use of only bigram, trigram, unigram estimates


I Many other “features” of w1 , . . . , wi 1 may be useful, e.g.,:
qM L (model | wi 2 = any)
qM L (model | wi 1 is an adjective)
qM L (model | wi 1 ends in “ical”)
qM L (model | author = Chomsky)
qM L (model | “model” does not occur somewhere in w1 , . . . wi 1 )
qM L (model | “grammatical” occurs somewhere in w1 , . . . wi 1 )
A Naive Approach

q(model|w1 , . . . wi 1 ) =
1 qM L (model|wi 2 = any, wi 1 = statistical) +

2 qM L (model|wi 1 = statistical) +

3 qM L (model) +

4 qM L (model|wi 2 = any) +

5 qM L (model|wi 1 is an adjective) +

6 qM L (model|wi 1 ends in “ical”) +

7 qM L (model|author = Chomsky) +

8 qM L (model|“model” does not occur somewhere in w1 , . . . wi 1 ) +

9 qM L (model|“grammatical” occurs somewhere in w1 , . . . wi 1 )

This quickly becomes very unwieldy...


A Second Example: Part-of-Speech Tagging
INPUT:
Profits soared at Boeing Co., easily topping forecasts on Wall Street,
as their CEO Alan Mulally announced first quarter results.

OUTPUT:
Profits/N soared/V at/P Boeing/N Co./N ,/, easily/ADV topping/V
forecasts/N on/P Wall/N Street/N ,/, as/P their/POSS CEO/N
Alan/N Mulally/N announced/V first/ADJ quarter/N results/N ./.
N = Noun
V = Verb
P = Preposition
Adv = Adverb
Adj = Adjective
...
A Second Example: Part-of-Speech Tagging

Hispaniola/NNP quickly/RB became/VB an/DT important/JJ


base/?? from which Spain expanded its empire into the rest of the
Western Hemisphere .

• There are many possible tags in the position ??


{NN, NNS, Vt, Vi, IN, DT, . . . }

• The task: model the distribution

p(ti |t1 , . . . , ti 1 , w1 . . . wn )

where ti is the i’th tag in the sequence, wi is the i’th word


A Second Example: Part-of-Speech Tagging
Hispaniola/NNP quickly/RB became/VB an/DT important/JJ
base/?? from which Spain expanded its empire into the rest of the
Western Hemisphere .

• The task: model the distribution


p(ti |t1 , . . . , ti 1 , w1 . . . w n )

where ti is the i’th tag in the sequence, wi is the i’th word

• Again: many “features” of t1 , . . . , ti 1 , w1 . . . w n may be relevant


qM L (NN | wi = base)
qM L (NN | ti 1 is JJ)
qM L (NN | wi ends in “e”)
qM L (NN | wi ends in “se”)
qM L (NN | wi 1 is “important”)
qM L (NN | wi+1 is “from”)
Overview

I Log-linear models

I Parameter estimation in log-linear models

I Smoothing/regularization in log-linear models


The General Problem

I We have some input domain X

I Have a finite label set Y

I Aim is to provide a conditional probability p(y | x)


for any x, y where x 2 X , y 2 Y
Language Modeling

I x is a “history” w1 , w2 , . . . wi 1 , e.g.,

Third, the notion “grammatical in English” cannot be identified in any


way with the notion “high order of statistical approximation to English”.
It is fair to assume that neither sentence (1) nor (2) (nor indeed any
part of these sentences) has ever occurred in an English discourse.
Hence, in any statistical

I y is an “outcome” wi
Feature Vector Representations

I Aim is to provide a conditional probability p(y | x) for


“decision” y given “history” x

I A feature is a function fk (x, y) 2 R


(Often binary features or indicator functions
fk (x, y) 2 {0, 1}).

I Say we have m features fk for k = 1 . . . m


) A feature vector f (x, y) 2 Rm for any x, y
Language Modeling
I x is a “history” w1 , w2 , . . . wi 1 , e.g.,
Third, the notion “grammatical in English” cannot be identified in any
way with the notion “high order of statistical approximation to English”.
It is fair to assume that neither sentence (1) nor (2) (nor indeed any
part of these sentences) has ever occurred in an English discourse.
Hence, in any statistical

I y is an “outcome” wi
I Example features:

1 if y = model
f1 (x, y) =
0 otherwise

1 if y = model and wi 1 = statistical
f2 (x, y) =
0 otherwise

1 if y = model, wi 2 = any, wi 1 = statistical
f3 (x, y) =
0 otherwise

1 if y = model, wi 2 = any
f4 (x, y) =
0 otherwise

1 if y = model, wi 1 is an adjective
f5 (x, y) =
0 otherwise

1 if y = model, wi 1 ends in “ical”
f6 (x, y) =
0 otherwise

1 if y = model, author = Chomsky
f7 (x, y) =
0 otherwise

1 if y = model, “model” is not in w1 , . . . wi 1
f8 (x, y) =
0 otherwise

1 if y = model, “grammatical” is in w1 , . . . wi 1
f9 (x, y) =
0 otherwise
Defining Features in Practice
I We had the following “trigram” feature:

1 if y = model, wi 2 = any, wi 1 = statistical
f3 (x, y) =
0 otherwise
I In practice, we would probably introduce one trigram feature
for every trigram seen in the training data: i.e., for all
trigrams (u, v, w) seen in training data, create a feature

1 if y = w, wi 2 = u, wi 1 =v
fN (u,v,w) (x, y) =
0 otherwise

where N (u, v, w) is a function that maps each (u, v, w)


trigram to a di↵erent integer
The POS-Tagging Example

I Each x is a “history” of the form ht1 , t2 , . . . , ti 1 , w1 . . . wn , ii


I Each y is a POS tag, such as NN, NNS, Vt, Vi, IN, DT, . . .
I We have m features fk (x, y) for k = 1 . . . m

For example:

1 if current word wi is base and y = Vt
f1 (x, y) =
0 otherwise

1 if current word wi ends in ing and y = VBG
f2 (x, y) =
0 otherwise
...
The Full Set of Features in Ratnaparkhi, 1996
I Word/tag features for all word/tag pairs, e.g.,


1 if current word wi is base and y = Vt
f100 (x, y) =
0 otherwise

I Spelling features for all prefixes/suffixes of length  4, e.g.,


1 if current word wi ends in ing and y = VBG
f101 (x, y) =
0 otherwise

1 if current word wi starts with pre and y = NN
f102 (h, t) =
0 otherwise
The Full Set of Features in Ratnaparkhi, 1996
I Contextual Features, e.g.,

1 if hti 2 , ti 1 , yi = hDT, JJ, Vti
f103 (x, y) =
0 otherwise

1 if hti 1 , yi = hJJ, Vti
f104 (x, y) =
0 otherwise

1 if hyi = hVti
f105 (x, y) =
0 otherwise

1 if previous word wi 1 = the and y = Vt
f106 (x, y) =
0 otherwise

1 if next word wi+1 = the and y = Vt
f107 (x, y) =
0 otherwise
The Final Result

I We can come up with practically any questions (features)


regarding history/tag pairs.

I For a given history x 2 X , each label in Y is mapped to a


di↵erent feature vector

f (hJJ, DT, h Hispaniola, . . . i, 6i, Vt) = 1001011001001100110


f (hJJ, DT, h Hispaniola, . . . i, 6i, JJ) = 0110010101011110010
f (hJJ, DT, h Hispaniola, . . . i, 6i, NN) = 0001111101001100100
f (hJJ, DT, h Hispaniola, . . . i, 6i, IN) = 0001011011000000010
...
Parameter Vectors

I Given features fk (x, y) for k = 1 . . . m,


also define a parameter vector v 2 Rm

I Each (x, y) pair is then mapped to a “score”


X
v · f (x, y) = vk fk (x, y)
k
Language Modeling

I x is a “history” w1 , w2 , . . . wi 1 , e.g.,
Third, the notion “grammatical in English” cannot be identified in any
way with the notion “high order of statistical approximation to English”.
It is fair to assume that neither sentence (1) nor (2) (nor indeed any
part of these sentences) has ever occurred in an English discourse.
Hence, in any statistical
I Each possible y gets a di↵erent score:

v · f (x, model) = 5.6 v · f (x, the) = 3.2


v · f (x, is) = 1.5 v · f (x, of ) = 1.3
v · f (x, models) = 4.5 ...
Log-Linear Models
I We have some input domain X , and a finite label set Y. Aim is
to provide a conditional probability p(y | x) for any x 2 X and
y 2 Y.
I A feature is a function f : X ⇥ Y ! R
(Often binary features or indicator functions
fk : X ⇥ Y ! {0, 1}).
I Say we have m features fk for k = 1 . . . m
) A feature vector f (x, y) 2 Rm for any x 2 X and y 2 Y.
I We also have a parameter vector v 2 Rm
I We define

ev·f (x,y)
p(y | x; v) = P v·f (x,y 0)
y 0 2Y e
Why the name?

X
v·f (x,y 0 )
log p(y | x; v) = v · f (x, y) log e
| {z }
y 0 2Y
Linear term | {z }
Normalization term
Overview

I Log-linear models

I Parameter estimation in log-linear models

I Smoothing/regularization in log-linear models


Maximum-Likelihood Estimation

I Maximum-likelihood estimates given training sample


(x(i) , y (i) ) for i = 1 . . . n, each (x(i) , y (i) ) 2 X ⇥ Y:

vM L = argmaxv2Rm L(v)

where
n
X n
X n
X X
(i) (i) (i) (i) v·f (x(i) ,y 0 )
L(v) = log p(y | x ; v) = v · f (x , y ) log e
i=1 i=1 i=1 y 0 2Y
Calculating the Maximum-Likelihood Estimates
I Need to maximize:
Xn n
X X
v·f (x(i) ,y 0 )
L(v) = v · f (x(i) , y (i) ) log e
i=1 i=1 y 0 2Y
I Calculating gradients:
n n P v·f (x(i) ,y 0 )
dL(v) X X fk (x , y )e
y 0 2Y
(i) 0
(i) (i)
= fk (x , y ) P v·f (x(i) ,z 0 )
dvk i=1 i=1 0
z 2Y e
Xn Xn X v·f (x(i) ,y 0 )
(i) (i) (i) 0 e
= fk (x , y ) fk (x , y ) P v·f (x(i) ,z 0 )
i=1 i=1 y 0 2Y z 0 2Y e
Xn X n X
= fk (x(i) , y (i) ) fk (x(i) , y 0 )p(y 0 | x(i) ; v)
i=1 i=1 y 0 2Y
| {z } | {z }
Empirical counts Expected counts
Gradient Ascent Methods

I Need to maximize L(v) where

Xn n X
X
dL(v)
= f (x(i) , y (i) ) f (x(i) , y 0 )p(y 0 | x(i) ; v)
dv i=1 i=1 y 0 2Y

Initialization: v = 0
Iterate until convergence:
I Calculate = dL(v)
dv
I Calculate ⇤ = argmax L(v + ) (Line
Search)
I Set v v+ ⇤
Conjugate Gradient Methods

I (Vanilla) gradient ascent can be very slow

I Conjugate gradient methods require calculation of gradient at


each iteration, but do a line search in a direction which is a
function of the current gradient, and the previous step
taken.

I Conjugate gradient packages are widely available


In general: they require a function
✓ ◆
dL(v)
calc gradient(v) ! L(v),
dv

and that’s about it!


Overview

I Log-linear models

I Parameter estimation in log-linear models

I Smoothing/regularization in log-linear models


Smoothing in Log-Linear Models
I Say we have a feature:

1 if current word wi is base and y = Vt
f100 (x, y) =
0 otherwise

I In training data, base is seen 3 times, with Vt every time


I Maximum likelihood solution satisfies
X XX
(i) (i)
f100 (x , y ) = p(y | x(i) ; v)f100 (x(i) , y)
i i y

) p(Vt | x(i) ; v) = 1 for any history x(i) where wi = base


) v100 ! 1 at maximum-likelihood solution (most likely)
) p(Vt | x; v) = 1 for any test data history x where w = base
Regularization
I Modified loss function
Xn n
X X m
X
v·f (x(i) ,y 0 )
L(v) = v · f (x(i) , y (i) ) log e vk2
i=1 i=1 y 0 2Y
2 k=1

I Calculating gradients:
n
X n X
X
dL(v)
= fk (x(i) , y (i) ) fk (x(i) , y 0 )p(y 0 | x(i) ; v) vk
dvk i=1 i=1 y 0 2Y
| {z } | {z }
Empirical counts Expected counts

I Can run conjugate gradient methods as before

I Adds a penalty for large weights


Experiments with Regularization
I [Chen and Rosenfeld, 1998]: apply log-linear models to
language modeling: Estimate q(wi | wi 2 , wi 1 )

I Unigram, bigram, trigram features, e.g.,



1 if trigram is (the,dog,laughs)
f1 (wi 2 , wi 1 , wi ) =
0 otherwise

1 if bigram is (dog,laughs)
f2 (wi 2 , wi 1 , wi ) =
0 otherwise

1 if unigram is (laughs)
f3 (wi 2 , wi 1 , wi ) =
0 otherwise

ef (wi 2 ,wi 1 ,wi )·v


q(wi | wi 2 , wi 1 ) = P f (wi 2 ,wi 1 ,w)·v
we
Experiments with Gaussian Priors

I In regular (unregularized) log-linear models, if all n-gram


features are included, then it’s equivalent to
maximum-likelihood estimates!
Count(wi 2 , wi 1 , wi )
q(wi | wi 2 , wi 1 ) =
Count(wi 2 , wi 1 )

I [Chen and Rosenfeld, 1998]: with regularization, get very


good results. Performs as well as or better than standardly
used “discounting methods” (see lecture 2).
P f (wi 2 ,wi 1 ,w)·v
I Downside: computing w e is SLOW.
Log Linear Models for Tagging
Part-of-Speech Tagging
INPUT:
Profits soared at Boeing Co., easily topping forecasts on Wall Street,
as their CEO Alan Mulally announced first quarter results.

OUTPUT:
Profits/N soared/V at/P Boeing/N Co./N ,/, easily/ADV topping/V
forecasts/N on/P Wall/N Street/N ,/, as/P their/POSS CEO/N
Alan/N Mulally/N announced/V first/ADJ quarter/N results/N ./.

N = Noun
V = Verb
P = Preposition
Adv = Adverb
Adj = Adjective
...
Named Entity Recognition

INPUT: Profits soared at Boeing Co., easily topping forecasts on


Wall Street, as their CEO Alan Mulally announced first quarter
results.

OUTPUT: Profits soared at [Company Boeing Co.], easily


topping forecasts on [Location Wall Street], as their CEO [Person
Alan Mulally] announced first quarter results.
Named Entity Extraction as Tagging
INPUT:
Profits soared at Boeing Co., easily topping forecasts on Wall Street,
as their CEO Alan Mulally announced first quarter results.

OUTPUT:
Profits/NA soared/NA at/NA Boeing/SC Co./CC ,/NA easily/NA
topping/NA forecasts/NA on/NA Wall/SL Street/CL ,/NA as/NA
their/NA CEO/NA Alan/SP Mulally/CP announced/NA first/NA
quarter/NA results/NA ./NA

NA = No entity
SC = Start Company
CC = Continue Company
SL = Start Location
CL = Continue Location
Our Goal
Training set:
1 Pierre/NNP Vinken/NNP ,/, 61/CD years/NNS old/JJ ,/, will/MD
join/VB the/DT board/NN as/IN a/DT nonexecutive/JJ director/NN
Nov./NNP 29/CD ./.
2 Mr./NNP Vinken/NNP is/VBZ chairman/NN of/IN Elsevier/NNP
N.V./NNP ,/, the/DT Dutch/NNP publishing/VBG group/NN ./.
3 Rudolph/NNP Agnew/NNP ,/, 55/CD years/NNS old/JJ and/CC
chairman/NN of/IN Consolidated/NNP Gold/NNP Fields/NNP PLC/NNP
,/, was/VBD named/VBN a/DT nonexecutive/JJ director/NN of/IN
this/DT British/JJ industrial/JJ conglomerate/NN ./.
...
38,219 It/PRP is/VBZ also/RB pulling/VBG 20/CD people/NNS out/IN
of/IN Puerto/NNP Rico/NNP ,/, who/WP were/VBD helping/VBG
Huricane/NNP Hugo/NNP victims/NNS ,/, and/CC sending/VBG
them/PRP to/TO San/NNP Francisco/NNP instead/RB ./.
I From the training set, induce a function/algorithm that maps new
sentences to their tag sequences.
Overview

I Recap: The Tagging Problem

I Log-linear taggers
Log-Linear Models for Tagging
I We have an input sentence w[1:n] = w1 , w2 , . . . , wn
(wi is the i’th word in the sentence)
Log-Linear Models for Tagging
I We have an input sentence w[1:n] = w1 , w2 , . . . , wn
(wi is the i’th word in the sentence)

I We have a tag sequence t[1:n] = t1 , t2 , . . . , tn


(ti is the i’th tag in the sentence)
Log-Linear Models for Tagging
I We have an input sentence w[1:n] = w1 , w2 , . . . , wn
(wi is the i’th word in the sentence)

I We have a tag sequence t[1:n] = t1 , t2 , . . . , tn


(ti is the i’th tag in the sentence)

I We’ll use an log-linear model to define


p(t1 , t2 , . . . , tn |w1 , w2 , . . . , wn )
for any sentence w[1:n] and tag sequence t[1:n] of the same length.
(Note: contrast with HMM that defines p(t1 . . . tn , w1 . . . wn ))
Log-Linear Models for Tagging
I We have an input sentence w[1:n] = w1 , w2 , . . . , wn
(wi is the i’th word in the sentence)

I We have a tag sequence t[1:n] = t1 , t2 , . . . , tn


(ti is the i’th tag in the sentence)

I We’ll use an log-linear model to define


p(t1 , t2 , . . . , tn |w1 , w2 , . . . , wn )
for any sentence w[1:n] and tag sequence t[1:n] of the same length.
(Note: contrast with HMM that defines p(t1 . . . tn , w1 . . . wn ))

I Then the most likely tag sequence for w[1:n] is


t⇤[1:n] = argmaxt[1:n] p(t[1:n] |w[1:n] )
How to model p(t[1:n]|w[1:n])?
A Trigram Log-Linear Tagger:

Qn
p(t[1:n] |w[1:n] ) = j=1 p(tj | w1 . . . wn , t1 . . . tj 1 ) Chain rule
How to model p(t[1:n]|w[1:n])?
A Trigram Log-Linear Tagger:

Qn
p(t[1:n] |w[1:n] ) = j=1 p(tj | w1 . . . wn , t1 . . . tj 1 ) Chain rule

Qn
= j=1 p(tj | w1 , . . . , wn , tj 2 , tj 1 )
Independence assumptions

I We take t0 = t 1 =*
How to model p(t[1:n]|w[1:n])?
A Trigram Log-Linear Tagger:

Qn
p(t[1:n] |w[1:n] ) = j=1 p(tj | w1 . . . wn , t1 . . . tj 1 ) Chain rule

Qn
= j=1 p(tj | w1 , . . . , wn , tj 2 , tj 1 )
Independence assumptions

I We take t0 = t 1 =*
I Independence assumption: each tag only depends on previous
two tags
p(tj |w1 , . . . , wn , t1 , . . . , tj 1 ) = p(tj |w1 , . . . , wn , tj 2 , tj 1 )
An Example

Hispaniola/NNP quickly/RB became/VB an/DT important/JJ


base/?? from which Spain expanded its empire into the rest of the
Western Hemisphere .

• There are many possible tags in the position ??


Y = {NN, NNS, Vt, Vi, IN, DT, . . . }
Representation: Histories
I A history is a 4-tuple ht 2 , t 1 , w[1:n] , ii
I t 2, t 1 are the previous two tags.
I w[1:n] are the n words in the input sentence.
I i is the index of the word being tagged
I X is the set of all possible histories

Hispaniola/NNP quickly/RB became/VB an/DT important/JJ


base/?? from which Spain expanded its empire into the rest of the
Western Hemisphere .

I t 2 , t 1 = DT, JJ
I w[1:n] = hHispaniola, quickly, became, . . . , Hemisphere, .i
I i=6
Recap: Feature Vector Representations in Log-Linear
Models

I We have some input domain X , and a finite label set Y. Aim


is to provide a conditional probability p(y | x) for any x 2 X
and y 2 Y.

I A feature is a function f : X ⇥ Y ! R
(Often binary features or indicator functions
f : X ⇥ Y ! {0, 1}).

I Say we have m features fk for k = 1 . . . m


) A feature vector f (x, y) 2 Rm for any x 2 X and y 2 Y.
An Example (continued)
I X is the set of all possible histories of form ht 2 , t 1 , w[1:n] , ii
I Y = {NN, NNS, Vt, Vi, IN, DT, . . . }
I We have m features fk : X ⇥ Y ! R for k = 1 . . . m

For example:

1 if current word wi is base and t = Vt
f1 (h, t) =
0 otherwise

1 if current word wi ends in ing and t = VBG
f2 (h, t) =
0 otherwise
...

f1 (hJJ, DT, h Hispaniola, . . . i, 6i, Vt) = 1


f2 (hJJ, DT, h Hispaniola, . . . i, 6i, Vt) = 0
...
The Full Set of Features in [(Ratnaparkhi, 96)]
I Word/tag features for all word/tag pairs, e.g.,


1 if current word wi is base and t = Vt
f100 (h, t) =
0 otherwise

I Spelling features for all prefixes/suffixes of length  4, e.g.,


1 if current word wi ends in ing and t = VBG
f101 (h, t) =
0 otherwise

1 if current word wi starts with pre and t = NN
f102 (h, t) =
0 otherwise
The Full Set of Features in [(Ratnaparkhi, 96)]
I Contextual Features, e.g.,

1 if ht 2 , t 1 , ti = hDT, JJ, Vti
f103 (h, t) =
0 otherwise

1 if ht 1 , ti = hJJ, Vti
f104 (h, t) =
0 otherwise

1 if hti = hVti
f105 (h, t) =
0 otherwise

1 if previous word wi 1 = the and t = Vt
f106 (h, t) =
0 otherwise

1 if next word wi+1 = the and t = Vt
f107 (h, t) =
0 otherwise
Log-Linear Models
I We have some input domain X , and a finite label set Y. Aim
is to provide a conditional probability p(y | x) for any x 2 X
and y 2 Y.
I A feature is a function f : X ⇥ Y ! R
(Often binary features or indicator functions
f : X ⇥ Y ! {0, 1}).
I Say we have m features fk for k = 1 . . . m
) A feature vector f (x, y) 2 Rm for any x 2 X and y 2 Y.
I We also have a parameter vector v 2 Rm
I We define ev·f (x,y)
p(y | x; v) = P v·f (x,y 0 )
0
y 2Y e
Training the Log-Linear Model

I To train a log-linear model, we need a training set (xi , yi ) for


i = 1 . . . n. Then search for
0 1
B C
BX X C
v = argmaxv B

B log p(yi |xi ; v) 2C
vk C
@ i 2 k A
| {z } | {z }
Log Likelihood Regularizer

(see last lecture on log-linear models)

I Training set is simply all history/tag pairs seen in the training


data
The Viterbi Algorithm

Problem: for an input w1 . . . wn , find

arg max p(t1 . . . tn | w1 . . . wn )


t1 ...tn

We assume that p takes the form


n
Y
p(t1 . . . tn | w1 . . . wn ) = q(ti |ti 2 , ti 1 , w[1:n] , i)
i=1

(In our case q(ti |ti 2 , ti 1 , w[1:n] , i) is the estimate from a


log-linear model.)
The Viterbi Algorithm
I Define n to be the length of the sentence
I Define
k
Y
r(t1 . . . tk ) = q(ti |ti 2 , ti 1 , w[1:n] , i)
i=1

I Define a dynamic programming table

⇡(k, u, v) = maximum probability of a tag sequence ending


in tags u, v at position k

that is,

⇡(k, u, v) = max r(t1 . . . tk 2 , u, v)


ht1 ,...,tk 2i
A Recursive Definition

Base case:
⇡(0, *, *) = 1

Recursive definition:
For any k 2 {1 . . . n}, for any u 2 Sk 1 and v 2 Sk :

⇡(k, u, v) = max ⇡(k 1, t, u) ⇥ q(v|t, u, w[1:n] , k)


t2Sk 2

where Sk is the set of possible tags at position k


The Viterbi Algorithm with Backpointers

Input: a sentence w1 . . . wn , log-linear model that provides q(v|t, u, w[1:n] , i) for any
tag-trigram t, u, v, for any i 2 {1 . . . n}
Initialization: Set ⇡(0, *, *) = 1.
Algorithm:
I For k = 1 . . . n,

I For u 2 Sk 1, v 2 Sk ,

⇡(k, u, v) = max ⇡(k 1, t, u) ⇥ q(v|t, u, w[1:n] , k)


t2Sk 2

bp(k, u, v) = arg max ⇡(k 1, t, u) ⇥ q(v|t, u, w[1:n] , k)


t2Sk 2

I Set (tn 1 , tn ) = arg max(u,v) ⇡(n, u, v)


I For k = (n 2) . . . 1, tk = bp(k + 2, tk+1 , tk+2 )
I Return the tag sequence t1 . . . tn
FAQ Segmentation: McCallum et. al

I McCallum et. al compared HMM and log-linear taggers on a


FAQ Segmentation task

I Main point: in an HMM, modeling

p(word|tag)

is difficult in this domain


FAQ Segmentation: McCallum et. al

<head>X-NNTP-POSTER: NewsHound v1.33


<head>
<head>Archive name: acorn/faq/part2
<head>Frequency: monthly
<head>
<question>2.6) What configuration of serial cable should I use
<answer>
<answer> Here follows a diagram of the necessary connections
<answer>programs to work properly. They are as far as I know t
<answer>agreed upon by commercial comms software developers fo
<answer>
<answer> Pins 1, 4, and 8 must be connected together inside
<answer>is to avoid the well known serial port chip bugs. The
FAQ Segmentation: Line Features
begins-with-number
begins-with-ordinal
begins-with-punctuation
begins-with-question-word
begins-with-subject
blank
contains-alphanum
contains-bracketed-number
contains-http
contains-non-space
contains-number
contains-pipe
contains-question-mark
ends-with-question-mark
first-alpha-is-capitalized
indented-1-to-4
FAQ Segmentation: The Log-Linear Tagger
<head>X-NNTP-POSTER: NewsHound v1.33
<head>
<head>Archive name: acorn/faq/part2
<head>Frequency: monthly
<head>
<question>2.6) What configuration of serial cable should I use

Here follows a diagram of the necessary connections

) “tag=question;prev=head;begins-with-number”
“tag=question;prev=head;contains-alphanum”
“tag=question;prev=head;contains-nonspace”
“tag=question;prev=head;contains-number”
“tag=question;prev=head;prev-is-blank”
FAQ Segmentation: An HMM Tagger

<question>2.6) What configuration of serial cable should I use

I First solution for p(word | tag):

p(“2.6) What configuration of serial cable should I use” | question) =


e( 2.6) | question)⇥
e(W hat | question)⇥
e(conf iguration | question)⇥
e(of | question)⇥
e(serial | question)⇥
...
I i.e. have a language model for each tag
FAQ Segmentation: McCallum et. al
I Second solution: first map each sentence to string of features:
<question>2.6) What configuration of serial cable should I use
)
<question>begins-with-number contains-alphanum contains-nonspace
contains-number prev-is-blank
I Use a language model again:
p(“2.6) What configuration of serial cable should I use” | question) =
e(begins-with-number | question)⇥
e(contains-alphanum | question)⇥
e(contains-nonspace | question)⇥
e(contains-number | question)⇥
e(prev-is-blank | question)⇥
FAQ Segmentation: Results

Method Precision Recall


ME-Stateless 0.038 0.362
TokenHMM 0.276 0.140
FeatureHMM 0.413 0.529
MEMM 0.867 0.681

I Precision and recall results are for recovering segments


FAQ Segmentation: Results

Method Precision Recall


ME-Stateless 0.038 0.362
TokenHMM 0.276 0.140
FeatureHMM 0.413 0.529
MEMM 0.867 0.681

I Precision and recall results are for recovering segments


I ME-stateless is a log-linear model that treats every sentence
seperately (no dependence between adjacent tags)
FAQ Segmentation: Results

Method Precision Recall


ME-Stateless 0.038 0.362
TokenHMM 0.276 0.140
FeatureHMM 0.413 0.529
MEMM 0.867 0.681

I Precision and recall results are for recovering segments


I ME-stateless is a log-linear model that treats every sentence
seperately (no dependence between adjacent tags)
I TokenHMM is an HMM with first solution we’ve just seen
FAQ Segmentation: Results

Method Precision Recall


ME-Stateless 0.038 0.362
TokenHMM 0.276 0.140
FeatureHMM 0.413 0.529
MEMM 0.867 0.681

I Precision and recall results are for recovering segments


I ME-stateless is a log-linear model that treats every sentence
seperately (no dependence between adjacent tags)
I TokenHMM is an HMM with first solution we’ve just seen
I FeatureHMM is an HMM with second solution we’ve just seen
FAQ Segmentation: Results

Method Precision Recall


ME-Stateless 0.038 0.362
TokenHMM 0.276 0.140
FeatureHMM 0.413 0.529
MEMM 0.867 0.681

I Precision and recall results are for recovering segments


I ME-stateless is a log-linear model that treats every sentence
seperately (no dependence between adjacent tags)
I TokenHMM is an HMM with first solution we’ve just seen
I FeatureHMM is an HMM with second solution we’ve just seen
I MEMM is a log-linear trigram tagger (MEMM stands for
“Maximum-Entropy Markov Model”)
Summary
I Key ideas in log-linear taggers:
I Decompose
n
Y
p(t1 . . . tn |w1 . . . wn ) = p(ti |ti 2 , ti 1 , w1 . . . wn )
i=1
I Estimate
p(ti |ti 2 , ti 1 , w1 . . . wn )
using a log-linear model
I For a test sentence w1 . . . wn , use the Viterbi algorithm to
find !
Yn
arg max p(ti |ti 2 , ti 1 , w1 . . . wn )
t1 ...tn
i=1

I Key advantage over HMM taggers: flexibility in the features


they can use

You might also like