19 Parsing
19 Parsing
Mausam
put
boy tortoise on
The the rug
The boy put the tortoise on the rug
the
Why Parse?
• Part of speech information
• Phrase information
• Useful relationships
8
The rise of annotated data:
The Penn Treebank
[Marcus et al. 1993, Computational Linguistics]
( (S
(NP-SBJ (DT The) (NN move))
(VP (VBD followed)
(NP
(NP (DT a) (NN round))
(PP (IN of)
(NP
(NP (JJ similar) (NNS increases))
(PP (IN by)
(NP (JJ other) (NNS lenders)))
(PP (IN against)
(NP (NNP Arizona) (JJ real) (NN estate) (NNS loans))))))
(, ,)
(S-ADV
(NP-SBJ (-NONE- *))
(VP (VBG reflecting)
(NP
(NP (DT a) (VBG continuing) (NN decline))
(PP-LOC (IN in)
(NP (DT that) (NN market)))))))
(. .)))
Penn Treebank Non-terminals
Statistical parsing applications
Statistical parsers are now robust and widely used in larger NLP
applications:
• High precision question answering [Pasca and Harabagiu SIGIR 2001]
• Improving biological named entity finding [Finkel et al. JNLPBA 2004]
• Syntactically based sentence compression [Lin and Wilbur 2007]
• Extracting opinions about products [Bloom et al. NAACL 2007]
• Improved interaction in computer games [Gorniak and Roy 2005]
• Helping linguists find data [Resnik et al. BLS 2005]
• Source sentence analysis for machine translation [Xu et al. 2009]
• Relation extraction systems [Fundel et al. Bioinformatics 2006]
Example Application: Machine Translation
NP VP PP NP VP PP
DT NN V NP IN NP DT NN V NP IN NP
on on
the boy the boy
put DT NN DT NN put DT NN DT NN
the tortoise the rug the tortoise the rug
Example Application: Machine Translation
NP VP PP NP PP VP
DT NN V NP IN NP DT NN IN NP V NP
on
the boy the boy on
put DT NN DT NN DT NN put DT NN
the tortoise the rug the tortoise
the rug
Example Application: Machine Translation
NP VP PP NP PP VP
DT NN V NP IN NP DT NN NP IN V NP
on
the boy the boy on
put DT NN DT NN DT NN put DT NN
the tortoise the rug the tortoise
the rug
Example Application: Machine Translation
NP VP PP NP PP VP
DT NN V NP IN NP DT NN NP IN NP V
on
the boy the boy on
put DT NN DT NN DT NN DT NN put
the tortoise the rug the tortoise
the rug
Example Application: Machine Translation
NP VP PP NP PP VP
DT NN V NP IN NP DT NN NP IN NP V
on
the boy
लड़के ने DT ऊपर
put DT NN DT NN NN DT NN रखा
the tortoise the rug कछु आ
कालीन
Pre 1990 (“Classical”) NLP Parsing
• Goes back to Chomsky’s PhD thesis in 1950s
• Wrote symbolic grammar (CFG or often richer) and lexicon
S NP VP NN interest
NP (DT) NN NNS rates
NP NN NNS NNS raises
NP NNP VBP interest
VP V NP VBZ rates
20
Context-Free Grammars
21
Context-Free Grammars in NLP
23
Left-Most Derivations
24
Properties of CFGs
25
A Fragment of a Noun Phrase Grammar
26
Extended Grammar with Prepositional Phrases
27
Verbs, Verb Phrases and Sentences
28
PPs Modifying Verb Phrases
29
Complementizers and SBARs
30
More Verbs
31
Coordination
32
Much more remains…
33
Attachment ambiguities
• Dislocation / gapping
• Which book should Peter buy?
• A debate arose which continued until the election.
• Binding
• Reference
• The IRS audits itself
• Control
• I want to go
• I want you to go
40
Context-Free Grammars in NLP
46
Probabilistic – or stochastic – context-free
grammars (PCFGs)
• G = (Σ, N, S, R, P)
• T is a set of terminal symbols
• N is a set of nonterminal symbols
• S is the start symbol (S ∈ N)
• R is a set of rules/productions of the form X
• P is a probability function
• P: R [0,1]
•
åg ÎT *
P(g ) = 1
A Probabilistic Context-Free Grammar (PCFG)
PCFG Example
Vi ⇒ sleeps 1.0
S ⇒ NP VP 1.0
Vt ⇒ saw 1.0
VP ⇒ Vi 0.4
NN ⇒ man 0.7
VP ⇒ Vt NP 0.4
NN ⇒ woman 0.2
VP ⇒ VP PP 0.2
NN ⇒ telescope 0.1
NP ⇒ DT NN 0.3
DT ⇒ the 1.0
NP ⇒ NP PP 0.7
IN ⇒ with 0.5
PP ⇒ P NP 1.0
IN ⇒ in 0.5
• Probability of a tree t with rules
α 1 → β1, α 2 → β2, . . . , α n → βn
is n
p(t) = q(α i → βi )
i= 1
where q(α → β) is the probability for rule α → β.
Example of a PCFG
49
A Probabilistic Context-Free Grammar (PCFG)
Probability of a Parse
Vi ⇒ sleeps 1.0
S ⇒ NP VP 1.0 S1.0
Vt ⇒ saw 1.0
VP ⇒ Vi 0.4
VP ⇒ Vt NP 0.4 tNN
1= ⇒NP0.3manVP
0.4
0.7
NN DT ⇒ NNwoman 0.2
VP ⇒ VP PP 0.2 Vi
NN ⇒ telescope
1.0 0.7 1.0
0.1
NP ⇒ DT NN 0.3 The man sleeps
Free DT ⇒ the 1.0
NPGrammar
⇒ NP PP(PCFG)
0.7 p(t1)=1.0*0.3*1.0*0.7*0.4*1.0
IN ⇒ with 0.5
PP ⇒ P NP 1.0
IN ⇒ inS1.0 0.5
•ViProbability
⇒ sleeps of a tree t1.0
with rules VP 0.2
Vt ⇒ saw 1.0
α 1 → β1, α 2 →t2β=2, . . . , α n → VPβn
NN ⇒ man 0.7 0.4
PP0.4
NNis ⇒ woman 0.2 n NP0.3 Vt NP0.3 IN NP0.3
NN ⇒ telescope p(t) 0.1 = q(α → β
1.0
)
0.5
DT i NN i DT NN DT NN
DT ⇒ the 1.0 i = 1 1.0 0.7 1.0 0.2 1.0 0.1
INwhere
⇒ q(αwith The man saw the woman with the telescope
0.5probability
→ β) is the for rule α → β.
IN ⇒ in 0.5 p(t )=1.8*0.3*1.0*0.7*0.2*0.4*1.0*0.3*1.0*0.2*0.4*0.5*0.3*1.0*0.1
s
es
PCFGs: Learning and Inference
Model
The probability of a tree t with n rules αi βi, i = 1..n
Learning
Read the rules off of labeled sentences, use ML estimates for
probabilities
Inference
For input sentence s, define T(s) to be the set of trees whose yield is s
(whole leaves, read left to right, match the words in s)
Grammar Transforms
52
Chomsky Normal Form
S NP VP N people
VP V NP N fish
VP V NP PP N tanks
NP NP NP N rods
NP NP PP V people
NP N V fish
NP e V tanks
PP P NP P with
Chomsky Normal Form steps
S NP VP N people
S VP N fish
VP V NP
N tanks
VP V
VP V NP PP N rods
VP V PP V people
NP NP NP V fish
NP NP V tanks
NP NP PP
P with
NP PP
NP N
PP P NP
PP P
Chomsky Normal Form steps
S NP VP N people
VP V NP
S V NP N fish
VP V N tanks
SV
N rods
VP V NP PP
S V NP PP V people
VP V PP V fish
S V PP
NP NP NP V tanks
NP NP P with
NP NP PP
NP PP
NP N
PP P NP
PP P
Chomsky Normal Form steps
S NP VP N people
VP V NP
S V NP N fish
VP V N tanks
VP V NP PP
N rods
S V NP PP
VP V PP V people
S V PP S people
NP NP NP
NP NP V fish
NP NP PP S fish
NP PP
NP N
V tanks
PP P NP S tanks
PP P
P with
Chomsky Normal Form steps
S NP VP N people
VP V NP N fish
S V NP N tanks
VP V NP PP N rods
S V NP PP V people
VP V PP S people
S V PP
VP people
NP NP NP
V fish
NP NP
S fish
NP NP PP
VP fish
NP PP
V tanks
NP N
PP P NP S tanks
PP P VP tanks
P with
Chomsky Normal Form steps
S NP VP NP people
VP V NP NP fish
S V NP NP tanks
VP V NP PP NP rods
S V NP PP V people
VP V PP S people
S V PP VP people
NP NP NP V fish
NP NP PP S fish
NP P NP VP fish
PP P NP V tanks
S tanks
VP tanks
P with
PP with
Chomsky Normal Form steps
S NP VP NP people
VP V NP NP fish
S V NP NP tanks
VP V @VP_V NP rods
@VP_V NP PP V people
S V @S_V S people
@S_V NP PP VP people
VP V PP V fish
S V PP S fish
NP NP NP VP fish
NP NP PP V tanks
NP P NP S tanks
PP P NP VP tanks
P with
PP with
A phrase structure grammar
S NP VP N people
VP V NP N fish
VP V NP PP N tanks
NP NP NP N rods
NP NP PP V people
NP N V fish
NP e V tanks
PP P NP P with
Chomsky Normal Form steps
S NP VP NP people
VP V NP NP fish
S V NP NP tanks
VP V @VP_V NP rods
@VP_V NP PP V people
S V @S_V S people
@S_V NP PP VP people
VP V PP V fish
S V PP S fish
NP NP NP VP fish
NP NP PP V tanks
NP P NP S tanks
PP P NP VP tanks
P with
PP with
Chomsky Normal Form
• The rest isn’t necessary; it just makes the algorithms cleaner and
a bit quicker
An example: before binarization…
ROOT
NP VP
V NP PP
N
P NP
N
N
ROOT
NP VP
@VP_V
V NP PP
N
N P NP
67
Constituency Parsing
PCFG
Rule Prob θi
S S NP VP θ0
NP NP NP θ1
VP
…
NP NP N fish θ42
N people θ43
N N V N
V fish θ44
fish people fish tanks …
Cocke-Kasami-Younger (CKY)
Constituency Parsing (Parse Triangle/Chart)
S NP VP 0.9
S VP 0.1
VP V NP 0.5
VP V 0.1
VP 0.06 VP V @VP_V 0.3
NP 0.35
NP 0.14 VP V PP
V 0.1 0.1
V 0.6 @VP_V NP PP 1.0
N 0.5
N 0.2 NP NP NP 0.1
NP NP PP 0.2
NP N 0.7
people fish PP P NP 1.0
Extended CKY parsing
• Binarization is vital
• Without binarization, you don’t get parsing cubic in the length of the
sentence and in the number of nonterminals in the grammar
• Binarization may be an explicit transformation or implicit in how the parser
works (Earley-style dotted rules), but it’s always there.
A Recursive Parser
bestScore(X,i,j,s)
if (j == i)
return q(X->s[i])
else
return max q(X->YZ) *
k,X->YZ
bestScore(Y,i,k,s) *
bestScore(Z,k+1,j,s)
The CKY algorithm (1960/1965)
… extended to unaries
score[2][3] score[2][4]
score[3][4]
4
fish 1 people 2 fish 3 tanks 4
S NP VP 0.9 0
S VP 0.1
VP V NP 0.5
VP V 0.1
VP V @VP_V 0.3
1
VP V PP 0.1
@VP_V NP PP 1.0
NP NP NP 0.1
NP NP PP 0.2
NP N 0.7 2
PP P NP 1.0
N people 0.5
N fish 0.2
3
N tanks
for i=0; i<#(words); i++
0.2
for A in nonterms
N rods 0.1 if A -> words[i] in grammar
V people 0.1 score[i][i+1][A] = P(A -> words[i]);
V fish 0.6 4
V tanks 0.3
fish 1 people 2 fish 3 tanks 4
S NP VP 0.9 0
S VP 0.1 N fish 0.2
V fish 0.6
VP V NP 0.5
VP V 0.1
VP V @VP_V 0.3
1
VP V PP 0.1 N people 0.5
@VP_V NP PP 1.0 V people 0.1
NP NP NP 0.1
NP NP PP 0.2
NP N 0.7 2
PP P NP 1.0 N fish 0.2
V fish 0.6
N people 0.5
N fish 0.2 // handle unaries
3
boolean added = true
N tanks while added N tanks 0.2
0.2 added = false V tanks 0.1
for A, B in nonterms
N rods 0.1 if score[i][i+1][B] > 0 && A->B in grammar
V people 0.1 prob = P(A->B)*score[i][i+1][B]
if(prob > score[i][i+1][A])
V fish 0.6 4 score[i][i+1][A] = prob
back[i][i+1][A] = B
V tanks 0.3
added = true
fish 1 people 2 fish 3 tanks 4
S NP VP 0.9 0
S VP 0.1 N fish 0.2
V fish 0.6
VP V NP 0.5
NP N 0.14
VP V 0.1 VP V 0.06
VP V @VP_V 0.3 S VP 0.006
1
VP V PP 0.1 N people 0.5
@VP_V NP PP 1.0 V people 0.1
NP NP NP 0.1 NP N 0.35
NP NP PP 0.2 VP V 0.01
2 S VP 0.001
NP N 0.7
PP P NP 1.0 N fish 0.2
V fish 0.6
NP N 0.14
N people 0.5 VP V 0.06
N fish 0.2 S VP 0.006
3
N tanks N tanks 0.2
0.2 prob=score[begin][split][B]*score[split][end][C]*P(A->BC) V tanks 0.1
if (prob > score[begin][end][A])
N rods 0.1 score[begin]end][A] = prob NP N 0.14
V people 0.1 back[begin][end][A] = new Triple(split,B,C) VP V 0.03
V fish 0.6 4 S VP 0.003
V tanks 0.3
fish 1 people 2 fish 3 tanks 4
S NP VP 0.9 0
S VP 0.1 N fish 0.2 NP NP NP
V fish 0.6 0.0049
VP V NP 0.5 VP V NP
NP N 0.14
VP V 0.1 VP V 0.06
0.105
S NP VP
VP V @VP_V 0.3 S VP 0.006
1 0.00126
VP V PP 0.1 N people 0.5 NP NP NP
@VP_V NP PP 1.0 V people 0.1 0.0049
NP N 0.35 VP V NP
NP NP NP 0.1 0.007
NP NP PP 0.2 VP V 0.01 S NP VP
2 S VP 0.001
NP N 0.7 0.0189
89
Weaknesses
90
A Case of PP Attachment Ambiguity
91
92
A Case of Coordination Ambiguity
93
94
Structural Preferences: Close Attachment
95
Structural Preferences: Close Attachment
96
PCFGs and Independence
11%
9% 9% 9%
6% 7%
4%
101
Horizontal Markovization
74% 12000
72% 6000
71% 3000
70% 0
0 1 2v 2 inf 0 1 2v 2 inf
Horizontal Markov Order Horizontal Markov Order
Vertical Markovization
79% 25000
78% 20000
77%
Symbols
76% 15000
75% 10000
74%
73% 5000
72% 0
1 2v 2 3v 3 1 2v 2 3v 3
Vertical Markov Order Vertical Markov Order Model F1 Size
v=h=2v 77.8 7.5K
Unary Splits
• Problem: unary
rewrites are used to
transmute
categories so a high-
probability rule can
be used.
Solution: Mark
unary rewrite sites
Annotation F1 Size
with -U
Base 77.8 7.5K
UNARY 78.3 8.0K
Tag Splits
Annotation F1 Size
• Partial Solution:
Previous 78.3 8.0K
• Subdivide the IN tag.
SPLIT-IN 80.3 8.1K
Other Tag Splits
F1 Size
• UNARY-DT: mark demonstratives as DT^U (“the
X” vs. “those”) 80.4 8.1K
• UNARY-RB: mark phrasal adverbs as RB^U
80.5 8.1K
(“quickly” vs. “very”)
• TAG-PA: mark tags with non-canonical parents
81.2 8.5K
(“not” is an RB^VP)
• SPLIT-AUX: mark auxiliary verbs with –AUX [cf. 81.6 9.0K
Charniak 97]
• SPLIT-CC: separate “but” and “&” from other 81.7 9.1K
conjunctions
• SPLIT-%: “%” gets its own tag. 81.8 9.3K
Yield Splits
• Examples:
• Possessive NPs
• Finite vs. infinite VPs
• Lexical heads!
Parser LP LR F1
Magerman 95 84.9 84.6 84.7
Collins 96 86.3 85.8 86.0
Klein & Manning 03 86.9 85.7 86.3
Charniak 97 87.4 87.5 87.4
Collins 99 88.7 88.6 88.6
111
Heads in Context Free Rules
112
Heads
113
Rules to Recover Heads: An Example for NPs
114
Rules to Recover Heads: An Example for VPs
115
Adding Headwords to Trees
116
Adding Headwords to Trees
117
Adding Headwords to Trees
118
Lexicalized CFGs in Chomsky Normal Form
119
Example
120
Lexicalized CKY
(VP->VBD...NP)[saw]
X[h]
(VP-> VBD[saw] NP[her])
Y[h] Z[h’]
bestScore(X,i,j,h)
if (j = i)
return score(X,s[i]) i h k h’ j
else
return
max score(X[h]->Y[h]Z[w]) *
k,h,w
bestScore(Y,i,k,h) *
X->YZ
bestScore(Z,k+1,j,w)
max score(X[h]->Y[w]Z[h]) *
k,h,w
bestScore(Y,i,k,w) *
X->YZ
bestScore(Z,k+1,j,h)
Parsing with Lexicalized CFGs
122
Pruning with Beams
X[h]
• The Collins parser prunes with
per-cell beams [Collins 99] Y[h] Z[h’]
• Essentially, run the O(n5) CKY
• Remember only a few hypotheses for
each span <i,j>.
• If we keep K hypotheses at each i h k h’ j
span, then we do at most O(nK2)
work per span (why?)
• Keeps things more or less cubic
124
A Model from Charniak (1997)
125
A Model from Charniak (1997)
126
Final Test Set Results
Parser LP LR F1
Magerman 95 84.9 84.6 84.7
Collins 96 86.3 85.8 86.0
Klein & Manning 03 86.9 85.7 86.3
Charniak 97 87.4 87.5 87.4
Collins 99 88.7 88.6 88.6
Strengths and Weaknesses of PCFG Parsers
131