Michael Collins. Head-Driven Statistical Models For Natural Language Processing, Computational Linguistics
Michael Collins. Head-Driven Statistical Models For Natural Language Processing, Computational Linguistics
Michael Collins∗
MIT Computer Science and
Artificial Intelligence Laboratory
This article describes three statistical models for natural language parsing. The models extend
1. Introduction
1. Which linguistic objects (e.g., context-free rules, parse moves) should the
model’s parameters be associated with? In other words, which features
should be used to discriminate among alternative parse trees?
2. How can this choice be instantiated in a sound probabilistic model?
In this article we explore these issues within the framework of generative models,
more precisely, the history-based models originally introduced to parsing by Black
∗ MIT Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology,
545 Technology Square, Cambridge, MA 02139. E-mail: [email protected].
c 2003 Association for Computational Linguistics
Computational Linguistics Volume 29, Number 4
590
Collins Head-Driven Statistical Models for NL Parsing
Booth and Thompson (1973) specify the conditions under which the PCFG does in fact
define a distribution over the possible derivations (trees) generated by the underlying
grammar. The first condition is that the rule probabilities define conditional distribu-
tions over how each nonterminal in the grammar can expand. The second is a technical
condition that guarantees that the stochastic process generating trees terminates in a
finite number of steps with probability one.
A central problem in PCFGs is to define the conditional probability P(β | X) for
each rule X → β in the grammar. A simple way to do this is to take counts from a
treebank and then to use the maximum-likelihood estimates:
Count(X → β)
P(β | X) = (1)
Count(X)
591
Computational Linguistics Volume 29, Number 4
If the treebank has actually been generated from a probabilistic context-free grammar
with the same rules and nonterminals as the model, then in the limit, as the training
sample size approaches infinity, the probability distribution implied by these estimates
will converge to the distribution of the underlying grammar.2
Once the model has been trained, we have a model that defines P(T, S) for any
sentence-tree pair in the grammar. The output on a new test sentence S is the most
likely tree under this model,
P(T, S)
Tbest = arg max P(T | S) = arg max = arg max P(T, S)
T T P(S) T
The parser itself is an algorithm that searches for the tree, Tbest , that maximizes P(T, S).
2 This point is actually more subtle than it first appears (we thank one of the anonymous reviewers for
pointing this out), and we were unable to find proofs of this property in the literature for PCFGs. The
rule probabilities for any nonterminal that appears with probability greater than zero in parse
derivations will converge to their underlying values, by the usual properties of maximum-likelihood
estimation for multinomial distributions. Assuming that the underlying PCFG generating training
examples meet both criteria in Booth and Thompson (1973), it can be shown that convergence of rule
probabilities implies that the distribution over trees will converge to that of the underlying PCFG, at
least when Kullback-Liebler divergence or the infinity norm is taken to be the measure of distance
between the two distributions. Thanks to Tommi Jaakkola and Nathan Srebro for discussions on this
topic.
3 We find lexical heads in Penn Treebank data using the rules described in Appendix A of Collins (1999).
The rules are a modified version of a head table provided by David Magerman and used in the parser
described in Magerman (1995).
592
Collins Head-Driven Statistical Models for NL Parsing
Figure 1
A nonlexicalized parse tree and a list of the rules it contains.
Lexical Rules:
JJ(Last,JJ) → Last
NN(week,NN) → week
NNP(IBM,NNP) → IBM
VBD(bought,VBD) → bought
NNP(Lotus,NN) → Lotus
Figure 2
A lexicalized parse tree and a list of the rules it contains.
where |V| is the number of words in the vocabulary and |T | is the number of part-of-
speech tags).
Although nothing has changed from a formal point of view, the practical conse-
quences of expanding the number of nonterminals quickly become apparent when
one is attempting to define a method for parameter estimation. The simplest solution
would be to use the maximum-likelihood estimate as in equation (1), for example,
593
Computational Linguistics Volume 29, Number 4
But the addition of lexical items makes the statistics for this estimate very sparse: The
count for the denominator is likely to be relatively low, and the number of outcomes
(possible lexicalized RHSs) is huge, meaning that the numerator is very likely to be
zero. Predicting the whole lexicalized rule in one go is too big a step.
One way to overcome these sparse-data problems is to break down the gener-
3.1 Model 1
This section describes how the generation of the RHS of a rule is broken down into a
sequence of smaller steps in model 1. The first thing to note is that each internal rule
in a lexicalized PCFG has the form4
H is the head-child of the rule, which inherits the headword/tag pair h from its parent
P. L1 (l1 ) . . . Ln (ln ) and R1 (r1 ) . . . Rm (rm ) are left and right modifiers of H. Either n or m
may be zero, and n = m = 0 for unary rules. Figure 2 shows a tree that will be used
as an example throughout this article. We will extend the left and right sequences to
include a terminating STOP symbol, allowing a Markov process to model the left and
right sequences. Thus Ln+1 = Rm+1 = STOP.
For example, in S(bought,VBD) → NP(week,NN) NP(IBM,NNP) VP(bought,VBD):
4 With the exception of the top rule in the tree, which has the form TOP → H(h).
594
Collins Head-Driven Statistical Models for NL Parsing
Note that lexical rules, in contrast to the internal rules, are completely deterministic.
They always take the form
P(h) → w
where P is a part-of-speech tag, h is a word-tag pair w, t, and the rule rewrites to just
the word w. (See Figure 2 for examples of lexical rules.) Formally, we will always take
a lexicalized nonterminal P(h) to expand deterministically (with probability one) in
this way if P is a part-of-speech symbol. Thus for the parsing models we require the
nonterminal labels to be partitioned into two sets: part-of-speech symbols and other
nonterminals. Internal rules always have an LHS in which P is not a part-of-speech
symbol. Because lexicalized rules are deterministic, they will not be discussed in the
remainder of this article: All of the modeling choices concern internal rules.
The probability of an internal rule can be rewritten (exactly) using the chain rule
(The subscripts h, l and r are used to denote the head, left-modifier, and right-modifier
parameter types, respectively.) Next, we make the assumption that the modifiers are
generated independently of each other:
In summary, the generation of the RHS of a rule such as (2), given the LHS, has
been decomposed into three steps:5
For example, the probability of the rule S(bought) → NP(week) NP(IBM) VP(bought)
would be estimated as
5 An exception is the first rule in the tree, TOP → H(h), which has probability PTOP (H, h|TOP)
595
Computational Linguistics Volume 29, Number 4
In this example, and in the examples in the rest of the article, for brevity we omit
the part-of-speech tags associated with words, writing, for example S(bought) rather
than S(bought,VBD). We emphasize that throughout the models in this article, each
word is always paired with its part of speech, either when the word is generated or
when the word is being conditioned upon.
3.1.1 Adding Distance to the Model. In this section we first describe how the model
can be extended to be “history-based.” We then show how this extension can be
utilized in incorporating “distance” features into the model.
Black et al. (1992) originally introduced history-based models for parsing. Equa-
tions (3) and (4) of the current article made the independence assumption that each
Figure 3
A partially completed tree derived depth-first. “????” marks the position of the next modifier
to be generated—it could be a nonterminal/headword/head-tag triple, or the STOP symbol.
The distribution over possible symbols in this position could be conditioned on any
previously generated structure, that is, any structure appearing in the figure.
596
Collins Head-Driven Statistical Models for NL Parsing
Figure 4
The next child, R3 (r3 ), is generated with probability P(R3 (r3 ) | P, H, h, distancer (2)). The distance
is a function of the surface string below previous modifiers R1 and R2 . In principle the model
could condition on any structure dominated by H, R1 , or R2 (or, for that matter, on any
structure previously generated elsewhere in the tree).
Here distancel and distancer are functions of the surface string below the previous
modifiers. (See Figure 4 for illustration.) The distance measure is similar to that in
Collins (1996), a vector with the following two elements: (1) Is the string of zero
length? (2) Does the string contain a verb? The first feature allows the model to learn
a preference for right-branching structures. The second feature6 allows the model to
learn a preference for modification of the most recent verb.7
6 Note that this feature means that dynamic programming parsing algorithms for the model must keep
track of whether each constituent does or does not have a verb in the string to the right or left of its
head. See Collins (1999) for a full description of the parsing algorithms.
7 In the models described in Collins (1997), there was a third question concerning punctuation: (3) Does
the string contain 0, 1, 2 or more than 2 commas? (where a comma is anything tagged as “,” or “:”).
The model described in this article has a cleaner incorporation of punctuation into the generative
process, as described in section 4.3.
8 Except that IBM is closer to the VP, but note that IBM is also the subject in IBM last week bought Lotus.
9 We use the term complement in a broad sense that includes both complements and specifiers under the
terminology of government and binding.
597
Computational Linguistics Volume 29, Number 4
Figure 5
A tree with the -C suffix used to identify complements. IBM and Lotus are in subject and
object position, respectively. Last week is an adjunct.
the relative improbability of week’s being the headword of a subject. These problems
are not restricted to NPs; compare The spokeswoman said (SBAR that the asbestos was dan-
gerous) with Bonds beat short-term investments (SBAR because the market is down), in which
an SBAR headed by that is a complement, but an SBAR headed by because is an adjunct.
A second reason for incorporating the complement/adjunct distinction into the
parsing model is that this may help parsing accuracy. The assumption that comple-
ments are generated independently of one another often leads to incorrect parses. (See
Figure 6 for examples.)
3.2.1 Identifying Complements and Adjuncts in the Penn Treebank. We add the -C
suffix to all nonterminals in training data that satisfy the following conditions:
598
Collins Head-Driven Statistical Models for NL Parsing
2. The nonterminal must not have one of the following semantic tags: ADV,
VOC, BNF, DIR, EXT, LOC, MNR, TMP, CLR or PRP. See Marcus et al.
(1994) for an explanation of what these tags signify. For example, the NP
Last week in figure 2 would have the TMP (temporal) tag, and the SBAR in
(SBAR because the market is down) would have the ADV (adverbial) tag.
3. The nonterminal must not be on the RHS of a coordinated phrase. For
example, in the rule S → S CC S, the two child Ss would not be marked
as complements.
In addition, the first child following the head of a prepositional phrase is marked as
a complement.
Here the head initially decides to take a single NP-C (subject) to its left and no com-
plements to its right. NP-C(IBM) is immediately generated as the required subject, and
NP-C is removed from LC, leaving it empty when the next modifier, NP(week), is gen-
erated. The incorrect structures in Figure 6 should now have low probability, because
Plc ({NP-C,NP-C} | S,VP,was) and Prc ({NP-C,VP-C} | VP,VB,was) should be small.
599
Computational Linguistics Volume 29, Number 4
It might be possible to write rule-based patterns that identify traces in a parse tree.
We argue again, however, that this task is best integrated into the parser: The task
(1) NP → NP SBAR(+gap)
(2) SBAR(+gap) → WHNP S-C(+gap)
(3) S(+gap) → NP-C VP(+gap)
(4) VP(+gap) → VB TRACE NP
Figure 7
A +gap feature can be added to nonterminals to describe wh-movement. The top-level NP
initially generates an SBAR modifier but specifies that it must contain an NP trace by adding
the +gap feature. The gap is then passed down through the tree, until it is discharged as a
TRACE complement to the right of bought.
600
Collins Head-Driven Statistical Models for NL Parsing
Given that the LHS of the rule has a gap, there are three ways that the gap can
be passed down to the RHS:
Head: The gap is passed to the head of the phrase, as in rule (3) in Figure 7.
Left, Right: The gap is passed on recursively to one of the left or right modifiers
of the head or is discharged as a TRACE argument to the left or right of
the head. In rule (2) in Figure 7, it is passed on to a right modifier, the S
complement. In rule (4), a TRACE is generated to the right of the head VB.
In rule (2), Right is chosen, so the +gap requirement is added to RC. Generation of
S-C(bought)(+gap) fulfills both the S-C and +gap requirements in RC. In rule (4),
Right is chosen again. Note that generation of TRACE satisfies both the NP-C and +gap
subcategorization requirements.
Sections 3.1 to 3.3 described the basic framework for the parsing models in this article.
In this section we describe how some linguistic phenomena (nonrecursive NPs and
coordination, for example) clearly violate the independence assumptions of the general
models. We describe a number of these special cases, in each instance arguing that the
phenomenon violates the independence assumptions, then describing how the model
can be refined to deal with the problem.
601
Computational Linguistics Volume 29, Number 4
Figure 8
Three examples of structures with base-NPs.
the model will fail to learn that the STOP symbol is very likely to follow
a determiner. As a result, the model will assign unreasonably high
probabilities to NPs such as [NP yesterday the dog] in sentences such as
Yesterday the dog barked.
• The annotation standard in the treebank leaves the internal structure of
base-NPs underspecified. For example, both pet food volume (where pet
modifies food and food modifies volume) and vanilla ice cream (where both
vanilla and ice modify cream) would have the structure NPB → NN NN NN.
Because of this, there is no reason to believe that modifiers within NPBs
are dependent on the head rather than the previous modifier. In fact, if it
so happened that a majority of phrases were like pet food volume, then
conditioning on the previous modifier rather than the head would be
preferable.
• In general it is important (in particular for the distance measure to be
effective) to have different nonterminal labels for what are effectively
different X-bar levels. (See section 7.3.2 for further discussion.)
For these reasons the following modifications are made to the models:
11 For simplicity, we give probability terms under model 1 with no distance variables; the probability
terms with distance variables, or for models 2 and 3, will be similar, but with the addition of various
pieces of conditioning information.
602
Collins Head-Driven Statistical Models for NL Parsing
4.2 Coordination
Coordination constructions are another example in which the independence assump-
tions in the basic models fail badly (at least given the current annotation method in
the treebank). Figure 9 shows how coordination is annotated in the treebank.12 To
use an example to illustrate the problems, take the rule NP(man) → NP(man) CC(and)
NP(dog), which has probability
The independence assumptions mean that the model fails to learn that there is always
exactly one phrase following the coordinator (CC). The basic probability models will
give much too high probabilities to unlikely phrases such as NP → NP CC or NP →
NP CC NP NP. For this reason we alter the generative process to allow generation of
both the coordinator and the following phrase in one step; instead of just generating a
nonterminal at each step, a nonterminal and a binary-valued coord flag are generated.
coord = 1 if there is a coordination relationship. In the generative process, generation
of a coord = 1 flag along with a modifier triggers an additional step in the generative
Figure 9
(a) The generic way of annotating coordination in the treebank. (b) and (c) show specific
examples (with base-NPs added as described in section 4.1). Note that the first item of the
conjunct is taken as the head of the phrase.
12 See Appendix A of Collins (1999) for a description of how the head rules treat phrases involving
coordination.
603
Computational Linguistics Volume 29, Number 4
Note the new type of parameter, Pcc , for the generation of the coordinator word and
POS tag. The generation of coord=1 along with NP(dog) in the example implicitly
requires generation of a coordinator tag/word pair through the Pcc parameter. The
generation of this tag/word pair is conditioned on the two words in the coordination
dependency (man and dog in the example) and the label on their relationship (NP,NP,NP
4.3 Punctuation
This section describes our treatment of “punctuation” in the model, where “punctu-
ation” is used to refer to words tagged as a comma or colon. Previous work—the
generative models described in Collins (1996) and the earlier version of these mod-
els described in Collins (1997)—conditioned on punctuation as surface features of the
string, treating it quite differently from lexical items. In particular, the model in Collins
(1997) failed to generate punctuation, a deficiency of the model. This section describes
how punctuation is integrated into the generative models.
Our first step is to raise punctuation as high in the parse trees as possible. Punc-
tuation at the beginning or end of sentences is removed from the training/test data
altogether.13 All punctuation items apart from those tagged as comma or colon (items
such as quotation marks and periods, tagged “ ” or . ) are removed altogether. These
transformations mean that punctuation always appears between two nonterminals, as
opposed to appearing at the end of a phrase. (See Figure 10 for an example.)
Figure 10
A parse tree before and after punctuation transformations.
13 As one of the anonymous reviewers of this article pointed out, this choice of discarding the
sentence-final punctuation may not be optimal, as the final punctuation mark may well carry useful
information about the sentence structure.
604
Collins Head-Driven Statistical Models for NL Parsing
Figure 11
(a) The treebank annotates sentences with empty subjects with an empty -NONE- element
under subject position; (b) in training (and for evaluation), this null element is removed; (c) in
models 2 and 3, sentences without subjects are changed to have a nonterminal SG.
605
Computational Linguistics Volume 29, Number 4
Table 1
The conditioning variables for each level of back-off. For example, Ph estimation interpolates
e1 = Ph (H | P, w, t), e2 = Ph (H | P, t), and e3 = Ph (H | P). ∆ is the distance measure.
These two probabilities are then smoothed separately. Eisner (1996b) originally used
POS tags to smooth a generative model in this way. In each case the final estimate is
e = λ1 e1 + (1 − λ1 )(λ2 e2 + (1 − λ2 )e3 )
14 In Collins (1999) we erroneously stated that all words occuring less than five times in training data
were classified as “unknown.” Thanks to Dan Bikel for pointing out this error.
606
Collins Head-Driven Statistical Models for NL Parsing
tagger is used to decode each test data sentence. All other words are tagged during
parsing, the output from Ratnaparkhi’s tagger being ignored. The POS tags allowed
for each word are limited to those that have been seen in training data for that word
(any tag/word pairs not seen in training would give an estimate of zero in the PL2
and PR2 distributions). The model is fully integrated, in that part-of-speech tags are
statistically generated along with words in the models, so that the parser will make a
statistical decision as to the most likely tag for each known word in the sentence.
6. Results
The parser was trained on sections 2–21 of the Wall Street Journal portion of the
Penn Treebank (Marcus, Santorini, and Marcinkiewicz 1993) (approximately 40,000
sentences) and tested on section 23 (2,416 sentences). We use the PARSEVAL measures
(Black et al. 1991) to compare performance:
For a constituent to be “correct,” it must span the same set of words (ignoring punctu-
ation, i.e., all tokens tagged as commas, colons, or quotation marks) and have the same
label15 as a constituent in the treebank parse. Table 2 shows the results for models 1, 2
and 3 and a variety of other models in the literature. Two models (Collins 2000; Char-
niak 2000) outperform models 2 and 3 on section 23 of the treebank. Collins (2000)
uses a technique based on boosting algorithms for machine learning that reranks n-best
output from model 2 in this article. Charniak (2000) describes a series of enhancements
to the earlier model of Charniak (1997).
The precision and recall of the traces found by Model 3 were 93.8% and 90.1%,
respectively (out of 437 cases in section 23 of the treebank), where three criteria must be
met for a trace to be “correct”: (1) It must be an argument to the correct headword; (2)
It must be in the correct position in relation to that headword (preceding or following);
15 Magerman (1995) collapses ADVP and PRT into the same label; for comparison, we also removed this
distinction when calculating scores.
607
Computational Linguistics Volume 29, Number 4
Table 2
Results on Section 23 of the WSJ Treebank. LR/LP = labeled recall/precision. CBs is the
average number of crossing brackets per sentence. 0 CBs, ≤ 2 CBs are the percentage of
sentences with 0 or ≤ 2 crossing brackets respectively. All the results in this table are for
models trained and tested on the same data, using the same evaluation metric. (Note that
these results show a slight improvement over those in (Collins 97); the main model changes
were the improved treatment of punctuation (section 4.3) together with the addition of the Pp
and Pcc parameters.)
Model ≤ 40 Words (2,245 sentences)
LR LP CBs 0 CBs ≤ 2 CBs
Magerman 1995 84.6% 84.9% 1.26 56.6% 81.4%
and (3) It must be dominated by the correct nonterminal label. For example, in Figure 7,
the trace is an argument to bought, which it follows, and it is dominated by a VP. Of the
437 cases, 341 were string-vacuous extraction from subject position, recovered with
96.3% precision and 98.8% recall; and 96 were longer distance cases, recovered with
81.4% precision and 59.4% recall.16
7. Discussion
This section discusses some aspects of the models in more detail. Section 7.1 gives a
much more detailed analysis of the parsers’ performance. In section 7.2 we examine
16 We exclude infinitival relative clauses from these figures (for example, I called a plumber TRACE to fix the
sink, where plumber is coindexed with the trace subject of the infinitival). The algorithm scored 41%
precision and 18% recall on the 60 cases in section 23—but infinitival relatives are extremely difficult
even for human annotators to distinguish from purpose clauses (in this case, the infinitival could be a
purpose clause modifying called) (Ann Taylor, personal communication, 1997).
608
Collins Head-Driven Statistical Models for NL Parsing
the distance features in the model. In section 7.3 we examine how the model interacts
with the Penn Treebank style of annotation. Finally, in section 7.4 we discuss the need
to break down context-free rules in the treebank in such a way that the model will
generalize to give nonzero probability to rules not seen in training. In each case we
use three methods of analysis. First, we consider how various aspects of the model
affect parsing performance, through accuracy measurements on the treebank. Second,
we look at the frequency of different constructions in the treebank. Third, we consider
linguistically motivated examples as a way of justifying various modeling choices.
Table 3
Recall and precision for different constituent types, for section 0 of the treebank with model 2.
Label is the nonterminal label; Proportion is the percentage of constituents in the treebank
section 0 that have this label; Count is the number of constituents that have this label.
609
Computational Linguistics Volume 29, Number 4
Figure 12
A tree and its associated dependencies. Note that in “normalizing” dependencies, all POS tags
are replaced with TAG, and the NP-C parent in the fifth relation is replaced with NP.
In addition, the relation is “normalized” to some extent. First, all POS tags are
replaced with the token TAG, so that POS-tagging errors do not lead to errors in
dependencies.17 Second, any complement markings on the parent or head nontermi-
nal are removed. For example, NP-C, NPB, PP, R is replaced by NP, NPB, PP, R. This
prevents parsing errors where a complement has been mistaken to be an adjunct (or
vice versa), leading to more than one dependency error. As an example, in Figure 12,
if the NP the man with the telescope was mistakenly identified as an adjunct, then without
normalization, this would lead to two dependency errors: Both the PP dependency and
the verb-object relation would be incorrect. With normalization, only the verb-object
relation is incorrect.
17 The justification for this is that there is an estimated 3% error rate in the hand-assigned POS tags in the
treebank (Ratnaparkhi 1996), and we didn’t want this noise to contribute to dependency errors.
610
Collins Head-Driven Statistical Models for NL Parsing
Table 4
Dependency accuracy on section 0 of the treebank with Model 2. No labels means that only the
dependency needs to be correct; the relation may be wrong; No complements means all
complement (-C) markings are stripped before comparing relations; All means complement
markings are retained on the modifying nonterminal.
A conclusion to draw from these accuracies is that the parser is doing very well at
recovering the core structure of sentences: complements, sentential heads, and base-NP
relationships (NP chunks) are all recovered with over 90% accuracy. The main sources
of errors are adjuncts. Coordination is especially difficult for the parser, most likely
611
Computational Linguistics Volume 29, Number 4
Table 5
Accuracy of the 50 most frequent dependency types in section 0 of the treebank, as recovered
by model 2.
612
Collins Head-Driven Statistical Models for NL Parsing
Table 6
Accuracy for various types/subtypes of dependency (part 1). Only subtypes occurring more
than 10 times are shown.
613
Computational Linguistics Volume 29, Number 4
Table 6
(cont.)
614
Collins Head-Driven Statistical Models for NL Parsing
Table 7
Results on section 0 of the WSJ Treebank. A “YES” in the A column means that the adjacency
conditions were used in the distance measure; likewise, a “YES” in the V column indicates
that the verb conditions were used in the distance measure. LR = labeled recall; LP = labeled
precision. CBs is the average number of crossing brackets per sentence. 0 CBs ≤ 2 CBs are the
percentages of sentences with 0 and ≤ 2 crossing brackets, respectively.
because it often involves a dependency between two content words, leading to very
sparse statistics.
7.2.1 Impact of the Distance Measure on Accuracy. Table 7 shows the results for
models 1 and 2 with and without the adjacency and verb distance measures. It is clear
that the distance measure improves the models’ accuracy.
What is most striking is just how badly model 1 performs without the distance
measure. Looking at the parser’s output, the reason for this poor performance is that
the adjacency condition in the distance measure is approximating subcategorization
information. In particular, in phrases such as PPs and SBARs (and, to a lesser extent,
in VPs) that almost always take exactly one complement to the right of their head,
the adjacency feature encodes this monovalency through parameters P(STOP|PP/SBAR,
adjacent) = 0 and P(STOP|PP/SBAR, not adjacent) = 1. Figure 13 shows some par-
ticularly bad structures returned by model 1 with no distance variables.
Another surprise is that subcategorization can be very useful, but that the dis-
tance measure has masked this utility. One interpretation in moving from the least
parameterized model (Model 1 [No, No]) to the fully parameterized model (Model 2
[Yes, Yes]) is that the adjacency condition adds around 11% in accuracy; the verb
condition adds another 1.5%; and subcategorization finally adds a mere 0.8%. Under
this interpretation subcategorization information isn’t all that useful (and this was my
original assumption, as this was the order in which features were originally added
to the model). But under another interpretation subcategorization is very useful: In
moving from Model 1 (No, No) to Model 2 (No, No), we see a 10% improvement as a
result of subcategorization parameters; adjacency then adds a 1.5% improvement; and
the verb condition adds a final 1% improvement.
From an engineering point of view, given a choice of whether to add just distance
or subcategorization to the model, distance is preferable. But linguistically it is clear
that adjacency can only approximate subcategorization and that subcategorization is
615
Computational Linguistics Volume 29, Number 4
Figure 13
Two examples of bad parses produced by model 1 with no distance or subcategorization
conditions (Model 1 (No, No) in table 7). In (a) one PP has two complements, the other has
none; in (b) the SBAR has two complements. In both examples either the adjacency condition
or the subcategorization parameters will correct the errors, so these are examples in which the
Table 8
Distribution of nonterminals generated as postmodifiers to an NP (see tree to the left), at
various distances from the head. A = True means the modifier is adjacent to the head, V =
True means there is a verb between the head and the modifier. Distributions were calculated
from the first 10000 events for each of the three cases in sections 2-21 of the treebank.
A = True, V = False A = False, V = False A = False, V = True
Percentage ? Percentage ? Percentage ?
70.78 STOP 88.53 STOP 97.65 STOP
17.7 PP 5.57 PP 0.93 PP
3.54 SBAR 2.28 SBAR 0.55 SBAR
3.43 NP 1.55 NP 0.35 NP
2.22 VP 0.92 VP 0.22 VP
0.61 SG 0.38 SG 0.09 SG
0.56 ADJP 0.26 PRN 0.07 PRN
0.54 PRN 0.22 ADVP 0.04 ADJP
0.36 ADVP 0.15 ADJP 0.03 ADVP
0.08 TO 0.09 -RRB- 0.02 S
0.08 CONJP 0.02 UCP 0.02 -RRB-
0.03 UCP 0.01 X 0.01 X
0.02 JJ 0.01 RRC 0.01 VBG
0.01 VBN 0.01 RB 0.01 RB
0.01 RRC
0.01 FRAG
0.01 CD
0.01 -LRB-
more “correct” in some sense. In free-word-order languages, distance may not approx-
imate subcategorization at all well: A complement may appear to either the right or
left of the head, confusing the adjacency condition.
7.2.2 Frequencies in Training Data. Tables 8 and 9 show the effect of distance on the
distribution of modifiers in two of the most frequent syntactic environments: NP and
verb modification. The distribution varies a great deal with distance. Most striking is
the way that the probability of STOP increases with increasing distance: from 71% to
89% to 98% in the NP case, from 8% to 60% to 96% in the verb case. Each modifier
probability generally decreases with distance. For example, the probability of seeing
a PP modifier to an NP decreases from 17.7% to 5.57% to 0.93%.
616
Collins Head-Driven Statistical Models for NL Parsing
Table 9
Distribution of nonterminals generated as postmodifiers to a verb within a VP (see tree to the
left), at various distances from the head. A = True means the modifier is adjacent to the head;
V = True means there is a verb between the head and the modifier. The distributions were
calculated from the first 10000 events for each of the distributions in sections 2–21. Auxiliary
verbs (verbs taking a VP complement to their right) were excluded from these statistics.
A = True, V = False A = False, V = False A = False, V = True
Percentage ? Percentage ? Percentage ?
39 NP-C 59.87 STOP 95.92 STOP
15.8 PP 22.7 PP 1.73 PP
8.43 SBAR-C 3.3 NP-C 0.92 SBAR
8.27 STOP 3.16 SG 0.5 NP
7.2.3 Distance Features and Right-Branching Structures. Both the adjacency and verb
components of the distance measure allow the model to learn a preference for right-
branching structures. First, consider the adjacency condition. Figure 14 shows some
examples in which right-branching structures are more frequent. Using the statistics
from Tables 8 and 9, the probability of the alternative structures can be calculated. The
results are given below. The right-branching structures get higher probability (although
this is before the lexical-dependency probabilities are multiplied in, so this “prior”
preference for right-branching structures can be overruled by lexical preferences). If
the distance variables were not conditioned on, the product of terms for the two
alternatives would be identical, and the model would have no preference for one
structure over another.
Probabilities for the two alternative PP structures in Figure 14 (excluding probabil-
ity terms that are constant across the two structures; A=1 means distance is adjacent,
A=0 means not adjacent) are as follows:
617
Computational Linguistics Volume 29, Number 4
Right-branching:
P(PP|NP,NPB,A=1)P(STOP|NP,NPB,A=0)
P(PP|NP,NPB,A=1)P(STOP|NP,NPB,A=0)
= 0.177 × 0.8853 × 0.177 × 0.8853 = 0.02455
Non-right-branching:
P(PP|NP,NPB,A=1)P(PP|NP,NPB,A=0)
P(STOP|NP,NPB,A=0)P(STOP|NP,NPB,A=1)
= 0.177 × 0.0557 × 0.8853 × 0.7078 = 0.006178
Probabilities for the SBAR case in Figure 14, assuming the SBAR contains a verb (V=0
means modification does not cross a verb, V=1 means it does), are as follows:
Right-branching:
P(PP|NP,NPB,A=1,V=0)P(SBAR|NP,NPB,A=1,V=0)
P(STOP|NP,NPB,A=0,V=1)P(STOP|NP,NPB,A=0,V=1)
= 0.177 × 0.0354 × 0.9765 × 0.9765 = 0.005975
Non-right-branching:
P(PP|NP,NPB,A=1)P(STOP|NP,NPB,A=1)
P(SBAR|NP,NPB,A=0)P(STOP|NP,NPB,A=0,V=1)
= 0.177 × 0.7078 × 0.0228 × 0.9765 = 0.002789
618
Collins Head-Driven Statistical Models for NL Parsing
7.2.4 Verb Condition and Right-Branching Structures. Figure 15 shows some exam-
ples in which the verb condition is important in differentiating the probability of two
structures. In both cases an adjunct can attach either high or low, but high attachment
results in a dependency’s crossing a verb and has lower probability.
An alternative to the surface string feature would be a predicate such as were any
of the previous modifiers in X, where X is a set of nonterminals that are likely to contain
a verb, such as VP, SBAR, S, or SG. This would allow the model to handle cases like the
first example in Figure 15 correctly. The second example shows why it is preferable to
condition on the surface string. In this case the verb is “invisible” to the top level, as
it is generated recursively below the NP object.
7.2.5 Structural versus Semantic Preferences. One hypothesis would be that lexical
statistics are really what is important in parsing: that arriving at a correct interpretation
for a sentence is simply a matter of finding the most semantically plausible analysis,
and that the statistics related to lexical dependencies approximate this notion of plau-
sibility. Implicitly, we would be just as well off (maybe even better off) if statistics were
calculated between items at the predicate-argument level, with no reference to struc-
ture. The distance preferences under this interpretation are just a way of mitigating
sparse-data problems: When the lexical statistics are too sparse, then falling back on
some structural preference is not ideal, but is at least better than chance. This hypoth-
esis is suggested by previous work on specific cases of attachment ambiguity such
as PP attachment (see, e.g., Collins and Brooks 1995), which has showed that models
will perform better given lexical statistics, and that a straight structural preference is
merely a fallback.
But some examples suggest this is not the case: that, in fact, many sentences
have several equally semantically plausible analyses, but that structural preferences
619
Computational Linguistics Volume 29, Number 4
distinguish strongly among them. Take the following example (from Pereira and War-
ren 1980):
Surprisingly, this sentence has two analyses: Bill can be the deep subject of either
believed or shot. Yet people have a very strong preference for Bill to be doing the
shooting, so much so that they may even miss the second analysis. (To see that the
dispreferred analysis is semantically quite plausible, consider Bill believed John to have
been shot.)
As evidence that structural preferences can even override semantic plausibility,
take the following example (from Pinker 1994):
This sentence is a garden path: The structural preference for yesterday to modify the
most recent verb is so strong that it is easy to miss the (only) semantically plausible
interpretation, paraphrased as Flip said yesterday that Squeaky will do the work.
The model makes the correct predictions in these cases. In example (4), the statistics
in Table 9 show that a PP is nine times as likely to attach low as to attach high when
two verbs are candidate attachment points (the chances of seeing a PP modifier are
15.8% and 1.73% in columns 1 and 5 of the table, respectively). In example (5), the
probability of seeing an NP (adjunct) modifier to do in a nonadjacent but non-verb-
crossing environment is 2.11% in sections 2–21 of the treebank (8 out of 379 cases); in
contrast, the chance of seeing an NP adjunct modifying said across a verb is 0.026% (1
out of 3,778 cases). The two probabilities differ by a factor of almost 80.
Figure 16
Alternative annotation styles for a sentence S with a verb head V, left modifiers X1, X2, and
right modifiers Y1, Y2: (a) the Penn Treebank style of analysis (one level of structure for each
bar level); (b) an alternative but equivalent binary branching representation.
620
Collins Head-Driven Statistical Models for NL Parsing
Figure 17
Alternative annotation styles for a noun phrase with a noun head N, left modifiers X1, X2,
and right modifiers Y1, Y2: (a) the Penn Treebank style of analysis (one level of structure for
each bar level, although note that both the nonrecursive and the recursive noun phrases are
labeled NP; (b) an alternative but equivalent binary branching representation; (a ) our
modification of the Penn Treebank style to differentiate recursive and nonrecursive NPs (in
some sense NPB is a bar 1 structure and NP is a bar 2 structure).
As long as there is a one-to-one mapping between the treebank and the new rep-
resentation, nothing is lost in making such a transformation. Goodman (1997) and
Johnson (1997) both suggest this strategy. Goodman (1997) converts the treebank into
binary-branching trees. Johnson (1997) considers conversion to a number of different
representations and discusses how this influences accuracy for nonlexicalized PCFGs.
The models developed in this article have tacitly assumed the Penn Treebank
style of annotation and will perform badly given other representations (for example,
binary-branching trees). This section makes this point more explicit, describing exactly
what annotation style is suitable for the models and showing how other annotation
styles will cause problems. This dependence on Penn Treebank–style annotations does
not imply that the models are inappropriate for a treebank annotated in a different
style: In this case we simply recommend transforming the trees into flat, one-level-
per-X-bar-level trees before training the model, as in the three-step procedure outlined
above.
Other models in the literature are also very likely to be sensitive to annotation
style. Charniak’s (1997) models will most likely perform quite differently with binary-
branching trees (for example, his current models will learn that rules such as VP →
V SG PP are very rare, but with binary-branching structures, this context sensitivity
will be lost). The models of Magerman (1995) and Ratnaparkhi (1997) use contextual
predicates that would most likely need to be modified given a different annotation
style. Goodman’s (1997) models are the exception, as he already specifies that the
treebank should be transformed into his chosen representation, binary-branching trees.
7.3.1 Representation Affects Structural, not Lexical, Preferences. The alternative rep-
resentations in Figures 16 and 17 have the same lexical dependencies (providing that
the binary-branching structures are centered about the head of the phrase, as in the
examples). The difference between the representations involves structural preferences
such as the right-branching preferences encoded by the distance measure. Applying
the models in this article to treebank analyses that use this type of “head-centered”
621
Computational Linguistics Volume 29, Number 4
Figure 18
BB = binary-branching structures; FLAT = Penn treebank style annotations. In each case the
binary-branching annotation style prevents the model from learning that these structures
binary-branching tree will result in a distance measure that incorrectly encodes a pref-
erence for right-branching structures.
To see this, consider the examples in Figure 18. In each binary-branching example,
the generation of the final modifying PP is “blind” to the distance between it and the
head that it modifies. At the top level of the tree, it is apparently adjacent to the head;
crucially, the closer modifier (SG in (a), the other PP in (b)) is hidden lower in the tree
structure. So the model will be unable to differentiate generation of the PP in adjacent
versus nonadjacent or non-verb-crossing versus verb-crossing environments, and the
structures in Figure 18 will be assigned unreasonably high probabilities.
This does not mean that distance preferences cannot be encoded in a binary-
branching PCFG. Goodman (1997) achieves this by adding distance features to the non-
terminals. The spirit of this implementation is that the top-level rules VP → VP PP and
NP → NP PP would be modified to VP → VP(+rverb) PP and NP → NP(+rmod) PP,
respectively, where (+rverb) means a phrase in which the head has a verb in its right
modifiers, and (+rmod) means a phrase that has at least one right modifier to the
head. The model will learn from training data that P(VP → VP(+rverb) PP|VP)
P(VP → VP(-rverb) PP|VP), that is, that a prepositional-phrase modification is much
more likely when it does not cross a verb.
Figure 19
(a) The way the Penn Treebank annotates NPs. (a ) Our modification to the annotation, to
differentiate recursive (NP) from nonrecursive (NPB) noun phrases. (b) A structure that is never
seen in training data but will receive much too high a probability from a model trained on
trees of style (a).
622
Collins Head-Driven Statistical Models for NL Parsing
Figure 20
Examples of other phrases in the Penn Treebank in which nonrecursive and recursive phrases
are not differentiated.
are never seen in training data. (Johnson [1997] notes that this structure has a higher
probability than the correct, flat structure, given counts taken from the treebank for
7.3.3 Summary. To summarize, the models in this article assume the following:
1. Tree representations are “flat”: that is, one level per X-bar level.
2. Different X-bar levels have different labels (in particular, nonrecursive
and recursive levels are differentiated, at least for the most frequent case
of NPs).
623
Computational Linguistics Volume 29, Number 4
The estimation technique used in Charniak (1997) for the CF rule probabilities inter-
polates several estimates, the lowest being P(Ln . . . L1 HR1 . . . Rm ) | P). Any rules not
seen in training data will be assigned zero probability with this model. Parse trees in
test data will be limited to include rules seen in training.
A problem with this approach is coverage. As shown in this section, many test data
sentences will require rules that have not been seen in training. This gives motivation
for breaking down rules into smaller components. This section motivates the need to
break down rules from four perspectives. First, we discuss how the Penn Treebank
annotation style leads to a very large number of grammar rules. Second, we assess the
extent of the coverage problem by looking at rule frequencies in training data. Third,
we conduct experiments to assess the impact of the coverage problem on accuracy.
Fourth, we discuss how breaking rules down may improve estimation as well as
7.4.1 The Penn Treebank Annotation Style Leads to Many Rules. The “flatness” of
the Penn Treebank annotation style has already been discussed, in section 7.3. The
flatness of the trees leads to a very large (and constantly growing) number of rules,
primarily because the number of adjuncts to a head is potentially unlimited: For ex-
ample, there can be any number of PP adjuncts to a head verb. A binary-branching
(Chomsky adjunction) grammar can generate an unlimited number of adjuncts with
very few rules. For example, the following grammar generates any sequence VP → V
NP PP*:
VP → V NP
VP → VP PP
In contrast, the Penn Treebank style would create a new rule for each number of PPs
seen in training data. The grammar would be
VP → V NP
VP → V NP PP
VP → V NP PP PP
VP → V NP PP PP PP
and so on
Other adverbial adjuncts, such as adverbial phrases or adverbial SBARs, can also modify
a verb several times, and all of these different types of adjuncts can be seen together in
the same rule. The result is a combinatorial explosion in the number of rules. To give
a flavor of this, here is a random sample of rules of the format VP → VB modifier*
that occurred only once in sections 2–21 of the Penn Treebank:
VP → VB NP NP NP PRN
VP → VB NP SBAR PP SG ADVP
VP → VB NP ADVP ADVP PP PP
VP → VB RB
VP → VB NP PP NP SBAR
VP → VB NP PP SBAR PP
It is not only verb phrases that cause this kind of combinatorial explosion: Other
phrases, in particular nonrecursive noun phrases, also contribute a huge number of
rules. The next section considers the distributional properties of the rules in more
detail.
624
Collins Head-Driven Statistical Models for NL Parsing
Note that there is good motivation for the Penn Treebank’s decision to repre-
sent rules in this way, rather than with rules expressing Chomsky adjunction (i.e., a
schema in which complements and adjuncts are separated, through rule types VP →
VB {complement}* and VP → VP {adjunct}). First, it allows the argument/adjunct
distinction for PP modifiers to verbs to be left undefined: This distinction was found
to be very difficult for annotators. Second, in the surface ordering (as opposed to deep
structure), adjuncts are often found closer to the head than complements, thereby yield-
ing structures that fall outside the Chomsky adjunction schema. For example, a rule
such as VP → VB NP-C PP SBAR-C is found very frequently in the Penn Treebank;
SBAR complements nearly always extrapose over adjuncts.
7.4.2 Quantifying the Coverage Problem. To quantify the coverage problem, rules
Table 10
Statistics for rules taken from sections 2–21 of the treebank, with complement markings not
included on nonterminals.
625
Computational Linguistics Volume 29, Number 4
Table 11
Statistics for rules taken from sections 2–21 of the treebank, with complement markings
included on nonterminals.
Rule count Number of Rules Percentage of rules Number of Rules Percentage of rules
by type by type by token by token
1 7865 55.00 7865 0.84
2 1918 13.41 3836 0.41
3 815 5.70 2445 0.26
4 528 3.69 2112 0.22
5 377 2.64 1885 0.20
6 . . . 10 928 6.49 7112 0.76
11 . . . 20 595 4.16 8748 0.93
Table 12
Results on section 0 of the Treebank. The label restricted means the model is restricted to
recovering rules that have been seen in training data. LR = labeled recall. LP = labeled
precision. CBs is the average number of crossing brackets per sentence. 0 CBs and ≤ 2 CBs are
the percentages of sentences with 0 and ≤ 2 crossing brackets, respectively.
Model Accuracy
LR LP CBs 0 CBs ≤ 2 CBs
Model 1 87.9 88.3 1.02 63.9 84.4
Model 1 (restricted) 87.4 86.7 1.19 61.7 81.8
Model 2 88.8 89.0 0.94 65.9 85.6
Model 2 (restricted) 87.9 87.0 1.19 62.5 82.4
7.4.3 The Impact of Coverage on Accuracy. Parsing experiments were used to assess
the impact of the coverage problem on parsing accuracy. Section 0 of the treebank was
parsed with models 1 and 2 as before, but the parse trees were restricted to include
rules already seen in training data. Table 12 shows the results. Restricting the rules
leads to a 0.5% decrease in recall and a 1.6% decrease in precision for model 1, and a
0.9% decrease in recall and a 2.0% decrease in precision for model 2.
7.4.4 Breaking Down Rules Improves Estimation. Coverage problems are not the
only motivation for breaking down rules. The method may also improve estimation.
To see this, consider the rules headed by told, whose counts are shown in Table 13.
Estimating the probability P(Rule | VP, told) using Charniak’s (1997) method would
interpolate two maximum-likelihood estimates:
626
Collins Head-Driven Statistical Models for NL Parsing
Table 13
(a) Distribution over rules with told as the head (from sections 2–21 of the treebank); (b)
distribution over subcategorization frames with told as the head.
Count Rule
70 VP told → VBD NP-C SBAR-C
23 VP told → VBD NP-C
6 VP told → VBD NP-C SG-C
5 VP told → VBD NP-C NP SBAR-C
5 VP told → VBD NP-C : S-C
4 VP told → VBD NP-C PP SBAR-C
4 VP told → VBD NP-C PP
4 VP told → VBD NP-C NP
probability mass must be left to the backed-off estimate Pml (Rule | VP).
This estimation method is missing a crucial generalization: In spite of there being
many different rules, the distribution over subcategorization frames is much sharper.
Told is seen with only five subcategorization frames in training data: The large number
of rules is almost entirely due to adjuncts or punctuation appearing after or between
complements. The estimation method in model 2 effectively estimates the probability
of a rule as
Plc (LC | VP, told) × Prc (RC | VP, told) × P(Rule | VP, told, LC, RC)
The left and right subcategorization frames, LC and RC, are chosen first. The entire
rule is then generated by Markov processes.
Once armed with the Plc and Prc parameters, the model has the ability to learn the
generalization that told appears with a quite limited, sharp distribution over subcatego-
rization frames. Say that these parameters are again estimated through interpolation,
for example
In this case λ can be quite high. Only five subcategorization frames (as opposed to
26 rule types) have been seen in the 147 cases. The lexically specific distribution
627
Computational Linguistics Volume 29, Number 4
Pml (LC | VP, told) can therefore be quite highly trusted. Relatively little probability
mass is left to the backed-off estimate.
In summary, from the distributions in Table 13, the model should be quite uncertain
about what rules told can appear with. It should be relatively certain, however, about
the subcategorization frame. Introducing subcategorization parameters allows the
model to generalize in an important way about rules. We have carefully isolated the
“core” of rules—the subcategorization frame—that the model should be certain about.
We should note that Charniak’s method will certainly have some advantages in
estimation: It will capture some statistical properties of rules that our independence
assumptions will lose (e.g., the distribution over the number of PP adjuncts seen for a
particular head).
628
Collins Head-Driven Statistical Models for NL Parsing
(2000) has developed generative statistical models that integrate word sense informa-
tion into the parsing process. Eisner (2002) develops a sophisticated generative model
for lexicalized context-free rules, making use of a probabilistic model of lexicalized
transformations between rules. Blaheta and Charniak (2000) describe methods for the
recovery of the semantic tags in the Penn Treebank annotations, a significant step
forward from the complement/adjunct distinction recovered in model 2 of the cur-
rent article. Charniak (2001) gives measurements of perplexity for a lexicalized PCFG.
Gildea (2001) reports on experiments investigating the utility of different features in
bigram lexical-dependency models for parsing. Miller et al. (2000) develop generative,
lexicalized models for information extraction of relations. The approach enhances non-
terminals in the parse trees to carry semantic labels and develops a probabilistic model
that takes these labels into account. Collins et al. (1999) describe how the models in
PL2 (lwi | Li , P, w)
629
Computational Linguistics Volume 29, Number 4
The next few sections give further explanation of the differences between Charniak’s
8.1.1 Additional Features of Charniak’s Model. There are some notable additional
features of Charniak’s model. First, the rule probabilities are conditioned on the par-
ent of the nonterminal being expanded. Our models do not include this information,
although distinguishing recursive from nonrecursive NPs can be considered a reduced
form of this information. (See section 7.3.2 for a discussion of this distinction; the argu-
ments in that section are also motivation for Charniak’s choice of conditioning on the
parent.) Second, Charniak uses word-class information to smooth probabilities and re-
ports a 0.35% improvement from this feature. Finally, Charniak uses 30 million words
of text for unsupervised training. A parser is trained from the treebank and used to
parse this text; statistics are then collected from this machine-parsed text and merged
with the treebank statistics to train a second model. This gives a 0.5% improvement
in performance.
H The head nonterminal label (VP in the previous profits/rose example). At first
glance this might seem redundant: For example, an S will usually take
a VP as its head. In some cases, however, the head label can vary: For
example, an S can take another S as its head in coordination cases.
lti , t The POS tags for the head and modifier words. Inclusion of these tags al-
lows our models to use POS tags as word class information. Charniak’s
model may be missing an important generalization in this respect. Char-
niak (2000) shows that using the POS tags as word class information in
the model is important for parsing accuracy.
c The coordination flag. This distinguishes, for example, coordination cases from
appositives: Charniak’s model will have the same parameter—P(modifier|
head, NP, NP)—in both of these cases.
630
Collins Head-Driven Statistical Models for NL Parsing
8.1.3 The Rule Parameters of Charniak’s Model. The rule parameters in Charniak’s
model are effectively decomposed into our L1 parameters (section 5.1), the head pa-
rameters, and—in models 2 and 3—the subcategorization and gap parameters. This
decomposition allows our model to assign probability to rules not seen in training
data: See section 7.4 for an extensive discussion.
8.1.4 Right-Branching Structures in Charniak’s Model. Our models use distance fea-
tures to encode preferences for right-branching structures. Charniak’s model does not
represent this information explicitly but instead learns it implicitly through rule prob-
abilities. For example, for an NP PP PP sequence, the preference for a right-branching
structure is encoded through a much higher probability for the rule NP → NP PP than
for the rule NP → NP PP PP. (Note that conditioning on the rule’s parent is needed to
8.2 A Comparison to the Models of Jelinek et al. (1994), Magerman (1995), and Rat-
naparkhi (1997)
We now make a detailed comparison of our models to the history-based models of Rat-
naparkhi (1997), Jelinek et al. (1994), and Magerman (1995). A strength of these models
is undoubtedly the powerful estimation techniques that they use: maximum-entropy
modeling (in Ratnaparkhi 1997) or decision trees (in Jelinek et al. 1994 and Magerman
1995). A weakness, we will argue in this section, is the method of associating parame-
ters with transitions taken by bottom-up, shift-reduce-style parsers. We give examples
in which this method leads to the parameters’ unnecessarily fragmenting the training
data in some cases or ignoring important context in other cases. Similar observations
have been made in the context of tagging problems using maximum-entropy models
(Lafferty, McCallum, and Pereira 2001; Klein and Manning 2002).
We first analyze the model of Magerman (1995) through three common examples
of ambiguity: PP attachment, coordination, and appositives. In each case a word se-
quence S has two competing structures, T1 and T2 , with associated decision sequences
d1 , . . . , dn and e1 , . . . , em , respectively. Thus the probability of the two structures can
be written as
P(T1 |S) = P(di |d1 . . . di−1 , S)
i=1...n
P(T2 |S) = P(ei |e1 . . . ei−1 , S)
i=1...m
It will be useful to isolate the decision between the two structures to a single probability
term. Let the value j be the minimum value of i such that di = ei . Then we can rewrite
the two probabilities as follows:
P(T1 |S) = P(di |d1 . . . di−1 , S) × P(dj |d1 . . . dj−1 , S) × P(di |d1 . . . di−1 , S)
i=1...j−1 i=j+1...n
P(T2 |S) = P(ei |e1 . . . ei−1 , S) × P(ej |e1 . . . ej−1 , S) × P(ei |e1 . . . ei−1 , S)
i=1...j−1 i=j+1...m
631
Computational Linguistics Volume 29, Number 4
The first thing to note is that i=1...j−1 P(di |d1 . . . di−1 , S) = i=1...j−1 P(ei |e1 . . . ei−1 , S),
so that these probability terms are irrelevant to the decision between the two structures.
We make one additional assumption, that
P(di |d1 . . . di−1 , S) ≈ P(ei |e1 . . . ei−1 , S) ≈ 1
i=j+1...n i=j+1...m
This is justified for the examples in this section, because once the jth decision is made,
the following decisions are practically deterministic. Equivalently, we are assuming
that P(T1 |S) + P(T2 |S) ≈ 1, that is, that very little probability mass is lost to trees other
than T1 or T2 . Given these two equalities, we have isolated the decision between the
two structures to the parameters P(dj |d1 . . . dj−1 , S) and P(ej |e1 . . . ej−1 , S).
Figure 21
(a) and (b) are two candidate structures for the same sequence of words. (c) shows the first
decision (labeled “?”) where the two structures differ. The arc above the NP can go either left
(for verb attachment of the PP, as in (a)) or right (for noun attachment of the PP, as in (b)).
632
Collins Head-Driven Statistical Models for NL Parsing
Figure 22
Figure 23
(a) and (b) are two candidate structures for the same sequence of words. (c) shows the first
decision (labeled “?”) in which the two structures differ. The arc above the NP can go either
left (for high attachment (a) of the appositive phrase) or right (for noun attachment (b) of the
appositive phrase).
in John likes Mary and Bill loves Jill, the decision not to coordinate Mary and Bill is made
just after the NP Mary is built. At this point, the verb loves is outside the contextual
window, and the model has no way of telling that Bill is the subject of the following
clause. The model is assigning probability mass to globally implausible structures as
a result of points of local ambiguity in the parsing process.
Some of these problems can be repaired by changing the derivation order or the
conditioning context. Ratnaparkhi (1997) has an additional chunking stage, which
means that the head noun does fall within the contextual window for the coordination
and appositive cases.
9. Conclusions
The models in this article incorporate parameters that track a number of linguistic
phenomena: bigram lexical dependencies, subcategorization frames, the propagation
of slash categories, and so on. The models are generative models in which parse
trees are decomposed into a number of steps in a top-down derivation of the tree
633
Computational Linguistics Volume 29, Number 4
and the decisions in the derivation are modeled as conditional probabilities. With a
careful choice of derivation and independence assumptions, the resulting model has
parameters corresponding to the desired linguistic phenomena.
In addition to introducing the three parsing models and evaluating their perfor-
mance on the Penn Wall Street Journal Treebank, we have aimed in our discussion
(in sections 7 and 8) to give more insight into the models: their strengths and weak-
nesses, the effect of various features on parsing accuracy, and the relationship of the
models to other work on statistical parsing. In conclusion, we would like to highlight
the following points:
634
Collins Head-Driven Statistical Models for NL Parsing
635
Computational Linguistics Volume 29, Number 4
636
Collins Head-Driven Statistical Models for NL Parsing
637