Notes 3
Notes 3
plainly consists of two smaller sentences: I know something and what it is that I know, namely,
that this President enjoys the exercise of military power . This second word group functions
as a single unit in several senses. First of all, it intuitively stands for what it is that I know,
and so is a meaningful unit. Second, one can move it around as one chunk—as in, That this
President enjoys the exercise of military power is something I know . This group of words is
a syntactic unit. Finally, echoing a common theme, there are other word sequences—not just
single words—that can be substituted for it, while retaining grammaticality: I know that the
guy with his finger on the button is the President. To put it bluntly, there’s no way for us
to say that the guy with his finger on the button and the President can both play the same
syntactic roles. As we have seen, this leads to an unwelcome network duplication. It does not
allow us to say that sentences are built out of hierarchical parts.
We’ve already seen that there are several reasons to use hierarchical descriptions for natural
languages. Let’s summarize these here.
• Larger units than single words. Natural languages have word sequences that act as if
they were single units.
• Obvious hierarchy and recursion. Natural language sentences themselves may contain
embedded sentences, and these sentences in turn may contain sentences.
• Succinctness. Natural languages are built by combining a few types of phrases in differ-
ent combinations, like Noun Phrases and Verb Phrases. Important: the phrase names
themselves, even their existence, is in a sense purely taxonomic, just as word classes are.
(Phrases don’t exist except as our theoretical apparatus requires them to.)
NETWORK TREE
NP VP NP
GRAMMAR AUTOMATON
S NP VP
NP NP NP
Predict Complete
S Art S Art S N S
NP Article Noun NP
Push Scan Scan Pop
Scan Scan
Note how this method introduces a way to say where each phrase begins, ends, and attaches.
How can we do this using our FTN network descriptions? We already have defined one kind
of phrase using our “flat” networks—namely, a Sentence phrase, represented by the whole
network. Why not just extend this to all phrases and represent each by a network of its own.
It’s clear enough what the networks for Noun Phrases and Verb Phrases should be in this case—
just our usual finite-transition diagrams will do. What about the network for Sentences? It
no longer consists of adjacent words, but of adjacent phrases: namely, a Noun Phrase followed
by a Verb Phrase.
We have now answered the begin and end questions, but not the attach question. We must
say how the subnetworks are linked together, and for this we’ll introduce two new arc types to
encode the dominance relation. We now do this by extending our categorization relation from
words to groups of words. To get from state S-0 to state S-1 of the main Sentence network, we
must determine that there is a Noun Phrase at that point in the input.
To check whether there is a Noun Phrase in the input, we must refer to the subnetwork
labeled Noun Phrase. We can encode this reference to the Noun Phrase network in several
ways. One way is just to add jump arcs from the Start state of the Sentence to the first state
of the Noun Phrase subnetwork. This is also very much like a subroutine call: the subnetwork
is the name of a procedure that we can invoke. In this case, we need to know the starting
address of the procedure (the subnetwork) so that we can go and find it. Whatever the means,
we have defined the beginning of a phrase. We now move through the Noun Phrase subnet,
checking that all is-a relations for single words are satisfied. Reaching the final state for that
network, we now have defined the end of a Noun Phrase. Now what? We should not just
stop, but return to the main network that we came from—to the state on the other end of
the Noun Phrase arc, since we have now seen that there is a Noun Phrase at this position in
the sentence. In short, each subnetwork checks the is-a relation for each distinct phrase, while
the main sentence network checks the is-a relation for the sentence as a whole. (Note that in
4 6.863J Handout 7, Spring, 2001
Sentence:
NP VP
S-0 S-1 S-2
ε
ε
ε verb
ε
Noun VP-0 VP-1 VP-2
phrase Verb phrase
subnet subnet
ε
determiner noun
NP-0 NP-1 NP-3
Figure 1: Using jump arcs for network calls. In general, a stack must be used to keep track of
(arbitrary) return addresses.
general we must keep track of the proper return addresses, and we can use a stack for that.)
This arrangement also answers the attachment question: the subunit Noun Phrase fits into
the larger picture of things depending on how we get back to a main network from a subnetwork.
Similarly, we now have to refer to the Verb Phrase subnetwork to check the is-a relation for a
Verb Phrase, and then, returning from there, come back to state S-2, and finish. Note that the
beginning and end of the main network are defined as before for simple finite-state transition
systems, by the start state and end state of the network itself.
The revised network now answers the key questions of hierarchical analysis:
• A phrase of a certain type begins by referring to its network description, either as the
initial state of the main network (S-0 in our example), or by a call from one network to
another. The name of a phrase comes from the name of the subnetwork.
• A phrase ends when we reach the final state of the network describing it. This means
that we have completed the construction of a phrase of a particular type.
• A phrase is attached to the phrase that is the name of the network that referred to
(called) it.
To look at these answers from a slightly different perspective, note that each basic network is
itself a finite-state automaton that gives the basic linear order of elements in a phrase. The
same linear order of states and arcs imposes a linear order on phrases. This establishes a
precedes relation between every pair of elements (words or phrases). Hierarchical domination
is fixed by the pattern of subnetwork jumps and returns. But because the sentence patterns
described by network and subnetwork traversal must lead all the way from the start state to
Computation & Hierarchical parsing I; Earley’s algorithm 5
or VP
NP VP
sold
John
bought NP
a new car
a final state without interruption, we can see that such a network must relate every pair of
elements (=arc labels) by either dominates or precedes. You can check informally that this
seems to be so. If a phrase, like a Noun Phrase, does not dominate another phrase, like a Verb
Phrase, then either the Noun Phrase precedes the Verb Phrase or the Verb Phrase precedes
the Noun Phrase.
In summary, the hierarchical network system we have described can model languages where
every pair of phrases or words satisfies either the dominates or precedes relation. It is interest-
ing to ask whether this is a necessary property of natural languages. Could we have a natural
language where two words or phrases were not related by either dominates or precedes? This
seems hard to imagine because words and phrases are spoken linearly, and so seem to auto-
matically satisfy the precedes relation if they don’t satisfy dominates. However, remember
that we are interested not just in the external representation of word strings, but also in their
internal (mental and computer) representation. There is nothing that bars us or a computer
from storing the representation of a linear string of words in a non-linear way.
In fact, there are constructions in English and other languages that suggest just this pos-
sibility. If we look at the following sentence,
we note that sold a new car forms a Verb Phrase—as we would expect since sell subcategorizes
for a Noun Phrase. But what about bought? It too demands a Noun Phrase Object. By the
restrictions on subcategorization described in the last chapter, we know that this Noun Phrase
must appear in the same Verb Phrase that bought forms. That is, the Verb Phrase bought
. . . must dominate a new car . This is a problem, because the only way we can do this using
our hierarchical networks is to have bought dominate the Verb Phrase sold a new car as well.
Suppose, though, that we relax the condition that all elements of a sentence must be in either a
dominates or precedes relation. Then we could have the Verb Phrases bought and sold bearing
no dominance or precedence relation to each other. This would be compatible with the picture
in figure ??. Here we have two Verb Phrases dominating a new car . Neither dominates or
precedes the other. Note how the subcategorization constraint is met by both phrases.
6 6.863J Handout 7, Spring, 2001
To define the notion of a language, we define a derivation relation as with FTNs: two
strings of nonterminals and terminals α, β are related via a derivation step, ⇒, according to
grammar G if there is a rule of G, X → ϕ such that α = ψXγ and β = ψϕγ. That is, we can
obtain the string β from α by replacing X by ϕ. The string of nonterminals and terminals
∗
produced at each step of a derivation from S is called a sentential form. ⇒ is the application
of zero or more derivation steps. The language generated by a context-free grammar , L(G), is
the set of all the terminal strings that can be derived from S using the rules of G:
∗
L(G) = {w|w ∈ T ∗ and S ⇒ w}
Notation: Let L(a) denote the label of a node a in a (phrase structure) tree. Let the set of
node labels (with respect to a grammar) be denoted VN . V = VN ∪ VT (the set of node labels
plus terminal elements).
Definition 2: The categories of a phrase structure tree immediately dominating the actual
words of the sentence correspond to lexical categories and are called preterminals or X 0
categories. These are written in the format, e.g., N 0 → dog, cat, . . . .
Definition 3: A network (grammar) is cyclic if there exists a nonterminal N such that the
nonterminal can be reduced to itself (derive itself). For example, the following FTN is cyclic:
∗ ∗
N ⇒ N1 ⇒ N
Ambiguity in a context-free grammar is exactly analogous to the notion of different paths
through a finite-state transition network. If a context-free grammar G has at least one sentence
with two or more derivations, then it is ambiguous; otherwise it is unambiguous.
Definition 4: A grammar is infinitely ambiguous if for some sentences there are an infinite
number of derivations; otherwise, it is finitely ambiguous.
Example. The following grammar is infinitely ambiguous and cyclic.
Start→ S
S→ S
S→ a
Neither of these cases seem to arise in natural language. Consider the second case. Such
rules are generally not meaningful linguistically, because a nonbranching chain introduces no
new description in terms of is-a relationships. For example, consider the grammar fragment,
S→ NP VP NP→NP
NP→Det Noun
This grammar has a derivation tree with NP–NP—NP dominating whatever string of ter-
minals forms a Noun Phrase (such as an ice-cream). Worse still, there could be an arbitrary
number of NPs dominating an ice-cream. This causes computational nightmares; suppose a
parser had to build all of these possibilities—it could go on forever. Fortunately, this extra
layer of description is superfluous. If we know that an ice-cream is the Subject Noun Phrase,
occupying the first two positions in the sentence, then that is all we need to know. The extra
8 6.863J Handout 7, Spring, 2001
NPs do not add anything new in the way of description, since all they can do is simply repeat
the statement that there is an NP occupying the first two positions. To put the same point
another way, we would expect no grammatical process to operate on the lowest NP without
also affecting NP in the same way.
We can contrast this nonbranching NP with a branching one:
S→ NP VP NP→NP PP
PP→ Prep NP NP→ Det Noun
Here we can have an NP dominating the branching structure NP–PP. We need both NPs,
because they describe different phrases in the input. For example, the lowest NP might domi-
nate an ice-cream and the higher NP might dominate an ice-cream with raspberry toppings—
plainly different things, to an ice-cream aficionado.
Note that like the FTN case, ambiguity is a property of a grammar. However, there are
two ways that ambiguity can arise in a context-free system. First, just as with finite-state
systems, we can have lexical or category ambiguity: one and the same word can be analyzed
in two different ways, for example, as either a Noun or a Verb. For example, we could have
Noun→time or Verb→time. Second, because we now have phrases, not just single words,
one can sometimes analyze the same sequence of word categories in more than one way. For
example, the guy on the hill with the telescope can be analyzed as a single Noun Phrase as
either,
The word categories are not any different in these two analyses. The only thing that changes
is how the categories are stitched together. In simple cases, one can easily spot this kind of
structural ambiguity in a context-free system. It often shows up when two different rules share
a common phrase boundary. Here’s an example:
VP→ Verb NP PP
NP→ Det Noun PP
(lexicality) N = {X i |1 ≤ i ≤ n, X ∈ T }
(centrality) Start = X n , X ∈ T
(succession) rules in P have the form X i → αX i−1 β
(uniformity) where α and β are possibly empty strings over
(maximality) the set of “maximal categories (projections).” NM = {X n |X ∈ T }
Notation: The superscripts are called bar levels and are sometimes notated X, X, etc. The
items α ( not a single category, note) are called the specifiers of a phrase, while β comprise the
complements of a phrase. The most natural example is that of a verb phrase: the objects of
the verb phrase are its COMPlements while its Specifier might be the Subject Noun Phrase.
Example. Suppose n = 2. Then the X2 grammar for noun expansions could include the rules:
N 2 → DeterminerN 1 P 2
N 1 → N 0 (= noun)
The X are lexical items (word categories). The X i are called projections of X, and X is the
head of X i . Note that the X definition enforces the constraint that all maximal projections, or
full phrases like noun phrases (NPs), verb phrases (VPs), etc., are maximal projection phrases,
uniformly (this restriction is relaxed in some accounts).
It is not hard to show that the number of bar levels doesn’t really matter to the language
that is generated, in that given, say, an X 3 grammar, we can always find an X 1 grammar that
generates the same language by simply substituting for X 3 and X 2 . Conversely, given any X n
grammar, we can find an X n+1 grammar generating the same language—for every rule α → β
in the old grammar, the the new grammar just adds rules that has the same rules with the bar
level incremented by 1, plus a new rule X 1 → X 0 , for every lexical item X.
Unfortunately, Kornai’s definition itself is flawed given current views. Let us see what sort
of X theory people actually use nowadays. We can restrict the number of bar levels to 2,
make the rules binary branching, and add rules of the form X i → αX i or X i → X i β. The
binary branching requirement is quite natural for some systems of semantic interpretation, a
matter to which we will return. On this most recent view, the central assumption of the X
model is that a word category X can function as the head of a phrase and be projected to a
corresponding phrasal category XP by (optionally) adding one of three kinds of modifiers: X
can be projected into an X by adding a complement; the X can be recursively turned into
another X by the addition of a modifying adjunct, and this X can be projected into an XP
(X) by adding a specifier. In other words, the basic skeleton structure for phrases in English
looks like this:
The Verb criticize is projected in a V by adding the noun complement the other ; the V
criticize the other is projected into another V by adding the adverbial adjunct bitterly; the
new V is projected into a full VP by adding the specifier each. Continuing, consider the modal
auxiliary verb will . Since it is Inflected for tense, we can assign it the category I. I is projected
into I or IP by the addition of the verb phrase complement [each bitterly criticize the other ].
This is projected into another I by adding the adverbial adjunct probably; this finally forms
an IP by adding the specifier (subject) they, so we get a full sentence. Thus IP=S(entence)
phrase, in this case.
Similarly, a phrase like that you should be tired is composed of the Complementizer word
that, which heads up what is now called a Complement Phrase or CP (and was some-
times called an S phrase; don’t get this confused with the Complement of a phrase), and the
Complement IP (=S) you should be tired ; here the specifier and adjunct are missing.
The basic configuration of Specifier-Head and Head-Complement relations may exhaust
what we need to get out of local syntactic relations, so it appears. Thus the apparent tree
structure is really derivative. (Recent linguistic work may show that the binary branching
structure is also derivative from more basic principles.)
2. The left edge of the tree, i.e., its starting position in the input. This is given by a
number indexing the interword position at which we began construction of the tree.
Computation & Hierarchical parsing I; Earley’s algorithm 11
3. The right edge of the tree built so far (redundant with the state set number as before).
4 Earley’s algorithm
Earley’s algorithm is like the state set simulation of a nondeterministic FTN presented earlier,
with the addition of a single new integer representing the starting point of a hierarchical phrase
(since now phrases can start at any point in the input). Given input n, a series of state sets S0 ,
S1 , . . ., Sn is built, where Si contains all the valid items after reading i words. The algorithm
as presented is a simple recognizer; as usual, parsing involves more work.
In theorem-proving terms, the Earley algorithm selects the leftmost nonterminal (phrase)
in a rule as the next candidate to see if one can find a “proof” for it in the input. (By varying
which nonterminal is selected, one can come up with a different strategy for parsing.)
12 6.863J Handout 7, Spring, 2001
tree rather than a partial linear sequence), one must add a single new integer index to mark
the return address in hierarchical structure.
Note that prediction and completion both act like -transitions: they spark parser opera-
tions without consuming any input; hence, one must close each state set construction under
these operations (= we must add all states we can reach after reading i words, including those
reached under -transitions.)
Question: where is the stack in the Earley algorithm? (Since we need a stack for return
pointers.)
Start→S S→NP VP
NP→Name NP→Det Noun
NP→Name PP PP→ Prep NP
VP→V NP VP→V NP PP
V→ate Noun→ice-cream
Name→John Name→ice-cream
Noun→table Det→the
Prep→on
Let’s follow how this parse works using Earley’s algorithm and the parser used in laboratory
2. (The headings and running count of state numbers aren’t supplied by the parser. Also note
that Start is replaced by *DO*. Some additional duplicated states that are printed during
tracing have been removed for clarity, and comments added.)
(in-package ’gpsg)
(remove-rule-set ’testrules)
(remove-rule-set ’testdict)
(add-rule-set ’testrules ’CFG)
(add-rule-list ’testrules
’((S ==> NP VP)
(NP ==> name)
(NP ==> Name PP)
(VP ==> V NP)
(NP ==> Det Noun)
(PP ==> Prep NP)
(VP ==> V NP PP)))
John [Name]
1 0 NP ==> NAME . (6) (scan over 3)
1 0 NP ==> NAME . PP (7) (scan over 4)
1 0 S ==> NP . VP (8) (complete 6 to 2)
1 1 PP ==> . PREP NP (9) (predict from 7)
1 1 VP ==> . V NP (10) (predict from 8)
1 1 VP ==> . V NP PP (11) (predict from 8)
ate [V]
2 1 VP ==> V . NP (12) (scan over 10)
2 1 VP ==> V . NP PP (13) (scan over 11)
2 2 NP ==> . NAME (14) (predict from 12/13)
2 2 NP ==> . NAME PP (15) (predict from 12/13)
2 2 NP ==> . DET NOUN (16) (predict from 12/13)
on [Prep]
4 3 PP ==> PREP . NP (24) (scan over 21)
4 4 NP ==> . NAME (25) (predict from 24)
4 4 NP ==> . NAME PP (26) (predict from 24)
4 4 NP ==> . DET NOUN (27) (predict from 24)
the [Det]
5 4 NP ==> DET . NOUN (28) (scan over 27)
table [Noun]
6 4 NP ==> DET NOUN . (29) (scan over 28)
6 3 PP ==> PREP NP . (30) (complete 29 to 24)
6 1 VP ==> V NP PP . (31) (complete 24 to 19)
6 2 NP ==> NAME PP . (32) (complete 24 to 18)
6 0 S ==> NP VP . (33) (complete 8 to 1)
6 0 *DO* ==> S . (34) (complete 1) [parse 1]
6 1 VP ==> V NP . (35) (complete 18 to 12)
6 0 S ==> NP VP . (36) (complete 12 to 1) = 33
16 6.863J Handout 7, Spring, 2001
*DO*→S• (34)
NP→Name PP•(32)
PP →Prep NP•(30)
Figure 3: Distinct parses lead to distinct state triple paths in the Earley algorithm
((START
(S (NP (NAME JOHN))
(VP (V ATE) (NP (NAME ICE-CREAM))
(PP (PREP ON) (NP (DET THE) (NOUN TABLE))))))
(START
(S (NP (NAME JOHN))
(VP (V ATE)
(NP (NAME ICE-CREAM) (PP (PREP ON) (NP (DET THE) (NOUN TABLE))))))))
set). In the worst case, the maximum number of distinct items is the maximum number of
dotted rules times the maximum number of distinct return values, or |G| · n. The time to
process a single item can be found by considering separately the scan, predict and complete
actions. Scan and predict are effectively constant time (we can build in advance all the
possible single next-state transitions, given a possible category). The complete action could
force the algorithm to advance the dot in all the items in a state set, which from the previous
calculation, could be |G| · n items, hence proportional to that much time. We can combine
the values as shown below to get an upper bound on the execution time, assuming that the
primitive operations of our computer allow us to maintain lists without duplicates without any
additional overhead (say, by using bit-vectors; if this is not done, then searching through or
ordering the list of states could add in another |G| factor.). Note as before that grammar size
(measure by the total number of symbols in the rule system) is an important component to
this bound; more so than the input sentence length, as you will see in Laboratory 2.
If there is no ambiguity, then this worst case does not arise (why?). The parse is then linear
time (why?). If there is only a finite ambiguity in the grammar (at each step, there are only a
finite, bounded in advance number of ambiguous attachment possibilities) then the worst case
time is proportional to n2 .
Maximum number of state sets X Maximum time to build ONE state set