Formal Grammars and Parsing
Formal Grammars and Parsing
Aims:
In this section we will describe several types of formal grammars for natural language processing, parse
trees, and a number of parsing methods, including a bottom-up chart parser in some detail. Read the
different subsections and the recommended readings at the end of this UNIT.
What is a Grammar?
a formal description of the structure of a language (prescriptive grammar is something else: rules
for a high-status variant of a language: e.g. don't split infinitives)
What is a Parser?
an algorithm for analysing sentences given a grammar:
- may just give yes/no answer to the question Does this sentence conform to the given grammar:
such a parser is termed an accepter
- may also produce a structure description ("parse tree") for correct sentences:
1. S -> NP VP
2. VP-> V NP
grammar rules
grammar rules
3. NP-> NAME
4. NP-> ART N
. .
5. NAME -> John
6. V -> ate
lexical entries
7. ART -> the
8. N -> cat
P is a set of context-free productions, i.e. objects of the form X -> beta, where X is a member of
N, and beta is a string over the alphabet A
A is an alphabet of grammar symbols;
N is a set of non-terminal symbols;
T is a set of terminal symbols (and N U T = A);
S is a distinguished non-terminal called the start symbol (think of it as "sentence" in NLP
applications).
E.g.
P = { S -> NP VP, VP-> V NP, NP-> NAME, NP-> ART N, NAME-> John V-> ate, ART-> the, N->
cat} .
Notice how the productions were split into grammar rules and lexicon above. N, V, NAME and ART are
called pre-terminal or lexicalsymbols.
Types of grammars:
unrestricted grammars
context-sensitive grammars
context-free grammars
regular grammars.
With context-free grammars, these form the Chomsky hierarchy of grammars. The four types of
grammar differ in the type of rewriting rule alpha -> beta that is allowed:
unrestricted grammar. No restrictions on the form that the Rules can take. Unrestricted grammars
are not widely used: their extreme power makes them difficult to use;
context sensitive grammar, or transformational grammar. The length of the string alpha on the
left-hand side of any rule must be less than or equal to the length of the string beta]] on the right-
hand side of the rule. It is equivalent to require that all the productions be of the form lambda A
rho -> lambda alpha rho, where lambda and rho are arbitrary (possibly null) strings. lambda and
rho are thought of as the left and right context in which the non-terminal symbol A can be
rewritten as the non-null symbol-string alpha. Hence the term context-sensitive grammar. Context-
sensitive production rules can be used for transforming an active sentence into the corresponding
passive sentence.
context-free grammar, or phrase structure grammar. All rules must be of the form A -> alpha,
where A is a nonterminal symbol and alpha is an arbitrary string of symbols.
regular grammar, or a right linear grammar. All rules take one of two forms: either A -> t, or A -
> tN, where A and N are non-terminal symbols and t is a member of the vocabulary (a terminal
symbol). Regular grammar rules are not powerful enough to conveniently describe natural
languages (or even programming languages). They can sometimes be used to describe portions of
languages, and have the advantage that they lead to fast parsing.
Since the restrictions which define the grammar types apply to the rules, it makes sense to talk of
unrestricted, context-sensitive, context-free, and regular rules.
choose any rule whose LHS occurs in the current string (in a CFG, the LHS must be a non-
terminal symbol);
replace the LHS of that rule with the RHS of the rule, in the current string, producing a new
current string.
Repeat until there are no non-terminals remaining in the current string. The current string is then a
sentence in the language generated by the grammar. (Before this, it is termed a sentential form.) E.g.:
Current string Rewriting
S => NP VP S
=> NAME VP NP
=> John VP NAME
=> John V NP VP
=> John ate NP V
=> John ate ART N NP
=> John ate the N ART
=> John ate the cat N
Parsing might be the reverse of this process (doing the steps shown above in reverse would constitute a
bottom-up right-to-left parse of John ate the cat.)
S -> NP VP
NP -> ART N | NAME
PP -> PREP NP
VP -> V | V NP | V NP PP | V PP
Bottom-up parsing
Chart Parsing
(Section 3.4 of Allen)
The chart is a record of all the substructures (like past the barn) that have ever been built during the
parse. A chart is sometimes also called a well-formed substring table. Actual charts get complex rapidly.
An attempt to parse 3 as a sentence fails, but all is not lost, as the analysis of plums as an NP is on the
chart. Successful parsing of the entire utterance as any kind of structure can be useful.
The algorithm constructs (phrasal or lexical) constituents of a sentence. We shall use the sentence the
green fly flies as an example in describing a NL-oriented parser similar to Earley's algorithm. The
sentence is notionally annotated with positions: our sentence becomes 0the1green2fly3flies4.
In terms of this notation, the parsing process succeeds if an S (sentence) constituent is found covering
positions 0 to 4.
Points (1) to (8) below do not completely specify the order in which parsing steps are carried out: one
reasonable order is to scan a word (as in (2)) and then perform all possible parsing steps as specified in
(3) - (6) before scanning another word. Parsing is completed when the last word has been read and all
possible subsequent parsing steps have been performed.
Parser operations:
(0) The algorithm operates on two data structures: the active chart - a collection of active arcs (see (3)
below) and the constituents (see (2) and (5)). Both are initially empty.
(1) The grammar is considered to include lexical insertion rules: for example, if fly is a word in the
lexicon/vocabulary being used, and if its lexical entry includes the fact that fly may be a N or a V, then
rules of the form N -> fly and V -> fly are considered to be part of the grammar.
(2) As a word (like fly) is scanned, constituents corresponding to its lexical categories are created:
(3) If the grammar contains a rule like NP -> ART ADJ N, and a constituent like ART1: ART -> the
FROM m TO n has been found, then an active arc ARC1: NP -> ART1 * ADJ N FROM m TO n
is added to the active chart. (In our example sentence, m would be 0 and n would be 1.) The "*" in an
active arc marks the boundary between found constituents and constituents not (yet) found.
(4) Advancing the "*": If the active chart has an active arc like:
and there is a constituent in the chart of type ADJ (i.e. the first item after the *), say
such that the FROM position in the constituent matches the TO position in the active arc, then the "*"
can be advanced, creating a new active arc:
(6) Both lexical and phrasal constituents can be used in steps 3 and 4: e.g. if the grammar contains a rule
S -> NP VP, then as soon as the constituent NP1 discussed in step 5 is created, it will be possible to
make a new active arc
(7) When subsequent constituents are created, they would have names like NP2, NP3, ..., ADJ2, ADJ3, ...
and so on.
(8) The aim of parsing is to get phrasal constituents (normally of type S) whose FROM is 0 and whose
TO is the length of the sentence. There may be several such constituents.
1. S -> NP VP
2. NP -> ART ADJ N
3. NP -> ART N
4. NP -> ADJ N
5. VP -> AUX V NP
6. VP -> V NP
Lexicon:
the... ART
large... ADJ
can... AUX, N, V
hold... N, V
water... N, V
Steps in Parsing:
s(subj(np(name(jack))),
mainv(find),
tense(past),
obj(np(art(a), head(dollar))))
PP attachment
The boy saw the man on the hill with the telescope
The two visual interpretations correspond to two different parses (coming from different grammar rules
(VP -> V NP PP and NP -> NP PP):
or
it is reasonable to ask for syntactically correct programs, but unrealistic to ask for syntactically
correct NL. Written NL material is sometimes correct, but spoken utterances are rarely
grammatical. NL systems must be syntactically and semantically robust.
some approaches have sought to be semantics-driven, to avoid the problem of how to deal with
syntactically ill-formed text. However, some syntax is essential - else how do we distinguish
between Cyril loves Audrey and Audrey loves Cyril?
Summary: Grammars and Parsing
There are many approaches to parsing and many grammatical formalisms. Some problems in deciding
the structure of a sentence turn out to be undecidable at the syntactic level. We have concentrated on a
bottom-up chart parser based on a context-free grammar.