0% found this document useful (0 votes)
153 views45 pages

CFG

This document discusses context-free grammars and their components. A context-free grammar consists of a finite set of terminals, non-terminals, a start symbol, and production rules. Production rules take the form of a non-terminal symbol on the left-hand side and a string of terminals and/or non-terminals on the right-hand side. Derivations in a context-free grammar involve replacing the left-hand side of a rule with its right-hand side. A derivation tree can represent a derivation in a compact way. The language generated by a context-free grammar is the set of all strings that can be derived from the start symbol.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
153 views45 pages

CFG

This document discusses context-free grammars and their components. A context-free grammar consists of a finite set of terminals, non-terminals, a start symbol, and production rules. Production rules take the form of a non-terminal symbol on the left-hand side and a string of terminals and/or non-terminals on the right-hand side. Derivations in a context-free grammar involve replacing the left-hand side of a rule with its right-hand side. A derivation tree can represent a derivation in a compact way. The language generated by a context-free grammar is the set of all strings that can be derived from the start symbol.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

Context-free grammars

A context-free grammar (CFG) is a four-tuple , V, S, P,


where:
is a nite, non-empty set of terminals, the alphabet;
V is a nite, non-empty set of grammar variables
(categories, or non-terminal symbols), such that V = ;
S V is the start symbol;
P is a nite set of production rules, each of the form A ,
where A V and (V )

.
For a rule A , A is the rules head and is its body.
Context-free grammars
Example: CFG example
= {the, cat, in, hat}
V = {D, N, P, NP, PP}
The start symbol is NP
The rules:
D the NP D N
N cat PP P NP
N hat NP NP PP
P in
Context-free grammars: language
Each non-terminal symbol in a grammar denotes a language.
A rule such as N cat implies that the language denoted by the
non-terminal N includes the alphabet symbol cat.
The symbol cat here is a single, atomic alphabet symbol, and not
a string of symbols: the alphabet of this example consists of
natural language words, not of natural language letters.
For a more complex rule such as NP D N, the language
denoted by NP contains the concatenation of the language
denoted by D with that denoted by N: L(NP) L(D) L(N).
Matters become more complicate when we consider recursive rules
such as NP NP PP.
Context-free grammars: derivation
Given a grammar G = V, , P, S, we dene the set of forms to
be (V )

: the set of all sequences of terminal and non-terminal


symbols.
Derivation is a relation that holds between two forms, each a
sequence of grammar symbols.
A form derives a form , denoted by , if and only if
=
l
A
r
and =
l

r
and A
c
is a rule in P.
A is called the selected symbol. The rule A is said to be
applicable to .
Derivation
Example: Forms
The set of non-terminals of G is V = {D, N, P, NP, PP} and the
set of terminals is = {the, cat, in, hat}.
The set of forms therefore contains all the (innitely many)
sequences of elements from V and , such as , NP,
D cat P D hat, D N, the cat in the hat, etc.
Derivation
Example: Derivation
Let us start with a simple form, NP. Observe that it can be
written as
l
NP
r
, where both
l
and
r
are empty. Observe also
that NP is the head of some grammar rule: the rule NP D N.
Therefore, the form is a good candidate for derivation: if we replace
the selected symbol NP with the body of the rule, while preserving
its environment, we get
l
D N
r
= D N. Therefore, NP D N.
Derivation
Example: Derivation
We now apply the same process to D N. This time the selected
symbol is D (we could have selected N, of course). The left context
is again empty, while the right context is
r
= N. As there exists
a grammar rule whose head is D, namely D the, we can replace
the rules head by its body, preserving the context, and obtain the
form the N. Hence D N the N.
Derivation
Example: Derivation
Given the form the N, there is exactly one non-terminal that we
can select, namely N. However, there are two rules that are headed
by N: N cat and N hat. We can select either of these rules
to show that both the N the cat and the N the hat.
Since the form the cat consists of terminal symbols only, no non-
terminal can be selected and hence it derives no form.
Extended derivation

G
if derives in k steps:

G

1

G

2

G
. . .
G

k
and
k
= .
The reexive-transitive closure of
G
is

G
:

G
if

G
for some k 0.
A G-derivation is a sequence of forms
1
, . . . ,
n
, such that for
every i , 1 i < n,
i

G

i +1
.
Extended derivation: example
Example: Derivation
(1) NP D N
(2) D N the N
(3) the N the cat
Extended derivation: example
Example: Derivation
Therefore, we trivially have:
(4) NP

D N
(5) D N

the N
(6) the N

the cat
From (2) and (6) we get
(7) D N

the cat
and from (1) and (7) we get
(7) NP

the cat
Languages
A form is a sentential form of a grammar G i S

G
, i.e., it
can be derived in G from the start symbol.
The (formal) language generated by a grammar G with respect to
a category name (non-terminal) A is L
A
(G) = {w | A

w}. The
language generated by the grammar is L(G) = L
S
(G).
A language that can be generated by some CFG is a context-free
language and the class of context-free languages is the set of
languages every member of which can be generated by some CFG.
If no CFG can generate a language L, L is said to be
trans-context-free.
Language of a grammar
Example: Language
For the example grammar (with NP the start symbol):
D the NP D N
N cat PP P NP
N hat NP NP PP
P in
it is fairly easy to see that L(D) = {the}.
Similarly, L(P) = {in} and L(N) = {cat, hat}.
Language of a grammar
Example: Language
It is more dicult to dene the languages denoted by the non-
terminals NP and PP, although is should be straight-forward that
the latter is obtained by concatenating {in} with the former.
Proposition: L(NP) is the denotation of the regular expression
the (cat + hat) (in the (cat + hat))

Language: a formal example G


e
Example: Language
S V
a
S V
b
S
V
a
a
V
b
b
L(G
e
) = {a
n
b
n
| n 0}.
Recursion
The language L(G
e
) is innite: it includes an innite number of
words; G
e
is a nite grammar.
To be able to produce innitely many words with a nite number
of rules, a grammar must be recursive: there must be at least one
rule whose body contains a symbol, from which the head of the
rule can be derived.
Put formally, a grammar , V, S, P is recursive if there exists a
chain of rules, p
1
, . . . , p
n
P, such that for every 1 < i n, the
head of p
i +1
occurs in the body of p
i
, and the head of p
1
occurs in
the body of p
n
.
In G
e
, the recursion is simple: the chain of rules is of length 0,
namely the rule S V
a
S V
b
is in itself recursive.
Derivation tree
Sometimes derivations provide more information than is actually
needed. In particular, sometimes two derivations of the same string
dier not in the rules that were applied but only in the order in
which they were applied.
Starting with the form NP it is possible to derive the string the
cat in two ways:
(1) NP D N D cat the cat
(2) NP D N the N the cat
Since both derivations use the same rules to derive the same string,
it is sometimes useful to collapse such equivalent derivations
into one. To this end the notion of derivation trees is introduced.
Derivation tree
A derivation tree (sometimes called parse tree, or simply tree) is a
visual aid in depicting derivations, and a means for imposing
structure on a grammatical string.
Trees consist of vertices and branches; a designated vertex, the
root of the tree, is depicted on the top. Then, branches are simply
connections between two vertices.
Intuitively, trees are depicted upside down, since their root is at
the top and their leaves are at the bottom.
Derivation tree
Example: Derivation tree
An example for a derivation tree for the string the cat in the hat:
NP
NP PP
D N P NP
D N
the cat in the hat
Derivation tree
Formally, a tree consists of a nite set of vertices and a nite set of
branches (or arcs), each of which is an ordered pair of vertices.
In addition, a tree has a designated vertex, the root, which has two
properties: it is not the target of any arc, and every other vertex is
accessible from it (by following one or more branches).
When talking about trees we sometimes use family notation: if a
vertex v has a branch leaving it which leads to some vertex u, then
we say that v is the mother of u and u is the daughter, or child, of
v. If u has two daughters, we refer to them as sisters.
Derivation trees
Derivation trees are dened with respect to some grammar G, and
must obey the following conditions:
1
every vertex has a label, which is either a terminal symbol, a
non-terminal symbol or ;
2
the label of the root is the start symbol;
3
if a vertex v has an outgoing branch, its label must be a
non-terminal symbol, the head of some grammar rule; and the
elements in body of the same rule must be the labels of the
children of v, in the same order;
4
if a vertex is labeled , it is the only child of its mother.
Derivation trees
A leaf is a vertex with no outgoing branches.
A tree induces a natural left-to-right order on its leaves; when
read from left to right, the sequence of leaves is called the frontier,
or yield of the tree.
Correspondence between trees and derivations
Derivation trees correspond very closely to derivations.
For a form , a non-terminal symbol A derives if and only if is
the yield of some parse tree whose root is A.
Sometimes there exist dierent derivations of the same string that
correspond to a single tree. In fact, the tree representation
collapses exactly those derivations that dier from each other only
in the order in which rules are applied.
Correspondence between trees and derivations
NP
NP PP
D N P NP
D N
the cat in the hat
Each non-leaf vertex in the tree corresponds to some grammar rule
(since it must be labeled by the head of some rule, and its children
must be labeled by the body of the same rule).
Correspondence between trees and derivations
This tree represents the following derivations (among others):
(1) NP NP PP D N PP D N P NP
D N P D N the N P D N
the cat P D N the cat in D N
the cat in the N the cat in the hat
(2) NP NP PP D N PP the N PP
the cat PP the cat P NP
the cat in NP the cat in D N
the cat in the N the cat in the hat
(3) NP NP PP NP P NP NP P D N
NP P D hat NP P the hat
NP in the hat D N in the hat
D cat in the hat the cat in the hat
Correspondence between trees and derivations
While exactly the same rules are applied in each derivation (the
rules are uniquely determined by the tree), they are applied in
dierent orders. In particular, derivation (2) is a leftmost
derivation: in every step the leftmost non-terminal symbol of a
derivation is expanded. Similarly, derivation (3) is rightmost.
Ambiguity
Sometimes, however, dierent derivations (of the same string!)
correspond to dierent trees.
This can happen only when the derivations dier in the rules which
they apply.
When more than one tree exists for some string, we say that the
string is ambiguous.
Ambiguity is a major problem when grammars are used for certain
formal languages, in particular programming languages. But for
natural languages, ambiguity is unavoidable as it corresponds to
properties of the natural language itself.
Ambiguity: example
Consider again the example grammar and the following string:
the cat in the hat in the hat
Intuitively, there can be (at least) two readings for this string: one
in which a certain cat wears a hat-in-a-hat, and one in which a
certain cat-in-a-hat is inside a hat:
((the cat in the hat) in the hat)
(the cat in (the hat in the hat))
This distinction in intuitive meaning is reected in the grammar,
and hence two dierent derivation trees, corresponding to the two
readings, are available for this string:
Ambiguity: example
NP
NP
NP PP PP
D N P NP P NP
D N D N
the cat in the hat in the hat
Ambiguity: example
NP
NP PP
D N P NP
NP PP
P NP
D N D N
the cat in the hat in the hat
Ambiguity: example
Using linguistic terminology, in the left tree the second occurrence
of the prepositional phrase in the hat modies the noun phrase the
cat in the hat, whereas in the right tree it only modies the (rst
occurrence of) the noun phrase the hat. This situation is known as
syntactic or structural ambiguity.
Grammar equivalence
It is common in formal language theory to relate dierent
grammars that generate the same language by an equivalence
relation:
Two grammars G
1
and G
2
(over the same alphabet ) are
equivalent (denoted G
1
G
2
) i L(G
1
) = L(G
2
).
We refer to this relation as weak equivalence, as it only relates the
generated languages. Equivalent grammars may attribute totally
dierent syntactic structures to members of their (common)
languages.
Grammar equivalence
Example: Equivalent grammars, dierent trees
Following are two dierent tree structures that are attributed to the
string aabb by the grammars G
e
and G
f
, respectively.
S S
S S
V
a
V
a
S V
b
V
b
a a b b a a b b
Grammar equivalence
Example: Structural ambiguity
A grammar, G
arith
, for simple arithmetic expressions:
S a | b | c | S + S | S S
Two dierent trees can be associated by G
arith
with the string a +
b c:
S S
S S
S S S S S S
a + b c a + b c
Grammar equivalence
Weak equivalence relation is stated in terms of the generated
language. Consequently, equivalent grammars do not have to be
described in the same formalism for them to be equivalent. We will
later see how grammars, specied in dierent formalisms, can be
compared.
Normal form
It is convenient to divide grammar rules into two classes: one that
contains only phrasal rules of the form A , where V

, and
another that contains only terminal rules of the form B where
. It turns out that every CFG is equivalent to some CFG of
this form.
Normal form
A grammar G is in phrasal/terminal normal form i for every
production A of G, either V

or . Productions of
the form A are called terminal rules, and A is said to be a
pre-terminal category, the lexical entry of . Productions of the
form A , where V

, are called phrasal rules. Furthermore,


every category is either pre-terminal or phrasal, but not both. For
a phrasal rule with = A
1
A
n
, w = w
1
w
n
, w L
A
(G) and
w
i
L
A
i
(G) for i = 1, . . . , n, we say that w is a phrase of category
A, and each w
i
is a sub-phrase (of w) of category A
i
. A
sub-phrase w
i
of w is also called a constituent of w.
Context-free grammars for natural languages
A context-free grammar for English sentences: G = V, , P, S
where V = {D, N, P, NP, PP, V, VP, S}, = {the, cat, in, hat,
sleeps, smile, loves, saw}, the start symbol is S and P is the
following set of rules:
S NP VP D the
NP D N N cat
NP NP PP N hat
PP P NP V sleeps
VP V P in
VP VP NP V smile
VP VP PP V loves
V saw
Context-free grammars for natural languages
The augmented grammar can derive strings such as the cat sleeps
or the cat in the hat saw the hat.
A derivation tree for the cat sleeps is:
S
NP VP
D N V
the cat sleeps
Context-free grammars for natural languages
There are two major problems with this grammar.
1
it ignores the valence of verbs: there is no distinction among
subcategories of verbs, and an intransitive verb such as sleep
might occur with a noun phrase complement, while a
transitive verb such as love might occur without one. In such
a case we say that the grammar overgenerates: it generates
strings that are not in the intended language.
2
there is no treatment of subjectverb agreement, so that a
singular subject such as the cat might be followed by a plural
form of verb such as smile. This is another case of
overgeneration.
Both problems are easy to solve.
Verb valence
To account for valence, we can replace the non-terminal symbol V
by a set of symbols: Vtrans, Vintrans, Vditrans etc. We must also
change the grammar rules accordingly:
VP Vintrans Vintrnas sleeps
VP Vtrans NP Vintrans smile
VP Vditrans NP PP Vtrans loves
Vditrans give
Agreement
To account for agreement, we can again extend the set of
non-terminal symbols such that categories that must agree reect
in the non-terminal that is assigned for them the features on which
they agree. In the very simple case of English, it is sucient to
multiply the set of nominal and verbal categories, so that we
get Dsg, Dpl, Nsg, Npl, NPsg, NPpl, Vsg, Vlp, VPsg, VPpl etc.
We must also change the set of rules accordingly:
Agreement
Nsg cat Npl cats
Nsg hat Npl hats
P in
Vsg sleeps Vpl sleep
Vsg smiles Vpl smile
Vsg loves Vpl love
Vsg saw Vpl saw
Dsg a Dpl many
Agreement
S NPsg VPsg S NPpl VPpl
NPsg Dsg Nsg NPpl Dpl Npl
NPsg NPsg PP NPpl NPpl PP
PP P NP
VPsg Vsg VPpl Vpl
VPsg VPsg NP VPpl VPpl NP
VPsg VPsg PP VPpl VPpl PP
Context-free grammars for natural languages
Context-free grammars can be used for a variety of syntactic
constructions, including some non-trivial phenomena such as
unbounded dependencies, extraction, extraposition etc.
However, some (formal) languages are not context-free, and
therefore there are certain sets of strings that cannot be generated
by context-free grammars.
The interesting question, of course, involves natural languages: are
there natural languages that are not context-free? Are context-free
grammars sucient for generating every natural language?

You might also like