l5 CFG
l5 CFG
Context-Free Grammars – p. 1
Introduction
We will introduce a larger class of languages than
the regular languages. They are called the
context-free languages.
These languages have a natural, recursive formal
representation called context-free grammar (CFG).
CFG plays a central role in compiler technology.
More recently, it is used to describe document
formats in XML (extensible markup language).
CFG is equivalent to another class of automata
called pushdown automata.
Context-Free Grammars – p. 2
An Informal Example
Consider the language of palindromes, where a
palindrome is a string that read the same forwards
and backwards, Lp = {w | w = w R }.
There is a natural, recursive definition for Lp :
basis: , 0 and 1 are in Lp
induction: if w is in Lp , so are 0w0 and 1w1.
A context-free grammar is a formal notation to
express such a recursive definition for languages.
Variables are used to represent classes of strings and
the relations between variables are specified.
Context-Free Grammars – p. 3
CFG for the Palindromes
For Lp , we need only one variable S to represent the
set of palindromes.
With S, the set of rules P of the CFG is
S→
S → 0
P : S→1
S → 0S0
S → 1S1
Context-Free Grammars – p. 5
Productions of CFG
Each rule in a CFG consists of a head, an arrow and
a body, in the form head → body.
The head must be a single variable.
The body is a string of zero or more terminals
and variables. It represents one way to form
strings in the language of the head variable.
The notation for productions can be more compact.
We can group all productions headed by variable A,
and call them A-productions. Suppose the bodies are
α1 , . . . , αn , we may rewrite the A-productions as
A → α1 | α2 | . . . | α n
Context-Free Grammars – p. 6
Derivations
There are two basic methods to use the productions:
recursive inference, where we use the rules to go
from body to head.
derivations, where we use the rules to go from
head to body.
The language of a CFG is the set of all terminal
strings that can be obtained by derivations, starting
with the start symbol S.
Context-Free Grammars – p. 7
Extending Derivation Rules
For convenience, we define a new symbol ⇒. If
G
A → γ is a production, then αAβ ⇒ αγβ. (G is
G
often omitted in the notation if it is obvious.)
We may extend the ⇒ relationship to represent zero,
∗
one or more derivation steps, by ⇒. More precisely,
∗
basis: α ⇒ α for any α ∈ (V ∪ T )∗ .
∗ ∗
induction: if α ⇒ β and β ⇒ γ, then α ⇒ γ.
∗
Put in another way, if α ⇒ β, then there exists a
sequence γ1 , . . . , γn such that α = γ1 , β = γn and
γi ⇒ γi+1 for all i = 1, . . . , n − 1.
Context-Free Grammars – p. 8
Leftmost and Rightmost Derivations
To turn a variable into a string of terminals, the
leftmost derivation requires that in each step, we
replace the leftmost variable by one of its bodies.
∗
This is denoted by ⇒ or ⇒. Similarly for the
lm lm
rightmost derivation.
Any derivation has an equivalent leftmost (and
rightmost) derivation. That is,
∗ ∗ ∗
A ⇒ w iff A ⇒ w (and A ⇒ w).
lm rm
Context-Free Grammars – p. 9
The Language of a Grammar
Let G = (V, T, P, S) be a CFG, the language of G,
denoted by L(G) is the set of terminal strings that
have derivations from the start symbol. That is,
∗ ∗
L(G) = {w ∈ T | S ⇒ w}.
G
Context-Free Grammars – p. 10
Sentential Forms
Derivations from the start symbol produce strings
called “sentential forms”. That is, any α ∈ (V ∪ T )∗
∗
is a sentential form if S ⇒ α.
Note that L(G) is the set of sentential forms in T ∗ .
For example, for the CFG from Fig 5.2,
E ⇒ E ∗ E ⇒ E ∗ (E) ⇒ E ∗ (E + E)
So E ∗ E, E ∗ (E) and E ∗ (E + E) are all
sentential forms.
Context-Free Grammars – p. 11
Parse Trees
The derivation of a sentence can be represented by a
(parse) tree. It shows clearly how the symbols of a
terminal string are grouped into substrings, each of
which belongs to a variable in the grammar.
When used in a compiler, this “parse tree” is the data
structure to represent the source program. It enables
a natural translation of source code to the executable.
The matter of ambiguity will also be studied, where
a terminal string can have more than one parse tree.
Context-Free Grammars – p. 12
Construction of Parse Trees
A parse tree for grammar G = (V, T, P, S) satisfies
Each interior node is a variable.
Each leaf is either a variable, a terminal, or . If
it is , then it must be the only child of its parent.
An interior node A can have its children labeled
by X1 , . . . , Xk from left to right only if
A → X1 , . . . , X k
is a production in P .
Context-Free Grammars – p. 13
The Yield of a Parse Tree
If we look at the leaves of a parse tree from left to
right, we get a string called the yield of the tree.
It is always a string that is derived by the root node.
The yields of those trees with S as root and terminal
symbols or as leaves are strings in the language of
the underlying grammar.
Context-Free Grammars – p. 14
Inference, Derivation and Parse Tree
Given G = (V, T, P, S), the following are equivalent:
1. The recursive inference determines that w ∈ L(A),
the language of variable A.
∗
2. A ⇒ w.
∗
3. A ⇒ w.
lm
∗
4. A ⇒ w.
rm
5. There is a parse tree with root A and yield w.
Except for the recursive inference, all conditions are also
equivalent when w is a string with some variables.
Context-Free Grammars – p. 15
From Inference to Tree
We will follow Figure 5.7 to show the equivalence.
First, (1) ⇒ (5). We will prove by induction on the
number of steps to infer w ∈ L(A).
basis: one step. Only the basis of the inference
has been used. So there must be a production of
A → w. The tree is trivial.
induction: n + 1 steps. Suppose the last step is
the production A → X1 . . . Xk in the inference
for w ∈ A. We can break w into w1 . . . wk , where
wi = Xi if Xi is a terminal or wi ∈ L(Xi ) if Xi
is a variable and there is a parse tree for Xi by
the induction hypothesis. Connecting these parse
trees with root A gives the parse tree for w.
Context-Free Grammars – p. 16
From Tree to Derivation
The second step in showing equivalence is to
construct a leftmost derivation from a parse tree.
(5) ⇒ (3) is shown by induction on the tree height.
basis: height 1. The tree is rooted by A with
terminal string w. A ⇒ w by A → w.
lm
induction: height n + 1. There is a root A with
children X1 . . . Xk from left to right. We can
partition the string w into w1 . . . wk where
∗
Xi ⇒ wi by the induction hypothesis. Applying
lm
for each Xi , i = 1, . . . , k, we have the leftmost
∗
derivation A ⇒ w.
lm
Context-Free Grammars – p. 17
From Derivation to Recur. Inference
Finally (2) ⇒ (4), and the cycle is completed. Note
(3) ⇒ (2) is trivial.
∗
The induction is on the length of derivation A ⇒ w.
basis: one step. A → w must be a production,
and w ∈ L(A) will be concluded by inference.
induction: n + 1 steps. Singling out the first
∗
derivation, we can write A ⇒ X1 . . . Xk ⇒ w.
We can break w into w1 . . . wk , where wi = Xi if
∗
Xi is a terminal and Xi ⇒ wi if Xi is a variable.
By the induction hypothesis, wi ∈ L(Xi ) is
concluded by the recursive inference. Then
recursive inference concludes w ∈ L(A).
Context-Free Grammars – p. 18
Applications of CFG
Grammars are used to describe programming
languages. There is actually a mechanical way to
turn the language description as a CFG to a parser.
This parser is used in compiler to recognize the
structure of a source program and represent that
structure as a parse tree.
Grammars are used in XML for document type
definition (DTD) to describe the allowable tags and
the ways to use these tags.
Context-Free Grammars – p. 19
Ambiguous Grammars
If any string w in L(G) have more than one parse
tree, then G is called an ambiguous grammar. It is a
fact that there is no algorithm to decide whether a
CFG is ambiguous.
For some ambiguous G, it may be possible to
redesign the grammar to make the parse tree unique
for every string in L(G), i.e., to create an equivalent
unambiguous grammar.
However, the creation of an equivalent unambiguous
grammar may not be possible for some CFL. Such a
CFL is called inherently ambiguous, or simply
ambiguous.
Context-Free Grammars – p. 20
Ambiguous Languages
Given a CFl L, if every CFG G with L(G) = L is
ambiguous, then L is ambiguous.
Here we give an ambiguous CFL. Let
L = {an bn cm dm | n ≥ 1, m ≥ 1}∪{an bm cm dn | n ≥ 1, m ≥ 1}
Context-Free Grammars – p. 21