Parse Trees, Ambiguity, and Chomsky Normal Form
Parse Trees, Ambiguity, and Chomsky Normal Form
In this lecture we will discuss a few important notions connected with context-
free grammars, including parse trees, ambiguity, and a special form for context-free
grammars known as Chomsky normal form.
Again we will call this CFG G, and as we proved last time we have
L( G ) = w ∈ Σ ∗ : | w |0 = | w |1 ,
(8.2)
where Σ = {0, 1} is the binary alphabet and |w|0 and |w|1 denote the number of
times the symbols 0 and 1 appear in w, respectively.
Left-most derivations
Here is an example of a derivation of the string 0101:
This is an example of a left-most derivation, which means that it is always the left-
most variable that gets replaced at each step. For the first step there is only one
75
CS 360 Introduction to the Theory of Computing
variable that can possibly be replaced; this is true both in this example and in gen-
eral. For the second step, however, one could choose to replace either of the occur-
rences of the variable S, and in the derivation above it is the left-most occurrence
that gets replaced. That is, if we underline the variable that gets replaced and the
symbols and variables that replace it, we see that this step replaces the left-most
occurrence of the variable S:
The same is true for every other step: always we choose the left-most variable
occurrence to replace, and that is why we call this a left-most derivation. The same
terminology is used in general, for any context-free grammar.
If you think about it for a moment, you will quickly realize that every string
that can be generated by a particular context-free grammar can also be generated
by that same grammar using a left-most derivation. This is because there is no “in-
teraction” among multiple variables and/or symbols in any context-free grammar
derivation; if we know which rule is used to substitute each variable, then it does
not matter what order the variable occurrences are substituted, so you might as
well always take care of the left-most variable during each step.
We could also define the notion of a right-most derivation, in which the right-
most variable occurrence is always evaluated first, but there is not really anything
important about right-most derivations that is not already represented by the no-
tion of a left-most derivation, at least from the viewpoint of this course. For this
reason, we will not have any reason to discuss right-most derivations further.
Parse trees
With any derivation of a string by a context-free grammar we may associate a tree,
called a parse tree, according to the following rules:
1. We have one node of the tree for each new occurrence of either a variable, a
symbol, or an ε in the derivation, with the root node of the tree corresponding
to the start variable. We only have nodes labeled ε when rules of the form
V → ε are applied.
2. Each node corresponding to a symbol or an ε is a leaf node (having no chil-
dren), while each node corresponding to a variable has one child for each sym-
bol, variable, or ε with which it is replaced. The children of each variable node
are ordered in the same way as the symbols and variables in the rule used to
replace that variable.
For example, the derivation (8.3) yields the parse tree illustrated in Figure 8.1.
76
Lecture 8
0 S 1 S
1 S 0 S ε
ε ε
Figure 8.1: The parse tree corresponding to the derivation (8.3) of the string 0101.
There is a one-to-one and onto correspondence between parse trees and left-
most derivations, meaning that every parse tree uniquely determines a left-most
derivation and each left-most derivation uniquely determines a parse tree.
8.2 Ambiguity
Sometimes a context-free grammar will allow multiple parse trees (or, equivalently,
multiple left-most derivations) for some strings in the language that it generates.
For example, a left-most derivation of the string 0101 by the CFG (8.1) that is dif-
ferent from (8.3) is
77
CS 360 Introduction to the Theory of Computing
0 S 1 S
ε 0 S 1 S
ε ε
Figure 8.2: The parse tree corresponding to the derivation (8.5) of the string 0101.
L( G ) described in (8.2) that, unlike the CFG (8.1), is unambiguous. Here is such a
CFG:
S → 0X1S 1Y0S ε
X → 0X1X ε (8.6)
Y → 1Y0Y ε
We will not take the time to go through a proof that this CFG is unambiguous, but
if you think about it for a few moments you should be able to convince yourself
that it is unambiguous. The variable X generates strings having the same number
of 0s and 1s, where the number of 1s never exceeds the number of 0s when you
read from left to right, and the variable Y is similar except the role of the 0s and 1s
is reversed. If you try to generate a particular string by a left-most derivation with
this CFG, you will never have more than one option as to which rule to apply.
Here is another example of how an ambiguous CFG can be modified to make it
unambiguous. Let us define an alphabet
Σ = a, b, +, ∗ , ( , ) }
(8.7)
78
Lecture 8
S ∗ S
( S ) S + S
S + S a b
a b
Figure 8.3. You can of course imagine a more complex version of this grammar
allowing for other arithmetic operations, variables, and so on, but we will stick to
the grammar in (8.8) for the sake of simplicity.
The CFG (8.8) is ambiguous. For instance, a different (left-most) derivation for
the same string ( a + b) ∗ a + b as before is
S ⇒ S + S ⇒ S ∗ S + S ⇒ (S) ∗ S + S
⇒ (S + S) ∗ S + S ⇒ ( a + S) ∗ S + S ⇒ ( a + b) ∗ S + S (8.10)
⇒ ( a + b) ∗ a + S ⇒ ( a + b) ∗ a + b,
and the parse tree for this derivation is shown in Figure 8.4.
Notice that the parse tree illustrated in Figure 8.4 is appealing because it actu-
ally carries the meaning of the expression ( a + b) ∗ a + b, in the sense that the tree
structure properly captures the order in which the operations should be applied
according to the standard order of precedence for arithmetic operations. In con-
trast, the parse tree shown in Figure 8.3 seems to represent what the expression
( a + b) ∗ a + b would evaluate to if we lived in a society where addition was given
higher precedence than multiplication.
The ambiguity of the grammar (8.8), along with the fact that parse trees may
not represent the meaning of an arithmetic expression in the sense just described,
is a problem in some settings. For example, if we were designing a compiler and
wanted a part of it to represent arithmetic expressions (presumably allowing much
more complicated ones than our grammar from above allows), a CFG along the
lines of (8.8) would be completely inadequate.
We can, however, come up with a new CFG for the same language that is much
better, in the sense that it is unambiguous and properly captures the meaning of
79
CS 360 Introduction to the Theory of Computing
S + S
S ∗ S b
( S ) a
S + S
a b
80
Lecture 8
S + T
T F
T ∗ F I
F I b
( S ) a
S + T
T F
F I
I b
Figure 8.5: Unique parse tree for ( a + b) ∗ a + b for the CFG (8.11).
We will not prove that this language is inherently ambiguous, but the intuition is
that no matter what CFG you come up with for this language, the string 0n 1n 2n
will always have multiple parse trees for some sufficiently large natural number n.
simply generates the language {ε}; but it is obviously ambiguous, and even worse
it has infinitely many parse trees (which of course can be arbitrarily large) for the
81
CS 360 Introduction to the Theory of Computing
X Y
X Z 0
0 Y X
1 1
Figure 8.6: A hypothetical example of a parse tree for a CFG in Chomsky normal
form.
82
Lecture 8
Figure 8.7: The unique parse tree for ε for a CFG in Chomsky normal form, assum-
ing it includes the rule S → ε.
The usual way to prove this theorem is through a construction that converts an
arbitrary CFG G into a CFG H in Chomsky normal form for which L( H ) = L( G ).
The conversion is, in fact, fairly straightforward—a summary of the steps one may
perform to do this conversion for an arbitrary CFG G = (V, Σ, R, S) appear be-
low. To illustrate how these steps work, let us start with the following CFG, which
generates the balanced parentheses language BAL from the previous lecture:
S → (S)S ε (8.14)
S0 → S
(8.15)
S → (S)S ε
83
CS 360 Introduction to the Theory of Computing
S0 → S
S → LSRS ε
(8.16)
L→(
R→ )
X → Y1 Z2
Z2 → Y2 Z3
.. (8.17)
.
Zm−2 → Ym−2 Zm−1
Zm−1 → Ym−1 Ym
Note that we must use separate auxiliary variables for each rule so that there
is no “cross talk” between different rules—so do not reuse the same auxiliary
variables to break up multiple rules.
Transforming the CFG (8.16) in this way results in the following CFG:
S0 → S
S → LZ2 ε
Z2 → SZ3
(8.18)
Z3 → RS
L→(
R→ )
84
Lecture 8
Aside from the special case S0 → ε, there is never any need for rules of the form
X → ε; you can get the same effect by simply duplicating rules in which X
appears on the right-hand side, and directly replacing or not replacing X with ε
in each possible combination. You might introduce new ε-rules in this way, but
they can be handled recursively—and any time a new ε-rule is generated that
was already eliminated, it is not added back in.
Transforming the CFG (8.18) in this way results in the following CFG:
S0 → S ε
S → LZ2
Z2 → SZ3 Z3
(8.19)
Z3 → RS R
L→(
R→ )
Note that we do end up with the ε-rule S0 → ε, but we do not eliminate this
one because S0 → ε is the special case that we allow as an ε-rule.
5. Eliminate unit rules, which are rules of the form X → Y.
Rules like this are never necessary, and they can be eliminated provided that
we also include the rule X → w in the CFG whenever Y → w appears as a rule.
If you obtain a new unit rule that was already eliminated (or is the unit rule
currently being eliminated), it is not added back in.
Transforming the CFG (8.19) in this way results in the following CFG:
S0 → LZ2 ε
S → LZ2
Z2 → SZ3 RS )
(8.20)
Z3 → RS )
L→(
R→ )
The description above is only meant to give you the basic idea of how the construc-
tion works and does not constitute a formal proof of Theorem 8.2. It is possible,
however, to be more formal and precise in describing this construction in order to
obtain a proper proof of Theorem 8.2.
85
CS 360 Introduction to the Theory of Computing
We will make use of the theorem from time to time. In particular, when we are
proving things about context-free languages, it is sometimes extremely helpful to
know that we can always assume that a given context-free language is generated
by a CFG in Chomsky normal form.
Finally, it must be stressed that the Chomsky normal form says nothing about
ambiguity in general. A CFG in Chomsky normal form may or may not be am-
biguous, just like we have for arbitrary CFGs.
86