Lecture Notes On Context-Free Grammars: 15-411: Compiler Design Frank Pfenning September 15, 2009
Lecture Notes On Context-Free Grammars: 15-411: Compiler Design Frank Pfenning September 15, 2009
Context-Free Grammars
15-411: Compiler Design
Frank Pfenning
Lecture 7
September 15, 2009
1 Introduction
2 Context-Free Grammars
S −→
S −→ [S]
S −→ S S
The first rule looks somewhat strange, because the right-hand side is the
empty string. To make this more readable, we usually write the empty
string as .
A derivation of a sentence w from start symbol S is a sequence S =
α0 −→ α1 −→ αn = w, where w consists only of terminal symbols. In each
step we choose an occurrence of a non-terminal X in αi and a production
X −→ γ and replace the occurrence of X in αi by γ.
We usually label the productions in the grammar so that we can refer to
them by name. In the example above we might write
[emp] S −→
[pars] S −→ [S]
[dup] S −→ S S
Then the following is a derivation of the string [[][]], where each transi-
S −→ [S] [pars]
−→ [SS] [dup]
−→ [[S]S] [pars]
−→ [[]S] [emp]
−→ [[][S]] [pars]
−→ [[][]] [emp]
We have labeled each derivation step with the corresponding grammar pro-
duction that was used.
Derivations are clearly not unique, because when there is more than one
non-terminal we can replace it in any order in the string. In order to avoid
this kind of harmless ambiguity in rule order, we like to construct a parse
tree in which the nodes represents the non-terminals in a string, with the
root being S. In the example above we obtain the following tree:
pars
dup
pars pars
emp emp
While the tree removes some ambiguity, it turns out the sample gram-
mar is ambiguous in another way. In fact, there are infinitely many parse
trees of every string in the language. This can be seen by considering the
cycle
S −→ SS −→ S
where the first step is dup and the second is emp, applied either to the first
or second occurrence of S.
P1
:
P3 P4 (emp) P3
[:[ :S ]:]
P22
[] : [ S ] ..
P4 (pars) .
[] : S [] : S
P2
[][] : S S
P3 P4 (dup) P3
[:[ [][] : S ]:]
P22
[[][]] : [ S ]
P4 (pars)
[[][]] : S
The one omitted subdeduction is identical to its sibling on the left. We
observe that the labels have the same structure as the parse tree, except that
it is written upside-down. Parse trees are therefore just deduction trees.
4 CYK Parsing
The rules above that formally define when a terminal string matches an
arbitrary string can be used to immediately give an algorithm for parsing.
Assume we are given a grammar with start symbol S and a terminal
string w0 . Start with a databased of assertions : and a : a for any
terminal symbol occuring in w. Now arbitrarily apply the given rules in
the following way: if the premises of the rules can be matched against the
database, and the conclusion w : γ is such that w is a substring of w0 and γ
is a string occurring in the grammar, then add w : γ to the database.
We repeat this process until we reach saturation: any further application
of any rule leads to conclusion are already in the database. We stop at this
point and check if we see w0 : S in the database. If yes, we succeed; if not
we fail.
This process must always terminate, since there are only a fixed number
of substrings of the grammar, and only a fixed number of substrings of the
query string w0 . In fact, only O(n2 ) terms can ever be derived if the gram-
mar is fixed and n = |w|. Using a meta-complexity result by Ganzinger
and McAllester [GM02] we can obtain the complexity of this algorithm as
the maximum of the size of the saturated database (which is O(n2 )) and
the number of so-called prefix firings of the rule. We count this by bounding
the number of ways the premises of each rule can be instantiated, when
working from left to right. The crucial rule is
w1 : γ1 w2 : γ2
P2
w1 w2 : γ1 γ2
There are O(n2 ) substrings, so there are O(n2 ) ways to match the first premise
against the database. Since w1 w2 is also constrained to be a substring of
w0 , there are only O(n) ways to instantiate the second premise, since the
left end of w2 in the input string is determined, but not its right end. This
yields a complexity of O(n3 ).
The algorithm we have just presented is an abstract form of the Cocke-
Younger-Kasami (CYK) parsing algorithm. It assumes the grammar is in a
normal form, and represents substring by their indices in the input rather
than directly as strings. However, its general running time is still O(n3 ).
As an example, we apply this algorithm using an n-ary concatenation
rule as a short-hand. We try to parse [[][]] with our grammar of match-
ing parentheses. We start with three facts that derive from rules P1 and P3 .
When working forward it is important to keep in mind that we only infer
have to change the nature of the rule for non-terminals so it can handle a
non-terminal at the left end of the string.
[r]X −→ β
w:γ w :βγ
R1 R2 R3 (r)
: aw : aγ w:Xγ
At this point the rules are entirely linear (each rule has zero or one
premises) and decompose the string left-to-right (we only proceed by strip-
ping away a terminal symbol a).
Rather than blindly using these rules from the premises to the conclu-
sions (which wouldn’t be analyzing the string from left to right), couldn’t
we use them the other way around? Recall that we are starting with a given
goal, namely to derive w0 : S, if possible, or explicitly fail otherwise. Now
could we use the rules in a goal-directed way? The first two rules certainly
do not present a problem, but the third one presents a problem since we
may not be able to determine which production to use if there are multiple
productions for a given non-terminal X.
The difficulty then lies in the third rule: how can we decide which pro-
duction to use? We can turn the question around: for which grammars can
we always decide which rule to use in the third case?
We return to an example to explore this question. We use a simple gram-
mar for an expression language similar one to the one used in Lab 1. We
use id and num to stand for identifier and number tokens produced by the
lexer.
[assign] S −→ id = E ; S
[return] S −→ return E
[plus] E −→ E+E
[times] E −→ E*E
[ident] E −→ id
[number] E −→ num
[parens] E −→ (E)
As an example string, consider x = 3; return x;. After lexing, x and
3 are replace by tokens id ("x") and num(3), which we write just as id and
num, for short.
If we always guess right, we would construct the following deduction
from the bottom to the top. That is, we start with the last line, either determine
or guess which rule to apply to get the previous line, etc. until we reach :
or get stuck.
:
; : ;
id ; : id ;
id ; : E; [ident]
return id ; : return E ;
return id ; : S [return]
; return id ; : ;S
num ; return id ; : num ; S
num ; return id ; : E;S [number]
= num ; return id ; : =E;S
id = num ; return id ; : id = E ; S
id = num ; return id ; : S [assign]
This parser (assuming all the guesses are made correctly) evidently tra-
verses the input string from left to right. It also produces a leftmost deriva-
tion, which we can read off from this deduction by reading the right-hand
side from the bottom to top.
We have labeled the inference that potentially involved a choice with
the chosen name of the chosen grammar production. If we restrict our-
selves to look at the first character in the input string on the left, which
ones could we have predicted correctly?
In the last line (the first guess we have to make) we are trying to parse
an S and the first input token is id . There is only one production that would
allow this, namely [assign]. So there is no guess necessary.
In the fourth-to-last line (our second potential choice point), the first
token is num and we are trying to parse an E. It is tempting, but wrong,
to say that this must be the production [number]. For example, the string
num + id also starts with token num, but we must use production [plus].
In fact, no input token can disambiguate expression productions for
us. The problem is that the rules [plus] and [times] are left-recursive, that
is, the right-hand side of the production starts with the non-terminal on the
left-hand side. We can never decide by a token look-ahead which rule to
choose, because any token which can start an expression E could arise via
the [plus] and [times] productions.
In the next lecture we develop some techniques for analyzing the gram-
mar to determine if we can parse by searching for a deduction without
backtracking, if we are permitted some lookahead to make the right choice.
This will also be the key for parser generation, the process of compiling a
grammar specification to a specialized efficient parser.
References
[App98] Andrew W. Appel. Modern Compiler Implementation in ML. Cam-
bridge University Press, Cambridge, England, 1998.