Cha 3
Cha 3
Compiler Design
Springer
Contents
2 Lexical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1 The Task of Lexical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Regular Expressions and Finite-State Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.1 Words and Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Language for the Specification of Lexical Analyzers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3.1 Character classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3.2 Non-recursive Parentheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4 Scanner Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4.1 Character Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4.2 An Implementation of the until-Construct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4.3 Sequences of regular expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4.4 The Implementation of a Scanner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.5 The Screener . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.5.1 Scanner States . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.5.2 Recognizing Reserved Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.7 Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3 Syntactic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.1 The Task of Syntactic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2 Foundations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2.1 Context-free Grammars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2.2 Productivity and Reachability of Nonterminals . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.2.3 Pushdown Automata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.2.4 The Item-Pushdown Automaton to a Context-Free Grammar . . . . . . . . . . . . . . . . 47
3.2.5 first- and follow-Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.2.6 The Special Case first1 and follow1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.2.7 Pure Union Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
VI Contents
Literatur . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
1
The Structure of Compilers
Our series of books treats the compilation of higher programming languages into the machine lan-
guages of virtual or real computers. Such compilers are large, complex software systems. Realizing
large and complex software systems is a difficult task. What is special about compilers such that they
can be even implemented as a project accompanying a compiler course? A decomposition of the task
into subtasks with clearly defined functionalities and clean interfaces between them makes this, in fact,
possible. This is true about compilers; there is a more or less standard conceptual compiler structure
composed of components solving a well-defined subtask of the compilation task. The interfaces be-
tween the components are representations of the input program.
The compiler structure described in the following is a conceptual structure. i.e. it identifies the
subtasks of the translation of a source language into a target language and defines interfaces between
the components realizing the subtasks. The concrete architecture of the compiler is then derived from
this conceptual structure. Several components might be combined if the realized subtasks allow this.
But a component may also be split into several components if the realized subtask is very complex.
A first attempt to structure a compiler decomposes it into three components executing three consec-
utive phases:
1. The analysis phase, realized by the Frontend. It determines the syntactic structure of the source
program and checks whether the static semantic constraints are satisfied. The latter contain the
type constraints in languages with static type systems.
2. The optimization and transformation phase, performed by what is often called the Middleend. The
syntactically analysed and semantically checks program is transformed by semantics-preserving
transformations. These transformations mostly aim at improving the efficiency of the program by
reducing the execution time, the memory consumption, or the consumed energy. These transforma-
tions are independent of the target architecture and mostly also independent of the source language.
3. The code generation and the machine-dependent optimization phase, performed by the Backend.
The program is being translated into an equivalent program in the target language. Machine-
dependent optimizations might be performed, which exploit peculiarities of the target architecture.
This coarse compiler structure splits it into a first phase, which depends on the source language, a third
phase, which depends only on the target architecture, and a second phase, which is mostly independent
of both. This structure helps to adapt compiler components to new source languages and to new target
architectures.
The following sections present these phases in more detail, decompose them further, and show
them working on a small running example. This book describes the analysis phase of the compiler.
The transformation phase is presented in much detail in the volume Analysis and Transformation.
The volume Code Generation and Machine-oriented Optimization covers code generation for a target
machine.
2 1 The Structure of Compilers
int a, b;
a = 42;
b = a ∗ a − 7;
lexikalische Analyse
Scanner
A Symbolfolge S
Optimierung
N Y
Sieben
A N
Sieber
L T
dekorierte Symbolfolge
Y H
S syntaktische Analyse E
E Parser S Codeerzeugung
Syntaxbaum E
semantische Analyse
dekorierter Syntaxbaum
Zielprogramm
Fig. 1.1. Structure of a compiler together with the program representations during the analysis phase.
Myclass, x, while the character sequences 42, 3.14159 and ′′ HalloWorld!′′ represent constants. Some-
thing special to note is that there are, in principle, arbitrarily many such symbols. However, they can be
categorized into finitely many classes. A symbol class consists of symbols that are equivalent as far as
the syntactic structure of programs is concerned. Identifiers are an example of such a class. Within this
class, there may be subclasses such as type constructors in O CAML or variables in P ROLOG, which are
written in capital letters. In the class of constants, int-constants can be distinguished from floating-point
constants and string-constants.
The symbols we have considered so far bear semantic interpretations and need, therefore, be consid-
ered in code generation. However, there are symbols without semantics. Two symbols need a separator
between them if their concatenation would also form a symbol. Such a separator can be a blank, a new-
line, or an indentation or a sequence of such characters. Such so-called white space can also be inserted
into a program to make visible the structure of the program.
Another type of symbols, without meaning for the compiler, but helpful for the human reader, are
comments and can be used by software development tools. A similar type of symbols are compiler
directives (pragmas). Such directives may tell the compiler to include particular libraries or influence
the memory management for the program to be compiled.
The sequence of symbols for the example program might look as follows:
Int(′′ int ′′ ) Sep(′′ ′′ ) Id(′′ a′′ ) Com(′′ ,′′ ) Sep(′′ ′′ ) Id(′′ b′′ ) Sem(′′ ;′′ ) Sep(′′ \n′′ )
Id(′′ a′′ ) Bec(′′ =′′ ) Intconst(′′ 42′′ ) Sem(′′ ;′′ ) Sep(′′ \n′′ )
Id(′′ b′′ ) Bec(′′ =′′ ) Id(′′ a′′ ) Mop(′′ ∗′′ ) Id(′′ a′′ ) Aop(′′ −′′ ) Intconst(′′ 7′′ ) Sem(′′ ;′′ ) Sep(′′ \n′′ )
To increase readability, the sequences was brolen into lines according to the original program structure.
Each symbol is represented with its symbol class and the substring representing it in the program. More
information may be added such as the position of the string in the input.
• The non-confirmed part of the prediction starts with a terminal symbol. The top-down parser will
then compare this with the next input symbol. If they agree it means that another symbol of the
prediction is confirmed. Otherwise, the parser has detected an error.
The top-down parser terminates successfully when the whole input has been predicted and confirmed.
Bottom-up parsers start the syntactic analysis of a given program and the construction of the parse
tree with the input, that is, the given program. They attempt to discover the syntactic structure of longer
and longer prefixes of the input program. To do this they attempt to replace occurrences of right sides
of productions by their left-side nonterminals. Such a replacement is called a reduction. If the parser
cannot perform a reduction if does a shift, that is, it reads the next input symbol. These are the only
two actions a bottom-up parser can perform. It is, therefore, called shift-reduce parser. The analysis
terminates successfully when the parser has reduced the input program by a sequence of shift and
reduce steps to the start symbol of the grammar.
Most programs that are submitted to a compiler are erroneous. Many contain syntax errors. The com-
piler should, therefore, treat the normal case, namely the erroneous program adequately. Lexical errors
are rather local. Syntax errors, for instance in the parenthesis structure of a program, are often difficult
to diagnose. This chapter covers required and possible reactions to syntax errors by the parser. There
are essentially four different types of reaction to syntax errors:
1. The error is localized and reported;
2. The error is diagnozed;
3. The error is corrected;
4. The parser gets back into a state in which it can possibly detect further errors.
The first alternative is absolutely required. Later stages of the compiler assume that they are only given
syntactically correct programs in the form of syntax trees. The programmer needs to be informed about
syntax errors in his programs. There exist, however, two significant problems: Firstly, further syntax
errors can remain undetected in the vicinity of a detected error. Second, the parser detects an error when
it has no continuation out of its actual configuration under the next input symbol. This is, in general,
only the error symptom, not the error itself.
Example 3.1.1 Consider the following erroneous assignment statement:
a = a ∗ (b + c ∗ d ;
↑
error symptom: ′ )′ is missing
There are several potential errors: Either there is an extra open parenthesis, or a closing parenthesis is
missing after c or after d. These three corrections lead to programs with different meaning. ⊓ ⊔
At errors of extra or missing parentheses such as {, }, begin, end, if, etc., the position of the
error and the position of the error-symptom can be far apart. The practically relevant parsing methods,
LL(k)- and LR(k) parsing, presented in the following sections, have the viable-prefix property.
When the parser for a context-free grammar G has analyzed the prefix u of a word without announcing
an error then there exists a word w such that uw is a word of G.
Parsers possessing this property report error and error symptoms at the earliest possible time. We have
said above that, in general, the parser will only discover an error symptom, not the error itself. Still,
we will speak of errors in the following. In this sense, the discussed parsers perform the first two listed
actions: they report and try to diagnose errors.
Example 3.1.1 shows that the second action is not easily done.
The parser can attempt a diagnosis of the error symptom. It should at least provide the following
information:
3.2 Foundations 37
Section 3.2 presents the theoretical foundations of syntax analysis, context-free grammars and their
notion of derivation and pushdown automata, their acceptors. A special non-deterministic pushdown
automaton for a context-free grammar is introduced that recognizes the language defined by the gram-
mar. Deterministic top-down and bottom-up parser for the grammar are derived from this pushdown
automaton.
Sections 3.3 and 3.4 describe top-down- and bottom-up syntax analysis. The corresponding gram-
mar classes are characterized and parser-generation methods are presented. Error handling for both
top-down and bottom-up parsers is described in detail.
3.2 Foundations
We have seen that lexical analysis is specified by regular expressions and implemented by finite-state
machines. We will now see that syntax analysis is specified by context-free grammars and implemented
by pushdown automata.
Regular expressions are not sufficient to describe the syntax of programming languages since they
cannot embedded recursion as they occur in the nesting of expressions, statements, and blocks.
In Sections 3.2.1 and 3.2.3, we introduce the needed notions about context-free grammars and
pushdown automata. Readers familiar with these notions can skip them and go directly to Section
3.2.4. In Section 3.2.4, a pushdown automaton is introduced for a context-free grammar that accepts
the language defined by that grammar.rt.
Context-free grammars can be used to describe the syntactic structure of programs of a programming
language. The grammar describes what the elementary components of programs are and how pieces of
programs can be composed to form bigger pieces.
Example 3.2.1 A section of a grammar to describe a C-like programming language might look like
follows:
38 3 Syntactic Analysis
The nonterminal symbol hstat i generates statements. We will use the meta-character | to combine
several alternatives for one nonterminal.
According to this section of a grammar, a statement is either and if -statement, a while-statement,
a do-while-statement, an expression followed by a semicolon, an empty statement, or a sequence of
statements in parentheses.
if -statements in which the else-part may be missing. They always start with the keyword if, fol-
lowed by an expression in parentheses, and a statement. This statement may be followed by the key-
word else and another statement. Further productions describe how while- and do-while-statements and
expressions are constructed. For expressions, only some possible alternatives are explicitly given. Other
alternatives are indicated by . . . . ⊓
⊔
Formally, a context-free grammar is a quadruple G = (VN , VT , P, S), where VN , VT are disjoint
alphabets, VN is the set of nonterminals, VT is the set of terminals, P ⊆ VN × (VN ∪ VT )∗ is the finite
set of production rules, and S ∈ VN is the start symbol.
Terminal symbols (in short: terminals) are the symbols from which programs are built. While we
spoke of alphabets of characters in the section on lexical analysis, typically ASCII or Uni-code char-
acters, we now speak of alphabets of symbols as they are returned from the scanner or the screener.
Such symbols are reserved keywords of the language, identifiers, or symbol classes comprising sets of
symbols.
The nonterminals of the grammar stand for sets of words that can be generated from them according
to the production rules of the grammar. In the example grammar 3.2.1, they are enclosed in angle brack-
ets. A production rules (in short: production) (A, α) in the relations P describes possible replacements:
an occurrence of the left side A in a word β = γ1 Aγ2 can be replaced by the right side α ∈ (VT ∪VN )∗ .
In the view of a top-down parser, a new word β ′ = γ1 αγ2 is produced or derived from the word β.
A bottom-up parser interprets the production (A, α) as a replacement of the right side α by the left
side A. Applying the production to a word β ′ = γ1 αγ2 reduces this to the word β = γ1 Aγ2 .
We introduce some conventions to talk about context-free grammars G = (VN , VT , P, S). Capital
latin letters from the beginning of the alphabet, e.g. A, B, C are used to denote nonterminals from VN ;
capital latin letters from the end of the alphabet, e.g. X, Y, Z denote terminals or nonterminals. Small
latin letters from the beginning of the alphabet, e.g. a, b, c, . . ., stand for terminals from VT ; small latin
3.2 Foundations 39
letters from the end of the alphabet, like u, v, w, x, y, z, stand for terminal words, that is, elements from
VT∗ ; small greek letters such as α, β, γ, ϕ, ψ stand for words from (VT ∪ VN )∗ .
The relation P is seen as a set of production rules. Each element (A, α) of this relation is, more
intuitively, written as A → α. All productions A → α1 , A → α2 , . . . , A → αn for a nonterminal A
are combined to
A → α1 | α2 | . . . | αn
. The α1 , α2 , . . . , αn are called the alternatives of A.
Example 3.2.2 The two grammars G0 and G1 describe the same language:
⊓
⊔
We say, a word ϕ directly produces a word ψ according to G, written as ϕ =⇒ ψ if ϕ = σAτ, ψ = σατ
G
holds for some words σ, τ and a production A → α ∈ P. A word ϕ produces a word ψ according to G,
∗
or ψ is derivable from ϕ according to G, written as ϕ =⇒ ψ, if there is a finite sequence ϕ0 , ϕ1 , . . . ϕn ,
G
(n ≥ 0) of words such that
Example 3.2.3 The grammars of Example 3.2.2 have, among others, the derivations
E =⇒ E + T =⇒ T + T =⇒ T ∗ F + T =⇒ T ∗ Id + T =⇒ F ∗ Id + T =⇒
G0 G0 G0 G0 G0 G0
F ∗ Id + F =⇒ Id ∗ Id + F =⇒ Id ∗ Id + Id,
G0 G0
E =⇒ E + E =⇒ E ∗ E + E =⇒ Id ∗ E + E =⇒ Id ∗ E + Id =⇒ Id ∗ Id + Id .
G1 G1 G1 G1 G1
∗ ∗
We conclude from these derivations that E =⇒ Id ∗ Id + Id holds as well as E =⇒ Id ∗ Id + Id. ⊓
⊔
G1 G0
∗
A word x ∈ L(G) is called a word of G. A word α ∈ (VT ∪ VN )∗ where S =⇒ α is called a sentential
G
form of G.
Example 3.2.4 Let us consider again the grammars of Example 3.2.3. The word Id ∗ Id + Id is a word
∗ ∗
of both G0 and G1 , since E =⇒ Id ∗ Id + Id as well as E =⇒ Id ∗ Id + Id hold. ⊓⊔
G0 G1
We omit the index G in =⇒ when the grammar to which derivations refer is clear from the context.
G
The syntactic structure of a program, as it results from syntactic analysis, is the parse tree, which
is an ordered tree, that is, a tree in which the outgoing edges of each node are ordered. The parse tree
describes a set of derivations of the program according to the underlying grammar. It, therefore, allows
40 3 Syntactic Analysis
to define the notion ambiguity and to explain the differences between parsing strategies, see Sections
3.3 and 3.4. Within a compiler, the parse tree serves as the interface to the subsequent compiler phases.
Most approaches to the evaluation of semantic attributes, as they are described in Chapter 4, about
semantic analysis, work on this tree structure.
Let G = (VN , VT , P, S) be a context-free grammar. Let t be an ordered tree whose inner nodes are
labeled with symbols from VN and whose leaves are labeled with symbols from VT ∪ {ε}. t is a parse
tree if the label X of each inner node n of t together with the sequence of labels X1 , . . . , Xk of the
children of n in t has the following properties:
1. X → X1 . . . Xk is a production from P .
2. Is X1 . . . Xk = ε, then k = 1, that is, node n has exactly one child and this child is labeled with ε.
3. Is X1 . . . Xk 6= ε then Xi 6= ε for each i.
If the root of t is labeled with nonterminal symbol A, and if the concatenation of the leaf labels yields
the terminal word w we call t a parse tree for nonterminal A and word w according to grammar G. If
the root is labeled with S, the start symbol of the grammar, we just call t a parse tree for w.
Example 3.2.5 Fig. 3.1 shows two parse trees according to grammar G1 of Example 3.2.2 for the word
Id ∗ Id + Id . ⊓
⊔
E E
E E E E
E E E E
id ∗ id + id id ∗ id + id
Fig. 3.1. Two syntax trees according to grammar G1 of Example 3.2.2 for the word Id ∗ Id + Id.
A syntax tree can be viewed as a representation of derivations where one abstracts from the order and
the direction, derivation or reduction, in which productions were applied. A word of the language is
called ambiguous if there exists more than one parse tree for it. Correspondingly, the grammar G is
called ambiguous, if L(G) contains at least one ambiguous word. A context-free grammar that is not
ambiguous is called non-ambiguous.
Example 3.2.6 The grammar G1 is ambiguous because the word Id ∗ Id + Id has more than one parse
tree. The grammar G0 , on the other hand, is non-ambiguous. ⊓
⊔
The definition implies that each word x ∈ L(G) has at least one derivation from S. To each derivation
for a word x corresponds a parse tree for x. Thus, each word x ∈ L(G) has at least one parse tree.
On the other hand, to each parse tree for a word x corresponds at least one derivation for x. Any such
derivation can be easily read off the parse tree.
Example 3.2.7 The word Id + Id has the one parse tree of Fig. 3.2 according to grammar G1 . Two
different derivations result depending on the order in which the nonterminals are replaced:
E ⇒ E + E ⇒ Id + E ⇒ Id + Id
E ⇒ E + E ⇒ E + Id ⇒ Id + Id
⊓
⊔
3.2 Foundations 41
E E
Id + Id
Fig. 3.2. The uniquely determined parse tree for the word Id + Id.
In Example 3.2.7 we saw that—even with non-ambiguous words— several derivations might corre-
spond to one parse tree. This results from the different possibilities to chose a nonterminal in a sen-
tential form for the next application of a production. One can chose essentially two different canonical
replacement strategies, replacing the leftmost nonterminal or the rightmost nonterminal. In each case
one obtains uniquely determined derivations, namely leftmost and rightmost derivations, resp.
A derivation ϕ1 =⇒ . . . =⇒ ϕn of ϕ = ϕn from S = ϕ1 is a leftmost derivation of ϕ, denoted
∗
as S =⇒ ϕ , if in the derivation step from ϕi to ϕi+1 the leftmost nonterminal of ϕi is replaced, i.e.
lm
ϕi = uAτ, ϕi+1 = uατ for a word u ∈ VT∗ and a production A → α ∈ P.
∗
Similarly, we call a derivation ϕ1 =⇒ . . . =⇒ ϕn a rightmost derivation of ϕ, denoted by S =⇒ ϕ,
rm
if the rightmost nonterminal in ϕi is replaced, i.e. ϕi = σAu, ϕi+1 = σαu with u ∈ VT∗ and A → α ∈
P.
A sentential form that occurs in a leftmost derivation (rightmost derivation) is called left sentential
form (right sentential form).
To each parse tree for S there exists exactly one leftmost derivation and exactly one rightmost
derivation. Thus, there is exactly one leftmost and one rightmost derivation for each unambiguous word
in a language.
Example 3.2.8 The word Id ∗ Id + Id has, according to grammar G1 , the leftmost derivations
E =⇒ E + E =⇒ E ∗ E + E =⇒ Id ∗ E + E =⇒ Id ∗ Id + E =⇒ Id ∗ Id + Id and
lm lm lm lm lm
E =⇒ E ∗ E =⇒ Id ∗ E =⇒ Id ∗ E + E =⇒ Id ∗ Id + E =⇒ Id ∗ Id + Id.
lm lm lm lm lm
E =⇒ E + E =⇒ E + Id =⇒ E ∗ E + Id =⇒ E ∗ Id + Id =⇒ Id ∗ Id + Id und
rm rm rm rm rm
E =⇒ E ∗ E =⇒ E ∗ E + E =⇒ E ∗ E + Id =⇒ E ∗ Id + Id =⇒ Id ∗ Id + Id.
rm rm rm rm rm
E =⇒ E + E =⇒ Id + E =⇒ Id + Id
lm lm lm
E =⇒ E + E =⇒ E + Id =⇒ Id + Id.
rm rm rm
⊓
⊔
In an unambiguous grammar, the leftmost and the rightmost derivation of a word consist of the same
productions. The difference is the order of application. The questions is, can one find sentential forms
in both derivations that correspond to each other in the following way: in both derivations will, in the
next step, the same occurrence of a nonterminal be replaced?
The following lemma establishes such a relation.
42 3 Syntactic Analysis
∗ ∗
Lemma 3.1. 1. If S =⇒ uAϕ holds, then there exists ψ, with ψ =⇒ u, such that for all v with
lm
∗
ϕ =⇒ v holds S =⇒ ψAv.
rm
∗ ∗ ∗
2. If S =⇒ ψAv holds, then there exists a ϕ with ϕ =⇒ v, such that for all u with ψ =⇒ u holds
rm
∗
S =⇒ uAϕ. ⊓
⊔
lm
Fig. 3.3 clarifies the relation between ϕ and v on one side and ψ and u on the other side.
ψ ϕ
A
u v
Fig. 3.3. Correspondence between leftmost and rightmost derivation.
Context-free grammars that describe programming languages should be unambiguous. If this is the
case, then there exist exactly one parse tree, and one leftmost and one rightmost derivation for each
syntactically correct program.
A context-free grammar might have superfluous nonterminals and productions. Eliminating them re-
duces the size of the grammar, but doesn’t change the language. We will now introduce two properties
of nonterminals that characterize them as useful and present methods to compute the subsets of nonter-
minals that have these properties. Grammars from which all nonterminals not having these properties
are removed will be called reduced. We will later always assume that the grammars we deal with are
reduced.
The first required property of useful nonterminals is productivity. A nonterminal X of a context-
∗
free grammar G = (VN , VT , P, S) is called productive, if there exists a derivation X =⇒ w for a word
G
w ∈ VT∗ , or equivalently, if there exists a parse tree whose root is labeled with X.
Example 3.2.9 Consider the grammar G = ({S ′ , S, X, Y, Z}, {a, b}, P, S ′), where P consists of the
productions :
S′ → S
S → aXZ | Y
X → bS | aY bY
Y → ba | aZ
Z → aZX
Then Y is productive and therefore also X, S and S ′ . The nonterminal Z, on the other hand, is not
productive since the only production for Z contains on occurrence of Z on its right side. ⊓
⊔
A two-level characterization of nonterminal productivity leading to an algorithm to compute it is the
following:
3.2 Foundations 43
(1) X is productive through production p if and only if X is the left side of p, and if all nonterminals
on the right side of p are productive.
(2) X is productive if X is productive through at least one of its alternatives.
In particular, X is thereby productive if there exists a production X → u ∈ P whose right side u has no
nonterminal occurrences, that is, u ∈ VT∗ . Property (1) describes the dependence of the information for
X on the information about symbols on the right side of the production for X; property (2) indicates
how to combine the information obtained from the different alternatives for X.
We describe a method that computes for a context-free grammar G the set of all productive non-
terminals. The method uses for each production p a counter count[p], which counts the number of
occurrences of nonterminals whose productivity is not yet known. When the counter of a production p
is decreased to 0 all nonterminals on the right side must be productive. Therefore, also the left side of
p is productive through p. To manage the productions whose counter has sunk to 0 the algorithm uses a
worklist W .
Further, for each nonterminal X a list occ[X] of occurrences of this nonterminal in right sides of
productions is managed:
The call init(p) of the routine init() for a production p, whose code we have not given, iterates over the
sequence of symbols on the right side of p. At each occurrence of a nonterminal X the counter count[p]
is incremented, and p is added to the listocc[X]. If at the end still count[p] = 0 holds then init(p) enters
production p into the list W . This concludes the initialization.
The main iteration processes the productions in W one by one. For each production p in W , the
left side is productive through p and therefore productive. When, on the other hand, a nonterminal X is
newly discovered as productive, the algorithm iterates through the list occ[X] of those productions in
which X occurs. The counter count[r] is decremented for each production r in this list. The described
method is realized by the following algorithm:
...
while (W 6= []) {
X ← hd(W ); W ← tl(W );
if (X 6∈ productive) {
productive ← productive ∪ {X};
forall ((r : A → α) ∈ occ[X]) {
count[r]−−;
if (count[r] = 0) W ← A :: W ;
} // end of forall
} // end of if
} // end of while
Let us derive the run time of this algorithm. The initialization phase essentially runs once over the
grammar and does a constant amount of work for each symbol. The main iteration through the worklist
44 3 Syntactic Analysis
enters the left side of each production once into the list W and so removes it at most once from the list.
At the removal of a nonterminals X from W more than a constant amount of work has to be done only
when X has not yet been marked as productive. The effort for such an X is proportional to the length of
the list occ[X]. The sum of these lengths is bounded by the overall size of the grammar G. This means
that the total effort is linear in the size of the grammar.
To show the correctness of the procedure, we ascertain that it possesses the following properties:
• If X is entered into the set productive in the j-th iteration of the while-loop, there exists a parse
tree for X of height at most j − 1.
• For each parse tree, the root is entered into W once.
The efficient algorithm just presented has relevance beyond its application in compiler construction.
It can be used with small modifications to compute least solutions of Boolean systems of equations,
that is of systems of equations, in which the right sides are disjunctions of arbitrary conjunctions of
unknowns. In our example, the conjunctions stem from the right sides while a disjunction represents
the existence of different alternatives for a nonterminal.
The second property of a useful nonterminal is its reachability. We call a nonterminal X reachable in
∗
a context-free grammar G = (VN , VT , P, S), if there exists a derivation S =⇒ αXβ.
G
Example 3.2.10 Consider the grammar G = ({S, U, V, X, Y, Z}, {a, b, c, d}, P, S), where P consists
of the following productions:
S → Y X → c
Y → YZ |Ya|b V → Vd|d
U → V Z → ZX
sethnonterminal i reachable ← ∅;
listhnonterminal i W ← S :: [ ];
nonterminal Y ;
while (W 6= [ ]) {
X ← hd(W ); W ← tl(W );
if (X 6∈ reachable) {
reachable ← reachable ∪ {X};
forall (Y ∈ rhs[X]) W ← W ∪ {Y };
}
To reduce a grammar G, first all non-productive nonterminals are removed from the grammar together
with all productions in which they occur. Only in a second step are the non-reachable nonterminals
eliminated, also together with the productions in which they occur. This second step is, therefore, based
on the assumption that all remaining nonterminals are productive.
3.2 Foundations 45
Example 3.2.11 Let us consider again the grammar of Example 3.2.9 with the productions
S′ → S
S → aXZ | Y
X → bS | aY bY
Y → ba | aZ
Z → aZX
The set of productive nonterminals is {S ′ , S, X, Y }, while Z is not productive. To reduce the grammar,
a first step removes all productions in which Z occurs. The resulting set is P1 :
S′ → S
S → Y
X → bS | aY bY
Y → ba
Although X was reachable according to the original set of productions X is no more reachable after the
first step. The set of reachable nonterminals is VN′ = {S ′ , S, Y }. By removing all productions whose
left side if no longer reachable the following set of obtained:
S′ → S
S → Y
Y → ba
⊓
⊔
We assume n the following that grammars are always reduced.
This section treats the automata model corresponding to context-free grammars, pushdown automata.
We need to describe how to realize a compiler component that performs syntax analysis according to
a given context-free grammar. Section 3.2.4 describes such a method. The pushdown automaton con-
structed for a context-free grammar, however, has a problem: it is non-deterministic for most grammars.
In Sections 3.3 and 3.4 we describe how for appropriate subclasses of context-free grammars the thus
constructed pushdown automaton can be modified to become deterministic.
In contrast to the finite-state machines of the preceding chapter, a pushdown automaton has an
unlimited storage capacity. It has a (conceptually) unbounded data structure, the stack, which works
according to a last-in, first-out principle. Fig. 3.4 shows a schematic picture of a pushdown automaton.
The reading head is only allowed to move from left to right, as was the case with finite-state machines.
In contrast to finite-state machines, transitions of the pushdown automaton not only depend on the
actual state and the next input symbol, but also on some topmost section of the stack. A transition may
change this upper section of the stack and it may consume the next input symbol by moving the reading
head one place to the right.
Formally, a pushdown automaton is a tuple P = (Q, VT , ∆, q0 , F ), where
• Q is a finite set of states,
• VT is the input alphabet,
• q0 ∈ Q is the initial state and
• F ⊆ Q is the set of final states, and
• ∆, is a finite subset of Q+ × VT × Q∗ , the transition relation. The transition relation ∆ can be seen
as a finite partial function ∆ from Q+ × VT into the finites subsets of Q∗ .
46 3 Syntactic Analysis
input tape
control
stack
Our definition of a pushdown automaton is somewhat unusual as it doesn’t make a distinction between
the states of the automaton and its stack symbols. It uses the same alphabet for both. In this way, the
topmost stack symbol is interpreted as the actual state. The transition relation describes the possible
computation steps of the pushdown automaton. It lists finitely many transitions. Executing the transition
(γ, x, γ ′ ) replaces the upper section γ ∈ Q+ of the stack contents by the new sequence γ ′ ∈ Q∗ of
states and reads x ∈ VT ∪ {ε} in the input. The replaced section of the stack contents has at least the
length 1. A transition that doesn’t inspect the next input symbol is called an ε-transition.
Similarly as for finite-state machines, we introduce the notion of a configuration for pushdown
automata. A configuration encompasses all components that may influence the future behavior of the
automaton. With our kind of pushdown automata these are the stack contents and the remaining input.
Formally, a configuration of the pushdown automaton P is a pair (γ, w) ∈ Q+ × VT∗ . In the linear
representation the topmost position of the stack is always at the right end of γ while the next input
symbol is situated at the left end of w. A transition of P is represented through the binary relation ⊢P
between configurations. This relation is defined by:
for a suitable α ∈ Q∗ . As was the case with finite-state machines, a computation is a sequence of
configurations, where a transition exists between each two consecutive members. We denote them by
n
C ⊢P C ′ if there exist configurations C1 , . . . , Cn+1 such that C1 = C, Cn+1 = C ′ and Ci ⊢P Ci+1
+ ∗
for 1 ≤ i ≤ n holds. The relations ⊢P and ⊢P are the transitive and the reflexive and transitive closure
of ⊢P , resp. We have:
+ n ∗ n
⊢P = ⊢P and ⊢P = ⊢P
S S
n≥1 n≥0
This means, a word w is accepted by a pushdown automaton if there exists at least one computation
that goes from an initial configuration (q0 , w) to a final configuration. Such computations are called
accepting. Several accepting computations may exist for one word, but also several computations that
can only read a prefix of a word w or that can read w, but don’t reach a final configuration.
In practice, accepting computations should not be found by trial and error. Therefore, deterministic
pushdown automata are of particular importance.
3.2 Foundations 47
A pushdown automaton P is called deterministic, if the transition relation ∆ has the following
property:
(D) If (γ1 , x, γ2 ), (γ1′ , x′ , γ2′ ) are two different transitions in ∆ and γ1′ is a suffix of γ1 then x and x′
are in Σ and are different from each other, that is, x 6= ε 6= x′ and x 6= x′ .
If the transition relation has the property (D) there exists at most one transition out of each configura-
tion.
In this section, we meet a method that constructs for each context-free grammar a pushdown automaton
that accepts the language defined by the grammar. This automaton is non-deterministic and therefore
not overly useful for a practical application. However, we can derive the LL-parsers of Section 3.3, as
well as the LR-parsers of Section 3.4 by appropriate design decisions.
The notion of context-free item plays a decisive role. Let G = (VN , VT , P, S) be a context-free
grammar. A context-free item of G is a triple (A, α, β) with A → αβ ∈ P . This triple is, more
intuitively, written as [A → α.β]. The item [A → α.β] describes the situation that in an attempt to
derive a word w from A a prefix of w has already been derived from α. α is therefore called the history
of the item.
An item [A → α.β] with β = ε is called complete. The set of all context-free items of G is denoted
by ItG . Is ρ the sequence of items
then hist(ρ) denotes the concatenation of the histories of the items of ρ, i.e.,
hist(ρ) = α1 α2 . . . αn .
Transitions according to (E) are called expanding transitions, those according to (S) shifting transi-
tions and those according to (R) reducing transitions.
Each sequence of items that occurs as stack contents in the computation of an item-pushdown
automaton satisfies the following invariant (I):
48 3 Syntactic Analysis
∗ ∗
(I) If ([S ′ → .S], uv) ⊢P (ρ, v) then hist(ρ) =⇒ u.
G G
This invariant is an essential part of the proof that the item-pushdown automaton PG only accepts words
of G, that is, that L(PG ) ⊆ L(G) holds. We now explain the way the automaton PG works and at the
same time give a proof by induction over the length of computations that the invariant (I) holds for
each configuration reachable from an initial configuration. Let us first consider the initial configuration
for the input w. The initial configuration is ([S ′ → .S], w). The word u = ε has already been read,
∗
hist([S ′ → .S]) = ε, and ε =⇒ ε holds. Therefore, the invariant holds in this configuration.
Let us now consider derivations that consist of at least one transition. Let us firstly assume that the
last transition was an expanding transition. Before this transition, a configuration (ρ[X → β.Y γ], v)
was reached from the initial configuration ([S ′ → .S], uv).
∗
This configuration satisfies the invariant (I) by the induction hypothesis, i.e., hist(ρ)β =⇒ u holds.
The item [X → β.Y γ] as actual state suggests to derive a prefix v from Y . To do this, the automaton
should non-deterministically select one of the alternatives for Y . This is described by the transitions
according to (E). All the successor configurations (ρ[X → β.Y γ][Y → .α], v) for Y → α ∈ P also
satisfy the invariant (I) because
∗
hist(ρ[X → β.Y γ][Y → .α]) = hist(ρ)β =⇒ u .
As next case, we assume that the last transition was a shifting transition. Before this transition, a con-
figuration (ρ[X → β.aγ], av) was reached from the initial configuration ([S ′ → .S], uav). This con-
∗
figuration satisfies the invariant (I) by the induction hypothesis, that is, hist(ρ)β =⇒ u holds. The
successor configuration (ρ[X → βa.γ], v) also satisfies the invariant (I) because
∗
hist(ρ[X → βa.γ]) = hist(ρ)βa =⇒ ua
For the final case, let us assume that the last transition was a reducing transition. Before this transitions,
a configuration (ρ[X → β.Y γ][Y → α.], v) was reached from the initial configuration ([S ′ → .S], uv).
∗
This configuration satisfies the invariant (I) according to the induction hypothesis, that is, hist(ρ)βα =⇒ u
G
holds. The actual state is the complete item [Y → α.]. It is the result of a computation that started with
the item [Y → .α], when [X → β.Y γ] was the actual state and the alternative Y → α for Y was se-
lected. This alternative was successfully processed. The successor configuration (ρ[X → βY.γ], v) also
∗ ∗
satisfies the invariant (I) because hist(ρ)βα =⇒ u implies hist(ρ)βY =⇒ u. ⊓ ⊔
G G
Taken together, the following theorem holds:
Theorem 3.2.1 For each context-free grammar G, L(PG ) = L(G).
Proof. Let us assume w ∈ L(PG ). We then have
∗
([S ′ → .S], w) ⊢P ([S ′ → S.], ε) .
G
Because of the invariant (I), which we have already proved, it follows that
∗
S = hist([S ′ → S.]) =⇒ w
G
∗
Therefore w ∈ L(G). For the other direction, we assume w ∈ L(G). We then have S =⇒ w. To prove
G
∗
([S ′ → .S], w) ⊢P ([S ′ → S.], ε)
G
∗
we show a more general statement, namely that for each derivation A =⇒ α =⇒ w with A ∈ VN ,
G G
∗
(ρ[A → .α], wv) ⊢P (ρ[A → α.], v)
G
for arbitrary ρ ∈ It∗G and arbitrary v ∈ VT∗ . This general claim can be proved by induction over the
∗
length of the derivation A =⇒ α =⇒ w. ⊓ ⊔
G G
3.2 Foundations 49
S → E
E → E+T |T
T → T ∗F |F
F → (E) | Id
The transition relation ∆ of PG0 is presented in Table 3.1. Table 3.2 shows an accepting computation
of PG0 for the word Id + Id ∗ Id. ⊓
⊔
Pushdown automata as such are only acceptors, that is, they decide whether or not an input string is
a word of the language. To use a pushdown automaton for the syntactic analysis in a compiler needs
more than a yes/no answer. The automaton should output the syntactic structure of accepted input
words. This can have one of several forms, a parse tree or the sequence of productions as they were
applied in a leftmost or rightmost derivation. We, therefore, extend pushdown automata by a means to
produce output.
A pushdown automaton with output is a tuple P = (Q, VT , O, ∆, q0 , F ), where Q, VT , q0 , F are
the same as with a normal pushdown automaton and O is a finite output alphabet. ∆ is a finite relation
between Q+ × (VT ∪ {ε}) and Q∗ × (O ∪ {ε}). A configuration consists of the actual stack content,
the remaining input, and the already produced output. It is an element of Q+ × VT∗ × O∗ .
At each transition, the automaton can output one symbol from O. If a pushdown automaton with
output is used as a parser its output alphabet consists of the productions of the context-free grammar or
their numbers.
The item-pushdown automaton can be extended by a means to produce output in essentially two
different ways. It can output the applied production whenever it performs an expansion. In this case,
the overall output of an accepting computation is a leftmost derivation. A pushdown automaton with
this output discipline is called a left-parser.
Instead at expansion, the item-pushdown automaton can output the applied production at each re-
duction. In this case, it delivers a rightmost derivation, but in reversed order. A pushdown automaton
using such an output discipline is called a right-parser.
Deterministic Parsers
In Theorem 3.2.1 we proved that the item-pushdown automaton PG to a context-free grammar G ac-
cepts the grammar’s language L(G). However, the non-deterministic way of working of the pushdown
automaton is unsuitable for practice. The source of non-determinism lies in the transitions of type (E):
the item-pushdown automaton can choose between several alternatives for a nonterminal at expand-
ing transitions, With a non-ambiguous grammar at most one is the correct choice to derive a prefix
of the remaining input. The other alternatives lead sooner or later into dead ends. The item-pushdown
automaton can only guess the right alternative.
In Sections 3.3 and 3.4, we describe two different ways to replace guessing. The LL-parsers of Sec-
tion 3.3 deterministically choose one alternative for the actual nonterminal using a bounded lookahead
into the remaining input. For grammars of class LL(k) a corresponding parser can deterministically
select one (E)-transition based on the already consumed input, the nonterminal to be expanded and the
next k input symbols. LL-parsers are left-parsers.
LR-parsers work differently. They delay the decision, which LL-parsers take at expansion, until
reduction. All the time during the analysis they pursue all possible derivations in parallel that may lead
to a reverse rightmost derivation for the input word. A decision has to be taken only when one of these
possibilities signals a reduction. This decision concerns whether to continue shifting or to reduce, and
in the latter case, by which production. Basis for this decision is again the actual stack contents and
50 3 Syntactic Analysis
Table 3.1. Tabular representation of the transition relation of Example 3.2.12. The middle column shows the
consumed input.
3.2 Foundations 51
a bounded lookahead into the remaining input. LR-parsers signal reductions, and therefore are right-
parsers. There does not exist an LR-parser for each context-free grammar, but only for grammars of
the class LR(k), where k again is the number of necessary lookahead symbols.
u ⊙k v = (uv)|k
This operator is called k-concatenation. We extend both operators to sets of words. For sets L ⊆ VT∗
and L1 , L2 ⊆ V ≤k we define
Let G = (VN , VT , P, S) be a context-free grammar. For k ≥ 1, we define the function firstk : (VN ∪
≤k
VT )∗ → 2VT that returns for each word α the set of all prefixes of length k of terminal words that can
be derived from α.
∗
firstk (α) = {u|k | α =⇒ u}
≤k
Correspondingly, the function followk : VN → 2VT ,# returns for a nonterminal X the set of terminal
words of length at most k that can directly follow a nonterminal X in a sentential form:
∗
followk (X) = {w ∈ VT∗ | S =⇒ βXγ and w ∈ firstk (γ#)}
The set firstk (X) consists of the k-prefixes of leaf words of all trees for X, followk (X) of the k-prefixes
of the second part of leaf words of all upper tree fragments for X (see Fig. 3.5). The following lemma
followk (X)
firstk (X)
The proofs for (b), (c), (d) and (e) are trivial. (a) is obtained by case distinctions over the length of
∗
words x ∈ L1 , y ∈ L2 , z ∈ L3 . The proof for (f ) uses (e) and the observation that X1 . . . Xn =⇒ u holds
∗
if and only if u = u1 . . . un for suitable words ui with Xi =⇒ ui .
Because of property (f ), the computation of the set firstk (α) can be reduced to the computation of
the set firstk (X) for single symbols X ∈ VT ∪ VN . Since firstk (a) = {a} holds for a ∈ VT it suffices
to determine the sets firstk (X) for nonterminals X. A word w ∈ VT≤k is in firstk (X) if and ony if w is
contained in the set firstk (α) for one of the productions X → α ∈ P .
Due to property (f ) of Lemma 3.2, the firstk -sets satisfy the equation system (fi):
[
firstk (X) = {firstk (X1 ) ⊙k . . . ⊙k firstk (Xn ) | X → X1 . . . Xn ∈ P } , Xi ∈ VN (fi)
0: S → E 3 : E ′ → +E 6 : T ′ → ∗T
1 : E → T E′ 4 : T → FT′ 7 : F → (E)
2 : E′ → ε 5 : T′ → ε 8 : F → Id
⊓
⊔
The right sides of the system of equations of the firstk -sets can be represented as expressions consisting
of unknowns firstk (Y ), Y ∈ VN and the set constants {x}, x ∈ VT ∪ {ε} and built using the operators
⊙k and ∪. Immediately the following questions arise:
• Does this system of equations always have solutions?
• If yes, which is the one corresponding to the firstk -sets?
• How does one compute this solution?
To answer these questions we first consider in general systems of equations like (fi) and look for an
algorithmic approach to solve such systems: Let x1 , . . . , xn be a set of unknowns,
x1 = f1 (x1 , . . . , xn )
x2 = f2 (x1 , . . . , xn )
..
.
xn = fn (x1 , . . . , xn )
a system of equations to be solved over a domain D. Each fi on the right side denotes a function
fi : Dn → D. A solution I ∗ of this system of equations associates a value I ∗ (xi ) with each unknown
xi such that all equations are satisfied, that is
x1 , . . . , xn to this start value d0 . Let I (0) be this variable binding. All right sides fi are evaluated in
this variable binding. This might associate each variable xi with a new value. All these new values
form a new variable binding I (1) , in which the right sides are again evaluated, and so on. Let us assume
that an actual variable binding I (j) has been computed. The new variable binding I (j+1) is determined
through:
I (j+1) (xi ) = fi (I (j) (x1 ), . . . , I (j) (xn ))
A sequence of variable bindings I (0) , I (1) , . . . results. If for a j ≥ 0 holds that I (j+1) = I (j) , then
of variable bindings. If the domain D is finite, there exists a j, such that I (j) = I (j+1) holds. This means
that the algorithm in fact finds a solution. One can even show that this solution is the least solution.
Such a least solution does even exist if the complete lattice is not finite, and if the simple iteration does
not terminate. This follows from the fixed-point theorem of Knaster-Tarski, which we treat in detail in
the third volume Compiler Design: Analysis and Transformation.
Example 3.2.14 Let us apply this algorithm to determine a solution of the system of equations of
Example 3.2.13. Initially, all nonterminals are associated with the empty set. The following table shows
the words added to the first1 -sets in the i-ten iteration.
1 2 3 4 5 6 7 8
S Id (
E Id (
E′ ǫ +
T Id (
T′ ǫ ∗
F Id (
Proof. For i ≥ 0 let I (i) be the variable binding after the i-th iteration of the algorithm to find
solutions for (fi). One shows by induction overSi that for all i ≥ 0 I (i) (X) ⊆ firstk (X) holds for
all X ∈ VN . Therefore, it also holds I(X) = i≥0 (X) ⊆ firstk (X) for all X ∈ VN . For the other
∗
direction it suffices to show that for each derivation X =⇒ w, there exists an i ≥ 0 with w|k ∈ I (i) (X).
lm
This claim is again shown by induction, this time by induction over the length n ≥ 1 of the leftmost
derivation. Is n = 1 the grammar has a production X → w. We then have
and the claim follows with i = 1. Is n > 1, ther exists a production X → u0 X1 u1 . . . Xm um with
∗
u0 , . . . , um ∈ VT∗ and X1 , . . . , Xm ∈ VN and leftmost derivations Xi =⇒ wj , j = 1, . . . , k who all
lm
have a length less than n, with w = u0 w1 u1 . . . wm um . According to the induction hypothesis, for
each j ∈ {1, . . . , m} there exists a ij , such that (wi |k ) ∈ I (ij ) (Xi ) holds. Let i′ be the maximum of
these ij . For i = i′ + 1 it holds
′ ′
I (i) (X) ⊇ {u0 } ⊙k I (i ) (X1 ) ⊙k {u1 } . . . ⊙k I (i ) (Xm ) ⊙k {um }
⊇ {u0 } ⊙k {w1 |k } ⊙k {u1 } . . . ⊙k {wm |k } ⊙k {um }
⊇ {w|k }
followk (S ′ ) = {#}
S (fo)
followk (X) = {firstk (β) ⊙k followk (Y ) | Y → αXβ ∈ P }, S ′ 6= X ∈ VN
56 3 Syntactic Analysis
Example 3.2.15 Let us again consider the context-free grammar G2 of Example 3.2.13. To calculate
the follow1 -sets for the grammar G2 we use the system of equations:
⊓
⊔
The system of equations (fo) has again to be solved over a subset lattice. The right sides of the equations
are built from constant sets and unknowns by monotonic operators. Therefore, (fo) has a solution, which
can be computed by global iteration. We want to ascertain that this algorithm indeed computes the right
sets.
Theorem 3.2.3 (Correctness of the followk -sets) Let G = (VN , VT , P, S ′ ) be an extended context-
free grammar, D be the complete lattice of subsets of VTk ∪ VT≤k−1 {#} and, I : VN → D be the least
solution of the system of equations (fo). We then have:
⊓
⊔
The proof is simiilar to the proof of Theorem 3.2.2 and is left to the reader (Exercise 6).
Example 3.2.16 We consider the system of equations of Example 3.2.15. To compute the solution
the iteration again starts with the value ∅ for each nonterminal. The words added in the subsequent
iterations are shown in the following table:
1 2 3 4 5 6 7
S #
E # )
E′ # )
T +, #, )
′
T +, #, )
F ∗, +, #, )
⊓
⊔
The iterative method for the computation of least solutions of systems of equations for the first1 - and
follow1 -sets is not very efficient. But even for more efficient methods, the computation of firstk - and
follow1 -sets needs a large effort when k gets larger. Therefore, practical parsers only use lookahead of
length k = 1. In this case, the computation of the first- and follow-sets can be performed particularly
efficient. The following lemma is the base for our further treatment.
3.2 Foundations 57
According to our assumption, the considered grammars are always reduced. They, therefore, contain
neither non-productive nor unreachable nonterminals. It holds for all X ∈ VN that first1 (X) as well
as follow1 (X) are non-empty. Taken together with Lemma 3.3, it allows us to simplify the transfer
functions for first1 and follow1 in such a way that the 1-concatenation can be (essentially) replaced by
unions. We want to eliminate the case distinction of whether ε is contained in the first1 -sets or not. This
done in two steps: In the first step, the set of nonterminals X is determined that satisfy ε ∈ first1 (X).
In the second step, the ε-free first1 -set is determined for each nonterminal X instead of the first1 -sets.
The ε-free first1 -sets are defined by
To implement the first step, it helps to exploit that for each nonterminal X
∗
ε ∈ first1 (X) if and only if X =⇒ ε
Example 3.2.17 Consider the grammar G2 of Example 3.2.13. The set of productions in which no
terminal symbol occurs is
0: S → E
1 : E → T E′ 4 : T → FT′
2 : E′ → ε 5 : T′ → ε
With respect to this set of productions only the nonterminals E ′ and T ′ are productive. These two
nonterminals are, thus, the only ε-productive nonterminals of grammar G2 . ⊓⊔
Let us now turn to the second step, the computation of the ε-free first1 -sets. Consider a production of
the form X → X1 . . . Xm . Its contribution to eff(X) can be written as
[ ∗
{eff(Xj ) | X1 . . . Xj−1 =⇒ ε}
G
Example 3.2.18 Consider again the context-free grammar G2 of Example 3.2.13. The following sys-
tem of equations serves to compute the ε-free first1 -sets.
All occurrences of the ⊙1 -operator have disappeared. Instead, only constant sets, unions and variables
eff(X) appear on the right sides. The least solution is
Nonterminals that occur to the right of terminals do not contribute to the ε-free first1 -sets. It is important
for the correctness of the construction that all nonterminals of the grammar are productive.
The ε-free first1 -sets eff(X) can also be used to simplify the system of equations for the computa-
tion of the follow1 -sets. Consider a production of the form Y → αXX1 . . . Xm . The contribution of
the occurrence of X in the right side of Y to the set follow1 (X) is
[ ∗ ∗
{eff(Xj ) | X1 . . . Xj−1 =⇒ ε} ∪ {follow1 (Y ) | X1 . . . Xm =⇒ ε}
G G
If all nonterminals are not only productive, but also reachable the equation system for the computation
of the follow1 -sets simplifies to
follow1 (S ′ ) = {#}
S ∗
follow1 (X) = {eff(Y ) | A → αXβY γ ∈ P, β =⇒ ε}
G
S ∗
∪ {follow1 (A) | A → αXβ, β =⇒ ε}, X ∈ VN \{S ′ }
G
Example 3.2.19 The simplified system of equations for the computation of the follow1 -sets of the
context-free grammar G2 of Example 3.2.13 becomes
Again we observe that all occurrences of the operators ⊙1 were eliminated. Only constant sets and
variables follow1 (X) occur on the right side of equations together with the union operator. ⊓
⊔
The next section presents a method that solves arbitrary systems of equations very efficiently that are
similar to the simplified systems of equations for the sets eff(X) and follow1 (X). We first describe the
general method and then apply it to the computations of the first1 - and follow1 -sets.
We construct a variable-dependency graph to a pure union problem. The nodes of this graph are the
variables xi of the system of equations. An edge (xi , xj ) exists if and only if the variable xi occurs
in the right side of the variable xj . Fig. 3.6 shows the variable-dependency graph for the system of
equations of Example 3.2.20
x3
x0 x1
x2
Fig. 3.6. The variable-dependency graph for the system of equations of Example 3.2.20.
Let I be the least solution of the system of equations. We observe that always I(xi ) ⊑ I(xj ) must
hold if there exists a path from xi to xj in the variable-dependency graph. In consequence, the values
of all variables in each strongly-connected component of the variable-dependency graph are the same.
We label each variable xi with the least upper bound of all constants that occur on the right sides of
equations for variable xi . Let us call this value I0 (xi ). We have for all j that
I(xj ) = ⊔{I0 (xi ) | xj is reachable from xi }
Example 3.2.21 (Continuation of Example 3.2.20)
For the system of equations of Example 3.2.20 we find:
I0 (x0 ) = {a}
I0 (x1 ) = {b}
I0 (x2 ) = {c}
I0 (x3 ) = {c}
It follows:
I(x0 ) = I0 (x0 ) = {a}
I0 (x1 ) = I0 (x0 ) ∪ I0 (x1 ) ∪ I0 (x2 ) ∪ I0 (x3 ) = {a, b, c}
I0 (x2 ) = I0 (x0 ) ∪ I0 (x1 ) ∪ I0 (x2 ) ∪ I0 (x3 ) = {a, b, c}
I0 (x3 ) = I0 (x0 ) ∪ I0 (x1 ) ∪ I0 (x2 ) ∪ I0 (x3 ) = {a, b, c}
⊓
⊔
This observation suggests the following method to compute the least solution I of the system of equa-
tions. First, the strongly-connected components of the variable-dependency graph are computed. This
needs a linear number of steps. Then an iteration over the list of strongly-connected components is
performed.
One starts with a strongly-connected component Q, that has no entering edges coming from other
strongly-connected components. The values of all variables xj ∈ Q are:
G
I(xj ) = {I0 (xi ) | xi ∈ Q}
D t ← ⊥;
forall (xi ∈ Q)
t ← t ⊔ I0 (xi );
forall (xi ∈ Q)
I(xi ) ← t;
60 3 Syntactic Analysis
The run time of both loops is proportional to the number of elements in the strongly-connected compo-
nent Q. The values of the variables in Q are propagated along the outgoing edges. Let EQ be the set of
edges (xi , xj ) of the variable-dependency graph with xi ∈ Q and xj 6∈ Q, that is, the edges leaving Q.
For EQ it is set:
forall ((xi , xj ) ∈ EQ )
I0 (xj ) ← I0 (xj ) ⊔ I(xi );
The number of steps for the propagation is proportional to the number of edges in EQ .
The strongly-connected component Q together with the set EQ of outgoing edges is removed from
the graph and one continues with the next strongly-connected component without ingoing edges. This
is repeated until no more strongly-connected component remains. Altogether, we have a method that
performs a linear number of operations ⊔ on the complete lattice D.
Example 3.2.22 (Continuation of Example 3.2.20) The dependency graph of the system of equations
of Example 3.2.20 has the strongly-connected components
For Q0 one obtains the value I0 (x0 ) = {a}. After removal of Q0 and the edge (x0 , x1 ), the new
assignment is:
I0 (x1 ) = {a, b}
I0 (x2 ) = {c}
I0 (x3 ) = {c}
The value of all variables in the strongly-connected component Q1 arise as I0 (x1 ) ∪ I0 (x2 ) ∪ I0 (x3 ) =
{a, b, c}. ⊓
⊔
The way different parsers work can best be made intuitively clear by observing how they construct
the parse tree to an input word. Top-down parsers start the construction of the parse tree at the root.
In the initial situation, the constructed fragment of the parse tree consists of the root, which is labeled
by the start symbol of the context-free grammar; nothing of the input word w is consumed. In this
situation, one alternative for the start symbol is selected for expansion. The symbols of the right side
of this alternative are attached under the root extending the upper fragment of the parse tree. The next
nonterminal to be considered is the one on the leftmost position. The selection of one alternative for this
nonterminal and the attachment of the right side below the node labeled with the left side is repeated
until the parse tree is complete. By attaching symbols of the right side of a production terminal symbols
can appear in the leaf word of a tree fragment. If there is no nonterminal to the left of a terminal symbol
in the leaf word the top-down top-down parser compares them with the next symbol in the input. If they
agree the parser will consume these symbols in the input. Otherwise, the parser will report a syntax
error.
Thus, a top-down analysis performs the following two types of actions:
• Selection of an alternative for the actual leftmost nonterminal and attachment of the right side of
the production to the actual tree fragment.
• Comparison of terminal symbols to the left of the leftmost nonterminal with the remaining input.
Figures 3.7, 3.8, 3.9 and 3.10 show some parse tree fragments for the arithmetic expression Id + Id ∗
Id according to grammar G2 . The selection of alternatives for the nonterminals to be expanded was
cleverly done in such a way as to lead to a successful termination of the analysis.
3.3 Top-down-Syntax Analysis 61
S →E E′ → + E | ε T′ → ∗ T | ε
E → T E′ T → F T′ F → (E) | Id
Id + Id ∗ Id
S S S S
E E E
T E′ T E′
F T′
Fig. 3.7. The first parse-tree fragments of a top-down analysis of the word Id + Id ∗ Id according to grammar G2 .
They are constructed without reading any symbol from the input.
+ Id ∗ Id
S S
E E
T E′ T E′
F T′ F T′
Id Id ε
Fig. 3.8. The parse tree fragments after reading of the symbol Id and before the terminal symbol + is attached to
the fragment.
Id ∗ Id
S S
E E
T E′ T E′
F T′ + E F T′ + E
Id ε Id ε T E′
F T′
Fig. 3.9. The first and the last parse tree after reading of the symbols + and before the second symbol Id appears
in the parse tree.
62 3 Syntactic Analysis
∗ Id Id
S S
E E
T E′ T E′
F T′ + E F T′ + E
Id ε T E′ Id ε T E′
F T′ F T′ ∗ E
Id ε Id ε
Fig. 3.10. The parse tree after the reduction for the second occurrence of T ′ and the parse tree after reading the
symbol ∗, together with the remaining input.
To derive a deterministic automaton from the item pushdown-automaton PG we equip the automaton
with a bounded lookahead into the remaining input. We fix a natural number k ≥ 1 and allow the item
pushdown-automaton to inspect the k first symbols of the remaining input at each (E) transition to aid
in its decision. If this lookahead of depth k always suffices to select the right alternative we call the
grammar LL(k) grammar.
Let us regard a configuration that the item pushdown-automaton PG has reached from an initial
configuration:
∗
([S ′ → .S], uv) ⊢P (ρ[X → β.Y γ], v)
G
∗
Because of invariant (I) of Section ?? it holds hist(ρ)β =⇒ u.
Let ρ = [X1 −→ β1 .X2 γ1 ] . . . [Xn −→ βn .Xn+1 γn ] be a sequence of items. We call the sequence
fut(ρ) = γn . . . γ1
∗
the future of ρ. Let δ = fut(ρ). So far, the leftmost derivation S ′ =⇒ uY γδ has been found. If this
lm
∗ ∗
derivation can be extended to a derivation of the terminal word uv, that is, S ′ =⇒ uY γδ =⇒ uv, then
lm lm
in an LL(k) grammar the alternative to be selected for Y only depends on u, Y and v|k .
Let k ≥ 1 be a natural number. The reduced context-free grammar G is a LL(k)-grammar if for
every two leftmost derivations:
∗ ∗ ∗ ∗
S =⇒ uY α =⇒ uβα =⇒ ux and S =⇒ uY α =⇒ uγα =⇒ uy
lm lm lm lm lm lm
For an LL(k) grammar, the selection of the alternative for the next nonterminal Y in general de-
pends not only on Y and the next k symbols, but also on the already consumed prefix u of the input. If
this selection does, however, not depend on the already consumed left context u we call the grammar
strong-LL(k).
Example 3.3.1 Let G1 the context-free grammar with the productions:
The grammar G1 is an LL(1) grammar. If hstati occurs as leftmost nonterminal in a sentential form
then the next input symbol determines which alternative must be applied. More precisely, it means that
for two derivations of the form
∗ ∗
hstati =⇒ w hstati α =⇒ w β α =⇒ w x
lm lm lm
∗ ∗
hstati =⇒ w hstati α =⇒ w γ α =⇒ w y
lm lm lm
it follows from x|1 = y|1 that β = γ. Is for instance x|1 = y|1 = if, then β = γ =
if (Id) hstati else hstati. ⊓
⊔
Definition 3.3.1 (simple LL(1)-grammar)
Let G be a context-free grammar without ε-productions. If for each nonterminal N , each of its alterna-
tives begins with a different terminal symbol, then G is called a simple LL(1) grammar. ⊓⊔
This is a first, easily checked criterion for a special case. The grammar G1 of Example 3.3.1 is a simple
LL(1) grammar.
Example 3.3.2 We now add the following production to the grammar G1 of Example 3.3.1:
β
z }| {
∗ ′ ′ ∗
hstati =⇒ w hstati α =⇒ w Id = Id; α =⇒ w x
lm lm lm
γ
∗
z }| { ∗
hstati =⇒ w hstati α =⇒ w Id : hstati α =⇒ w y
lm lm lm
δ
∗
z }| { ∗
hstati =⇒ w hstati α =⇒ w Id(Id); α =⇒ w z
lm lm lm
are pairwise different. And these are indeed the only critical cases. ⊓
⊔
64 3 Syntactic Analysis
S → A|B
A → aAb | 0
B → aBbb | 1
Then
L(G4 ) = {an 0bn | n ≥ 0} ∪ {an 1b2n | n ≥ 0}
and G4 is no LL(k) grammar for any k ≥ 1. To see this we consider the two leftmost derivations
∗
S =⇒ A =⇒ ak 0bk
lm lm
∗
S =⇒ B =⇒ ak 1b2k
lm lm
3.3 Top-down-Syntax Analysis 65
G4 is for no k ≥ 1 an LL(k) grammar since for each k ≥ 1 it holds (ak 0bk )|k = (ak 1b2k )|k , but the
right sides A and B for S are different. In this case one can show that for no k ≥ 1 there exists an
LL(k)-grammar for the language L(G4 ). ⊓ ⊔
Theorem 3.3.1 The reduced context-free grammar G = (VN , VT , P, S) is an LL(k) grammar if and
only if for each two different productions A → β and A → γ of G holds:
∗
firstk (βα) ∩ firstk (γα) = ∅ for all α with S =⇒ wAα
lm
Proof. To prove the direction, ” ⇒ ”, we assume, G were an LL(k) grammar, but there existed an
x ∈ firstk (βα) ∩ firstk (γα). According to the definition of firstk and because G is reduced there exist
derivations
∗ ∗
S =⇒ uAα =⇒ uβα =⇒ uxy
lm lm lm
∗ ∗
S =⇒ uAα =⇒ uγα =⇒ uxz,
lm lm lm
where in in the case |x| < k it must hold y = z = ε. β 6= γ implies that G can not be an LL(k)
grammar—a contradiction to our assumption.
To prove the other direction, ” ⇐ ”, we assume, G were not an LL(k) grammar. Then there exist
two leftmost derivations
∗ ∗
S =⇒ uAα =⇒ uβα =⇒ ux
lm lm lm
∗ ∗
S =⇒ uAα =⇒ uγα =⇒ uy
lm lm lm
with x|k = y|k , where A → β, A → γ are different productions. Then the word x|k = y|k is contained
in firstk (βα) ∩ firstk (γα) — a contradiction to the claim of the theorem. ⊓
⊔
Theorem 3.3.1 states that in an LL(k) grammar the application of two different productions to a left-
sentential form always leads to different k-prefixes of the remaining input. Theorem 3.3.1 allows to
derive useful criteria for membership of certain subclasses of LL(k) grammars. The first concerns the
case k = 1.
The set first1 (βα) ∩ first1 (γα) for all left-sentential forms wAα and any two different alternatives
A → β and A → γ can be simplified to first1 (β) ∩ first1 (γ), if neither β nor γ produce the empty word
ε. This is the case if no nonterminal of G is ε-produktiv.
Theorem 3.3.2 Let G be an ε-free context-free grammar, that is, without productions of the form
X → ε. Then G is an LL(1) grammar if and only if for each nonterminal X with the alternatives
X → α1 | . . . | αn the sets first1 (α1 ), . . . , first1 (αn ) are pairwise disjoint.
In practice, it would be too hard a restriction to forbid ε-productions. Consider the case that one of the
two right sides β or γ would produce the empty word. If both β as well as γ produce the empty word
∗
G can not be an LL(1) grammar. Let us, therefore, assume that β =⇒ ε, but that ε can not be derived
from γ. However, then holds for all left-sentential forms uAα, u′ Aα′ :
first1 (βα) ∩ first1 (γα′ ) = first1 (βα) ∩ first1 (γ) ⊙1 first1 (α′ )
= first1 (βα) ∩ first1 (γ)
= first1 (βα) ∩ first1 (γα)
= ∅
Theorem 3.3.3 A reduced context-free grammar G is an LL(1) grammar if and only if for each two
different productions A → β and A → γ holds
⊓
⊔
The characterization of Theorem 3.3.3 is easily checked in contrast to the characterization of The-
orem 3.3.1. An even more easily checkable formulation is obtained by exploiting properties of 1-
concatenation.
Corollary 3.3.3.1 A reduced context-free grammar G is an LL(1) grammar if and only if for all alter-
natives A → α1 | . . . | αn holds
1. first1 (α1 ), . . . , first1 (αn ) are pairwise disjoint; in particular, at most one of these sets contains ε;
2. ε ∈ first1 (αi ) implies first1 (αj ) ∩ follow1 (A) = ∅ for all 1 ≤ j ≤ n, j 6= i. ⊓ ⊔
We extend the property of Theorem 3.3.3 to arbitrary lengths k ≥ 1 of lookaheads.
A reduced context-free grammar G = (VN , VT , P, S) is called strong LL(k) grammar, if for each
two different productions A → β and A → γ of a nonterminal A always holds
According to this definition and Theorem 3.3.3 every LL(1) grammar is a strong LL(1) grammar.
However, an LL(k) grammar for k > 1 is not automatically a strong LL(k) grammar. The reason is
that the set followk (A) contains the follow words of all left sentential forms with occurrences of A. In
contrast, the LL(k) condition only refers to follow words of one left sentential form.
Example 3.3.5 Let G be the context-free grammar with the productions
We check:
1. Fall: The derivation starts with S ⇒ aAaa. It holds first2 (baa) ∩ first2 (aa) = ∅.
2. Fall: The derivation starts with S ⇒ bAba. It holds first2 (bba) ∩ first2 (ba) = ∅.
Hence G is an LL(2) grammar according to Theorem 3.3.1. However, the grammar G is not a strong
LL(2)-grammar, because
In the example, follow1 (A) is too undifferentiated because it collects terminal follow words that may
occur in different sentential forms. ⊓
⊔
Deterministic parsers that construct the parse tree for the input top down cannot deal with left recursive
nonterminals. A nonterminal A of a context-free grammar G is called left recursive if there exists a
+
derivation A =⇒ Aβ.
Theorem 3.3.4 Let G be a reduced context-free grammar. G is not an LL(k) grammar for any k ≥ 1
if at least one nonterminal of the grammar G is left recursive.
3.3 Top-down-Syntax Analysis 67
Proof. Let X be a left recursive nonterminal of grammar G. For simplicity we assume that G has a
production X → Xβ. G is reduced. So, there must exist another production X → α. If X occurs in a
∗
left sentential form, that is, S =⇒ uXγ, the alternative X → Xβ can be applied arbitrarily often. For
lm
each n ≥ 1 there exists a leftmost derivation
∗ n
S =⇒ wXγ =⇒ wXβ n γ .
lm lm
Let us assume that grammar G were an LL(k) grammar. Theorem 3.3.1 implies
Due to X → α we have
firstk (αβ n+1 γ) ⊆ firstk (Xβ n+1 γ),
hence also
firstk (αβ n+1 γ) ∩ firstk (αβ n γ) = ∅.
∗
If β =⇒ ε holds we immediately obtain a contradiction. Otherwise, we choose n ≥ k and again obtain
a contradiction. Hence, G can not be an LL(k) grammar. ⊓
⊔
We conclude that no generator of LL(k) parsers can cope with left recursive grammars. However,
each grammar with left recursion can be transformed into a grammar without left recursion that de-
fines the same language. Let us assume for simplicity that the grammar G has no ε-productions (see
+
Exercise ??) and no recursive chain productions, that is, there is no nonterminal A with A =⇒ A. Let
G
G = (VN , VT , P, S). We construct for G a context-free grammar G′ = (VN′ , VT , P ′ , S) with the same
set VT of terminal symbols, the same start symbol S, a set VN′ of nonterminal symbols
VN′ = VN ∪ {hA, Bi | A, B ∈ VN },
E →E+T |T
T →T ∗F |F
F → (E) | Id
Grammar G0 has three nonterminals and six productions, grammar G1 , needs nine nonterminals and
15 productions.
68 3 Syntactic Analysis
The parse tree for Id + Id according to grammar G0 is shown in Fig. 3.11 (a), the one according
to grammar G1 in Fig. 3.11 (b). The latter one has a definitely different structure. Intuitively, the gram-
mar generates directly the first possible terminal symbol and then in a backward fashion collects the
remainders of the right sides, which follow the left-side nonterminal symbol. The nonterminal hA, Bi
stands for the job to return from B back to A. ⊓ ⊔
We convince ourselves that the grammar G′ constructed from grammar G has the following properties:
• Grammar G′ has no left recursive nonterminals.
• there exists a leftmost derivation
∗
A =⇒ Bγ =⇒ aβγ
G G
in which after the first step only nonterminals of the form hX, Y i are replaced.
The last property implies, in particular, that grammars G and G′ are equivalent, i.e., that L(G) = L(G′ )
holds.
In some cases, the grammar obtained by removing left recursion is an LL(k) grammar. This is the
case for grammar G0 of Example 3.3.6. We have already seen that the transformation to remove left
recursion also has disadvantages. Let n be the number of nonterminals. The number of nonterminals
as well as the number of productions can increase by a factor of n + 1. In large grammars, it might
be not advisable to perform this transformation manually. A parser generator however, could do the
transformation automatically and also generate a program that would automatically convert parse trees
of the transformed grammar back into parse trees of the original grammar (see Exercise ?? of the next
section). The user wouldn’t even see the grammar transformation.
E E
E + T Id hE, F i
T F hE, T i
F Id hE, Ei
Id + T hE, Ei
Id hT, F i ε
hT, T i
(a) (b)
Fig. 3.11. Parse trees for Id + Id according to grammar G0 of Example 3.3.6 and according to the grammar after
removal of left recursion.
Example 3.3.6 illustrates how much the parse tree of a word according to the transformed grammar
can be different from the one according to the original grammar. The operator sits somewhat isolated
3.3 Top-down-Syntax Analysis 69
between its remotely located operands. An alternative to the elimination of left recursion are grammars
with regular right sides, which we will treat later.
w u v
input tape
M
output tape
parser table
control
stack
Fig. 3.12 shows the structure of a parser for strong LL(k) grammars. The prefix w of the input is
already read. The remaining input starts with a prefix u of length k. The stack contains a sequence of
items of the context-free grammar. The topmost item, the actual state, Z, determines whether
• to read the next input symbol,
• to test for the successful end of the analysis, or
• to expand the actual nonterminal.
Upon expansion, the parser uses the parser table, to select the correct alternative for the nonterminal.
The parser table M is a 2-dimensional array whose rows are indexed by the nonterminals and whose
columns are indexed by words of length at most k. It represents a selection function
VN × VT≤k ∗
# → (VT ∪ VN ) ∪ {error }
which associates each nonterminal with the one of its alternatives that should be applied based on the
given lookahead. It could also signal an error if no alternative exists for the combination of actual state
and lookahead. Let [X → β.Y γ] be the topmost item on the stack and u be the prefix of length k of the
remaining input. If M [Y, u] = (Y → α) then [Y → .α] will be the new topmost stack symbol and the
production Y → α is written to the output tape.
The table entries in M for a nonterminal Y are determined in the following way: Let Y → α1 |
. . . | αr be the alternatives for Y . For a strong LL(k) grammar, the sets firstk (αi ) ⊙k followk (Y ) are
disjoint. For each of the u ∈ firstk (α1 ) ⊙k followk (Y ) ∪ . . . ∪, firstk (αr ) ⊙k followk (Y ) is therefore
Otherwise, M [Y, u] is set to error . The entry M [Y, u] = error means that the actual nonterminal and
the prefix of the remaining input don’t go together. This means that a syntax error has been found. A
70 3 Syntactic Analysis
error-diagnosis and error-handling routine is started, which will attempt to continue the analysis. Such
approaches will be described in Section ??.
For k = 1, the construction of the parser table is particularly simple. Because of Corollary 3.3.3.1, it
works without k-concatenation. Instead, it suffices to test u for membership in one of the sets first1 (αi )
and maybe in follow1 (Y ).
Example 3.3.7 Table 3.3 is the LL(1)-parser table for the grammar of Example 3.2.13. Table 3.4
describes the run of the associated parser for input Id ∗ Id#. ⊓
⊔
( ) + ∗ Id #
S E error error error E error
E (E) hE, F i error error error Id hE, F i error
T (E) hT, F i error error error Id hT, F i error
F (E) hF, F i error error error Id hF, F i error
hE, F i error hE, T i hE, T i hE, T i error hE, T i
hE, T i error hE, Ei hE, Ei ∗ F hE, T i error hE, Ei
hE, Ei error ε + T hE, Ei error error ε
hT, F i error hT, T i hT, T i hT, T i error hT, T i
hT, T i error ε ε ∗ F hT, T i error ε
hF, F i error ε ε ε error ε
Table 3.3. LL(1) parser table for the grammar of Example 3.2.13.
Stack Input
[S → .E] Id ∗ Id#
[S → .E][E → .Id hE, F i] Id ∗ Id#
[S → .E][E → Id . hE, F i] ∗Id#
[S → .E][E → Id . hE, F i][hE, F i → . hE, T i] ∗Id#
[S → .E][E → Id . hE, F i][hE, F i → . hE, T i][hE, T i → . ∗ F hE, T i] ∗Id#
[S → .E][E → Id . hE, F i][hE, F i → . hE, T i][hE, T i → ∗ .F hE, T i] Id#
[S → .E][E → Id . hE, F i][hE, F i → . hE, T i][hE, T i → ∗ .F hE, T i][F → .Id hF, F i] Id#
[S → .E][E → Id . hE, F i][hE, F i → . hE, T i][hE, T i → ∗ .F hE, T i][F → Id. hF, F i] #
[S → .E][E → Id . hE, F i][hE, F i → . hE, T i][hE, T i → ∗ .F hE, T i][F → Id. hF, F i][hF, F i → .] #
[S → .E][E → Id . hE, F i][hE, F i → . hE, T i][hE, T i → ∗ .F hE, T i][F → Id hF, F i .] #
[S → .E][E → Id . hE, F i][hE, F i → . hE, T i][hE, T i → ∗ F. hE, T i] #
[S → .E][E → Id . hE, F i][hE, F i → . hE, T i][hE, T i → ∗ F. hE, T i][hE, T i → . hE, Ei] #
[S → .E][E → Id . hE, F i][hE, F i → . hE, T i][hE, T i → ∗ F. hE, T i][hE, T i → . hE, Ei][hE, Ei → .] #
[S → .E][E → Id . hE, F i][hE, F i → . hE, T i][hE, T i → ∗ F. hE, T i][hE, T i → hE, Ei .] #
[S → .E][E → Id . hE, F i][hE, F i → . hE, T i][hE, T i → ∗ F hE, T i.] #
[S → .E][E → Id . hE, F i][hE, F i → hE, T i .] #
[S → .E][E → Id hE, F i .] #
[S → E.] #
Output:
(S → E) (E → Id hE, F i) (hE, F i → hE, T i) (hE, T i → ∗ F hE, T i) (F → Id hF, F i)
(hF, F i → ε) (hE, T i → hE, Ei) (hE, Ei → ε)
Our construction of LL(k) parser are only applicable to strong LL(k) grammars. This restriction,
however, is not really severe.
3.3 Top-down-Syntax Analysis 71
• The case occurring most often in practice is the case k = 1, and each LL(1) grammar is a strong
LL(1) grammar.
• If a lookahead k > 1 is needed, and is the grammar LL(k), but not strong LL(k), a general
transformation can be applied converting the grammar into a strong LL(k) grammar that accepts
the same language. (see Exercise 7).
We do, therefore, not describe a parsing method for arbitrary LL(k) grammars.
Left-recursive nonterminals destroy the LL property of context-free grammars. Left recursion is mostly
used to describe sequences and lists of syntactic objects, like parameter lists and sequences of operands
connected by an associative operator. These can also be described by regular expressions. Thus, we want
to offer the best description comfort by admitting regular expressions on the right side of productions.
A right-regular context-free grammar is a tuple G = (VN , VT , p, S), where VN , VT , S are as usual the
set of nonterminals, the set of terminals, and the start symbol. p : VN → RA is now a function from the
set of nonterminals into the set RA of regular expressions over VN ∪ VT . A pair (X, r) with p(X) = r
is written as X → r.
Example 3.3.8
A right-regular context-free grammar for arithmetic expressions is
where p is the following function (′ {′ and ′ }′ are used as meta-characters to avoid the conflict with the
terminal symbols ′ (′ and ′ )′ ):
S →E
E → T {{+ | −} T }∗
T → F {{∗ | /} F }∗
F → (E) | id ⊓ ⊔
∗
Let =⇒ be the die reflexive, transitive closure of =⇒ . The language defined by G is L(G) = {w ∈
R,lm R,lm
∗
VT∗ | S =⇒ w} ⊓
⊔
R,lm
Example 3.3.9
A regular leftmost derivation for the word id + id ∗ id of grammar Ge of Example 3.3.8 is:
S =⇒ E =⇒ T {{+|−}T }∗
R,lm R,lm
=⇒ F {{∗|/}F }∗{{+|−}T }∗
R,lm
=⇒ {(E)|id}{{∗|/}F }∗{{+|−}T }∗
R,lm
=⇒ id{{∗|/}F }∗{{+|−}T }∗
R,lm
72 3 Syntactic Analysis
=⇒ id{{+|−}T }∗
R,lm
=⇒ id{+|−}T {{+|−}T }∗
R,lm
=⇒ id + T {{+|−}T }∗
R,lm
=⇒ id + F {{∗|/}F }∗{{+|−}T }∗
R,lm
=⇒ id + {(E)|id}{{∗|/}F }∗{{+|−}T }∗
R,lm
=⇒ id + id{{∗|/}F }∗{{+|−}T }∗
R,lm
=⇒ id + id{∗|/}F {{∗|/}F }∗{{+|−}T }∗
R,lm
=⇒ id + id ∗ F {{∗|/}F }∗{{+|−}T }∗
R,lm
=⇒ id + id ∗ {(E)|id}{{∗|/}F }∗{{+|−}T }∗
R,lm
=⇒ id + id ∗ id{{∗|/}F }∗{{+|−}T }∗
R,lm
=⇒ id + id ∗ id{{+|−}T }∗
R,lm
=⇒ id + id ∗ id ⊓
⊔
R,lm
Our goal is to develop an RLL parser, that is, a deterministic top down parser for right-regular
context-free grammars. This is the method of choice to implement a parser as long as no powerful and
comfortable tools offer an attractive alternative.
The RLL parser will produce a regular leftmost derivation for any correct input word. Looking at
the definition above makes clear that the case of expansion (a)—a nonterminal is replaced by its only
right side—is no longer critical. Instead, the cases (b), (c) and (d) need to be made deterministic.
We will call a parser for a right-regular context-free grammar an RLL(1) parser if it
• for each regular left-sentential form w(r1 | . . . | rn )β can take the decision for the right alternative,
• for each regular left-sentential form w(r)∗ β can take the decision for the continuation or the termi-
nation of the iteration
based on the next input symbol of the remaining input. We transfer some notions to the case of right-
regular context-free grammars.
Definition 3.3.3 (regular subexpression)
ri , 1 ≤ i ≤ n, is direct regular subexpression of (r1 | . . . | rn ) and (r1 . . . rn ); r is direct regular
subexpression von (r)∗ and of r; r1 ist regular subexpression of r2 , if r1 = r2 or if r1 is a direct
regular subexpression of r2 or regular subexpression of a direct regular subexpression of r2 . ⊓ ⊔
Definition 3.3.4 (extended context-free item)
A tuple (X, α, β, γ) is an extended context-free item of a right-regular context-free grammar G =
(VN , VT , p, S) if X ∈ VN , α, β, γ ∈ (VN ∪ VT ∪ {(, ), ∗ , |, ε})∗ , p(X) = βαγ and α is regular
subexpression of βαγ. This item is written as [X → β.αγ]. ⊓ ⊔
Realizing an RLL(1) parser for a right-regular context-free grammar uses again first1 - and follow1 sets,
this time of regular subexpressions of right sides of productions.
The computations of first1 - and follow1 -sets for right-regular context-free grammars can again be rep-
resented as pure union-problems, and can, therefore, be efficiently solved. In the same way as in the
conventional case, this starts with the computation of ε-productivity. The equations for ε-productivity
can be defined over the structure of regular expressions. The ε-productivity of right sides transfers to
the nonterminal of the left side.
3.3 Top-down-Syntax Analysis 73
After ε-productivity is computed, the ε-free first-function can be computed. This is specified by the
following equations:
eff(ε) = ∅
eff(a) = {a}
eff(r∗ ) = eff(r)
eff(X) = eff(r), if p(X) S =r (eff)
eff((r1 | . . . |rn )) = eff(ri )
1≤i≤n
S ^
eff((r1 . . . rn )) = {eff(rj ) | eps(ri )}
1≤j≤n
1≤i<j
ε-productivity and ε-free first-functions could be defined recursively over the structure of regular
expressions. The first1 -set of a regular expression is independent of the context in which it occurs.
This is different for the follow1 -set; two different occurrences of a regular (sub-) expression have
in general different follow1 -sets. In realizing RLL(1) parsers, we are interested in the follow1 -sets of
occurrences of regular (sub-) expressions. A particular occurrence of a regular expression in a right side
corresponds to exactly one extended regular item in which the dot is positioned in front of this regular
expression. The following equations for follow1 assume that concatenations and lists of alternatives are
surrounded on the outside by parentheses, but have no superfluous parentheses inside.
(1) follow1 ([S ′ → .S]) = {#} The eof symbol ’#’ follows after each input word.
(2) follow1 ([X → · · · (r1 | · · · |.ri | · · · |rn ) · · · ]) =
follow1 ([X → · · · .(r1 | · · · |ri | · · · |rn ) · · · ]) for 1 ≤ i ≤ n
(3) follow1 ([X → · · ·
(· · · .ri ri+1 · · · ) · · · ]) =
follow1 ([X → · · · (· · · ri .ri+1 · · · ) · · · ]),
eff(ri+1 ) ∪ if eps(ri+1 ) = true
∅ otherwise
(4) follow1 ([X → · · · (r1 · · · rn−1 .rn ) · · · ]) = (follow1 )
follow1 ([X → · · · .(r1 · · · rn−1 rn ) · · · ])
(5) follow1 ([X → · · · (.r)∗ · · · ]) =
eff(r) ∪ follow1 ([X → · · · .(r)∗ · · · ])
S
(6) follow1 ([X → .r]) = follow1 ([Y → · · · .X · · · ])
74 3 Syntactic Analysis
To compute solutions for eff and follow1 as efficiently as possible, that is, in linear time, these
equation systems need to be brought into the form
[
f (X) = g(X) ∪ {f (Y ) | X R Y }
The RLL(1) parser is a deterministic pushdown automaton. The parser table M represents a selection
function m : ItG × VT ; ItG ∪ {error}. The parser table is consulted when a decision has to be taken
by considering lookahead into the remaining input. Therefore, M has only rows for
• items in which an alternative needs to be chosen, and
• items in which an iteration needs to processed;
i.e. the function m is defined for items of the form [X → · · · .(r1 | · · · |rn ) · · · ] and of the form
[X → · · · .(r)∗ · · · ].
3.3 Top-down-Syntax Analysis 75
The RLL(1) parser is started in an initial configuration (#[S ′ → .S], w#). The actual item, the topmost
on the stack, determines whether the parser table should be consulted. If the table needs to be consulted
M [ρ, a] – if not error – indicates the next item for the actual item ρ and the actual input symbol a. If
M [ρ, a] = error , a syntax error has been discovered. In the configuration (#[S ′ → S.], #), the parser
accepts the input word.
The other transitions are:
δ([X → · · · .a · · · ], a) = [X → · · · a. · · · ]
δ([X → · · · .Y · · · ], ε) = [X → · · · .Y · · · ][Y → .p(Y )]
δ([X → · · · .Y · · · ][Y → p(Y ).], ε) = [X → · · · Y. · · · ]
In addition, there were some transitions, for example from [X → · · · (· · · |ri .| · · · ) · · · ] to [X →
· · · (· · · |ri | · · · ). · · · ], which neither read symbols, nor expand nonterminals, nor reduce to nontermi-
nals. They can be avoided by modifying the transition function in the following way:
(1) [X → · · · (· · · |ri .| · · · ) · · · ] ⇒ (2) [X → · · · (· · · |ri | · · · ). · · · ]
(3) [X → · · · (r.)∗ · · · ] ⇒ (4) [X → · · · .(r)∗ · · · ]
(5) [X → · · · .(r1 · · · rn ) · · · ] ⇒ (6) [X → · · · (.r1 · · · rn ) · · · ]
If a transition of δ leads to (1), it is made to lead to the context-free item (2). If it leads to (3), it is
made to lead to (4), and from (5) directly to (6).
We present now the algorithm for the generation of the RLL(1) parser tables.
Algorithm RLL(1)-GEN
Input: RLL(1)-grammar G, first1 and follow1 for G.
Output: parser table M for RLL(1) parser for G.
Method: For all items of the form [X → · · · .(r1 | · · · |rn ) · · · ] set
M ([X → · · · .(r1 | · · · |rn ) · · · ], a) = [X → · · · (· · · |.ri | · · · ) · · · ], for a ∈ first1 (ri ) and if in ad-
dition ε ∈ first1 (ri ) then also for a ∈ follow1 ([X → · · · .(r1 | · · · |rn ) · · · ]).
For all items of the form [X → · · · .(r)∗ · · · ] set
∗
( ([X → · · · .(r) · · · ], a) =
M
∗
[X → · · · (.r) · · · ] if a ∈ first1 (r)
[X → · · · (r)∗ . · · · ] if a ∈ follow1 ([X → · · · .(r)∗ · · · ])
Set all not yet filled entries to error.
be written in the programming language of one’s choice. The latter is the implementation method as
long as no generator tool is available.
Let a right-regular context-free grammar G = (VN , VT , p, S) with VN = {X0 , . . . , Xn }, S = X0 ,
p = {X0 7→ α0 , X1 7→ α1 , . . . , Xn 7→ αn } be given. We present recursive functions p_progr and
progr that generate a so-called recursive descent parser from the grammar G and the computed first1 -
und follow1 -sets.
For each production, this also means for each nonterminal X, a procedure with the name X is
generated. The constructors for regular expressions on the right sides are translated into programming
language constructs such as switch-, while-, do-while-statements, into checks for terminal symbols,
and into recursive calls of procedures for nonterminals. The first1 - and follow1 -sets of occurrences of
regular expressions are needed, for instance, to select the right one of several alternatives. Such an oc-
currence of a regular (sub-) expression corresponds exactly to an extended context-free item. The func-
tion progr is, therefore, recursively defined over the structure of context-free items of the grammar G.
The following function FiFo is used in the case distinction for alternatives. FiFo([X → · · · .β · · · ]) =
first1 (β)⊕1 follow1 ([X → · · · .β · · · ]).
s t r u c t sy m b o l n ex tsy m ;
/ ∗ S t o r e s n e x t i n p u t symbol i n nextsym ∗ /
void scan ( ) ;
/ ∗ P r i n t s t h e e r r o r messa g e and
s t o p s t h e run o f t h e p a r s e r ∗ /
void e r r o r ( S t r i n g errorMessage ) ;
/ ∗ T r a n s l a t i n g t h e i n p u t grammar ∗ /
p _ p r o g r ( X0 → α0 ) ;
p _ p r o g r ( X1 → α1 ) ;
..
.
p _ p r o g r ( Xn → αn ) ;
void p a r s e r ( ) {
scan ( ) ;
X0 ( ) ;
/ ∗ For a l l r u l e s l i k e t h i s . . . ∗ /
p _ p r o g r ( X → .α )
/ ∗ . . .we c r e a t e an a c c o r d i n g meth o d l i k e t h i s . ∗ /
void X( ) {
p r o g r ( [ X → .α ] ) ;
}
switch ( ) {
c a s e ( FiFo ( [ X → · · · (.α1 |α2 | · · · |αk−1 |αk ) · · · ] )
. c o n t a i n s ( n ex tsy m ) ) :
p r o g r ( [ X → · · · (.α1 |α2 | · · · |αk−1 |αk ) · · · ] ) ;
break ;
c a s e ( FiFo ( [ X → · · · (α1 |.α2 | · · · |αk−1 |αk ) · · · ] )
. c o n t a i n s ( n ex tsy m ) ) :
p r o g r ( [ X → · · · (α1 |.α2 | · · · |αk−1 |αk ) · · · ] ) ;
break ;
..
.
c a s e ( FiFo ( [ X → · · · (α1 |α2 | · · · |.αk−1 |αk ) · · · ] )
. c o n t a i n s ( n ex tsy m ) ) :
p r o g r ( [ X → · · · (α1 |α2 | · · · |.αk−1 |αk ) · · · ] ) ;
break ;
default :
p r o g r ( [ X → · · · (α1 |α2 | · · · |αk−1 |.αk ) · · · ] ) ;
}
}
v o i d p r o g r ( [ X → · · · .(α1 α2 · · · αk ) · · · ] ) {
p r o g r ( [ X → · · · (.α1 α2 · · · αk ) · · · ] ) ;
p r o g r ( [ X → · · · (α1 .α2 · · · αk ) · · · ] ) ;
..
.
p r o g r ( [ X → · · · (α1 α2 · · · .αk ) · · · ] ) ;
}
v o i d p r o g r ( [ X → · · · .(α)∗ · · · ] ) {
w h i l e ( F IRST1 (α) . c o n t a i n s ( n ex tsy m ) ) {
p r o g r ( [ X → · · · .α · · · ] ) ;
}
}
v o i d p r o g r ( [ X → · · · .(α)+ · · · ] ) {
do {
p r o g r ( [ X → · · · .α · · · ] ) ;
} w h i l e ( F IRST1 (α) . c o n t a i n s ( n ex tsy m ) ) ;
}
v o i d p r o g r ( [ X → · · · .ǫ · · · ] ) {}
For a ∈ VT is
v o i d p r o g r ( [ X → · · · .a · · · ] ) {
i f ( n ex tsy m == a )
scan ( ) ;
else
e r r o r ( ". . ." ) ;
}
For Y ∈ VN is
v o i d p r o g r ( [ X → · · · .Y · · · ] ) = v o i d Y ( )
How does such a parser work? Procedure X for a nonterminal X is in charge of recognizing words
for X. When it is called, the first symbol of the word to recognize has already been read by the combi-
78 3 Syntactic Analysis
nation scanner/screener, the procedure scan. When procedure X has found a word for X and returns,
it has already read the symbol following the found word.
The next section describes several modifications for the handling of syntax errors.
We now present the recursive descent parsers for the right-regular context-free grammar G for
arithmetic expressions.
Example 3.3.14 (Continuation of Example 3.3.8)
The following parser results from the schematic translation or the extended expression grammar. For
terminal symbols their string representation is used.
symbol n ex tsy m ;
/ ∗ Retu rn s n e x t i n p u t symbol ∗ /
symbol s c a n ( ) ;
/ ∗ P r i n t s t h e e r r o r messa g e and
s t o p s t h e run o f t h e p a r s e r ∗ /
void e r r o r ( S t r i n g errorMessage ) ;
void S ( ) {
E();
}
void E ( ) {
T();
w h i l e ( n ex tsy m == " + " | | n ex tsy m == "−" ) {
s w i t c h ( n ex tsy m ) {
case "+" :
i f ( n ex tsy m == " + " )
scan ( ) ;
else
e r r o r ( "+ expected " ) ;
break ;
default :
i f ( n ex tsy m == "−" )
scan ( ) ;
else
e r r o r ( "− e x p e c t e d " ) ;
}
T();
}
}
void T ( ) {
F();
w h i l e ( n ex tsy m == " ∗ " | | n ex tsy m == " / " ) {
s w i t c h ( n ex tsy m ) {
case "∗" :
i f ( n ex tsy m == " ∗ " )
scan ( ) ;
3.4 Bottom-up Syntax Analysis 79
else
er r o r ( "∗ expected " ) ;
break ;
default :
i f ( n ex tsy m == " / " )
scan ( ) ;
else
er r o r ( " / expected " ) ;
}
F();
}
}
void F ( ) {
s w i t c h ( n ex tsy m ) {
case " ( " :
E();
i f ( n ex tsy m == " ) " )
scan ( ) ;
else
e r r o r ( " ) expected " ) ;
default :
i f ( n ex tsy m == " i d " )
scan ( ) ;
else
e r r o r ( " id expected " ) ;
}
}
void p a r s e r ( ) {
scan ( ) ;
S();
i f ( n ex tsy m == " # " )
accept ( ) ;
else
er r o r ( "# expected " ) ;
}
Some inefficiencies result from the schematic generation of this parser program. A more sophisti-
cated generation scheme will avoid most of these inefficiencies.
Bottom-up parsers read their input like top-down parsers from left to right. They are pushdown automata
that can essentially do two kinds of operations:
• Read the next input symbol (shift), and
• Reduce the right side of a production X → α at the top of the stack by the left side X of the
production (reduce).
Because of these operations they are called shift-reduce parsers. Shift-reduce parsers are right parsers;
they output the application of a production when they do a reduction. The result of the successful
80 3 Syntactic Analysis
analysis of an input word is a rightmost derivation in reverse order because shift-reduce parsers always
reduce at the top of the stack.
A shift-reduce parser must never miss a required reduction, that is, cover it in the stack by a newly
read input symbol. A reduction is required, if no rightmost derivation to the start symbol is possible
without it. A right side covered by an input symbol will never reappear at the top of the stack and
can, therefore, never be reduced. A right side at the top of the stack that must be reduced to obtain a
derivation is called a handle.
Not all occurrences of right sides that appear at the top of the stack are handles. Some reductions
performed at the top of the stack lead into dead ends, that is, they can not continued to a reverse
rightmost derivation although the input is correct.
Example 3.4.1 Let G0 be again the grammar for arithmetic expressions with the productions:
S → E
E → E+T |T
T → T ∗F |F
F → (E) | Id
Table 3.5 shows a successful bottom-up analysis of the word Id ∗ Id of G0 . The third column lists
actions that were also possible, but would lead into dead ends. In the third step, the parser would miss
a required reduction. In the other two steps, the alternative reductions would lead into dead ends, that
is, not to right sentential forms. ⊓
⊔
Table 3.5. A successful analysis of the word Id ∗ Id together with potential dead ends.
Bottom-up parsers construct the parse tree from the bottom up. They start with the leaf word of the
parse tree, the input word, and construct for ever larger parts of the read input subtrees of the parse tree
by attaching the subtrees for the right side α of a production X → α below a newly created X node
upon a reduction by this production. The analysis is successful if a parse tree with root label S, the start
symbol of the grammar, has been constructed for the whole input word.
Fig. 3.13 shows some snapshots during the construction of the parse tree according to the derivation
shown in Table 3.5. The tree on the left contains all nodes that can be created when the input Id has
been read. The sequence of three trees in the middle represents the state before the handle T ∗ F is
being reduced, while the tree on the right shows the complete parse tree.
T ∗ Id T ∗ F T ∗ F
F F Id F Id
Id Id Id
Fig. 3.13. Construction of the parse tree after reading the first symbol, Id, together with the remaining input, before
the reduction of the handle T ∗ F , and the complete parse tree.
Reading a terminal symbols a in char(G) corresponds to a shift transition of the item pushdown-
automaton under a. ε transitions of char(G) correspond to the expansion transitions of the item
82 3 Syntactic Analysis
pushdown-automaton. When char(G) reaches a final state [X → α.] PG undertakes the following
actions: it removes the item [X → α.] on top of its stack and makes a transition under X from the new
state that has appears on top of the stack. This is a reduction move of the item pushdown-automaton
PG .
Example 3.4.2 Let G0 again be the grammar for arithmetic expressions with the productions
S → E
E → E+T |T
T → T ∗F |F
F → (E) | Id
E
[S → .E] [S → E.]
E + T
[E → .E + T ] [E → E. + T ] [E → E + .T ] [E → E + T.]
T
[E → .T ] [E → T.]
T ∗ F
[T → .T ∗ F ] [T → T. ∗ F ] [T → T ∗ .F ] [T → T ∗ F.]
F
[T → .F ] [T → F.]
( E )
[F → .(E)] [F → (.E)] [F → (E.)] [F → (E).]
Id
[F → .Id] [F → Id.]
Fig. 3.14. The characteristic finite-state machine char(G0 ) for the grammar G0 .
The following theorem clarifies the exact relation between the characteristic finite-state machine and
the item pushdown automaton:
Theorem 3.4.1 Let G be a context-free grammar and γ ∈ (VT ∪ VN )∗ . The following three statements
are equivalent:
∗
1. There exists a computation ([S ′ → .S], γ) ⊢char(G) ([A → α.β], ε) of the characteristic finite-state
machine char(G).
∗
2. There exists a computation (ρ [A → α.β], w) ⊢P ([S ′ → S.], ε) of the item pushdown-automaton
G
PG such that γ = hist(ρ) α holds.
∗
3. There exists a rightmost derivation S ′ =⇒ γ ′ Aw =⇒ γ ′ αβw with γ = γ ′ α. ⊓⊔
rm rm
The equivalence of statements (1) and (2) means that words that lead to an item of the characteristic
finite-state machine char(G) are exactly the histories of stack contents of the item pushdown-automaton
PG whose topmost symbol is this item and from which PG can reach one of its final states assuming
appropriate input w. The equivalence of statements (2) and (3) means that an accepting computation of
3.4 Bottom-up Syntax Analysis 83
the item pushdown-automaton for an input word w that starts with a stack contents ρ corresponds to a
rightmost derivation that leads to a sentential form αw where α is the history of the stack contents ρ.
We introduce some terminology before we prove Theorem 3.4.1. For a rightmost derivation
∗
S ′ =⇒ γ ′ Av =⇒ γαv and a production A → α we call α the handle of the right sentential form γαv.
rm rm
Is the right side α = α′ β, the prefix γ = γ ′ α′ is called a reliable prefix of G for the item [A → α′ .β].
The item [A → α.β] is valid for γ. Theorem 3.4.1, thus, means, that the set of words under which the
characteristic finite-state machine reaches an item [A → α′ .β] is exactly the set of reliable prefixes for
this item.
Example 3.4.3 For the grammar G0 we have:
We now consider this rightmost derivation in the direction of reduction, that is, in the direction in which
a bottom-up parser constructs it. First, x is reduced to γ in a number of steps, then u to α, then v to
β. The valid item [A → α.β] for the reliable prefix γα describes the analysis situation in which the
reduction of u to α has already been done, while the reduction of v to β has not yet started. A possible
long-range goal in this situation is the application of the production X → αβ.
84 3 Syntactic Analysis
We come back to the question which language is accepted by the characteristic finite-state machine
of PG . Theorem 3.4.1 says that chG goes under a reliable prefix into a state that is a valid item for this
prefix. Final states, i.e. complete items, are only valid for reliable prefixes where a reduction is possible
at their ends.
Proof of Theorem 3.4.1. We do a circular proof (1) ⇒ (2) ⇒ (3) ⇒ (1). Let us first assume
∗
([S ′ → .S], γ) ⊢char(G) ([A → α.β], ε). By induction over the number n of ε transitions we construct a
rm rm
rightmost derivation S ′ =⇒ γAw =⇒ γαβw.
∗
rm
Ist n = 0, dann ist γ = ε und [A → α.β] = [S ′ → .S]. Da S ′ =⇒ S ′ gilt, ist die Behauptung in
∗
diesem Fall erf"ullt. Ist n > 0, betrachten wir den letzten ε-"Ubergang. Dann l"asst sich die Berechnung
of the characteristic automaton zerlegen in:
∗ ∗
([S ′ → .S], γ) ⊢char(G) ([X → α′ .Aβ ′ ], ε) ⊢char(G) ([A → .αβ], α) ⊢char(G) ([A → α.β], ε)
rm rm
where γ = γ ′ α. Nach Induktionsannahme gibt es eine rightmost derivation S ′ =⇒ γ ′′ Xw′ =⇒ γ ′′ α′ Aβ ′ w′
∗
rm
mit γ ′ = γ ′′ α′ . Da die grammar G reduziert ist, gibt es ebenfalls eine rightmost derivation β ′ =⇒ v.
∗
Deshalb haben wir:
rm rm
S ′ =⇒ γ ′ Avw′ =⇒ γ ′ αβw
∗
∗
for Xn = A. Mit Induktion nach n folgt, dass (ρ, vw) ⊢K ([S ′ → S.], ε) gilt for
G
In Chapter 2, we presented an algorithm which takes a non-deterministic finite-state machine and con-
structs an equivalent deterministic finite-state machine. This deterministic finite-state machine pursues
all paths in parallel which the non-deterministic automaton could take for a given input. Its states
are sets of states of the non-deterministic automaton. This subset construction is now applied to the
characteristic finite-state machine char(G) of a context-free grammar G. The resulting deterministic
finite-state machine is called the canonical LR(0) automaton for G and denote it by LR0 (G).
3.4 Bottom-up Syntax Analysis 85
Example 3.4.5 The canonical LR(0) automaton for the context-free grammar G0 of Example 3.2.2
on page 39 is obtained by the application of the subset construction to the characteristic finite-state
machine char(G0 ) of Fig. 3.14 on page 82. It is shown in Fig. 3.15 on page 85. It states are:
S0 = { [S → .E], S4 = { [F → (.E)], S7 = { [T → T ∗ .F ],
[E → .E + T ], [E → .E + T ], [F → .(E)],
[E → .T ], [E → .T ], [F → .Id] }
[T → .T ∗ F ], [T → .T ∗ F ] S8 = { [F → (E.)],
[T → .F ], [T → .F ] [E → E. + T ] }
[F → .(E)], [F → .(E)]
S9 = { [E → E + T.],
[F → .Id] } [F → .Id] }
[T → T. ∗ F ] }
S1 = { [S → E.], S5 = { [F → Id.] }
S10 = { [T → T ∗ F.]}
[E → E. + T ] } S6 = { [E → E + .T ],
S11 = { [F → (E).] }
S2 = { [E → T.], [T → .T ∗ F ],
[T → T. ∗ F ] } [T → .F ], S12 = ∅
S3 = { [T → F.] } [F → .(E)],
[F → .Id] }
⊓
⊔
+ T
S1 S6 S9
Id
E F
S5
( Id
Id +
F ∗
S0 S3
( Id
F
( E )
T S4 S8 S11
T (
∗ F
S2 S7 S10
Fig. 3.15. The transition diagram of the LR(0) automaton for the grammar G0 obtained from the characteristic
finite-state machinechar(G0 ) in Fig. 3.14. The error state S12 = ∅ and all transitions into it are left out.
The canonical LR(0) automaton LR0 (G) to a context-free grammar G has some interesting properties.
Let LR0 (G) = (QG , VT ∪ VN , ∆G , qG,0 , FG ), and let ∆∗G : QG × (VT ∪ VN )∗ → QG be the lifting
of the transition function ∆G from symbols to words. We then have:
1. ∆∗G (qG,0 , γ) is the set of all items in IG for which γ is a reliable prefix.
2. L(LR0 (G)) is the set of all reliable prefixes for complete items [A → α.] ∈ IG .
Reliable prefixes are prefixes of right-sentential forms, as they occur during the reduction of an input
word. When a reduction is possible that will again lead to a right sentential-form This can only hap-
pen at the right end of this sentential form. An item valid for a reliable prefix describes one possible
interpretation of the actual analysis situation.
86 3 Syntactic Analysis
Example 3.4.6 E + F is a reliable prefix for the grammar G0 . The state ∆∗G0 (S0 , E + F ) = S3 is also
reached by the following reliable prefixes:
F , (F , ((F , (((F , . . .
T ∗ (F , T ∗ ((F , T ∗ (((F , ...
E + F , E + (F , E + ((F , ...
The state S6 in the canonical LR(0) automaton to G0 contains all valid items for the reliable prefix
E+, namely the items
[E → E + .T ], [T → .T ∗ F ], [T → .F ], [F → .Id], [F → .(E)].
↑ ↑ ↑
Valid are for instance [E → E + .T ] [T → .F ] [F → .Id]
⊓
⊔
The canonical LR(0) automaton LR0 (G) to a context-free grammar G is a deterministic finite-state
machine that accepts the set of reliable prefixes to complete items. In this way, it identifies positions
for reduction, and, therefore, offers itself for the construction of a right parser. Instead of items (as
the item-pushdown automaton) this parser stores on its stack states of the canonical LR(0) automaton,
that is sets of items. The underlying pushdown automata P0 is defined as the tuple K0 = (QG ∪
{f }, VT , ∆0 , qG,0 , {f }). The set of states is the set QG of states of the canonical LR(0) automaton
LR0 (G), extended by a new state f , the final state. The initial state of P0 is identical to the initial state
qG,0 of LR0 (G); The transition relation ∆0 consists of the following kinds of transitions:
Read: (q, a, q δG (q, a)) ∈ ∆0 , if δG (q, a) 6= ∅. This transition reads the next input symbol a and
pushes the successor state q under a onto the stack. It can only be taken if at least one item of the
form [X → α.aβ] is contained in q.
Reduce: (qq1 . . . qn , ε, q δG (q, X)) ∈ ∆ if [X → α.] ∈ qn holds with |α| = n. The complete item
[X → α.] in the topmost stack entry signals a potential reduction. As many entries are removed
from the top of the stack as the length of the right side indicates. After that, the X successor of the
new topmost stack entry is pushed onto the stack.
Fig. 3.16 shows a part of the transition diagram of a LR(0) automaton LR0 (G) that demonstrates
this situation. The α path in the transition diagram corresponds to |α| entries on top of the stack.
These entries are removed at reduction. The new actual state, previously below these removed
entries, has a transition under X, which is now taken.
Finish: (qG,0 q, ε, f ) if [S ′ → S.] ∈ q. This transition is the reduction transition to the production
S ′ → S. The property [S ′ → S.] ∈ q signals that a word was successfully reduced to the start
symbol. This transition empties the stack and inserts the final state f .
The special case [X → . ] merits special consideration. According to our description, |ε| = 0 topmost
stack entries need to be removed from the stack upon this reduction, and a transition from the new,
and old, actual state q under X should be taken, and the state ∆G (q, X) is pushed onto the stack.
This transition is possible since by construction it holds that with the item [· · · → · · · .X · · · ] also the
item [X → .α] is contained in state q for each right side α of nonterminal X. In the special case of a
ε production, the actual state q contains together with the item [· · · → · · · .X · · · ] also the complete
item [X → . ]. This latter reduction transition extends the length of the stack.
The construction of LR0 (G) guarantees that for each non-initial and non-final state q there exists
exactly one entry symbol under which the automaton can make a transition into q. The stack contents
q0 , . . . , qn mit q0 = qG,0 corresponds, therefore, to a uniquely determined word α = X1 . . . Xn ∈
(VT ∪ VN )∗ for which ∆G (qi , Xi+1 ) = qi+1 holds. This word α is a reliable prefix, and qn is the set
of all items valid for α.
3.4 Bottom-up Syntax Analysis 87
[· · · → · · · .X · · · ]
X [· · · → · · · X. · · · ]
[X → .α]
···
···
α
[X → α.]
···
The pushdown automaton P0 just constructed is not necessarily deterministic. There are two kinds
of conflicts that cause non-determinism:
shift-reduce conflict: a state q allows a read transition under a symbol a ∈ VT as well as a reduce or
finish transition, and
reduce-reduce conflict: a state q permits reduction transitions according to two different productions.
In the first case, the actual state contains at least one item [X → α.aβ] and at least one complete item
[Y → γ.]; in the second case, q contains two different complete items [Y → α.], [Z → β.]. A state q
of the LR(0) automaton with one of these properties is called LR(0) inadequate. Otherwise, we call q
LR(0) adequate. Es gilt:
Lemma 3.4. For an LR(0) adequate state q there are three possibilities:
1. The state q contains no complete item.
2. The state q consists of exactly one complete item [A → α.];
3. The state q contains exactly one complete item [A → . ], and all non-complete items in q are of the
form [X → α.Y β], where all rightmost derivations for Y that lead to a terminal word are of the
form:
∗
Y =⇒ Aw =⇒ w
rm rm
for a w ∈ VT∗ . ⊓
⊔
Inadequate states of the canonical LR(0) automaton make the pushdown automata P0 non-deterministic.
We obtain deterministic parsers by permitting the parser to look ahead into the remaining input to select
the correct action in inadequate states.
Example 3.4.7 The states S1 , S2 and S9 of the canonical LR(0) automaton in Fig. 3.15 are LR(0)
inadequate. In state S1 , the parser can reduce the right side E to the left side S (complete item [S → E.])
and it can read the terminal symbol + in the input (item [E → E. + T ]). In state S2 the parser can
reduce the right side T to E (complete item [E → T.]) and it can read the terminal symbol ∗ (item
[T → T. ∗ F ]). In state S9 finally, the parser can reduce the right side E + T to E (complete item
[E → E + T.]), and it can read the terminal symbol ∗ (item [T → T. ∗ F ]). ⊓ ⊔
The canonical LR(0) automaton LR0 (G) to a context-free grammar G needs not be derived through
the construction of the characteristic finite-state machine char(G) and the subset construction. It can
be constructed directly from G. The construction uses a function ∆G,ε that adds to each set q of items
all items that are reachable by ε transitions of the characteristic finite-state machine. The set ∆G,ε (q)
is the least solution of the following equation
sethitemi closure(sethitemi q) {
sethitemi result ← q;
listhitemi W ← list_of(q);
symbol X; stringhsymbol i α;
while (W 6= []) {
item i ← hd(W ); W ← tl(W );
switch (i) {
case [_ → _ .X _] : forall (α : (X → α) ∈ P )
if ([X → .α] 6∈ result) {
result ← result ∪ {[X → .α]};
W ← [X → .α] :: W ;
}
default : break;
}
}
return result ;
}
where V is the set of symbols V = VT ∪ VN . The set QG of states and the transition relation ∆G are
computed by first constructing the initial state qG,0 = ∆G,ε ({[S ′ → .S]}) and then adding successor
states and transitions until all successor states are already in the set of constructed states. To implement
it we specialize the function nextState() of the subset construction:
As in the subset construction, the set of states states and the set of transitions trans can be computed
iteratively:
3.4 Bottom-up Syntax Analysis 89
listhsethitemii W ;
sethitemi q0 ← closure({[S ′ → .S]});
states ← {q0 }; W ← [q0 ];
trans ← ∅;
sethitemi q, q ′ ;
while (W 6= []) {
q ← hd(W ); W ← tl(W );
forall (symbol X) {
q ′ ← nextState(q, X);
trans ← trans ∪ {(q, X, q ′ )};
if (q ′ 6∈ states) {
states ← states ∪ {q ′ };
W ← q ′ :: W ;
}
}
}
Then L(G) = {an 0bn | n ≥ 0} ∪ {an 1b2n | n ≥ 0}. We know already that G is for no k ≥ 1 an
LL(k)-grammar. Grammar G is an LR(0)-grammar, though.
The right sentential forms of G have the form
for n ≥ 0. The handles are always underlined. Two different possibilities to reduce exist only in the
case of right sentential forms an aAbbn and an aBbbb2n One could reduce an aAbbn to an Abn and to
an aSbbn . The first choice belonged to the rightmost derivation
∗
S =⇒ an Abn =⇒ an aAbbn
rm rm
the second to no rightmost derivation. The prefix an of an Abn uniquely determines, whether A is the
handle, namely in the case n = 0, or whether aAb is the handle, namely in the case n > 0. The right
sentential forms an Bb2n are handled analogously. ⊓ ⊔
90 3 Syntactic Analysis
S → aAc A → Abb | b
S → aAc A → bbA | b
and the language L(G2 ) = L(G1 ) is an LR(1)-grammar. The critical right sentential forms have the
form abn w. If 1 : w = b, the handle lies in w; if 1 : w = c, the last b in bn forms the handle. ⊓
⊔
Example 3.4.11 The grammar G3 with the productions
S → aAc A → bAb | b
and the language L(G3 ) = L(G1 ) is not an LR(k)-grammar for any k ≥ 0. For, let k be arbitrary, but
fix. Consider the two rightmost derivations
∗
S =⇒ abn Abn c =⇒ abn bbn c
rm rm
∗
S =⇒ abn+1 Abn+1 c =⇒ abn+1 bbn+1 c
rm rm
with n ≥ k. With the names introduced in the definition of LR(k)-grammar, we have α = abn , β =
b, γ = abn+1 , w = bn c, y = bn+2 c. Here w|k = y|k = bk . α 6= γ implies that G3 can be no LR(k)-
grammar. ⊓ ⊔
The following theorem clarifies the relation between the definition of LR(0)-grammar and the proper-
ties of the canonic LR(0) automaton.
Theorem 3.4.2 A context-free grammar G is an LR(0)-grammar if and only if the canonical LR(0)
automaton for G has no LR(0)-inadequate states.
Proof: ” ⇒ ” Let G eine LR(0)-grammar, and nehmen wir an, der canonical LR(0) automaton
LR0 (G) habe einen einen LR(0)-inadequaten state p.
Fall 1: The state p hat einen reduce-reduce-conflict, d.h. p enth"alt zwei verschiedene items [X → β.], [Y → δ.].
Dem state p zugeordnet ist eine nichtleere Menge von reliable prefixesn. Let γ = γ ′ β ein solches reli-
able prefix. Weil beide items valid for γ sind, gibt es rightmost derivations
∗
S ′ =⇒ γ ′ Xw =⇒ γ ′ βw und
rm rm
∗
S ′ =⇒ νY y =⇒ νδy mit νδ = γ ′ β = γ
rm rm
Ist β ′ ∈ VT∗ , erhalten wir sofort einen Widerspruch. Andernfalls gibt es eine rightmost derivation
∗
α =⇒ v1 Xv3 =⇒ v1 v2 v3
rm rm
3.4 Bottom-up Syntax Analysis 91
Zu zeigen ist, dass α = γ, X = Y, x = y gelten. Let p der state of the canonical LR(0) automaton
nach Lesen von αβ. Dann enth"alt p alle for αβ valid items . Nach Voraussetzung ist p LR(0)-geeignet.
Wir unterscheiden zwei F"alle:
Fall 1: β 6= ε. Wegen Lemma 3.4 ist p = {[X → β.]}, d.h. [X → β.] ist das einzige valid item for
αβ. Daraus folgt, dass α = γ, X = Y and x = y sein muss.
Fall 2: β = ε. Nehmen wir an, die zweite rightmost derivation widerspreche der LR(0)-Bedingung.
Dann gibt es ein weiteres item [X → δ.Y ′ η] ∈ p, so dass α = α′ δ ist. The letzte Anwendung einer
production in der unteren rightmost derivation ist die letzte Anwendung einer production in einer ter-
minalen rightmost derivation for Y ′ . Nach Lemma 3.4 folgt daraus, dass die untere Ableitung gegeben
ist durch:
∗ ∗
S ′ =⇒ α′ δY ′ w =⇒ α′ δXvw =⇒ α′ δvw
rm rm rm
with x = (w#)|k . A context-free item [A → α.β] can be understood as an LR(0)-item that is extended
by lookahead ε.
Example 3.4.12 Consider again grammar G0 . We have:
(1) [E → E + .T, )]
[E → E + .T, +] are valid LR(1)-items for the prefix (E+
Observation (2) follows since the subword E∗ can occur in no right sentential form. ⊓
⊔
92 3 Syntactic Analysis
The folllowing theorem gives a characterization of the LR(k)-property based on valid LR(k)-items.
Theorem 3.4.3 Let G be a context-free grammar. For a reliable prefix γ let It(γ) be the set of LR(k)-
items of G that are valid for γ.
The grammar G is an LR(k)-grammar if and only if for all reliable prefixes γ and all LR(k)-items
[A → α., x] ∈ It(γ) holds:
1. if there is another LR(k)-item [X → δ., y] ∈ It(γ), then x 6= y.
2. is there another LR(k)-item [X → δ.aβ, y] ∈ It(γ), then x 6∈ firstk (aβ) ⊙k {y}. ⊓
⊔
Theorem 3.4.3 suggests to define LR(k)-adequate and LR(k)-inadequate sets of items also for
k > 0. Let I be a set of LR(k)-items. I has a reduce-reduce-conflict, if there are LR(k)-items
[X → α., x], [Y → β., y] ∈ I with x = y. I has a shift-reduce-conflict, if there are LR(k)-items
[X → α.aβ, x], [Y → γ., y] ∈ I with
A second table, the goto-table, contains the representation of the transition function of the canonic
LR(k)-automaton LRk (G). It is consulted after a shift-action or a reduce-action to determine the new
state on top of the stack. Upon a shift, it computes the transition under the read symbol out of the actual
state. Upon a reduction by X → α, it gives the transition under X out of the state underneath those
stack symbols that belong to α. These two tables for k = 1 are shown in Fig. 3.17.
The LR(k) parser for a grammar G needs a program that interprets the action- and goto-table, the
driver. Again, we consider the case k = 1. This is, in principle, sufficient because for each language
that has an LR(k)-grammar and therefore also an LR(k) parser one can construct an LR(1)-grammar
and consequently also an LR(1) parser. Let us assume that the set of states of the LR(1) parser were
Q. One such driver program then is:
3.4 Bottom-up Syntax Analysis 93
action-table goto-table
VT ∪ {#} VN ∪ VT
x X
Fig. 3.17. Schematic representation of action- and goto-table of an LR(1) parser with set of states Q.
The LR(1) parser is based on the canonical LR(1)-automaton LR1 (G). Its states, therefore, are sets of
LR(1)-items. We construct the canonical LR(1)-automaton much in the same way as we constructed
the canonical LR(0) automaton. The only difference is that LR(1)-items are used instead of LR(0)-
items. This means that the lookahead symbols need to be computed when the closure of a set q of
94 3 Syntactic Analysis
LR(1)-items under ε-transitions is formed. This set is the least solution of the following equation
sethitem 1 i closure(sethitem 1 i q) {
sethitem 1 i result ← q;
listhitem 1 i W ← list_of(q);
nonterminal X; stringhsymbol i α, β; terminal x, y;
while (W 6= []) {
item 1 i ← hd(W ); W ← tl(W );
switch (i) {
case [_ → _ .Xβ, x] :
forall (α : (X → α) ∈ P )
forall (y ∈ first1 (β) ⊙1 {x})
if ([X → .α, y] 6∈ result) {
result ← result ∪ {[X → .α, y]};
W ← [X → .α, y] :: W ;
}
default : break;
}
}
return result ;
}
where V is the set of all symbols, V = VT ∪ VN . The initial state q0 of LR1 (G) is
We need a function nextState() that computes the successor state to a given set q of LR1 -items and a
symbol X ∈ V = VN ∪ VT . The corresponding function for the construction of LR0 (G) needs to be
extended by the compute the lookahead symbols:
The set of states and the transition relation of the canonical LR(1)-automaton is computed in analogy
to the canonical LR(0)-automaton. The generator starts with the initial state and an empty set of tran-
sitions and adds successors states until all successor states are already contained in the set of computed
states. The transition function of the canonical LR(1)-automaton gives the goto-table of the LR(1)
parser.
Let us turn to the construction of the action-table of the LR(1) parser. No reduce-reduce-conflict
exists in a state q of the canonical LR(1)-automaton with complete LR(1)-items [X → α., x], [Y →
β., y] if x 6= y. If the LR(1) parser is in state q it will decide to reduce with the production whose
3.4 Bottom-up Syntax Analysis 95
lookahead symbol is the next input symbol. If state q contains at the same time a complete LR(1)-item
[X → α., x] and an LR(1)-item [Y → β.aγ, y], it still has no shift-reduce-conflict if a 6= x. In state
q the generated parser will reduce if the next next input symbol is x and shift if it is a. Therefore, the
action-table can be computed by the following iteration:
forall (state q) {
forall (terminal x) action[q, x] ← error ;
forall ([X → α.β, x] ∈ q)
if (β = ε)
if (X = S ′ ∧ α = S ∧ x = #) action[q, #] ← accept ;
else action[q, x] ← reduce(X → α);
else if (β = aβ ′ ) action[q, a] ← shift ;
}
Example 3.4.13 We consider some states of the canonical LR(1)-automaton for the context-free gram-
mar G0 . The numbering of states is the same as in Fig. 3.15. To make the representation of sets S
of LR(1)-items more readable all lookahead symbols in LR(1)-items from S with the same kernel
[A → α.β] are collected in one lookahead set
L = {x | [A → α.β, x] ∈ q}
S2′ = nextState(S1′ , T )
= { [E → T., {#, +}],
[T → T. ∗ F, {#, +, ∗}] }
After the extension by lookahead symbols, the states S1 , S2 and S9 , which were LR(0) inadequate,
have no longer conflicts. In state S1′ the next input symbol + indicates to shift, the next input symbol
# indicates to reduce. In state S2′ lookahead symbol ∗ indicates to shift, # and + to reduce; similarly
in state S9′ .
The table 3.6 shows the rows of the action-table of the canonical LR(1) parser for the grammar G0 ,
which belong to the states S0′ , S1′ , S2′ , S6′ and S9′ . ⊓
⊔
The set of states of LR(1) parsers can become quite large. Therefore, often LR analysis methods are
employed that are not as powerful as canonical LR parsers, but have fewer states. Two such LR analysis
96 3 Syntactic Analysis
Table 3.6. Some rows of the action-table of the canonical LR(1) parser for G0 . s stands for shift, r(i) for reduce
by production i, acc for accept. All empty entries represent error.
methods are the SLR(1)- (simple LR-) and LALR(1)- (lookahead LR-)methods. Ist SLR(1) parser
is a special LALR(1) parser, and each grammar that has an LALR(1) parser is an LR(1)-grammar.
The starting point of the construction of SLR(1)- and LALR(1) parsers is the canonical LR(0)
automaton LR0 (G). The set Q of states and the goto-table for these parsers are the set of states and
the goto-table of the corresponding LR(0) parser. Lookahead is used to resolve conflicts in the states
in Q. Let q ∈ Q be a state of the canonical LR(0) automaton and [X → α.β] an item in q. We denote
by λ(q, [X → α.β]) the lookahead set that is added to the item [X → α.β] in q. The SLR(1)-method
is different from the LALR(1)-method in the definition of the function
λ : Q × IG → 2VT ∪{#}
Relative to such a function λ, the state q of LR0 (G) has a reduce-reduce-conflict, if it has different
complete items [X → α.], [Y → β.] ∈ q with
λ(q, [X → α.]) ∩ λ(q, [Y → β.]) 6= ∅
Relative to λ , q has a shift-reduce-conflict if it has items [X → α.aβ], [Y → γ.] ∈ q with a ∈
λ(q, [Y → γ.]).
If no state of the canonic LR(0) automaton has a conflict, the lookahead sets λ(q, [X → α.]) suffice
to construct an action-table zu.
In SLR(1) parsers, the lookahead sets for items are independent of the states in which they occur;
the lookahead only depends on the left side of the production in the item:
∗
λS (q, [X → α.β]) = {a ∈ VT ∪ {#} | S ′ # =⇒ γXaw} = follow1 (X)
for alle states q mit [X → α.] ∈ q. A state q of the canonical LR(0) automaton is called SLR(1)-
inadequate if it contains conflicts with respect to the function λS . G is an SLR(1)-grammar if there
are no SLR(1)-inadequate states.
Example 3.4.14 We consider again grammar G0 of Example 3.4.1. Its canonical LR(0) automaton
LR0 (G0 ) has the inadequate states S1 , S2 and S9 . We extend the complete items in the states by the
follow1 -sets of their left sides to represent the function λS in a readable way. Since follow1 (S) = {#}
and follow1 (E) = {#, +, )} we obtain:
S1′′ = { [S → E., {#}], conflict eliminated,
[E → E. + T ]} da + 6∈ {#}
Here, q0 is the initial state, and ∆G is the transition function of the canonic LR(0) automaton LR0 (G).
In λL (q, [X → α.]) only terminal symbols are contained that can follow X in a right sentential form
βXaw such that βα drives the canonical LR(0) automaton into the state q. We call state q of the
canonical LR(0) automaton LALR(1)-inadequate if it contains conflicts with respect to the function
λL . The grammar G is an LALR(1)-grammar if the canonical LR(0) automaton has no LALR(1)-
inadequate states.
There always exists an LALR(1) parser to an LALR(1)-grammar. The definition of the function
λL however is not constructive since sets of right sentential forms appear in it that are in general
infinite. The sets λL (q, [A → α.β]) can be characterized as the least solution of the following system
of equations:
The system of equations describes how sets of successor symbols of items in states originate. The first
equation says that only # can follow the start symbol S ′ . The second class of equations describes that
the follow symbols of an item [A → αX.β] in a state q result from the follow symbols after the dot in
an item [A → α.Xβ] in states p from which one can reach q by reading X. The third class of equations
formalizes that the follow symbols of an item [A → .α] in a state q result from the follow symbols of
occurrences of A in items in q after the dot, that is, from sets first1 (β) ⊙1 λL (q, [X → γ.Aβ]) for items
[X → γ.Aβ] in q.
The system of equations for the sets λL (q, [A → α.β]) over the finite subset lattice 2VT ∪{#} can be
solved by the iterative method for the computation of least solutions. Considering which nonterminal
may produce ε allows us to replace the occurrences of 1-concatenation by unions. We so obtain an
equivalent pure union problem that can be solved by the efficient method of Section 3.2.7.
LALR(1) parsers can be constructed in the following, not very efficient way: One constructs a
canonical LR(1) parser. If its states have no conflicts such states p and q are merged to a new state
p′ where the cores of the items in p are the same as the cores in the items of q, that is, where the
difference of the two sets of items consists only in the lookahead sets. The lookahead sets in the new
state p′ are obtained as the union of the lookahead sets of items with the same core. The grammar is an
LALR(1)-grammar if the new states have no conflicts.
A further possibility consists in the modification of Algorithm LR(1)-GEN. The conditional state-
ment
if q ′ not in Q then Q := Q ∪ {q ′ } fi;
is replaced by
if exist. q ′′ in Q mit kerngleich(q ′ , q ′′ ) then verschmelze(Q, q ′ , q ′′ ) fi;
where
function samecore(p, p′ : set of item):bool;
if set of cores of p = set of cores of p′
then return (true)
else return (false)
98 3 Syntactic Analysis
fi;
Example 3.4.15 The following grammar taken from [ASU86] describes a simplified version of the C
assignment statement:
S′ → S
S → L=R|R
L → ∗R | Id
R → L
This grammar is not an SLR(1)-grammar, but t is a LALR(1)-grammar. The states of the canonical
LR(0) automaton are given by:
S0 = { [S ′ → .S], S2 = { [S → L. = R], S6 = { [S → L = .R],
[S → .L = R], [R → L.] } [R → .L],
[S → .R], [L → . ∗ R],
S3 = { [S → R.] }
[L → . ∗ R], [L → .Id] }
[L → .Id], S4 = { [L → ∗ .R],
S7 = { [L → ∗ R.] }
[R → .L] } [R → .L],
[L → . ∗ R], S8 = { [R → L.] }
S1 = { [S ′ → S.] }
[L → .Id] }
S9 = { [S → L = R.] }
S5 = { [L → Id.] }
State S2 is the only LR(0)-inadequate state. We have follow1 (R) = {#, =}. This lookahead set for
the item [R → L.] is not sufficient to resolve the shift-reduce-conflict in S2 since the next input symbol
= is in the lookahead set. Therefore, the grammar is not an SLR(1)-grammar.
The grammar however is a LALR(1)-grammar. The transition diagram of its LALR(1) parser
is shown in Fig. 3.18. To increase readability, the lookahead sets λL (q, [A → α.β]) were directly
associated with the item [A → α.β] of state q. In state S2 , the item [R → L.] has now the lookahead
set {#}. The conflict is resolved since this set does not contain the next input symbol =. ⊓ ⊔