0% found this document useful (0 votes)
12 views71 pages

Cha 3

The document is a comprehensive guide on compiler design, focusing on syntactic and semantic analysis. It covers various aspects of compilation, including lexical analysis, syntactic analysis, and semantic analysis, along with their respective tasks and methodologies. The content is structured into sections detailing the subtasks of compilation and includes literature references for further reading.

Uploaded by

Biniyam Haile
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views71 pages

Cha 3

The document is a comprehensive guide on compiler design, focusing on syntactic and semantic analysis. It covers various aspects of compilation, including lexical analysis, syntactic analysis, and semantic analysis, along with their respective tasks and methodologies. The content is structured into sections detailing the subtasks of compilation and includes literature references for further reading.

Uploaded by

Biniyam Haile
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 71

Reinhard Wilhelm, Helmut Seidl, Sebastian Hack

Compiler Design

Syntactic and Semantic Analysis

November 11, 2011

Springer
Contents

1 The Structure of Compilers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1


1.1 Subtasks of compilation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Lexical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 The Screener . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Syntactic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Semantic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.6 Machine-Independent Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.7 Memory Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.8 Generation of the Target Program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.9 Specification and Generation of Compiler Components . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.10 Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2 Lexical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1 The Task of Lexical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Regular Expressions and Finite-State Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.1 Words and Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Language for the Specification of Lexical Analyzers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3.1 Character classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3.2 Non-recursive Parentheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4 Scanner Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4.1 Character Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4.2 An Implementation of the until-Construct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4.3 Sequences of regular expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4.4 The Implementation of a Scanner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.5 The Screener . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.5.1 Scanner States . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.5.2 Recognizing Reserved Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.7 Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3 Syntactic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.1 The Task of Syntactic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2 Foundations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2.1 Context-free Grammars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2.2 Productivity and Reachability of Nonterminals . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.2.3 Pushdown Automata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.2.4 The Item-Pushdown Automaton to a Context-Free Grammar . . . . . . . . . . . . . . . . 47
3.2.5 first- and follow-Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.2.6 The Special Case first1 and follow1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.2.7 Pure Union Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
VI Contents

3.3 Top-down-Syntax Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60


3.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.3.2 LL(k): Definition, Examples, and Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.3.3 Left Recursion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.3.4 Strong LL(k) Parsers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.3.5 LL Parsers for Right-regular Context-free Grammars . . . . . . . . . . . . . . . . . . . . . . 71
3.4 Bottom-up Syntax Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
3.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
3.4.2 LR(k) Parsers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.4.3 LR(k): Definition, Properties, and Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
3.4.4 Fehlerbehandlung in LR parsern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
3.5 Literaturhinweise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
3.6 "Ubungen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

4 Semantic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109


4.1 Aufgabe der semantischen Analyse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
4.1.1 G"ultigkeits- und Sichtbarkeitsregeln . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
4.1.2 "Uberpr"ufung der Kontextbedingungen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
4.1.3 "Uberladung von Bezeichnern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
4.2 Typinferenz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
4.3 Attributgrammatiken . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
4.3.1 Die Semantik einer Attributgrammatik . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
4.3.2 Einige Attributgrammatiken . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
4.4 Die Generierung von Attributauswertern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
4.4.1 Bedarfsgetriebene Auswertung der Attribute . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
4.4.2 Statische Vorberechnungen für Attributauswerter . . . . . . . . . . . . . . . . . . . . . . . . . . 154
4.4.3 Besuchsgesteuerte Attributauswertung . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
4.4.4 Parsergesteuerte Attributauswertung . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
4.5 "Ubungen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
4.6 Literaturhinweise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171

Literatur . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
1
The Structure of Compilers

Our series of books treats the compilation of higher programming languages into the machine lan-
guages of virtual or real computers. Such compilers are large, complex software systems. Realizing
large and complex software systems is a difficult task. What is special about compilers such that they
can be even implemented as a project accompanying a compiler course? A decomposition of the task
into subtasks with clearly defined functionalities and clean interfaces between them makes this, in fact,
possible. This is true about compilers; there is a more or less standard conceptual compiler structure
composed of components solving a well-defined subtask of the compilation task. The interfaces be-
tween the components are representations of the input program.
The compiler structure described in the following is a conceptual structure. i.e. it identifies the
subtasks of the translation of a source language into a target language and defines interfaces between
the components realizing the subtasks. The concrete architecture of the compiler is then derived from
this conceptual structure. Several components might be combined if the realized subtasks allow this.
But a component may also be split into several components if the realized subtask is very complex.
A first attempt to structure a compiler decomposes it into three components executing three consec-
utive phases:
1. The analysis phase, realized by the Frontend. It determines the syntactic structure of the source
program and checks whether the static semantic constraints are satisfied. The latter contain the
type constraints in languages with static type systems.
2. The optimization and transformation phase, performed by what is often called the Middleend. The
syntactically analysed and semantically checks program is transformed by semantics-preserving
transformations. These transformations mostly aim at improving the efficiency of the program by
reducing the execution time, the memory consumption, or the consumed energy. These transforma-
tions are independent of the target architecture and mostly also independent of the source language.

3. The code generation and the machine-dependent optimization phase, performed by the Backend.
The program is being translated into an equivalent program in the target language. Machine-
dependent optimizations might be performed, which exploit peculiarities of the target architecture.

This coarse compiler structure splits it into a first phase, which depends on the source language, a third
phase, which depends only on the target architecture, and a second phase, which is mostly independent
of both. This structure helps to adapt compiler components to new source languages and to new target
architectures.
The following sections present these phases in more detail, decompose them further, and show
them working on a small running example. This book describes the analysis phase of the compiler.
The transformation phase is presented in much detail in the volume Analysis and Transformation.
The volume Code Generation and Machine-oriented Optimization covers code generation for a target
machine.
2 1 The Structure of Compilers

1.1 Subtasks of compilation


Fig. 1.1 shows a conceptual compiler structure. Compilation is decomposed into a sequence of phases.
The analysis phase is further split into subtasks as this volume is concerned with the analysis phase.
Each component realizing such a subtask receives a representation of the program as input and delivers
another representation as output. The format of the output representation may be different, e.g. when
translating a symbol sequence into a tree, or it may be the same. In the latter case, the representation
will in general be augmented with newly computed information. The subtasks are represented by boxes
labeled with the name of the subtask and maybe with the name of the module realizing this subtask.
We now walk through the sequence of subtasks step by step, characterize their job, and describe the
change in program representation. As a running example we consider the following program fragment:

int a, b;
a = 42;
b = a ∗ a − 7;

where ′ =′ denotes the assignment operator.

Quellprogramm als Zeichenfolge

lexikalische Analyse
Scanner
A Symbolfolge S
Optimierung
N Y
Sieben
A N
Sieber
L T
dekorierte Symbolfolge
Y H
S syntaktische Analyse E
E Parser S Codeerzeugung
Syntaxbaum E
semantische Analyse

dekorierter Syntaxbaum
Zielprogramm

Fig. 1.1. Structure of a compiler together with the program representations during the analysis phase.

1.2 Lexical Analysis


The component performing lexical analysis of source programs is often called the scanner. This com-
ponen reads the source program represented a sequence of characters mostly from a file. It decomposes
this sequence of characters into a sequence of lexical units of the programming language. These lexical
units are called symbols. Typial lexical units are keywords such as if, else, while or switch and spe-
cial charactes and character combinations such as =, ==, ! =, <=, >=, <, >, (, ), [, ], {, } or comma
and semicolon. These need to be recognized and converted into corresponding internal representations.
The same holds for reserved identifiers such as names of basic types int, float, double, char, bool
or string, etc. Further symbols are identifier and constants. Examples for identifiers are value42, abc,
1.3 The Screener 3

Myclass, x, while the character sequences 42, 3.14159 and ′′ HalloWorld!′′ represent constants. Some-
thing special to note is that there are, in principle, arbitrarily many such symbols. However, they can be
categorized into finitely many classes. A symbol class consists of symbols that are equivalent as far as
the syntactic structure of programs is concerned. Identifiers are an example of such a class. Within this
class, there may be subclasses such as type constructors in O CAML or variables in P ROLOG, which are
written in capital letters. In the class of constants, int-constants can be distinguished from floating-point
constants and string-constants.
The symbols we have considered so far bear semantic interpretations and need, therefore, be consid-
ered in code generation. However, there are symbols without semantics. Two symbols need a separator
between them if their concatenation would also form a symbol. Such a separator can be a blank, a new-
line, or an indentation or a sequence of such characters. Such so-called white space can also be inserted
into a program to make visible the structure of the program.
Another type of symbols, without meaning for the compiler, but helpful for the human reader, are
comments and can be used by software development tools. A similar type of symbols are compiler
directives (pragmas). Such directives may tell the compiler to include particular libraries or influence
the memory management for the program to be compiled.
The sequence of symbols for the example program might look as follows:

Int(′′ int ′′ ) Sep(′′ ′′ ) Id(′′ a′′ ) Com(′′ ,′′ ) Sep(′′ ′′ ) Id(′′ b′′ ) Sem(′′ ;′′ ) Sep(′′ \n′′ )
Id(′′ a′′ ) Bec(′′ =′′ ) Intconst(′′ 42′′ ) Sem(′′ ;′′ ) Sep(′′ \n′′ )
Id(′′ b′′ ) Bec(′′ =′′ ) Id(′′ a′′ ) Mop(′′ ∗′′ ) Id(′′ a′′ ) Aop(′′ −′′ ) Intconst(′′ 7′′ ) Sem(′′ ;′′ ) Sep(′′ \n′′ )

To increase readability, the sequences was brolen into lines according to the original program structure.
Each symbol is represented with its symbol class and the substring representing it in the program. More
information may be added such as the position of the string in the input.

1.3 The Screener


The scanner delivers a sequence of symbols to the screener. These are substrings of the program text
labeled with their symbol classes. It is the task of the screener to further process this sequence. Some
symbols it will eliminate since they have served their prupose as separators. Others it will transform into
a different representation. More precisely, it will perform the following actions, specific for different
symbol classes:
Reserved symbols: These are typically identifiers, but have a special meaning in the programming lan-
guage. e.g. begin, end, var, int etc.
Separators and comments: Sequences of blanks and newlines serve as separators between symbols.
They are of not needed for further processing of the program and can therefore be removed.. An
exception to this rule are some functional languages, e.g. H ASKELL, where indentation is used to
express program nesting. Comments will also not be needed later and can be removed.
Pragmas: Compiler directives (pragmas) are not part of the program. They will separately passed on
to the copmpiler.
Other types of symbols are typically preserved, but their textual representation may be converted into
some more efficient internal representation.
Constants: The sequence of digits as representation of number constants is converted to a binary rep-
resentation. String-constants are stored into an allocated object. In JAVA implementations, these
objects are stored in a dedicated data structure, the String Pool. The String Pool is available to the
program at run-time.
Identifier: Compilers usually do not work with identifiers represented as string objects. This repre-
sentation would be too inefficient. Rather, identifiers are coded as unique numbers. The compiler
needs to be able to access the external representation of identifiers, though. For this purpose, the
identifiers are kept in a data structure, which can be efficiently addressed by their codes.
3
Syntactic Analysis

3.1 The Task of Syntactic Analysis


The parser realizes the syntactic analysis of programs. Its input is a sequence of symbols as produced
by the cmbination of scanner and screener. It is its job to identify the syntactic structure in this sequence
of symbols, that is the composition of syntactic units from other units.
Syntactic units in imperative languages are variables, expressions, declarations, statements and se-
quences of statements. Functional languages have variables, expressions, patterns, definitions and dec-
larations. Logic languages such as [sc Prolog have variables, terms, goals, and clauses.
The parser represents the syntactic structure of the input program in a data structure that allows
the subsequent phases of the compiler to access the individual program components. One possible
representation is the parse tree. The parse tree maz later be decorated with more information about the
program. Transformation may be applied to it, and code for a target machine can be generated from it.
For some languages, the compilation task is so simple that programs can be translated in one pass
over the program text. In this case, the parser can avoid the construction of the intermediate represen-
tation. The parser acts as main function calling routines for semantic analysis and for code generation.
Many programs that are presented to a compiler contain errors, many of them syntax errors. Syntax
errors consist in violations of the rules for forming valid programs. The compiler is expected to ade-
quately react to errors. It should at least attempt to locate the error precisely. However, often only the
localization of the error symptom is possible, not the localization of the error itself. The error symptom
is the position where no continuation of the syntactic analysis is possible. The compiler should not give
up after the first error found, but continue to analyze the rest of the program and maybe detect more
errors.
The syntactic structure of the programs written in some programming language can be described by
a context-free grammar. There exist methods to automatically generate a parser from such a description.
For efficiency and unambiguity reasons, parsing methods are often restricted to deterministically ana-
lyzable context-free languages. For these, several automatic methods for parser generation exist. The
parsing methods used in practice fall into two categories, top-down- and bottom-up-parsing methods.
Both read the input from left to right. The differences in the way they work can be best made clear by
regarding how they construct parse trees.
Top-down parsers start the syntactic analysis of a given program and the construction of the parse
tree with the start symbol of the grammar and with the root of the parse tree labelled with that symbol.
Top-down parser are called predictive parser since they make predictions about what they expect to find
next in the input. They then attempt to verify the prediction by comparing it with the remaining input.
The first prediction is the start symbol of the grammar. It says that the parser expects to find a word for
the start symbol. Let us assume that a prefix of the prediction is already confirmed. Then there are two
cases:
• The non-confirmed part of the prediction starts with a nonterminal. The top-down parser will then
refine its prediction by selecting one of the alternatives of this nonteminal.
36 3 Syntactic Analysis

• The non-confirmed part of the prediction starts with a terminal symbol. The top-down parser will
then compare this with the next input symbol. If they agree it means that another symbol of the
prediction is confirmed. Otherwise, the parser has detected an error.
The top-down parser terminates successfully when the whole input has been predicted and confirmed.
Bottom-up parsers start the syntactic analysis of a given program and the construction of the parse
tree with the input, that is, the given program. They attempt to discover the syntactic structure of longer
and longer prefixes of the input program. To do this they attempt to replace occurrences of right sides
of productions by their left-side nonterminals. Such a replacement is called a reduction. If the parser
cannot perform a reduction if does a shift, that is, it reads the next input symbol. These are the only
two actions a bottom-up parser can perform. It is, therefore, called shift-reduce parser. The analysis
terminates successfully when the parser has reduced the input program by a sequence of shift and
reduce steps to the start symbol of the grammar.

The Treatment of Syntax Errors

Most programs that are submitted to a compiler are erroneous. Many contain syntax errors. The com-
piler should, therefore, treat the normal case, namely the erroneous program adequately. Lexical errors
are rather local. Syntax errors, for instance in the parenthesis structure of a program, are often difficult
to diagnose. This chapter covers required and possible reactions to syntax errors by the parser. There
are essentially four different types of reaction to syntax errors:
1. The error is localized and reported;
2. The error is diagnozed;
3. The error is corrected;
4. The parser gets back into a state in which it can possibly detect further errors.
The first alternative is absolutely required. Later stages of the compiler assume that they are only given
syntactically correct programs in the form of syntax trees. The programmer needs to be informed about
syntax errors in his programs. There exist, however, two significant problems: Firstly, further syntax
errors can remain undetected in the vicinity of a detected error. Second, the parser detects an error when
it has no continuation out of its actual configuration under the next input symbol. This is, in general,
only the error symptom, not the error itself.
Example 3.1.1 Consider the following erroneous assignment statement:

a = a ∗ (b + c ∗ d ;

error symptom: ′ )′ is missing

There are several potential errors: Either there is an extra open parenthesis, or a closing parenthesis is
missing after c or after d. These three corrections lead to programs with different meaning. ⊓ ⊔
At errors of extra or missing parentheses such as {, }, begin, end, if, etc., the position of the
error and the position of the error-symptom can be far apart. The practically relevant parsing methods,
LL(k)- and LR(k) parsing, presented in the following sections, have the viable-prefix property.
When the parser for a context-free grammar G has analyzed the prefix u of a word without announcing
an error then there exists a word w such that uw is a word of G.
Parsers possessing this property report error and error symptoms at the earliest possible time. We have
said above that, in general, the parser will only discover an error symptom, not the error itself. Still,
we will speak of errors in the following. In this sense, the discussed parsers perform the first two listed
actions: they report and try to diagnose errors.
Example 3.1.1 shows that the second action is not easily done.
The parser can attempt a diagnosis of the error symptom. It should at least provide the following
information:
3.2 Foundations 37

• the psition of the error in the program,


• a description of the parser configuration, i.e., the current state, the expected symbol, and the found
symbol.
For the third listed action, the correction of an error, the parser would need to know the intention of
the programmer. This is, in general, impossible. Somewhat more realistic is the search for a globally
optimal error correction. To realize this, the parser is given the capability to insert or delete symbols in
the input word. The globally optimal error correction for an erroneous input word w is a word w′ that
is obtained from w by a minimal number of such insertions and deletions. Such methods have been
proposed in the literature, but have not been used in practice due to the necessary effort.
Instead, most parsers do only local corrections to have the parser move from the error configuration
to a new configuration in which it can at least read the next input symbol. This prevents the parser from
going into an endless loop while trying to repair an error.

The Structure of this Chapter

Section 3.2 presents the theoretical foundations of syntax analysis, context-free grammars and their
notion of derivation and pushdown automata, their acceptors. A special non-deterministic pushdown
automaton for a context-free grammar is introduced that recognizes the language defined by the gram-
mar. Deterministic top-down and bottom-up parser for the grammar are derived from this pushdown
automaton.
Sections 3.3 and 3.4 describe top-down- and bottom-up syntax analysis. The corresponding gram-
mar classes are characterized and parser-generation methods are presented. Error handling for both
top-down and bottom-up parsers is described in detail.

3.2 Foundations
We have seen that lexical analysis is specified by regular expressions and implemented by finite-state
machines. We will now see that syntax analysis is specified by context-free grammars and implemented
by pushdown automata.
Regular expressions are not sufficient to describe the syntax of programming languages since they
cannot embedded recursion as they occur in the nesting of expressions, statements, and blocks.
In Sections 3.2.1 and 3.2.3, we introduce the needed notions about context-free grammars and
pushdown automata. Readers familiar with these notions can skip them and go directly to Section
3.2.4. In Section 3.2.4, a pushdown automaton is introduced for a context-free grammar that accepts
the language defined by that grammar.rt.

3.2.1 Context-free Grammars

Context-free grammars can be used to describe the syntactic structure of programs of a programming
language. The grammar describes what the elementary components of programs are and how pieces of
programs can be composed to form bigger pieces.
Example 3.2.1 A section of a grammar to describe a C-like programming language might look like
follows:
38 3 Syntactic Analysis

hstat i → hif _stat i |


hwhile_stat i |
hdo_while_stati |
hexpi ; |
; |
{ hstatsi }
hif _stat i → if ( hexpi ) else hstat i |
if ( hexpi ) hstat i
hwhile_stat i → while ( hexpi ) hstat i
hdo_while_stat i → do hstat i while ( hexpi );
hexpi → hassigni |
hcalli |
Id |
...
hcalli → Id ( hexpsi ) |
hexpi()
hassigni → Id ′ =′ hexpi
hstats i → hstat i |
hstats i hstat i
hexpsi → hexpi |
hexpsi, hexpi

The nonterminal symbol hstat i generates statements. We will use the meta-character | to combine
several alternatives for one nonterminal.
According to this section of a grammar, a statement is either and if -statement, a while-statement,
a do-while-statement, an expression followed by a semicolon, an empty statement, or a sequence of
statements in parentheses.
if -statements in which the else-part may be missing. They always start with the keyword if, fol-
lowed by an expression in parentheses, and a statement. This statement may be followed by the key-
word else and another statement. Further productions describe how while- and do-while-statements and
expressions are constructed. For expressions, only some possible alternatives are explicitly given. Other
alternatives are indicated by . . . . ⊓

Formally, a context-free grammar is a quadruple G = (VN , VT , P, S), where VN , VT are disjoint
alphabets, VN is the set of nonterminals, VT is the set of terminals, P ⊆ VN × (VN ∪ VT )∗ is the finite
set of production rules, and S ∈ VN is the start symbol.
Terminal symbols (in short: terminals) are the symbols from which programs are built. While we
spoke of alphabets of characters in the section on lexical analysis, typically ASCII or Uni-code char-
acters, we now speak of alphabets of symbols as they are returned from the scanner or the screener.
Such symbols are reserved keywords of the language, identifiers, or symbol classes comprising sets of
symbols.
The nonterminals of the grammar stand for sets of words that can be generated from them according
to the production rules of the grammar. In the example grammar 3.2.1, they are enclosed in angle brack-
ets. A production rules (in short: production) (A, α) in the relations P describes possible replacements:
an occurrence of the left side A in a word β = γ1 Aγ2 can be replaced by the right side α ∈ (VT ∪VN )∗ .
In the view of a top-down parser, a new word β ′ = γ1 αγ2 is produced or derived from the word β.
A bottom-up parser interprets the production (A, α) as a replacement of the right side α by the left
side A. Applying the production to a word β ′ = γ1 αγ2 reduces this to the word β = γ1 Aγ2 .
We introduce some conventions to talk about context-free grammars G = (VN , VT , P, S). Capital
latin letters from the beginning of the alphabet, e.g. A, B, C are used to denote nonterminals from VN ;
capital latin letters from the end of the alphabet, e.g. X, Y, Z denote terminals or nonterminals. Small
latin letters from the beginning of the alphabet, e.g. a, b, c, . . ., stand for terminals from VT ; small latin
3.2 Foundations 39

letters from the end of the alphabet, like u, v, w, x, y, z, stand for terminal words, that is, elements from
VT∗ ; small greek letters such as α, β, γ, ϕ, ψ stand for words from (VT ∪ VN )∗ .
The relation P is seen as a set of production rules. Each element (A, α) of this relation is, more
intuitively, written as A → α. All productions A → α1 , A → α2 , . . . , A → αn for a nonterminal A
are combined to
A → α1 | α2 | . . . | αn
. The α1 , α2 , . . . , αn are called the alternatives of A.
Example 3.2.2 The two grammars G0 and G1 describe the same language:

G0 = {E, T, F }, {+, ∗, (, ), Id}, P0 , E) where P0 is given by:


E → E + T | T,
T → T ∗ F | F,
F → (E) | Id
G1 = ({E}, {+, ∗, (, ), Id}, P1 , E) where P1 is given by:
E → E + E | E ∗ E | (E) | Id



We say, a word ϕ directly produces a word ψ according to G, written as ϕ =⇒ ψ if ϕ = σAτ, ψ = σατ
G
holds for some words σ, τ and a production A → α ∈ P. A word ϕ produces a word ψ according to G,

or ψ is derivable from ϕ according to G, written as ϕ =⇒ ψ, if there is a finite sequence ϕ0 , ϕ1 , . . . ϕn ,
G
(n ≥ 0) of words such that

ϕ = ϕ0 , ψ = ϕn and ϕi =⇒ ϕi+1 for all 0 ≤ i < n.


G

The sequence ϕ0 , ϕ1 , . . . , ϕn is called a derivation of ψ from ϕ according to G. A derivation of length


n ∗
n is written as ϕ =⇒ ψ. The relation =⇒ denotes the reflexive and transitive closure of =⇒ .
G G G

Example 3.2.3 The grammars of Example 3.2.2 have, among others, the derivations

E =⇒ E + T =⇒ T + T =⇒ T ∗ F + T =⇒ T ∗ Id + T =⇒ F ∗ Id + T =⇒
G0 G0 G0 G0 G0 G0
F ∗ Id + F =⇒ Id ∗ Id + F =⇒ Id ∗ Id + Id,
G0 G0
E =⇒ E + E =⇒ E ∗ E + E =⇒ Id ∗ E + E =⇒ Id ∗ E + Id =⇒ Id ∗ Id + Id .
G1 G1 G1 G1 G1

∗ ∗
We conclude from these derivations that E =⇒ Id ∗ Id + Id holds as well as E =⇒ Id ∗ Id + Id. ⊓

G1 G0

The language defined by a context-free grammar G = (VN , VT , P, S) is the set



L(G) = {u ∈ VT∗ | S =⇒ u}.
G


A word x ∈ L(G) is called a word of G. A word α ∈ (VT ∪ VN )∗ where S =⇒ α is called a sentential
G
form of G.
Example 3.2.4 Let us consider again the grammars of Example 3.2.3. The word Id ∗ Id + Id is a word
∗ ∗
of both G0 and G1 , since E =⇒ Id ∗ Id + Id as well as E =⇒ Id ∗ Id + Id hold. ⊓⊔
G0 G1

We omit the index G in =⇒ when the grammar to which derivations refer is clear from the context.
G
The syntactic structure of a program, as it results from syntactic analysis, is the parse tree, which
is an ordered tree, that is, a tree in which the outgoing edges of each node are ordered. The parse tree
describes a set of derivations of the program according to the underlying grammar. It, therefore, allows
40 3 Syntactic Analysis

to define the notion ambiguity and to explain the differences between parsing strategies, see Sections
3.3 and 3.4. Within a compiler, the parse tree serves as the interface to the subsequent compiler phases.
Most approaches to the evaluation of semantic attributes, as they are described in Chapter 4, about
semantic analysis, work on this tree structure.
Let G = (VN , VT , P, S) be a context-free grammar. Let t be an ordered tree whose inner nodes are
labeled with symbols from VN and whose leaves are labeled with symbols from VT ∪ {ε}. t is a parse
tree if the label X of each inner node n of t together with the sequence of labels X1 , . . . , Xk of the
children of n in t has the following properties:
1. X → X1 . . . Xk is a production from P .
2. Is X1 . . . Xk = ε, then k = 1, that is, node n has exactly one child and this child is labeled with ε.
3. Is X1 . . . Xk 6= ε then Xi 6= ε for each i.
If the root of t is labeled with nonterminal symbol A, and if the concatenation of the leaf labels yields
the terminal word w we call t a parse tree for nonterminal A and word w according to grammar G. If
the root is labeled with S, the start symbol of the grammar, we just call t a parse tree for w.
Example 3.2.5 Fig. 3.1 shows two parse trees according to grammar G1 of Example 3.2.2 for the word
Id ∗ Id + Id . ⊓

E E

E E E E

E E E E

id ∗ id + id id ∗ id + id

Fig. 3.1. Two syntax trees according to grammar G1 of Example 3.2.2 for the word Id ∗ Id + Id.

A syntax tree can be viewed as a representation of derivations where one abstracts from the order and
the direction, derivation or reduction, in which productions were applied. A word of the language is
called ambiguous if there exists more than one parse tree for it. Correspondingly, the grammar G is
called ambiguous, if L(G) contains at least one ambiguous word. A context-free grammar that is not
ambiguous is called non-ambiguous.
Example 3.2.6 The grammar G1 is ambiguous because the word Id ∗ Id + Id has more than one parse
tree. The grammar G0 , on the other hand, is non-ambiguous. ⊓

The definition implies that each word x ∈ L(G) has at least one derivation from S. To each derivation
for a word x corresponds a parse tree for x. Thus, each word x ∈ L(G) has at least one parse tree.
On the other hand, to each parse tree for a word x corresponds at least one derivation for x. Any such
derivation can be easily read off the parse tree.
Example 3.2.7 The word Id + Id has the one parse tree of Fig. 3.2 according to grammar G1 . Two
different derivations result depending on the order in which the nonterminals are replaced:

E ⇒ E + E ⇒ Id + E ⇒ Id + Id
E ⇒ E + E ⇒ E + Id ⇒ Id + Id



3.2 Foundations 41

E E

Id + Id

Fig. 3.2. The uniquely determined parse tree for the word Id + Id.

In Example 3.2.7 we saw that—even with non-ambiguous words— several derivations might corre-
spond to one parse tree. This results from the different possibilities to chose a nonterminal in a sen-
tential form for the next application of a production. One can chose essentially two different canonical
replacement strategies, replacing the leftmost nonterminal or the rightmost nonterminal. In each case
one obtains uniquely determined derivations, namely leftmost and rightmost derivations, resp.
A derivation ϕ1 =⇒ . . . =⇒ ϕn of ϕ = ϕn from S = ϕ1 is a leftmost derivation of ϕ, denoted

as S =⇒ ϕ , if in the derivation step from ϕi to ϕi+1 the leftmost nonterminal of ϕi is replaced, i.e.
lm
ϕi = uAτ, ϕi+1 = uατ for a word u ∈ VT∗ and a production A → α ∈ P.

Similarly, we call a derivation ϕ1 =⇒ . . . =⇒ ϕn a rightmost derivation of ϕ, denoted by S =⇒ ϕ,
rm
if the rightmost nonterminal in ϕi is replaced, i.e. ϕi = σAu, ϕi+1 = σαu with u ∈ VT∗ and A → α ∈
P.
A sentential form that occurs in a leftmost derivation (rightmost derivation) is called left sentential
form (right sentential form).
To each parse tree for S there exists exactly one leftmost derivation and exactly one rightmost
derivation. Thus, there is exactly one leftmost and one rightmost derivation for each unambiguous word
in a language.
Example 3.2.8 The word Id ∗ Id + Id has, according to grammar G1 , the leftmost derivations

E =⇒ E + E =⇒ E ∗ E + E =⇒ Id ∗ E + E =⇒ Id ∗ Id + E =⇒ Id ∗ Id + Id and
lm lm lm lm lm
E =⇒ E ∗ E =⇒ Id ∗ E =⇒ Id ∗ E + E =⇒ Id ∗ Id + E =⇒ Id ∗ Id + Id.
lm lm lm lm lm

It has the rightmost derivations

E =⇒ E + E =⇒ E + Id =⇒ E ∗ E + Id =⇒ E ∗ Id + Id =⇒ Id ∗ Id + Id und
rm rm rm rm rm
E =⇒ E ∗ E =⇒ E ∗ E + E =⇒ E ∗ E + Id =⇒ E ∗ Id + Id =⇒ Id ∗ Id + Id.
rm rm rm rm rm

The word Id + Id has, according to G1 , only one leftmost derivation, namely

E =⇒ E + E =⇒ Id + E =⇒ Id + Id
lm lm lm

and one rightmost derivation, namely

E =⇒ E + E =⇒ E + Id =⇒ Id + Id.
rm rm rm



In an unambiguous grammar, the leftmost and the rightmost derivation of a word consist of the same
productions. The difference is the order of application. The questions is, can one find sentential forms
in both derivations that correspond to each other in the following way: in both derivations will, in the
next step, the same occurrence of a nonterminal be replaced?
The following lemma establishes such a relation.
42 3 Syntactic Analysis
∗ ∗
Lemma 3.1. 1. If S =⇒ uAϕ holds, then there exists ψ, with ψ =⇒ u, such that for all v with
lm

ϕ =⇒ v holds S =⇒ ψAv.
rm
∗ ∗ ∗
2. If S =⇒ ψAv holds, then there exists a ϕ with ϕ =⇒ v, such that for all u with ψ =⇒ u holds
rm

S =⇒ uAϕ. ⊓

lm

Fig. 3.3 clarifies the relation between ϕ and v on one side and ψ and u on the other side.

ψ ϕ
A

u v
Fig. 3.3. Correspondence between leftmost and rightmost derivation.

Context-free grammars that describe programming languages should be unambiguous. If this is the
case, then there exist exactly one parse tree, and one leftmost and one rightmost derivation for each
syntactically correct program.

3.2.2 Productivity and Reachability of Nonterminals

A context-free grammar might have superfluous nonterminals and productions. Eliminating them re-
duces the size of the grammar, but doesn’t change the language. We will now introduce two properties
of nonterminals that characterize them as useful and present methods to compute the subsets of nonter-
minals that have these properties. Grammars from which all nonterminals not having these properties
are removed will be called reduced. We will later always assume that the grammars we deal with are
reduced.
The first required property of useful nonterminals is productivity. A nonterminal X of a context-

free grammar G = (VN , VT , P, S) is called productive, if there exists a derivation X =⇒ w for a word
G
w ∈ VT∗ , or equivalently, if there exists a parse tree whose root is labeled with X.
Example 3.2.9 Consider the grammar G = ({S ′ , S, X, Y, Z}, {a, b}, P, S ′), where P consists of the
productions :
S′ → S
S → aXZ | Y
X → bS | aY bY
Y → ba | aZ
Z → aZX
Then Y is productive and therefore also X, S and S ′ . The nonterminal Z, on the other hand, is not
productive since the only production for Z contains on occurrence of Z on its right side. ⊓

A two-level characterization of nonterminal productivity leading to an algorithm to compute it is the
following:
3.2 Foundations 43

(1) X is productive through production p if and only if X is the left side of p, and if all nonterminals
on the right side of p are productive.
(2) X is productive if X is productive through at least one of its alternatives.
In particular, X is thereby productive if there exists a production X → u ∈ P whose right side u has no
nonterminal occurrences, that is, u ∈ VT∗ . Property (1) describes the dependence of the information for
X on the information about symbols on the right side of the production for X; property (2) indicates
how to combine the information obtained from the different alternatives for X.
We describe a method that computes for a context-free grammar G the set of all productive non-
terminals. The method uses for each production p a counter count[p], which counts the number of
occurrences of nonterminals whose productivity is not yet known. When the counter of a production p
is decreased to 0 all nonterminals on the right side must be productive. Therefore, also the left side of
p is productive through p. To manage the productions whose counter has sunk to 0 the algorithm uses a
worklist W .
Further, for each nonterminal X a list occ[X] of occurrences of this nonterminal in right sides of
productions is managed:

sethnonterminal i productive ← ∅; // result-set


int count[production]; // counter for each production
listhnonterminal i W ← [ ];
listhproductioni occ[nonterminal ]; // occurrences in right sides

forall (nonterminal X) occ[X] ← [ ]; // Initialization


forall (production p) { count[p] ← 0;
init(p);
}
...

The call init(p) of the routine init() for a production p, whose code we have not given, iterates over the
sequence of symbols on the right side of p. At each occurrence of a nonterminal X the counter count[p]
is incremented, and p is added to the listocc[X]. If at the end still count[p] = 0 holds then init(p) enters
production p into the list W . This concludes the initialization.
The main iteration processes the productions in W one by one. For each production p in W , the
left side is productive through p and therefore productive. When, on the other hand, a nonterminal X is
newly discovered as productive, the algorithm iterates through the list occ[X] of those productions in
which X occurs. The counter count[r] is decremented for each production r in this list. The described
method is realized by the following algorithm:

...
while (W 6= []) {
X ← hd(W ); W ← tl(W );
if (X 6∈ productive) {
productive ← productive ∪ {X};
forall ((r : A → α) ∈ occ[X]) {
count[r]−−;
if (count[r] = 0) W ← A :: W ;
} // end of forall
} // end of if
} // end of while

Let us derive the run time of this algorithm. The initialization phase essentially runs once over the
grammar and does a constant amount of work for each symbol. The main iteration through the worklist
44 3 Syntactic Analysis

enters the left side of each production once into the list W and so removes it at most once from the list.
At the removal of a nonterminals X from W more than a constant amount of work has to be done only
when X has not yet been marked as productive. The effort for such an X is proportional to the length of
the list occ[X]. The sum of these lengths is bounded by the overall size of the grammar G. This means
that the total effort is linear in the size of the grammar.
To show the correctness of the procedure, we ascertain that it possesses the following properties:
• If X is entered into the set productive in the j-th iteration of the while-loop, there exists a parse
tree for X of height at most j − 1.
• For each parse tree, the root is entered into W once.
The efficient algorithm just presented has relevance beyond its application in compiler construction.
It can be used with small modifications to compute least solutions of Boolean systems of equations,
that is of systems of equations, in which the right sides are disjunctions of arbitrary conjunctions of
unknowns. In our example, the conjunctions stem from the right sides while a disjunction represents
the existence of different alternatives for a nonterminal.
The second property of a useful nonterminal is its reachability. We call a nonterminal X reachable in

a context-free grammar G = (VN , VT , P, S), if there exists a derivation S =⇒ αXβ.
G

Example 3.2.10 Consider the grammar G = ({S, U, V, X, Y, Z}, {a, b, c, d}, P, S), where P consists
of the following productions:

S → Y X → c
Y → YZ |Ya|b V → Vd|d
U → V Z → ZX

The nonterminals S, Y, Z and X are reachable, while U and V are not. ⊓



Reachability can also be characterized in a two-level definition that leads to an algorithm:
(1) If a nonterminal X is reachable and X → α ∈ P , then each nonterminal occurring in the right side
α is reachable through this occurrence.
(2) A nonterminal is reachable if it is reachable through at least one of its occurrences.
(3) The start symbol S is always reachable.
Let rhs[X] for a nonterminal X be the set of all nonterminals that occur in the right side of productions
with left side X. These sets can be computed in linear time. The set reachable of reachable nonterminals
of a grammar can be computed by:

sethnonterminal i reachable ← ∅;
listhnonterminal i W ← S :: [ ];
nonterminal Y ;
while (W 6= [ ]) {
X ← hd(W ); W ← tl(W );
if (X 6∈ reachable) {
reachable ← reachable ∪ {X};
forall (Y ∈ rhs[X]) W ← W ∪ {Y };
}

To reduce a grammar G, first all non-productive nonterminals are removed from the grammar together
with all productions in which they occur. Only in a second step are the non-reachable nonterminals
eliminated, also together with the productions in which they occur. This second step is, therefore, based
on the assumption that all remaining nonterminals are productive.
3.2 Foundations 45

Example 3.2.11 Let us consider again the grammar of Example 3.2.9 with the productions

S′ → S
S → aXZ | Y
X → bS | aY bY
Y → ba | aZ
Z → aZX

The set of productive nonterminals is {S ′ , S, X, Y }, while Z is not productive. To reduce the grammar,
a first step removes all productions in which Z occurs. The resulting set is P1 :

S′ → S
S → Y
X → bS | aY bY
Y → ba

Although X was reachable according to the original set of productions X is no more reachable after the
first step. The set of reachable nonterminals is VN′ = {S ′ , S, Y }. By removing all productions whose
left side if no longer reachable the following set of obtained:

S′ → S
S → Y
Y → ba



We assume n the following that grammars are always reduced.

3.2.3 Pushdown Automata

This section treats the automata model corresponding to context-free grammars, pushdown automata.
We need to describe how to realize a compiler component that performs syntax analysis according to
a given context-free grammar. Section 3.2.4 describes such a method. The pushdown automaton con-
structed for a context-free grammar, however, has a problem: it is non-deterministic for most grammars.
In Sections 3.3 and 3.4 we describe how for appropriate subclasses of context-free grammars the thus
constructed pushdown automaton can be modified to become deterministic.
In contrast to the finite-state machines of the preceding chapter, a pushdown automaton has an
unlimited storage capacity. It has a (conceptually) unbounded data structure, the stack, which works
according to a last-in, first-out principle. Fig. 3.4 shows a schematic picture of a pushdown automaton.
The reading head is only allowed to move from left to right, as was the case with finite-state machines.
In contrast to finite-state machines, transitions of the pushdown automaton not only depend on the
actual state and the next input symbol, but also on some topmost section of the stack. A transition may
change this upper section of the stack and it may consume the next input symbol by moving the reading
head one place to the right.
Formally, a pushdown automaton is a tuple P = (Q, VT , ∆, q0 , F ), where
• Q is a finite set of states,
• VT is the input alphabet,
• q0 ∈ Q is the initial state and
• F ⊆ Q is the set of final states, and
• ∆, is a finite subset of Q+ × VT × Q∗ , the transition relation. The transition relation ∆ can be seen
as a finite partial function ∆ from Q+ × VT into the finites subsets of Q∗ .
46 3 Syntactic Analysis

input tape

control

stack

Fig. 3.4. Schematic representation of a pushdown automaton

Our definition of a pushdown automaton is somewhat unusual as it doesn’t make a distinction between
the states of the automaton and its stack symbols. It uses the same alphabet for both. In this way, the
topmost stack symbol is interpreted as the actual state. The transition relation describes the possible
computation steps of the pushdown automaton. It lists finitely many transitions. Executing the transition
(γ, x, γ ′ ) replaces the upper section γ ∈ Q+ of the stack contents by the new sequence γ ′ ∈ Q∗ of
states and reads x ∈ VT ∪ {ε} in the input. The replaced section of the stack contents has at least the
length 1. A transition that doesn’t inspect the next input symbol is called an ε-transition.
Similarly as for finite-state machines, we introduce the notion of a configuration for pushdown
automata. A configuration encompasses all components that may influence the future behavior of the
automaton. With our kind of pushdown automata these are the stack contents and the remaining input.
Formally, a configuration of the pushdown automaton P is a pair (γ, w) ∈ Q+ × VT∗ . In the linear
representation the topmost position of the stack is always at the right end of γ while the next input
symbol is situated at the left end of w. A transition of P is represented through the binary relation ⊢P
between configurations. This relation is defined by:

(γ, w) ⊢P (γ ′ , w′ ), if γ = αβ, γ ′ = αβ ′ , w = xw′ und (β, x, β ′ ) ∈ ∆

for a suitable α ∈ Q∗ . As was the case with finite-state machines, a computation is a sequence of
configurations, where a transition exists between each two consecutive members. We denote them by
n
C ⊢P C ′ if there exist configurations C1 , . . . , Cn+1 such that C1 = C, Cn+1 = C ′ and Ci ⊢P Ci+1
+ ∗
for 1 ≤ i ≤ n holds. The relations ⊢P and ⊢P are the transitive and the reflexive and transitive closure
of ⊢P , resp. We have:
+ n ∗ n
⊢P = ⊢P and ⊢P = ⊢P
S S
n≥1 n≥0

A configuration (q0 , w) for a w ∈ VT∗


is called an initial configuration, (q, ε), for q ∈ F , a final
configuration of the pushdown automaton P . A word w ∈ VT∗ is accepted by a pushdown automaton

P if (q0 , w) ⊢P (q, ε) holds for a q ∈ F . The language L(P ) of the pushdown automaton P is the set
of words accepted by P :

L(P ) = {w ∈ VT∗ | ∃f ∈ F : (q0 , w) ⊢P (f, ε)}

This means, a word w is accepted by a pushdown automaton if there exists at least one computation
that goes from an initial configuration (q0 , w) to a final configuration. Such computations are called
accepting. Several accepting computations may exist for one word, but also several computations that
can only read a prefix of a word w or that can read w, but don’t reach a final configuration.
In practice, accepting computations should not be found by trial and error. Therefore, deterministic
pushdown automata are of particular importance.
3.2 Foundations 47

A pushdown automaton P is called deterministic, if the transition relation ∆ has the following
property:
(D) If (γ1 , x, γ2 ), (γ1′ , x′ , γ2′ ) are two different transitions in ∆ and γ1′ is a suffix of γ1 then x and x′
are in Σ and are different from each other, that is, x 6= ε 6= x′ and x 6= x′ .
If the transition relation has the property (D) there exists at most one transition out of each configura-
tion.

3.2.4 The Item-Pushdown Automaton to a Context-Free Grammar

In this section, we meet a method that constructs for each context-free grammar a pushdown automaton
that accepts the language defined by the grammar. This automaton is non-deterministic and therefore
not overly useful for a practical application. However, we can derive the LL-parsers of Section 3.3, as
well as the LR-parsers of Section 3.4 by appropriate design decisions.
The notion of context-free item plays a decisive role. Let G = (VN , VT , P, S) be a context-free
grammar. A context-free item of G is a triple (A, α, β) with A → αβ ∈ P . This triple is, more
intuitively, written as [A → α.β]. The item [A → α.β] describes the situation that in an attempt to
derive a word w from A a prefix of w has already been derived from α. α is therefore called the history
of the item.
An item [A → α.β] with β = ε is called complete. The set of all context-free items of G is denoted
by ItG . Is ρ the sequence of items

ρ = [A1 → α1 .β1 ][A2 → α2 .β2 ] . . . [An → αn .βn ]

then hist(ρ) denotes the concatenation of the histories of the items of ρ, i.e.,

hist(ρ) = α1 α2 . . . αn .

We now describe how to construct the item-pushdown automaton to a context-free grammar G =


(VN , VT , P, S) . The items of the grammar act as its states and, therefore, also as stack symbols. The
actual state is the item whose right side the automaton is just processing. Below this state in the stack
are the items, where processing of their right sides has been begun, but not yet been finished.
Before we show how to construct the item-pushdown automaton to a grammar, we want to extend
the grammar G is such a way that termination of the pushdown automaton can be recognized by looking
at the actual state. Is S the start symbol of the grammar, candidates for final states of the item-pushdown
automaton are all complete items [S → α.] of the grammar. If S also occurs on the right side of a
production such complete items can occur on the stack but still the automaton need not terminate since
below it there may be incomplete items. We, therefore, extend the grammar G by a new start symbol S ′ ,
which does not occur in any right side. For S ′ we add the productions S ′ → S to the set of productions
of G. As initial state of the item-pushdown automaton for the extended grammar we chose the item
[S ′ → .S] and as single final state the complete item [S ′ → S.]. The item-pushdown automaton to the
grammar G is the pushdown automaton

PG = (ItG , VT , ∆, [S ′ → .S], {[S ′ → S.]})

where the transition relation ∆ has three types of transitions:

(E) ∆([X → β.Y γ], ε) = {[X → β.Y γ][Y → .α] | Y → α ∈ P }


(S) ∆([X → β.aγ], a) = {[X → βa.γ]}
(R) ∆([X → β.Y γ][Y → α.], ε) = {[X → βY.γ]}.

Transitions according to (E) are called expanding transitions, those according to (S) shifting transi-
tions and those according to (R) reducing transitions.
Each sequence of items that occurs as stack contents in the computation of an item-pushdown
automaton satisfies the following invariant (I):
48 3 Syntactic Analysis
∗ ∗
(I) If ([S ′ → .S], uv) ⊢P (ρ, v) then hist(ρ) =⇒ u.
G G

This invariant is an essential part of the proof that the item-pushdown automaton PG only accepts words
of G, that is, that L(PG ) ⊆ L(G) holds. We now explain the way the automaton PG works and at the
same time give a proof by induction over the length of computations that the invariant (I) holds for
each configuration reachable from an initial configuration. Let us first consider the initial configuration
for the input w. The initial configuration is ([S ′ → .S], w). The word u = ε has already been read,

hist([S ′ → .S]) = ε, and ε =⇒ ε holds. Therefore, the invariant holds in this configuration.
Let us now consider derivations that consist of at least one transition. Let us firstly assume that the
last transition was an expanding transition. Before this transition, a configuration (ρ[X → β.Y γ], v)
was reached from the initial configuration ([S ′ → .S], uv).

This configuration satisfies the invariant (I) by the induction hypothesis, i.e., hist(ρ)β =⇒ u holds.
The item [X → β.Y γ] as actual state suggests to derive a prefix v from Y . To do this, the automaton
should non-deterministically select one of the alternatives for Y . This is described by the transitions
according to (E). All the successor configurations (ρ[X → β.Y γ][Y → .α], v) for Y → α ∈ P also
satisfy the invariant (I) because

hist(ρ[X → β.Y γ][Y → .α]) = hist(ρ)β =⇒ u .

As next case, we assume that the last transition was a shifting transition. Before this transition, a con-
figuration (ρ[X → β.aγ], av) was reached from the initial configuration ([S ′ → .S], uav). This con-

figuration satisfies the invariant (I) by the induction hypothesis, that is, hist(ρ)β =⇒ u holds. The
successor configuration (ρ[X → βa.γ], v) also satisfies the invariant (I) because

hist(ρ[X → βa.γ]) = hist(ρ)βa =⇒ ua

For the final case, let us assume that the last transition was a reducing transition. Before this transitions,
a configuration (ρ[X → β.Y γ][Y → α.], v) was reached from the initial configuration ([S ′ → .S], uv).

This configuration satisfies the invariant (I) according to the induction hypothesis, that is, hist(ρ)βα =⇒ u
G
holds. The actual state is the complete item [Y → α.]. It is the result of a computation that started with
the item [Y → .α], when [X → β.Y γ] was the actual state and the alternative Y → α for Y was se-
lected. This alternative was successfully processed. The successor configuration (ρ[X → βY.γ], v) also
∗ ∗
satisfies the invariant (I) because hist(ρ)βα =⇒ u implies hist(ρ)βY =⇒ u. ⊓ ⊔
G G
Taken together, the following theorem holds:
Theorem 3.2.1 For each context-free grammar G, L(PG ) = L(G).
Proof. Let us assume w ∈ L(PG ). We then have

([S ′ → .S], w) ⊢P ([S ′ → S.], ε) .
G

Because of the invariant (I), which we have already proved, it follows that

S = hist([S ′ → S.]) =⇒ w
G


Therefore w ∈ L(G). For the other direction, we assume w ∈ L(G). We then have S =⇒ w. To prove
G

([S ′ → .S], w) ⊢P ([S ′ → S.], ε)
G


we show a more general statement, namely that for each derivation A =⇒ α =⇒ w with A ∈ VN ,
G G

(ρ[A → .α], wv) ⊢P (ρ[A → α.], v)
G

for arbitrary ρ ∈ It∗G and arbitrary v ∈ VT∗ . This general claim can be proved by induction over the

length of the derivation A =⇒ α =⇒ w. ⊓ ⊔
G G
3.2 Foundations 49

Example 3.2.12 Let G′ = ({S, E, T, F }, {+, ∗, (, ), Id}, P ′ , S) be the extension of grammar G0 by


the new start symbol S. The set of productions P ′ is given by

S → E
E → E+T |T
T → T ∗F |F
F → (E) | Id

The transition relation ∆ of PG0 is presented in Table 3.1. Table 3.2 shows an accepting computation
of PG0 for the word Id + Id ∗ Id. ⊓

Pushdown Automata with Output

Pushdown automata as such are only acceptors, that is, they decide whether or not an input string is
a word of the language. To use a pushdown automaton for the syntactic analysis in a compiler needs
more than a yes/no answer. The automaton should output the syntactic structure of accepted input
words. This can have one of several forms, a parse tree or the sequence of productions as they were
applied in a leftmost or rightmost derivation. We, therefore, extend pushdown automata by a means to
produce output.
A pushdown automaton with output is a tuple P = (Q, VT , O, ∆, q0 , F ), where Q, VT , q0 , F are
the same as with a normal pushdown automaton and O is a finite output alphabet. ∆ is a finite relation
between Q+ × (VT ∪ {ε}) and Q∗ × (O ∪ {ε}). A configuration consists of the actual stack content,
the remaining input, and the already produced output. It is an element of Q+ × VT∗ × O∗ .
At each transition, the automaton can output one symbol from O. If a pushdown automaton with
output is used as a parser its output alphabet consists of the productions of the context-free grammar or
their numbers.
The item-pushdown automaton can be extended by a means to produce output in essentially two
different ways. It can output the applied production whenever it performs an expansion. In this case,
the overall output of an accepting computation is a leftmost derivation. A pushdown automaton with
this output discipline is called a left-parser.
Instead at expansion, the item-pushdown automaton can output the applied production at each re-
duction. In this case, it delivers a rightmost derivation, but in reversed order. A pushdown automaton
using such an output discipline is called a right-parser.

Deterministic Parsers

In Theorem 3.2.1 we proved that the item-pushdown automaton PG to a context-free grammar G ac-
cepts the grammar’s language L(G). However, the non-deterministic way of working of the pushdown
automaton is unsuitable for practice. The source of non-determinism lies in the transitions of type (E):
the item-pushdown automaton can choose between several alternatives for a nonterminal at expand-
ing transitions, With a non-ambiguous grammar at most one is the correct choice to derive a prefix
of the remaining input. The other alternatives lead sooner or later into dead ends. The item-pushdown
automaton can only guess the right alternative.
In Sections 3.3 and 3.4, we describe two different ways to replace guessing. The LL-parsers of Sec-
tion 3.3 deterministically choose one alternative for the actual nonterminal using a bounded lookahead
into the remaining input. For grammars of class LL(k) a corresponding parser can deterministically
select one (E)-transition based on the already consumed input, the nonterminal to be expanded and the
next k input symbols. LL-parsers are left-parsers.
LR-parsers work differently. They delay the decision, which LL-parsers take at expansion, until
reduction. All the time during the analysis they pursue all possible derivations in parallel that may lead
to a reverse rightmost derivation for the input word. A decision has to be taken only when one of these
possibilities signals a reduction. This decision concerns whether to continue shifting or to reduce, and
in the latter case, by which production. Basis for this decision is again the actual stack contents and
50 3 Syntactic Analysis

top of the stack input new top of the stack


[S → .E] ε [S → .E][E → .E + T ]
[S → .E] ε [S → .E][E → .T ]
[E → .E + T ] ε [E → .E + T ][E → .E + T ]
[E → .E + T ] ε [E → .E + T ][E → .T ]
[F → (.E)] ε [F → (.E)][E → .E + T ]
[F → (.E)] ε [F → (.E)][E → .T ]
[E → .T ] ε [E → .T ][T → .T ∗ F ]
[E → .T ] ε [E → .T ][T → .F ]
[T → .T ∗ F ] ε [T → .T ∗ F ][T → .T ∗ F ]
[T → .T ∗ F ] ε [T → .T ∗ F ][T → .F ]
[E → E + .T ] ε [E → E + .T ][T → .T ∗ F ]
[E → E + .T ] ε [E → E + .T ][T → .F ]
[T → .F ] ε [T → .F ][F → .(E)]
[T → .F ] ε [T → .F ][F → .Id]
[T → T ∗ .F ] ε [T → T ∗ .F ][F → .(E)]
[T → T ∗ .F ] ε [T → T ∗ .F ][F → .Id]
[F → .(E)] ( [F → (.E)]
[F → .Id] Id [F → Id.]
[F → (E.)] ) [E → (E).]
[E → E. + T ] + [E → E + .T ]
[T → T. ∗ F ] ∗ [T → T ∗ .F ]
[T → .F ][F → Id.] ε [T → F.]
[T → T ∗ .F ][F → Id.] ε [T → T ∗ F.]
[T → .F ][F → (E).] ε [T → F.]
[T → T ∗ .F ][F → (E).] ε [T → T ∗ F.]
[T → .T ∗ F ][T → F.] ε [T → T. ∗ F ]
[E → .T ][T → F.] ε [E → T.]
[E → E + .T ][T → F.] ε [E → E + T.]
[E → E + .T ][T → T ∗ F.] ε [E → E + T.]
[T → .T ∗ F ][T → T ∗ F.] ε [T → T. ∗ F ]
[E → .T ][T → T ∗ F.] ε [E → T.]
[F → (.E)][E → T.] ε [F → (E.)]
[F → (.E)][E → E + T.] ε [F → (E.)]
[E → .E + T ][E → T.] ε [E → E. + T ]
[E → .E + T ][E → E + T.] ε [E → E. + T ]
[S → .E][E → T.] ε [S → E.]
[S → .E][E → E + T.] ε [S → E.]

Table 3.1. Tabular representation of the transition relation of Example 3.2.12. The middle column shows the
consumed input.
3.2 Foundations 51

stack contents remaining input


[S → .E] Id + Id ∗ Id
[S → .E][E → .E + T ] Id + Id ∗ Id
[S → .E][E → .E + T ][E → .T ] Id + Id ∗ Id
[S → .E][E → .E + T ][E → .T ][T → .F ] Id + Id ∗ Id
[S → .E][E → .E + T ][E → .T ][T → .F ][F → .Id] Id + Id ∗ Id
[S → .E][E → .E + T ][E → .T ][T → .F ][F → Id.] +Id ∗ Id
[S → .E][E → .E + T ][E → .T ][T → F.] +Id ∗ Id
[S → .E][E → .E + T ][E → T.] +Id ∗ Id
[S → .E][E → E. + T ] +Id ∗ Id
[S → .E][E → E + .T ] Id ∗ Id
[S → .E][E → E + .T ][T → .T ∗ F ] Id ∗ Id
[S → .E][E → E + .T ][T → .T ∗ F ][T → .F ] Id ∗ Id
[S → .E][E → E + .T ][T → .T ∗ F ][T → .F ][F → .Id] Id ∗ Id
[S → .E][E → E + .T ][T → .T ∗ F ][T → .F ][F → Id.] ∗Id
[S → .E][E → E + .T ][T → .T ∗ F ][T → F.] ∗Id
[S → .E][E → E + .T ][T → T. ∗ F ] ∗Id
[S → .E][E → E + .T ][T → T ∗ .F ] Id
[S → .E][E → E + .T ][T → T ∗ .F ][F → .Id] Id
[S → .E][E → E + .T ][T → T ∗ .F ][F → Id.]
[S → .E][E → E + .T ][T → T ∗ F.]
[S → .E][E → E + T.]
[S → E.]

Table 3.2. The accepting computation of PG for the word Id + Id ∗ Id.

a bounded lookahead into the remaining input. LR-parsers signal reductions, and therefore are right-
parsers. There does not exist an LR-parser for each context-free grammar, but only for grammars of
the class LR(k), where k again is the number of necessary lookahead symbols.

3.2.5 first- and follow-Sets

Let us consider the item-pushdown automaton PG to a context-free grammar G when it performs


an expansion, that is, at an (E)-transition. Just before such a transition, PG is in a state of the form
[X → α.Y β]. In this state, the pushdown automaton PG must select non-deterministically one of the
alternatives Y → α1 | . . . | αn for the nonterminal Y . A good aid for this selection is the knowledge of
the sets of words that can be produced from the different alternatives. If the beginning of the remaining
input only matches words in the set of words derivable from one alternative Y → αi this alternative is
to be selected. If some of the alternatives also produce short words or even ε the set of words that may
follow Y becomes relevant.
It is wise to only consider prefixes of such words of a given length k since the sets of words that
can be derived from an alternative are in general infinite. The sets of prefixes, in contrast, are finite. A
generated parser bases its decisions on a comparison of prefixes of the remaining input of length k with
the elements in these precomputed sets. For this purpose, we introduce the two functions firstk and
followk , which associate these sets with words over (VN ∪ VT )∗ and VN , resp. For an alphabet VT , we
Sk
write VT≤k for ≤k
VTi and VT,# for V ≤k ∪ (VT≤k−1 {#}), where # is a symbol that is not contained
i=0
in VT . Like the EOF symbol, eof, it marks the end of a word. Let w = a1 . . . an be a word ai ∈ VT for
(1 ≤ i ≤ n), n ≥ 0. For k ≥ 0, we define the k-prefix of w by
52 3 Syntactic Analysis
(
a1 . . . an if n ≤ k
w|k =
a1 . . . ak otherwise

Further, we introduce the operator ⊙k : VT × VT → VT≤k defined by

u ⊙k v = (uv)|k

This operator is called k-concatenation. We extend both operators to sets of words. For sets L ⊆ VT∗
and L1 , L2 ⊆ V ≤k we define

L|k = {w|k | w ∈ L} and L1 ⊙k L2 = {x ⊙k y | x ∈ L1 , y ∈ L2 } .

Let G = (VN , VT , P, S) be a context-free grammar. For k ≥ 1, we define the function firstk : (VN ∪
≤k
VT )∗ → 2VT that returns for each word α the set of all prefixes of length k of terminal words that can
be derived from α.

firstk (α) = {u|k | α =⇒ u}
≤k
Correspondingly, the function followk : VN → 2VT ,# returns for a nonterminal X the set of terminal
words of length at most k that can directly follow a nonterminal X in a sentential form:

followk (X) = {w ∈ VT∗ | S =⇒ βXγ and w ∈ firstk (γ#)}

The set firstk (X) consists of the k-prefixes of leaf words of all trees for X, followk (X) of the k-prefixes
of the second part of leaf words of all upper tree fragments for X (see Fig. 3.5). The following lemma

followk (X)
firstk (X)

Fig. 3.5. firstk and followk in a parse tree

describes some properties of k-concatenation and the function firstk .

Lemma 3.2. Let k ≥ 1, and let L1 , L2 , L3 ⊆ V ≤k be given. We then have:

(a) L1 ⊙k (L2 ⊙k L3 ) = (L1 ⊙k L2 ) ⊙k L3


(b) L1 ⊙k {ε} = {ε} ⊙k L1 = L1 |k
(c) L1 ⊙k L2 = ∅ iff L1 = ∅ ∨ L2 = ∅
(d) ε ∈ L1 ⊙k L2 iff ε ∈ L1 ∧ ε ∈ L2
(e) (L1 L2 )|k = L1 |k ⊙k L2 |k
(f ) firstk (X1 . . . Xn ) = firstk (X1 ) ⊙k . . . ⊙k firstk (Xn )
for X1 , . . . , Xn ∈ (VT ∪ VN )
3.2 Foundations 53

The proofs for (b), (c), (d) and (e) are trivial. (a) is obtained by case distinctions over the length of

words x ∈ L1 , y ∈ L2 , z ∈ L3 . The proof for (f ) uses (e) and the observation that X1 . . . Xn =⇒ u holds

if and only if u = u1 . . . un for suitable words ui with Xi =⇒ ui .
Because of property (f ), the computation of the set firstk (α) can be reduced to the computation of
the set firstk (X) for single symbols X ∈ VT ∪ VN . Since firstk (a) = {a} holds for a ∈ VT it suffices
to determine the sets firstk (X) for nonterminals X. A word w ∈ VT≤k is in firstk (X) if and ony if w is
contained in the set firstk (α) for one of the productions X → α ∈ P .
Due to property (f ) of Lemma 3.2, the firstk -sets satisfy the equation system (fi):
[
firstk (X) = {firstk (X1 ) ⊙k . . . ⊙k firstk (Xn ) | X → X1 . . . Xn ∈ P } , Xi ∈ VN (fi)

Example 3.2.13 Let G2 be the context-free grammar with the productions:

0: S → E 3 : E ′ → +E 6 : T ′ → ∗T
1 : E → T E′ 4 : T → FT′ 7 : F → (E)
2 : E′ → ε 5 : T′ → ε 8 : F → Id

G2 generates the same language of arithmetic expressions as G0 and G1 . We obtain as system of


equations for the computation of the first1 -sets:

first1 (S) = first1 (E)


first1 (E) = first1 (T ) ⊙1 first1 (E ′ )
first1 (E ′ ) = {ε} ∪ {+} ⊙1 first1 (E)
first1 (T ) = first1 (F ) ⊙1 first1 (T ′ )
first1 (T ′ ) = {ε} ∪ {∗} ⊙1 first1 (T )
first1 (F ) = {Id} ∪ {(} ⊙1 first1 (E) ⊙1 {)}



The right sides of the system of equations of the firstk -sets can be represented as expressions consisting
of unknowns firstk (Y ), Y ∈ VN and the set constants {x}, x ∈ VT ∪ {ε} and built using the operators
⊙k and ∪. Immediately the following questions arise:
• Does this system of equations always have solutions?
• If yes, which is the one corresponding to the firstk -sets?
• How does one compute this solution?
To answer these questions we first consider in general systems of equations like (fi) and look for an
algorithmic approach to solve such systems: Let x1 , . . . , xn be a set of unknowns,

x1 = f1 (x1 , . . . , xn )
x2 = f2 (x1 , . . . , xn )
..
.
xn = fn (x1 , . . . , xn )

a system of equations to be solved over a domain D. Each fi on the right side denotes a function
fi : Dn → D. A solution I ∗ of this system of equations associates a value I ∗ (xi ) with each unknown
xi such that all equations are satisfied, that is

I ∗ (xi ) = fi (I ∗ (x1 ), . . . , I ∗ (xn ))

holds for all i = 1, . . . , n.


Let us assume, D contained a distinctive element d0 that would offer itself as start value for the
calculation of a solution. A simple idea to determine a solution consists in setting all the unknowns
54 3 Syntactic Analysis

x1 , . . . , xn to this start value d0 . Let I (0) be this variable binding. All right sides fi are evaluated in
this variable binding. This might associate each variable xi with a new value. All these new values
form a new variable binding I (1) , in which the right sides are again evaluated, and so on. Let us assume
that an actual variable binding I (j) has been computed. The new variable binding I (j+1) is determined
through:
I (j+1) (xi ) = fi (I (j) (x1 ), . . . , I (j) (xn ))
A sequence of variable bindings I (0) , I (1) , . . . results. If for a j ≥ 0 holds that I (j+1) = I (j) , then

I (j) (xi ) = fi (I (j) (x1 ), . . . , I (j) (xn )) (i = 1, . . . , n).

Therefore I (j) = I ∗ is a solution.


Without further assumptions it is unclear whether a j with I (j+1) = I (j) is ever reached. In the
special cases considered in this volume, we can guarantee that this procedure converges not only against
some solution, but against the desired solution. This is based on properties of the domains D that occur
in our applications.
• There always exists a partial order on the domain D represented by the symbol ⊑. In the case of
the firstk -sets the set D consists of all subsets of the finite base set VT≤k of terminal words of length
at most k. The partial order over this domain is the subset relation.
• D contains a uniquely determined least element with which the iteration can start. This element is
denoted as ⊥ (bottom). In the case of the firstk -sets, this least
F element is the empty set.
• For each subset Y ⊆ D, there exists a least upper bound Y wrt. to the relation ⊑. In the case of
the firstk -sets, the least upper bound of a set of sets is the union of its sets. Partial orders with this
property are called compete lattices.
Furthermore, all functions fi are monotonic, that is, they respect the order ⊑ of their arguments. In
the case of the firstk -sets this holds because the right sides of the equations are built from the opera-
tors union and k-concatenation, which are both monotonic and because the composition of monotonic
functions is again monotonic.
If the algorithm is started with d0 = ⊥, it holds that I (0) ⊑ I (1) . Hereby, a variable binding is less
than or equal to another variable binding, if this holds for the value of each variable. The monotonicity
of the functions fi implies by induction that the algorithm produces an ascending sequence

I (0) ⊑ I (1) ⊑ I (2) ⊑ . . . I (k) ⊑ . . .

of variable bindings. If the domain D is finite, there exists a j, such that I (j) = I (j+1) holds. This means
that the algorithm in fact finds a solution. One can even show that this solution is the least solution.
Such a least solution does even exist if the complete lattice is not finite, and if the simple iteration does
not terminate. This follows from the fixed-point theorem of Knaster-Tarski, which we treat in detail in
the third volume Compiler Design: Analysis and Transformation.
Example 3.2.14 Let us apply this algorithm to determine a solution of the system of equations of
Example 3.2.13. Initially, all nonterminals are associated with the empty set. The following table shows
the words added to the first1 -sets in the i-ten iteration.

1 2 3 4 5 6 7 8
S Id (
E Id (
E′ ǫ +
T Id (
T′ ǫ ∗
F Id (

The following result is obtained:


3.2 Foundations 55

first1 (S) = {Id, (} first1 (T ) = {Id, (}


first1 (E) = {Id, (} first1 (T ′ ) = {ε, ∗}
first1 (E ′ ) = {ε, +} first1 (F ) = {Id, (}


It suffices to show that all right sides are monotonic and that the domain is finite to guarantee the
applicability of the iterative algorithm for a given system of equations over a complete lattice.
The following theorem makes sure that the least solution of the system of equations (fi) indeed
characterizes the firstk -sets.
Theorem 3.2.2 (Correctness of the firstk -sets) Let G = (VN , VT , P, S) be a context-free grammar,
D the complete lattice of the subsets of VT≤k , and I : VN → D be the least solution of the system of
equations (fi). We then have:

I(X) = firstk (X) for all X ∈ VN

Proof. For i ≥ 0 let I (i) be the variable binding after the i-th iteration of the algorithm to find
solutions for (fi). One shows by induction overSi that for all i ≥ 0 I (i) (X) ⊆ firstk (X) holds for
all X ∈ VN . Therefore, it also holds I(X) = i≥0 (X) ⊆ firstk (X) for all X ∈ VN . For the other

direction it suffices to show that for each derivation X =⇒ w, there exists an i ≥ 0 with w|k ∈ I (i) (X).
lm
This claim is again shown by induction, this time by induction over the length n ≥ 1 of the leftmost
derivation. Is n = 1 the grammar has a production X → w. We then have

I (1) (X) ⊇ firstk (w) = {w|k }

and the claim follows with i = 1. Is n > 1, ther exists a production X → u0 X1 u1 . . . Xm um with

u0 , . . . , um ∈ VT∗ and X1 , . . . , Xm ∈ VN and leftmost derivations Xi =⇒ wj , j = 1, . . . , k who all
lm
have a length less than n, with w = u0 w1 u1 . . . wm um . According to the induction hypothesis, for
each j ∈ {1, . . . , m} there exists a ij , such that (wi |k ) ∈ I (ij ) (Xi ) holds. Let i′ be the maximum of
these ij . For i = i′ + 1 it holds
′ ′
I (i) (X) ⊇ {u0 } ⊙k I (i ) (X1 ) ⊙k {u1 } . . . ⊙k I (i ) (Xm ) ⊙k {um }
⊇ {u0 } ⊙k {w1 |k } ⊙k {u1 } . . . ⊙k {wm |k } ⊙k {um }
⊇ {w|k }

The claim follows. ⊓ ⊔


To compute least solutions of systems of equations or similarly for systems of inequalities over com-
plete lattices is a problem that also appears in the computation of program invariants, which are used
to show the applicability of program transformations, which are to increase the efficiency of programs.
Such analyses and transformations are presented in the volume Compiler Design: Analysis and Trans-
formation. The global iterative approach just sketched is not necessarily the best method to solve sys-
tems of equations. In the volume Compiler Design: Analysis and Transformation we describe more
efficient methods.
Let us now consider how to compute followk -sets for an extended context-free grammar G. Again,
we start with an adequate recursive property. For a word w ∈ VTk ∪ VT≤k−1 {#} holds w ∈ followk (X)
if
(1) X = S ′ is the start symbol of the grammar and w = # holds,
(2) or there exists a production Y → αXβ in G such that w ∈ firstk (β) ⊙k followk (Y ) holds.
The sets followk (X) satisfy the following system of equations :

followk (S ′ ) = {#}
S (fo)
followk (X) = {firstk (β) ⊙k followk (Y ) | Y → αXβ ∈ P }, S ′ 6= X ∈ VN
56 3 Syntactic Analysis

Example 3.2.15 Let us again consider the context-free grammar G2 of Example 3.2.13. To calculate
the follow1 -sets for the grammar G2 we use the system of equations:

follow1 (S) = {#}


follow1 (E) = follow1 (S) ∪ follow1 (E ′ ) ∪ {)} ⊙1 follow1 (F )
follow1 (E ′ ) = follow1 (E)
follow1 (T ) = {ε, +} ⊙1 follow1 (E) ∪ follow1 (T ′ )
follow1 (T ′ ) = follow1 (T )
follow1 (F ) = {ε, ∗} ⊙1 follow1 (T )



The system of equations (fo) has again to be solved over a subset lattice. The right sides of the equations
are built from constant sets and unknowns by monotonic operators. Therefore, (fo) has a solution, which
can be computed by global iteration. We want to ascertain that this algorithm indeed computes the right
sets.
Theorem 3.2.3 (Correctness of the followk -sets) Let G = (VN , VT , P, S ′ ) be an extended context-
free grammar, D be the complete lattice of subsets of VTk ∪ VT≤k−1 {#} and, I : VN → D be the least
solution of the system of equations (fo). We then have:

I(X) = followk (X) for all X ∈ VN



The proof is simiilar to the proof of Theorem 3.2.2 and is left to the reader (Exercise 6).
Example 3.2.16 We consider the system of equations of Example 3.2.15. To compute the solution
the iteration again starts with the value ∅ for each nonterminal. The words added in the subsequent
iterations are shown in the following table:

1 2 3 4 5 6 7
S #
E # )
E′ # )
T +, #, )

T +, #, )
F ∗, +, #, )

Altogether we obtain the following sets:

follow1 (S) = {#} follow1 (T ) = {+, #, )}


follow1 (E) = {#, )} follow1 (T ′ ) = {+, #, )}
follow1 (E ′ ) = {#, )} follow1 (F ) = {∗, +, #, )}


3.2.6 The Special Case first1 and follow1

The iterative method for the computation of least solutions of systems of equations for the first1 - and
follow1 -sets is not very efficient. But even for more efficient methods, the computation of firstk - and
follow1 -sets needs a large effort when k gets larger. Therefore, practical parsers only use lookahead of
length k = 1. In this case, the computation of the first- and follow-sets can be performed particularly
efficient. The following lemma is the base for our further treatment.
3.2 Foundations 57

Lemma 3.3. Let L1 , L2 ⊆ VT≤1 be non-empty languages. We then have:


(
L1 if L2 6= ∅ and ε 6∈ L1
L1 ⊙1 L2 =
(L1 \{ε}) ∪ L2 if L2 6= ∅ and ε ∈ L1

According to our assumption, the considered grammars are always reduced. They, therefore, contain
neither non-productive nor unreachable nonterminals. It holds for all X ∈ VN that first1 (X) as well
as follow1 (X) are non-empty. Taken together with Lemma 3.3, it allows us to simplify the transfer
functions for first1 and follow1 in such a way that the 1-concatenation can be (essentially) replaced by
unions. We want to eliminate the case distinction of whether ε is contained in the first1 -sets or not. This
done in two steps: In the first step, the set of nonterminals X is determined that satisfy ε ∈ first1 (X).
In the second step, the ε-free first1 -set is determined for each nonterminal X instead of the first1 -sets.
The ε-free first1 -sets are defined by

eff(X) = first1 (X)\{ε}



= {(w|k ) | X =⇒ w, w 6= ε}
G

To implement the first step, it helps to exploit that for each nonterminal X

ε ∈ first1 (X) if and only if X =⇒ ε

Example 3.2.17 Consider the grammar G2 of Example 3.2.13. The set of productions in which no
terminal symbol occurs is

0: S → E
1 : E → T E′ 4 : T → FT′
2 : E′ → ε 5 : T′ → ε

With respect to this set of productions only the nonterminals E ′ and T ′ are productive. These two
nonterminals are, thus, the only ε-productive nonterminals of grammar G2 . ⊓⊔
Let us now turn to the second step, the computation of the ε-free first1 -sets. Consider a production of
the form X → X1 . . . Xm . Its contribution to eff(X) can be written as
[ ∗
{eff(Xj ) | X1 . . . Xj−1 =⇒ ε}
G

Altogether, we obtain the system of equations :


[ ∗
eff(X) = {eff(Y ) | X → αY β ∈ P, α =⇒ ε}, X ∈ VN (eff)
G

Example 3.2.18 Consider again the context-free grammar G2 of Example 3.2.13. The following sys-
tem of equations serves to compute the ε-free first1 -sets.

eff(S) = eff(E) eff(T ) = eff(F )


eff(E) = eff(T ) eff(T ′ ) = ∅ ∪ {∗}
eff(E ′ ) = ∅ ∪ {+} eff(F ) = {Id} ∪ {(}

All occurrences of the ⊙1 -operator have disappeared. Instead, only constant sets, unions and variables
eff(X) appear on the right sides. The least solution is

eff(S) = {Id, (} eff(T ) = {Id, (}


eff(E) = {Id, (} eff(T ′ ) = {∗}
eff(E ′ ) = {+} eff(F ) = {Id, (}


58 3 Syntactic Analysis

Nonterminals that occur to the right of terminals do not contribute to the ε-free first1 -sets. It is important
for the correctness of the construction that all nonterminals of the grammar are productive.
The ε-free first1 -sets eff(X) can also be used to simplify the system of equations for the computa-
tion of the follow1 -sets. Consider a production of the form Y → αXX1 . . . Xm . The contribution of
the occurrence of X in the right side of Y to the set follow1 (X) is
[ ∗ ∗
{eff(Xj ) | X1 . . . Xj−1 =⇒ ε} ∪ {follow1 (Y ) | X1 . . . Xm =⇒ ε}
G G

If all nonterminals are not only productive, but also reachable the equation system for the computation
of the follow1 -sets simplifies to

follow1 (S ′ ) = {#}
S ∗
follow1 (X) = {eff(Y ) | A → αXβY γ ∈ P, β =⇒ ε}
G
S ∗
∪ {follow1 (A) | A → αXβ, β =⇒ ε}, X ∈ VN \{S ′ }
G

Example 3.2.19 The simplified system of equations for the computation of the follow1 -sets of the
context-free grammar G2 of Example 3.2.13 becomes

follow1 (S) = {#}


follow1 (E) = follow1 (S) ∪ follow1 (E ′ ) ∪ {)}
follow1 (E ′ ) = follow1 (E)
follow1 (T ) = {+} ∪ follow1 (E) ∪ follow1 (T ′ )
follow1 (T ′ ) = follow1 (T )
follow1 (F ) = {∗} ∪ follow1 (T )

Again we observe that all occurrences of the operators ⊙1 were eliminated. Only constant sets and
variables follow1 (X) occur on the right side of equations together with the union operator. ⊓

The next section presents a method that solves arbitrary systems of equations very efficiently that are
similar to the simplified systems of equations for the sets eff(X) and follow1 (X). We first describe the
general method and then apply it to the computations of the first1 - and follow1 -sets.

3.2.7 Pure Union Problems

Let us assume we had a system of equations


xi = ei , i = 1, . . . , n
over an arbitrary complete lattice D. On the right side of the equations were expressions ei that are
built only from constants in D, variables xj , and applications of the operator ⊔ (least upper bound of
the complete lattice D). The problem is to efficiently determine the least solution of this system of
equations. Such a problem is called a pure union problem.
The computation of the set of reachable nonterminals of a context-free grammar is a pure union
problem over the Boolean lattice B = {false, true}. Also the problems to compute ε-free first1 -sets and
follow1 -sets for a reduced context-free grammar are pure union problems. In these cases, the complete
lattices are 2VT and 2VT ∪{#} , ordered by the subset relation.
Example 3.2.20 As running example we consider the subset lattice D = 2{a,b,c} together with the
system of equations
x0 = {a}
x1 = {b} ∪ x0 ∪ x3
x2 = {c} ∪ x1
x3 = {c} ∪ x2 ∪ x3


3.2 Foundations 59

We construct a variable-dependency graph to a pure union problem. The nodes of this graph are the
variables xi of the system of equations. An edge (xi , xj ) exists if and only if the variable xi occurs
in the right side of the variable xj . Fig. 3.6 shows the variable-dependency graph for the system of
equations of Example 3.2.20

x3

x0 x1

x2

Fig. 3.6. The variable-dependency graph for the system of equations of Example 3.2.20.

Let I be the least solution of the system of equations. We observe that always I(xi ) ⊑ I(xj ) must
hold if there exists a path from xi to xj in the variable-dependency graph. In consequence, the values
of all variables in each strongly-connected component of the variable-dependency graph are the same.
We label each variable xi with the least upper bound of all constants that occur on the right sides of
equations for variable xi . Let us call this value I0 (xi ). We have for all j that
I(xj ) = ⊔{I0 (xi ) | xj is reachable from xi }
Example 3.2.21 (Continuation of Example 3.2.20)
For the system of equations of Example 3.2.20 we find:

I0 (x0 ) = {a}
I0 (x1 ) = {b}
I0 (x2 ) = {c}
I0 (x3 ) = {c}
It follows:
I(x0 ) = I0 (x0 ) = {a}
I0 (x1 ) = I0 (x0 ) ∪ I0 (x1 ) ∪ I0 (x2 ) ∪ I0 (x3 ) = {a, b, c}
I0 (x2 ) = I0 (x0 ) ∪ I0 (x1 ) ∪ I0 (x2 ) ∪ I0 (x3 ) = {a, b, c}
I0 (x3 ) = I0 (x0 ) ∪ I0 (x1 ) ∪ I0 (x2 ) ∪ I0 (x3 ) = {a, b, c}


This observation suggests the following method to compute the least solution I of the system of equa-
tions. First, the strongly-connected components of the variable-dependency graph are computed. This
needs a linear number of steps. Then an iteration over the list of strongly-connected components is
performed.
One starts with a strongly-connected component Q, that has no entering edges coming from other
strongly-connected components. The values of all variables xj ∈ Q are:
G
I(xj ) = {I0 (xi ) | xi ∈ Q}

The values I(xj ) can be computed by the two loops:

D t ← ⊥;
forall (xi ∈ Q)
t ← t ⊔ I0 (xi );
forall (xi ∈ Q)
I(xi ) ← t;
60 3 Syntactic Analysis

The run time of both loops is proportional to the number of elements in the strongly-connected compo-
nent Q. The values of the variables in Q are propagated along the outgoing edges. Let EQ be the set of
edges (xi , xj ) of the variable-dependency graph with xi ∈ Q and xj 6∈ Q, that is, the edges leaving Q.
For EQ it is set:
forall ((xi , xj ) ∈ EQ )
I0 (xj ) ← I0 (xj ) ⊔ I(xi );
The number of steps for the propagation is proportional to the number of edges in EQ .
The strongly-connected component Q together with the set EQ of outgoing edges is removed from
the graph and one continues with the next strongly-connected component without ingoing edges. This
is repeated until no more strongly-connected component remains. Altogether, we have a method that
performs a linear number of operations ⊔ on the complete lattice D.
Example 3.2.22 (Continuation of Example 3.2.20) The dependency graph of the system of equations
of Example 3.2.20 has the strongly-connected components

Q0 = {x0 } and Q1 = {x1 , x2 , x3 } .

For Q0 one obtains the value I0 (x0 ) = {a}. After removal of Q0 and the edge (x0 , x1 ), the new
assignment is:
I0 (x1 ) = {a, b}
I0 (x2 ) = {c}
I0 (x3 ) = {c}
The value of all variables in the strongly-connected component Q1 arise as I0 (x1 ) ∪ I0 (x2 ) ∪ I0 (x3 ) =
{a, b, c}. ⊓

3.3 Top-down-Syntax Analysis


3.3.1 Introduction

The way different parsers work can best be made intuitively clear by observing how they construct
the parse tree to an input word. Top-down parsers start the construction of the parse tree at the root.
In the initial situation, the constructed fragment of the parse tree consists of the root, which is labeled
by the start symbol of the context-free grammar; nothing of the input word w is consumed. In this
situation, one alternative for the start symbol is selected for expansion. The symbols of the right side
of this alternative are attached under the root extending the upper fragment of the parse tree. The next
nonterminal to be considered is the one on the leftmost position. The selection of one alternative for this
nonterminal and the attachment of the right side below the node labeled with the left side is repeated
until the parse tree is complete. By attaching symbols of the right side of a production terminal symbols
can appear in the leaf word of a tree fragment. If there is no nonterminal to the left of a terminal symbol
in the leaf word the top-down top-down parser compares them with the next symbol in the input. If they
agree the parser will consume these symbols in the input. Otherwise, the parser will report a syntax
error.
Thus, a top-down analysis performs the following two types of actions:
• Selection of an alternative for the actual leftmost nonterminal and attachment of the right side of
the production to the actual tree fragment.
• Comparison of terminal symbols to the left of the leftmost nonterminal with the remaining input.
Figures 3.7, 3.8, 3.9 and 3.10 show some parse tree fragments for the arithmetic expression Id + Id ∗
Id according to grammar G2 . The selection of alternatives for the nonterminals to be expanded was
cleverly done in such a way as to lead to a successful termination of the analysis.
3.3 Top-down-Syntax Analysis 61

S →E E′ → + E | ε T′ → ∗ T | ε
E → T E′ T → F T′ F → (E) | Id

Id + Id ∗ Id

S S S S

E E E

T E′ T E′

F T′

Fig. 3.7. The first parse-tree fragments of a top-down analysis of the word Id + Id ∗ Id according to grammar G2 .
They are constructed without reading any symbol from the input.

+ Id ∗ Id

S S

E E

T E′ T E′

F T′ F T′

Id Id ε

Fig. 3.8. The parse tree fragments after reading of the symbol Id and before the terminal symbol + is attached to
the fragment.

Id ∗ Id

S S

E E

T E′ T E′

F T′ + E F T′ + E

Id ε Id ε T E′

F T′

Fig. 3.9. The first and the last parse tree after reading of the symbols + and before the second symbol Id appears
in the parse tree.
62 3 Syntactic Analysis

∗ Id Id

S S

E E

T E′ T E′

F T′ + E F T′ + E

Id ε T E′ Id ε T E′

F T′ F T′ ∗ E

Id ε Id ε

Fig. 3.10. The parse tree after the reduction for the second occurrence of T ′ and the parse tree after reading the
symbol ∗, together with the remaining input.

3.3.2 LL(k): Definition, Examples, and Properties

The Item-pushdown automaton PG to a context-free grammar G works in principle like a top-down


parser; its (E)-transitions make a predictions which alternative to select for the actual nonterminal
to derive the input word. The trouble is that the item pushdown-automaton PG takes this decision in
a nondeterministic way. The nondeterminism stems from the (E) transitions. If [X → β.Y γ] is the
actual state and if Y has the alternatives Y → α1 | . . . | αn there are n transitions

∆([X → β.Y γ], ε) = {[X → β.Y γ][Y → .αi ] | 1 ≤ i ≤ n}

To derive a deterministic automaton from the item pushdown-automaton PG we equip the automaton
with a bounded lookahead into the remaining input. We fix a natural number k ≥ 1 and allow the item
pushdown-automaton to inspect the k first symbols of the remaining input at each (E) transition to aid
in its decision. If this lookahead of depth k always suffices to select the right alternative we call the
grammar LL(k) grammar.
Let us regard a configuration that the item pushdown-automaton PG has reached from an initial
configuration:

([S ′ → .S], uv) ⊢P (ρ[X → β.Y γ], v)
G


Because of invariant (I) of Section ?? it holds hist(ρ)β =⇒ u.
Let ρ = [X1 −→ β1 .X2 γ1 ] . . . [Xn −→ βn .Xn+1 γn ] be a sequence of items. We call the sequence

fut(ρ) = γn . . . γ1

the future of ρ. Let δ = fut(ρ). So far, the leftmost derivation S ′ =⇒ uY γδ has been found. If this
lm
∗ ∗
derivation can be extended to a derivation of the terminal word uv, that is, S ′ =⇒ uY γδ =⇒ uv, then
lm lm
in an LL(k) grammar the alternative to be selected for Y only depends on u, Y and v|k .
Let k ≥ 1 be a natural number. The reduced context-free grammar G is a LL(k)-grammar if for
every two leftmost derivations:
∗ ∗ ∗ ∗
S =⇒ uY α =⇒ uβα =⇒ ux and S =⇒ uY α =⇒ uγα =⇒ uy
lm lm lm lm lm lm

and x|k = y|k implies β = γ.


3.3 Top-down-Syntax Analysis 63

For an LL(k) grammar, the selection of the alternative for the next nonterminal Y in general de-
pends not only on Y and the next k symbols, but also on the already consumed prefix u of the input. If
this selection does, however, not depend on the already consumed left context u we call the grammar
strong-LL(k).
Example 3.3.1 Let G1 the context-free grammar with the productions:

hstati → if (Id) hstati else hstati |


while (Id) hstati |
{ hstatsi } |
Id ′ =′ Id;
hstatsi → hstati hstatsi |
ε

The grammar G1 is an LL(1) grammar. If hstati occurs as leftmost nonterminal in a sentential form
then the next input symbol determines which alternative must be applied. More precisely, it means that
for two derivations of the form
∗ ∗
hstati =⇒ w hstati α =⇒ w β α =⇒ w x
lm lm lm
∗ ∗
hstati =⇒ w hstati α =⇒ w γ α =⇒ w y
lm lm lm

it follows from x|1 = y|1 that β = γ. Is for instance x|1 = y|1 = if, then β = γ =
if (Id) hstati else hstati. ⊓

Definition 3.3.1 (simple LL(1)-grammar)
Let G be a context-free grammar without ε-productions. If for each nonterminal N , each of its alterna-
tives begins with a different terminal symbol, then G is called a simple LL(1) grammar. ⊓⊔
This is a first, easily checked criterion for a special case. The grammar G1 of Example 3.3.1 is a simple
LL(1) grammar.
Example 3.3.2 We now add the following production to the grammar G1 of Example 3.3.1:

hstati → Id : hstati | // labeled statement


Id (Id); // procedure call

The grammar G2 thus obtained is no longer an LL(1) grammar because it holds

β
z }| {
∗ ′ ′ ∗
hstati =⇒ w hstati α =⇒ w Id = Id; α =⇒ w x
lm lm lm
γ

z }| { ∗
hstati =⇒ w hstati α =⇒ w Id : hstati α =⇒ w y
lm lm lm
δ

z }| { ∗
hstati =⇒ w hstati α =⇒ w Id(Id); α =⇒ w z
lm lm lm

with x|1 = y|1 = z|1 = Id, but β, γ, δ are pairwise different.


However, G2 is a LL(2) grammar. For the three leftmost derivations given above holds,

x|2 = Id ′ =′ y|2 = Id : z|2 = Id (

are pairwise different. And these are indeed the only critical cases. ⊓

64 3 Syntactic Analysis

Example 3.3.3 G3 possesses the productions

hstati → if (hvari) hstati else hstati |


while (hvari) hstati |
{ hstatsi } |
hvari ′ =′ hvari; |
hvari;
hstatsi → hstati hstatsi |
ε
hvari → Id |
Id() |
Id(hvarsi)
hvarsi → hvari, hvarsi |
hvari

The grammar G3 is for no k ≥ 1 an LL(k) grammar. To derive a contradiction assume G3 were an


LL(k) grammar for a k > 0.
∗ ∗
Let hstati ⇒ β =⇒ x and hstati ⇒ γ =⇒ y with
lm lm
x = Id (Id, Id, . . . , Id) ′ =′ Id; and y = Id (Id, Id, . . . , Id);
| {z } | {z }
k k

We have x|k = y|k , but


β = hvari ′ =′ hvari γ = hvari;
and therefore β 6= γ. ⊓

There exists, however, an LL(2)-grammar for the language L(G3 ) of grammar G3 , which can be ob-
tained from G3 by factorization. Critical in G3 are the productions for assignment and procedure call.
Factorization introduces sharing of common prefixes of those productions. A new nonterminal sym-
bol follows this common prefix. The different continuations can be derived from this nonterminal. The
productions
hstati → hvari ′ =′ hvari; | hvari;
are replaced by
hstati → hvari Z
Z → ′ =′ hvari; | ;
Now, an LL(1) parser can decide between the critical alternatives using the next symbols Id and ’;’.
Example 3.3.4 Let G4 = ({S, A, B}, {0, 1, a, b}, P4, S), where the set P4 of productions is given by

S → A|B
A → aAb | 0
B → aBbb | 1

Then
L(G4 ) = {an 0bn | n ≥ 0} ∪ {an 1b2n | n ≥ 0}
and G4 is no LL(k) grammar for any k ≥ 1. To see this we consider the two leftmost derivations

S =⇒ A =⇒ ak 0bk
lm lm

S =⇒ B =⇒ ak 1b2k
lm lm
3.3 Top-down-Syntax Analysis 65

G4 is for no k ≥ 1 an LL(k) grammar since for each k ≥ 1 it holds (ak 0bk )|k = (ak 1b2k )|k , but the
right sides A and B for S are different. In this case one can show that for no k ≥ 1 there exists an
LL(k)-grammar for the language L(G4 ). ⊓ ⊔
Theorem 3.3.1 The reduced context-free grammar G = (VN , VT , P, S) is an LL(k) grammar if and
only if for each two different productions A → β and A → γ of G holds:

firstk (βα) ∩ firstk (γα) = ∅ for all α with S =⇒ wAα
lm

Proof. To prove the direction, ” ⇒ ”, we assume, G were an LL(k) grammar, but there existed an
x ∈ firstk (βα) ∩ firstk (γα). According to the definition of firstk and because G is reduced there exist
derivations
∗ ∗
S =⇒ uAα =⇒ uβα =⇒ uxy
lm lm lm
∗ ∗
S =⇒ uAα =⇒ uγα =⇒ uxz,
lm lm lm

where in in the case |x| < k it must hold y = z = ε. β 6= γ implies that G can not be an LL(k)
grammar—a contradiction to our assumption.
To prove the other direction, ” ⇐ ”, we assume, G were not an LL(k) grammar. Then there exist
two leftmost derivations
∗ ∗
S =⇒ uAα =⇒ uβα =⇒ ux
lm lm lm
∗ ∗
S =⇒ uAα =⇒ uγα =⇒ uy
lm lm lm

with x|k = y|k , where A → β, A → γ are different productions. Then the word x|k = y|k is contained
in firstk (βα) ∩ firstk (γα) — a contradiction to the claim of the theorem. ⊓

Theorem 3.3.1 states that in an LL(k) grammar the application of two different productions to a left-
sentential form always leads to different k-prefixes of the remaining input. Theorem 3.3.1 allows to
derive useful criteria for membership of certain subclasses of LL(k) grammars. The first concerns the
case k = 1.
The set first1 (βα) ∩ first1 (γα) for all left-sentential forms wAα and any two different alternatives
A → β and A → γ can be simplified to first1 (β) ∩ first1 (γ), if neither β nor γ produce the empty word
ε. This is the case if no nonterminal of G is ε-produktiv.
Theorem 3.3.2 Let G be an ε-free context-free grammar, that is, without productions of the form
X → ε. Then G is an LL(1) grammar if and only if for each nonterminal X with the alternatives
X → α1 | . . . | αn the sets first1 (α1 ), . . . , first1 (αn ) are pairwise disjoint.
In practice, it would be too hard a restriction to forbid ε-productions. Consider the case that one of the
two right sides β or γ would produce the empty word. If both β as well as γ produce the empty word

G can not be an LL(1) grammar. Let us, therefore, assume that β =⇒ ε, but that ε can not be derived
from γ. However, then holds for all left-sentential forms uAα, u′ Aα′ :

first1 (βα) ∩ first1 (γα′ ) = first1 (βα) ∩ first1 (γ) ⊙1 first1 (α′ )
= first1 (βα) ∩ first1 (γ)
= first1 (βα) ∩ first1 (γα)
= ∅

This implies that

first1 (β) ⊙1 follow1 (A) ∩ first1 (γ) ⊙1 follow1 (A)


S ∗ S ∗
= {first1 (βα) | S =⇒ uAα} ∩ {first1 (γα′ ) | S =⇒ u′ Aα′ }
lm lm
= ∅

We, hereby, obtain the following theorem:


66 3 Syntactic Analysis

Theorem 3.3.3 A reduced context-free grammar G is an LL(1) grammar if and only if for each two
different productions A → β and A → γ holds

first1 (β) ⊙1 follow1 (A) ∩ first1 (γ) ⊙1 follow1 (A) = ∅ .



The characterization of Theorem 3.3.3 is easily checked in contrast to the characterization of The-
orem 3.3.1. An even more easily checkable formulation is obtained by exploiting properties of 1-
concatenation.
Corollary 3.3.3.1 A reduced context-free grammar G is an LL(1) grammar if and only if for all alter-
natives A → α1 | . . . | αn holds
1. first1 (α1 ), . . . , first1 (αn ) are pairwise disjoint; in particular, at most one of these sets contains ε;
2. ε ∈ first1 (αi ) implies first1 (αj ) ∩ follow1 (A) = ∅ for all 1 ≤ j ≤ n, j 6= i. ⊓ ⊔
We extend the property of Theorem 3.3.3 to arbitrary lengths k ≥ 1 of lookaheads.
A reduced context-free grammar G = (VN , VT , P, S) is called strong LL(k) grammar, if for each
two different productions A → β and A → γ of a nonterminal A always holds

firstk (β) ⊙k followk (A) ∩ firstk (γ) ⊙k followk (A) = ∅.

According to this definition and Theorem 3.3.3 every LL(1) grammar is a strong LL(1) grammar.
However, an LL(k) grammar for k > 1 is not automatically a strong LL(k) grammar. The reason is
that the set followk (A) contains the follow words of all left sentential forms with occurrences of A. In
contrast, the LL(k) condition only refers to follow words of one left sentential form.
Example 3.3.5 Let G be the context-free grammar with the productions

S → aAaa | bAba A→b|ε

We check:
1. Fall: The derivation starts with S ⇒ aAaa. It holds first2 (baa) ∩ first2 (aa) = ∅.
2. Fall: The derivation starts with S ⇒ bAba. It holds first2 (bba) ∩ first2 (ba) = ∅.
Hence G is an LL(2) grammar according to Theorem 3.3.1. However, the grammar G is not a strong
LL(2)-grammar, because

first2 (b) ⊙2 follow2 (A) ∩ first2 (ε) ⊙2 follow2 (A)


= {b} ⊙2 {aa, ba} ∩ {ε} ⊙2 {aa, ba}
= {ba, bb} ∩ {aa, ba}
= {ba}

In the example, follow1 (A) is too undifferentiated because it collects terminal follow words that may
occur in different sentential forms. ⊓

3.3.3 Left Recursion

Deterministic parsers that construct the parse tree for the input top down cannot deal with left recursive
nonterminals. A nonterminal A of a context-free grammar G is called left recursive if there exists a
+
derivation A =⇒ Aβ.

Theorem 3.3.4 Let G be a reduced context-free grammar. G is not an LL(k) grammar for any k ≥ 1
if at least one nonterminal of the grammar G is left recursive.
3.3 Top-down-Syntax Analysis 67

Proof. Let X be a left recursive nonterminal of grammar G. For simplicity we assume that G has a
production X → Xβ. G is reduced. So, there must exist another production X → α. If X occurs in a

left sentential form, that is, S =⇒ uXγ, the alternative X → Xβ can be applied arbitrarily often. For
lm
each n ≥ 1 there exists a leftmost derivation
∗ n
S =⇒ wXγ =⇒ wXβ n γ .
lm lm

Let us assume that grammar G were an LL(k) grammar. Theorem 3.3.1 implies

firstk (Xαn+1 γ) ∩ firstk (αβ n γ) = ∅.

Due to X → α we have
firstk (αβ n+1 γ) ⊆ firstk (Xβ n+1 γ),
hence also
firstk (αβ n+1 γ) ∩ firstk (αβ n γ) = ∅.

If β =⇒ ε holds we immediately obtain a contradiction. Otherwise, we choose n ≥ k and again obtain
a contradiction. Hence, G can not be an LL(k) grammar. ⊓

We conclude that no generator of LL(k) parsers can cope with left recursive grammars. However,
each grammar with left recursion can be transformed into a grammar without left recursion that de-
fines the same language. Let us assume for simplicity that the grammar G has no ε-productions (see
+
Exercise ??) and no recursive chain productions, that is, there is no nonterminal A with A =⇒ A. Let
G
G = (VN , VT , P, S). We construct for G a context-free grammar G′ = (VN′ , VT , P ′ , S) with the same
set VT of terminal symbols, the same start symbol S, a set VN′ of nonterminal symbols

VN′ = VN ∪ {hA, Bi | A, B ∈ VN },

and a set of productions P ′


• Is B → aβ ∈ P for a terminal symbol a ∈ VT , then A → aβ hA, Bi ∈ P ′ for each A ∈ VN ;
• Is C → Bβ ∈ P then hA, Bi → βhA, Ci ∈ P ′ ;
• Finally, hA, Ai → ε ∈ P ′ for all A ∈ VN .
Example 3.3.6 For the grammar G0 with the productions

E →E+T |T
T →T ∗F |F
F → (E) | Id

we obtain after removal of non-productive nonterminals

E → (E) hE, F i | Id hE, F i


hE, F i → hE, T i
hE, T i → ∗ F hE, T i | hE, Ei
hE, Ei → +T hE, Ei | ε
T → (E) hT, F i | Id hE, F i
hT, F i → hT, T i
hT, T i → ∗ F hT, T i | ε
F → (E) hF, F i | Id hF, F i
hF, F i → ε

Grammar G0 has three nonterminals and six productions, grammar G1 , needs nine nonterminals and
15 productions.
68 3 Syntactic Analysis

The parse tree for Id + Id according to grammar G0 is shown in Fig. 3.11 (a), the one according
to grammar G1 in Fig. 3.11 (b). The latter one has a definitely different structure. Intuitively, the gram-
mar generates directly the first possible terminal symbol and then in a backward fashion collects the
remainders of the right sides, which follow the left-side nonterminal symbol. The nonterminal hA, Bi
stands for the job to return from B back to A. ⊓ ⊔
We convince ourselves that the grammar G′ constructed from grammar G has the following properties:
• Grammar G′ has no left recursive nonterminals.
• there exists a leftmost derivation

A =⇒ Bγ =⇒ aβγ
G G

if and only there exists a rightmost derivation



A =⇒

aβ hA, Bi =⇒

aβγ hA, Ai
G G

in which after the first step only nonterminals of the form hX, Y i are replaced.
The last property implies, in particular, that grammars G and G′ are equivalent, i.e., that L(G) = L(G′ )
holds.
In some cases, the grammar obtained by removing left recursion is an LL(k) grammar. This is the
case for grammar G0 of Example 3.3.6. We have already seen that the transformation to remove left
recursion also has disadvantages. Let n be the number of nonterminals. The number of nonterminals
as well as the number of productions can increase by a factor of n + 1. In large grammars, it might
be not advisable to perform this transformation manually. A parser generator however, could do the
transformation automatically and also generate a program that would automatically convert parse trees
of the transformed grammar back into parse trees of the original grammar (see Exercise ?? of the next
section). The user wouldn’t even see the grammar transformation.

E E

E + T Id hE, F i

T F hE, T i

F Id hE, Ei

Id + T hE, Ei

Id hT, F i ε

hT, T i

(a) (b)

Fig. 3.11. Parse trees for Id + Id according to grammar G0 of Example 3.3.6 and according to the grammar after
removal of left recursion.

Example 3.3.6 illustrates how much the parse tree of a word according to the transformed grammar
can be different from the one according to the original grammar. The operator sits somewhat isolated
3.3 Top-down-Syntax Analysis 69

between its remotely located operands. An alternative to the elimination of left recursion are grammars
with regular right sides, which we will treat later.

3.3.4 Strong LL(k) Parsers

w u v

input tape

M
output tape

parser table
control
stack

Fig. 3.12. Schematic representation of a strong LL(k)-Parser.

Fig. 3.12 shows the structure of a parser for strong LL(k) grammars. The prefix w of the input is
already read. The remaining input starts with a prefix u of length k. The stack contains a sequence of
items of the context-free grammar. The topmost item, the actual state, Z, determines whether
• to read the next input symbol,
• to test for the successful end of the analysis, or
• to expand the actual nonterminal.
Upon expansion, the parser uses the parser table, to select the correct alternative for the nonterminal.
The parser table M is a 2-dimensional array whose rows are indexed by the nonterminals and whose
columns are indexed by words of length at most k. It represents a selection function

VN × VT≤k ∗
# → (VT ∪ VN ) ∪ {error }

which associates each nonterminal with the one of its alternatives that should be applied based on the
given lookahead. It could also signal an error if no alternative exists for the combination of actual state
and lookahead. Let [X → β.Y γ] be the topmost item on the stack and u be the prefix of length k of the
remaining input. If M [Y, u] = (Y → α) then [Y → .α] will be the new topmost stack symbol and the
production Y → α is written to the output tape.
The table entries in M for a nonterminal Y are determined in the following way: Let Y → α1 |
. . . | αr be the alternatives for Y . For a strong LL(k) grammar, the sets firstk (αi ) ⊙k followk (Y ) are
disjoint. For each of the u ∈ firstk (α1 ) ⊙k followk (Y ) ∪ . . . ∪, firstk (αr ) ⊙k followk (Y ) is therefore

M [Y, u] ← αi if and only if u ∈ firstk (αi ) ⊙k followk (Y )

Otherwise, M [Y, u] is set to error . The entry M [Y, u] = error means that the actual nonterminal and
the prefix of the remaining input don’t go together. This means that a syntax error has been found. A
70 3 Syntactic Analysis

error-diagnosis and error-handling routine is started, which will attempt to continue the analysis. Such
approaches will be described in Section ??.
For k = 1, the construction of the parser table is particularly simple. Because of Corollary 3.3.3.1, it
works without k-concatenation. Instead, it suffices to test u for membership in one of the sets first1 (αi )
and maybe in follow1 (Y ).
Example 3.3.7 Table 3.3 is the LL(1)-parser table for the grammar of Example 3.2.13. Table 3.4
describes the run of the associated parser for input Id ∗ Id#. ⊓

( ) + ∗ Id #
S E error error error E error
E (E) hE, F i error error error Id hE, F i error
T (E) hT, F i error error error Id hT, F i error
F (E) hF, F i error error error Id hF, F i error
hE, F i error hE, T i hE, T i hE, T i error hE, T i
hE, T i error hE, Ei hE, Ei ∗ F hE, T i error hE, Ei
hE, Ei error ε + T hE, Ei error error ε
hT, F i error hT, T i hT, T i hT, T i error hT, T i
hT, T i error ε ε ∗ F hT, T i error ε
hF, F i error ε ε ε error ε

Table 3.3. LL(1) parser table for the grammar of Example 3.2.13.

Stack Input
[S → .E] Id ∗ Id#
[S → .E][E → .Id hE, F i] Id ∗ Id#
[S → .E][E → Id . hE, F i] ∗Id#
[S → .E][E → Id . hE, F i][hE, F i → . hE, T i] ∗Id#
[S → .E][E → Id . hE, F i][hE, F i → . hE, T i][hE, T i → . ∗ F hE, T i] ∗Id#
[S → .E][E → Id . hE, F i][hE, F i → . hE, T i][hE, T i → ∗ .F hE, T i] Id#
[S → .E][E → Id . hE, F i][hE, F i → . hE, T i][hE, T i → ∗ .F hE, T i][F → .Id hF, F i] Id#
[S → .E][E → Id . hE, F i][hE, F i → . hE, T i][hE, T i → ∗ .F hE, T i][F → Id. hF, F i] #
[S → .E][E → Id . hE, F i][hE, F i → . hE, T i][hE, T i → ∗ .F hE, T i][F → Id. hF, F i][hF, F i → .] #
[S → .E][E → Id . hE, F i][hE, F i → . hE, T i][hE, T i → ∗ .F hE, T i][F → Id hF, F i .] #
[S → .E][E → Id . hE, F i][hE, F i → . hE, T i][hE, T i → ∗ F. hE, T i] #
[S → .E][E → Id . hE, F i][hE, F i → . hE, T i][hE, T i → ∗ F. hE, T i][hE, T i → . hE, Ei] #
[S → .E][E → Id . hE, F i][hE, F i → . hE, T i][hE, T i → ∗ F. hE, T i][hE, T i → . hE, Ei][hE, Ei → .] #
[S → .E][E → Id . hE, F i][hE, F i → . hE, T i][hE, T i → ∗ F. hE, T i][hE, T i → hE, Ei .] #
[S → .E][E → Id . hE, F i][hE, F i → . hE, T i][hE, T i → ∗ F hE, T i.] #
[S → .E][E → Id . hE, F i][hE, F i → hE, T i .] #
[S → .E][E → Id hE, F i .] #
[S → E.] #

Output:
(S → E) (E → Id hE, F i) (hE, F i → hE, T i) (hE, T i → ∗ F hE, T i) (F → Id hF, F i)
(hF, F i → ε) (hE, T i → hE, Ei) (hE, Ei → ε)

Table 3.4. Parser run for input Id ∗ Id#

Our construction of LL(k) parser are only applicable to strong LL(k) grammars. This restriction,
however, is not really severe.
3.3 Top-down-Syntax Analysis 71

• The case occurring most often in practice is the case k = 1, and each LL(1) grammar is a strong
LL(1) grammar.
• If a lookahead k > 1 is needed, and is the grammar LL(k), but not strong LL(k), a general
transformation can be applied converting the grammar into a strong LL(k) grammar that accepts
the same language. (see Exercise 7).
We do, therefore, not describe a parsing method for arbitrary LL(k) grammars.

3.3.5 LL Parsers for Right-regular Context-free Grammars

Left-recursive nonterminals destroy the LL property of context-free grammars. Left recursion is mostly
used to describe sequences and lists of syntactic objects, like parameter lists and sequences of operands
connected by an associative operator. These can also be described by regular expressions. Thus, we want
to offer the best description comfort by admitting regular expressions on the right side of productions.

A right-regular context-free grammar is a tuple G = (VN , VT , p, S), where VN , VT , S are as usual the
set of nonterminals, the set of terminals, and the start symbol. p : VN → RA is now a function from the
set of nonterminals into the set RA of regular expressions over VN ∪ VT . A pair (X, r) with p(X) = r
is written as X → r.
Example 3.3.8
A right-regular context-free grammar for arithmetic expressions is

Ge = ({S, E, T, F }, {id, (, ), +, −, ∗, /}, p, S),

where p is the following function (′ {′ and ′ }′ are used as meta-characters to avoid the conflict with the
terminal symbols ′ (′ and ′ )′ ):
S →E
E → T {{+ | −} T }∗
T → F {{∗ | /} F }∗
F → (E) | id ⊓ ⊔

Definition 3.3.2 (regular derivation)


Let G be a right-regular context-free grammar. The relation =⇒ on RA, directly derives leftmost,
R,lm
regular, is defined by:

(a) wX β =⇒ w α β mit α = p(X)


R,lm
(b) w (r1 | . . . | rn ) β =⇒ w ri β f"ur 1 ≤ i ≤ n
R,lm
(c) w (r)∗ β =⇒ w β
R,lm
(d) w (r)∗ β =⇒ w r (r)∗ β
R,lm


Let =⇒ be the die reflexive, transitive closure of =⇒ . The language defined by G is L(G) = {w ∈
R,lm R,lm

VT∗ | S =⇒ w} ⊓

R,lm

Example 3.3.9
A regular leftmost derivation for the word id + id ∗ id of grammar Ge of Example 3.3.8 is:
S =⇒ E =⇒ T {{+|−}T }∗
R,lm R,lm
=⇒ F {{∗|/}F }∗{{+|−}T }∗
R,lm
=⇒ {(E)|id}{{∗|/}F }∗{{+|−}T }∗
R,lm
=⇒ id{{∗|/}F }∗{{+|−}T }∗
R,lm
72 3 Syntactic Analysis

=⇒ id{{+|−}T }∗
R,lm
=⇒ id{+|−}T {{+|−}T }∗
R,lm
=⇒ id + T {{+|−}T }∗
R,lm
=⇒ id + F {{∗|/}F }∗{{+|−}T }∗
R,lm
=⇒ id + {(E)|id}{{∗|/}F }∗{{+|−}T }∗
R,lm
=⇒ id + id{{∗|/}F }∗{{+|−}T }∗
R,lm
=⇒ id + id{∗|/}F {{∗|/}F }∗{{+|−}T }∗
R,lm
=⇒ id + id ∗ F {{∗|/}F }∗{{+|−}T }∗
R,lm
=⇒ id + id ∗ {(E)|id}{{∗|/}F }∗{{+|−}T }∗
R,lm
=⇒ id + id ∗ id{{∗|/}F }∗{{+|−}T }∗
R,lm
=⇒ id + id ∗ id{{+|−}T }∗
R,lm
=⇒ id + id ∗ id ⊓

R,lm
Our goal is to develop an RLL parser, that is, a deterministic top down parser for right-regular
context-free grammars. This is the method of choice to implement a parser as long as no powerful and
comfortable tools offer an attractive alternative.
The RLL parser will produce a regular leftmost derivation for any correct input word. Looking at
the definition above makes clear that the case of expansion (a)—a nonterminal is replaced by its only
right side—is no longer critical. Instead, the cases (b), (c) and (d) need to be made deterministic.
We will call a parser for a right-regular context-free grammar an RLL(1) parser if it
• for each regular left-sentential form w(r1 | . . . | rn )β can take the decision for the right alternative,
• for each regular left-sentential form w(r)∗ β can take the decision for the continuation or the termi-
nation of the iteration
based on the next input symbol of the remaining input. We transfer some notions to the case of right-
regular context-free grammars.
Definition 3.3.3 (regular subexpression)
ri , 1 ≤ i ≤ n, is direct regular subexpression of (r1 | . . . | rn ) and (r1 . . . rn ); r is direct regular
subexpression von (r)∗ and of r; r1 ist regular subexpression of r2 , if r1 = r2 or if r1 is a direct
regular subexpression of r2 or regular subexpression of a direct regular subexpression of r2 . ⊓ ⊔
Definition 3.3.4 (extended context-free item)
A tuple (X, α, β, γ) is an extended context-free item of a right-regular context-free grammar G =
(VN , VT , p, S) if X ∈ VN , α, β, γ ∈ (VN ∪ VT ∪ {(, ), ∗ , |, ε})∗ , p(X) = βαγ and α is regular
subexpression of βαγ. This item is written as [X → β.αγ]. ⊓ ⊔
Realizing an RLL(1) parser for a right-regular context-free grammar uses again first1 - and follow1 sets,
this time of regular subexpressions of right sides of productions.

first1 - and follow1 - Computation for Right-regular Context-free Grammars

The computations of first1 - and follow1 -sets for right-regular context-free grammars can again be rep-
resented as pure union-problems, and can, therefore, be efficiently solved. In the same way as in the
conventional case, this starts with the computation of ε-productivity. The equations for ε-productivity
can be defined over the structure of regular expressions. The ε-productivity of right sides transfers to
the nonterminal of the left side.
3.3 Top-down-Syntax Analysis 73

eps(a) = false, for a ∈ VT


eps(ε) = true
eps(r∗ ) = true
eps(X) = eps(r), if p(X) = r for X ∈ VN
n
_ (eps)
eps((r1 | . . . |rn )) = eps(ri )
i=1
n
^
eps((r1 . . . rn )) = eps(ri )
i=1

Example 3.3.10 (Continuation of Example 3.3.8)


For all nonterminals of Ge holds: eps(X) = false ⊓

After ε-productivity is computed, the ε-free first-function can be computed. This is specified by the
following equations:
eff(ε) = ∅
eff(a) = {a}
eff(r∗ ) = eff(r)
eff(X) = eff(r), if p(X) S =r (eff)
eff((r1 | . . . |rn )) = eff(ri )
1≤i≤n
S ^
eff((r1 . . . rn )) = {eff(rj ) | eps(ri )}
1≤j≤n
1≤i<j

Example 3.3.11 (Continuation of Example 3.3.8)


The eff- and, therefore, also the first1 -sets for the nonterminals of grammar Ge are
first1 (S) = first1 (E) = first1 (T ) = first1 (F ) = {(, id} ⊓

ε-productivity and ε-free first-functions could be defined recursively over the structure of regular
expressions. The first1 -set of a regular expression is independent of the context in which it occurs.
This is different for the follow1 -set; two different occurrences of a regular (sub-) expression have
in general different follow1 -sets. In realizing RLL(1) parsers, we are interested in the follow1 -sets of
occurrences of regular (sub-) expressions. A particular occurrence of a regular expression in a right side
corresponds to exactly one extended regular item in which the dot is positioned in front of this regular
expression. The following equations for follow1 assume that concatenations and lists of alternatives are
surrounded on the outside by parentheses, but have no superfluous parentheses inside.

(1) follow1 ([S ′ → .S]) = {#} The eof symbol ’#’ follows after each input word.
(2) follow1 ([X → · · · (r1 | · · · |.ri | · · · |rn ) · · · ]) =
follow1 ([X → · · · .(r1 | · · · |ri | · · · |rn ) · · · ]) for 1 ≤ i ≤ n
(3) follow1 ([X → · · ·
(· · · .ri ri+1 · · · ) · · · ]) =
 follow1 ([X → · · · (· · · ri .ri+1 · · · ) · · · ]),

eff(ri+1 ) ∪ if eps(ri+1 ) = true


∅ otherwise
(4) follow1 ([X → · · · (r1 · · · rn−1 .rn ) · · · ]) = (follow1 )
follow1 ([X → · · · .(r1 · · · rn−1 rn ) · · · ])
(5) follow1 ([X → · · · (.r)∗ · · · ]) =
eff(r) ∪ follow1 ([X → · · · .(r)∗ · · · ])
S
(6) follow1 ([X → .r]) = follow1 ([Y → · · · .X · · · ])
74 3 Syntactic Analysis

Example 3.3.12 (Continuation of Example 3.3.8)


The follow1 -sets for some items to grammar Ge are:
follow1 ([S → .E]) = {#}
(4)
follow1 ([E → T.{{+|−}T }∗]) =
(6)
follow1 ([E → .T {{+|−}T }∗]) =
follow1 ([S → .E]) ∪ follow1 ([F → (.E)]) =
(3)
({#} ∪ follow1 ([F → (.E)])) = {), #}
follow1 ([T → F.{{∗|/}F }∗]) = {+, −, ), #} ⊓

To compute solutions for eff and follow1 as efficiently as possible, that is, in linear time, these
equation systems need to be brought into the form
[
f (X) = g(X) ∪ {f (Y ) | X R Y }

with a known set-valued function g and a binary relation R.


In the computation of eff the base set of R and the set of nodes of the directed graph induced by R
is the set of regular (sub-) expressions occurring in the production. A directed edge from X to Y exists
if and only if either Y is a direct subexpression of X and Y contributes to the first1 -set of X, or if X is
a nonterminal (occurrence) and Y its right side. The function g is only defined to be non-empty for the
case of a terminal symbols.
In the computation of follow1 the base set is the set of extended items, and the relation associates
such items j with an item i, that contribute to the follow1 -set of i. The function g is defined using the
already computed eff-sets.
Definition 3.3.5 (RLL(1)-grammar)
A right-regular context-free grammar G = (VN , VT , p, S) is called RLL(1) grammar if for all ex-
tended context-free items
[X → · · · .(r1 | · · · |rn ) · · · ] holds:
first1 (ri )⊕1 follow1 ([X → · · · .(r1 | · · · |rn ) · · · ]) ∩
first1 (rj )⊕1 follow1 ([X → · · · .(r1 | · · · |rn ) · · · ]) = ∅ for all i 6= j,
and for all extended context-free items [X → · · · .(r)∗ · · · ] holds:
first1 (r) ∩ follow1 ([X → · · · .(r)∗ · · · ]) = ∅ and eps(r) = false. ⊓ ⊔
Once the first1 - and follow1 -sets for a right-regular context-free grammar are computed, and the
check for the RLL(1)-property has been successful, an RLL(1) parser for the grammar can be generated.
Two different representations are popular. The first consists of a driver, fixed for all grammars, and a
table specifically generated for each grammar. The driver indexes the table with the actual item and the
next input symbol, more precisely, some integer codes for these two. The selected entry in the table
indicates the next item or signals a syntax error.
The second representation is by a program. This program consists essentially of a set simultaneously
recursice procedures, one per nonterminal. The procedure for nonterminal X is in charge of analyzing
words for X. We first introduce the table version of RLL(1) parsers.

RLL(1) Parser for Right-regular Context-free Grammars (Table Version)

The RLL(1) parser is a deterministic pushdown automaton. The parser table M represents a selection
function m : ItG × VT ; ItG ∪ {error}. The parser table is consulted when a decision has to be taken
by considering lookahead into the remaining input. Therefore, M has only rows for
• items in which an alternative needs to be chosen, and
• items in which an iteration needs to processed;
i.e. the function m is defined for items of the form [X → · · · .(r1 | · · · |rn ) · · · ] and of the form
[X → · · · .(r)∗ · · · ].
3.3 Top-down-Syntax Analysis 75

The RLL(1) parser is started in an initial configuration (#[S ′ → .S], w#). The actual item, the topmost
on the stack, determines whether the parser table should be consulted. If the table needs to be consulted
M [ρ, a] – if not error – indicates the next item for the actual item ρ and the actual input symbol a. If
M [ρ, a] = error , a syntax error has been discovered. In the configuration (#[S ′ → S.], #), the parser
accepts the input word.
The other transitions are:
δ([X → · · · .a · · · ], a) = [X → · · · a. · · · ]
δ([X → · · · .Y · · · ], ε) = [X → · · · .Y · · · ][Y → .p(Y )]
δ([X → · · · .Y · · · ][Y → p(Y ).], ε) = [X → · · · Y. · · · ]
In addition, there were some transitions, for example from [X → · · · (· · · |ri .| · · · ) · · · ] to [X →
· · · (· · · |ri | · · · ). · · · ], which neither read symbols, nor expand nonterminals, nor reduce to nontermi-
nals. They can be avoided by modifying the transition function in the following way:
(1) [X → · · · (· · · |ri .| · · · ) · · · ] ⇒ (2) [X → · · · (· · · |ri | · · · ). · · · ]
(3) [X → · · · (r.)∗ · · · ] ⇒ (4) [X → · · · .(r)∗ · · · ]
(5) [X → · · · .(r1 · · · rn ) · · · ] ⇒ (6) [X → · · · (.r1 · · · rn ) · · · ]
If a transition of δ leads to (1), it is made to lead to the context-free item (2). If it leads to (3), it is
made to lead to (4), and from (5) directly to (6).
We present now the algorithm for the generation of the RLL(1) parser tables.
Algorithm RLL(1)-GEN
Input: RLL(1)-grammar G, first1 and follow1 for G.
Output: parser table M for RLL(1) parser for G.
Method: For all items of the form [X → · · · .(r1 | · · · |rn ) · · · ] set
M ([X → · · · .(r1 | · · · |rn ) · · · ], a) = [X → · · · (· · · |.ri | · · · ) · · · ], for a ∈ first1 (ri ) and if in ad-
dition ε ∈ first1 (ri ) then also for a ∈ follow1 ([X → · · · .(r1 | · · · |rn ) · · · ]).
For all items of the form [X → · · · .(r)∗ · · · ] set

( ([X → · · · .(r) · · · ], a) =
M

[X → · · · (.r) · · · ] if a ∈ first1 (r)
[X → · · · (r)∗ . · · · ] if a ∈ follow1 ([X → · · · .(r)∗ · · · ])
Set all not yet filled entries to error.

Example 3.3.13 (Continuation of Example 3.3.8)


The parser table to grammar Ge . (Rows and columns are exchanged for layout reasons.)

[E → T.{{+|−}T }∗] [T → F.{{∗|/}F }∗]


+ [E → T {{. + |−}T }∗] [T → F {{∗|/}F }∗.]
− [E → T {{+|.−}T }∗] [T → F {{∗|/}F }∗.]
# [E → T {{+|−}T }∗.] [T → F {{∗|/}F }∗.]
) [E → T {{+|−}T }∗.] [T → F {{∗|/}F }∗.]
∗ error [T → F {{. ∗ |/}F }∗ ]
/ error [T → F {{∗|./}F }∗]
Note that the construction of the table uses compression; from the item [E → T.{{+|−}T }∗] a di-
rect transition under + into the item [E → T {{. + |−}T }∗] was entered. Analogously for − and
for the item [T → F.{{∗|/}F }∗] under ∗ and /. Thereby, all items of the form [E → T {.{+|−}T }∗]
and [T → F {.{∗|/}F }∗] can be eliminated, and at compile time, the corresponding transitions can be
saved. ⊓ ⊔

Recursive descent RLL(1) Parser (Program Version)


A popular implementation method of RLL(1) parsers is in the form of a program. This implementation
can be automatically generated from an RLL(1)-grammar and its first1 - and follow1 -sets, but it can also
76 3 Syntactic Analysis

be written in the programming language of one’s choice. The latter is the implementation method as
long as no generator tool is available.
Let a right-regular context-free grammar G = (VN , VT , p, S) with VN = {X0 , . . . , Xn }, S = X0 ,
p = {X0 7→ α0 , X1 7→ α1 , . . . , Xn 7→ αn } be given. We present recursive functions p_progr and
progr that generate a so-called recursive descent parser from the grammar G and the computed first1 -
und follow1 -sets.
For each production, this also means for each nonterminal X, a procedure with the name X is
generated. The constructors for regular expressions on the right sides are translated into programming
language constructs such as switch-, while-, do-while-statements, into checks for terminal symbols,
and into recursive calls of procedures for nonterminals. The first1 - and follow1 -sets of occurrences of
regular expressions are needed, for instance, to select the right one of several alternatives. Such an oc-
currence of a regular (sub-) expression corresponds exactly to an extended context-free item. The func-
tion progr is, therefore, recursively defined over the structure of context-free items of the grammar G.
The following function FiFo is used in the case distinction for alternatives. FiFo([X → · · · .β · · · ]) =
first1 (β)⊕1 follow1 ([X → · · · .β · · · ]).
s t r u c t sy m b o l n ex tsy m ;

/ ∗ S t o r e s n e x t i n p u t symbol i n nextsym ∗ /
void scan ( ) ;

/ ∗ P r i n t s t h e e r r o r messa g e and
s t o p s t h e run o f t h e p a r s e r ∗ /
void e r r o r ( S t r i n g errorMessage ) ;

/ ∗ An n o u n ces t h e end o f t h e a n a l y s i s and


s t o p s t h e run o f t h e p a r s e r ∗ /
void a c c e p t ( ) ;

/ ∗ T r a n s l a t i n g t h e i n p u t grammar ∗ /
p _ p r o g r ( X0 → α0 ) ;
p _ p r o g r ( X1 → α1 ) ;
..
.
p _ p r o g r ( Xn → αn ) ;

void p a r s e r ( ) {
scan ( ) ;
X0 ( ) ;

i f ( n ex tsy m == " # " )


accept ( ) ;
else
e r r o r ( ". . ." ) ;
}

/ ∗ For a l l r u l e s l i k e t h i s . . . ∗ /
p _ p r o g r ( X → .α )

/ ∗ . . .we c r e a t e an a c c o r d i n g meth o d l i k e t h i s . ∗ /
void X( ) {
p r o g r ( [ X → .α ] ) ;
}

v o i d p r o g r ( [ X → · · · .(α1 |α2 | · · · |αk−1 |αk ) · · · ] ) {


3.3 Top-down-Syntax Analysis 77

switch ( ) {
c a s e ( FiFo ( [ X → · · · (.α1 |α2 | · · · |αk−1 |αk ) · · · ] )
. c o n t a i n s ( n ex tsy m ) ) :
p r o g r ( [ X → · · · (.α1 |α2 | · · · |αk−1 |αk ) · · · ] ) ;
break ;
c a s e ( FiFo ( [ X → · · · (α1 |.α2 | · · · |αk−1 |αk ) · · · ] )
. c o n t a i n s ( n ex tsy m ) ) :
p r o g r ( [ X → · · · (α1 |.α2 | · · · |αk−1 |αk ) · · · ] ) ;
break ;
..
.
c a s e ( FiFo ( [ X → · · · (α1 |α2 | · · · |.αk−1 |αk ) · · · ] )
. c o n t a i n s ( n ex tsy m ) ) :
p r o g r ( [ X → · · · (α1 |α2 | · · · |.αk−1 |αk ) · · · ] ) ;
break ;
default :
p r o g r ( [ X → · · · (α1 |α2 | · · · |αk−1 |.αk ) · · · ] ) ;
}
}

v o i d p r o g r ( [ X → · · · .(α1 α2 · · · αk ) · · · ] ) {
p r o g r ( [ X → · · · (.α1 α2 · · · αk ) · · · ] ) ;
p r o g r ( [ X → · · · (α1 .α2 · · · αk ) · · · ] ) ;
..
.
p r o g r ( [ X → · · · (α1 α2 · · · .αk ) · · · ] ) ;
}

v o i d p r o g r ( [ X → · · · .(α)∗ · · · ] ) {
w h i l e ( F IRST1 (α) . c o n t a i n s ( n ex tsy m ) ) {
p r o g r ( [ X → · · · .α · · · ] ) ;
}
}

v o i d p r o g r ( [ X → · · · .(α)+ · · · ] ) {
do {
p r o g r ( [ X → · · · .α · · · ] ) ;
} w h i l e ( F IRST1 (α) . c o n t a i n s ( n ex tsy m ) ) ;
}

v o i d p r o g r ( [ X → · · · .ǫ · · · ] ) {}
For a ∈ VT is
v o i d p r o g r ( [ X → · · · .a · · · ] ) {
i f ( n ex tsy m == a )
scan ( ) ;
else
e r r o r ( ". . ." ) ;
}
For Y ∈ VN is
v o i d p r o g r ( [ X → · · · .Y · · · ] ) = v o i d Y ( )
How does such a parser work? Procedure X for a nonterminal X is in charge of recognizing words
for X. When it is called, the first symbol of the word to recognize has already been read by the combi-
78 3 Syntactic Analysis

nation scanner/screener, the procedure scan. When procedure X has found a word for X and returns,
it has already read the symbol following the found word.
The next section describes several modifications for the handling of syntax errors.
We now present the recursive descent parsers for the right-regular context-free grammar G for
arithmetic expressions.
Example 3.3.14 (Continuation of Example 3.3.8)
The following parser results from the schematic translation or the extended expression grammar. For
terminal symbols their string representation is used.
symbol n ex tsy m ;

/ ∗ Retu rn s n e x t i n p u t symbol ∗ /
symbol s c a n ( ) ;

/ ∗ P r i n t s t h e e r r o r messa g e and
s t o p s t h e run o f t h e p a r s e r ∗ /
void e r r o r ( S t r i n g errorMessage ) ;

/ ∗ An n o u n ces t h e end o f t h e a n a l y s i s and


s t o p s t h e run o f t h e p a r s e r ∗ /
void a c c e p t ( ) ;

void S ( ) {
E();
}

void E ( ) {
T();
w h i l e ( n ex tsy m == " + " | | n ex tsy m == "−" ) {
s w i t c h ( n ex tsy m ) {
case "+" :
i f ( n ex tsy m == " + " )
scan ( ) ;
else
e r r o r ( "+ expected " ) ;
break ;
default :
i f ( n ex tsy m == "−" )
scan ( ) ;
else
e r r o r ( "− e x p e c t e d " ) ;
}

T();
}
}

void T ( ) {
F();
w h i l e ( n ex tsy m == " ∗ " | | n ex tsy m == " / " ) {
s w i t c h ( n ex tsy m ) {
case "∗" :
i f ( n ex tsy m == " ∗ " )
scan ( ) ;
3.4 Bottom-up Syntax Analysis 79

else
er r o r ( "∗ expected " ) ;
break ;
default :
i f ( n ex tsy m == " / " )
scan ( ) ;
else
er r o r ( " / expected " ) ;
}

F();
}
}

void F ( ) {
s w i t c h ( n ex tsy m ) {
case " ( " :
E();
i f ( n ex tsy m == " ) " )
scan ( ) ;
else
e r r o r ( " ) expected " ) ;
default :
i f ( n ex tsy m == " i d " )
scan ( ) ;
else
e r r o r ( " id expected " ) ;
}
}

void p a r s e r ( ) {
scan ( ) ;
S();
i f ( n ex tsy m == " # " )
accept ( ) ;
else
er r o r ( "# expected " ) ;
}
Some inefficiencies result from the schematic generation of this parser program. A more sophisti-
cated generation scheme will avoid most of these inefficiencies.

3.4 Bottom-up Syntax Analysis


3.4.1 Introduction

Bottom-up parsers read their input like top-down parsers from left to right. They are pushdown automata
that can essentially do two kinds of operations:
• Read the next input symbol (shift), and
• Reduce the right side of a production X → α at the top of the stack by the left side X of the
production (reduce).
Because of these operations they are called shift-reduce parsers. Shift-reduce parsers are right parsers;
they output the application of a production when they do a reduction. The result of the successful
80 3 Syntactic Analysis

analysis of an input word is a rightmost derivation in reverse order because shift-reduce parsers always
reduce at the top of the stack.
A shift-reduce parser must never miss a required reduction, that is, cover it in the stack by a newly
read input symbol. A reduction is required, if no rightmost derivation to the start symbol is possible
without it. A right side covered by an input symbol will never reappear at the top of the stack and
can, therefore, never be reduced. A right side at the top of the stack that must be reduced to obtain a
derivation is called a handle.
Not all occurrences of right sides that appear at the top of the stack are handles. Some reductions
performed at the top of the stack lead into dead ends, that is, they can not continued to a reverse
rightmost derivation although the input is correct.
Example 3.4.1 Let G0 be again the grammar for arithmetic expressions with the productions:
S → E
E → E+T |T
T → T ∗F |F
F → (E) | Id
Table 3.5 shows a successful bottom-up analysis of the word Id ∗ Id of G0 . The third column lists
actions that were also possible, but would lead into dead ends. In the third step, the parser would miss
a required reduction. In the other two steps, the alternative reductions would lead into dead ends, that
is, not to right sentential forms. ⊓

Stack input Erroneous alternative actions


Id ∗ Id
Id ∗ Id
F ∗ Id Reading of ∗ misses a required reduction
T ∗ Id reduction of T to E leads into a dead end
T ∗ Id
T ∗ Id
T ∗F reduction of F to T leads into a dead end
T
E
S

Table 3.5. A successful analysis of the word Id ∗ Id together with potential dead ends.

Bottom-up parsers construct the parse tree from the bottom up. They start with the leaf word of the
parse tree, the input word, and construct for ever larger parts of the read input subtrees of the parse tree
by attaching the subtrees for the right side α of a production X → α below a newly created X node
upon a reduction by this production. The analysis is successful if a parse tree with root label S, the start
symbol of the grammar, has been constructed for the whole input word.
Fig. 3.13 shows some snapshots during the construction of the parse tree according to the derivation
shown in Table 3.5. The tree on the left contains all nodes that can be created when the input Id has
been read. The sequence of three trees in the middle represents the state before the handle T ∗ F is
being reduced, while the tree on the right shows the complete parse tree.

3.4.2 LR(k) Parsers


This section presents the most powerful deterministic method that works bottom-up, LR(k) analysis.
The letter L says that the parsers of this class read their input from left to right, The R characterizes
3.4 Bottom-up Syntax Analysis 81

T ∗ Id T ∗ F T ∗ F

F F Id F Id

Id Id Id

Fig. 3.13. Construction of the parse tree after reading the first symbol, Id, together with the remaining input, before
the reduction of the handle T ∗ F , and the complete parse tree.

them as Right parser; k is the length of the considered lookahead.


We start again with the item-pushdown automaton PG for a context-free grammar G and transform
it into a shift-reduce parser. Let us look back at what we did in the case of top-down analysis. Sets of
lookahead words were computed from the grammar, which were used to select the right alternative for
a nonterminal at expansion transitions of PG . So, the LL(k) parser decides about the alternative for
a nonterminal at the earliest possible time, when the nonterminal has to be expanded. LR(k) parsers
follow a different strategy; they pursue all possibilities to expand and to read in parallel.
A decision has to be taken when one of the possibilities to continue asks for a reduction. What is
there to decide? There could be several productions by which to reduce, and a shift could be possible
in addition to a reduction. The parser uses the next k symbols to take its decision.
In this section, first an LR(0) parser is developed, which does not yet consider any lookahead.
Section 3.4.3 presents the canonical LR(k) parser. In Section 3.4.3, less powerful variants of LR(k)
are described, which are often powerful enough for practice. Finally, Section 3.4.4 describes a error
recovery method for LR(k). Note that all context-free grammars are assumed to be reduced of non-
productive and unreachable nonterminals and extended by a new start symbol.

The Characteristic Finite-state Machine to a Context-free Grammar

We attempt to represent PG by a non-deterministic finite-state machine, its characteristic finite-state


machine, ch(G). Since PG is a pushdown automaton, this cannot easily work. An additional specifica-
tion of actions on the stack is necessary. These are associated with some states and some transitions of
ch(G).
Our goal is to arrive at a pushdown automaton who pursues all potential expansion and read tran-
sitions of the item pushdown-automaton in parallel and only at reduction decides which production
is the one to select. We define the characteristic finite-state machine ch(G) to a reduced context-free
grammar G. The states of the characteristic finite-state machine ch(G) are the items [A → α.β] of the
grammar G, that is, the states of the item pushdown-automaton PG . The set of input symbols of the
characteristic finite-state machine ch(G) is VT ∪ VN , its initial state is the start item [S ′ → .S] of the
item pushdown-automaton PG . The final states of the characteristic finit-state machine are the complete
items [X → α.]. Such a final state signals that the word just read corresponds to a stack contents of
the item pushdown-automaton in which a reduction with the production A → α can be performed. The
transition relation ∆ of the characteristic finite-state machine consists of the transitions:
([X → α.Y β], ε, [Y → .γ]) for X → αY β ∈ P, Y →γ∈P
([X → α.Y β], Y, [X → αY.β]) for X → αY β ∈ P, Y ∈ VN ∪ VT

Reading a terminal symbols a in char(G) corresponds to a shift transition of the item pushdown-
automaton under a. ε transitions of char(G) correspond to the expansion transitions of the item
82 3 Syntactic Analysis

pushdown-automaton. When char(G) reaches a final state [X → α.] PG undertakes the following
actions: it removes the item [X → α.] on top of its stack and makes a transition under X from the new
state that has appears on top of the stack. This is a reduction move of the item pushdown-automaton
PG .
Example 3.4.2 Let G0 again be the grammar for arithmetic expressions with the productions

S → E
E → E+T |T
T → T ∗F |F
F → (E) | Id

Fig. 3.14 shows the characteristic finite-state machine to grammar G0 . ⊓


E
[S → .E] [S → E.]

E + T
[E → .E + T ] [E → E. + T ] [E → E + .T ] [E → E + T.]

T
[E → .T ] [E → T.]

T ∗ F
[T → .T ∗ F ] [T → T. ∗ F ] [T → T ∗ .F ] [T → T ∗ F.]

F
[T → .F ] [T → F.]

( E )
[F → .(E)] [F → (.E)] [F → (E.)] [F → (E).]

Id
[F → .Id] [F → Id.]

Fig. 3.14. The characteristic finite-state machine char(G0 ) for the grammar G0 .

The following theorem clarifies the exact relation between the characteristic finite-state machine and
the item pushdown automaton:
Theorem 3.4.1 Let G be a context-free grammar and γ ∈ (VT ∪ VN )∗ . The following three statements
are equivalent:

1. There exists a computation ([S ′ → .S], γ) ⊢char(G) ([A → α.β], ε) of the characteristic finite-state
machine char(G).

2. There exists a computation (ρ [A → α.β], w) ⊢P ([S ′ → S.], ε) of the item pushdown-automaton
G
PG such that γ = hist(ρ) α holds.

3. There exists a rightmost derivation S ′ =⇒ γ ′ Aw =⇒ γ ′ αβw with γ = γ ′ α. ⊓⊔
rm rm

The equivalence of statements (1) and (2) means that words that lead to an item of the characteristic
finite-state machine char(G) are exactly the histories of stack contents of the item pushdown-automaton
PG whose topmost symbol is this item and from which PG can reach one of its final states assuming
appropriate input w. The equivalence of statements (2) and (3) means that an accepting computation of
3.4 Bottom-up Syntax Analysis 83

the item pushdown-automaton for an input word w that starts with a stack contents ρ corresponds to a
rightmost derivation that leads to a sentential form αw where α is the history of the stack contents ρ.
We introduce some terminology before we prove Theorem 3.4.1. For a rightmost derivation

S ′ =⇒ γ ′ Av =⇒ γαv and a production A → α we call α the handle of the right sentential form γαv.
rm rm
Is the right side α = α′ β, the prefix γ = γ ′ α′ is called a reliable prefix of G for the item [A → α′ .β].
The item [A → α.β] is valid for γ. Theorem 3.4.1, thus, means, that the set of words under which the
characteristic finite-state machine reaches an item [A → α′ .β] is exactly the set of reliable prefixes for
this item.
Example 3.4.3 For the grammar G0 we have:

right sentential form handle reliable prefixess reason


E+F F E, E +, E + F S =⇒ E =⇒ E + T =⇒ E + F
rm rm rm
3
T ∗ Id Id T, T ∗, T ∗ Id S =⇒ T ∗ F =⇒ T ∗ Id
rm rm


In a non-ambiguous grammar, the handle of a right sentential form is the uniquely determined word that
the bottom-up parser should replace by a nonterminal in the next reduction step to arrive at a rightmost
derivation. A reliable prefix is a prefix of a right sentential form that does not properly extend beyond
the handle.
Example 3.4.4 We give two reliable prefixes of G0 and some items that are valid for them.

relaible prefix valid item reason


E+ [E → E + .T ] S =⇒ E =⇒ E + T
rm rm

[T → .F ] S =⇒ E + T =⇒ E + F
rm rm

[F → .Id] S =⇒ E + F =⇒ E + Id
rm rm

(E + ( [F → (.E)] S =⇒ (E + F ) =⇒ (E + (E))
rm rm

[T → .F ] S =⇒ (E + (.T ) =⇒ (E + (F ))
rm rm

[F → .Id] S =⇒ (E + (F ) =⇒ (E + (Id))
rm rm


Has, in the attempt to construct a rightmost derivation for a word, the prefix u of the word been reduced
to a reliable prefix γ, then each item [X → α.β], valid for γ, describes one possible interpretation of
the analysis situation. Thus, there is a rightmost derivation in which γ is prefix of a right sentential form
and X → αβ is one of the possibly just processed productions. All such productions are candidates for
later reductions.
Consider the rightmost derivation

S ′ =⇒ γAw =⇒ γαβw
rm rm

It should be extended to a rightmost derivation of a terminal word. This requires that


1. β is derived to a terminal word v, and after that,
2. α is derived to a terminal word u.
Altogether,
∗ ∗ ∗ ∗
S ′ =⇒ γAw =⇒ γαβw =⇒ γαvw =⇒ γuvw =⇒ xuvw
rm rm rm rm rm

We now consider this rightmost derivation in the direction of reduction, that is, in the direction in which
a bottom-up parser constructs it. First, x is reduced to γ in a number of steps, then u to α, then v to
β. The valid item [A → α.β] for the reliable prefix γα describes the analysis situation in which the
reduction of u to α has already been done, while the reduction of v to β has not yet started. A possible
long-range goal in this situation is the application of the production X → αβ.
84 3 Syntactic Analysis

We come back to the question which language is accepted by the characteristic finite-state machine
of PG . Theorem 3.4.1 says that chG goes under a reliable prefix into a state that is a valid item for this
prefix. Final states, i.e. complete items, are only valid for reliable prefixes where a reduction is possible
at their ends.
Proof of Theorem 3.4.1. We do a circular proof (1) ⇒ (2) ⇒ (3) ⇒ (1). Let us first assume

([S ′ → .S], γ) ⊢char(G) ([A → α.β], ε). By induction over the number n of ε transitions we construct a
rm rm
rightmost derivation S ′ =⇒ γAw =⇒ γαβw.

rm
Ist n = 0, dann ist γ = ε und [A → α.β] = [S ′ → .S]. Da S ′ =⇒ S ′ gilt, ist die Behauptung in

diesem Fall erf"ullt. Ist n > 0, betrachten wir den letzten ε-"Ubergang. Dann l"asst sich die Berechnung
of the characteristic automaton zerlegen in:
∗ ∗
([S ′ → .S], γ) ⊢char(G) ([X → α′ .Aβ ′ ], ε) ⊢char(G) ([A → .αβ], α) ⊢char(G) ([A → α.β], ε)

rm rm
where γ = γ ′ α. Nach Induktionsannahme gibt es eine rightmost derivation S ′ =⇒ γ ′′ Xw′ =⇒ γ ′′ α′ Aβ ′ w′

rm
mit γ ′ = γ ′′ α′ . Da die grammar G reduziert ist, gibt es ebenfalls eine rightmost derivation β ′ =⇒ v.

Deshalb haben wir:
rm rm
S ′ =⇒ γ ′ Avw′ =⇒ γ ′ αβw

mit w = vw′ . Damit ist die Richtung (1) ⇒ (2) bewiesen.


rm rm
Nehmen wir an, wir h"atten eine rightmost derivation S ′ =⇒ γ ′ Aw =⇒ γ ′ αβw. Diese Ableitung

l"asst sich zerlegen in:
rm rm rm rm rm
S ′ =⇒ α1 X1 β1 =⇒ α1 X1 v1 =⇒ . . . =⇒ (α1 . . . αn )Xn (vn . . . v1 ) =⇒ (α1 . . . αn )αβ(vn . . . v1 )
∗ ∗ ∗


for Xn = A. Mit Induktion nach n folgt, dass (ρ, vw) ⊢K ([S ′ → S.], ε) gilt for
G

ρ = [S ′ → α1 .X1 β1 ] . . . [Xn−1 → αn .Xn βn ]


w = vvn . . . v1

sofern β =⇒ v, α1 = β1 = ε and X1 = S. Damit ergibt sich der Schluss (2) ⇒ (3).
rm

F"ur den letzten Schluss betrachten wir einen Kellerinhalt ρ = ρ′ [A → α.β] mit (ρ, w) ⊢K ([S ′ →
G
S.], ε). Zuerst "uberzeugen wir uns mit Induktion nach der Anzahl der "Uberg"ange in einer solchen
Berechnung, dass ρ′ notwendigerweise von der Form:

ρ′ = [S ′ → α1 .X1 β1 ] . . . [Xn−1 → αn .Xn βn ]



ist for ein n ≥ 0 and Xn = A. Mit Induktion nach n folgt aber, dass ([S ′ → .S], γ) ⊢char(G) ([A →
α.β], ε) gilt for γ = α1 . . . αn α. Da γ = hist(ρ), gilt auch die Behauptung (1). Damit ist der Beweis
vollst"andig. ⊓ ⊔

The Canonical LR(0) Automaton

In Chapter 2, we presented an algorithm which takes a non-deterministic finite-state machine and con-
structs an equivalent deterministic finite-state machine. This deterministic finite-state machine pursues
all paths in parallel which the non-deterministic automaton could take for a given input. Its states
are sets of states of the non-deterministic automaton. This subset construction is now applied to the
characteristic finite-state machine char(G) of a context-free grammar G. The resulting deterministic
finite-state machine is called the canonical LR(0) automaton for G and denote it by LR0 (G).
3.4 Bottom-up Syntax Analysis 85

Example 3.4.5 The canonical LR(0) automaton for the context-free grammar G0 of Example 3.2.2
on page 39 is obtained by the application of the subset construction to the characteristic finite-state
machine char(G0 ) of Fig. 3.14 on page 82. It is shown in Fig. 3.15 on page 85. It states are:

S0 = { [S → .E], S4 = { [F → (.E)], S7 = { [T → T ∗ .F ],
[E → .E + T ], [E → .E + T ], [F → .(E)],
[E → .T ], [E → .T ], [F → .Id] }
[T → .T ∗ F ], [T → .T ∗ F ] S8 = { [F → (E.)],
[T → .F ], [T → .F ] [E → E. + T ] }
[F → .(E)], [F → .(E)]
S9 = { [E → E + T.],
[F → .Id] } [F → .Id] }
[T → T. ∗ F ] }
S1 = { [S → E.], S5 = { [F → Id.] }
S10 = { [T → T ∗ F.]}
[E → E. + T ] } S6 = { [E → E + .T ],
S11 = { [F → (E).] }
S2 = { [E → T.], [T → .T ∗ F ],
[T → T. ∗ F ] } [T → .F ], S12 = ∅

S3 = { [T → F.] } [F → .(E)],
[F → .Id] }


+ T
S1 S6 S9
Id
E F
S5
( Id
Id +
F ∗
S0 S3
( Id
F
( E )
T S4 S8 S11
T (
∗ F
S2 S7 S10

Fig. 3.15. The transition diagram of the LR(0) automaton for the grammar G0 obtained from the characteristic
finite-state machinechar(G0 ) in Fig. 3.14. The error state S12 = ∅ and all transitions into it are left out.

The canonical LR(0) automaton LR0 (G) to a context-free grammar G has some interesting properties.
Let LR0 (G) = (QG , VT ∪ VN , ∆G , qG,0 , FG ), and let ∆∗G : QG × (VT ∪ VN )∗ → QG be the lifting
of the transition function ∆G from symbols to words. We then have:
1. ∆∗G (qG,0 , γ) is the set of all items in IG for which γ is a reliable prefix.
2. L(LR0 (G)) is the set of all reliable prefixes for complete items [A → α.] ∈ IG .
Reliable prefixes are prefixes of right-sentential forms, as they occur during the reduction of an input
word. When a reduction is possible that will again lead to a right sentential-form This can only hap-
pen at the right end of this sentential form. An item valid for a reliable prefix describes one possible
interpretation of the actual analysis situation.
86 3 Syntactic Analysis

Example 3.4.6 E + F is a reliable prefix for the grammar G0 . The state ∆∗G0 (S0 , E + F ) = S3 is also
reached by the following reliable prefixes:

F , (F , ((F , (((F , . . .
T ∗ (F , T ∗ ((F , T ∗ (((F , ...
E + F , E + (F , E + ((F , ...

The state S6 in the canonical LR(0) automaton to G0 contains all valid items for the reliable prefix
E+, namely the items

[E → E + .T ], [T → .T ∗ F ], [T → .F ], [F → .Id], [F → .(E)].

For E+ is a prefix of the right sentential form E + T :


S =⇒ E =⇒ E+T =⇒ E+F =⇒ E + Id
rm rm rm rm

↑ ↑ ↑
Valid are for instance [E → E + .T ] [T → .F ] [F → .Id]


The canonical LR(0) automaton LR0 (G) to a context-free grammar G is a deterministic finite-state
machine that accepts the set of reliable prefixes to complete items. In this way, it identifies positions
for reduction, and, therefore, offers itself for the construction of a right parser. Instead of items (as
the item-pushdown automaton) this parser stores on its stack states of the canonical LR(0) automaton,
that is sets of items. The underlying pushdown automata P0 is defined as the tuple K0 = (QG ∪
{f }, VT , ∆0 , qG,0 , {f }). The set of states is the set QG of states of the canonical LR(0) automaton
LR0 (G), extended by a new state f , the final state. The initial state of P0 is identical to the initial state
qG,0 of LR0 (G); The transition relation ∆0 consists of the following kinds of transitions:
Read: (q, a, q δG (q, a)) ∈ ∆0 , if δG (q, a) 6= ∅. This transition reads the next input symbol a and
pushes the successor state q under a onto the stack. It can only be taken if at least one item of the
form [X → α.aβ] is contained in q.
Reduce: (qq1 . . . qn , ε, q δG (q, X)) ∈ ∆ if [X → α.] ∈ qn holds with |α| = n. The complete item
[X → α.] in the topmost stack entry signals a potential reduction. As many entries are removed
from the top of the stack as the length of the right side indicates. After that, the X successor of the
new topmost stack entry is pushed onto the stack.
Fig. 3.16 shows a part of the transition diagram of a LR(0) automaton LR0 (G) that demonstrates
this situation. The α path in the transition diagram corresponds to |α| entries on top of the stack.
These entries are removed at reduction. The new actual state, previously below these removed
entries, has a transition under X, which is now taken.
Finish: (qG,0 q, ε, f ) if [S ′ → S.] ∈ q. This transition is the reduction transition to the production
S ′ → S. The property [S ′ → S.] ∈ q signals that a word was successfully reduced to the start
symbol. This transition empties the stack and inserts the final state f .
The special case [X → . ] merits special consideration. According to our description, |ε| = 0 topmost
stack entries need to be removed from the stack upon this reduction, and a transition from the new,
and old, actual state q under X should be taken, and the state ∆G (q, X) is pushed onto the stack.
This transition is possible since by construction it holds that with the item [· · · → · · · .X · · · ] also the
item [X → .α] is contained in state q for each right side α of nonterminal X. In the special case of a
ε production, the actual state q contains together with the item [· · · → · · · .X · · · ] also the complete
item [X → . ]. This latter reduction transition extends the length of the stack.
The construction of LR0 (G) guarantees that for each non-initial and non-final state q there exists
exactly one entry symbol under which the automaton can make a transition into q. The stack contents
q0 , . . . , qn mit q0 = qG,0 corresponds, therefore, to a uniquely determined word α = X1 . . . Xn ∈
(VT ∪ VN )∗ for which ∆G (qi , Xi+1 ) = qi+1 holds. This word α is a reliable prefix, and qn is the set
of all items valid for α.
3.4 Bottom-up Syntax Analysis 87

[· · · → · · · .X · · · ]
X [· · · → · · · X. · · · ]
[X → .α]
···
···

α
[X → α.]
···

Fig. 3.16. Part of the transition diagram of a canonical LR(0) automaton.

The pushdown automaton P0 just constructed is not necessarily deterministic. There are two kinds
of conflicts that cause non-determinism:
shift-reduce conflict: a state q allows a read transition under a symbol a ∈ VT as well as a reduce or
finish transition, and
reduce-reduce conflict: a state q permits reduction transitions according to two different productions.
In the first case, the actual state contains at least one item [X → α.aβ] and at least one complete item
[Y → γ.]; in the second case, q contains two different complete items [Y → α.], [Z → β.]. A state q
of the LR(0) automaton with one of these properties is called LR(0) inadequate. Otherwise, we call q
LR(0) adequate. Es gilt:
Lemma 3.4. For an LR(0) adequate state q there are three possibilities:
1. The state q contains no complete item.
2. The state q consists of exactly one complete item [A → α.];
3. The state q contains exactly one complete item [A → . ], and all non-complete items in q are of the
form [X → α.Y β], where all rightmost derivations for Y that lead to a terminal word are of the
form:

Y =⇒ Aw =⇒ w
rm rm

for a w ∈ VT∗ . ⊓

Inadequate states of the canonical LR(0) automaton make the pushdown automata P0 non-deterministic.
We obtain deterministic parsers by permitting the parser to look ahead into the remaining input to select
the correct action in inadequate states.
Example 3.4.7 The states S1 , S2 and S9 of the canonical LR(0) automaton in Fig. 3.15 are LR(0)
inadequate. In state S1 , the parser can reduce the right side E to the left side S (complete item [S → E.])
and it can read the terminal symbol + in the input (item [E → E. + T ]). In state S2 the parser can
reduce the right side T to E (complete item [E → T.]) and it can read the terminal symbol ∗ (item
[T → T. ∗ F ]). In state S9 finally, the parser can reduce the right side E + T to E (complete item
[E → E + T.]), and it can read the terminal symbol ∗ (item [T → T. ∗ F ]). ⊓ ⊔

Direct Construction of the Canonical LR(0) Automaton

The canonical LR(0) automaton LR0 (G) to a context-free grammar G needs not be derived through
the construction of the characteristic finite-state machine char(G) and the subset construction. It can
be constructed directly from G. The construction uses a function ∆G,ε that adds to each set q of items
all items that are reachable by ε transitions of the characteristic finite-state machine. The set ∆G,ε (q)
is the least solution of the following equation

I = q ∪ {[A → .γ] | ∃X → αAβ ∈ P : [X → α.Aβ] ∈ I}

Similar to the function closure() of the subset construction it can be computed by


88 3 Syntactic Analysis

sethitemi closure(sethitemi q) {
sethitemi result ← q;
listhitemi W ← list_of(q);
symbol X; stringhsymbol i α;
while (W 6= []) {
item i ← hd(W ); W ← tl(W );
switch (i) {
case [_ → _ .X _] : forall (α : (X → α) ∈ P )
if ([X → .α] 6∈ result) {
result ← result ∪ {[X → .α]};
W ← [X → .α] :: W ;
}
default : break;
}
}
return result ;
}

where V is the set of symbols V = VT ∪ VN . The set QG of states and the transition relation ∆G are
computed by first constructing the initial state qG,0 = ∆G,ε ({[S ′ → .S]}) and then adding successor
states and transitions until all successor states are already in the set of constructed states. To implement
it we specialize the function nextState() of the subset construction:

sethitemi nextState(sethitemi q, symbol X) {


sethitemi q ′ ← ∅;
nonterminal A; stringhsymbol i α, β;
forall (A, α, β : ([A → α.Xβ] ∈ q))
q ′ ← q ′ ∪ {[A → αX.β]};
return closure(q ′ );
}

As in the subset construction, the set of states states and the set of transitions trans can be computed
iteratively:
3.4 Bottom-up Syntax Analysis 89

listhsethitemii W ;
sethitemi q0 ← closure({[S ′ → .S]});
states ← {q0 }; W ← [q0 ];
trans ← ∅;
sethitemi q, q ′ ;
while (W 6= []) {
q ← hd(W ); W ← tl(W );
forall (symbol X) {
q ′ ← nextState(q, X);
trans ← trans ∪ {(q, X, q ′ )};
if (q ′ 6∈ states) {
states ← states ∪ {q ′ };
W ← q ′ :: W ;
}
}
}

3.4.3 LR(k): Definition, Properties, and Examples

We call a context-free grammar G an LR(k)-grammar, if in each of its rightmost derivations S ′ =


α0 =⇒ α1 =⇒ α2 · · · =⇒ αm = v and each right sentential forms αi occurring in the derivation
rm rm rm

• the handle can be localized, and


• the production to be applied can be determined
by considering αi from the left to at most k symbols following the handle. In an LR(k)-grammar,
the decomposition of αi into γβw and the determination of X → β, such that αi−1 = γXw holds is
uniquely determined by γβ and w|k . Formally, we call G an LR(k)-grammar if

S ′ =⇒ αXw =⇒ αβw and
rm rm

S ′ =⇒ γY x =⇒ αβy and
rm rm
w|k = y|k implies α = γ ∧ X = Y ∧ x = y.

Example 3.4.8 Let G be the grammar with the productions

S→A|B A → aAb | 0 B → aBbb | 1

Then L(G) = {an 0bn | n ≥ 0} ∪ {an 1b2n | n ≥ 0}. We know already that G is for no k ≥ 1 an
LL(k)-grammar. Grammar G is an LR(0)-grammar, though.
The right sentential forms of G have the form

S, A, B, an aAbbn , an aBbbb2n , an a0bbn , an a1bbb2n

for n ≥ 0. The handles are always underlined. Two different possibilities to reduce exist only in the
case of right sentential forms an aAbbn and an aBbbb2n One could reduce an aAbbn to an Abn and to
an aSbbn . The first choice belonged to the rightmost derivation

S =⇒ an Abn =⇒ an aAbbn
rm rm

the second to no rightmost derivation. The prefix an of an Abn uniquely determines, whether A is the
handle, namely in the case n = 0, or whether aAb is the handle, namely in the case n > 0. The right
sentential forms an Bb2n are handled analogously. ⊓ ⊔
90 3 Syntactic Analysis

Example 3.4.9 The grammar G1 with the productions

S → aAc A → Abb | b

and the language L(G1 ) = {ab2n+1 c | n ≥ 0} is an LR(0)-grammar. In a right sentential form


aAbbb2n c only the reduction to aAb2n c is possible as part of a rightmost derivation. The prefix aAbb
uniquely determines this. For the right sentential form abb2n c, b is the handle, and the prefix ab uniquely
determines this. ⊓⊔
Example 3.4.10 The grammar G2 with the productions

S → aAc A → bbA | b

and the language L(G2 ) = L(G1 ) is an LR(1)-grammar. The critical right sentential forms have the
form abn w. If 1 : w = b, the handle lies in w; if 1 : w = c, the last b in bn forms the handle. ⊓

Example 3.4.11 The grammar G3 with the productions

S → aAc A → bAb | b

and the language L(G3 ) = L(G1 ) is not an LR(k)-grammar for any k ≥ 0. For, let k be arbitrary, but
fix. Consider the two rightmost derivations

S =⇒ abn Abn c =⇒ abn bbn c
rm rm

S =⇒ abn+1 Abn+1 c =⇒ abn+1 bbn+1 c
rm rm

with n ≥ k. With the names introduced in the definition of LR(k)-grammar, we have α = abn , β =
b, γ = abn+1 , w = bn c, y = bn+2 c. Here w|k = y|k = bk . α 6= γ implies that G3 can be no LR(k)-
grammar. ⊓ ⊔
The following theorem clarifies the relation between the definition of LR(0)-grammar and the proper-
ties of the canonic LR(0) automaton.
Theorem 3.4.2 A context-free grammar G is an LR(0)-grammar if and only if the canonical LR(0)
automaton for G has no LR(0)-inadequate states.
Proof: ” ⇒ ” Let G eine LR(0)-grammar, and nehmen wir an, der canonical LR(0) automaton
LR0 (G) habe einen einen LR(0)-inadequaten state p.
Fall 1: The state p hat einen reduce-reduce-conflict, d.h. p enth"alt zwei verschiedene items [X → β.], [Y → δ.].
Dem state p zugeordnet ist eine nichtleere Menge von reliable prefixesn. Let γ = γ ′ β ein solches reli-
able prefix. Weil beide items valid for γ sind, gibt es rightmost derivations

S ′ =⇒ γ ′ Xw =⇒ γ ′ βw und
rm rm

S ′ =⇒ νY y =⇒ νδy mit νδ = γ ′ β = γ
rm rm

Das ist aber ein Widerspruch zur LR(0)-Eigenschaft.


Fall 2: state p hat einen shift-reduce-conflict, d.h. p enth"alt items [X → β.] and [Y → δ.aα]. Let γ
ein reliable prefix for beide item Weil beide items valid for γ sind, gibt es rightmost derivations

S ′ =⇒ γ ′ Xw =⇒ γ ′ βw und
rm rm
′ ∗
S =⇒ νY y =⇒ νδaαy mit νδ = γ ′ β = γ
rm rm

Ist β ′ ∈ VT∗ , erhalten wir sofort einen Widerspruch. Andernfalls gibt es eine rightmost derivation

α =⇒ v1 Xv3 =⇒ v1 v2 v3
rm rm
3.4 Bottom-up Syntax Analysis 91

Weil y 6= av1 v2 v3 y gilt, ist die LR(0)-Eigenschaft verletzt.


” ⇐ ” Nehmen wir an, der canonical LR(0) automaton LR0 (G) habe keine LR(0)-inadequaten
states. Betrachten wir die zwei rightmost derivations:

S ′ =⇒ αXw =⇒ αβw
rm rm

S ′ =⇒ γY x =⇒ αβy
rm rm

Zu zeigen ist, dass α = γ, X = Y, x = y gelten. Let p der state of the canonical LR(0) automaton
nach Lesen von αβ. Dann enth"alt p alle for αβ valid items . Nach Voraussetzung ist p LR(0)-geeignet.
Wir unterscheiden zwei F"alle:
Fall 1: β 6= ε. Wegen Lemma 3.4 ist p = {[X → β.]}, d.h. [X → β.] ist das einzige valid item for
αβ. Daraus folgt, dass α = γ, X = Y and x = y sein muss.
Fall 2: β = ε. Nehmen wir an, die zweite rightmost derivation widerspreche der LR(0)-Bedingung.
Dann gibt es ein weiteres item [X → δ.Y ′ η] ∈ p, so dass α = α′ δ ist. The letzte Anwendung einer
production in der unteren rightmost derivation ist die letzte Anwendung einer production in einer ter-
minalen rightmost derivation for Y ′ . Nach Lemma 3.4 folgt daraus, dass die untere Ableitung gegeben
ist durch:
∗ ∗
S ′ =⇒ α′ δY ′ w =⇒ α′ δXvw =⇒ α′ δvw
rm rm rm

wobei y = vw ist. Damit gilt α = α′ δ = γ, Y = X and x = vw = y – im Widerspruch zu unserer


Annahme. ⊓ ⊔
Let us conclude. We have seen how to construct the LR(0) automaton LR0 (G) from a given context-
free grammar G. This can be done either directly of through the characteristic finite-state machine
char(G). From the deterministic finite-state machine LR0 (G) one can construct a pushdown automata
P0 . This pushdown automaton P0 is deterministic if LR0 (G) does not contain LR(0)-inadequate states.
Theorem 3.4.2 states this is exactly the case if the grammar G is an LR(0)-grammar. We have thereby
met a method to generate parsers for LR(0)-grammars.
In real life, LR(0)-grammars are rather rare. Often lookahead of length k > 0 needs to be used
to select between the different choices of a parsing situation. In an LR(0) parser, the actual state de-
termines what the next action is, independently of the next input symbols. LR(k) parsers for k > 0
have states consisting of sets of items. A different kind of items are used, though, so-called LR(k)-
items. LR(k)-items are context-free items, extended by lookahead words. An LR(k)-item is of the
form i = [A → α.β, x] for a production A → αβ of G and a word x ∈ (VTk ∪ VT<k #). The context-
free item [A → α.β] is called the core, the word x the lookahead of the LR(k)-items i. The set of
LR(k)-items of grammar G is written as IG,k . The LR(k)-item [A → α.β, x] is valid for a reliable
prefix γ, if there exists a rightmost derivation

S ′ # =⇒ γ ′ Xw# =⇒ γ ′ αβw#
rm rm

with x = (w#)|k . A context-free item [A → α.β] can be understood as an LR(0)-item that is extended
by lookahead ε.
Example 3.4.12 Consider again grammar G0 . We have:
(1) [E → E + .T, )]
[E → E + .T, +] are valid LR(1)-items for the prefix (E+

(2) [E → T., ∗] is not a valid LR(1)-item for any reliable prefix.


To see observation (1), consider the two rightmost derivations:

S ′ =⇒ (E) =⇒ (E + T )
rm rm

S ′ =⇒ (E + Id) =⇒ (E + T + Id)
rm rm

Observation (2) follows since the subword E∗ can occur in no right sentential form. ⊓

92 3 Syntactic Analysis

The folllowing theorem gives a characterization of the LR(k)-property based on valid LR(k)-items.
Theorem 3.4.3 Let G be a context-free grammar. For a reliable prefix γ let It(γ) be the set of LR(k)-
items of G that are valid for γ.
The grammar G is an LR(k)-grammar if and only if for all reliable prefixes γ and all LR(k)-items
[A → α., x] ∈ It(γ) holds:
1. if there is another LR(k)-item [X → δ., y] ∈ It(γ), then x 6= y.
2. is there another LR(k)-item [X → δ.aβ, y] ∈ It(γ), then x 6∈ firstk (aβ) ⊙k {y}. ⊓

Theorem 3.4.3 suggests to define LR(k)-adequate and LR(k)-inadequate sets of items also for
k > 0. Let I be a set of LR(k)-items. I has a reduce-reduce-conflict, if there are LR(k)-items
[X → α., x], [Y → β., y] ∈ I with x = y. I has a shift-reduce-conflict, if there are LR(k)-items
[X → α.aβ, x], [Y → γ., y] ∈ I with

y ∈ {a} ⊙k firstk (β) ⊙k {x}

For k = 1 this condition is simplified to y = a.


The set I is called LR(k)-inadequate, if it has a reduce-reduce- or a shift-reduce-conflict. Other-
wise, we call it LR(k)-adequate.
The LR(k)-property means that when reading a right sentential form, a candidate for a reduction
together with production to be applied can be uniquely determined by the help of the associated reliable
prefixes and the k next symbols of the input. However, if we were to tabulate all combinations of
reliable prefixes with words of length k this would be infeasible since, in general, there are infinitely
many reliable prefixes. In analogy to our way of dealing with LR(0)-grammars one could construct a
canonical LR(k)-automaton. The canonical LR(k)-automaton LRk (G) is a deterministic finite-state
machine. Its states are sets of LR(k)-items. For each reliable prefix γ the deterministic finite-state
machine LRk (G) determines the set of LR(k)-items that are valid for γ. Theorem 3.4.3 helps us in
our derivation. It says that for an LR(k)-grammar, the set of LR(k)-items valid for γ together with the
lookahead determines uniquely whether to reduce in the next step, and if so, by which production.
In much the same way as the LR(0) parser stores states of the canonical LR(0) automaton on its
stack, the LR(k) parser stores states of the canonical LR(k)-automaton on is stack. The selection of
the right of several possible actions of the LR(k) parser is controlled by the action-table. This table
contains for each combination of state and lookahead one of the following entries:

shift: read the next input symbol;


reduce(X → α): reduce by production X → α;
error: report error
accept: announce successful end of the parser run

A second table, the goto-table, contains the representation of the transition function of the canonic
LR(k)-automaton LRk (G). It is consulted after a shift-action or a reduce-action to determine the new
state on top of the stack. Upon a shift, it computes the transition under the read symbol out of the actual
state. Upon a reduction by X → α, it gives the transition under X out of the state underneath those
stack symbols that belong to α. These two tables for k = 1 are shown in Fig. 3.17.
The LR(k) parser for a grammar G needs a program that interprets the action- and goto-table, the
driver. Again, we consider the case k = 1. This is, in principle, sufficient because for each language
that has an LR(k)-grammar and therefore also an LR(k) parser one can construct an LR(1)-grammar
and consequently also an LR(1) parser. Let us assume that the set of states of the LR(1) parser were
Q. One such driver program then is:
3.4 Bottom-up Syntax Analysis 93

action-table goto-table

VT ∪ {#} VN ∪ VT
x X

Q parser action Q q δd (q, X)


q for (q, x)

Fig. 3.17. Schematic representation of action- and goto-table of an LR(1) parser with set of states Q.

listhstatei stack ← [q0 ];


terminal buffer ← scan();
state q; nonterminal X; stringhsymbol i α;
while (true) {
q ← hd(stack);
switch (action[q, buffer ]) {
case shift : stack ← goto[q, buffer] :: stack ;
buffer ← scan();
break;
case reduce(X → α) : output(X → α);
stack ← tl(|α|, stack ); q ← hd(stack );
stack ← goto[q, X] :: stack ;
break;
case accept : stack ← f :: tl(2, stack );
return accept ;
case error : output(′′ . . .′′ ); goto err ;
}
The function listhstatei tl(int n, listhstatei s) returns in its second argument the list s with the
topmost n elements removed. As with the driver program for LL(1) parsers, in the case of an error, it
jumps to a label err at which the code for error handling is to be found.
We present three approaches to construct an LR(1) parser for a context-free grammar G. The most
general method is the canonical LR(1)-method. For each LR(1)-grammar G there exists a canonical
LR(1) parser. The number of states of this parser can be large. Therefore, other methods were proposed
that have state sets of the size of the LR(0) automaton. Of these we consider the SLR(1)- and the
LALR(1)-method.
The described driver program for LR(1) parsers works for all three parsing methods; the driver in-
terprets the action- and a goto-table, but their contents are computed in different ways. In consequence,
the actions for some combinations of state and lookahead may be different.

Construction of an LR(1) Parser

The LR(1) parser is based on the canonical LR(1)-automaton LR1 (G). Its states, therefore, are sets of
LR(1)-items. We construct the canonical LR(1)-automaton much in the same way as we constructed
the canonical LR(0) automaton. The only difference is that LR(1)-items are used instead of LR(0)-
items. This means that the lookahead symbols need to be computed when the closure of a set q of
94 3 Syntactic Analysis

LR(1)-items under ε-transitions is formed. This set is the least solution of the following equation

I = q ∪ {[A → .γ, y] | ∃X → αAβ ∈ P : [X → α.Aβ, x] ∈ I, y ∈ first1 (β) ⊙1 {x}}

It is computed by the following function

sethitem 1 i closure(sethitem 1 i q) {
sethitem 1 i result ← q;
listhitem 1 i W ← list_of(q);
nonterminal X; stringhsymbol i α, β; terminal x, y;
while (W 6= []) {
item 1 i ← hd(W ); W ← tl(W );
switch (i) {
case [_ → _ .Xβ, x] :
forall (α : (X → α) ∈ P )
forall (y ∈ first1 (β) ⊙1 {x})
if ([X → .α, y] 6∈ result) {
result ← result ∪ {[X → .α, y]};
W ← [X → .α, y] :: W ;
}
default : break;
}
}
return result ;
}

where V is the set of all symbols, V = VT ∪ VN . The initial state q0 of LR1 (G) is

q0 = closure({[S ′ → .S, #]})

We need a function nextState() that computes the successor state to a given set q of LR1 -items and a
symbol X ∈ V = VN ∪ VT . The corresponding function for the construction of LR0 (G) needs to be
extended by the compute the lookahead symbols:

sethitem 1 i nextState(sethitem 1 i q, symbol X) {


sethitem 1 i q ′ ← ∅;
nonterminal A; stringhsymbol i α, β; terminal x;
forall (A, α, β, x : ([A → α.Xβ, x] ∈ q))
q ′ ← q ′ ∪ {[A → αX.β, x]};
return closure(q ′ );
}

The set of states and the transition relation of the canonical LR(1)-automaton is computed in analogy
to the canonical LR(0)-automaton. The generator starts with the initial state and an empty set of tran-
sitions and adds successors states until all successor states are already contained in the set of computed
states. The transition function of the canonical LR(1)-automaton gives the goto-table of the LR(1)
parser.
Let us turn to the construction of the action-table of the LR(1) parser. No reduce-reduce-conflict
exists in a state q of the canonical LR(1)-automaton with complete LR(1)-items [X → α., x], [Y →
β., y] if x 6= y. If the LR(1) parser is in state q it will decide to reduce with the production whose
3.4 Bottom-up Syntax Analysis 95

lookahead symbol is the next input symbol. If state q contains at the same time a complete LR(1)-item
[X → α., x] and an LR(1)-item [Y → β.aγ, y], it still has no shift-reduce-conflict if a 6= x. In state
q the generated parser will reduce if the next next input symbol is x and shift if it is a. Therefore, the
action-table can be computed by the following iteration:

forall (state q) {
forall (terminal x) action[q, x] ← error ;
forall ([X → α.β, x] ∈ q)
if (β = ε)
if (X = S ′ ∧ α = S ∧ x = #) action[q, #] ← accept ;
else action[q, x] ← reduce(X → α);
else if (β = aβ ′ ) action[q, a] ← shift ;
}

Example 3.4.13 We consider some states of the canonical LR(1)-automaton for the context-free gram-
mar G0 . The numbering of states is the same as in Fig. 3.15. To make the representation of sets S
of LR(1)-items more readable all lookahead symbols in LR(1)-items from S with the same kernel
[A → α.β] are collected in one lookahead set
L = {x | [A → α.β, x] ∈ q}

We represent subsets {[A → α.β, x] | x ∈ L} as [A → α.β, L] and obtain

S0′ = closure({[S → .E, {#}]}) S6′ = nextState(S1′ , +)


= { [S → .E, {#}] = { [E → E + .T, {#, +}],
[E → .E + T, {#, +}], [T → .T ∗ F, {#, +, ∗}],
[E → .T, {#, +}], [T → .F, {#, +, ∗}],
[T → .T ∗ F, {#, +, ∗}], [F → .(E), {#, +, ∗}],
[T → .F, {#, +, ∗}], [F → .Id, {#, +, ∗}] }
[F → .(E), {#, +, ∗}],
[F → .Id, {#, +, ∗}] } S9′ = nextState(S6′ , T ))
= { [E → E + T., {#, +}],
S1′ = nextState(S0′ , E) [T → T. ∗ F, {#, +, ∗}] }
= { [S → E., {#}],
[E → E. + T, {#, +}] }

S2′ = nextState(S1′ , T )
= { [E → T., {#, +}],
[T → T. ∗ F, {#, +, ∗}] }

After the extension by lookahead symbols, the states S1 , S2 and S9 , which were LR(0) inadequate,
have no longer conflicts. In state S1′ the next input symbol + indicates to shift, the next input symbol
# indicates to reduce. In state S2′ lookahead symbol ∗ indicates to shift, # and + to reduce; similarly
in state S9′ .
The table 3.6 shows the rows of the action-table of the canonical LR(1) parser for the grammar G0 ,
which belong to the states S0′ , S1′ , S2′ , S6′ and S9′ . ⊓

SLR(1)- and LALR(1) parser

The set of states of LR(1) parsers can become quite large. Therefore, often LR analysis methods are
employed that are not as powerful as canonical LR parsers, but have fewer states. Two such LR analysis
96 3 Syntactic Analysis

The used numbering of the productions:


Id ( ) ∗ + #
S0′ s s 1: S→E
S1′ s acc 2: E →E+T
S2′ s r(3) r(3) 3: E→T
4: T →T ∗F
S6′ s s 5: T →F
6: F → (E)
S9′ s r(2) r(2) 7: F → Id

Table 3.6. Some rows of the action-table of the canonical LR(1) parser for G0 . s stands for shift, r(i) for reduce
by production i, acc for accept. All empty entries represent error.

methods are the SLR(1)- (simple LR-) and LALR(1)- (lookahead LR-)methods. Ist SLR(1) parser
is a special LALR(1) parser, and each grammar that has an LALR(1) parser is an LR(1)-grammar.
The starting point of the construction of SLR(1)- and LALR(1) parsers is the canonical LR(0)
automaton LR0 (G). The set Q of states and the goto-table for these parsers are the set of states and
the goto-table of the corresponding LR(0) parser. Lookahead is used to resolve conflicts in the states
in Q. Let q ∈ Q be a state of the canonical LR(0) automaton and [X → α.β] an item in q. We denote
by λ(q, [X → α.β]) the lookahead set that is added to the item [X → α.β] in q. The SLR(1)-method
is different from the LALR(1)-method in the definition of the function
λ : Q × IG → 2VT ∪{#}
Relative to such a function λ, the state q of LR0 (G) has a reduce-reduce-conflict, if it has different
complete items [X → α.], [Y → β.] ∈ q with
λ(q, [X → α.]) ∩ λ(q, [Y → β.]) 6= ∅
Relative to λ , q has a shift-reduce-conflict if it has items [X → α.aβ], [Y → γ.] ∈ q with a ∈
λ(q, [Y → γ.]).
If no state of the canonic LR(0) automaton has a conflict, the lookahead sets λ(q, [X → α.]) suffice
to construct an action-table zu.
In SLR(1) parsers, the lookahead sets for items are independent of the states in which they occur;
the lookahead only depends on the left side of the production in the item:

λS (q, [X → α.β]) = {a ∈ VT ∪ {#} | S ′ # =⇒ γXaw} = follow1 (X)

for alle states q mit [X → α.] ∈ q. A state q of the canonical LR(0) automaton is called SLR(1)-
inadequate if it contains conflicts with respect to the function λS . G is an SLR(1)-grammar if there
are no SLR(1)-inadequate states.
Example 3.4.14 We consider again grammar G0 of Example 3.4.1. Its canonical LR(0) automaton
LR0 (G0 ) has the inadequate states S1 , S2 and S9 . We extend the complete items in the states by the
follow1 -sets of their left sides to represent the function λS in a readable way. Since follow1 (S) = {#}
and follow1 (E) = {#, +, )} we obtain:
S1′′ = { [S → E., {#}], conflict eliminated,
[E → E. + T ]} da + 6∈ {#}

S2′′ = { [E → T., {#, +, )}], conflict eliminated,


[T → T. ∗ F ] } da ∗ 6∈ {#, +, )}

S9′′ = { [E → E + T., {#, +, )}], conflict eliminated,


[T → T. ∗ F ] } da ∗ 6∈ {#, +, )}
3.4 Bottom-up Syntax Analysis 97

So, G0 i an SLR(1)-grammar and it has an SLR(1) parser. ⊓



The set follow1 (X) collects all symbols that can follow the nonterminal X in a sentential form of the
grammar. Only the follow1 -sets are used to resolve conflicts in the construction of an SLR(1) parser. In
many cases this is not sufficient. More conflicts can be resolved if the state is taken into consideration
in which the complete item [X → α.] occurs. The most precise lookahead set that considers the state is
defined by:

λL (q, [X → α.β]) = {a ∈ VT ∪ {#} | S ′ # =⇒ γXaw ∧ ∆∗G (q0 , γα) = q}
rm

Here, q0 is the initial state, and ∆G is the transition function of the canonic LR(0) automaton LR0 (G).
In λL (q, [X → α.]) only terminal symbols are contained that can follow X in a right sentential form
βXaw such that βα drives the canonical LR(0) automaton into the state q. We call state q of the
canonical LR(0) automaton LALR(1)-inadequate if it contains conflicts with respect to the function
λL . The grammar G is an LALR(1)-grammar if the canonical LR(0) automaton has no LALR(1)-
inadequate states.
There always exists an LALR(1) parser to an LALR(1)-grammar. The definition of the function
λL however is not constructive since sets of right sentential forms appear in it that are in general
infinite. The sets λL (q, [A → α.β]) can be characterized as the least solution of the following system
of equations:

λL (q0 , [S ′ → .S]) = {#}


S
λL (q, [A → αX.β]) = {λL (p, [A → α.Xβ]) | ∆G (p, X) = q} , X ∈ (VT ∪ VN )
S
λL (q, [A → .α]) = {first1 (β) ⊙1 λL (q, [X → γ.Aβ]) | [X → γ.Aβ] ∈ q ′ }

The system of equations describes how sets of successor symbols of items in states originate. The first
equation says that only # can follow the start symbol S ′ . The second class of equations describes that
the follow symbols of an item [A → αX.β] in a state q result from the follow symbols after the dot in
an item [A → α.Xβ] in states p from which one can reach q by reading X. The third class of equations
formalizes that the follow symbols of an item [A → .α] in a state q result from the follow symbols of
occurrences of A in items in q after the dot, that is, from sets first1 (β) ⊙1 λL (q, [X → γ.Aβ]) for items
[X → γ.Aβ] in q.
The system of equations for the sets λL (q, [A → α.β]) over the finite subset lattice 2VT ∪{#} can be
solved by the iterative method for the computation of least solutions. Considering which nonterminal
may produce ε allows us to replace the occurrences of 1-concatenation by unions. We so obtain an
equivalent pure union problem that can be solved by the efficient method of Section 3.2.7.
LALR(1) parsers can be constructed in the following, not very efficient way: One constructs a
canonical LR(1) parser. If its states have no conflicts such states p and q are merged to a new state
p′ where the cores of the items in p are the same as the cores in the items of q, that is, where the
difference of the two sets of items consists only in the lookahead sets. The lookahead sets in the new
state p′ are obtained as the union of the lookahead sets of items with the same core. The grammar is an
LALR(1)-grammar if the new states have no conflicts.
A further possibility consists in the modification of Algorithm LR(1)-GEN. The conditional state-
ment
if q ′ not in Q then Q := Q ∪ {q ′ } fi;
is replaced by
if exist. q ′′ in Q mit kerngleich(q ′ , q ′′ ) then verschmelze(Q, q ′ , q ′′ ) fi;
where
function samecore(p, p′ : set of item):bool;
if set of cores of p = set of cores of p′
then return (true)
else return (false)
98 3 Syntactic Analysis

fi;

proc merge(Q : set of set of item, p, p′ : set of item);


Q := Q ∪ {[X → α.β, L1 ∪ L2 ] | [X → α.β, L1 ] ∈ p und [X → α.β, L2 ] ∈ p′ }.

Example 3.4.15 The following grammar taken from [ASU86] describes a simplified version of the C
assignment statement:
S′ → S
S → L=R|R
L → ∗R | Id
R → L
This grammar is not an SLR(1)-grammar, but t is a LALR(1)-grammar. The states of the canonical
LR(0) automaton are given by:
S0 = { [S ′ → .S], S2 = { [S → L. = R], S6 = { [S → L = .R],
[S → .L = R], [R → L.] } [R → .L],
[S → .R], [L → . ∗ R],
S3 = { [S → R.] }
[L → . ∗ R], [L → .Id] }
[L → .Id], S4 = { [L → ∗ .R],
S7 = { [L → ∗ R.] }
[R → .L] } [R → .L],
[L → . ∗ R], S8 = { [R → L.] }
S1 = { [S ′ → S.] }
[L → .Id] }
S9 = { [S → L = R.] }
S5 = { [L → Id.] }
State S2 is the only LR(0)-inadequate state. We have follow1 (R) = {#, =}. This lookahead set for
the item [R → L.] is not sufficient to resolve the shift-reduce-conflict in S2 since the next input symbol
= is in the lookahead set. Therefore, the grammar is not an SLR(1)-grammar.
The grammar however is a LALR(1)-grammar. The transition diagram of its LALR(1) parser
is shown in Fig. 3.18. To increase readability, the lookahead sets λL (q, [A → α.β]) were directly
associated with the item [A → α.β] of state q. In state S2 , the item [R → L.] has now the lookahead
set {#}. The conflict is resolved since this set does not contain the next input symbol =. ⊓ ⊔

3.4.4 Fehlerbehandlung in LR parsern


LR parser besitzen ebenso wie LL parser die Eigenschaft of the fortsetzungsf"ahigen Pr"afixes. Das
bedeutet, dass jedes durch einen LR parser fehlerfrei analysierte prefix der input zu einem korrekten
inputwort, einem Satz der Sprache, fortgesetzt werden kann. Trifft ein LR parser in einer Konfiguration
auf ein input symbol a mit action[q, a] = error , ist dies die fr"uhestm"ogliche Situation, in der ein
Fehler entdeckt werden kann. Diese Konfiguration nennen wir Fehlerkonfiguration und q den Fehlerzu-
stand dieser Konfiguration. Auch for LR parser gibt es ein Spektrum von Fehlerbehandlungsverfahren:
• Vorw"artsfehlerbehandlung. Modifikationen werden in der restlichen input, nicht aber auf dem
Parserkeller vorgenommen.
• R"uckw"artsfehlerbehandlung. Modifikationen werden auch auf dem Parserkeller vorgenommen.
Nehmen wir an, q sei der aktuelle state und a das next Symbol in der input. Als mögliche Korrekturen
bieten sich die Aktionen ein verallgemeinertes shift(βa) für ein item [A → α.βaγ] aus q, ein reduce
für unvollständige items aus q oder skip an:
• The Korrektur shift(βa) nimmt an, dass das Teilwort zu β ausgefallen ist. Es kellert deshalb die
Zustände, die der item-pushdown automaton bei Lesen der Symbolfolge β von q aus durchläuft.
Anschließend wird das Symbol a gelesen and der entsprechende shift-Übergang des Parsers aus-
geführt.

You might also like