Info F 403
Info F 403
PEREZ
INTRODUCTION
TO
L A N G UAG E T H E O RY
AND
COMPILING
Typeset under LATEX using the T U F T E - L A T E X document class.
The authors would like to thank the following persons for proofreading (part of ) those notes and making mean-
ingful suggestions and comments:
• Mourad A K A N D O U C H
• Benoît H A A L
• Zakaria J A M I A I
• Franklin L AU R E L E S
• Lucas L E F È V R E
• Benjamin M O N M E G E
• Marie V A N D E N B O G A A R D .
1 Introduction 9
1.1 What is a language? 9
1.2 Formal languages 11
1.3 Application: compiler design 15
1.4 Operations on words and languages 29
3 Grammars 73
3.1 The limits of regular languages and some intuitions 73
3.2 Syntax and semantics 76
3.3 The Chomsky hierarchy 78
3.4 Exercises 84
B Bibliography 217
List of Figures
4.1 A finite automaton accepting (01)∗ . The labels of the nodes represent
the automaton’s memory: it remembers the last bit read if any. 87
4.2 Recognising a palindrome using a stack. 88
4.3 An intuition of a pushdown automaton that recognises L pal# . The key-
words Push, Pop and Top have their usual meaning. The edge labelled
by empty can be taken only when the stack is empty. Note that, in-
stead of pushing q 0 or q 1 , one could simply store 0 and 1’s on the
stack. 90
4.4 The grammar G Exp to generate expressions. 90
4.5 A derivation tree for the word Id + Id ∗ Id. 92
4.6 Another derivation tree for the word Id + Id ∗ Id. 92
4.7 The derivation tree for Id+Id∗Id of Figure 4.5 with a top-down traversal
indicated by the position of each node in the sequence 93
4.8 A CFG which is not in CNF. 95
4.9 A CFG in CNF that corresponds to the CFG in Figure 4.8. 95
4.10An example CNF grammar generating a+ b. 97
4.11An example PDA recognising L pal# (by accepting state). 101
4.12An example PDA recognising L pal# (by empty stack). 104
4.13An non-deterministic PDA recognising L pal (by accepting state). 104
4.14A PDA accepting (by empty stack) arithmetic expressions with + and
∗ operators only. 108
4.15An Illustration of the construction that turns a PDA accepting by empty
stack into a PDA accepting the same language by final state. 111
4.16An Illustration of the construction that turns a PDA accepting by final
state into a PDA accepting the same language by empty stack. Transi-
tions labelled by ε, γ/ε represent all possible transitions for all possible
γ ∈ Γ. 113
4.17A grammar with an unproductive variable (A). 119
4.18A grammar with an unreachable symbol (B ). 119
7
6.24A grammar that is LL(2) and SLR(1) but neither LL(1) nor LR(0). 205
1 Introduction
[. . . ]
The words, their pronunciation, and the methods of combining them used
and understood by a community.
check.
Observe that, although we remove
the semi-colon from line 5, the er-
Example 1.1. Listing 1.1 shows a syntactically correct C program3 . Delet-
ror is reported on line 6. Indeed,
ing the semi-colon from the end of line 5 triggers a compiler syntax error: the compiler ‘realises that the semi-colon
is missing only when it reads the return
statement on line 6.
10 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G
X += Y
X := X+Y
ADD Y TO X GIVING X
INTRODUCTION 11
2. offer a clear structure, so that the code is easily readable and maintain-
able (for instance, the use of functions, blocks, and so forth).
Let us start with several basic definitions. We first give the definitions, then
comment on them.
{a, b, c, d , e, f , g , h, i , j , k, l , m, n, o, p, q, r , s, t , u, v, w, x, y, z}
{a, b, . . . , z, A, B, . . . , Z, 0, 1, . . . , 9, _}
of all characters which are allowed in C variable names. Then, all words
on ΣC that do not begin with a number and are not C keywords are valid
variable name in C (assuming no limit is imposed on the length of variable
names). M
1. The set L Cid of all non-empty words on ΣC (see example 1.7) that do
not begin with a digit, is a language. It contains all valid C identifiers
(variable names, function names, etc) and all C keywords (for, while,
etc).
2. The set L odd of all non-empty words on {0, 1} that end with a 1 is a lan-
guage. It contains all the binary encodings of odd numbers.
4. The set L alg of all algebraic expressions that use only the x variable, the
+ and ∗ operators and parenthesis, and which are well-parenthesised,
is a language on the alphabet Σ = {(, ), x, +, ∗}. For instance ((x+x)∗x)+x
belongs to this language, while )(x + x does not, although it is a word
on Σ.
All these examples are more or less related to the field of compiler de-
sign, but we will provide examples from other fields of application later.
Being able to answer such a question in general (i.e., for all languages
L) seems to solve meaningful questions. Let us come back to our examples
14 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G
to illustrate this. In the first case, testing whether w ∈ L Cid allows one to
check that w is a valid C identifier. Testing whether a binary number be-
longs to L odd allows one to check whether it is odd or even. The member-
ship problem for L alg and LC amount to checking the syntax of expressions
and C programs respectively, a task that is important when compiling. Ob-
serve that all these criteria are purely syntactical. On the other hand, the
last example, L Cterm seems more complex, because the criterion for a word
w to belong to L Cterm is that w is string encoding a terminating C program,
i.e., a semantic criterion, yet the definition of L Cterm makes perfectly sense
and is mathematically sound.
Of course, we are particularly interested in solving the membership
problem automatically. What we mean there is that, given a language L,
we want to build a program that, reads on its input any word w, and re-
turns ‘yes’ iff w ∈ L.
What we will do mainly in these notes is to develop formal tools to pro-
vide an answer to that question. Let us already try and build some in-
tuitions by highlighting characteristics of programs that would recognise
each of those languages. In all the cases, we assume that the word for
which we want to test membership is read character by character, from left
to right, i.e., if w = w 1 w 2 · · · w n , then the program will first read w 1 , then
w 2 , and so forth up to w n . When w n is read, the program must output its
answer.
2. In the case of L odd , the program must only check that all characters it
receives are 0’s and 1’s, and that the last one is 1. This does not even
require any memory.
In these notes, we will present the mathematical and practical tools that
are necessary to attack those questions.
It should now be clear from the previous examples that a first class appli-
cation to formal language theory is the design of compilers. Remark that here, we are using
the term ‘language’ with the same
A compiler 10 is a program that processes programs and translates a pro-
meaning as in ‘programming lan-
gram P s (the source program, or source code) written in a language L s (the guage’, and not in the formal sense of Def-
source language) into an equivalent program P t (the target program) writ- inition 1.8.
10
A. Aho, M. Lam, R. Sethi, and Ullman
ten in a language L t (the target language). The compiler, being a program J. Compilers: Principles, Techniques, &
itself, might be written in a third language. As an example, gfortran11 Tools. Addison-Wesley series in computer
science. Pearson/Addison Wesley, 2007
is a compiler that translates FORTRAN code into (for instance) Intel i386
11
See the gfortran home page at:
machine code, and is written in C. https://gcc.gnu.org/fortran/.
low). Syntax errors are detected and reported during this phase. Fi-
nally, the analysis part usually contain a semantic analysis, that per-
forms, a.o., type checking, and reports typing errors.
1.3.2 Scanning
return 0 ;
}
Observe how white spaces are ignored in this example. Here, we use the
term ‘white space’ in a broad sense: it also includes tabulation characters
or end-of-line. Those white space symbols are relevant to the compiler
only to separate successive sub-strings13 . Indeed, the two following code 13
This is the case in most programming
languages. A notable exception is the
excerpts have the same effect:
(prank) programming language Whites-
pace, where ‘Any non white space charac-
int i = 5 ;
ters are ignored; only spaces, tabs and new-
lines are considered syntax’. See http://
compsoc.dur.ac.uk/whitespace/.
int
=
INTRODUCTION 17
5
;
However, the scanner does not only split the input in a sequence of
sub-strings as illustrated above, but also performs a preliminary analysis
of those sub-strings and determine their type. For instance, what matters
about the j sub-string in lines 4 and 5 is not the j character, but rather
(1) the fact that j is identified as a variable identifier and (2) the fact that
the same identifier occurs in lines 4 and 5 but not in other lines where
variable identifiers appear (indeed, replacing the two occurrences of j in
those lines by LukeIAmYourFather, or any other legal variable name will
yield a compiled code with exactly the same effect). Also, reserved key-
words (while, if,. . . ), operators (=, <=, !=,. . . ) and special symbols ({, ;,
. . . ) can be identified as such.
To sum up the role of the lexical analyser is not only to split the input
into a sequence of sub-strings, but to relate each of those sub-strings to its
lexical unit. A lexical unit is an abstract family of sub-strings, or, in other
words, a language, that corresponds to a peculiar feature of the language.
The definition of lexical units is a bit arbitrary and depends on the next
steps of the compiling process. For instance, for the C language, we could
have as lexical units:
• identifiers
• keywords
• ...
• identifiers,
• ...
where each keyword is its own lexical unit. It should be clear that each
lexical unit in the lists above corresponds to a set of words, i.e., a language.
It is common practice to associate a unique symbolic name to each lexical
unit, for instance a natural number, or a name such as ‘identifier’. As we
are about to see, those values will constitute a part of the scanner’s return
values.
Definition 1.12 (Token). A token is a pair (id, att), where id is the iden-
tifier of a lexical unit, and att is an attribute, i.e. an additional piece of
information about the token. M
Tokens are what the scanner actually returns and provides to the next
step of the compiling process. The attribute part of the token is optional:
it can be used to provide more information about the token, but this is
sometimes not needed.
A typical use of the attribute occurs when the matched lexeme is an
identifier. In this case, the scanner must check whether this identifier has
been matched before, and, if it is the case, to return a piece of information
that links all occurrences of the same identifier throughout the code. The
scanner achieves this by maintaining, at all times, a so-called symbol table.
Roughly speaking the symbol table records, at all times, all the identi-
fiers that the scanner has met so far. Whenever the scanner matches a new
lexeme which is an identifier, it looks it up in the symbol table. If the lex-
eme is not found, the scanner inserts it in the table. Then, the index of the
lexeme in the table can be used as a unique symbolic name for this lex-
eme, which can be put in the attribute part of the token that the scanner
returns.
1 int i = 5;
2 int j = 3 ;
3 i = 9 ;
Initially, the symbol table is empty. When the lexeme i in line 1 is matched,
the scanner inserts it into the first entry (index 0) of the symbol table, and
returns the token (identifier, 0). Here, identifier denotes the symbolic name
for identifiers, and we have used the index of the lexeme in the symbol
table as the attribute of the token, which is a unique symbolic name for the
identifier i. When j in line 2 is matched, it is inserted into entry number 1
of the symbol table, and the token (identifier, 1) is now returned. So far, the
symbol table has the following content:
index lexeme
0 i
1 j
for instance, a variable declaration from a variable use. When the symbol
table should accommodate scoping, we will defer its creation to the parser
(see next section). The technique we have described so far, however, works
well when no scoping is required, in simple programming languages for
instance.
Let us mention another possible use of the symbol table: it can also be
exploited to match keywords, and avoid that keywords are used as iden-
tifiers. This is achieved by initialising the symbol table with all possible
keywords in the first entries of the table. This allows one to treat keywords
in a similar fashion to identifiers, which often makes the scanner easier to
implement.
Example 1.14. For instance, assume we are building a scanner for a lan-
guage with three keywords: while, for and if. We initialise the symbol
table this way:
index lexeme
0 while
1 for
2 if
Thus, keywords are present in lines 0–2 of the table, and identifiers will be
inserted in the following lines. Assume the scanner matches the lexeme
abc. It will be compared to all line in the symbol table, and inserted since
it is not present:
index lexeme
0 while
1 for
2 if
3 abc
The scanner returns (identifier, 3), which means that a genuine identifier
has been matched, since the index 3 is not among the lines 0–2 that are de-
voted to keywords. Now, assume the scanner matches for, which could be
an identifier since the lexeme contains only letters. The scanner will find
this lexeme in line 1 of the symbol table and return: (identifier, 1). Since
now the attribute of the token is ≤ 2, the parser can identify this token as
the keyword for. M
The scanner is the first part of the compiling process. Its role is
to split the input into a sequence of lexemes (Definition 1.11) that
are associated to lexical units and returns a sequence of tokens (Def-
inition 1.12). It can be responsible for inserting identifiers into the
symbol table, that contains all identifier matched so far, and possi-
bly all keywords. The symbol table is thus used as a communication
medium between the different compiling phases.
20 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G
Now that we have a clear view of what the scanner should do, it re-
mains to explain how to do it. Namely, we need to answer the following
questions:
1. How to specify lexical units? So far, we have used vague English descrip-
tions, like: ‘all words starting with a letter and followed by an arbitrary
number of letters and digits’. This is clearly not satisfactory. In Chap-
ter 2, we will introduce regular expressions to this end.
1.3.3 Parsing
It should now be clear that the duty of the scanner is to perform a local
analysis of the code: to match a lexeme against a lexical unit, the scanner
analyses a sequence of contiguous characters in the code. Such a local
analysis is not sufficient to analyse all features of programming language.
That for instance the matching of parenthesis in arithmetic expressions,
like:
( ( x + y ) * 3 )
Checking that the first (opening) parenthesis matches the last (closing)
one clearly requests a global view on the piece of code under analysis.
Building (and, to some extent, analysing) such a global and abstract
representation of the code is the task of the parser. To help us build an A binary operator is one that has
two arguments, such the ‘−’ oper-
intuition of what such an abstract representation could be, let us consider
ator in 5 − 3, where the two argu-
the restricted case of arithmetic expressions that can contain: (1) paren- ments are 5 and 3. A unary operator has
thesis (possibly nested); (2) identifiers (i.e., variable names); (3) natural only one argument, such as ‘−’ in the ex-
pression −5.
numbers; (4) the +, -, / and * binary operators; (5) the - unary operator. A
textual description of what ‘a syntactically correct expression’ is, could be
given in an inductive fashion. For instance, a word is a correct expression
if and only if it has one of the following forms:
The most usual object that parsers build is the abstract syntax tree (AST
for short). As the name indicates, this object is a tree that reflects the nest-
ing of the different programming constructs. As an example, the AST of
the expression i+y*5 could be:
+
Id *
i Id Cst
y 5
Similary, the AST of the following C while loop:
1 while(x>5) {
2 x = x-1 ;
3 y = 2*y ;
4 }
could be:
Prog
Statement St-tail
While ε
Cond Block
>
Statement St-tail
Id Exp
=
x Cst Statement St-Tail
Id Exp
5 = ε
x -
Id Exp
Id Cst
y *
x 1
Cst Id
2 y
The parser and the symbol table In the previous section, we have dis-
cussed the creation of symbol table entries by the scanner, which is a tech-
nique that works fine when the compiler must not handle scoping, as in
Example 1.13. However, in a realistic example such as 1.1, the scope of
variables must clearly be taken into account. Indeed, the i variable used
in line 5 is not the same as the one declared in line 2, because the latter
occurs in the global scope, while the former lies in the block containing
the code of function f().
22 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G
To cope with the scoping of identifier names, the compiler can manage
several symbol tables, one for each scope, that contains all the identifiers
from the scope. Since the scanner has no global view on the code and
can hardly detect scopes, we ask the parser the populate these symbol ta-
bles. All the scanner does is to return an information indicating that it has
recognised an identifier, together with the name of the identifier.
All those symbol tables are arranged in a tree, in order to reflect the
nesting of scopes (see example below). When the parser obtains from the
scanner a token corresponding to an identifier whose name is v, it looks
up v in several symbol tables: first, the current symbol table, then—if the
identifier has not been found—its father, and so forth, up to the root that
corresponds to the symbol table of the global scope. This is illustrated in
the following example:
Example 1.15. Let us come back once again to the code excerpt of List-
ing 1.1. Initially, an empty symbol table T0 is created for the global scope.
Then, the parsing goes, a.o. through the following steps (we focus on the
handling of identifiers):
1. When the lexeme i is matched on line 2, the scanner returns the name
i to the parser, which looks it up in the current (and only) symbol table
T0 . Since T0 is still empty, i is inserted into T0 :
T0
Current index lexeme
0 i
2. Then, when reaching line 4, the parser detects the declaration of a new
function, and thus creates a new scope. Concretely, this amounts to
creating a new symbol table T1 , which is inserted in the tree of symbol
tables as a son of T0 :
T0
index lexeme
0 i
T1
Current
index lexeme
T0
index lexeme
0 i
T1
Current index lexeme
0 j
4. Next, moving to line 5, the parser first inserts a fresh i variable, but this
time in T1 . As we will see, it will predate variable i in T0 within the
scope of function f:
T0
index lexeme
0 i
T1
index lexeme
Current
0 j
1 i
5. On the same line, the parser detects the use of variable j. It first looks
up j in T1 (which is the current symbol table) and finds it. Thus, this
occurrence of j will be identified with (T1 , 0).
6. Then, the parser finishes parsing f, and detects the use of variable named
i in line 6. It first looks up in T1 , and finds an occurrence of i. Thus, it
identifies this i with (T1 , 1), i.e., the same i as in line 5.
7. When leaving the scope of f, the parser changes the current symbol
table to T0 . Note however that T1 is kept in memory for the next steps
of compiling.
8. In line 9, the parser detects a new scope and inserts a new symbol table
T2 as a child of T0 , since it is the current table. T2 becomes the current
table:
T0
index lexeme
0 i
T1 T2
Current
index lexeme index lexeme
0 j
1 i
24 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G
9. Finally, in line 11, the parser detects the use of a variable called i, and
looks it up in the tree, starting from the current symbol table T2 (which
contains no entry), then moving up the tree towards the root. Variable i
is eventually found in the root T0 , so it is correctly identified with (T0 , 0),
i.e., the one which is declared in line 2.
As for the scanner, we can now identify several questions that we must
solve in order to build parsers:
1. We have said that the parser must check the syntax of its input, but how
do we specify this syntax? We will see, in Chapter 3 that grammars can
be used for that purpose, just as we have used regular expressions in the
case of scanners.
2. How can we build a parser from a given grammar, and what kind of
machine will it look like? In Chapter 4, we will see that pushdown au-
tomata—an extension of the finite automata from Chapter 2—are ab-
stract machines that can be used to formalise and build parsers. We will
review, in Chapter 5 and Chapter 6, several techniques to build parsers
that are efficient in practice.
Now that the parser has checked that the syntax of the input code is cor-
rect, and has built an abstract representation of this code, it is time to start
analysing what the code means, in order to prepare its translation. This
is the aim of semantic analysis. Of course, semantic analysis is highly de-
pendent on the input language, this is why we will stay very general when
introducing it. Yet, we can identify several essential points to deal with:
scoping, typing, and control flow.
Scoping During the semantic analysis phase, the compiler can analyse
the links that exist between the declaration(s) of a name (if any) and the
uses of this name throughout the code. For instance, the—syntactically
correct—code in Figure 1.1 could raise an error during semantic analysis, 1 int main() {
2 i = 3 ;
because variable i is used undeclared (although the name i is declared 3 int i ;
later as an integer variable). 4 }
Observe that the control of the scoping can already be performed dur- Figure 1.1: A syntactically correct excerpt
ing parsing, thanks to the symbol table (see previous section). of C code that raises an error during se-
mantic analysis.
INTRODUCTION 25
Type checking and type control Each name (variable name, function name,
type name,. . . ) in a program is associated with a data type (or, simply, a
type) that describes uniquely how this name can be manipulated. During
semantic analysis, the compiler determines (if possible) the type of each
expression, and checks that the operations on those expressions are con-
sistent with their types. Figure 1.2 shows the typical problems that can
occur when compiling an assignment in C. The assignment in line 9 is not
problematic, because the type of the right-hand side of the assignment is
1 struct S {int i;} ;
int, which is the same as the type of the variable j. Indeed, the sum of the 2 int main() {
int variable i and the int constant 4 is an int. The second assignment 3 int i, j ;
4 struct S s ;
(line 10) raises a warning (i.e., a non-blocking error): the right-hand side
5 struct S * p = &s ;
is a pointer, but assigning a pointer to an int is allowed in C, and a con- 6
version is implicitly applied by the compiler. The last assignment (line 11) 7 i = 3 ;
8
raises an error: the type of the right-hand side is struct S, and the com- 9 j = i + 4 ;
piler does not know how to convert such an object to an int. 10 j = p ;
11 j = s ;
In order to manage types, the compiler can add information to the AST
12 }
that has been built during parsing. This operation of adding information
Figure 1.2: Three syntactically correct as-
to a tree is called ‘decoration’16 . As an example, Figure 1.3 displays the
signments with different behaviours of the
decorated AST of the C statement x = sum * 1.5, where x and sum are semantic analyser. The first (line 9) is not
integer variables. Since one of the terms of the sum * 1.5 product is a problematic. The second (line 10) raises a
warning because a pointer is cast to an in-
float, the compiler assigns this type to the expression. This allows it to teger. The last (line 11) is not allowed: no
detect that the result will need to be truncated when copying it to x (and conversion is possible.
16
perhaps to raise a warning, depending on the compiler and its options). Of Which suggests that ASTs are most prob-
ably Christmas trees. . .
course, such information will be crucial when generating the target code
for this assignment.
=, int
x, int *, float
Id Cst
Control flow The term ‘control flow’ refers to the order in which the in- 1 #include <iostream>
edges of the AST are shown in gray. Additional edges (dashed) are first
inserted in the AST to represent the control flow of the if. Similar treatment
is applied recursively in the B , E , T and N sub-trees. Then, the edges of
the AST are removed, and one obtains the typical diamond shape of the
CFG of an if statement.
statements
B
if-then-else N
yields: T E
B T E N
contains all the necessary information for the synthesis phase of the
compiler.
1.3.5 Synthesis
Code optimisation Before the output code is actually generated, the com-
piler might perform several optimisations on the code. Typical optimisa-
1 for(int i=0; i<n; ++i) {
tions include, but are not limited to:
2 if (x > 2)
3 printf("%d", i) ;
Control flow optimisation Modifies the control flow graph in order to make 4 else
the resulting code more efficient. Figure 1.6 shows an example. In the 5 printf("%d", i+1) ;
6 }
first version of the loop, the conditional x>2 is tested at each iteration of
the loop. However, the loop does not modify x, so this condition will be Becomes:
either always true along the loop, or always false. The second version 1 if (x>2) {
of the loop is therefore more efficient. 2 for(int i=0; i<n; ++i) {
3 printf("%d", i) ;
4 } else {
Loop optimisation Consists in making loops more efficient, for instance 5 for(int i=0; i<n; ++i) {
by unravelling them when they are executed a constant number of times. 6 printf("%d", i+1) ;
7 }
8 }
Constant propagation The compiler can try and detect variables that keep
a constant value, and replace the occurrences of these variables by those Figure 1.6: An example of control flow op-
timisation. The second code excerpt guar-
constant values, thereby avoiding unnecessary and costly memory ac- antees to test the condition x > 2 only
cesses. once.
1 int a, b, c ;
2 if (a > b)
3 c=1 ;
4 else
5 c=2 ;
As can be seen from this example, the LLVM intermediate language is pretty
close to a classical machine language, with very low-level instruction such
as icmp sgt i32 to compare two integers on 32 bits, br i1 for a condi-
tional jump, or br for an unconditional jump. But this language allows one
to use as many virtual registers (whose name begin with %) as desired. It
is thus easier to generate LLVM intermediate language than machine lan-
guage for a machine with a fixed (and limited) number of registers. M
INTRODUCTION 29
w ·v = w1 w2 · · · wn v1 v2 · · · vℓ
We can lift the concatenation operator to sets of words, i.e., languages. In-
tuitively, the concatenation of two languages is a new language that con-
tains all the words obtained by concatenating one word from the former
language with one word from the latter:
By reading this definition carefully,
Definition 1.18 (Concatenation of languages). Let L 1 and L 2 be two lan- one realises that the empty lan-
guage ; is not a neutral for lan-
guages. Then, their concatenation, denotes L 1 · L 2 is the language: guage concatenation. Indeed, assume
L 1 = ;, and consider L 1 · L 2 . For a word
L1 · L2 = {w 1 · w 2 | w 1 ∈ L 1 and w 2 ∈ L 2 } w to belong to L 1 · L 2 , it must have a pre-
fix which is a word of L 1 . However, there
M is no word in L 1 , so, no word belongs to
L 1 · L 2 ; that is, L 1 · L 2 = ;. However, {ε} is:
For example, if L 1 = {I love , I hate }, and L 2 = {compilers, L · {ε} = {ε} · L = L for all languages L.
chocolate}, then L1 · L2 = {I love compilers, I love chocolate,
I hate compilers, I hate chocolate}.
On top of language concatenation, we can introduce several other no-
tations:
1. For all languages L, for all natural numbers n, L n is the language con-
taining all words obtained by taking n words from L an concatenating
them:
Ln = {w 1 w 2 · · · w n | for all 1 ≤ i ≤ n : w i ∈ L}
30 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G
For example, if L = {a, b}, then L 3 = {aaa, aab, aba, baa, abb, bab,
bba, bbb}.
For example, {a}∗ = {ε, a, aa, aaa, aaaa, . . .}. Observe that ε ∈ L ∗ for all
languages L, and that L ∗ is necessarily an infinite language, except for
the cases where L = {ε} and L = ;, since then L ∗ = {ε}.
T H E R E A D E R S H O U L D N OW B E C O N V I N C E D O F T H E I M P O RTA N C E O F
L A N G UA G E T H E O RY I N C O M P U T E R S C I E N C E , and in particular for com-
piler design. Our main objective for now will be to study formal tools to:
(1) define, using a finite syntax, languages that are potentially infinite; and
(2) manipulate those languages (for instance, combine them using classi-
cal set operations such as union or intersection). In particular, we want
to be able to answer the membership problem, or, in other words, to be
able to tell in an automatic way whether a given word belongs to a given
language or not.
In this chapter, we will study the class of regular languages1 . Regular 1
In francophone Belgium, regular lan-
guages are called « langages réguliers »,
languages form one of the most basic classes of languages, yet they con-
while in France, they are called « langages
tain many useful languages, such as the one we have used to define (most) rationnels », probably a much better trans-
legal C identifiers and keywords: lation.
ℓ = a+b+c+d+e+f+g+h+i+j+k+l+m+n+o
+p + q + r + s + t + u + v + w + x + y + z + A + B + C
+D + E + F + G + H + I + J + K + L + M + N + O + P
+Q + R + S + T + U + V + W + X + Y + Z + _
d = 1+2+3+4+5+6+7+8+9+0
ℓ · (ℓ + d )∗
the Kleene closure (see Section 1.4). In other words, the above regular
expression must be interpreted as: « a character matching ℓ (i.e., a non-
digit character) followed by any number of characters matching either
ℓ or d ».
1. either L = ;;
2. or L = {ε};
4. or L = L 1 ∪ L 2 ;
5. or L = L 1 · L 2 ;
6. or L = L ∗1
Example 2.2.
2. The language of all binary words (thus on the alphabt {0, 1}) is regular.
Indeed, this language can be defined as:
¢∗
{0, 1}∗ = {0} ∪ {1}
¡
3. The language of all well-parenthesised words over Σ = {(, )}, on the other
hand, is not regular (this can be proved formally). Intuitively, the defini-
tion of regular language does not allow to discriminate between words
ALL THINGS REGULAR. . . 33
Lets us now introduce several formal tools to deal with regular lan-
guages.
The first tool we will consider for regular languages are regular expressions.
Regular expressions are a kind of algebraic characterisation of regular lan-
guages. To define regular expressions, we need to define two things: their For the more mathematically in-
clined readers, regular expressions
syntax (i.e., which regular expressions can we write?), and their semantics
form a so-called Kleene algebra, i.e,
(i.e., what is the meaning of a given regular expression, in terms of regular an idempotent semi-ring, see:
language?). These definitions follow closely that of regular languages: Dexter Kozen. On Kleene algebras and
closed semirings. In Mathematical foun-
Definition 2.3 (Regular expressions). Given a finite alphabet Σ, the follow- dations of computer science, Proceedings
of the 15th Symposium, MFCS ’90, Banská
ing are regular expressions on Σ:
Bystrica/Czech. 1990, volume 452 of Lec-
ture notes in computer science, pages 26–
1. The constant ;. It denotes the language L(;) = ;. 47, 1990. URL http://www.cs.cornell.
edu/~kozen/Papers/kacs.pdf
2. The constant ε. It denotes the language L(ε) = {ε}.
Theorem 2.1. For all regular languages L, there is a regular expression r s.t.
L(r ) = L. For all regular expressions r , L(r ) is a regular language.
Actually, finding a minimal regu-
Observe that the language L(r ) associated to each regular expression lar expression to denote a given
regular language L is not an easy
r is unique, while there can be several regular expressions to denote the problem, since the problem of determin-
same language. For instance a and a + a both denote the language {a}, i.e. ing whether two given regular expressions
r 1 and r 2 accept the same language (i.e.,
L(a) = L(a + a) = {a}.
L(r 1 ) = L(r 2 )) is a PSPACE-complete prob-
lem, see:
2.2.1 Extended regular expressions L. J. Stockmeyer and A. R. Meyer. Word
problems requiring exponential time (pre-
Regular expressions are widely used in practice, in particular by many Unix liminary report). In Proceedings of the
Fifth Annual ACM Symposium on The-
applications. They can be used, for instance, to look for specific files us- ory of Computing, STOC ’73, pages 1–9,
ing the ls command. As an example the following command lists all the New York, NY, USA, 1973. ACM. D O I :
10.1145/800125.804029
file names in the current directory (thanks to the command find .) and
filters them using the grep tool, following the pattern ^..g.*\.tex which
is given as an extended regular expression.
The pattern asks to select only the filenames that have a g in the third po-
sition, and have .tex as extension. As can be seen from this example, The difference between the two
syntaxes can be confusing: the +
the syntax of Unix regular expressions (called extended regular expressions)
denotes the alternative in ‘classi-
departs significantly from Definition 2.3. This is not surprising, since Def- cal’ regular expressions, and thus corre-
inition 2.3 has been introduced mainly for theoretical purpose. On the sponds to | in extended regular expres-
sions. On the other hand + in extended
other hand, the syntax of extended regular expressions (see Table 2.1) is regular expressions is the repetition, i.e., it
probably better fitted for practical purpose. Still, all languages that are corresponds to r + in ‘classical’ regular ex-
pressions. . .
definable by extended regular expressions are regular, which means these
new constructs do not alter the expressiveness.
ALL THINGS REGULAR. . . 35
E.R.E. Semantics
x the character x
. any character, except the ‘newline’ special character
"x" the character x, even if x is an operator. For instance "." is the character . and not ‘any
character’.
\x the character x, even if x is an operator (for instance \. is the . character)
[xy] either x or y
[a-z] any character in the range a, b,. . . ,z. Other ranges can be used, like 1-5 or D-X, for instance
[^x] any character but x
^x an x at the beginning of a line
x$ an x at the end of a line
x? an optional x
x* the concatenation of any number of x’s (Kleene closure)
x+ the concatenation of any strictly positive number of x’s
x{m,n} the concatenation of k numbers of x’s-, where m ≤ k ≤ n.
x|y either x or y
• The aim of the machine is to discriminate between words that are in a tape
given language, and words that are not. The automaton does so by ei-
ther accepting or rejecting input words. At all times, the machine pro-
l l d l
duces a binary (yes/no) output, indicating whether the word prefix read
head
so far is accepted or not by the machine.
q1 l q2
Figure 2.1 is an illustration of those concepts. It displays the input tape
d yes/no
(with content lldl), the reading head, and the output. The content of the
q3 l, d
rectangular box represents the different possible states of the automaton, l, d
by means of circles (in this case, the states are called q 1 , q 2 and q 3 ) and
Figure 2.1: An illustration of a finite au-
the possible state changes, by means of labeled arrows between states. In tomaton.
this example, for instance, reading an l on the input tape when in state
q 1 moves the current state to q 2 , and so forth. In addition, we need to
indicate:
should be either accepting (in which case the output is ‘yes’) or reject-
ing (the output is ‘no’). We will display accepting states as nodes with a
double border. In this case, q 2 is the only accepting state.
From the intuitions sketched above, it should be clear that the behaviour
of the automaton will depend only on its states and on the possible changes
between those states, i.e., what is depicted inside the rectangular box in
q1 l q2
Figure 2.1. So, in the next illustrations of finite automata, we will restrict
ourselves to this part, that is, we will display the automaton of Figure 2.1 d
q3 l, d
as in Figure 2.2. l, d
Since an automaton either accepts or rejects any word, it also implicitly Figure 2.2: We can represent finite au-
define a language, which contains all the words the automaton accepts. tomata more compactly by focusing on the
‘control’, i.e., the states and transitions.
It is easy to check that the language defined by automaton in Fig 2.2. is
exactly the language of the regular expression l · (l + d)∗ (assuming the
input alphabet is Σ = {l, d}). Indeed:
1. When running with a word starting by an l on the input tape, the au-
tomaton first moves from q 1 to q 2 , which is accepting and where it will
stay up to the end of its execution. So all words starting by an l will be
accepted.
2.3.2 Syntax
A = 〈Q, Σ, δ, q 0 , F 〉
where:
1. Q = {q 1 , q 2 , q 3 };
2. Σ = {l, d};
3. q 0 = q 1 ;
38 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G
4. F = {q 2 };
There are two important features to Definition 2.5 that one should ob- q1 b q3
serve. First, the co-domain of the transition function is a set of states. Ob-
a
serve that in the example of Figure 2.2, the function always returns either c
a singleton or the empty set. However, we can also build automata like the q0
automaton of Figure 2.3, where δ(q 0 , a) = {q 1 , q 2 }. c
In this example, there are several possible executions of the automaton a
on the input word ab: the word can be read by an execution visiting q 0 , q2 q4
b
q 1 , then q 3 , or an execution visiting q 0 , q 2 and q 4 . This phenomenon is
Figure 2.3: A non-deterministic finite au-
called non-determinism, and the automaton of Figure 2.3 is said to be non- tomaton.
deterministic. Non-determinism raises several natural questions:
1. How do we determine the output of the automaton, when there are sev-
eral possible runs on the same word, that do not all end in an accepting
state? This occurs with the word ab and the automaton in Figure 2.3.
The rule is that, in non-deterministic automata, there must exist one
run that accepts for the word to be accepted.
The second important feature of Definition 2.5 is the fact that some
transitions can be labeled by the empty word ε. This is called a ‘sponta-
ALL THINGS REGULAR. . . 39
neous move’ and allows the automaton to change its current state with-
out reading any character on the input (hence, without moving its read-
ing head). Again, spontaneous moves depart radically from our intuition
of algorithm, yet they can be useful for modeling purpose. For instance,
suppose we want to build an automaton for the language composed of
all words that start with a (possibly empty) sequence of a’s, followed by a
(possibly empty) sequence of b’s. One natural way to do it would be to start
by building two automata for those two parts of the words in the language:
a b
q0 q1
then, add spontaneous move between those states, to allow the automa-
ton to move from the ‘sequence of a’s’ part to the ‘sequence of b ’s’:
a b
q0 ε q1
M
40 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G
For instance, the automaton in Figure 2.2 is a DFA, hence also an NFA
and an ε-NFA. The automaton in Figure 2.3 is an NFA, hence also an ε-NFA,
but is not a DFA. The automaton in Figure 2.5 is an ε-NFA, but neither an
NFA, nor a DFA.
2.3.3 Semantics
®
Intuitively, a pair q, w completely characterises the current ‘configu-
ration’ of the automaton: its current state is q and the word w remains on
the input (in other words, the reading head is currently on the first char-
acter of w, if w ̸= ε; or at the end of the tape, if w = ε). Then, using the
transition relation, we can define how an automaton changes its current
configuration:
In the rest of these notes, we will often omit the subscript on the opera-
tor, when the ε-NFA is clear from the context, and write (q 1 , w 1 ) ⊢(q 2 , w 2 )
instead. Very often, we will consider sequences of configurations (q 1 , w 1 ),
(q 2 , w 2 ), . . . , (q n , w n ) s.t. (q i , w i ) ⊢(q i +1 , w i +1 ) for all 1 ≤ i ≤ n − 1. Such se-
quences are called runs of the automaton (on the word w 1 , which is the
word in the first configuration of the run). We say that a run (q 1 , w 1 ),
(q 2 , w 2 ), . . . , (q n , w n ) is accepting iff its last configuration (q n , w n ) is ac-
cepting; and we say that it is initialised iff its first configuration is initial.
Since ⊢A is a binary relation, we use the classical ⊢*A notation to denote
its reflexive and transitive closure♣ . Then, we can define the accepted lan-
guage of an automaton:
It is easy to check that the first configuration of the run is initial, that the
last is accepting. Hence, w = aab is accepted.
On the other hand, the maximal run that can be built on the word w ′ =
ba is:
(q 0 , ba), (q 1 , ba), (q 1 , a)
ample, the tree of possible runs of the automaton in Figure 2.3 is shown in (q 1 , b) (q 2 , b)
Figure 2.6.
(q 3 , ε) (q 4 , ε)
So far, we have reviewed two families of models for defining and manip-
ulating languages: regular expressions, on the one hand, and finite au-
tomata, on the other hand. We know that regular expressions define ex-
actly the class of regular languages (see definition 2.1 and Theorem 2.1),
but what about the expressive power of the three different classes of au- The ‘expressive power’ of a model
is a term often used to speak about
tomata we have introduced? Obviously, DFAs cannot be more expressive
the class of languages that the
than NFAs, which cannot be more expressive than ε-NFAs, by definition. model can define. One can thus speak
We have already seen at least one example of automaton that recognises about the expressive power of regular ex-
pressions (i.e., the regular languages), or
the same language than a given regular expression (see Figure 2.5), but can the expressive power of finite automata and
this be generalised? compare them. . .
It turns out that the expressive power of all three classes of finite au-
tomata is exactly the same, and equals that of regular expressions, that is,
the regular languages. This result is due to Stephen Kleene5 : 5
Stephen C. Kleene. Representation
of events in nerve nets and finite au-
Theorem 2.2 (Kleene’s theorem). For every regular language L, there is a tomata. Technical Report RM-704,
The RAND Corporation, 1951. URL
DFA A such that L(A) = L. Conversely, for all ε-NFAs A, L(A) is regular.
http://minicomplexity.org/pubr.
php?t=2&id=2
In other words, all finite automata recognise regular languages and all
regular languages are recognised by a finite automaton. To establish this
Our formulation of the theorem
result, we will give constructions that convert finite automata into regular
might seem restrictive, but one
expressions and vice-versa. More precisely, we will give algorithms to: must always bear in mind that
DFAs are a special case of NFAs, which are,
1. Convert any regular expression into an ε-NFA defining the same lan- in turn, a special case of ε-NFAs. Hence,
‘For every regular language L, there is a
guage.
DFA A such that L(A) = L’ entails that there
is also an NFA and an ε-NFA recognising L
2. Convert any ε-NFA into a DFA accepting the same language. This is (actually, the DFA A can serve for that pur-
called ‘determinising’ the ε-NFA as it somehow turns it into a determin- pose). Conversely, ‘for all ε-NFAs A: L(A)
istic version. Observe that this method can be applied, in particular, to is regular’ implies that the language of all
NFAs and DFAs are also regular!
any NFA.
42 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G
3. Convert any DFA into a regular expression defining the same language.
Regular
expression
2.4.3 2.4.1
This set of transformations is summarised in Figure 2.7. Together with special
Theorem 2.1, those transformations allow us to conclude that finite au- 2.4.2 NFA case
DFA ε-NFA
tomata recognise exactly regular languages.
2.4.2
Figure 2.7: The set of transformations used
to prove Theorem 2.2, with the section
numbers where they are introduced.
regular expressions r , an ε-NFA A r s.t. (i) L(A r ) = L(r ); and (ii) the (neces- 10.1145/363347.363387
7
R. McNaughton and H. Yamada. Regular
sarily unique) initial state of A r is called q ri , and A r has exactly one final expressions and state graphs for automata.
f f
state that we denote q r . Moreover, no transition enter q ri , nor leave q r . Electronic Computers, IRE Transactions on,
EC-9(1):39–47, March 1960. ISSN 0367-
9950. D O I : 10.1109/TEC.1960.5221603
Base cases Building ε-NFAs that accept the base cases of regular expres-
sions is easy, as shown in the following table: Observe that we could have given
simpler constructions. For in-
stance, A ε could have been made
up of only one (initial and accepting) state.
However, the construction we present has
the benefit to keep initial and final states
Regular separated, and is therefore more system-
ε-NFA A r atic.
expression r
i f
; q; q;
ε f
ε q εi qε
a f
a ∈Σ q εi qε
Inductive case For the inductive case, we assume ε-NFAs A r 1 and A r 2 are
already known for two regular expressions r 1 and r 2 . We treat the disjunc-
tion, concatenation and Kleene closure as follows:
ALL THINGS REGULAR. . . 43
Regular
ε-NFA A r
expression r
Ar1
f
q ri 1 qr1
ε ε
r1 + r2 q ri qr
f
ε ε
f
q ri 2 qr2
Ar2
Ar1
f
q ri 1 qr1
ε
ε
r1 · r2 q ri qr
f
ε
f
q ri 2 qr2
Ar2
Ar1 ε
ε f ε f
q ri q ri 1 qr1 qr
r 1∗
Example 2.12. Let us consider the regular expression l · (l + d)∗ on the al-
phabet Σ = {l, d}. Following the construction, we start with the base cases:
l
Al =
d
Ad =
ε ε
A l+d =
ε ε
d
l
ε ε
ε ε
A (l+d)∗ =
ε ε
d
q5 l q6
ε ε
q1 l q2 ε q3 ε q4 q9 ε q 10
ε ε
q7 d q8
δ(S, a) = δ(q, a)
[
q∈S
In other words, computing δ(S, a) amounts to computing all the states that
the automaton can reach from any state q ∈ S, by reading an a. Then:
i ∈ N, let εclosurei (q) be defined as follows: The definition might seem hard
( to read, but the intuition is re-
{q} if i = 0 ally easy: εclosurei (q) is the set of
i
εclosure (q) = states that A can reach from q by following
δ(εclosurei −1 (q), ε) ∪ εclosurei −1 (q) otherwise at most i transitions labeled by ε.
Then, for all q ∈ Q: εclosure q = εclosureK (q), where K is the least value
¡ ¢
Example 2.14. Let us consider the ε-NFA in Figure 2.8, and let us compute
εclosure q 6 . We compute εclosurei (q) for i = 0, 1, . . . up to stabilisation:
¡ ¢
εclosure0 (q 6 ) = {q 6 }
εclosure2 (q 6 ) = δ({q 6 , q 9 }, ε) ∪ {q 6 , q 9 }
= {q 4 , q 9 , q 10 } ∪ {q 6 , q 9 }
= {q 4 , q 6 , q 9 , q 10 }
εclosure3 (q 6 ) = δ({q 4 , q 6 , q 9 , q 10 }, ε) ∪ {q 4 , q 6 , q 9 , q 10 }
= {q 4 , q 5 , q 7 , q 9 , q 10 } ∪ {q 4 , q 6 , q 9 , q 10 }
= {q 4 , q 5 , q 6 , q 7 , q 9 , q 10 }
εclosure4 (q 6 ) = δ({q 4 , q 5 , q 6 , q 7 , q 9 , q 10 }, ε) ∪ {q 4 , q 5 , q 6 , q 7 , q 9 , q 10 }
= {q 4 , q 5 , q 6 , q 7 , q 9 , q 10 } ∪ {q 4 , q 5 , q 6 , q 7 , q 9 , q 10 }
= {q 4 , q 5 , q 6 , q 7 , q 9 , q 10 }
= εclosure3 (q 6 )
Determinisation of ε-NFAs
3. F D = {S ∈ Q D | S ∩ F A ̸= ;}
1. As expected, the set of states of the DFA is the set of subsets of the ε-
NFAs states.
2. The initial state of the DFA is the set of states the NFA can reach from
its own initial state q 0A by reading only ε-labeled transitions. Thus, q 0D
is the set of states in which the ε-NFA can be before reading any letter.
3. A state of the DFA is accepting iff it contains at least one accepting state
of the ε-NFA. This is coherent with the intuition that at least one execu-
tion of the ε-NFA must accept for the word to be accepted.
4. The transition function consists in: first reading a letter, then following
as many ε-labeled transitions as possible.
Although we will not present the details here8 , one can show that the DFA 8
The interested reader can find a proof in:
obtained from any ε-NFA by the above construction preserves the accepted John E. Hopcroft, Rajeev Motwani, and
Jeffrey D. Ullman. Introduction to Au-
language of the ε-NFA: tomata Theory, Languages, and Computa-
tion (3rd Edition). Addison-Wesley Long-
Theorem 2.3 (Determinisation of ε-NFAs). For all ε-NFA A, the DFA D ob- man Publishing Co., Inc., Boston, MA,
USA, 2006. ISBN 0321455363
tained by determinising A accepts the same language as A: L(A) = L(D).
δ̂(q, ε) = εclosure q
¡ ¢
L(A) = w ∈ Σ∗ | δ̂ A (q 0A , w) ∩ F A ̸= ;
© ª
Then to prove that L(D) = L(A), it is sufficient to check, that, for all
words w the set δ̂ A (q 0 , w) is exactly the state which is reached by D when
reading w from its initial state. This can be established by induction on the
length of w, which is easy because of the inductive definition of δ̂ A .
Example 2.15. Let us consider again the ε-NFA in Figure 2.8, and let us
build its deterministic counterpart D = Q D , Σ, δD , q 0D , F D .
®
δD (S 1 , l) = εclosure δ A (S 1 , l)
¡ ¢
= εclosure {q 2 }
¡ ¢
= {q 2 , q 3 , q 4 , q 5 , q 7 , q 10 }
δD (S 1 , d) = εclosure (;)
=;
εclosure δ A (S 2 , l) = εclosure {q 6 }
¡ ¢ ¡ ¢
= {q 4 , q 5 , q 6 , q 7 , q 9 , q 10 }
εclosure δ A (S 2 , d) = εclosure {q 8 }
¡ ¢ ¡ ¢
= {q 4 , q 5 , q 8 , q 7 , q 9 , q 10 }
εclosure δ A (;, d) = ;
¡ ¢
εclosure δ A (;, l) = ;
¡ ¢
Now, from S 3 :
εclosure δ A (S 3 , d) = εclosure {q 8 }
¡ ¢ ¡ ¢
= S4
¡ A
εclosure δ (S 3 , l) = εclosure {q 6 }
¢ ¡ ¢
= S3
48 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G
Finally, from S 4 :
εclosure δ A (S 4 , d) = εclosure {q 8 }
¡ ¢ ¡ ¢
= S4
¡ A
εclosure δ (S 4 , l) = εclosure {q 6 }
¢ ¡ ¢
= S3
The resulting DFA is depicted in Figure 2.9. Actually, this figure shows
the part of the DFA which is reachable from the initial state (since we have
built the states iteratively from the initial state). Indeed, a state like {q 1 , q 10 }
also exists in the DFA, but is not reachable.
l, d d
Observe that the result on the determination process does not always
yield a minimal automaton (in this case, the states S 2 and S 3 could be
‘merged’). We will review, in Section 2.5 a technique for minimising DFAs.
M
Size of the determinised automaton Since the set of states of the DFA D
A
obtained by the above construction is 2Q (where Q A is the set of states
of the original ε-NFA A), D could be, in theory, exponentially larger than
A. However, on the previous example, A l·(l+r)∗ has ten states, while its
corresponding DFA (Figure 2.9) has ‘only’ four states reachable (instead of
1024). So, even if the DFA has many states, most of them are not reachable
and their construction can thus easily be avoided.
Is it always going to be the case? The answer, unfortunately, is ‘no’: we
will exhibit an infinite family of languages L n (for all n ≥ 1) s.t., (i) for all
n ≥ 1: there is an ε-NFA A n that recognises L n , and the size of the A n ’s
grows linearly with n; and (ii) letting D n be any deterministic automaton
recognising L n (for all n), the size of the D n ’s grow exponentially with n.
Observe that the above statement is rather strong: whatever the determin-
istic automaton D n we chose to recognise L n , this automaton is bound to
a number of states which is exponential in n. Thus, there is no hope to ob-
tain a determinisation procedure that always produces a DFA that is poly-
nomial in the size of the original ε-NFA (and this holds in particular for the
determinisation procedure we have given above).
The languages L n are those of binary words that contain at least two 1’s
separated by n characters, i.e.:
0, 1 0, 1
1 0, 1 0, 1 0, 1 0, 1 1
qi q0 q1 q2 ··· qn qa
n transitions
Figure 2.10: The family of ε-NFAs A n (n ≥
1) s.t. for all n ≥ 1: L(A n ) = L n .
Observe that the only non-deterministic choice of A n occurs in q i : when
reading a 1, the automaton can either stay in q i , or move to q 0 . If it decides
for the latter, it will accept only if this 1 is followed n +1 characters later by
another 1. In some sense, each time the automaton sees a 1 in state q i , it
must guess whether this 1 will be followed n+1 characters later by another
1, in which case it moves to q 0 . The purpose of the states q 0 , q 1 ,. . . , q n is
to check that this guess was correct.
Finally, it is easy to see that, for all n ≥ 1, A n has n + 3 states, so the size
of the A n automata grows indeed linearly wrt n.
Now, let us argue that the size of deterministic automata D n that accept
L n grows exponentially wrt n. To support our discussion, we consider the
automaton A 1 :
1 0, 1 1
qi q0 q1 qa
• If the character is 0, then, the automaton must only update its memory,
by, again, copying the value of b 1 to b 0 , and letting b 1 = 0.
50 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G
Thus, the automaton clearly needs those two bits b 0 and b 1 of memory.
There are 22 = 4 possible memory values which are encoded in the states of
the DFA. Hence, D 1 must have at least 4 states. This reasoning generalises
to any n, letting the number of memory bits increase with n: for all n ≥ 1,
the automaton needs n +1 bits of memory. So, any DFA D n recognising L n
must have at least 2n+1 states.
As a matter of fact the automaton D 1 obtained by determinising A 1 (us-
ing the procedure of Section 2.4.2) is displayed in Figure 2.12. The four
states encoding the memory are the four non-accepting states. The gray
labels show the values of the two memory bits associated to those states—
of course, this intuition is valid only after D 1 has read at least 2 characters.
Clearly, this automaton could be made simpler, but only by ‘merging’ the
accepting states: it is not possible to reduce the number of non-accepting
states without changing the language of the automaton.
1 0 1
{q i } {q i , q 0 } {q i , q 1 } {q i , q 0 , q a }
0
1
b 0 = 0, b 1 = 0
{q i , q 0 , q 1 } 1
1
0
b 0 = 1, b 1 = 1
1 1
0
1 {q i , q 0 , q 1 , q a } {q i , q 1 , q a } {q i , q a }
0
0
ALL THINGS REGULAR. . . 51
b + (a · c)
q0 q1 q2
a c
It is easy to check that the latter automaton (i.e., without state q 1 ) accepts
the same language as the former.
Let us now generalise the idea sketched in this example. Assume we
want to remove some state q of an ε-NFA. Let p 1 , p 2 ,. . . , p n denote the
predecessors of q, i.e., all states p i s.t. q ∈ δ(p i , a), for some a ∈ Σ ∪ {ε}. Let
us further denote by s 1 , s 2 ,. . . , s ℓ the successors of q, i.e. all states s i s.t.
s i ∈ δ(q, a) for some a ∈ Σ ∪ {ε}. Obviously, the removal of q will affect all Observe that a state could be at
the same time a successor and a
transitions from some p i to q, and all transitions from q to some s i . But
predecessor of q, but this is not a
it might also affect some transitions from some p i to some s j , as in the problem for our technique.
above example. In this case, q 0 is a predecessor of q = q 1 ; q 2 is a successor;
and we ‘report’ the information from the two deleted transitions to the
direct transition from q 0 to q 2 . So, in general, the states and transitions we
need to consider when deleting state q are as depicted in Figure 2.13 (left).
Observe that we assumed two important things in this figure:
∗
r 1,ℓ r 1,ℓ + r 1,q r q,q r q,ℓ
∗
r 1,1 r 1,1 + r 1,q r q,q r q,1
p1 s1 p1 s1
r 1,q r q,1 r 1,q r q,1
r q,q r q,q
.. q .. .. q ..
. . . .
r n,ℓ ∗
r n,ℓ + r n,q r q,q r q,ℓ
r n,1 ∗
r n,1 + r n,q r q,q r q,1
ular expression accepting the same language is as follows. For each ac-
ALL THINGS REGULAR. . . 53
¢∗
r 0,0 + r 0, f · r f∗, f · r f ,0 · r 0, f · r f∗, f
¡
r∗
So, for each accepting state q f , we can now compute a regular expres-
sion r q f that accepts all the words A accepts by a run ending in q f . How-
ever, the language of A is exactly the set of all words that A accepts by a run
ending in either of the accepting states. Then, assuming that the set of ac-
cepting states of A is F = {q 1f , q 2f , . . . q nf }, we obtain the regular expression
corresponding to A as:
q 1f + q 2f + · · · + q nf
1
1
q0 0 q1 0 q2 ε q3
q0 0 q1 0 q2
1+0·0
q0 q2
(1 · 1 + 0 · 0 · 1)∗ · (1 + 0 · 0)
1
1+0·0
q0 q2 ε q3
;
q0
1
q0 1+0·0 q2
1
ε
q3
;
q0
1
q3
; + (1 + 0 · 0) · ε
¡ ¢
and:
; + (1 + 0 · 0) · ε = (1 + 0 · 0) · ε
¡ ¢ ¡ ¢
= 1+0·0
ALL THINGS REGULAR. . . 55
Then, putting everything together (and taking into account that the dupli-
cate q 0 is a single state), we obtain the automaton A q3 :
1·1+0·0·1 1
q0 1+0·0 q3
(1 · 1 + 0 · 0 · 1)∗ · (1 + 0 · 0) · 1∗
(1 · 1 + 0 · 0 · 1)∗ · (1 + 0 · 0) + (1 · 1 + 0 · 0 · 1)∗ · (1 + 0 · 0) · 1∗
¡ ¢ ¡ ¢
As we have seen in Section 2.4.2 (see Figure 2.9), there can be several DFAs
accepting the same language, and some of them might be larger than the
others. It is thus natural to look for a minimal DFA accepting a given regu-
lar language, and to wonder whether there can be several different minimal
DFAs accepting the same language.
Answers to those questions are provided by a central theorem of au-
tomata theory, which has been established in 1958 by Myhill and Nerode12 . 12
A. Nerode. Linear automaton trans-
formations. Proceedings of the Ameri-
To avoid technicalities which are out of the scope of those notes, we will
can Mathematical Society, 9(4):pp. 541–
not state the theorem, but rather one of its consequence: 544, 1958. ISSN 00029939. URL http:
//www.jstor.org/stable/2033204
Corollary 2.4 (Consequence of the Myhill-Nerode theorem). For all regu-
lar languages L, there is a unique minimal DFA accepting L. This DFA can
be computed from any DFA accepting L.
q0 q5 a q6
a
b a
q2 a q4
In other words, L(A, q) is the language A would accept if its initial state
were q instead of q 0 . Then, we can characterise states that we will be able
to merge. Those states are said to be equivalent:
L(A) is B = Q B , Σ, δB , q 0B , F B where:
®
1. Q B = {[q] | q ∈ Q A }
3. q 0B = q 0A
£ ¤
4. F B = {[q] | q ∈ F A }.
Example 2.19. Let us consider the example in Figure 2.15. Here are the
languages accepted by the different states (denoted as regular expressions):
q0 a · b · a · a∗ + b · a · a · a∗
q1 b · a · a∗
q2 a · a · a∗
q3 a · a∗
q4 a · a∗
q5 a∗
q6 a∗
the equivalence classes (and also the states of the minimal DFA) are:
[q 0 ] = {q 0 }
[q 1 ] = {q 1 }
[q 2 ] = {q 2 }
[q 3 ] = [q 4 ] = {q 3 , q 4 }
[q 5 ] = [q 6 ] = {q 5 , q 6 }
a
[q 0 ] [q 3 ] [q 5 ]
b a
a
[q 2 ]
2. If two states q 1 and q 2 are equivalent, then it must be the case that, for
all letters a: δ(q 1 , a) ≡ δ(q 2 , a). That is, reading the same letter from two
equivalent states yields necessarily equivalent states.
This can be shown by contradiction. Assume q 1 ≡ q 2 but δ(q 1 , a) ̸≡
δ(q 2 , a) for some letter a. Since δ(q 1 , a) ̸≡ δ(q 2 , a), the language ac-
cepted from δ(q 1 , a) must be different from the language accepted from
δ(q 2 , a), by definition of the equivalence relation (Definition 2.17). Hence,
there is at least one word w that differentiates these two languages.
Without loss of generality, let us assume that w can be accepted from
δ(q 1 , a) but not from δ(q 2 , a). Since we consider DFAs, we conclude
that a · w ∈ L(A, q 1 ), but that a · w ̸∈ L(A, q 2 ). Hence, it is not possible
that q 1 ≡ q 2 .
equivalent (or, in other words, as long as they have not been proved to be
non-equivalent). Initially, all final states are in relation with each oth- Observe that, once the algorithm
has declared that q i ̸∼ q j , then, we
ers, and all non-final states are too. However, no final state is in relation
are sure that q i ̸≡ q j . However, q i ∼
with a non-final one, since we know for sure that final and non-final states q j does not imply that q i ≡ q j . The fact
cannot be equivalent. that q i ∼ q j only represents the current
belief of the algorithm, but it could be re-
The current state of the relation is stored in a matrix P indexed by the vised later.
states (in both dimensions). We let P [q i , q j ] = 1 iff q i ∼ q j . Since the rela-
tion is symmetrical and reflexive, there are only 1’s on the diagonal and the
matrix is symmetrical , and so we keep only the (strictly) upper triangular
part of the matrix. For instance:
q2 q3
P: 0 1 q1
0 q2
• q i ̸= q j ;
Because P [δ(q i , a), δ(q j , a)] = 0, we know for sure that δ(q i , a) ̸≡ δ(q j , a).
Hence, as discussed above, it is not possible that q i ≡ q j , and so we put a
0 in P [q i , q j ]. We go on like that as long as we can update some cells of the
matrix. Algorithm 1 presents this algorithm.
Obviously, the algorithm terminates after having updated all cells of
the matrix in the worst case. It is easy to check that it runs in polynomial
time13 . One can prove14 that, upon termination, this algorithm computes 13
Actually in O (n 5 ), which is not very
good. A more clever implementation al-
exactly the relation ≡ that we are looking for:
lows one to achieve O (n log(n)).
14
John E. Hopcroft, Rajeev Motwani, and
Proposition 2.5. The refinement algorithm always terminates. Upon ter-
Jeffrey D. Ullman. Introduction to Au-
mination, q i ≡ q j iff P [q i , q j ] = 1, for all pairs of states (q i , q j ). tomata Theory, Languages, and Computa-
tion (3rd Edition). Addison-Wesley Long-
Example 2.20. Let us apply Algorithm 1 to the example in Figure 2.15. Re- man Publishing Co., Inc., Boston, MA,
USA, 2006. ISBN 0321455363
member that, following our convention, we have not shown, in the figure,
a sink state q s to which the automaton goes every time a transition is not
represented explicitly (for instance, δ(q 1 , a) = q s ). In the algorithm, how-
ever, we must take this state explicitly into account. So, we start with:
q1 q2 q3 q4 q5 q6 qs
1 1 1 1 0 0 1 q0
1 1 1 0 0 1 q1
1 1 0 0 1 q2
1 0 0 1 q3
0 0 1 q4
1 0 q5
0 q6
Input: A DFA A = Q = {q 1 , . . . , q n }, Σ, δ, q 0 , F .
®
foreach 1 ≤ i ≤ n do
foreach i < j ≤ n do
if q i ∈ F ↔ q j ∈ F then
P [q i , q j ] ← 1 ;
else
P [q i , q j ] ← 0 ;
Boolean f i ni shed ← 0 ;
while ¬ f i ni shed do
f i ni shed ← 1 ;
foreach 1 ≤ i ≤ n do
foreach i < j ≤ n do
if P [q i , q j ] = 1 then
foreach a ∈ Σ do
if P [δ(q i , a), δ(q j , a)] = 0 then
P [q i , q j ] ← 0 ;
f i ni shed ← 0 ;
return P ;
Algorithm 1: The algorithm to compute the matrix encoding the equiv-
alence classes of ≡.
60 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G
and updates the q 0 line accordingly. Then, the algorithm processes the q 1
line, and discovers that:
and updates the q 1 line too. Then, for similar reasons, it discovers that q 2
is not equivalent neither to q 3 nor to q 4 . It also finds that neither q 3 nor q 4
can be equivalent to q s . At the end of the first iteration of the while loop,
the matrix P is thus as follows:
q1 q2 q3 q4 q5 q6 qs
1 1 0 0 0 0 1 q0
1 0 0 0 0 1 q1
0 0 0 0 1 q2
1 0 0 0 q3
0 0 0 q4
1 0 q5
0 q6
At the end of the second iteration of the while loop, the matrix P is thus as
follows:
q1 q2 q3 q4 q5 q6 qs
0 0 0 0 0 0 1 q0
0 0 0 0 0 0 q1
0 0 0 0 0 q2
1 0 0 0 q3
0 0 0 q4
1 0 q5
0 q6
ALL THINGS REGULAR. . . 61
Finally, during the third and last iteration of the while loop, the al-
gorithm discovers that q 0 is not equivalent to q s because δ(q 0 , a) = q 1 ,
δ(q s , a) = q s , but P [q 1 , q s ] = 0. Hence, the final matrix is:
q1 q2 q3 q4 q5 q6 qs
0 0 0 0 0 0 0 q0
0 0 0 0 0 0 q1
0 0 0 0 0 q2
1 0 0 0 q3
0 0 0 q4
1 0 q5
0 q6
which indeed corresponds to the equivalence classes used to build the au-
tomaton in Figure 2.16. M
q1
However, the automaton obtained from the one in Figure 2.17 by having a
+
q 2 only as accepting state accepts a ̸= ;.
q0
From this example, it is clear that the problem comes from the non-
determinism. A DFA, however, has exactly one execution on each word w a a
that ends in an accepting state iff w is accepted. So, swapping accepting
q2
and non-accepting states of a DFA A, and keeping the rest of the automa-
ton identical yields an automaton A accepting the complement of A’s lan- Figure 2.17: Swapping accepting and non-
accepting states does not complement
guage. On each word w, the sequence of states traversed by A will be the non-deterministic automata.
same as in A. Only the final state will be accepting in A iff it is rejecting in
A.
To sum up, for a DFA A = Q, Σ, δ, q 0 , F , we let:
®
A = Q, Σ, δ, q 0 ,Q \ F
®
A1: A2:
q2 q5 a q7
a b ε
q1 q4
b
b b
q3 q6
Obviously, the initial state of A will be (q 1 , q 4 ), since they are the respective
initial states of A 1 and A 2 . From this state, we can consider several options,
for the transition function:
1. Either the input word begins with an a. A 1 can read this a, but A 2 can-
not since there is not a-labeled transition from q 4 , A 2 ’s current state.
Thus, there is no a-labeled transition from (q 1 , q 4 ) in A.
2. Or, the input word begins with a b. Both automata can read this letter:
A 1 will move either to q 2 or to q 3 , and A 2 will move to q 6 . Hence, in A,
(q 1 , q 4 ) has two b-labeled successors: (q 2 , q 6 ) and (q 3 , q 6 ). Only (q 2 , q 6 )
is accepting, since q 2 and q 6 are both accepting.
3. Or, one of the automata (in this case, A 2 ) makes a spontaneous move,
while the other (A 1 ) is left unchanged. This is possible because there
ALL THINGS REGULAR. . . 63
a
(q 1 , q 5 ) (q 1 , q 7 )
ε
b
(q 1 , q 4 ) (q 2 , q 6 )
b
(q 3 , q 6 )
3. q 0 = (q 01 , q 02 );
4. F = F 1 × F 2 .
Finally, observe that the construction for intersection can easily be mod-
ified to obtain an alternative algorithm to compute the union of two ε-
NFAs, simply by letting the set of final states be {(q 1 , q 2 ) | q 1 ∈ F 1 or q 2 ∈
F 2 }. The advantage of this construction is that it produce a DFA when
applied to two DFAs A 1 and A 2 , unlike the previous one that needs ε-
transitions.
Emptiness Clearly, an ε-NFA accepts a word iff there exists a path in the
automaton from the initial state to any accepting state, whatever the word
recognised along the path. Thus, testing for emptiness of ε-NFAs boils
down to a graph problem, which can be solved by classical algorithms
such as breadth- or depth-first search. Algorithm 2 shows a variation of
the breadth-first search to check for emptiness of ε-NFAs. At all times, it
maintains a set Passed of states that it has already visited and a set Frontier
containing all the states that have been visited for the first time at the pre-
vious iteration of the While loop. Each iteration of this loop consists in
64 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G
Input: An ε-NFA A = Q, Σ, δ, q 0 , F
®
Frontier ← NewFrontier ;
return Passed ∩ F = ; ;
Algorithm 2: Checking for emptiness of ε-NFAs.
L(A 1 ) ⊆ L(A 2 )
iff
L(A 1 ) ∩ L(A 2 ) = ;
In practice, however, the opera-
tions (determinising and comple-
Indeed, if L(A 1 ) ⊆ L(A 2 ), then all words w ∈ L(A 1 ) do not belong to
menting A 2 , computing the inter-
L(A 2 ) (otherwise, they would be rejected by A 2 ). Hence, there is certainly section and checking whether it is empty)
no intersection between L(A 1 ) and L(A 2 ). On the other hand, if L(A 1 ) ∩ can be carried on-the-fly: this allows to
stop the algorithm (and potentially avoid a
L(A 2 ) is empty, this means that there is no word w which is (i) accepted by costly determinisation) as soon as a word
A 1 and (ii) rejected by A 2 . Thus, all words that are accepted by A 1 are also accepted by A 1 and rejected by A 2 is
found. This on-the-fly algorithm will not
accepted by A 2 , hence L(A 1 ) ⊆ L(A 2 ).
be detailed here. It allows to prove that
Checking whether L(A 1 ) ∩ L(A 2 ) = ; can be done by using the tech- the language inclusion problem belongs to
P S PA C E .
niques we have described above: by first building the automaton A = A 1 ∩
A 2 , then checking whether L(A) = ; using Algorithm 2.
Equality testing To test whether L(A 1 ) = L(A 2 ), for two ε-NFAs A 1 and A 2 ,
one can simply check whether L(A 1 ) ⊆ L(A 2 ) and L(A 2 ) ⊆ L(A 1 ). Again, an
efficient, on-the-fly, version of this algorithm better be implemented in
practice.
ALL THINGS REGULAR. . . 65
2.7 Exercises
Exercise 2.2. Prove that any finite language is regular. Is the language L =
{0n 1n | n ∈ N} regular? Give an intuition of why or why not.
2. the set of strings whose 3rd symbol, counted from the end of the string,
is a 1;
3. the set of strings where each pair of zeroes is directly followed by a pair
of ones;
r 0
0, 1 0
0
p 0 q 1 1
t s
a a
b
p q
ε
ε
c c
b
a r
66 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G
1 0
0 0
q1 q2 q3
1 1
0
q1 q2
0
0
1 1
1
q3
1. 01∗
2. (0 + 1)01
3. 00(0 + 1)∗
For the next exercises, you are asked to provide regular expressions using
the ‘extended regular expression’ format (see Section 2.2.1), that is used in
practice. You can test your answers using the regular expression library16 16
Python Software Foundation. re – Regu-
lar expression operations. https://docs.
re in Python, with its re.search(pattern, string, flags=0) method.
python.org/3/library/re.html. On-
The method receives an extended regular expression as the pattern, and line: accessed on April 12th, 2023
returns a Match object indicating the first substring of string (if any)
that matches the pattern. For example:
1 >>> import re
2 >>> re.search("(a|b|c)+","abcbcab")
3 <re.Match object; span=(0, 7), match=’abcbcab’>
4 >>> re.search("(a|b|c)+","abcdef")
5 <re.Match object; span=(0, 3), match=’abc’>
6 >>> re.search("(a|b|c)+","decbaf")
7 <re.Match object; span=(2, 5), match=’cba’>
8 >>> re.search("(a|b|c)+","def")
Observe that the last call returns nothing because no match was possible.
Exercise 2.9. Give an extended regular expression (ERE) that accepts any
sequence of 5 characters, including the newline character \n.
Exercise 2.10. Give an ERE that accepts any string starting with an arbi-
trary number of \ followed by any number of *.
ALL THINGS REGULAR. . . 67
Exercise 2.11. UNIX-like shells (such as bash) allow the user to write batch
files in which comments can be added. A line is defined to be a comment
if it starts with a # sign. What ERE accepts such comments?
For example, the following strings are valid numbers in scientific notation:
42, 66.4E-5, 8E17
Exercise 2.13. Design an ERE that accepts ‘correct’ sentences that fulfill
the following criteria: (i) no prepending/appending spaces; (ii) the first
word must start with a capital letter; (iii) the sentence must end with a
dot .; (iv) the phrase must be made of one or more words (made of the
characters a...z and A...Z) separated by a single space; (v) there must
be one sentence per line; and (vi) punctuation signs other than the dot are
not allowed.
Exercise 2.14. Give an ERE that accepts old school DOS-style filenames
respecting the following criteria. First, each filename starts by 8 characters
(among a...z, A...Z and _), and the first five characters must be abcde.
Next, each filename has an extension which is .ext. Finally, the ERE must
accept accept the filename only (i.e., without the extension)!
For example, on abcdeLOL.ext, the ERE must accept abcdeLOL.
q0 1 q1 0 q3
1
0
q2 0 q4
0 q5
3. For each state q of the resulting DFA, give, as a regular expression, the
language L q that the automaton would accept if q were the initial state.
68 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G
2. what is the relationship between L(A) and L(A ′ )? Do they have a non-
empty intersection? Is it the case that L(A) is the complement of L(A ′ )?
The method to complement a finite
automaton is found in Section 2.6
3. . Apply the systematic method to compute the complement of a finite
automaton, and check that the result indeed accepts the complement
of L(A).
Exercise 2.17. Here are two finite automata A and B on the singleton al-
phabet {,}:
q1
,
A: q0 , ,
B: q0 q1
,
q2
,
For this part of the exercises, we will rely on the scanner generator JFlex.
A scanner is a program that reads text on the standard input and prints it
on standard output after applying operations. For example, a filter that re-
places all a with b and that receives abracadabra on input would output
bbrbcbdbbrb. Then, JFlex is a tool that generates such a scanner based
on a set of regular expressions that specify which part of the input should
be matched and modified. To recognise these regular expressions, JFlex is
based on the theory of finite automata that we have studied. The gener-
ated scanned is, in fact, a Java function.
JFlex can be dowloaded from http://jflex.de and a user manual is
at http://jflex.de/manual.html.
We start by a very short explanation of the tool.
1. the first part is the user code. It can contain any Java code, that will be
added at the beginning of the generated scanner.
• %class Name to tell JFlex to produce a scanner inside the classe called
Name;
Then, some extra Java code included between %{ and %} can be gener-
ated. It will be copied verbatim inside the generated Java class (contrary
to the code of the first part which appears outside of the class).
Finally, some ERE can be defined. They will be used as macros in part 3
of the file to enhance readability. For example:
3. the third part contains the core of the scanner. It is a series of rules
that associate actions (in terms of Java code) to the regular expressions.
Each rule is of the form:
1 Regex {Action}
where:
• yytext() is a the actual string that was matched by the regular expres-
sion.
70 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G
Then, the rules can be associated to states, and are active only in these
states. A state can be ‘activated’ using the function yybegin(S) in the code
(where S is the name of the state to activate). Here is an example:
1 %%
2 xstate YYINITIAL, PRINT;
3 %%
4 <YYINITIAL> {
5 "print" {yybegin(PRINT);}
6 }
7 <PRINT> {
8 ";" {yybegin(YYINITIAL);}
9 . {System.out.println(yytext());}
10 }
which creates the file Lexer.java containing the Lexer class (the %class
option can be used to change this);
2. compile the code into a class file: javac Lexer.java which creates
Lexer.class;
Here are now some exercises to get you familiar with JFlex:
Exercise 2.18. Write a scanner that outputs its input file with line numbers
in front of every line.
ALL THINGS REGULAR. . . 71
Exercise 2.20. Write a scanner that only shows the content of comments
in the input file. Such comments are enclosed within curly braces { }.
You can assume that the input file does not contain curly braces inside
comments.
Exercise 2.21. Write a scanner that transforms the input text as follows.
It should replace the word compiler by nope if the line starts with an a;
by ??? if it starts with a b and by !!! if it starts with a c.
Exercise 2.22. Write a lexical analysis function that recognises the follow-
ing tokens:
Each call to the function must seek the next token on the input. Every time
a token is found, your function must output a message of the form TOKEN
NAME: token (for example: C99VAR: myvariable) and return an object
Symbol containing the token type, its value and its position (line and col-
umn). Templates for the Symbol and LexicalUnit classes are provided on
the Université Virtuelle.
3 Grammars
G R A M M A R S A R E T H E T O O L W E W I L L U S E T O S P E C I F Y T H E F U L L S Y N TA X
O F P R O G R A M M I N G L A N G UA G E S . They are also the basic bulding block
of the systematic construction technique of parsers that we will discuss in
Chapter 5 and Chapter 6. Before giving the formal syntax and semantics
of grammars, we start with a discussion on the limits of regular languages,
to motivate the need for other, more expressive, formalisms.
cause the automaton always ends up in state q 1 after reading the first
letter, regardless of the letter. As a consequence the automaton accepts
{ab, cd, ad, bc}. Of course, the automaton in Figure 3.2 now accepts the
right language, because states q 1 and q 1′ act as a memory: when the au-
tomaton reaches q 1 , it has recorded that the first letter was an a; and when
74 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G
Now that we have a good intuition of what ‘memory’ means for a finite
automaton, let us prove formally that L () is indeed not regular. Our proof
strategy will be by contradiction: we will assume that L () is regular, and
hence, that there exists an ε-NFA A () that recognises it (by Theorem 2.2).
Then, we will exploit the fact that this hypothetical A () has finitely many
states to derive a contradiction. Assuming A () has n states, we will select
a word from L () which is ‘very long’: this means that the word will be long
enough to guarantee that, when the automaton accepts it, an accepting
run necessarily visits twice the same state. Coming back to our intuition
that each state of a finite automaton represents a possible memory con-
tent, this means that, after reading two different prefixes w 1 and w 2 —that
contain a different number of pending open parenthesis—the automaton
has memorised the exact same knowledge about those prefixes2 . However, 2
Just as in the example of Figure 3.1, where
the automaton has the same knowledge
since w 1 and w 2 contain a different number of pending open parenthesis,
when it reads either a or c as the first let-
the behaviour of the automaton should be different after reading these two ter.
prefixes. Unfortunately, since the automaton is in the same state in both
cases, there will be at least one common execution after w 1 and w 2 , i.e.,
by lack of memory, the automaton gets mixed up, and accepts words that
are not part of L () . Let us now formalise this intuition.
w = ((· · · (( )) · · · ))
| {z } | {z }
n n
( ( ( ( ( ( ) )
q0 → − qk →
− ··· → −
| ·{z −} q ℓ →
·· → − qn →
− ··· → − q n+1 · · · →
− q 2n
loop
Hence, the run obtained by repeating the loop twice is also an accepting
run:
( ( ( ( ( ( ( ( ) )
q0 → − qk →
− ··· → −
| ·{z −} q k →
·· → |− ·{z −} q ℓ →
·· → − qn →
− ··· → − q n+1 · · · →
− q 2n
=q ℓ
loop loop
GRAMMARS 75
w′ = (······()······)
| {z } | {z }
m times n times
This Theorem clearly shows that there are interesting and (arguably)
simple languages that are not regular (hence, they cannot be specified by
means of a regular expression, nor recognised by a finite automaton). This
motivates the introduction of grammars, which are a much more powerful
formalism for specifying languages (in the next chapter, we will study ex-
tensions of automata to handle more languages than finite automata can
do).
4. An identifier Id is an expression;
2 4 3 1 4 5
Exp =
⇒ Exp ∗ Exp = ⇒ (Exp) ∗ Id =
⇒ Exp ∗ Id = ⇒ (Exp + Exp) ∗ Id =
⇒ (Exp + Id) ∗ Id =
⇒ (Cst + Id) ∗ Id
Note that the initial sequence of symbols Exp contains only one vari-
able; that the last sequence (Cst + Id) ∗ Id is actually a word (it contains
only symbols from the language’s alphabet); and that all the intermedi-
ary sequences contain symbols from the alphabet, and variables (that are
eventually eliminated by rewriting). Let us now formalise these notions.
– α ∈ (V ∪ T )∗ V (V ∪ T )∗ and
– β ∈ (V ∪ T )∗
Example 3.2. Formally, the grammar that defines expressions is the tuple:
with:
Exp →Exp+Exp
Exp →Exp*Exp
P= Exp →(Exp)
Exp →Id
Exp →Cst
Then, we use rules 6 and 8 to move the latter A two positions to the left:
8 6
ABC ABC =
⇒ AB AC BC =
⇒ A ABC BC
It is easy to generalise this example to any word on {a, b, c} with the same
number of a’s, b’s and c’s, and to check that only those words can be gen-
erated by the grammar. M
78 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G
L(G) = {w ∈ T ∗ | S ⇒G
∗
w}
This definition calls for several comments. First, let us give some exam-
ples of grammars:
Example 3.8.
(1) S → Sa
(2) → ε
is left-regular and L(G a ∗ ) = {a}∗ . Observe that replacing the first rule by
S→aS yields a right-regular grammar that accepts the same language.
3. The grammar G Exp of Example 3.2 is context-free but not regular, be-
cause, for instance, of rule Exp→Exp + Exp .
Then, let us discuss the name ‘hierarchy’. This term seems to imply that
the classes of grammars are contained into each other, in other words that
each class 3 grammar is a class 2 grammar, each class 2 grammar is a class
1 grammar, and each class 1 grammar is a class 0 grammar. Unfortunately
this is not the case, as shown by the next example:
80 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G
Example 3.9. Consider the grammar with the two following rules:
(1) S → S′
(2) S ′
→ ε
Obviously, this grammar is regular (class 3) because the first rule is of
the form S→wS ′ , with w = ε ∈ T ∗ ; and the latter is of the form S ′ →w
with w = ε ∈ T ∗ again. It is also context-free (class 2) because the left-
hand sides of all its rules are made up of only one variable. Of course, it
is also unrestricted (class 0). But it is clearly not context-sensitive (class 1)
because of the rule S ′ →ε where S ′ ̸= S.
However, observe that the class 1 grammar
〈{S}, ;, {S → ε}, S〉
So, while the Chomsky hierarchy does not form a hierarchy of gram-
mars, it defines a hierarchy of languages. To establish this, let us begin
with a few more definitions:
And, similarly:
Proof. (Sketch) We will not prove the theorem with great detail, but we
b
will highlight the main arguments behind each of those relations.
q0 a q1
1. L 3 = Reg. We first show that each DFA A can be converted into a gram-
mar G A s.t. L(A) = L(G A ). The intuition of the construction is as follows.
b a
The set of variables of G A is the set of states of A. We will build the set of
rules in such a way that all sentential forms are of the form w q, where q2
w ∈ Σ∗ is the prefix read so far by the automaton, and q is the current
state. Then, for each transition from some q 1 to some q 2 labeled by a, a, b
the grammar contains a rule q 1 →aq 2 . Finally, when an accepting state (1) q0 → aq 1
(2) → bq 2
is reached, we must be able to get rid of the variable from the sentential
(3) q1 → bq 1
form, so we have rules of the form q→ε for all q ∈ F . Fig. 3.3 shows an (4) → aq 2
example of this transformation. (5) → ε
(6) q2 → aq 2
(7) → bq 2
Figure 3.3: An example of a DFA and its
corresponding right-regular grammar.
GRAMMARS 81
Q, Σ, P , q 0 , where:
®
P = {q → aq ′ | δ(q, a) = q ′ } ∪ {q → ε | q ∈ F }
(1) S → SS
(2) → (S)
(3) → ε
5. CFL ⊊ L 1 . To prove this inclusion, we must first show that all con-
text free languages (which can be defined by a context free grammar,
since L 2 = CFL) are also context-sensitive languages. Unfortunately,
as shown by Example 3.9 above, not all context-free grammars are context-
sensitive, so we cannot use a simple and direct syntactic argument as
we did when proving that all regular languages are also context-free.
Observe that, in a context-free grammar, a rule of the form α → β that
violates the property |α| ≤ ¯β¯ is necessarily a rule of the form A → ε.
¯ ¯
¯β¯ = 0, hence β = ε.
¯ ¯
82 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G
P ′ = P \ {A → ε}
(b) Find in P ′ all the rules of the form B → β where β contains A, and,
for each of those rules, add to P ′ all the rule B → β′ , where β′ has
been obtained by removing all the A symbols7 from β: 7
Or, in other words, replacing all A’s by ε.
A → β1 Sβ2 S · · · Sβn ∈ P ′′
P ′′′ = A → β1 S ′ β2 S ′ · · · S ′ βn with
βi ∈ (T ∪ V \ {S})∗
GRAMMARS 83
Figure 3.6 illustrates the construction. It is easy to check that the result- Original grammar:
ing grammar has the same language as the original one, and that it now (1) S → ε
(2) → aS
respects the syntax of context-sensitive grammars.
(3) → A
This shows that CFL ⊆ L 1 . Now, to prove that the inclusion is strict, we (4) A → aS
After step (b):
need to exhibit a language which is context-sensitive but not context-
(1) S → ε
free. It is the case of the language L(G ABC ) generated by the grammar (2) → aS
given in Example 3.3. Clearly, this language is context-sensitive since (3) → A
it is generated by a context-sensitive grammar. We will not prove here (4) S′ → aS
(5) → A
that this language is not context free. Suffice it to say that this can be (6) A → aS
proved by techniques similar to those we have used to show that L () is After step (c):
not regular (proof of Theorem 3.1). The interested reader should refer (1) S → ε
to the so-called ‘pumping lemmata’ for regular and context-free lan- (2) → aS
(3) → a
guages, which are general techniques allowing one to prove that a lan- (4) → A
guage is not regular and not context-free respectively8 . (5) S′ → aS
(6) → a
6. L 0 = RE. This equality holds by Definition 3.12. (7) → A
(8) A → aS
0 0 (9) → a
7. CSL ⊊ L . The fact that CSL ⊆ L holds by definition: all grammars
Final grammar:
belong to class 0, so all CSL, that can be defined by a context-sensitive
(1) S → ε
grammar (by definition) can be defined by a class 0 grammar. Show- (2) → aS ′
ing that the inclusion is strict requires techniques that are beyond the (3) → a
(4) → A
scope of this course, so we will not prove it here.
(5) S′ → aS ′
(6) → a
(7) → A
(8) A → aS ′
(9) → a
Figure 3.6: Removing all occurrences of
the start symbol S from right-hand sides of
rules, while preserving the language of the
original grammar.
8
John E. Hopcroft, Rajeev Motwani, and
Jeffrey D. Ullman. Introduction to Au-
tomata Theory, Languages, and Computa-
tion (3rd Edition). Addison-Wesley Long-
man Publishing Co., Inc., Boston, MA,
USA, 2006. ISBN 0321455363
3.4 Exercises
3.4.2 Grammars
Then, give a derivation of the word 1110 according to the first grammar;
a derivation of the word ∗ + a + aa ∗ aa according to the second grammar
G 2 and a derivation of the word abc abc produced by grammar G 3 .
(a) baS b;
(b) bB AB b;
(c) baabaab.
{an bm cℓ | n + m = ℓ}
{am bn cm dn | m ≥ 1, n ≥ 1}.
Exercise 3.7. Give a context-free grammar that generates all the arith-
metic expressions on the alphabet {(, ), +, ., 0, 1} that evaluate to 2. Problem
taken from Niwińsky and Rytter9 . 9
Damian Niwińsky and Wojciech Rytter.
200 Problems in Formal Languages and
Hint: start by generating all expressions that evaluate to 0, then to 1,
Automata Theory. University of Warsaw,
then to 2. 2017
4 All things context free. . .
C O N T E X T- F R E E L A N G UA G E S A R E T H E S E C O N D I M P O RTA N T C L A S S O F
L A N G UA G E S T H AT W E W I L L C O N S I D E R I N T H I S C O U R S E , after regular
languages. This chapter will be, in some sense, the ‘context-free’ analo-
gous to Chapter 2, where we introduced and studied regular languages.
Let us first summarise quickly what we have learned so far about CFLs
and their relationship to regular languages. First, recall, from the Chomsky
hierarchy (Definition 3.7) that regular languages are all CFLs and that the
containment is strict: for instance the Dyck language1 L () is a CFL which is 1
Recall that this languages contains all
well-parenthesised words on {(, )}, i.e., a
not regular (see Theorem 3.1). Moreover, we already know several formal
parenthesis is closed only if it has been
tools to deal with regular languages and CFLs, as summarised in the table opened before, and, at the end of the
below: word, all opened parenthesis are eventu-
ally closed.
Reg CFL
Regular expressions,
Specification Context-free grammars
regular grammars
Automaton DFA, NFA, ε-NFA ??
As can be seen, one cell is still empty in this table: which automaton
model allows us to characterise the class of CFLs, just as finite automata
correspond to regular languages? We will answer this question in Sec-
tion 4.2, but let us already try and build some intuitions.
As explained in Chapter 2, finite automata (whether they are determin-
istic or not) can be regarded as a model of programs that have access to a
finite amount of memory. This allows them to recognise simple languages
such as (01)∗ , for instance, because the only piece of information the pro-
gram must ‘remember’ (in this example) is the last bit that has been read,
0
in order to check that the next one is indeed different. So, one bit of mem-
1 0
ory is sufficient for (01)∗ , which explains why the automaton accepting it
1
has two states (see Figure 4.1). Figure 4.1: A finite automaton accepting
Now, let us consider a typical CFL, which is the language of all palin- (01)∗ . The labels of the nodes represent
the automaton’s memory: it remembers
dromes on {0, 1} (where the two parts of the palindromes are separated by
the last bit read if any.
#). Formally, we consider the language:
We will prove later that L pal# is indeed a CFL (and is not regular). Let us
admit this fact for now, and let us understand why this language cannot
be recognised by a finite automaton. Continuing the intuition we have
sketched above, a program recognising L pal# must, when reading a word
of the form w#w ′ :
88 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G
It should be clear that, since the length of the prefix w is not bounded (in
the definition of L pal# ), such a program needs an unbounded amount of
memory, to store this prefix w. Since we are considering automata that
read the input word from the left to the right, in one pass, and cannot
modify the input, one can hardly imagine that a program (or an automa-
ton) could recognise L pal# using only a finite amount of memory.
So, to recognise CFLs, we need to extend finite automata with some
form of unbounded memory. In the case of L pal# , this memory can be
restricted to be a stack. Indeed, our program can be rewritten as: Recall that a stack is a data struc-
ture where elements are stored as
1. read the prefix w up to the occurrence of #, letter by letter, pushing a sequence, where only the last in-
serted symbol can be accessed (it is called
each letter on the stack;
the top of the stack), and that can be modi-
fied only by appending symbols to the end
2. skip the symbol #; of the sequence (a push to the stack); or by
deleting the last symbol if it exists (a pop
3. read the suffix w ′ , letter by letter. Compare each letter from the input from the stack). Therefore, a stack is often
to the top of the stack. If they differ, or if the stack is empty, reject the referred to as a LIFO (an acronym of ‘Last
In First Out’), because the first elements
word, otherwise pop the letter. that will be read from the stack is the last
one that has been written.
4. If the whole suffix has been read and the stack is empty, accept the
word, otherwise reject.
1 1
0 0 0 1 0 # 0 1 0 0 0
−
→ −
→ −
→ →
− −
→ −
→ −
→
0 0 0 0 0 0
Figure 4.2: Recognising a palindrome us-
ing a stack.
So, in Section 4.2, we will extend finite automata by means of a stack, al-
lowing the automaton to perform one operation on the stack at each tran-
sition (in addition to reading an input letter). We will formally study this
new model, called pushdown automata2 (PDA for short), and show that 2
Pushdown is a synonymous of stack.
they recognise exactly the class of CFLs, just as finite automata recognise
regular languages.
In order to prove this last result, we will present a formal connection be-
tween PDAs and CFLs. Again, let us sketch some intuitions, by considering
the grammar:
(1) S → 0S 0
(2) → 1S 1
(3) → #
that generates exactly L pal# . We can check against Definition 3.7 that this
grammar is indeed context-free. It is easy to turn such a grammar into a
recursive program that recognises L pal# , by regarding each variable in the
ALL THINGS CONTEXT FREE. . . 89
1 def S():
2 n = read_next_character()
3
4 if n == ’#’:
5 return True
6
7 if n == ’0’:
8 r = S()
9 if (!r): return False
10 n = read_next_character()
11 if (n == ’0’): return True
12 else: return False
13
14 if n == ’1’:
15 r = S()
16 if (!r): return False
17 n = read_next_character()
18 if (n == ’1’): return True
19 else: return False
S→0S 0
Read a 0; then check that it is followed by a palindrome, i.e. a word that can
be generated by S; and finally read a matching 0.
be captured by a PDA, as sketched in Figure 4.3. Each time the function There are many different syntax’s
for PDAs. The one we are using
performs a recursive call, the PDA pushes information about its current
in this example aims at illustrating
state (that encodes the value of the local variables) to the stack and moves easily the intuitions of this introduction.
to its initial state. At each return, the PDA pops the top of the stack, to re- Note that the formal syntax we will use in
the rest of the chapter will differ slightly.
cover the value of the local variables and uses it to move to the state that
models the state of the function after the recursive call.
Top = q0
Push(q0 ) 0 0
Pop
q0 q 0′
Let us start by discussing some formal tools that are useful when dealing
with context-free grammars (hence, also regular grammars). Some of them
(such as the notion of derivation) have already been introduced in Chap-
ter 3, but will be specialised for CFGs.
(1) Exp → Exp + Exp
(2) → Exp ∗ Exp
4.1.1 Derivation (3) → (Exp)
(4) → Id
We have already discussed the notion of derivation in the previous chap- (5) → Cst
ter, see Definition 3.4 sqq. and the intuition given before. Let us consider Figure 4.4: The grammar G Exp to generate
expressions.
again the grammar G Exp , given in Figure 4.4, and let us consider the word
Id + Id ∗ Id which can be generated by this grammar. Indeed, the following
is a possible derivation of G Exp for this word (where we have underlined
the variable which is rewritten at each rule application):
2 1 4 4 4
Exp =
⇒ Exp ∗ Exp =
⇒ Exp + Exp ∗ Exp =
⇒ Exp + Id ∗ Exp =
⇒ Exp + Id ∗ Id =
⇒ Id + Id ∗ Id (4.1)
ALL THINGS CONTEXT FREE. . . 91
We can already observe that, since G Exp is context-free, all the deriva-
tions it generates have a particular shape: they consist, at each step, in
replacing one variable by a word over (Σ ∪V )∗ . This peculiarity of CFGs al-
lows us to define new notions: the leftmost and rightmost derivations, the
derivation trees and the notion of ambiguity.
2 4 1 4 4
Exp =
⇒ Exp ∗ Exp =
⇒ Exp ∗ Id =
⇒ Exp + Exp ∗ Id =
⇒ Exp + Id ∗ Id =
⇒ Id + Id ∗ Id (4.2)
2 1 4 4 4
Exp =
⇒ Exp ∗ Exp =
⇒ Exp + Exp ∗ Exp =
⇒ Id + Exp ∗ Exp =
⇒ Id + Id ∗ Exp =
⇒ Id + Id ∗ Id (4.3)
Derivation tree Another way to prove that a given word belongs to the
language of a grammar is to exhibit a derivation tree for this word. The
idea behind the derivation tree is similar to the intuition we have given in
the introduction that a rule like S → α1 B α2 can be interpreted as ‘match
α1 ; then a string that can be generated by B ; then α2 ’, which suggests a
recursive definition of the acceptance of a word. Such a recursive view can
easily be expressed by means of a tree:
S
α1 B α2
This tree can then be completed up to the point where the leaves con-
tain only terminals.
92 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G
Exp
Example 4.2. Let us illustrate this idea by a more concrete example. Con-
sider again the grammar G Exp in Figure 4.4, and the word Id + Id ∗ Id. Then,
a derivation tree of this word is given in Figure 4.5. M Exp + Exp
Id
From this example, it is easy to determine the characteristics of a deriva- Exp ∗ Exp
tion tree for a word w. Its root must be labelled by the start symbol S of the Id Id
grammar; the children of each node must correspond to the right-hand Figure 4.5: A derivation tree for the word
Id + Id ∗ Id.
side of a rule whose left-hand side is the label of the node; and the se-
quence of leaves from left to right must be the word w. Here is a more
formal definition:
1 ≤ i ≤ k.
As we will see in the next chapter, derivation trees are a most important
tool in compiler construction. Very often, the output of the parser4 will 4
See Section 1.3.1 for the different phases
of the compiling process and their relative
be a derivation tree (possibly with some extra information as we will see
connections.
later). Indeed, the structure of the derivation tree provides us with more
information on the structure of the input word than a derivation does. For
example, the derivation tree in Figure 4.5 reveals a possible structure for
the expression Id + Id ∗ Id, and suggests that it should be understood as the
sum of Id and Id ∗ Id. In other word, the structure of the tree suggests that
the semantics of the expression corresponds to that of Id + (Id ∗ Id) which
indeed matches the priority of the arithmetic operators. Such information
will clearly be important for the synthesis phase of compiling, where the
executable code corresponding to the expression will be created.
Ambiguities We have seen before that a given word might be derived us-
ing several different derivations (which has prompted us to introduce the Exp
only on the grammar G Exp (Figure 4.4) which derivation tree should be
generated for Id + Id ∗ Id?’ Unfortunately, the answer is ‘we can’t!’, because
the grammar does not contain any information about the priority of the
operators. In other words, the grammar is intrinsically ambiguous:
Example 4.6. As witnessed by Figure 4.5 and Figure 4.6, grammar G Exp
(Figure 4.4) is ambiguous. M
For the reason explained above, ambiguous grammars will be a big is-
sue when generating parsers. We will see, in Section 4.4, techniques to
turn an ambiguous grammar into a non-ambiguous one, that accepts the
same language, by taking into consideration extra information such as the
priority and associativity of the operators.
where:
• B and C are any variable different from the start symbol: {B ,C } ⊆ V \{S};
and
Roughly speaking, all the rules in the grammar either replace one vari-
able A by a sequence of exactly two variables BC (that are different from
the start symbol); or replace a variable A by a single non-empty termi-
nal a. The only exception to this rule is that the start symbol S can gener-
ate ε: this is necessary to ensure that the empty word can be accepted by
a grammar in CNF (and this also explains why we do not allow S to occur
in any right-hand part).
As announced above, we can build, from all CFGs an equivalent one
which is in CNF:
ALL THINGS CONTEXT FREE. . . 95
Theorem 4.1. For all CFGs G, we can build, in polynomial time, a CFG G ′
in CNF s.t. L(G) = L(G ′ ).
S →V1 B
V1 → a
S →V1 V2
V2 → B B
and B → aB c becomes:
B →V1 V3
(1) S → V1 B
V3 → BV4 (2) S → V1 V2
(3) V2 → BB
V 4 → c.
(4) B → V1 V3
(5) V3 → BV4
The final grammar, which is now in CNF is given in Figure 4.9. (6) V1 → a
M (7) V4 → c
(8) B → d
Figure 4.9: A CFG in CNF that corresponds
Checking language membership of CFGs Equipped with the notion of
to the CFG in Figure 4.8.
CNF, we can now describe a polynomial-time algorithm to check whether
a given word w belongs to the language of a given CFG G = 〈V , T , P , S〉,
that we assume to be in CNF (recall that, if the given CFG is not in CNF, we
can obtain an equivalent CFG which is in CNF, in polynomial time—see
Theorem 4.1).
Let us first sketch some intuitions. We observe that, thanks to the spe-
cial syntax of production rules in CNF grammars, checking membership is
particularly easy for words of length 0 or 1. Indeed:
96 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G
• The only word of length 0 is w = ε. Since S → ε is the only rule that can
appear with an ε on the right-hand side, we conclude that ε ∈ L(G) iff
S → ε appears in P .
The second item of the above reasoning can be generalised: we can check
whether any variable A ∈ V can generate a word w = a of length 1 simply
by checking whether the grammar has a rule A → a or not.
Now, let us turn our attention to the following more general problem:
checking whether some given variable A can generate a given word w =
w 1 w 2 · · · w n for n ≥ 2? Clearly, if we can answer that question, then we will
be able to answer the membership problem, simply by letting A = S. We
will base our reasoning on an intuition that we have already given: that
a rule A → α of a CFG can be regarded as a recursive function A which
performs a series of calls as given by α. In the setting of CNF grammars,
all ‘recursive’ rules are of the form A → BC , hence they correspond to two
successive ‘recursive’ calls. In other words, a variable A can generate w iff
we can find a rule A → BC and we can split w into two non-empty sub-
words u and v (i.e., w = uv) s.t.: (i) B generates u; and (ii) C generates v.
Observe that this recursive definition is sound since u and v are both non-
empty, and thus have necessarily a length which is strictly smaller than
n. So eventually, this definition will amount to check whether all charac-
ters of the word w can be generated by some variables, which can be done
easily as we have observed above.
Clearly, this discussion suggests a recursive procedure for checking mem-
bership of a word w to L(G). However, such a procedure could run in ex-
ponential time. In order to avoid this, we will rely on the very general idea
of dynamic programming. In our case, dynamic programming consists in Dynamic programming is a general
algorithmic technique, defined by
storing in a (quadratic) table the result of the procedure when called on all
Wikipedia as ‘a method for solving a
the possible sub-strings of w. By filling this table following a smart order, complex problem by breaking it down into
we will manage to keep the computing time polynomial. Basically, we will a collection of simpler subproblems, solv-
ing each of those subproblems just once,
first fill in the table for the shortest substrings of w (i.e., the individual let- and storing their solutions - ideally, us-
ters), then use this information to deduce whether we can accept longer ing a memory-based data structure’. It
has been introduced by Richard B E L L -
and longer substrings. . . up to the whole word w itself.
M A N in the forties. This technique occurs
More precisely, we will build a table Tab of dimension n × n s.t. each in many classical algorithms such as the
B E L L M A N - F O R D and F L OY D - W A R S H A L L
cell Tab(i , j ) will contain the list of all variables that can generate the sub- algorithms to compute shortest paths in a
word w i · · · w j . Formally, A ∈ Tab(i , j ) iff A ⇒∗ w i · · · w j . When this table graph.
if w = ε then
if S → ε ∈ P then return True ;
else return False ;
foreach 1 ≤ i ≤ n do
Tab(i , i ) ← {A | A → w i ∈ P } ;
foreach 1 ≤ ℓ ≤ n do
foreach 1 ≤ i ≤ n − ℓ + 1 do
j ← i +ℓ−1 ;
foreach i ≤ k ≤ j − 1 do
foreach rule A → BC ∈ P do
if B ∈ Tab(i , k) and C ∈ Tab(k + 1, j ) then
Add A to Tab(i , j ) ;
1 2 3 4
A, X X ,Y 1
A, X X ,Y 2
A, X S 3
B 4
1 2 3 4
′
A, X X ,Y X,X S 1
A, X X ,Y S 2
A, X S 3
B 4
Let us now define formally the notion of pushdown automaton (PDA for
short) that we have described informally in the introduction of this chap-
ter. Remember that, in essence, a PDA is a finite state automaton aug-
mented with a stack that serves as a memory: at each transition, the au-
tomaton can test the value on the top of the stack, and modify (push, pop)
this top of stack.
Syntax The formal definition of the syntax of PDA clearly shows that they
are an extension of finite automata:
it takes, as input, the current state and the next symbol on the input (as
in a finite automaton), but also a stack symbol, which is meant to be the
symbol on the top of the stack. It outputs a set of pairs of the form (q, w),
where q is a destination state (as in a non-deterministic finite automaton)
and w is a word from Γ∗ that will replace the symbol on the top of the stack
after the transition has been fired. So, intuitively:
δ(q, a, b) = {(q 1 , γ1 ), . . . , (q n , γn )}
means:
When in state q, reading a from the input, and having b on the top of the
stack, chose non-deterministically a pair (q i , γi ), move to q i , and replace b
on the top of the stack by γi (where the leftmost letter of w i goes on the top).
Before making this definition of the transition function more formal, let
us give an example of PDA following this syntax. Such an example can be This syntax might look hard to read
and is indeed way less intuitive
found in Figure 4.11. In order to depict the translation relation, we draw
than the classical ‘pop’ and ‘push’.
an arrow between q and q ′ , labelled by a, b/w iff (q ′ , w) ∈ δ(q, a, b), i.e. we It allows, however, a very clean definition
can go from q to q ′ while reading an a on the input, seeing a b on the top of the semantics of PDAs, as we will see
later.
of the stack, and replacing this b by w. In other words, in this example, we
have:
1. Q = {q 0 , q 1 , q 2 };
2. Σ = {0, 1, #};
3. Γ = {0, 1, Z0 };
4. F = {q 2 };
i /t 0 1 Z0
0 {(q 1 , ε)} ; ;
δ(q 1 , i , t ) = 1 ; {(q 1 , ε)} ;
# ; ; ;
ε ; ; {(q 2 , Z0 )}
and δ(q 2 , t , i ) = ; for all t ∈ Γ, i ∈ Σ.
With the intuitive definition of the semantics we have sketched, one can
understand that the self-loop on state q 0 consists in pushing all 0 and 1
read from the input. Indeed, whenever a 0 is read, the automaton system-
atically tests for all possible characters x (that can be either 0 or 1 or Z0 )
on the top of the stack, and replaces it by 0x, which amounts to pushing
a 0 (and symmetrically when a 1 is read). The PDA moves from state q 0 to
q 1 only when a # is read on the input, and does not modify the stack in this
case. Then, the self-loop on q 1 consists, when reading a 0, in checking that
the top of the stack is indeed a 0 too, and replacing it by ε, i.e., popping the
ALL THINGS CONTEXT FREE. . . 101
q, w, γ ∈ Q × Σ∗ × Γ∗ .
®
M
®
In particular, the initial configuration when reading w is q 0 , w, Z0 ,
i.e., initially, the current state is q 0 , w is on the input and Z0 is on the stack.
Configuration change Let us now formally define how a PDA can move
from one configuration to another. As for finite automata, we use the
c ⊢P c ′ notation to denote the fact that the PDA P can move from config-
uration c to configuration c ′ . Thanks to the syntax introduced in Defini-
tion 4.11, the definition of ⊢ is very simple:
q, aw, X β ⊢P q ′ , w, αβ .
® ®
a is a single input letter, or ε; and X is a stack symbol. Let us assume for the
moment that a ̸= ε. Thus, in the original configuration q, aw, X β , the
®
state from q to q ′ ; (ii) reading the a at the beginning of the input, hence
only w remains; and (iii) popping the X from the stack and pushing α
instead. Thus, the resulting configuration is indeed q ′ , w, αβ . We can
®
now check that the intuition still works when a = ε. Indeed, in this case,
aw = ε · w = w, and the input word is thus not modified by the transition.
Note that, when the PDA is clear from the context, we might omit the
subscript on ⊢. We also denote by ⊢*P (or, simply, ⊢* ) the reflexive and
transitive closure♣ of ⊢.
Example 4.14. Consider for instance the PDA in Figure 4.11, and the in-
put word 01#10 ∈ L pal# . The initial configuration on this input word is
(q 0 , 01#10, Z0 ). One can check against the definition of ⊢ that:
(q 0 , 01#10, Z0 ) ⊢(q 0 , 1#10, 0 Z0 ) ⊢(q 0 , #10, 10 Z0 ) ⊢(q 1 , 10, 10 Z0 ) ⊢(q 1 , 0, 0 Z0 ) ⊢(q 1 , ε, Z0 ) ⊢(q 2 , ε, Z0 ).
(q 0 , 01#10, Z0 ) ⊢* (q 2 , ε, Z0 )
Accepted language Equipped with this notion, we can now define which
words are accepted by a PDA. As said above, we will actually define two
notions of accepted languages. Indeed, one very natural notion of accep-
tance for PDAs is obtained by adapting the definition we have adopted for
finite automata: a word w is accepted iff there is at least one run reading
this word and reaching a final state. However, as shown by the example in
Figure 4.11, another natural notion of acceptance for PDAs is to accept a
word when the stack is empty. Intuitively, in many cases, the stack is used
ALL THINGS CONTEXT FREE. . . 103
as a memory to store some sort of input that still must be treated, so, it is
reasonable to accept a word as soon as this pending input is empty. These
two notions are captured by the following definition:
In other words:
1. A word w is in L(P ) (i.e., it is accepted by final state) iff, from the initial
®
configuration q 0 , w, Z0 where w is in the input, one can find an exe-
cution reaching a configuration q, ε, γ where w has been read entirely
®
and the current state q is accepting (q ∈ F ). Observe that the stack does
not need to be empty for w to be accepted (γ is any word in Γ∗ ).
tirely and the stack is empty (observe that, in this case, the current state
q does not need to be final: q ∈ Q).
(q 0 , 01#10, Z0 ) ⊢* (q 2 , ε, Z0 )
we deduce that 01#10 ∈ L(P ), where P is the PDA in Figure 4.11, because,
q 2 is an accepting state, and the input is empty in the last configuration.
Observe that the stack is not empty, but this is not necessary for a word to
be in L(P ).
Now consider a PDA P ′ obtained from P by deleting state q 2 , and adding,
on q 1 a self-loop transition labelled by ε, Z0 /ε, i.e. a transition that emp-
ties the stack once the Z0 symbol occurs on the top; and where there are no
more accepting states. This PDA is shown in Figure 4.12. Then, we have:
(q 0 , 01#10, Z0 ) ⊢*P’ (q 1 , ε, ε)
i.e., there is an execution of the PDA that reaches (q 1 , ε, ε) where the stack
is empty (but where q 1 is not accepting). This entails that: 01#10 ∈ N (P ′ ).
Observe, however that 01#10 ̸∈ N (P ), because P never empties its stack;
neither that 01#10 ̸∈ L(P ′ ) because P ′ has no accepting state. That is, we
can show that L(P ) = N (P ′ ) = L pal# , and that L(P ′ ) = N (P ) = ;. M
104 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G
When considering finite state automata, we have shown that ε-NFAs and
DFAs are equivalent in the sense that any ε-NFA can always be turned into
a DFA that accepts the same language. Is it the case for PDAs? Unfor-
tunately, the answer is ‘no’: there are some non-deterministic PDAs that
cannot be turned into an equivalent deterministic one. Let us give an ex-
ample.
Consider the language L pal , which is obtained from L pal# by deleting the
# in all words. Formally:
Moreover, in the latter case, the only possible next configuration change
is:
i.e., state q 2 is reached without modification of the stack, and the input is
not empty. From this configuration (q 2 , 0110, Z0 ), no other configuration
is reachable, hence, this run of the automaton will not accept the word
0110. Nevertheless, 0110 is accepted by the PDA in Figure 4.13, with the
next sequence of transitions (that corresponds, as expected, to pushing
the prefix 01 and popping the suffix 10):
(q 0 , 0110, Z0 ) ⊢(q 0 , 110, 0 Z0 ) ⊢(q 0 , 10, 10 Z0 ) ⊢(q 1 , 10, 10 Z0 ) ⊢(q 1 , 0, 0 Z0 ) ⊢(q 1 , ε, Z0 ) ⊢(q 2 , ε, Z0 ).
Proof. (Sketch) A full proof is beyond the scope of these lecture notes, so
we only give the intuition. Consider two words w 1 and w 2 in L pal that
share a common prefix. For instance w 1 = 0110 and w 2 = 011110. If
there is a DPDA that accepts L pal , then it reaches the same configuration
after reading the common prefix 01, and performs the same configura-
tion change when reading the next 1. However, the behaviour of the PDA
should differ when reading w 1 and w 2 , since in the case of w 1 the next 1
belongs to the suffix of the word, and the PDA should check that this suffix
is the mirror image of the prefix; while in the case of w 2 the 1 still belongs
to the prefix.
Every DPDA is a PDA just as every
As a consequence, the class of languages that can be accepted by a DFA is also an ε-NFA, even if the
names of those classes provide the
DPDA is stricly included in the class of languages that can be accepted by opposite intuition. . .
a PDA. Indeed, every DPDA is a PDA, which proves the inclusion; and we
have just identified a language that separates these two classes. We believe
it is important to stress out this result since it departs from what we have
observed with finite automata, where DFA accept the same languages as
NFA and ε-NFA, i.e. the regular languages. We will come back to this at the
end of the section.
Now that we have characterised pushdown automata, let us study the class
of languages they define, exactly as we have done when we have proved
that finite automata accept the class of regular languages (Kleene’s theo-
rem). As sketched above, PDAs accept the class of context free languages
(CFL), that we have defined so far as the class of languages that are defined
by context-free grammars (CFGs):
Theorem 4.3. For all PDAs P , L(P ) and N (P ) are both context-free lan-
guages. Conversely, for all CFLs L, there are PDAs P and P ′ s.t. L(P ) =
N (P ′ ) = L.
1. First, we will show that all CFLs can be accepted by a PDA that accepts
on empty stack, by giving a translation from CFGs to PDAs. Hence, we
show that for all CFGs G, there is a PDA P s.t. N (P ) = L(G);
2. Second, we will show the reverse direction: for all PDAs P , we can build
a CFG G P s.t. L(G P ) = N (P ), i.e. N (P ) is a CFL. Together with point 1,
this shows that the class of languages accepted by empty stack by a PDA
is the class of CFLs. It remains to show that this holds also for PDAs that
accept with final states;
3. Third, we will show that we can convert any PDA accepting by empty
stack into a PDA accepting the same language by accepting state, i.e.
for all PDAs P , there is a PDA P ′ s.t. N (P ) = L(P ′ );
4. Finally, we will show that for all PDAs P , there is a PDA P ′ s.t. L(P ) =
N (P ′ ).
ALL THINGS CONTEXT FREE. . . 107
From CFGs to PDAs accepting by empty stack For this first point, we will
only show an example that should be sufficient to convince the reader of
the validity of the following Lemma:
2 1 4 4 4
Exp =
⇒ Exp ∗ Exp =
⇒ Exp + Exp ∗ Exp =
⇒ Id + Exp ∗ Exp = ⇒ Id + Id ∗ Id.
⇒ Id + Id ∗ Exp =
Exp
+
Exp Exp
∗ ∗
Exp Exp Exp
So, now, the question is: how does the PDA update the stack? There are
two possibilities to consider:
1. Either the symbol on the top of the stack is a variable V of the gram-
mar, as in the three pictures above. In this case, the PDA must ‘sim-
ulate’ the derivation by finding a rule V → α in the grammar, popping
this variable V and pushing, instead the right-hand side α. This is what
happened in our example above, where the two steps correspond to
applying the rules Exp → Exp ∗ Exp and Exp → Exp + Exp, respectively. Non-determinism is crucial here:
if there are two rules of the form
Observe that these actions can be implemented in a non-deterministic
V → α1 and V → α2 , the PDA must
PDA with a single state q: for each rule V → α of the grammar, we have ‘guess’ which one to apply when seeing V
an element (q, α) in δ(q, ε,V ), i.e. we add a transition that does not on the top of the stack. The whole point
of Chapter 5 and Chapter 6 will be to build
change the state, does not read from the input, but checks that V is on a PDA that can make the ‘right choices’
the top of the stack and replaces it by α. deterministically in order to obtain a pro-
gram that can be implemented.
108 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G
2. Or there is, on the top of the stack, a terminal. In this case, we cannot
apply any grammar rule, and we cannot access the other variables that
could be deeper in the stack. This is what occurs if we further apply the
rule Exp → Id from the third stack above to obtain:
Id
+
Exp
∗
Exp
where the Id on the top of the stack ‘hides’ the Exp variables. However,
in this case, we are sure that the word which will be generated by the
derivation we are currently simulating will start by Id+, i.e. the two ter-
minals which are on the top of the stack. So, we can check that these
two terminals are indeed the two first letters on the input. If it is not the
case, then, clearly, the derivation we are currently simulating will not al-
low to recognise the input word. So, the PDA will not be able to execute
any further step, and will not reach an accepting state. On the other
hand, if Id and + are the two next characters on the input, then, they
can safely be popped and read from the input. This can be achieved by
PDA transitions (still assuming a PDA with a single state q): for all a ∈ T ,
we have in δ(q, a, a) an element (q, ε), i.e. we add a transition that does
not change the content of the stack, but reads a character a from the
ε, Exp/Exp + Exp
input, provided that it is present on the top of the stack, and pops it. ε, Exp/Exp ∗ Exp
Eventually, if the input word is accepted, this will empty the stack, so ε, Exp/(Exp)
ε, Exp/Id
we obtain a PDA that accepts the language of the grammar by empty
ε, Exp/Cst
stack.
Id, Id/ε ∗, ∗/ε
q Cst, Cst/ε +, +/ε
To finish with our example, let us depict the PDA obtained from the
(, (/ε ), )/ε
grammar in Figure 4.4, using the technique described above. It is shown
Figure 4.14: A PDA accepting (by empty
in Figure 4.14. stack) arithmetic expressions with + and ∗
Moreover, one possible execution of this PDA on the input string Id + operators only.
Id ∗ Id is:
(q, Id + Id ∗ Id, Exp) ⊢(q, Id + Id ∗ Id, Exp ∗ Exp) ⊢(q, Id + Id ∗ Id, Exp + Exp ∗ Exp) ⊢
(q, Id + Id ∗ Id, Id + Exp ∗ Exp) ⊢(q, +Id ∗ Id, +Exp ∗ Exp) ⊢(q, Id ∗ Id, Exp ∗ Exp) ⊢
(q, Id ∗ Id, Id ∗ Exp) ⊢(q, ∗Id, ∗Exp) ⊢(q, Id, Exp) ⊢(q, Id, Id) ⊢(q, ε, ε)
From PDA with empty stack to CFG In this case, again, we will restrict
ourselves to presenting an example of translation from a PDA that accepts
with empty stack to a CFG that accepts the same language. The construc-
tion is quite technical, but, fortunately, the intuitions behind the construc-
tion are quite simple. For our example, we will consider again the PDA in
Figure 4.12 recognising L pal# by empty stack. Recall that, in a PDA, the ex-
ecution starts with the Z0 symbol on the top of stack (so, initially, the stack
ALL THINGS CONTEXT FREE. . . 109
is not empty), and that the ultimate goal of the PDA is to empty its stack to
accept a word. This is why the variables in the CFG we will build are of the
form:
[pγq]
where: q and p are two (possibly equal) states of the PDA; γ ∈ Γ is a pos-
sible stack symbol; and the intuitive meaning of the variable is that the set
of words that can be generated from [pγq] is exactly the set of all words that
are accepted by the PDA when: (i) it starts its execution in state p with γ as
the content of the stack; and (ii) it ends its execution in state q. In other
words:
[pγq] ⇒∗ w
iff
(p, w, γ) ⊢* (q, ε, ε).
S →[q 0 Z0 q 0 ]
S →[q 0 Z0 q 1 ].
Indeed, for a word to be accepted by the PDA in Figure 4.12, the PDA must:
There are no other possibilities since q 0 and q 1 are the only two states in
the PDA.
Now, let us see how we can add to our grammar the rules that have vari-
ables of the form [pγq] as the left-hand side. Let us consider the variable
[q 0 0q 1 ]. So, we need to understand what are the words that the PDA could
accept by an execution that: (i) starts in q 0 with 0 as the only content of
the stack; and (ii) ends in q 1 with an empty stack. For that purpose, we
look at all the possible transitions that can occur from q 0 when 0 is on the
top of the stack. There are three possibilities:
• Either the PDA reads a 1 on the input. In that case, it will push this new
1 to the top of the stack, and stay in q0 . This means that an execution
that should empty the stack must, after this transition, pop two symbols
from the stack: first the 1, then the 0 that was already on the stack. In-
between these two pops, the PDA might either stay in q 0 or move to q 1 .
Thus, we add to our grammar two rules:
[q 0 0q 1 ] → 1[q 0 1q 0 ][q 0 0q 1 ]
[q 0 0q 1 ] → 1[q 0 1q 1 ][q 1 0q 1 ].
110 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G
Intuitively, the first rule says: ‘we want the PDA to read a word from a
configuration where q 0 is the current state and 0 is on the top of the
stack, and we want that after this word is read, the PDA reaches q 1 and
the 0 on the top of the stack has been popped ([q 0 0q 1 ]). Then, one way
to do that is to read a 1 from the input (which will push the 1 on the top
of the stack), then continue the execution by reaching q 0 again after
having removed that 1 from the top of the stack ([q 0 1q 0 ]), then con-
tinue again the execution until reaching q 1 after which the 0 on the top
of the stack has been popped ([q 0 0q 1 ]). The second rule says essentially
the same, except that now the intermediary state is q 1 .
• Another possibility is that the PDA reads a 0 from the input. Symmetri-
cally to the previous case, we have the two following rules in the gram-
mar:
[q 0 0q 1 ] → 0[q 0 0q 0 ][q 0 0q 1 ]
[q 0 0q 1 ] → 0[q 0 0q 1 ][q 1 0q 1 ].
• Finally, one possibility is that the PDA reads a # from the input. In this
case, it will not modify the stack (so the 0 that we want eventually to
pop will remain), and it will move to q 1 . Thus, the only rule we have in
this case is:
[q 0 0q 1 ] → #[q 1 0q 1 ].
From PDA with empty stack to PDA with accepting states For the third ε, X 0 /Z0 X 0
q 0′ q0
step of our proof, we show how we can transform a PDA P that accepts
some language N (P ) by empty stack, into a PDA P ′ that accepts the same
language by accepting state. ε, X 0 /ε ε, X 0 /ε
ε, X 0 /ε
′ ′ qf
Lemma 4.5. For all PDA P , there is a PDA P s.t. N (P ) = L(P ).
Proof. We sketch the construction of P ′ from P . It is illustrated in Fig- Figure 4.15: An Illustration of the construc-
tion that turns a PDA accepting by empty
ure 4.15 Let us assume that P = Q, Σ, Γ, δ, q 0 , Z0 , F . Then, we build P ′ =
®
stack into a PDA accepting the same lan-
guage by final state.
112 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G
• for all states q ∈ Q: δ′ (q, ε, X 0 ) = {(q f , ε)}, and δ′ (q, a, X 0 ) = ; for all
a ∈ Σ. Otherwise, δ′ coincide with δ. That is, we add transitions to the
accepting state only when X 0 occurs on the top of the stack;
• F ′ = {q f }.
(q 0 , w, Z0 ) ⊢*P (q, w ′ , ε)
iff we have, in P ′ :
That is, P ′ can ‘simulate’ the execution of P with the additional X 0 at the
bottom of the stack. Then, from (q, w ′ , X 0 ), P ′ can move to the accepting
state:
Hence, P accepts w (by empty stack) iff P ′ does (by accepting state), i.e.,
N (P ) = L(P ′ ).
From PDA with accepting state to PDA with empty stack We close the loop
by showing how we can convert a PDA accepting some language L(P ) by
accepting state into one accepting the same language by empty stack.
P
Proof. The construction is very similar to the one we have used in the
previous proof. Given a PDA P Q, Σ, Γ, δ, q 0 , Z0 , F that accepts some lan-
®
ε, X 0 /Z0 X 0
guage L(P ) by accepting state, we build a PDA P ′ that ‘simulates’ the exe- q 0′ q0
(q 0 , w, Z0 ) ⊢*P (q, w ′ , x)
where this last configuration is accepting for the ‘empty stack’ condition.
Hence, w is in L(P ) iff w is in N (P ′ )
That is, all the rules from both grammars are added to G; and the extra
rules allow the grammar G to ‘chose’ between L 1 or L 2 . More precisely, if
a word is generated from S (i.e. S ⇒∗ w), then it is necessarily generated
either from S 1 (i.e., the derivation is actually S ⇒ S 1 ⇒∗ w) or from S 2
(i.e., the derivation is actually S ⇒ S 2 ⇒∗ w). Thus, w belongs either to
L 1 or to L 2 , i.e. w ∈ L 1 ∪ L 2 . We have just shown that L(G) ⊆ L 1 ∪ L 2 .
Symmetrically, if w ∈ L 1 ∪L 2 , then either w ∈ L 1 or w ∈ L 2 . In the former
∗ ∗
case, w ∈ L 1 implies S 1 ⇒G 1
w, hence S ⇒G S 1 ⇒G w and thus w ∈ L(G).
∗ ∗
In the latter case, w ∈ L 2 implies S 2 ⇒G 2
w, hence S ⇒G S 2 ⇒G w and
thus w ∈ L(G), again. This shows that L 1 ∪ L 2 ⊆ L(G). Together with
L(G) ⊆ L 1 ∪ L 2 , this implies that L(G) = L 1 ∪ L 2 .
G ′ = V1 ∪ V2 ∪ {S ′ }, T1 ∪ T2 , P ′ , S ′ ,
®
P ′ = P 1 ∪ P 2 ∪ {S ′ → S 1 S 2 }
G ′′ = V1 ∪ {S ′′ }, T1 , P ′′ , S ′′ ,
®
P ′′ = P 1 ∪ {S ′′ → S 1 S ′′ , S ′′ → ε}.
Now, using this result, let us prove that the intersection of two CFLs
might not be a CFL.
L 1 = {an bn ck | k, n ≥ 0}
L 2 = {ak bn cn | k, n ≥ 0}.
So, L 1 contains all words of the form a · · · ab · · · bc · · · c s.t. the number of a’s
is equal to the number of b’s, and L 2 all the words of the same form s.t. the
number of b’s is equal to the number of c’s.
It is easy to check that L 1 and L 2 are CFLs. A PDA accepting L 1 would
push on the stack all the a’s it reads, then check that the number of b’s is
equal by popping from the stack, and finally read c’s without constraint.
Symmetrically, a PDA accepting L 2 would read a prefix of a’s, push when
reading the b’s, then pop when reading the c’s14 14
Another argument to show that L 1 and
L 2 are CFLs is to observe that L 1 is the
So, L 1 and L 2 are both CFLs, but clearly L 1 ∩ L 2 = L abc , which is not a
concatenation of {an bn | n ≥ 0} with c∗ ,
CFL by Theorem 4.9. while L 2 is the concatenation of a∗ with
{bn cn | n ≥ 0}. Clearly, all these languages
are CFLs (in particular, a∗ and c∗ are reg-
ular), so their concatenation is also a CFL,
by Theorem 4.8.
116 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G
Clearly, since L 1 and L 2 are CFLs, this language is a CFL too: by hypothesis,
L 1 and L 2 are CFLs; so L 1 ∪ L 2 is a CFL too by Theorem 4.8, hence, its
complement is a CFL. However, classical set theory tells us that:
³ ´
L1 ∩ L2 = L1 ∪ L2 .
Thus, we conclude that, if the complement of any CFL L were a CFL too,
then the intersection of any pair of CFLs would be a CFL, which we know
is not the case by Theorem 4.9. Contradiction.
4.4.1 Factoring
V → αβ1
V → αβ2
..
.
V → αβn ,
V → αV ′
V ′ → β1
V ′ → β2
..
.
V ′ → βn ,
where V ′ is a fresh variable. This process can be iterated until there are no
more rules to factor.
A →Bα
118 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G
where A and B are variables, and where all rules with B as left-hand side
are:
B → β1
B → β2
..
.
B → βn ,
we replace A → B α by:
A → β1 α
A → β2 α
..
.
A → βn α.
Clearly, this preserves the language of the grammar. We repeat this trans-
formation until there is no indirect left-recursion left in the grammar.
V →V α1
V →V α2
..
.
V →V αn
is the set of all direct left-recursive rules with V as the left-hand side. Fur-
ther assume that:
V → β1
V → β2
..
.
V → βm
is the set of all other (non-left-recursive) rules that have V as the left-hand
side. Observe that a word which is generated from A is necessarily of the
form:
w w 1′ w 2′ · · · w k′
where w is generated from one of the βi ’s, and each w i′ is generated from
ALL THINGS CONTEXT FREE. . . 119
one of the α j ’s. Following this intuition, we replace all those rules by:
V → β1 V ′
V → β2 V ′
..
.
V → βm V ′
V ′ → α1 V ′
V ′ → α2 V ′
..
.
V ′ → αn V ′
V ′ → ε.
As can be seen, V ′ is now the recursive variable, but we have used right-
recursion to generate the sequence of αi ’s.
Clearly, any derivation that starts by S ⇒ A will not allow one to pro- Figure 4.17: A grammar with an unproduc-
tive variable (A).
duce any word, because all sentential forms derived from one containing
an A will also contain an A, that can never be eliminated. In other words,
the variable A is recursive, but there is no way to stop the recursion. This
means that A is useless in this grammar (it will never allow to produce any
word). So, we can safely remove rule 3 from the grammar without modi-
fying its language. But then, we can also remove rule 2, and the grammar
becomes:
(1) S → a M
Example 4.22. Our second example shows a case of a symbol that is pro-
ductive but is nonetheless useless because no sentential form obtained
(1) S → A
from the start symbol will ever contain it. Consider the grammar in Fig- (2) A → a
ure 4.18. (3) B → b
In this case, variable B is productive because B ⇒∗ b, but it can never Figure 4.18: A grammar with an unreach-
able symbol (B ).
be ‘reached’ in any sentential form produced from S. Remark that it is also
120 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G
the case with terminal b that occurs only in rule 3 (whereas all terminals
are necessarily productive). M Observe that symbols can be be-
come unreachable because some
Definition 4.23. Let G = 〈V , T , P , S〉 be a grammar. A symbol X ∈ V ∪ T rules have been removed due to the
is unreachable iff there is no sentential form of G that contains an X , i.e. removal of unproductive symbols.
∗
there is no derivation of the form S ⇒G α1 X α2 . M
Input: A CFG G = 〈V , T , P , S〉
Output: The set Prod ⊆ V ∪ T of productive symbols
Prod ← T ;
Prec ← ; ;
while Prec ̸= Prod do
Prec ← Prod ;
foreach A → α ∈ P do
if α ∈ Prod∗ then
Prod ← Prod ∪ {A} ;
return Prod ;
Algorithm 4: The algorithm to compute productive symbols in a CFG.
Once the set Prod of productive symbols has been computed, removing
unproductive symbols from G = 〈V , T , P , S〉 yields G ′ = V ′ , T , P ′ , S , where
®
V ′ = Prod ∩ V ∪ {S}, and P ′ contains all the rules of the form A → α ∈ P s.t. Observe that we keep S in V even
when S is not productive, because
α ∈ Prod∗ , i.e., α contains only productive symbols.
the syntax of grammars requests
that V always contains at least the start
Unreachable symbols Next, let us devise an algorithm to compute reach- symbol. However, if S is unproductive, the
set of rules P will not contain any rule of
able symbols. We follow the same kind of inductive reasoning as in the the form S → α.
ALL THINGS CONTEXT FREE. . . 121
Reach ← {S} ;
Prec ← ; ;
while Prec ̸= Reach do
Prec ← Reach ;
foreach A → α ∈ P do
if A ∈ Reach then
Add to Reach all symbols occurring in α ;
return Reach ;
Algorithm 5: The algorithm to compute reachable symbols in a CFG.
Removing all useless symbols Finally, let us show how we can combine
the two operations described above to obtain a grammar that contains no
useless symbols. Consider the following grammar:
(1) S → a
(2) → A
(3) A → AB
(4) B → b
where, A is unproductive; B is productive; and both A and B are reachable.
Removing A from the grammar as well as rules 2 and 3 yields the grammar:
(1) S → a
(2) B → b
that indeed contains only productive symbols, but where B is not reach-
able anymore (indeed, it was reachable ‘through’ A which has been re-
moved). We conclude that removing unproductive symbols can create un-
reachable symbols: after removing unproductive symbols, we will need to
run the algorithm to remove unreachable symbols.
Does the reverse hold? That is, is it possible that removing unreachable
symbols make some symbols unproductive? We will argue that this is not
possible, by contradiction. Assume some variable A in a grammar G which
∗
is productive (i.e., A ⇒G w for some w ∈ T ∗ ), and assume we remove the
122 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G
unreachable symbols from the grammar, and that, after this removal, the
resulting grammar G ′ still contains A which is now unproductive, i.e. there
∗ ′
is no w s.t. A ⇒G ′ w. Clearly, if A cannot produce any word in G , while
After that, all variables are guaranteed to be productive, and all symbols to
be reachable.
(1) S → CC a
(2) B → b
(3) C → c
Now, we compute the reachable symbols in G ′ . We start with:
Reach = {S}.
At that point, we reach a fixed point: there is no rule of the grammar that
allows to reach neither B nor b from either S or C , so {S,C , a} is exactly the
set of reachable symbols. The resulting grammar is G ′′ = V ′′ , T ′′ , P ′′ , S ,
®
Exp Exp
Id Id Id Id
sum of the first identifier, on the one hand; and of the product of the sec-
ond and third identifiers, on the other hand. Symbolically, this expression
is thus equivalent to Id + (Id ∗ Id), which is indeed the right priority.
However, ambiguities occur even when operator priority doesn’t play
any role. Consider for instance the word Id + Id + Id. In this case, the two
following derivation trees are possible:
Exp Exp
Id Id Id Id
Now, the tree that we want to obtain is the one on the right, because it
corresponds to the expression (Id + Id) + Id, which is indeed the correct as-
sociativity for the + operator (the associativity is on the left).
Modifying the grammar Let us now modify the grammar to lift these am-
biguities and make sure that the only derivation trees that will be returned
by the parser are those that enforce the priority and associativity of the op-
erators. We discuss priority first. Intuitively, for the priority of the ∗ and
+ operators to be enforced, an expression must be a sum of products of
atoms, where an atom is a basic building block, i.e. either an Id or a Cst.
For instance, an expression like Id ∗ Id + Id ∗ Id must be regarded as the sum
of the two products Id ∗ Id, which means that we will compute the values
of those products first, then take their sum. This can be reflected in the
grammar, by introducing fresh variables corresponding to the concepts of
‘products’ and ‘atoms’, and to modify the rules in order to enforce a hier-
archy between these concepts. In the case of the grammar of Figure 4.19,
we would first introduce rules:
Atom → Id
→ Cst
Using the same canvas, we define an Exp as a sum of products, and we Observe that the resulting gram-
mar is left-recursive (and we will
obtain the grammar:
see hereinafter that this left recur-
(1) Exp → Exp + Prod sion is crucial). However, we can use the
techniques of Section 4.4 to remove this
(2) → Prod
left-recursion, if need be.
(3) Prod → Prod ∗ Atom
(4) → Atom
(5) Atom → Cst
(6) → Id
ALL THINGS CONTEXT FREE. . . 125
Let us check that this new grammar is not ambiguous, and that the
Exp
derivation trees indeed respect the priority of operators. Consider again
the word Id + Id ∗ Id. Its (unique) derivation tree is the one shown in Fig-
ure 4.20. Clearly, this tree respects the priority of the operators. Exp + Prod
Now, let us consider the word Id + Id + Id. Its derivation tree is given in Prod Atom
Figure 4.21. Since we have used left-recursion in the rules associated to
Id
Exp and Prod, the left associativity of the operator is naturally respected. Prod ∗ Atom
Atom Id
Id
Figure 4.20: The derivation tree of Id∗Id+Id
Unary minus and parenthesis Let us now consider a more complex, yet taking into account the priority of the op-
typical example, where we allow the use of parenthesis and of the unary erators.
minus (in addition to the − and / operators that were missing in the previ- Exp
ous grammar):
Atom → −Atom
→ Id
→ Cst.
We handle (Exp) similarly. Indeed, the parentheses mean that the ex-
pression must be considered as a basic building block, and the priority of
the operators within the parenthesis must not interfere with the operators
outside the parenthesis. So, we obtain the grammar:
4.5 Exercises
Exercise 4.1. Give a PDA that accepts the language containing all words
of the form w w R where w is any given word on the alphabet Σ = {a, b}
and w R is the mirror image of w. Test your automaton on the input word
abaaaaba, by giving an accepting run of your automaton on this word.
Does your automaton accept by empty stack or by accepting state?
(1) S → a
(2) → A
(3) A → AB
(4) B → b
(1) S → A
(2) → B
(3) A → aB
(4) → bS
(5) → b
(6) B → AB
(7) → Ba
(8) C → AS
(9) → b
128 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G
P A R S I N G I S T H E S E C O N D S T E P O F T H E C O M P I L I N G P R O C E S S . During
this stage, the compiler analyses the syntax of the input program to check
its correctness. Just as we have formalised the scanning step using finite
automata, we will rely on pushdown automata to define rigorously what a
parser does.
More precisely, in this chapter, we will define a first major family of
parsers, namely the top-down parsers. In the next chapter, we will study
a different family of parsers, the bottom-up parsers. As their names indi-
cate, these parsers work in two completely, and actually opposite ways: a
top-down parser tries and build a derivation tree for the input string start-
ing from the root, applying the grammar rules one after the other, until
the sequence of tree leaves forms the desired string. On the other hand,
a bottom-up parser builds a derivation tree starting from the leaves (i.e.,
the input string), follow backwards the derivation rules of the grammar,
until it manages to reach the root of the tree. We will see that these two
paradigms have their own merits. Top-down parsers are perhaps more in-
tuitive, but bottom-up parsers are more powerful. Historically, compilers
such as gcc were written by hacking the code of an automatically gener-
ated bottom-up parser. Nowadays, recent versions of gcc or clang use a
hand-written top-down parser1 . 1
GCC wiki: new C parser. https://gcc.
gnu.org/wiki/New_C_Parser, 2008. On-
For these two main families of parsers, we will present techniques that
line: accessed on December, 29th, 2015;
allow one (when possible) to build automatically parsers from grammars, and CLang: features and goals. http:
which is exactly what we need in the framework of compiler design. //clang.llvm.org/features.html. On-
line: accessed on December, 29th, 2015
We have already explained the main ideas behind top-down parsing when
showing how we can turn any CFG into a PDA that accepts the same lan-
guage by empty stack: see Section 4.2.3, and, in particular, the discussion
of Lemma 4.4, that we recall now. Consider the grammar for arithmetic
expressions in Figure 4.4. Then, a PDA that accepts (by empty stack) the
language of the grammar in Figure 4.4 is the following (where the initial
symbol on the stack is the start symbol of the grammar, namely Exp):
130 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G
ε, Exp/Exp + Exp
ε, Exp/Exp ∗ Exp
ε, Exp/(Exp)
ε, Exp/Id
ε, Exp/Cst (, (/ε
), )/ε
+, +/ε
q
∗, ∗/ε
Id, Id/ε
Cst, Cst/ε
2 1 4 4 4
Exp =
⇒ Exp ∗ Exp =
⇒ Exp + Exp ∗ Exp =
⇒ Id + Exp ∗ Exp =
⇒ Id + Id ∗ Exp =
⇒ Id + Id ∗ Id
Then, the PDA given above will start its execution with the start symbol
Exp of the grammar on the top of its stack, i.e., the execution will start in
configuration:
(q, Id + Id ∗ Id, Exp).
The PDA simulates the first step of the derivation by applying the rule
Exp → Exp ∗ Exp, which consists in popping the left-hand side of the rule
and pushing the right-hand side, yielding the new configuration:
Performing twice the same operations with the rules Exp → Exp + Exp and
Exp → Id, the PDA reaches the configuration:
where a terminal (Id) is now on the top of the stack. At this point, the
PDA can check that the same terminal is on the input, consume it, and
pop the terminal. This can be performed twice, and we obtain the new
configuration:
(q, Id ∗ Id, Exp ∗ Exp).
The simulation of the derivation by the PDA goes on like that up to the
point where the stack is empty and the whole input has been read.
Let us now formalise these ideas, and show how we can build a PDA ac-
cepting, by empty stack, the language of a given CFG. Let G = 〈V , T , P , S〉
be a CFG. We build a PDA PG with a single state:
PG = {q}, T ,V ∪ T , δ, q, S, ; ,
®
2. for all a ∈ T : δ(q, a, a) = {(q, ε)}. That is, for all terminals a of the gram-
mar, there is a transition that reads a from the input and pops a from
the stack. This operation is called a match (of terminal a).
3. in all other cases that have not been covered above, δ(a, b, c) = ;.
Lemma 5.1. For all CFGs G, the PDA PG is s.t. L(G) = N (PG ).
Proof. (Sketch) The proof can be done by showing that: (i) for all words
w ∈ L(G), the leftmost derivation producing w can be simulated by an
accepting run of PG ; and that (ii) all accepting runs of PG (accepting a
word w) can be mapped to a leftmost derivation of G that produces w.
These two points are easily established by induction (on the lengths of the
derivation and of the run, respectively).
ε, S/a
5.2 Predictive parsers ε, S/b
The rest of this chapter is devoted to identifying classes of grammars for a, a/ε
q
b, b/ε
which a deterministic parser can be achieved, if we allow this parser to
Figure 5.1: A trivial grammar and its
make use of some extra information, that we call a look-ahead. This look-
corresponding (non-deterministic) parser
ahead is parametrised by a natural number k and consists of the k next (where the initial stack symbol is S).
characters on the input, that the parser can now take into account, with- It is important not to confuse a
look-ahead and a read on the in-
out actually reading them, to decide which transition to take. Parsers that
put. The look-ahead allows the
make use of a look-ahead are called predictive parsers. parser to have a view on the future of the
input, without modifying it, while a read
modifies the input—the characters read on
the input cannot be recovered.
132 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G
In order to formalise these ideas, let us now extend the definition of PDAs
by modifying their transition relation so that it can take into account the k
next characters on the input (for a look-ahead of k characters). This means
that the successors of a configuration will now be computed on the basis
of the current state, the current stack content (as in the case of ‘regular’
PDAs), but also the k first characters on the input. Observe that the only difference
between this definition and that of
Definition 5.1 (k-look-ahead PDA). A pushdown automaton with k char- PDAs, is that the transition func-
tion has a fourth parameter, which consti-
acters of look-ahead (k-LPDA for short) is a tuple Q, Σ, Γ, δ, q 0 , Z0 , F , where
®
tutes the look-ahead. This look-ahead is a
all the components are defined as for PDAs (see Definition 4.11), except for word of k characters at most, since we have
∗
the transition function δ that maps Q ×(Σ∪{ε})×Γ×Σ≤k to 2(Q×Γ ) ; where: no guarantee that there are always k char-
acters (or more) remaining on the input.
Σ≤k = ∪ki=0 Σi
Let us now define formally the new semantics that takes into account
the look-ahead. We lift the notion of configuration from PDAs to k-LPDAs:
Definition 4.12 carries on to k-LPDAs. The notion of configuration change,
however, must be adapted:
• X ∈ Γ;
• a ∈ Σ ∪ {ε};
• u ∈ Σ≤k−1 ;
• v ∈ Σ∗ ; and
q, auv, X β ⊢P q ′ , uv, αβ .
® ®
M
T O P - D OW N PA R S E R S 133
Example 5.3. Let us consider again the trivial grammar in Figure 5.1. We
can now extend its corresponding PDA with a look-ahead of one charac-
ter, to obtain a deterministic LPDA. To this end, we consider the following
transition function:
• δ(q, ε, S, b) = {(q, b)} for the produce of S → b. Again, the produce is per-
formed only when b is the next character on the input;
a : ε, S/a
b : ε, S/b
a : a, a/ε
q
b : b, b/ε
Observe that this LPDA is now deterministic (and can thus be implemented
as a computer program in a straightforward way). Observe further that the
look-ahead allows the LPDA to query the next character on the input with-
out reading it: once again, a transition labeled by ‘a : ε, S/a’ means that
the automaton checks that the next character on the input is a, but does
not consume it (as indicated by the ε), hence does not modify the input in
the next configuration. On the other hand, a transition labeled by a : a, a/ε
not only checks that there is an a on the input but also reads it. M
end of the day, k-LPDAs are nothing more than a more convenient syntax
for PDAs (but a very convenient one, as we will see!).
Proposition 5.2. For all k-LPDAs P , we can build a PDA P ′ s.t. L(P ) = L(P ′ ).
Proof. We will present the proof for the case where k = 1, the ideas are
easy to generalise to k > 1. Let us first sketch the intuition of the proof, by
explaining how executions of P ′ correspond to executions of P , and vice
versa. First, we let the set of states of P ′ be pairs of the form (q, a), where q To complement this intuition, re-
call that a finite automaton can be
is a state of q and a ∈ Σ is a single-letter that represents the current look- regarded as a program that uses a
ahead. Note that this look-ahead will not be computed by P ′ , but rather finite amount of memory only. This mem-
ory is encoded in the states of the automa-
guessed using non-determinism, and then checked afterwards. Thus, in-
ton. This is the same idea that we use
tuitively, when P ′ is in (q, a), this corresponds to P being in state q and here. Since the look-ahead is bounded by
having guessed that a is the first character on the input. Following this k, and since the alphabet is finite, there are
only finitely many possible values for the
idea, we have to define the transition function of P ′ so that it properly up- look-ahead, which can thus be stored in
dates the look-ahead contained in its states, and checks the validity of the the states, and then queried and updated
when need be, using non-determinism.
non-deterministic guesses, in order to keep P ′ synchronised with P .
Initially, P ′ simply jumps to a state of the form (q 0 , x) for some x ∈ Σ∪{ε}
(thus, the x is the first guess performed by P ′ ). Then, Figure 5.2 shows the
rest of the construction. If, from state q 1 , P can read some character x ∈ Σ x : x, X /γ
q1 q2
(hence, the look-ahead is necessarily equal to x, otherwise the transition
cannot be taken), then, in P ′ , we can ‘simulate’ this transition from (q 1 , x) becomes:
only (because the look-ahead must be x). Since the corresponding tran- q2 , a
/γ
sition of P ′ does read an x, we are certain that the guess was correct. The x, X
x, X /γ
different possible successors correspond to the different possible guesses q1 , x q2 , b
x, X
for the next look-ahead. A special case is displayed at the bottom of the fig- /γ
ure and occurs when P reads ε from the input. In this case, the look-ahead q2 , ε
could be non-empty (otherwise, the look-ahead would have no interest!),
and must not be updated in the state of P ′ . y : ε, X /γ
q1 q2
Let us formalise this. Given an LPDA P = Q, Σ, Γ, δ, q 0 , Z0 , F , we build
®
a PDA P ′ = Q ′ , Σ, Γ, δ′ , q 0′ , Z0 , F ′ where:
®
becomes:
ε, X /γ
• Q ′ = Q × Σ ∪ {ε} ⊎ {q 0′ } (thus, q 0′ is a fresh initial state);
¡ ¢
q1 , y q2 , y
4. δ′ (q, x), y, X = ; in all the cases that have not been specified above.
¡ ¢
T O P - D OW N PA R S E R S 135
q 0 , w 0 , Z0 ⊢P q 1 , w 1 , γ1 ⊢P . . . ⊢P q k , w k , γk ,
® ® ®
That is, it is the execution obtained when P ′ always ‘guesses’ correctly the
next character on the input. It is easy to check that this is indeed an ex-
ecution of P ′ (see the definition of δ′ above), which is accepting because
(q k , a k ) is a final state when q k is.
On the other hand, if
q 0 , w 0 , Z0 ⊢P q 1 , w 1 , γ1 ⊢P . . . ⊢P q ℓ , w ℓ , γℓ
® ® ®
then:
Theorem 5.3. For all k, the class of languages accepted by k-LPDAs is the
class of CFLs.
Proof. Since, for all k, we can translate any k-LPDA into an equivalent PDA
(see Proposition 5.2), the class of k-LPDAs accepts no more than the CFLs.
On the other hand, all PDAs P can trivially be translated into an equivalent
k-LPDA P ′ (for all k): it suffices to define the transition relation of P ′ in
such a way that it ignores the look-ahead (or, in other words, such that it
performs the same actions on the input and on the stack for all possible
values of the look-ahead).
136 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G
Now that we have k-LPDAs at our disposal, let us show how to trans-
form, when possible, and in a systematic way, CFGs into deterministic
k-LPDAs that we will be able to translate easily into programs. To this end,
we need to introduce some extra definitions.
(a) Similarly to the case of rule 1, it is easy to see that all words pro-
duced from B will start by a b, by rule 4. So, rule 2 should be applied
only when a b is on the input.
(b) The case of C is more complicated, as C can produce either c or ε.
In the former case, we expect c to be the first next character on the
input to apply rule 3. In the latter case, the derivation is A ⇒ C dd ⇒
dd, so we expect a d as the next character on the input. We conclude
that all derivations starting by rule 3 will produce words that start by
c or d only.
The case of variable A are thus summarised in Table 5.1, which gives for
each look-ahead, the rule to apply when A is on the top of the stack.
To obtain this information, we have computed, for each rule of the form
A→α , the set of all the possible first characters of words that can be de-
rived from α. Indeed, in the case of A→aaa , all words derived from aaa
start by an a; in the case of A→B bb , all words derived from B bb start by
b; in the case of A→C dd , all words derived from C dd start either by c
or by d. This captures the intuition behind the First1 , i.e., as we will see
T O P - D OW N PA R S E R S 137
hereinafter, First1 (aaa) = {a}, First1 (B aa) = {b} and First1 (C dd) = {c, d}.
We already have the intuition that computing those sets must some-
times be done recursively: for instance First1 (B aa) is equal to First1 (B ),
because B is the first symbol in B aa.
Instead, we must consider the context in which a C can occur, and what
are the characters that could follow it. The only place where a C occurs
is in rule 3. From this rule we can deduce that all words generated from
C will necessarily be followed by a d. This is better understood by visual-
ising the derivation tree of dd, i.e. the only word that one can generate
from A by applying rule 6, as shown in Figure 5.3.
C d d
What we expect
on the input
ε
∈ Follow1 (C )
We can now complete Table 5.1 and obtain Table 5.2, which tells us ex-
actly which rule to apply for each possible symbol on the top of stack and
next character on the input. Since there is at most one rule in each cell of
the table, we have now a deterministic parser at our disposal. M
C 5 6
A word w is thus in Firstk (α) iff:
(i) it is the prefix of some sentential
Let us now formalise properly the notions of Firstk and Followk . form generated from α (i.e., α ⇒∗
w x for some w); and (ii) it has the right
size: either it contains exactly k characters
(|w| = k), or it contains less than k charac-
ter, but this can occur only if cannot make
this prefix any longer, which implies that
x = ε (after all, there is no reason that all
sentential forms generated from α contain
at least k characters).
138 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G
α ⇒∗ w x
¯
¯
¯
k ∗¯
First (α) = w ∈ T ¯ and .
¯
¯ either |w| = k or |w| < k and x = ε
¡ ¢
Example 5.7. Let us consider the grammar in Figure 5.4, which generates
expressions. Remember that this is the grammar that we have obtained
after taking into account the priority of the operators and removing left-
recursion (see the last pages of Chapter 4). Observe that we have added a
rule S→Exp$ to the grammar to make sure that all strings end with the
marker $. This will actually make our life easier when computing Follow (1) S → Exp$
(2) Exp → Prod Exp′
sets. Let us start by considering some values of First sets: (3) Exp′ → +Prod Exp′
© ª (4) → −Prod Exp′
• First(Atom) = −, Cst, Id, ( ; (5) → ε
(6) Prod → Atom Prod′
• First2 (Atom) = {−−, −Cst, −Id, −(, (−, ((, (Cst, (Id, Cst, Id}; (7) Prod′ → ∗Atom Prod′
(8) → /Atom Prod′
• First Prod′ = ∗, /, ε ;
¡ ¢ © ª
(9) → ε
(10) Atom → −Atom
• What is the value of First2 Prod′ ? We see that Prod′ produces either a
¡ ¢
(11) → Cst
string starting with ∗ and followed by some string generated by Atom; or (12) → Id
(13) → (Exp)
a string starting with / and followed by some string generated by Atom;
Figure 5.4: The grammar generating ex-
or ε. So, we can rely on First(Atom) to characterise First2 Prod′ , and we
¡ ¢
pressions (followed by $ as an end-of-
find: string marker), where we have taken into
account the priority of the operators, and
First2 Prod′ = ∗ · First1 (Atom) ∪ / · First1 (Atom) ∪ ε
¡ ¢ © ª © ª © ª
removed left-recursion.
at the end of a string generated by Exp, and all strings generated by Exp
are followed by a $ or by a ) in the final output, so Follow Exp′ = {$, )}.
¡ ¢
these two symbols are in Follow(Prod); (ii) or be the empty word. In this Exp $
latter case, the Follow of Prod will be the Follow of Exp′ . This is sketched
Prod Exp′
in Figure 5.5. Here, Prod eventually generates some string α, and Exp′
.. ε
generates ε. Then, clearly, the generated word is α · ε · $ = α$ (this can .
be seen by inspecting the tree’s leaves). This show that $ can indeed α
immediately follow a string (α) generated by Prod. This reasoning holds Figure 5.5: An example showing that
Follow(Prod) contains $ in a case where
for all symbols in Follow Exp′ = Follow(Exp) = {$, )}. We conclude that:
¡ ¢
© ª Exp′ generates ε.
Follow(Prod) = +, −, $, ) .
M
T O P - D OW N PA R S E R S 139
foreach a ∈ T do
Firstk (a) ← {a} ;
foreach A ∈ V do
Firstk (A) ← ; ;
repeat
foreach A → X 1 X 2 · · · X n ∈ P do
Firstk (A) ←
Firstk (A) ∪ Firstk (X 1 ) ⊙k Firstk (X 2 ) ⊙k · · · ⊙k Firstk (X n ) ;
¡ ¢
1. During the first iteration, we go through all rules, and update the First
of their left-hand side:
(a) For the rule S → Exp$, we get the opportunity to update First(S). We
compute: Remember that ;·L = ; for all lan-
guages L. So ; ⊙k L = ; for all lan-
guages L and all k.
140 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G
=;
So, this first rule does not allow us to infer more information on
First(S) at that point, because we have not computed any word from
First(Exp) yet.
(b) Actually, at this step of computations, all rules that contain at least
one variable in the right-hand side will yield a similar result, because
all the First’s are still empty.
(c) However, rules E xp ′ → ε, Prod′ → ε, Atom → Cst and Atom → Id al-
low us to update the First Exp′ , First Prod′ and First(Atom) respec-
¡ ¢ ¡ ¢
tively.
X First1 (X )
Exp′ ε
Prod′ ε
Atom Cst, Id
2. During the second iteration, we will discover new values of First1 sets.
Thanks to Prod → AtomProd′ , we add to First1 (Prod) the elements from:
= {Cst, Id};
First1 {∗} ⊙1 First1 (Atom) ⊙1 First1 Prod′ = {∗} ⊙1 {Cst, Id} ⊙1 {ε}
¡ ¢
= {∗},
and from:
First1 {/} ⊙1 First1 (Atom) ⊙1 First1 Prod′ = {/} ⊙1 {Cst, Id} ⊙1 {ε}
¡ ¢
= {/};
X First1 (X )
Exp′ ε
Prod Cst, Id
Prod′ ∗, /, ε
Atom −, Cst, Id
T O P - D OW N PA R S E R S 141
X First1 (X )
S −, Cst, Id, (
Exp −, Cst, Id, (
Exp′ +, −, ε
Prod −, Cst, Id, (
Prod′ ∗, /, ε
Atom −, Cst, Id, (
Let us now turn our attention to the computation of Followk (X ) for all vari-
ables X of a CFG. The algorithm is given in Algorithm 7, and is, again, a
greedy algorithm that grows the sets Followk (X ) up to stabilisation. To do
so, we rely on the following intuition: every time we have a rule of the form:
A → αB β,
(i.e., a rule that contains variable B in its right-hand side), we can poten-
tially add more information to Followk (B ). Indeed, a string generated by
B can be followed by a string generated by β, so we can use Firstk β . Ob-
¡ ¢
Input: A CFG G = 〈V , T , P , S〉
Output: The sets Followk (X ) for all X ∈ V .
foreach A ∈ V \ {S} do
Followk (A) ← ; ;
Followk (S) ← {ε} ;
repeat
foreach A → αB β ∈ P (with B ∈ V and α, β ∈ (V ∪ T )∗ ) do
Followk (B ) ← Followk (B ) ∪ Firstk β ⊙k Followk (A) ;
¡ ¡ ¢ ¢
Example 5.9. Let us consider again the grammar from Figure 5.4, and let
us apply Algorithm 7 to it, for k = 1.
0. Initially, we have Follow1 (X ) = ; for all variables X , except Follow1 (S), Observe that initialising
which is equal to {ε}. Follow1 (S) to {ε} is crucial
here. Otherwise, if we initialise
1. During the first iteration, the algorithm adds $ to Follow1 (Exp), thanks Follow(X ) to ; for all variables X ,
then the algorithm would terminate af-
to the rule S → Exp$. Indeed, this corresponds, in the algorithm, to hav- ter one iteration with Followk (X ) = ;
ing A = S, B = Exp, α = ε and β = $. So the string: for all X , since the expression
Firstk β ⊙k Followk (A) = Firstk β ⊙k ;
¡ ¢ ¡ ¢
= {$}
= {+, −, $};
Etc. . .
X Follow1 (X )
S ε
Exp $, )
Exp′ $, )
Prod +, −, $, )
Prod′ +, −, $, )
Atom ∗, /, +, −, $, )
Using the tools we have just defined (First and Follow sets), we can now
identify classes of CFGs for which the predictive parsers using k characters
of look-ahead (as sketched above) will be deterministic. Those grammars
are called LL(k) grammars, where ‘LL’ stands for ‘Left scanning, Left Pars-
ing’, because the input string is read (scanned) from the left to the right;
and the parser builds a leftmost derivation when successfully recognising
the input word. This class of grammars has first been introduced by Lewis
and Stearns in 19684 , with further important refinements by Rosenkrantz 4
P. M. Lewis, II and R. E. Stearns. Syntax-
directed transduction. J. ACM, 15(3):465–
and Stearns in 19705 (and many others afterwards. . . )
488, July 1968. ISSN 0004-5411. D O I :
What are the conditions we need to impose on the derivations of a gram- 10.1145/321466.321477
5
mar to make sure that its corresponding parser will be deterministic when D.J. Rosenkrantz and R.E. Stearns. Prop-
erties of deterministic top-down gram-
it has access to k characters of look-ahead? As we have seen already, the mars. Information and Computation (for-
only possible source of non-determinism in the parser stems from the pro- merly known as Information and Control),
17(3):226 – 256, 1970. ISSN 0019-9958.
duces, more specifically, when the grammar contains at least two rules of
D O I : 10.1016/S0019-9958(70)90446-8
the from A → α1 and A → α2 . Given that these two rules exist, let us now
pinpoint a situation that will confuse a parser that has access to k charac-
ters of look-ahead only. Such a situation occur if, at some point in a deriva-
tion of the grammar, A is the leftmost symbol (hence, it is on the top of the
stack), and the parser must decide whether to apply A → α1 or A → α2 , but
T O P - D OW N PA R S E R S 143
1. First, since we want to have A as the leftmost symbol at some point, the
derivation prefix is:
S ⇒∗ w Aγ
with w ∈ T ∗ and γ ∈ (V ∪ T )∗ .
2. Then, let us assume that in this first derivation, the right choice is to
apply A → α1 , i.e.,
w Aγ ⇒ wα1 γ.
S ⇒∗ w Aγ ⇒ wα1 γ ⇒∗ w x 1 .
Thus, this first derivation generates the word w x 1 . Let us consider again
the moment in the derivation when the sentential form was w Aγ and the
parser had to decide to apply A → α1 . As we have already remarked, A is,
at that point, on the top of the stack, and w has already been read from the
input. Hence, at that point, the string that remains on the input is x 1 , and
all the parser ‘sees’ is Firstk (x 1 ).
Then, it is easy to build a second derivation that will confuse the parser.
Assume that, in the grammar we have a derivation of the form:
S ⇒∗ w Aγ ⇒ wα2 γ ⇒∗ w x 2 .
Observe that, now, the right choice to derive A is A → α2 , and, when the
parser must take this choice, he ‘sees’ a look-ahead of Firstk (x 2 ). So, the
parser will be able to make the right decision regarding the derivation of A
iff the look-ahead it has at its disposal is sufficient to tell those two situa-
tions apart, i.e.: Remember that x 1 and x 2 are
k k words, so their Firstk is a sin-
First (x 1 ) ̸= First (x 2 ) .
gleton containing one string of
The definition6 of LL(k) grammar is based on these intuitions: it says length at most k. This is why we can
write Firstk (x 1 ) ̸= Firstk (x 2 ) instead of
that, whenever a pathological situation such as the one described above Firstk (x 1 ) ∩ Firstk (x 2 ) = ;, for instance.
occurs (the two derivations and Firstk (x 1 ) = Firstk (x 2 )), then, we must 6
P. M. Lewis, II and R. E. Stearns. Syntax-
directed transduction. J. ACM, 15(3):465–
have α1 = α2 ; which means that there is actually no choice to be made in
488, July 1968. ISSN 0004-5411. D O I :
the grammar. Otherwise, the parser would not be able to take a decision 10.1145/321466.321477
and the grammar would not be LL(k): Observe that, if a grammar is LL(k)
for some k, then it is also LL k ′ for
¡ ¢
Definition 5.10 (LL(k) CFGs). A CFG 〈P , T ,V , S〉 is LL(k) iff for all pairs of ′
all k ≥ k. This is coherent with our
intuition that LL(k) means ‘k characters of
derivations:
look-ahead are sufficient’.
S ⇒∗ w Aγ ⇒ wα1 γ ⇒∗ w x 1
S ⇒∗ w Aγ ⇒ wα2 γ ⇒∗ w x 2
S ⇒ a A a ⇒ aba
S ⇒ a A a ⇒ aa
S ⇒ b AB a ⇒ bbB a ⇒ bbba
S ⇒ b AB a ⇒ bbB a ⇒ bbca
S ⇒ b AB a ⇒ bB a ⇒ bba
S ⇒ b AB a ⇒ bB a ⇒ bca
S ⇒ b AB a ⇒ b A ba ⇒ bbba
S ⇒ b AB a ⇒ b A ca ⇒ bbca
S ⇒ b AB a ⇒ b A ba ⇒ bba
S ⇒ b AB a ⇒ b A ca ⇒ bca.
S ⇒ b AB a ⇒ bbB a ⇒ bbba
and
S ⇒ b AB a ⇒ bB a ⇒ bba
which both read the sentential form b AB a after one step, correspond-
ing to
w = b and γ = B a
in the definition. In the former derivation, one applies A → b, while in
the latter derivation, A → ε is used, corresponding to:
α1 = b and α2 = ε
in the definition. The resulting words are respectively bbba and bba,
corresponding to:
x 1 = bba and x 2 = ba
in the definition (since w = b), so we have:
This example shows well that, while the definition of LL(k) grammar
makes perfect sense and captures our intuition of what an LL(k) grammar
should be, it is of limited use in practice when one wants to check whether
a grammar is LL(k) or not. Indeed, Definition 5.10 requires to check all
possible pairs of derivations in the grammar, and there can be infinitely
many such pairs. Instead, we will now identify a stronger condition that
we will be able to test, and that will still be relevant in practice.
Example 5.13. Let us consider the grammar for arithmetic expression from
Figure 5.4, and let us check that it is a strong LL(1) grammar. To this end,
we can rely on the computation of the First and Follow sets from Exam-
ple 5.8 and Example 5.9. To apply Definition 5.12, we need to consider all
the group of rules that have the same left-hand side. There are three such
groups:
1. The three rules that have Exp′ as left-hand side are: Exp′ → +ProdExp′ ,
Exp′ → −ProdExp′ , and Exp′ → ε. Hence, we check that there is no com-
mon element between the three following sets:
= {$, )}.
This is indeed the case. Intuitively, this means that, when Exp′ is on the
top of the parser’s stack, it can determine which rule to apply basing
its decision on a single character look-ahead: when the look-ahed is +,
apply the first rule; when the look-ahead is −, apply the second; and
apply the last only when the look-ahead is $ or ).
2. For the three rules that have Prod′ as the left-hand side, we check that
146 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G
= {$, +, −, )}.
3. Finally, for the four rules that have Atom as the left-hand side, we con-
sider the four sets:
First(−AtomFollow(Atom)) = {−}
First(CstFollow(Atom)) = {Cst}
First(IdFollow(Atom)) = {Id}
First((Exp)Follow(Atom)) = {(},
The name strong LL(k) suggests that the conditions of Definition 5.12
are stronger than those of Definition 5.10. This is indeed the case: all
strong LL(k) grammars are LL(k) grammars; however, the converse is, in
general, not true, as shown by the next example:
Example 5.14. Let us consider again the grammar given in Example 5.11,
which is LL(2), and let us show that it is not strong LL(2). Indeed, if we
consider the two rules A → b and A → ε, we have:
and:
Since these two sets both contain ba, the grammar is not strong LL(2). M
Theorem 5.4.
1. For all k ≥ 1, for all CFG G: if G is strong LL(k), then it is also LL(k).
2. For all k ≥ 2, there is a CFG G which is LL(k) but not strong LL(k).
3. However, all LL(1) grammars are also strong LL(1), i.e. the classes of
LL(1) and strong LL(1) grammars coincide.
Proof. (Sketch) Points 1. and 3. can be derived from Definition 5.10 and
Definition 5.12. Point 2 stems from Example 5.14 that can be generalised
to any k ≥ 2.
T O P - D OW N PA R S E R S 147
2. hence Followk+1 (S) = {ak b, c} as well, by the first rule of the grammar.
3. However, Firstk+1 (S) = {ε, ak+1 }. Indeed, the recursion in the first rule
of the grammar will produce an arbitrarily long prefix of a’s, containing
at least one a, which will be followed by k more a′ s produced by A.
So, we can already conclude that G k is strong LL(k + 1), hence, it is also
LL(k + 1). Using similar arguments, we can show that G k is not strong
LL(k). Unfortunately, this does not imply that G k is not LL(k), and this
needs to be proved with other arguments (is the most involved part of the
proof from the original article). M
So we conclude that:
Theorem 5.5. The family of LL(k) grammars (for all k ≥ 0) forms a strict
hierarchy:
Observe that the definitions we have given so far (LL(k), strong LL(k)) are
concerned by grammars9 , but do not speak explicitly about the languages 9
Actually, the definition of strong LL(k) is a
purely syntactical condition on grammars.
those grammars define. We have already seen, in Example 5.11 that there
is at least one grammar which is LL(2) but not LL(1) (hence, not strong
LL(1)). However, we also know that there are potentially several different
grammars to define the same language. So, instead of considering classes
of LL(k) grammars, one could naturally define LL(k) languages:
Now, we can compare classes of LL(k) languages. Obviously, all LL(k) The intuition is always the same: if
a parser can recognise a language
languages are also LL(k + 1) for all k. Is the converse true? Let us consider
with k characters of look-ahead, it
again the grammar from Example 5.11, which is LL(2) but not LL(1), and can also do so with k +1 characters of look-
let us check whether there is another grammar that generates the same ahead.
language. For this grammar, the answer is trivially ‘yes’ since the language
generated by the grammar generates the finite language:
LL(0) lang. ⊊ LL(1) lang. ⊊ LL(2) lang. ⊊ · · · ⊊ LL(k) lang. ⊊ LL(K + 1) lang. ⊊ · · ·
Relationship with DCFL Finally, since the point of considering LL(k) lan-
guages is to obtain deterministic parsers, one can wonder of LL(k) lan-
guages compare to DCFL? Clearly, we have: 10
R. Kurki-Suonio. Notes on top-down lan-
guages. BIT Numerical Mathematics, 9(3):
For all k ≥ 0 : LL(k) lang. ⊊ DCFL 225–238, 1969. ISSN 1572-9125. D O I :
10.1007/BF01946814
Indeed, each LL(k) language is recognised by an LL(k) parser which is a
deterministic PDA, so all those languages are deterministic CFL. The con-
tainment needs to be strict since LL(k) lang. ⊊ LL(K + 1) lang. for all k.
One can actually prove a further result: even the (infinite) union of all
LL(k) lang. is still not sufficient to cover all DCFL. This can be proved by
considering the language that is obtained by the union of the regular lan-
guage {an | n ≥ 0} and the CFL {an bn | n ≥ 0}. Indeed, we can show that:
T O P - D OW N PA R S E R S 149
(Sketch). One can easily build a DPDA that accepts L by accepting state.
This DPDA pushes all the a it reads on the stack. This will be done by a self-
loop on an accepting state, so all the words of the form an are accepted. If
a b is read from this state, the DPDA moves deterministically to another
state where it will read all the b’s and check that there are as many b’s as a’s
by emptying the stack. When the stack becomes empty, the DPDA moves
to an accepting state. So, in this last state, all words of the form an bn will
be accepted.
However, L cannot be LL(k) for any k. Assume it is the case for some
value k. Then, consider the two words ak and ak bk . We can derive a con-
tradiction from the definition of LL(k) grammars. In its initial state, our
hypothetical parser will perform the same action, since the look-ahead ak
is the same. However, it is clear that there must be two different deriva-
tions: in the first case, only a’s must be generated from the sentential form
that is being built; while in the second case, some symbols must occur
in order to show to generate some amount of b’s (as many as there are
a’s).
and that:
[
LL(k) lang. ⊊ DCFL.
k≥0
Equipped with this general theory, we are now ready to discuss the con-
struction of deterministic top-down parsers for a large and practical class
of grammars, namely the LL(1) grammars. Those parser will thus be called
LL(1) parsers.
As we have seen before, not all grammars are LL(1), and some languages
cannot be defined by an LL(1) grammar. However, for practical matters,
when one wants to generate a parser for a typical programming language,
obtaining an LL(1) grammar for that language is feasible. Here are the typ-
ical obstacles to the LL(1) that can easily be alleviated with the techniques
we have seen so far:
(1) S → Sα
(2) → β
where β is a string of terminals. Then, this grammar is obviously not
LL(1), since the parser cannot decide which rule to apply when S on
the top of the stack, and First β is seen on the input, i.e.:
¡ ¢
(1) S → βS ′
(2) S′ → αS ′
(3) → ε
which is now LL(1).
Common prefixes Another source of trouble is when two rules share the
same left-hand side, and a common prefix on their right-hand side,
such as in:
(1) [if] → if [Cond] then [Code] fi
(2) [if] → if [Cond] then [Code] else [Code] fi
...
Here, if the parser sees variable [if ] on the top of the stack, and symbol
if on the input, it cannot decide which rule to apply, so the grammar
is not LL(1). However, factoring (see Section 4.4) solves this issue:
Now, let us assume that we have a proper LL(1) grammar to describe the
language we are interested to parse, and let us describe the construction
of its associated LL(1) parser.
The core of the construction will be the building of the so-called ‘action
table’, which describes what actions the parser must perform (either pro-
duce or match), depending on the look-ahead and the top of the stack.
We have already sketched an example of such a table at the beginning of
Section 5.3. This table describes completely the behaviour of the parser,
so, from now on, we will describe a parser with look-ahead by this means
only, hiding the fact that the parser is actually a PDA12 . Here is a more for- 12
Actually, a PDA with a single state, which
is thus irrelevant. Also, we will hide the fact
mal definition of the action table:
that there is always a transition that can
pop the Z0 symbol to reach an accepting
Definition 5.17 (LL(1) action table). Let G = 〈P , T ,V , S〉 be a CFG. Let us
configuration.
assume that:
– Accept, denoting that the string read so far is accepted. This occurs
only in cell M [$, $], i.e., when $ is on the top of the stack and also the
next symbol on the input. In terms of PDA, this consists in reading
$, and popping it, to reach an accepting configuration (provided that
no characters are left on the input); or
– Error, denoting the fact that the parser has discovered an error and
cannot go on with the construction of a derivation. The input should
be rejected.
(1) S → Exp$
Before explaining how to build such a table in a systematic way, we (2) Exp → Prod Exp′
(3) Exp′ → +Prod Exp′
present a complete example of such a table, and the execution of the parser
(4) → −Prod Exp′
on example input strings. (5) → ε
(6) Prod → Atom Prod′
(7) Prod′ → ∗Atom Prod′
Example 5.18. Let us consider once again the grammar for artihmetic ex- (8) → /Atom Prod′
(9) → ε
pressions (Figure 5.4), which we reproduce in Figure 5.7 to enhance read-
(10) Atom → −Atom
ability. Its action table is as follows (where M, A and empty cells denote (11) → Cst
‘Match’, ‘Accept’ and ‘Error’, respectively): (12) → Id
(13) → (Exp)
Figure 5.7: The grammar generating ex-
pressions (followed by $ as an end-of-
string marker). This is the same grammar
as in Figure 5.4, reproduced here for read-
ability.
152 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G
M $ + − ∗ / Cst Id ( )
S 1 1 1 1
Exp 2 2 2 2
Exp′ 5 3 4 5
Prod 6 6 6 6
Prod′ 9 9 9 7 8 9
Atom 10 11 12 13
$ A
+ M
− M
∗ M
/ M
Cst M
Id M
( M
) M
Note that the bottom half of the table is not very informative: it just tells
us that we should match terminals when they occur at the top of the stack.
This is not surprising: non-determinism can occur only because of the
‘Produce’ actions. So, in the rest of these notes, we will not show that part
of the table anymore.
Now, let us consider the input word Id + Id ∗ Id which is accepted by the
grammar, and let us build the corresponding run.
2. Then, we look up M [Exp, Id], and Produce rule (2). This replaces Exp
by ProdExp′ on the stack (with Prod on top, thus) and does not modify
the input. The parsing continues accordingly, for a couple of steps (the
remaining inputs are drawn below the stacks):
Atom Id
′
1
→
−
2
→
−
Prod 6
→
− Prod 12
−→ Prod′
′ ′
Exp Exp Exp Exp′
S $ $ $ $
Id + Id ∗ Id$ Id + Id ∗ Id$ Id + Id ∗ Id$ Id + Id ∗ Id$ Id + Id ∗ Id$
3. At that point, the terminal Id is present both on the top of the stack and
on the input, so a Match occurs, which modifies the input:
T O P - D OW N PA R S E R S 153
Id
Prod′ −
→
M Prod′
Exp′ Exp′
$ $
Id + Id ∗ Id$ +Id ∗ Id$
4. Then, the parsing goes on. . . Observe that the next produce consists in
popping the Prod′ variable from the top of the stack (i.e., applying rule
Prod′ → ε):
∗
+ Atom Id Atom
Prod′ 9 3 Prod M Prod 6 Prod′ 12 Prod′ M Prod′ 7 Prod′
→
− →
− −
→ →
− −→ −→ →
−
′
Exp′ Exp ′ Exp Exp ′
Exp′ Exp′ Exp′ Exp′
$ $ $ $ $ $ $ $
+Id ∗ Id$ +Id ∗ Id$ +Id ∗ Id$ Id ∗ Id$ Id ∗ Id$ Id ∗ Id$ ∗Id$ ∗Id$
Atom Id
M Prod′ 12 Prod′ M Prod′ 9 5 A
−→ −→ −→ →
− →
− → Accept!
−
′ ′ ′ ′
Exp Exp Exp Exp
$ $ $ $ $
Id$ Id$ $ $ $
Now, let us consider the word Id(Id)$ which is not syntactically correct
(it is not accepted by the grammar).
1. The corresponding run starts as in the case of Id + Id ∗ Id$, until the mo-
ment where the Id symbol on the top of the stack has to be matched:
Atom Id
1
→
−
2
→
−
Prod 6
→
− Prod′ 12
−→ Prod′ M
−
→ Prod′
Exp Exp′ Exp′ Exp′ Exp′
S $ $ $ $ $
Id(Id)$ Id(Id)$ Id(Id) Id(Id)$ Id(Id)$ (Id)$
2. At this point, the action table returns an error (i.e., M [Prod′ , (] = Error,
indicated by a blank cell in the table above), and the parsing stops
So the word Id(Id)$ is not accepted by the parser and by the grammar. One could also imagine that the
compiler skips the error at that
Observe that when the error is detected, the information in row Prod′
point and tries and re-synchronise,
of the action table could be used to give some feedback to the user, by i.e., tries and compile the remainder of the
telling him which symbol could have been correct at that point of the pars- input (in this case (Id)$, which is correct)
in order to inform the user of potential
ing (without guarantee that the parsing could have continued even if that further errors in the input. Error report-
symbol were present). For example, an error message in the case could ing and re-synchronisation are beyond the
scope of these notes. The interested reader
have been:
can refer to the ‘Dragon book’ .
Error, unexpected symbol (. I was expecting $, +, −, ∗, / or ). A. Aho, M. Lam, R. Sethi, and Ullman
J. Compilers: Principles, Techniques, &
M Tools. Addison-Wesley series in computer
science. Pearson/Addison Wesley, 2007
154 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G
Algorithm to build the action table Let us now formalise these ideas. Al-
gorithm 8 presents the construction of the LL(1) table. The algorithm starts
by initialising all the cells M [A, a] (where A is a variable and a a terminal)
to the empty set. These are all the cells that can potentially contain one or
several ‘Produce’ actions. Then all cells M [a, b], where a and b are termi-
nals, are populated: they are all initialised to the empty set, except for the
cells M [a, a] (with a ̸= $) that are initialised to a ‘Match’, and M [$, $] that
contains the ‘Accept’ action. It is useful to compare the way
the algorithm fills in the table with
After this initialisation, all rules A → α are taken into account: all sym-
the definition of strong LL(1) gram-
bols a from First(αFollow(A)) are computed, and the number i of the rule mars. One can then see that there will be
A → α is added to the corresponding cell M [A, a]. Since the algorithm adds no conflict in the table built by the algo-
rithm iff the grammar is strong LL(1), i.e.
the rule number to the cell, each cell can contain several rule numbers, LL(1) since these two notions are equiva-
thereby allowing to detect potential conflicts. lent.
return M ;
Algorithm 8: Systematic construction of the LL(1) table of a CFG.
else
return False ; /* Error */
156 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G
and:
S → B c,
because this would yield two (or more) different implementations for the
same function corresponding to S.
However, in order to resolve this non-determinism, we can rely on the
LL(1) techniques we have presented throughout this chapter. In the exam-
ple above, let us assume that:
1 def S():
2 n = get_next_character() # This is a look-ahead
3
4 # If the look-ahead is in First(A b Follow(S))
5 if n == ’a1’ or n == ’a2’ or ... or n == ’an’ :
6 read_next_character() # Discard the look-ahead
7 r = A()
8 if (!r): return False
9 n = read_next_character()
10 if (n == ’b’): return True
11 else: return False
12
13 # If the look-ahead is in First(B c Follow(S))
14 if n == ’b1’ or n == ’b2’ or ... or n == ’bn’ :
15 read_next_character() # Discard the look-ahead
16 r = B()
17 if (!r): return False
18 n = read_next_character()
19 if (n == ’c’): return True
T O P - D OW N PA R S E R S 157
5.7 Exercises
I N T H I S S E C T I O N , W E W I L L C O N S I D E R A C O M P L E T E LY D I F F E R E N T F A M -
I LY O F PA R S E R S , W H I C H A R E C A L L E D B OT T O M - U P PA R S E R S . As sketched
already in the introduction of the previous chapter, bottom-up parsers
build a derivation tree starting from the leaves (i.e., the input string), and
follow backwards the derivation rules of the grammar, until they manage
to reach the root of the tree. Those parsers are generally regarded as more
powerful than their top-down counterparts (we will give formal elements
to support this claim in Section 6.9). As such, automatic parser generators
such as yacc1 , bison2 and cup3 implement bottom-up parsers. 1
Stephen C. Johnson. Yacc: Yet an-
other compiler-compiler. Techni-
cal report, AT&T Bell Laboratories,
6.1 Principle of bottom-up parsing 1975. Readable online at http:
//dinosaur.compilertools.net/yacc/
2
Recall the two main actions that a top-down parser can perform: Gnu bison. https://www.gnu.org/
software/bison/. Online: accessed on
1. the Produce, which consists in replacing, on the top of the stack, the December, 29th, 2015
3
Cup: Construction of useful parsers.
left-hand side A by the right-hand side α of some production rule A→α http://www2.cs.tum.edu/projects/
of the grammar; and cup/. Online: accessed on December,
29th, 2015
2. the Match, that consists in reading from the input some character a,
which is at the same time popped from the top of the stack.
Such top-down parsers start their execution with the start symbol S on the
stack and accept with the empty stack. Doing so, they unravel a parse tree
for the input string from the top to the bottom, and produce a leftmost
derivation.
Bottom-up parsers, on the other hand, work in a complete reverse way.
As their name suggest, they build the parse tree from the bottom to the
top. As such, they are often regarded as more efficient, since they deduce
the nodes of the parse tree based on the actual input, whose elements
they recognise as being generated by the grammar; contrary to top-down 4
Donald E. Knuth. On the translation of
parsers that have to start from the start symbol and find a way to obtain languages from left to right. Information
the input by applying the proper grammar rules. and Computation (formerly known as In-
formation and Control), 8:607–639, 1965.
For these notes, we will consider the most prominent class of bottom- D O I : 10.1016/S0019-9958(65)90426-2
5
up parsers, which are those that are based on two main actions: the Shift Prominent american computer scientist,
born in 1938, former Stanford and CalTech
and the Reduce, as we are about to explain. Those parsers (in particular the professor, recipient of the T U R I N G award
LR(k) parsers that we are about to study) have been introduced4 by Don- in 1974. He is the author of the famous
series of books The Art of Computer Pro-
ald E. K N U T H 5 in 1965, in a paper where he generalises previous works
gramming, and of the TEX typesetting lan-
from other prominent computer scientists such as Robert F L OY D 6 . guage, which has later been extend to LATEX
by Leslie L A M P O RT .
In order to build the parse tree from the leaves to the root, Shift-Reduce 6
Robert E. F L OY D , american computer
bottom-up parsers procede as follows: scientist (1936–2001). Recipient of the
T U R I N G award in 1978, and well-known
for several contributions to algorithmic,
including the F L OY D - W A R S H A L L algo-
rithm.
160 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G
• First, they rely on the Shift action to move terminals from the input to
the top of the stack. The picture hereunder illustrates a shift, where the
terminal a is read from the input and pushed on the top of the stack:
a
A Shift A
−−−→
b b Observe that the handle A a ap-
abc bc pears with the rightmost character
on the top of the stack. That is,
the handle is reversed wrt to what would
have been pushed to the stack by a Pro-
• Second, they apply the grammar rules in reverse. The parser looks for
duce of the same rule in a top-down parser.
a so-called handle on the top of the stack, i.e. the right-hand side α of This is because the characters that have
some grammar rule A → α. When such a handle is present (in mirror produced the variable A (through other
Reduces, presumably) have been read on
image) on the top of the stack, the parser can perform a Reduce. A Re- the input before the a, so the A has been
duce amounts to popping the handle α, and pushing A instead. This pushed to the stack before the a and is thus
under the a in the stack.
way, the parser unravels a parse tree from the bottom to the top (hence
the name bottom-up parser). The picture hereunder illustrates a Re-
duce of the rule B →A a :
a
A Reduce B
−−−−→
b b
bc bc
• Finally, the aim of the parser is not to empty the stack (by matching all
the terminals), but rather to end up with only the start symbol S on the
stack (and, of course, an empty input). This means that a derivation has
been produced for the whole string, but in the reverse order (since rules
have been applied in the reverse order too when doing the Reduces).
Actually, the bottom-up parser builds a right-most derivation in reverse
order.
Example 6.1. Let us consider once again the grammar for arithmetic ex-
pressions of Figure 4.4, and let us consider the string Id + Id ∗ Id, which
is accepted by the grammar. One possible7 rightmost derivation for this 7
Recall that this grammar is ambiguous, so
several rightmost derivations are possible.
string is as follows:
2 4 1 4 4
Exp =
⇒ Exp ∗ Exp =
⇒ Exp ∗ Id =
⇒ Exp + Exp ∗ Id = ⇒ Id + Id ∗ Id.
⇒ Exp + Id ∗ Id = (6.1)
Id + Id ∗ Id
From this configuration, the parser can only shift the first character on the
input:
B OT T O M - U P PA R S E R S 161
Id
+Id ∗ Id
Now, the top of the stack constitutes a handle for the rule Exp→Id . Remark
that, at this point, the parser can decide either to Reduce Exp→Id or to
Shift another character. The former choice is the good one. Indeed, if we
shift a + on top of the Id, it means we will need to find a rule whose right-
hand part contains Id and some other symbols, but there is no such rule
in our grammar. The Reduce of Exp→Id yields:
Exp
+Id ∗ Id
Observe that, in the rightmost derivation (6.1), the last rule is indeed Exp→Id
. So, this is coherent with our claim that the parser builds the rightmost
derivation in a reverse manner. Now, the parser can shift two more sym-
bols, and reduce the Id that ends up on the top of the stack:
Id Exp
S + S + 4 +
→
− →
− →
−
Exp Exp Exp Exp
Next, following the rightmost derivation, Exp + Exp froms a handle of the
rule Exp→Exp + Exp , which can be reduced8 . Then, the parser can shift 8
Formally, this must be performed, in the
PDA, by three transitions in a row: one to
twice, and finally reduce Exp→Id , then Exp→Exp ∗ Exp :
pop the first Exp, one to pop the +, and
one to replace the last Exp by another Exp.
This last transition can be skipped how-
Exp Id Exp ever, since we are in the special case where
+ 1 S ∗ S ∗ 4 ∗ 2 the last symbol that must be popped is also
→
− →
− →
− →
− →
−
Exp Exp Exp Exp Exp Exp the one than has to be pushed.
∗Id ∗Id Id
to slightly alter the behaviour of the basic parser that we are presenting
now. Nevertheless, we believe it is a good exercise to show that the actions
we have described above can actually be implemented in a PDA.
The parser we are about to describe is unfortunately not as simple as
the one-state top-down parser we had obtained in Section 5.1.1. This is
due to the fact that a Reduce entails a sequence of pops from the stack,
which cannot be performed by a single transition. Hence, we will need to
introduce intermediary states. More precisely, for each rule of the form
A→α1 · · · αn (where the αi ’s are individual variables or terminals), we will
have n states, that we call (A, α1 · · · α j ) for 1 ≤ j ≤ n − 1 and (A, ε). Intu-
itively, the PDA reaches (A, α1 · · · α j ) iff: (i) it is in the middle of the reduc-
tion of A→α1 · · · αn ; and (ii) it has already popped characters αn , αn−1 ,. . . ,
α j +1 from the stack. Thus, to finish the reduction from this state, the PDA
must still: (i) pop α j , α j −1 ,. . . , α1 (in this order); and (ii) push A. For ex-
ample, if the grammar contains rule A→bc , then, the parser can, from its
initial state:
2. from (A, b), take a transition that pops b and move to (A, ε); and finally,
3. from (A, ε), take a transition that pushes A, and move back to the initial
state of the PDA.
In addition to these transitions, we also need transitions to perform Shifts, Recall that a PDA with accepting
states accepts only when the ac-
and one transition to move to a dedicated state called q a when the start
cepting state is reached and the in-
symbol of the grammar occurs on the top of the stack. Note that this state put is empty. There is thus no problem in
is not accepting (we build again a PDA that accepts on empty stack), but jumping non-deterministically to the ac-
cepting state whenever the start symbol
its aim is to check that, at this point, the only symbol left on the stack is Z0 , occurs on the top of the stack.
which means that all the symbols on the stack have indeed been reduced
to S.
Formally, from a CFG G = 〈V , T , P , S〉, we build a PDA PG′ as follows:
PG′ = Q, T ,V ∪ T , δ, q i , Z0 , ; ,
®
where:
Q = q i , q a ∪ (A, ε) | A ∈ V
© ª © ª
∪ (A, α1 · · · α j ) | A → α1 · · · αn ∈ P ∧ 1 ≤ j ≤ n − 1 .
© ª
That is, we have one initial state q i , one accepting state q a , and the
intermediary states for the reductions, as announced;
Intuitively, there are self-loops on the initial state that Shift any sym-
bol on the input to the stack; when the start symbol S occurs on the
top of the stack, the parser can move to the accepting state (Accept
action); and the parser can decide at any moment to start a Reduce,
provided that the top of the stack is the right-most character in the
handle.
(b) For all states of the form (A, α) with α ̸= ε:
¢ n¡ ¢o
δ (A, αs), ε, s = (A, α), ε
¡
δ (A, α), a, s = ;
¡ ¢
in all other cases.
(c) For all states of the form (A, ε), we have, for all a ∈ T , for all s ∈
T ∪ V ∪ {Z0 }:
δ (A, ε), a, s = ;.
¡ ¢
δ(q a , ε, Z0 ) = (q a , ε)
δ(q a , a, s) = ; in all other cases.
Lemma 6.1. For all CFGs G, the PDA PG′ is s.t. L(G) = N (PG′ ).
Sketch. The proof is done in two steps. First, one shows that all words w
accepted by the grammar G correspond to an accepting run of PG′ , by in-
duction on the length of a (reversed) rightmost derivation producing w
in G. Second, one shows that all accepting runs of PG′ on some word w can
be translated to a rightmost derivation in G (by induction on the length of
the run).
Reduce-Reduce conflict: such conflicts occur when the top of the stack is
the handle to two different rules, and the parser cannot decide which
rule to reduce;
Shift-Reduce conflicts: such conflicts occur when the top of the stack con-
stitutes a handle to some rule, but the parser cannot determine whether
it should continue shifting or whether it should reduce now.
Let us first focus on Shift-Reduce conflicts. One (but not the only one)
of the difficulties we need to overcome to get rid of such conflicts is to
determine when shifting new symbols might still produce a handle. This
164 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G
has been illustrated in Example 6.1. After the first Shift, the configuration
reached by the parser is as shown in Figure 6.1. In this configuration, the
non-deterministic parser can either Reduce rule Exp→Id , or Shift the +.
However, as we have already argued, shifting a + in this configuration will
not yield an accepting run, as there is no other handle than the one of Id
Sketch. The proof is by induction on the length of the run. Clearly, the
property holds in the initial configuration, since in this case γ = ε, hence
γR w = w, and w is an accepted word of the grammar (as the run is accept-
ing).
Next, if the property holds on some configuration q i , w 1 , γ1 Z0 , then
®
w1 = w2
γR1 = βα
γR2 = βA
of PG′ s.t. p = γRj for some j s.t. q j = q i (i.e. PG′ is in the initial state at step
j and not in one of the intermediary states used for the Reduce). M
Observe that the set of viable prefixes of a grammar can be infinite, and
actually constitutes a language on V ∪ T . If we consider once again the
grammar of Example 6.1, we can see that all words of the form Exp + Exp +
· · ·+ Exp are viable prefixes of the grammar. Indeed, for every such word of
the form:
Exp +Exp + · · · + Exp,
| {z }
n times
one can build a derivation obtained by applying n times the first grammar
rule on the rightmost Exp of the sentential form:
Id + · · · + Id} .
| Id +{z
n times
Example 6.4. Consider the grammar in Figure 6.2. The set of viable pre- Observe that this grammar is not
fixes of this grammar is: LL(1), because of the two rules hav-
ing A as the left-hand side. Never-
ε, A, A $, a, aC , ac, aC D, aC d , ab . theless, we will manage to parse it with a
© ª
bottom-up parser without any symbol of
prevision.
In particular, observe that acD is not a viable prefix, because the parser
has missed the handle c of C →c . Hence, we cannot reduce the rule (1) S → A$
(2) A → aC D
A→aC D . Instead, the parser had to reduce the c into C before shifting (3) → ab
and reducing the d. M (4) C → c
(5) D → d
Now, let’s see how to build the CFSM. If we consider the first rule of the Figure 6.2: An example grammar to
demonstrate the notion of viable prefix.
grammar S→A $ , we see immediately that A and A $ are viable prefixes.
So, we could start building our CFSM by having a three-state automaton
of the form:
A $
However, to track the progress of the automataon along the rule S→A $ ,
we will associate each state with so-called items. An item is simply a gram-
mar rule where we have inserted a • at some point in the right-hand part,
in order to makr the current progress of the CFSM (and, as we will see later,
of the bottom-up parser):
So, intuitively the item S→A•$ means: (i) that the automaton is trying
to recognise the handle A $ of the rule S→A $ , (ii) that it has recognised an
A so far, and (iii) that it still expects to read a $ to complete the handle (and
hence, complete the viable prefix). Using this notation, our automaton
becomes:
0 1 2
S → •A $ A S → A•$ $ S → A $•
Observe that we have slightly altered our conventions for depicting au-
tomata in this figure (in order to reflect common practice of the literature).
First, we do not mark explicitly the states as ‘accepting’ since they are all
accepting anyway. Indeed, the prefix of a viable prefix is itself a viable pre-
fix. Second, our states are now divided into two parts: we will understand
why in a moment. Finally, we are now numbering the states, to be able to
refer to them easily (see the bold numbers on top left of states).
B OT T O M - U P PA R S E R S 167
Note that, for the moment, our automaton contains only one item per The fact that states of the CFSM are
state. In general states of the CFSM will be sets of prefixes. Observe that, for sets of items should not be surpris-
ing, since we are building a deter-
a given grammar, there are finitely many items, because there are finitely
ministic automaton, but there can be non-
many rules, which have all a finite right-hand side. So the CFSM is guar- determinism in the grammar. This is rem-
anteed to be finite. iniscent of the subset construction tech-
nique to determinise finite automata from
Clearly, this automaton is not sufficient to accept all viable prefixes. In- Section 2.4.2.
deed, S ⇒ A $ ⇒ aC D $, for instance, is a possible prefix of derivation in
our grammar, so we should also accept the viable prefixes a, aC , etc.
How can we obtain this? We need to incorporate in our CFSM the fact
that A can be derived as aC D, but where to add this information? If we
observe the initial state, it contains the item S→•A $ , in which the • sym-
bol is immediately followed by the variable A. This is a sign that we need
to add more information to the initial state: when the automaton tries to
recognise a viable prefix generated using rule S→A $ , it might need to
read an a, because the rule A→aC D might be applied next. Thus, we will
add to the initial states two new items: A→•aC D and A→•ab .
0
1 2
S → •A $
A S → A•$ $ S → A $•
A → •aC D
A → •ab
The operation that consists in looking, in a state, for all the items of the
form A→α1 •B α2 , where B is a variable; and adding to the state all the
items of the form B →•α is called the closure operation. It needs to be
applied to all states10 of the CFSM (possibly several times until no more 10
The closure operation does not add
items to states 1 and 2, because the • is not
items can be added).
followed by a variable in the correspond-
In our depiction, the items that result from the closure operation ap- ing items.
pear in the lower part of the states. The items from the top part of the
states form the kernel of the state. In our depiction we keep them sep-
arated because it makes it easier to identify states, since the closure of a
given kernel will always be the same.
This new information in the initial state allows us to complete our au-
tomaton. Since the state now contains items A→•aC D and A→•ab , the
automaton can progress in the recognition of a valid prefix by reading an a
from the initial state. This will yield a new state where the automaton
has progressed in both items, so the kernel of the new state will contain
both A→a•C D and A→a•b . This means that we don’t know, at this point,
whether the handle that will be read will be aC D or ab, but both are still Again, compare with the subset
construction for determinising fi-
possible so far. In addition, the closure operation will add the item C →•c
nite automata.
to the state, because the • is directly followed by a C in A→a•C D :
168 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G
0
S → •A $
A
A → •aC D ···
A → •ab
a
1
A → a•C D
A → a•b
C → •c
0
1 2
S → •A $
A S → A•$ $ S → A $•
A → •aC D
A → •ab
a
3
4
5
A → a•C D
C A → aC •D D
A → a•b A → aC D•
D → •d
C → •c
c d
b
6 7 8
A → ab• C → c• D → d•
Closure ← I ;
repeat
PrecClosure ← Closure ;
foreach A→α1 •B α2 ∈ Closure s.t. B ∈ V do
Closure ← Closure ∪ B → •β | B → β ∈ P ;
© ª
automaton. Formally:
CFSMSucc (I , a) = A → α1 a • α2 ¯ A → α1 • aα2 ∈ I .
© ¯ ª
where:
1. the set of states Q is the set of all sets of G’s items: Q = 2Items(G) ;
Now that we have the CFSM in our toolbox, how can we exploit it to
make our bottom-up parsers deterministic? In the next sections, we will
introduce the LR(k) parsers, which are a family of deterministic bottom-
up parsers that use k characters of look-ahead. Similarly to the case of
LL(k), the ‘LR’ in LR(k) stands for Left scanning, Right parsing.
170 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G
We start with the LR(0) parsers, which use no look-ahead. This might
sound surprising, since it is hard to imagine that a top-down parser would
be able to parse anything meaningful without look-ahead. As a matter of
fact, a grammar is LL(0) iff it does not contain two rules with the same
left-hand side, which is a very strong restriction, since such grammars
would accept only one word at most! Nevertheless, LR parsers are usu-
ally regarded as more powerful than LL parsers, and there are non-trivial
grammars (such as the one in Example 6.4) that can be parsed by an LR(0)
parser, as shown by the next example, which will help us build our intu-
ition.
Example 6.7. Let us consider again the grammar G of Example 6.4, and
the word acd$ which is accepted by this grammar. How does CFSM (G), as
given in Figure 6.3, helps us parse acd$ in a deterministic fashion.
Let us observe the run of the bottom-up parser on acd$. Initially, the
stack is empty11 , and the only thing the parser can do is to shift a (since 11
It contains only the Z0 character, that we
will not depict in this example in order to
there is no rule which has ε as its right-hand side, no Reduce is possible):
keep it readable.
stack was empty, so the CFSM reaches state 0. Observe that in this state,
there are no items of the form A→α• , which indicates that no handle has
been read completely. This is coherent with our decision of Shifting and
not Reducing at this point. In the second configuration, the CFSM reaches
state 1. Again, the items contained in this state indicate that a Shift must
occur, and the parser reaches a configuration where the stack contains
(from the bottom to the top) ac, which moves the CFSM to state 7. We
depict this as follows, indicating the current CFSM state under the input:
Shift Shift c
−−−→ −−−→
a a
acd$ cd$ d$
0 3 7
c −Reduce 4
−−−−−→ C
a a
d$ d$
7 4
B OT T O M - U P PA R S E R S 171
Following the same reasoning, we now Shift the d, and the CFSM reaches
state 8 from which rule 5 must be reduced (item D→d• ). After this reduce,
the CFSM is in state 5, where a Reduce of rule 2 occurs, which moves the
CFSM to state 1, as the stack now contains A only:
d D
Shift
C −−−→ C −Reduce 5 Reduce 2
−−−−−→ C −−−−−−→
a a a A
d$ $ $ $
4 8 5 1
Finally, the parser shifts the remaining character $, which moves the
CFSM to state 2. In this state, the parser reduces rule 1 which leaves only
the start symbol S on the top of the stack. This amounts to accepting the
input string, and is denoted by the Accept action. The full run is displayed
in Figure 6.4 (top of the figure). M
d D
Shift
−−−→
Shift
−−−→ c −Red. 4 Shift Red. 5
−−−→ C −−−→ C −−−−→ C −Red.
−−−→
2 Shift
−−−→ $ −Accept
−−−−→
!
a a a a a A A S
acd$ cd$ d$ d$ $ $ $ ε ε
0 3 7 4 8 5 1 2
8 5
Shift Shift 7 Red. 4 4 Shift 4 Red. 5 4 Red. 2 Shift 2 Accept !
−−−→ 3 −−−→ 3 −−−−→ 3 −−−→ 3 −−−−→ 3 −−−−→ 1 −−−→ 1 −−−−−→
0 0 0 0 0 0 0 0
acd$ cd$ d$ d$ $ $ $ ε ε
• Shift whenever the current state does not contain an item of the form
A→α• because we have not yet shifted a complete handle on the stack;
and
• Reduce rule A→α whenever the current state contains the item A→α•
(except if this is a state where an Accept is performed).
172 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G
Now, since the CFSM state is the only meaningful information for de-
termining the behaviour of the parser, one could wonder what is the point
of actually pushing symbols on the stack? It turns out that we can, instead,
push the sequence of CFSM states. To better understand this, consider the
excerpt of the run above:
d D
C Shift Red. 5
−−−→ C −−−−→ C
a a a
d$ $ $
4 8 5
In this example, the CFSM state is 4 when the stack content is C a, then
a Shift occurs, which moves the CFSM to state 8. To obtain this state 8,
we only need to know that we were previously in state 4 and that we have
read symbol d. Then, a Reduce occurs that pops one character (because
the right-hand side of rule 5 has length 1), pushes a D instead, and moves
the CFSM to state 5. It is here crucial to observe that, in order to determine
that the CFSM ends up in state 5, all we need to know is:
1. the current state of the CFSM before the reduced handle (here: d) was
on the stack (here this state was 4); and
CFSM reaches state 5 when the stack contains DC a, one does not need
to re-run the CFSM on the whole stack content from scratch: instead, we
can remember on the stack the state reached with stack content C a (here,
state 4), then look for the (unique) successor of 4 by letter D. This holds Observe that, since the stack now
contains the initial state of the
because the CFSM is deterministic. CFSM from the beginning, we do
So, in practice, the run of the LR(0) parser on acd is the one displayed not even need the Z0 symbol anymore. Or,
in other words, we let Z0 = 0, and the Ac-
in Figure 6.4 (bottom). Let us now formalise these ideas. cept will pop 0 to empty the stack.
The action table of the LR(0) parser associates one or several actions to
each state:
B OT T O M - U P PA R S E R S 173
• The non-empty states of CFSM (G) (i.e. the elements of Q \ ;) are in-
dexed from 0 to m − 1, so that Q = {0, 1 . . . , m − 1, ;}.
As in the case of LL(1) our defini-
Then, the LR(0) action table is a table M with |Q| = m + 1 lines s.t., for all tion allows for several actions in a
given cell, but the parser will be de-
0 ≤ i ≤ m, M [i ] contains a set of actions. The actions can be: terministic only when there is exactly one
action per cell.
• either an integer 1 ≤ j ≤ n denoting a Reduce of rule number j ; When we depict the action tables,
we often denote the Shift, the Ac-
• or Shift denoting that the parser must Shift the next character of the cept and the Error by S, A and an
empty cell, respectively.
input to the stack ;
• or Accept denoting that the string read so far is accepted and belongs to
L(G);
• or Error denoting that the string must be rejected. The only state which
is associated with this action (in the case of LR(0)) is ;.
To build this action table, we look for items of the form A→α• in the
states of the CFSM and we associate a Reduce of the corresponding rule to
those states. When the state contains A→α1 •α2 , we associate a Shift to
the state. This is detailed in Algorithm 10. Observe that after the execution
of this algorithm, no cell will be empty, because each state of the CFSM
always contains at least one item that either will satisfy one of the two If’s
of the main loop, or will allow for at least one execution of the innermost
foreach loop.
Now that we have described the construction of the action table of the
LR(0) parser, let us formalise how it runs on a given input string. For the
sake of clarity, we will not describe this parser as a PDA, but rather as an
algorithm that can access a stack S from which it can push and pop. The
parser is given in Algorithm 11 and assumes that the action table contains
exactly one action per cell. The NextSymbol() function reads and returns
the next symbol on the input (i.e. in the word w).
It is easy to check that the algorithm follows exactly the intuitions that
we have described so far. Namely, in the main while loop, the parser checks
the content of the LR(0) table in the cell given by the top of the stack, to ob-
tain the action that must be executed. Then:
• if the action is a Shift, the next character is read and stored in variable c;
and
174 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G
return M ;
Algorithm 10: The algorithm to compute the LR(0) action table of a
CFG G using CFSM (G).
Input: The LR(0) action table M of some CFG G and its associated
CFSM (G) = 〈Q, 0,V ∪ T , δ, F 〉; and an input word w.
Output: True iff w ∈ L(G)
Let S be an empty stack ;
Push (S , 0) ; // Push of the initial CFSM state
while M [Top (S )] ̸= Error and M [Top (S )] ̸= Accept do
if M [Top (S )] = {Shift} then
c ← NextSymbol() ;
else if M [Top (S )] = {i } then
Let A→α be rule number i in G;
Pop() |α| times from S ;
c←A;
/* Compute the next CFSM state and push it */
nextState ← δ(Top(S ), c) ;
Push (S , nextState) ;
Algorithm 11: The LR(0) top-down parser. This algorithm assumes that
each cell of action table M contains exactly one action.
B OT T O M - U P PA R S E R S 175
Finally, the parser computes the next state based on the current state (which In practice, only the transition rela-
is now on top of stack) and the symbol c that the CFSM must read. This tion of the CFSM is needed for the
parser. It can be given by a simple
new state is pushed onto the stack. After all these actions have been per-
table that returns, for all states s and ter-
formed, the state of the parser has been correctly updated, and it can con- minal a, the successor state of s by a.
tinue its execution the same way.
Example 6.9. Let us consider the simple grammar in Figure 6.5. This
grammar has one recursive rule, and the rule S→ε , which is particularly
interesting here. Indeed, a bottom-up parser parsing this grammar will (1) S′ → S$
necessarily need to Reduce it at some point (it is needed to terminate the (2) S → S aS b
(3) → ε
recursion). But reducing this rule has a perhaps unexpected effect: since
Figure 6.5: An LR(0) grammar whose set of
the right-hand side is ε, there is nothing to pop from the stack when per- viable prefixes is infinite.
forming the reduction. So, reducing S→ε amounts to pushing S on the
top of the stack, and this can occur basically at any moment in the run of
the parser. This might give use the impression that we will end up with a
non-deterministic parser, yet our LR(0) parser will be deterministic, even
without look-ahead!
0
1
2
S′ → •S $ ′
S → S•$
S $ S ′
→ S $•
S → •S aS b S → S•aS b
S → •
a
5
4 S
3
S → S a•S b
S → S aS•b
S → S aS b• b
S → S•aS b S → •S aS b
S → •
a
Figure 6.6: The CFSM of the previous
grammar. Its language is infinite.
The (LR(0)) CFSM of the grammar is given in Figure 6.6. On can check
that its language is indeed infinite since it contains a loop between states 4 Table 6.1: The LR(0) action table of the pre-
and 5. One can also remark the presence of the item S→• in states 0 and 5, vious grammar.
that triggers the reduce of rule number 3 (i.e., pushing an S on the top of State Action
the stack as discussed above). 0 3
Then, the LR(0) action table is given in Table 6.1. Observe that states 0 1 S
2 A
and 5, again, prompt a Reduce of S→ε , and this Reduce only. There is 3 2
no Shift in these states since there is no outgoing transition labeled by a 4 S
5 3
terminal.
Now, let us build the run of the LR(0) parser on aabb$. As always, we
start with the initial state 0 on the stack. In state 0, a Reduce of S→ε must
176 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G
5
R3 1 S 1
−−→ →
−
0 0 0
aabb$ aabb$ abb$
After that, the run continues as usual. The full run is given in the following
table (in order to save space), where the stack content is displayed from
the bottom to the top:
aabb$ 0 Reduce 3
aabb$ 01 Shift
abb$ 015 Reduce 3
abb$ 0154 Shift
bb$ 01545 Reduce 3
bb$ 015454 Shift
b$ 0154543 Reduce 2
b$ 0154 Shift
$ 01543 Reduce 2
$ 01 Shift
ε 012 Accept
One thing which is remarkable about LR(0) parsers is their ability to parse
grammars that are not entirely trivial, without using any look-ahead, as the
previous examples have clearly shown. However, as it often happens with
life and computing, there is no free lunch, and there are grammars of prac-
tical interest that we cannot parse deterministically with LR(0) parsers! Let
us have a look at such an example. . .
(1) S → Exp$
Example 6.10. Consider the grammar in Figure 6.7. It is a simplification (2) Exp → Exp + Prod
of the grammar to generate arithmetic expressions that we have consid- (3) → Prod
(4) Prod → Prod ∗ Atom
ered several times before (the point of the simplification here is to keep (5) → Atom
the example short, but everything we are about to discuss extends to the (6) Atom → Id
grammar with all operators). Notice that this grammar implements the (7) → (Exp)
Figure 6.7: A simple grammar for generat-
priority of operator as discussed at the end of Chapter 4.
ing arithmetic expressions. This grammar
We claim that this grammar is not LR(0). Let us build its CFSM to check is not LR(0). It is also not LL(k) for any k
this. It is given in Figure 6.8. We can immediately spot conflicts in state 1 since it contains left-recursive rules.
and in state 12. In state 1 the parser cannot decide between performing a
shift (which will necessarily be of symbol ∗), or reducing Exp → Prod. Simi-
larly, in state 12, there is a conflict between a shift (of ∗ again) and a reduce
of Exp → Exp + Prod. M
B OT T O M - U P PA R S E R S 177
9
S → Exp$•
0
S → •Exp$ 8
$
4 Exp → Exp+•Prod
Exp → •Exp + Prod
Exp → •Prod S → Exp•$ Prod → •Prod ∗ Atom
Exp +
Prod → •Prod ∗ Atom Exp → Exp•+Prod Prod → •Atom
Prod → •Atom Atom → •Id
Atom → •Id Atom Atom Atom → •(Exp)
Atom → •(Exp) 5 Id
+ Prod
Id Prod → Atom• 10
Prod (
1 ( Atom → (Exp•)
Exp → Prod• Atom Exp → Exp•+Prod
Prod → Prod•∗Atom 6
Atom → (•Exp)
Exp )
Prod Exp → •Exp + Prod 11
∗
2 Exp → •Prod Atom → (Exp)•
Prod → Prod∗•Atom ( Prod → •Prod ∗ Atom (
Prod → •Atom
Atom → •Id
Atom → •(Exp)
Atom → •Id 12
Atom → •(Exp) Exp → Exp + Prod•
Atom Prod → Prod•∗Atom
3 *
7 Id
Prod → Prod ∗ Atom• Id
Atom → Id•
return M ;
Algorithm 12: The algorithm to compute the SLR(1) action table of a
CFG G using CFSM (G).
Input: The SLR(1) or LR(1) action table M of some CFG G and its
associated CFSM (G) = 〈Q, 0,V ∪ T , δ, F 〉; and an input word
w.
Output: True iff w ∈ L(G)
Let S be an empty stack ;
/* Push of the initial CFSM state */
Push (S , 0) ;
/* Initialisation of the look-ahead */
ℓ ← first symbol on the input ;
while M [Top (S ) , ℓ] ̸= Error and M [Top (S ) , ℓ] ̸= Accept do
if M [Top (S ) , ℓ] = {Shift} then
c ← NextSymbol() ;
else if M [Top (S ) , ℓ] = {i } then
Let A→α be rule number i in G;
Pop() |α| times from S ;
c←A;
/* Compute the next CFSM state and push it */
nextState ← δ(Top(S ), c) ;
Push (S , nextState) ;
/* Update the look-ahead */
ℓ ← first symbol on the remaining input ;
Algorithm 13: The SLR(1) and LR(1) top-down parser. This algorithm
assumes that each cell of action table M contains exactly one action.
180 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G
0
Id + Id ∗ Id$
When the CFSM is in state 0 and the look-ahead is Id, the action table says
to perform a Shift, which leads to state 7. In this new state, and with a new
look-ahead equal to +, the parser must Reduce rule 6, which leads to sate
5, and so on. . . The full run is then:
S R6 R5 R3 S S 7 R6 5 R5 12 S
→
− −−→ −−→ −−→ →
− 8 →
− 8 −−→ 8 −−→ 8 →
−
7 5 1 4 4 4 4 4
0 0 0 0 0 0 0 0 0
Id + Id ∗ Id$ +Id ∗ Id$ +Id ∗ Id$ +Id ∗ Id$ +Id ∗ Id$ Id ∗ Id$ ∗Id$ ∗Id$ ∗Id$
7 3
2 2 2
12 S 12 R6 12 R4 12 R1 S Accept!
8 →
− 8 −−→ 8 −−→ 8 −−→ →
− 9 −−−−−→
4 4 4 4 4 4
0 0 0 0 0 0
Id$ $ $ $ $ ε ε
• Follow(S) = {$};
Now, let us try and build the CFSM for this grammar. An excerpt is given
in Figure 6.10. On this small excerpt, we can clearly see that there is a
shift/reduce conflict in state 1, if we build an LR(0) parser. Unfortunately,
this conflict subsides if we build an SLR(1) parser. Clearly, when the look-
ahead is =, a Shift can be performed. But since =∈ Follow(S), we would
also perform a Reduce in this case. M
3 4
S ′
→ S•$ $ S ′
→ S $•
S
0
S′ → •S $
1
S → •L = R
S → L•= R
S → •R L
R → L•
L → •∗R
L → •Id
R → •L
R
2
S → R•
ever, from this state, it is not possible to Shift the = and to continue the
run of the parser.
With this discussion, we conclude that, in state 1, an = in the look-
ahead must trigger a Shift; while we must do a Reduce only when $ is
on the look-ahead. By doing so, we actually refined the notion of Follow.
While Follow(R) = {=, $}, this can be regarded as a global Follow, because
it is computed over the whole grammar. With the finer analysis we have
sketched here, we have obtained some sort of local Follow: in state 1, this
local Follow of R contains only $, which allows us to lift the conflict. The
point of this section is to formalise those notions.
Definition 6.13 (LR(k) CFSM item). Given a grammar G = 〈V , T , P , S〉, an Remember that the notation T ≤k
item of G is a rule of the form A → α1 • α2 , u, where A → α1 α2 ∈ P is a rule means ∪ki=0 T i , i.e., the set of all
strings of length at most k on T .
of G and u ∈ T ≤k . We denote by LR(k) − Items (G) the set of LR(k) items
of G. M
Example 6.14. Continuing the example of Figure 6.9 and Figure 6.10, the
second item in state 1 should be R → L•, $. M
S → •α, ε | S → α ∈ P
© ª
where P is the set of rules of the grammar. That is, we take the closure of all
the items of the form S → •α extended with the local Follow ε (and where
S is, as usual, the start symbol).
Now, for the closure operation, assume we have an item of the form
A → α1 • B α2 , u. We will perform the closure by creating items based on
all rules of the form B → β. The local Follows will be computed on the ba-
sis of what we expect on the input after B . Since we start from the item
A → α1 • B α2 , u, we expect on the input the k first characters of α2 , which
we can complete with u if need be. That is, we will have all the items of
B OT T O M - U P PA R S E R S 183
Closure ← I ;
repeat
PrecClosure ← Closure ;
foreach A→α1 •B α2 , u ∈ Closure s.t. B ∈ V do
Closure ←
Closure ∪ B → •β, u ′ ¯B → β ∈ P and u ′ ∈ Firstk (α2 · u) ;
© ¯ ª
the form B → •β, u ′ , where u ′ ∈ Firstk (α2 · u). Algorithm 14 gives the for-
malised algorithm for computing this closure operation.
Finally, the transitions of the CFSM are computed by simply propagat-
ing the local follows. That is, we redefine the successor function as:
CFSMSucc (I , a) = A → α1 a • α2 , u ¯ A → α1 • aα2 , u ∈ I .
© ¯ ª
Example 6.15. Let us build on the example of Figure 6.9 and Figure 6.10.
Let us first compute the initial state of the LR(1) CFSM for the grammar of
Figure 6.9. The kernel of the state will be the item S ′ → •S $, ε. Then, we
compute the closure of this item step by step:
First $ · ε = First $
¡ ¢ ¡ ¢
= {$}.
• Finally, from this last item, we obtain two new items L → • ∗ R, $ and
L → •Id, $. Observe that the local Follows we have obtained in these
items are different from the one computed before. In order to reduce
the size of the representation of the CFSM, we will often merge items
that differ only by the local Follows into sets of possible local Follows:
we will thus write L → • ∗ R, {=, $} and L → •Id, {=, $}. For the sake of
clarity, we will thus systematically represent the possible local Follows
as sets even when they are singletons.
This state is displayed in Figure 6.11, along with its L-successor, state 1.
This shows that the conflict on state 1 has been lifted, since the local Fol-
low for item R → L• is $. M
184 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G
Example 6.16. For our second example, we will build the LR(1) CFSM for
the grammar in Figure 6.7. We already know that this grammar is SLR(1),
so we will be able to compare the SLR(1) and LR(1) CFSM for this grammar.
The CFSM is given in Figure 6.12 and Figure 6.13. Observe that the states
in these two figures are very similar, but that the local follows in the items
are different. As an example, compare state 6 in Figure 6.12 and state 16
in Figure 6.13, where the difference lies in the first item only. This clearly
shows that these local follows allow for a finer analysis, but this comes at
the cost of the number of states. M
LR(k) Action table Now that we can compute the LR(k) CFSM, let us see
how we can exploit it in a parser. We will adapt the techniques we have
introduced for SLR(1). There are mainly two cases to consider:
local Follow that is associeted to the item, in order to obtaint the look-
ahead.
2. When a state s contains an item of the form A→α•, u , we need to Remember that in ‘A→α•, u ’,
the element u is a word of
perform a Reduce of the rule A→α . This action will occur in state s
at most k characters. In the
and when the look-ahead is u. CFSM, we adopt a compact notation like
A→α•, {w 1 , w 2 , . . . , w n } to represent the n
For example, if we rely on the CFSM of Figure 6.12, in state 9, we per- items A→α•, w 1 ,. . . , A→α•, w n .
form a Reduce when the look-ahead is either $, + or ∗. Note that we do It is worth comparing the rule for
not readuce when the look-ahead is ), although ) ∈ Follow(Prod). The performaing a Reduce in the case
of LR(1) to what we have done
Reduce will occur on this look-ahead in state 14 (but, in this state, it will
for SLR(1) (see Algorithm 12). In the
not occur on the look-ahead $). case of SLR(1), the look-ahead we use is
Follow(A). Here, we use u, that is, we use a
We can now exploit those ideas to adapt Algorithm 12 to the LR(k) case. local Follow instead of a global one.
This new algorithm is given in Algorithm 15. Apart from the use of the
local Follows that we have already discussed, the most notable addition to
Algorithm 15 is the use of k characters of look-ahead. This means that the We are indexing by words of length
at most k and not exactly k be-
action table is now indexed by a state q and a word u of length at most k.
cause the actual look-ahead avail-
able on the input might be shorter than k
Example 6.17. We apply Algorithm 15 to the grammar in Figure 6.7. Its (for example when we reach the end of the
LR(1) CFSM is given in Figure 6.12 and Figure 6.13. Its LR(1) action table is input).
in Table 6.3. M
B OT T O M - U P PA R S E R S 185
1
2
′
S → Exp•$ , {ε}
S′ → Exp$• , {ε} $
Exp → Exp•+Prod , {$, +}
Exp +
0 3
S′ → •Exp$ , ε Exp → Exp+•Prod , {$, +}
11
∗ Atom Id ( Prod
Atom → (Exp•) , {$, +, ∗}
+
Exp → Exp•+Prod , {), +} 13 14 15 16 17
10
)
12 Exp → Exp + Prod• , {$, +}
Atom → (Exp)• , {$, +, ∗} Prod → Prod•∗Atom , {$, +, ∗}
13
Exp → Exp+•Prod , {), +}
17 Atom
14
Exp → Prod• , {), +}
Prod → Atom• , {), +, ∗}
Prod → Prod•∗Atom , {), +, ∗}
Id
Prod Atom
∗ 6
Prod 18
Id (
Prod → Prod∗•Atom , {), +, ∗} ( 15
Atom → •Id , {), +, ∗} Id
Atom → Id• , {), +, ∗}
Atom → •(Exp) , {), +, ∗} Atom Prod
Atom
Id
19 ( 16
Prod → Prod ∗ Atom• , {), +, ∗} Atom → (•Exp) , {), +, ∗}
)
22 (
20
Atom → (Exp)• , {), +, ∗} Exp → Exp + Prod• , {), +}
Prod → Prod•∗Atom , {), +, ∗}
∗
Figure 6.13: The LR(1) CFSM for the gram-
mar for arithmetic expressions, (contin-
ued). States 6 and 11 are displayed on the
previous figure.
B OT T O M - U P PA R S E R S 187
return M ;
Algorithm 15: The algorithm to compute the LR(k) action table of a
CFG G using CFSM (G).
Now that we have managed to build the action table of an LR(k) parser,
we can discuss how it runs on a given input word. Actually, the run of
an LR(1) parser will not be different from that of an SLR(1)1 parser: the
origin of the action table and of the CFSM does not matter for Algorithm 13
to run. All this algorithm needs is an action table indexed by states (for
the rows) and look-aheads (for the columns), and a way to compute the
successor states of the CFSM (which can be given as a table, as well, and
is implicitly represented by the CFSM itself ). While we have restricted our
presentation of Algorithm 13 to the case k = 1 (i.e., and LR(1) parser), it is
straightforward to adapt it to the case where k > 1: it suffices to replace the
look-ahead ℓ with a word of size (at most) k.
We can thus close our discussion on the construction of LR(k) parsers
by giving an example of run.
Example 6.18. We repeat Example 6.11, this time using the action table
of the LR(k) parser, and its associated LR(k) CFSM (see Example 6.15 and
Example 6.17). We consider the input word Id + Id ∗ Id$, which is in the
language of the grammar (see Figure 6.7). As always, we start our run in
the following configuration:
0
Id + Id ∗ Id$
In this configuration, the look-ahead is Id, the current state if 0 and the
action table tells us to perform a Shift, which leads to state 5 in the CFSM,
etc. The full run is then:
S R6 R5 R3 S S 5 R6 4 R5 10 S
→
− −−→ −−→ −−→ →
− 3 →
− 3 −−→ 3 −−→ 3 →
−
5 4 7 1 1 1 1 1
0 0 0 0 0 0 0 0 0
Id + Id ∗ Id$ +Id ∗ Id$ +Id ∗ Id$ +Id ∗ Id$ +Id ∗ Id$ Id ∗ Id$ ∗Id$ ∗Id$ ∗Id$
5 9
8 8 8
10 S 10 R6 10 R4 10 R1 S Accept!
3 →
− 3 −−→ 3 −−→ 3 −−→ →
− 2 −−−−−→
1 1 1 1 1 1
0 0 0 0 0 0
Id$ $ $ $ $ ε ε
As can be seen, this run is very similar to the run of the SLR(1) parser. This should not be a surprise, since both
parsers build the same rightmost derivation.
M
B OT T O M - U P PA R S E R S 189
S ⇒∗ γAx ⇒ γαx ⇒ · · ·
1. the current CFSM state is the one which is reached when the CFSM
reads the mirror image of the stack content, i.e. γα. Since the CFSM is
deterministic, we can identify the state with γα (i.e., there will be one
and only one state reached when reading γα, so we can abuse notations
slightly and say that γα ‘is the state’); and
2. the look-ahead contains the k first symbols of what remains on the in-
put (if they exist), i.e. Firstk (x).
Now, let us build another derivation that will create a situation where
the parser is ‘confused’, i.e. a derivation where the parser cannot tell the
difference between the reduce of A → α described above, and another re-
duce. Let us thus assume that we have now two rightmost derivations (we
recall the previous one for the sake of comparison):
S ⇒∗ γAx ⇒ γαx ⇒ · · ·
S ⇒∗ δB y ⇒ δβy ⇒ · · ·
190 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G
In the latter derivation, the parser must reduce B → β. This occurs when
(δβ)R is on the stack and y remains on the input. So the situation where
the parser gets ‘confused’ between the Reduce of A → α and B → β is when
the CFSM state (or equivalently the stack content) is the same in both
cases, and when the look-aheads are the same. This occurs when:
δβy = γαx ′
S ⇒∗ γAx ⇒ γαx ⇒∗ w 1 x
S ⇒∗ δB y ⇒ δβy ⇒∗ w 2 y
s.t.: (i) δβy = γαx ′ ; and (ii) Firstk (x) = Firstk x ′ , we have: γAx ′ = δB y.
¡ ¢
M
While testing whether a grammar
Testing that a given grammar is LR(k) (for a given k) can be challenging is LR(k) for a given k (for exam-
ple: ‘is this grammar LR(5)?’) is
with this definition. The best way is still to build the LR(k) parser and clearly decidable, K N U T H shows in his pa-
check that it is deterministic. per cite above that testing whether there
exists some k s.t. a given grammar is LR(k)
Let us illustrate Definition 6.19 with an example:
is an undecidable problem.
Example 6.20. Consider the simple grammar in Figure 6.14. It is clearly (1) S → a
not LR(0), since there is a Shift/Reduce conflict in state 1 of the CFSM (see (2) → ab
Figure 6.15). Figure 6.14: A simple grammar which is
not LR(0).
Let us show why it does not satisfy Definition 6.19 for k = 0. To do so, we
0
must find a pair of derivations that do not satisfy the conditions of the def-
S → •a
inition. In our case, this will be pretty straightforward, since the grammar S → •ab
has two derivations only:
S ⇒a a
1
S ⇒ ab S → a•
S → a•b
Matching these derivations with the notations of Definition 6.19, we ob-
tain:
b
2
A=B =S
S → ab•
α=a
β = ab Figure 6.15: The CFSM of the previous
grammar.
γ=δ=ε
x =y =ε
Intuitively, γAx ′ is the situation
Thus, δβy = ab. Since, γα = a, we have δβy = γαx ′ with x ′ = b. So, where we have the stack content
(γA)R from the first derivation but
all the conditions of the definition are satisfied. In particular, note that
with the input x ′ which is a suffix of the
remaining input in the second derivation,
and the parser should be able to tell the
difference between these two situations
and take the right decision. In our ex-
ample, when the parser decides to reduce
S → a in the first derivation, it should still
do this even if we add x ′ = b to the input,
B OT T O M - U P PA R S E R S 191
γAx ′ = A b
δB y = A
First1 x ′ = {b}. M
¡ ¢
ure 6.13, states 6 and 16 only differ on the local follows. So, a natural idea 10.1145/69622.357187
consists in ‘merging’ those states, i.e., computing the union of the sets of
items of those states, and building an action table and a successor table
based on this new automaton. Of course, by doing so, we are losing some
precision, we cannot expect that LR(k) grammars will also be LALR(k),
but if the resulting parser is still deterministic, we have obtained a CFSM
which has the same size as the LR(0) CFSM, while using look-aheads for
lifting ambiguities.
More formally, we will first explain how to obtain an LALR(k) parser from
the LR(k) CFSM of a given grammar. We start with the definition of the
heart of a CFSM state
192 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G
Then, we say that two states of the LR(k) automaton are equivalent if
they have the same heart:
Example 6.23. Let us consider again the LR(1) CFSM of Figure 6.12 and
Figure 6.13. Here is the list of equivalence classes of its states:
[0] = {0}
[1] = {1}
[2] = {2}
[3] = [13] = {3, 13}
[4] = [14] = {4, 14}
[5] = [15] = {5, 15}
[6] = [16] = {6, 16}
[7] = [17] = {7, 17}
[8] = [18] = {8, 18}
[9] = [19] = {9, 19}
[10] = [20] = {10, 20}
[11] = [21] = {11, 21}
[12] = [22] = {12, 22}
Then, based on this equivalence relation, and from the LR(k) CFSM, we
can define the LALR(k) CFSM. To this end, we need to define the new set
of states and the new transition relation of this automaton: Note that the set of states of the
LALR(k) CFSM is not the set of
1. For each equivalence class of ≡, we will have one state in the LALR(k) equivalence classes. Otherwise,
the states of the new automaton would be
CFSM. More precisely, for each equivalence class [q], there will be a sets of states of the LR(k) CFSM, i.e. sets
new state which contains all the items that are in all the states in [q]. In of sets of items; while we want the states of
other words, the new state (which we denote S [q] ) is: the new automaton to bet sets of items to
obtain a CFSM.
q ′.
[
S [q] =
q ′ ∈[q]
A → α • β, u
A → α • β, v
in all other states q ′′ of the equivalence class (i.e., with the same LR(0)
item, but a different context). This means that the transitions from all
states q ′′ of the equivalence class will be similar in the following sense.
Whenever there is, in the LR(k) CFSM, a transition from q 1 , labeled by η
and going to q 2 , then there is also a transition labeled by η from all q 1′ ∈
[q 1 ], and this transition leads to a state q 2′ which is equivalent to q 2 (i.e.
q 2′ ∈ [q 2 ]). It is thus safe to have an η-labeled transition in the LALR(k)
CFSM from S [q1 ] to S [q2 ] . Remark that this construction preserves the
determinism of the automaton.
Example 6.24. Let us illustrate this on our running example. Consider for
example the equivalence class {8, 18}. Observe that the sets of correspond-
ing LR(0) items of these two states are the same, which is not surprising
since this is the definition of the heart:
n o
Prod → Prod ∗ •Atom, Atom → •Id, Atom → •(Exp) .
Based on this discussion, we are now ready to formally define the LALR(k)
CFSM from the LR(k) CFSM:
• q 0′ = S [q0 ] ; and
Once the LALR(k) CFSM is built, we build the action table as in the case
of an LR(k) parser, and the runs of the parser follow this action table simi-
larly.
194 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G
S [1]
S [2]
S′ → Exp•$ , {ε}
S′ → Exp$• , {ε} $
Exp → Exp•+Prod , {$, +}
Exp
+
S [0] S [3]
S [10]
)
S [12] Exp → Exp + Prod• , {$, +, )}
Atom → (Exp)• , {$, +, ∗, )} Prod → Prod•∗Atom , {$, +, ∗, )}
Example 6.26. Let us close this discussion by building the LALR(k) parser
for our running example. We start by building the LALR(k) CFSM. It is
given in Figure 6.16. We have purportedly chosen to keep the same pre-
sentation as in Figure 6.12, to allow the reader to compare them.
Then, the action table computed from this CFSM is given in Table 6.4.
We close this section on LALR(k) parsers by two remarks. First of all, since
our goal was to obtain a parser which is more efficient than LR(0) but more
compact than LR(k), we should ask ourselves whether it is indeed the case.
As a matter of fact, it is easy to see that the size (in terms of number of
states) of the LALR(k) CFSM is the same as the size of the LR(0) CFSM.
196 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G
Proposition 6.4. For all CFG, the LALR(k) CFSM has the same number of
states than the LR(0) CFSM.
Proof. This property stems directly from the definition of LALR(k) states.
Let us first consider the states of the LR(k) CFSM, from which the states of
the LALR(k) CFSM are extracted. We can observe that the set of hearts of
the LR(k) CFSM is exactly the set of LR(0) states. This can be seen by com-
paring the respective effects of the closure operation in the cases of LR(0)
(see Algorithm 9) and LR(k) (Algorithm 14). More precisely, assume that,
when applied to an item A→α•β , the LR(0) closure produces A 1 →α1 •β1
;. . . ; A n →αn •βn . Then, when applied to an item A→α•β, u with the
same heart, the LR(k) closure will produce a set of items A 1 →α1 •β1 u 1 ;. . . ;
A n →αn •βn u n with the same heart as well. Thus, the set of hearts of the
LR(k) CFSM is exactly the set of LR(0) states. However, the set of LALR(k)
states has the same size as the set of LR(k) hearts, by construction. Hence
our proposition.
Moreover, since the SLR(1) parser is essentially the LR(0) parser with
some look-ahead, we also have that:
Corollary 6.5. For all CFG, the LALR(k) parser has the same number of
states as the SLR(1) parser.
So, from these two results, it seems that we have managed to obtain
a pretty good reduction in the number of states that the LALR(k) parser
must use. Now, the question of the usefulness of LALR(k) remains open,
as we haven’t shown yet that there are some grammars that can be parsed
with LALR(k) but not LR(0) or SLR(1). We will address this in more details
in Section 6.6. Observe, however that, in the example we have discussed
above, the obtained LALR(1) parser happens to be the same as the SLR(1)
parser, but this is not always the case.
Now that we have such a wide range of (bottom-up) parsers at our dis-
posal, one might wonder which one to chose from? Clearly, increasing the
size of the look-ahead increases the number of grammars that a parser can
recognise deterministically; but we also know that such an increase has a
cost in the complexity of the parser. So, it seems that choosing a parser
will be a trade-off between its expressive power and its complexity.
As we did in the case of LL(k) grammars (see Section 5.4.2 and Sec-
tion 5.4.3), we will now establish bottom-up hierarchies of grammars and
languages. We will first compare the families of grammars that can be
parsed deterministically with the different bottom-up techniques we have
B OT T O M - U P PA R S E R S 197
seen, i.e., establish a syntactic hierarchy. However, the fact that a gram-
mar G is, for instance, LR(k + 1) but not LR(k) does not necessarily mean
that L(G) cannot be recognised by an LR(k) parser, because there might be
an LR(k) grammar G ′ with the same language. . . Thus, in Section 6.8, we
will be comparing the families of languages that can be recognised by the
different parsers that we have at our disposal, i.e. to establish a seman-
tic hierarchy. Finally, in Section 6.9 we will compare the top-down and
bottom-up hierarchies.
One could discuss whether the
We will start this discussion with the syntactic comparison. We have al- LR(k) and LL(k) classes are truly
syntactic. Clearly, the definition of
ready defined LL(k), strong LL(k) and LR(k) grammars (see Definition 5.10, strong LL(k) is purely syntactic since it
Definition 5.12 and Definition 6.19). We can further define SLR(1) and concerns the rules of the grammar. The
definitions of LL(k) and LR(k) are more
LALR(k) grammars:
semantic since they constrain the deriva-
Definition 6.27 (SLR(1) and LALR(k) grammars). A CFG is SLR(1) iff the tions that the grammar generates. How-
ever, those definition do no constraint the
SLR(1) parser generated for this grammar is deterministic. A CFG is LALR(k) ultimate semantic object that a grammar
iff the LALR(k) parser generated for this grammar is deterministic. M generates, i.e. its language. So, it is fair
to say that comparing classes of grammars
Now, let us compare these different families of grammars. One can es- (which are syntactic objects) is a syntac-
tic comparison, while comparing the lan-
tablish the following inclusions:
guages that these grammars generate is se-
mantic.
Theorem 6.6. The following (strict) inclusions hold:
We will not venture further into this
(pseudo-)philosophical discussion. . . Let
LR(0) ⊊ SLR(1) ⊊ LALR(1) ⊊ LR(1) ⊊ LR(2) ⊊ · · · ⊊ LR(k) ⊊ · · ·
us just quote this beautiful piece of coffee-
machine wisdom that we have overheard
Proof. We look at all these inclusions one after the other: during the break of a scientific conference:
‘You know, one man’s semantic is another
• LR(0) ⊊ SLR(1). We know that all LR(0) grammars are SLR(1) grammars, man’s syntax’. . .
since SLR(1) is an extension of LR(0) with a look-ahead. The strict in-
clusion stems from Example 6.10, which shows that the grammar in
Figure 6.7 is SLR(1) but not LR(0).
• SLR(1) ⊊ LALR(1). The fact that SLR(1) ⊆ LALR(1) stems from the def-
initions of SLR(1) and LALR(1). Both have states with the same hearts,
which are the LR(0) states by construction, but the contexts are more
precise in the case of LALR(1) since they are unions of local Follows. (1) S → Aa
Hence, SLR(1) does not offer more opportunities to lift conflicts than (2) → b Ac
(3) → dc
LALR(1), so all SLR(1) grammars are also LALR(1). (4) A → d
To separate strictly LALR(1) and SLR(1), we consider the grammar in Figure 6.17: An LALR(1) grammar that is
not SLR(1) and not LL(1) but LL(2).
Figure 6.17 (adapted from exercise 20, Chapter 3.5 of S E I D L , W H I L -
HELM and H A C K ’s book20 ). It is easy to check that it is LALR(1) but not 20
Reinhard Wilhelm, Helmut Seidl, and Se-
bastian Hack. Compiler Design, Syntactic
SLR(1). Indeed, the state that is reached after reading d in the LR(0)
and Semantic Analysis. Springer-Verlag,
CFSM is: 2013. ISBN 978-3-642-17539-8. DOI:
10.1007/978-3-642-17540-4
S → d•c
A → d•
and SLR(1) is not able to lift this Shift/Reduce conflict since c ∈ Follow(A).
However, in the LALR(1) CFSM, this state becomes:
S → d•c , {ε}
A → d• , {a}
198 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G
which lifts the conflict (and no other conflicts exist in the LALR(1) au-
tomaton).
• LALR(1) ⊊ LR(1). The fact that LALR(1) ⊆ LR(1) stems from the con-
struction of LALR(1): if a grammar can be parsed deterministically by
an LALR(k) parser, it will also be the case by an LR(k) parser since the
latter is a refinement of the former. To prove that the inclusion is strict,
all we need is to exhibit an example of grammar that is LR(1) but not
LALR(1). Such a grammar is given in Figure 6.18 and can be found as
Example 4.58 in the ‘Dragon book21 ’. 21
A. Aho, M. Lam, R. Sethi, and Ullman
J. Compilers: Principles, Techniques, &
In particular, building the LR(1) CFSM creates the two following states: Tools. Addison-Wesley series in computer
science. Pearson/Addison Wesley, 2007
A → c• , {d, e}
B → c• , {d, e}
• For all k: LR(k) ⊊ LR(k + 1). The fact that LR(k) ⊆ LR(k + 1) is trivial,
since having a longer look-ahead allows one to parse more grammars
deterministically. Then, for all k ≥ 1, the grammar in Figure 6.19 is (1) S → A bk c
LR(k + 1) but not LR(k). Indeed, after reading a, its LR(k) CFSM reaches (2) → B bk d
(3) A → a
the state:
(4) B → a
Figure 6.19: An LR(k + 1) grammar that is
A → a• , {bk } not LR(k) and not LL(k + 1).
B → a• , {bk }
A → a• , {bk c}
B → a• , {bk d}
Finally, we address a final question regarding the hierarchy of LR(k) Recall that an ambiguous CFG is
one that admits at least two dif-
languages: since the sequence LR(0) , LR(1) , . . . , LR(k) , . . . is growing, one ferent parse trees on a given word,
could think that each CFG can fit snuggly in one of those classes. It turns hence, at least two different rightmost
derivations on this word.
out that this is not the case. Clearly, any ambiguous grammar will not be
B OT T O M - U P PA R S E R S 199
LR(k) since the parser cannot decide which parse tree to use to produce
the rightmost derivation, hence it will necessarily be non-deterministic.
However, there are unambiguous CFG which are not LR(k) for any k. Here
is an example:
b’s is odd. To do so, the grammar relies on the recursive rule A→b A b , as (1) S → a Ac
shown in the following derivation: (2) A → b Ab
(3) → b
S ⇒ a A c ⇒ ab A bc ⇒ abb A bbc ⇒ abbb A bbbc ⇒ abbbbbc Figure 6.20: A grammar which is not LR(k)
for any k.
So, the crucial difficulty here will be for the parser to decide when to Re-
duce the ‘middle b’ (the one that is underlined) as A.
Assume the word to be accepted is ab2n+1 c for some n. Then, when this
particular Reduce must happens, an LR(k) parser ‘sees’ the look-ahead
Firstk (bn c), and needs to ‘see’ the c in order to realise that it is time to
Reduce the b into A. In other words, if the look-ahead is long enough (i.e.
k ≥ n + 1), the parser will ‘see’ the c and can decide to perform the Reduce
at the right moment on that particular word. Otherwise, the look-ahead
will contain only b’s which does not allow the parser to decide whether to
Reduce or to keep shifting b’s. Unfortunately, the size of the look-ahead
is fixed, and there will always be a word which is too long for this look-
ahead to be sufficient. More precisely, if the size of the look-ahead is k,
the parser will not have enough information to parse deterministically all
words ab2n+1 c with k ≤ n. So the grammar cannot be LR(k) for any k (and
it is clearly unambiguous). M
LR(0) lang. ⊆ SLR(1) lang. ⊆ LALR(1) lang. ⊆ LR(1) lang. ⊆ · · · LR(k) lang. ⊆ LR(K + 1) lang. ⊆ · · ·
Most of the results of this section can be found in the seminal paper on
LR(k) by Donald K N U T H 23 . We start by observing, as K N U T H does, that 23
Donald E. Knuth. On the translation of
languages from left to right. Information
and Computation (formerly known as In-
formation and Control), 8:607–639, 1965.
D O I : 10.1016/S0019-9958(65)90426-2
200 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G
there is at least one language which is an LR(1) language but not an LR(0)
language:
Proposition 6.7. The language L = {a, ε} is an LR(1) language but not an
LR(0) language.
Proof. The fact that L is an LR(1) language can be established by finding
an LR(1) grammar for it. The CFG that contains only the rules S → a and
S → ε is such a grammar. In the initial state of the CFSM, the character of
look-ahead allows the parser to decide between: (i) shifting the a from the
input when it is in the look-ahead; or (ii) reducing the S immediately when
the look-ahead is empty.
For similar reasons, we can conclude that the language L is not LR(0).
This is a bit more difficult, from the conceptual point of view, since we
need to make a reasoning on all possible grammars that can define L and
show that none of them can be LR(0). Assume we a have a grammar G
s.t. L(G) = L, and assume we build the LR(0) CFSM of this grammar. Recall
that the CFSM is a deterministic finite state automaton that reads all the vi-
able prefixes of the grammar in order to identify the handles. In the initial
state, the CFSM has read ε which is a handle (as ε is accepted); so, in the
initial state, Accept must be one action that the parser can perform (oth-
erwise, the parser would shift and miss the handle that allows it to accept
ε ∈ L). However, a Shift must also be one of the actions of the parser, in or-
der to obtain the handle a. So, there is necessarily a Shift/Accept conflict in
the initial state of the LR(0) CFSM, that only one character of look-ahead
can lift, as we have seen.
The fact that DPDA pop up once
So, clearly, LR(0) ⊂ LR(1), this inclusion is strict. But, surprisingly, such again when establishing the ex-
pressive powers of parsers should
a strict inclusion does not carry out to the next levels of the hierarchy! not be a total surprise to the attentive
Indeed, K N U T H , again, proves the following beautiful result, linking the reader (!) Indeed, the whole purpose of
studying parsers was to find a way to parse
LR(k) grammars for k ≥ 1 and the deterministic PDA that we have studied
(thus, by means of a PDA) CFL in a deter-
in Section 4.2.2. ministic way. In some sense, this theo-
rem tells us that we have reached our Graal
Theorem 6.8. If L is the language of a DPDA, then there is an LR(1) gram- with the LR(k) grammars!
mar G that recognises it, i.e. L(G) = L
We will refer the reader to the previously mentioned paper24 for the 24
Donald E. Knuth. On the translation of
languages from left to right. Information
proof of this theorem, and rather discuss here its implications.
and Computation (formerly known as In-
First, let us recall that we denote by DCFL the class of languages ac- formation and Control), 8:607–639, 1965.
cepted by deterministic PDA (DPDA). So, we can rephrase the theorem D O I : 10.1016/S0019-9958(65)90426-2
above by writing:
DCFL ⊆ LR(1) lang.
However, we already know that all the LR(k) classes of languages can be
recognised by a deterministic PDA25 , by definition! So, we conclude now 25
Actually, by a deterministic PDA with
look-ahead, but we have seen in Sec-
that:
tion 5.2.1 that the look-ahead is essentially
for all k ≥ 1 : LR(k) lang. ⊆ DCFL. syntactic sugar and can be incorporated in
a regular (deterministic) PDA
Putting everything together, we obtain:
Since ‘all LR(k) parsers can accept the same families of languages‘, one can
reasonably wonder why it was necessary to define them all? The motiva-
tion is of course purely practical: even if we can recognise all languages of
DPDA, even with an LR(0) parser (under the light restriction we have seen
above), this does not mean that it is always easy or desirable27 to obtain an 27
After all, grammars are also designed to
specify the syntax of computer languages
LR(0) grammar for a given language. So, we need parsers and parser gen-
to human beings who have to program. So
erators that can exploit broad classes of grammars. The experience shows they should be readable.
that many grammars that one uses in practice fall into the LALR(1) class,
which explains why tools such as yacc28 , bison29 and cup30 target this 28
Stephen C. Johnson. Yacc: Yet an-
other compiler-compiler. Techni-
particular class of grammars.
cal report, AT&T Bell Laboratories,
1975. Readable online at http:
//dinosaur.compilertools.net/yacc/
6.9 Comparison of the top-down and bottom-up hierarchies 29
Gnu bison. https://www.gnu.org/
software/bison/. Online: accessed on
The next step, in our discussion of the different families of context-free December, 29th, 2015
30
grammars and languages that can be parsed deterministically, is to com- Cup: Construction of useful parsers.
http://www2.cs.tum.edu/projects/
pare the bottom-up hierarchies we have established in the two previous cup/. Online: accessed on December,
sections, to the results of Section 5.4.2 and Section 5.4.3 that concern the 29th, 2015
top-down hierarchy.
Recall also that, from Theorem 6.6, the same holds the families of LR(k)
grammars, i.e.: LR(0) ⊊ SLR(1) ⊊ LALR(1) ⊊ LR(1) ⊊ LR(2) ⊊ · · · ⊊ LR(k) ⊊
LR(k + 1) ⊊ · · ·
The main result for this comparison consists in checking that all LL(k)
grammars are LR(k), while there are LR(k) grammars that are not LL(k). In
other words: LL(k) ⊊ LR(k) for all k ≥ 0. An history of attempts to establish
this result together with a comprehensive proof can be found in a 1982
paper by Anton N I J H O LT 31 . 31
Anton Nijholt. On the relationship
between LL(k) and LR(k) grammars. In-
Theorem 6.10. For all k ≥ 0: LL(k) ⊊ LR(k). formation Processing Letters, 15(3):97–101,
1982. D O I : 10.1016/0020-0190(82)90038-
2. URL https://www.researchgate.
Proof. For the proof that LL(k) ⊆ LR(k), we refer the reader to the paper by
net/publication/222460902_On_the_
N I J H O LT . The argument is rather short and consists in showing that: “if relationship_between_the_LLk_and_
the leftmost derivations of a grammar satisfy the LL(k) conditions, then LRk_grammars
the rightmost derivations satisfy the LR(k) conditions” (see page 98 of the
article).
To prove that the inclusion is strict, one can rely on the grammar of
Figure 6.19, which is in LR(k + 1) but not in LL(k + 1) (for all k ≥ 0) since
Firstk+1 A bk c = Firstk+1 B bk d = abk .
¡ ¢ ¡ ¢
This results can be seen as the ground behind the folklore assertion that
‘bottom-up parsers are more powerful than top-down parsers’ (although
we will provide other arguments to support this in Section 6.9.2). In other
words, when we have ascertained that k characters of look-ahead are suf-
ficient to parse a grammar top-down, we are sure that k characters of look-
ahead will be sufficient for a bottom-up parser as well.
LR hierarchy
LL hierarchy
LR(k + 1)
(17) Fig. 6.19
•
LR(k)
LL(k)
LR(1)
(12) (16)
LL(1)
(8) Fig. 6.18 Fig. 6.18 + T1 Fig. 6.18 + T2
(3) Fig. 6.23
• • •
•
LALR(1)
(11) (15)
(2) Fig. 6.22 (7) Fig. 6.17
Fig. 6.17 + T1 Fig. 6.17 + T2
• •
• •
SLR(1)
(10)
(6) Fig. 6.24 (14)
(1) Fig. 5.6 Fig. 6.24 + T1
• Fig. 6.24 + T2
for k = 0 •
• •
LR(0)
(0) S → a$ (9)
(5) Fig. 6.2
• Fig. 6.2 + T1 (13)
•
Fig. 6.2 + T2
•
•
rules S→ak2 b as well as S→a(k2 +1) b . Clearly, those rules imply that
G 2 ̸∈ LL(k 2 ). However, those new rules do not change the LR(k 1 ) char-
acter of the grammar, because a and b are fresh symbols. Indeed, if we
build the LR(0) CFSM for grammar G 2 , we will obtain, in comparaison
with the LR(0) CFSM of G 1 :
• k 2 extra states:
in which the parser will shift the a’s to the stack; and
• three extra states S → ak2 b • , S → a(k2 +1) • b and S → a(k2 +1) b •
© ª © ª © ª
Now, we can discuss the grammars that appear as elements of the sets
in Figure 6.21:
(0) A simple grammar containing only the rule S→a$ for example is
clearly both LL(1) and LR(0). It is actually LL(0) as well.
(1) The grammar from Figure 5.6, when we let k = 0 is an LL(1) grammar
that can be checked to be SLR(1) as well. It is however not LR(0) since
the two first rules introduce a Shift/Reduce conflict in the initial state
of the CFSM.
(2) The grammar in Figure 6.22 is an LL(1) grammar that can be checked
to be LALR(1) as well. It is however not SLR(1) since:
(1) S → Aa Ab
(2) → B bB a
Follow(A) = Follow(B ) = {a, b}.
(3) A → ε
(4) B → ε
Figure 6.22: An grammar which is LL(1)
(3) The grammar in Figure 6.23 can be checked to be LL(1) but not LALR(1).
and LALR(1) but not SLR(1).
This grammar is adapted from an example of book of Andrew A P -
P E L 32 . The grammar is not LALR(1) because the LR(1) CFSM contains 32
Andrew W. Appel. Modern Compiler Im-
plementation in ML. Cambridge Univer-
the two states:
sity Press, 1998. ISBN 0-521-58274-1
E → A• , {b}
F → A• , {c}
and
B OT T O M - U P PA R S E R S 205
E → A• , {c}
F → A• , {b} (1) S → aX
(2) → Eb
(3) → Fc
(4) X → Ec
(5) → Fb
whose merge in the LALR(1) will create a Reduce/Reduce conflict. (6) E → A
(7) F → A
(4) In general, grammar G k as found in Figure 5.6 (with parameter set to (8) A → ε
k + 1), gives a grammar that is LL(k + 1) but not LR(k). Figure 6.23: An grammar which is LL(1)
and not LALR(1)
(5) The grammar we have used at the beginning of the chapter (in Exam-
ple 6.4) is LR(0) but not LL(1) as we have already seen. It is however
still in the LL hierarchy, as it is LL(2).
(1) S → ab
(6) The grammar in Figure 6.24 is a simple example of grammar that is (2) S → ac
not LL(1) and not LR(0), but which is LL(2) and SLR(1). (3) S → a
Figure 6.24: A grammar that is LL(2) and
(7) The grammar in Figure 6.17 has already been shown to be LALR(1) SLR(1) but neither LL(1) nor LR(0).
but not SLR(1). We can also observe that it is not LL(1) because d ∈
First(A a) ∩ First(dc). However, it is LL(2).
(8) The grammar in Figure 6.18 has also been discussed already, and we
know that it is LR(1) but not LALR(1). We can check that it is not LL(1)
nor LL(2): for example, a look-ahead of ac does not allow one to chose
between S→a A d and S→aB e since both A and B produce c. How-
ever, it is still in the LL hierarchy, as it is LL(3).
(9)–(12) The grammars which have been used in the points (5) to (8) above
can be modified using transformation T1 in order to make them LL(k)
but not LL(k + 1) for any k.
(17) We have already seen that the grammar in Figure 6.19 is LR(k + 1) but
not LR(k). One can also check that this grammar is not LL(k + 1), since
Firstk+1 A bk c = Firstk+1 B bk d = abk . It is however, LL(k + 2), so
¡ ¢ ¡ ¢ © ª
classes, since we already know that the LR(k) hierarchy of languages ‘col-
lapses’ to correspond to the languages of DPDA. Associating this result
with the hierarchy we have obtained in Section 5.4.3, we obtain:
LL(1) lang. ⊊ LL(1) lang. ⊊ · · · ⊊ LL(k) lang. ⊊ · · · ⊊ LR(1) lang. = LR(2) lang. = · · · = LR(k) lang. = · · · = DCFL
B OT T O M - U P PA R S E R S 207
6.10 Exercises
6.10.1 LR(0)
The algorithms to build a CFSM are
Exercise 6.1. Build the CFSM corresponding to the following grammar: found in Section 6.2.
(1) S′ → S$
(2) S → aC d
(3) → bD
(4) → Cf
(5) C → eD
(6) → Fg
(7) → CF
(8) F → z
(9) D → y
The building techniques for an
Exercise 6.2. Give the action table of the LR(0) parser on the grammar of LR(0) are in Section 6.3.
the previous exercise.
See Section 6.3 again.
Exercise 6.3. Simulate the run of the LR(0) parser for the grammar of the
previous exercises, on the word aeyzzd$.
6.10.2 SLR(1)
SLR(1) parsers are covered in Sec-
Exercise 6.4. Build the SLR(1) parser for the following grammar (i.e., build tion 6.4.
the appropriate CFSM and give the SLR(1) action table):
(1) S′ → S$
(2) S → A
(3) A → bB
(4) → a
(5) B → cC
(6) → cC e
(7) C → d Af
Is the above grammar LR(0)? Justify your answer.
6.10.3 LR(k)
Section 6.5 is devoted to LR(k)
Exercise 6.5. Build the LR(1) parser for the following grammar (i.e., build parsers.
the appropriate CFSM and give the LR(1) action table):
(1) S′ → S$
(2) S → S aS b
(3) → c
(4) → ε
Is this grammar LR(0)? Is it SLR(1) 1? Justify your answers.
Exercise 6.6. Simulate the run of the parser you built at the previous ex-
ercise on the word abacb.
6.10.4 LALR(1)
The definition of LALR(1) parsers
Exercise 6.7. Build the LALR(1) parser for the grammar of exercise 6.5, in Section 6.6 shows how to build
using the LR(1) parser you have built for the same exercise. them from LR(1) parsers.
Since LALR(1) parsers can be built
Exercise 6.8. Find a grammar which is LR(1) but not LALR(1). from LR(1) parsers, try to come up
with states of an LR(1) parser that
would be generate a conflict when the
LALR(1) parser is built, and infer a gram-
mar from that.
A Some reminders of mathematics
A α alpha
B β beta
Γ γ gamma
∆ δ delta
E ε epsilon
Z ζ zeta
H η eta
Θ θ theta
I ι iota
K κ kappa
Λ λ lambda
M µ mu
N ν nu
Ξ ξ xi
O o omicron
Π π pi
P ρ rho
Σ σ sigma
T τ tau
Y υ upsilon
Φ ϕ phi
X χ chi
Ψ ψ psi
Ω ω omega
210 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G
For the sake of completeness, we recall the very basic definitions about
sets, although we assume the reader must be pretty familiar with them.
A.2.1 Sets
Example A.2. We can denote the set S containing all natural numbers be-
tween 2 and 7 (included) as:
S = {2, 3, 4, 5, 6, 7}.
Here, we use an enumeration to list all the elements of the set. Although
we have chosen to enumerate the elements in increasing order, we could
have written:
S = {3, 6, 5, 2, 7, 4}
N = {0, 1, 2, 3, . . .}
Z = {. . . , −3, −2, −1, 0, 1, 2, 3, . . .}.
These sets, however, are examples of infinite sets and cannot be com-
pletely enumerated2 . Moreover, even some sets which are finite can be 2
In the sense that we cannot write all their
elements on a sheet of paper. Note that
more conveniently represented using the so-called set-builder notation.
there exists a mathematical notion of ‘enu-
For example, we could define the previous set S like so: merable set’ for which the natural and in-
teger numbers are enumerable
S = {x | 2 ≤ x ≤ 7}.
Similarly, the set of all even natural numbers can be expressed as:
Finally, let us note that the elements of sets can be anything, including: Note how we use brackets () for
denoting pairs and curly brackets
{} for denoting sets, as is standard
practice.
S O M E R E M I N D E R S O F M AT H E M AT I C S 211
and
{(x, y) | x ∈ N and y ∈ N and y = 2.x}
2. other sets.
A.2.2 Relations
Definition A.3 (Cartesian product). Given two sets A and B (which might
be equal), their cartesian product, denoted A × B is the set:
{(a, b) | a ∈ A and b ∈ B }
The cartesian product of Z by itself is the set of all possible integer co-
ordinates in a two-dimensional plane. Instead of writing Z × Z, we rather
write Z2 . M
Definition A.5 (Binary relation). Given two sets A and B , a binary relation
over A and B is a subset of A × B . M
Since we will only consider binary relations here, we will simply call
them relations. Put simply, a relation over A and B is a set of pairs (a, b)
where a ∈ A and b ∈ B . Let R be such a relation. Instead of writing (a, b) ∈
R, we will adopt the common shorthand aRb, as shown by the following
examples.
is a relation over S 2 (in this case, both sets A and B from the definition are
equal to S). It defines the ‘smaller than or equal to’ concept over S. So, we
can call this relation ‘≤’, and write (a, b) ∈≤ iff a is smaller than or equal to
b. With the shorthand notation, we write a ≤ b. M
212 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G
Properties of relations The notion of relation is, as we have seen very gen-
eral. It can be used to formalise several concepts. For instance, the previ-
ous example shows that the natural notion of order can be formalised as a
binary relation. In order to identify certain relations of interest, we need to
define some properties that relations can have. We start we two different
properties of relations in general.
First, we look at functional relations. As the name indicates, this prop-
erty captures the notion of function. Consider for example the function sin.
We know that, for all possible value of x, sin(x) is a unique value. However,
several values of x can be mapped by the function to the same value. For
example, sin π2 = sin 5π
¡ ¢ ¡ ¢
2 = 1. Now, if we see the function sin as a rela-
tion containing all the pairs (x, sin(x)), we know that: (i) there can’t be two
pairs (x, y 1 ) and (x, y 2 ) with y 1 ̸= y 2 , since both y 1 and y 2 must be equal
to sin(x), which is unique; however (ii) there can be several pairs (x 1 , y)
and (x 2 , y) with x 1 ̸= x 2 . For example π2 , 1 and 5π
¡ ¢ ¡ ¢
2 , 1 both exist in the
relation. This is captured in the following definition:
1. R is reflexive iff, for all a ∈ A: (a, a) ∈ R. That is, all elements are always
put in relation with themselves. Observe that a relation which is
not symmetric is not necessarily
antisymmetric and vice-versa. It is
2. R is symmetric iff, for all (a, b) ∈ R: we also have (b, a) ∈ R. That is, every
possible that (a, b) ∈ R and (b, a) ∈ R for
time a is put into relation with b through R, then b is also put in relation some pairs (a, b) but not all (hence, R is not
to a through R. symmetric), and that there exists at least
such a pair with a ̸= b (hence, it is not an-
tisymmetric).
3. R is antisymmetric iff, for all a, b ∈ A: if (a, b) ∈ R and (b, a) ∈ R, then a = Further observe that some authors use
b. In some sense, the antisymmetric property does not allo symmetry the word ‘total’ instead of ‘strongly con-
nected’. However, this seems to be dep-
to happen on distinct elements: every time we have (a, b) and (b, a) in
recated, and we have adopted the mod-
R, it must happens on the same elements, i.e. a = b. ern notion of ‘total relation’, see Defini-
tion A.2.2.
4. R is transitive iff, whenenver (a, b) ∈ R and (b, c) ∈ R: we also have
(a, c) ∈ R. This is the classical definition of transitivity.
While these properties might sound very abstract, they are the basic
building blocks that allow one to define the classical concepts that we are
used to manipulate, like partial orders, orders or equivalences, as we are
about to see.
Partial orders and orders Let’s start with the classical notion of order.
Example A.14. For example, the classical ordering relation ≤ on the inte-
gers is both an order and a partial order. Indeed, if x ≤ y and y ≤ z, then
x ≤ z, so ≤ is transitive. It is also antisymmetric, because if x ≤ y and y ≤ x,
it can only be that x = y. So, ≤ is a partial order. Moreover, x ≤ x for all in-
teger x, so ≤ is also reflexive. Finally, we can always compare two integers
through ≤, so it is also strongly connected. Hence, ≤ is indeed an order.
Now let’s lift the ≤ relation to pairs of integers: (x 1 , y 1 ) ≤ (x 2 , y 2 ) iff
x 1 ≤ x 2 and y 1 ≤ y 2 . In this case, we have elements that are incomparable.
As a concrete example, assume we need to buy a washing machine, and
that we rate the models according to their yearly energy consumption and
their price. So each machine is characterised by (e, p). Assume we have
a machine that consumes 100 kWh and costs 500 euros. We assign it the
pair (100, 500). So, a machine with the pair (80, 400) is clearly better, and
we have (80, 400) ≤ (100, 500). However, a machine with characteristics
(75, 700) is not comparable to our (100, 500) machine: (75, 700) ̸≤ (100, 500)
and (100, 500) ̸≤ (75, 700).
This new relation ≤ is still transitive and antisymmetric, so it is indeed
a partial order. It is also reflexive, but not strongly connected, and is thus
not an order. M
214 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G
Example A.16. As an example, let us say that fruits are equivalent when
they have the same color (assuming they have only one color). So, for
instance, tomatoes are equivalent to cherries because they are both red,
cherries are also equivalent to strawberries because strawberries are red as
well. So, clearly, strawberries must be equivalent to tomatoes. This show
why an equivalence relation must be transitive. Of course, if cherries are
the same color as tomatoes, then tomatoes are the same color as cherries
(!) so our relation is symmetric. Finally, tomatoes are the same color as
tomatoes (!!) so our relation is also reflexive. We conclude that our ‘has
the same color’ relation is indeed an equivalence relation.
We can continue this example and see that equivalence relations natu-
rally induce a splitting of the fruits between so-called classes: while toma-
toes, cherries and strawberries are all red; bananas and lemons belong to
their own gang of yellow fruits. All yellow fruits are equivalent to each
other, but no yellow fruit can be equivalent to a red one. This is further
formalised in the next definitions. M
So, the notion of partition consists in ‘splitting’ the whole set A into dif-
ferent subsets, much like we do when we cut a cake. Such a ‘cut’ of the set
can be done through an equivalence relation, when we put all equivalent
elements together in a subset:
Example A.19. One can now check that these definitions match the intu-
itions given in the example above. Given the set
the subsets {tomatoes, cherries, strawberries} and {lemons, bananas} are the
two equivalence classes of our ‘has the same color’ relation, and they in-
deed form a partition of A, since all fruits end up in either equivalence
class, and there is no intersection between these classes. M
S O M E R E M I N D E R S O F M AT H E M AT I C S 215
Example A.20. Let us consider the five cities: Antwerp (A), Brussels (B),
Paris (P), New York (NY) and Miami (M). Let us assume we are given some
information about the possibility to travel from one city to the other by
road, as a relation:
© ª
R = (A, B ), (B , P ), (N Y , M ) .
That is, we know there is a road from Antwerp to Brussels, from Brussels
to Paris and from New York to Miami. Let is now assume we want to know
what are all the possible road connections we can deduce from this infor-
mation. Clearly, if we can go from Antwerp to Brussels and from Brussels
to Paris, then we can also go from Antwerp to Paris, so we can add the pair
(A, P ), but no further pair based on the information which is given to us.
This is exactly the transitive closure of the above relation:
R ′ = (A, B ), (B , P ), (N Y , M ), (A, P ) .
© ª
Finally, let us note that the notion of transitive closure can be extended
to other properties of relations, such as: the transitive and reflexive closure
of R is the smallest transitive and reflexive relation R ∗ that contain R, and
so forth.
B Bibliography
Frances E. Allen. Control flow analysis. SIGPLAN Not., 5(7):1–19, July 1970.
ISSN 0362-1340. D O I : 10.1145/390013.808479.
J.A. Brzozowski and Jr. McCluskey, E.J. Signal flow graph techniques
for sequential circuit state diagrams. Electronic Computers, IEEE
Transactions on, EC-12(2):67–76, April 1963. ISSN 0367-7508. D O I :
10.1109/PGEC.1963.263416.
H.W. Fowler, J.B. Sykes, and F.G. Fowler. The Concise Oxford dictionary of
current English. Clarendon Press, 1976.
Gerwin Klein, Steve Rowe, and Régis Décamps. Jflex user’s manual.
https://jflex.de/manual.html, March 2023. Version 1.9.1. Online:
accessed on April, 12th, 2023.
M.O. Rabin and D. Scott. Finite automata and their decision prob-
lems. IBM Journal of Research and Development, 3(2):114–125,
April 1959. ISSN 0018-8646. DOI: 10.1147/rd.32.0114. URL
https://www.researchgate.net/publication/230876408_
Finite_Automata_and_Their_Decision_Problems.
R.M. Ritter. The Oxford Guide to Style. Language Reference Series. Oxford
University Press, 2002.