0% found this document useful (0 votes)
18 views220 pages

Info F 403

Uploaded by

wegiveyouart
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views220 pages

Info F 403

Uploaded by

wegiveyouart
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 220

GILLES GEERAERTS, GUILLERMO A.

PEREZ

INTRODUCTION
TO
L A N G UAG E T H E O RY
AND
COMPILING
Typeset under LATEX using the T U F T E - L A T E X document class.

The authors would like to thank the following persons for proofreading (part of ) those notes and making mean-
ingful suggestions and comments:

• Mourad A K A N D O U C H
• Benoît H A A L
• Zakaria J A M I A I
• Franklin L AU R E L E S
• Lucas L E F È V R E
• Benjamin M O N M E G E
• Marie V A N D E N B O G A A R D .

Fifth version, September 2024


Contents

1 Introduction 9
1.1 What is a language? 9
1.2 Formal languages 11
1.3 Application: compiler design 15
1.4 Operations on words and languages 29

2 All things regular. . . 31


2.1 Regular languages 32
2.2 Regular expressions 33
2.3 Finite automata 36
2.4 Equivalence between automata and regular expressions 41
2.5 Minimisation of DFAs 55
2.6 Operations on regular languages 61
2.7 Exercises 65

3 Grammars 73
3.1 The limits of regular languages and some intuitions 73
3.2 Syntax and semantics 76
3.3 The Chomsky hierarchy 78
3.4 Exercises 84

4 All things context free. . . 87


4.1 Context-free grammars 90
4.2 Pushdown automata 99
4.3 Operations and closure properties of context-free languages 114
4.4 Grammar transformations 116
4.5 Exercises 127
4

5 Top-down parsers 129


5.1 Principle of top-down parsing 129
5.2 Predictive parsers 131
5.3 Firstk and Followk 136
5.4 LL(k) grammars 142
5.5 LL(1) parsers 149
5.6 Implementation using recursive descent 156
5.7 Exercises 158

6 Bottom-up parsers 159


6.1 Principle of bottom-up parsing 159
6.2 The canonical finite state machine 166
6.3 LR(0) parsers 170
6.4 SLR(1) parsers 176
6.5 LR(k) parsers 180
6.6 LALR(k) parsers 191
6.7 The bottom-up hierarchy of grammars 196
6.8 The bottom-up hierarchy of languages 199
6.9 Comparison of the top-down and bottom-up hierarchies 201
6.10Exercises 207

A Some reminders of mathematics 209


A.1 Greek letters 209
A.2 Sets and relations 210

B Bibliography 217
List of Figures

1.1 A syntactically correct excerpt of C code that raises an error during


semantic analysis. 24
1.2 Three syntactically correct assignments with different behaviours of
the semantic analyser. The first (line 9) is not problematic. The second
(line 10) raises a warning because a pointer is cast to an integer. The
last (line 11) is not allowed: no conversion is possible. 25
1.3 A decorated AST with typing information. 25
1.4 A C++ program which is syntactically correct, but contains a semantic
error: the goto bypasses the definition of the i variable, which exists
only in the scope of the for. 25
1.5 The construction of the control flow graph of a typical if statement for
its AST. B stands for the condition of the if; T for the ‘then’ block; E
for the ‘else’ block; and N for the statements that follow the if in the
program. 26
1.6 An example of control flow optimisation. The second code excerpt
guarantees to test the condition x > 2 only once. 27
1.7 An example of a C++ function where the parameter x can be safely
promoted to a reference to avoid a copy of the whole structure. 27

2.1 An illustration of a finite automaton. 36


2.2 We can represent finite automata more compactly by focusing on the
‘control’, i.e., the states and transitions. 37
2.3 A non-deterministic finite automaton. 38
2.4 An automaton recognising L 2 , i.e., the set of all binary words that con-
tain two 1’s separated by exactly 2 characters. 39
2.5 A non-deterministic automaton with spontaneous moves that accepts
a∗ b∗ . 39
2.6 The run-tree of the automaton in Figure 2.3 on the word ab. 41
2.7 The set of transformations used to prove Theorem 2.2, with the sec-
tion numbers where they are introduced. 42
2.8 The ε-NFA built from the regular expression l · (l + d)∗ , using the sys-
tematic method. 44
2.9 The DFA obtained from the ε-NFA A l·(l+d)∗ . 48
2.10The family of ε-NFAs A n (n ≥ 1) s.t. for all n ≥ 1: L(A n ) = L n . 49
2.11The ε-NFA A 1 recognising L 1 . 49
2.12The DFA D 1 obtained from the NFA A 1 . The gray labels show the val-
ues of the memory bits associated to some states. 50
2.13The situation before (left) and after (right) deletion of state q 52
6

2.14The two possible forms for an automaton A q f obtained by eliminating


all states but q 0 and q f , and their corresponding regular expressions.
We obtain the right automaton whenever q 0 = q f . 53
2.15A DFA which is not minimal. 55
2.16A minimal DFA. 57
2.17Swapping accepting and non-accepting states does not complement
non-deterministic automata. 62

3.1 An automaton that ‘forgets’ whether the first letter was an a or a b. 73


3.2 An automaton that ‘remembers’ whether the first letter was an a or
a b. 73
3.3 An example of a DFA and its corresponding right-regular grammar. 80
3.4 An example of a right-regular grammar and its corresponding ε-NFA. 81
3.5 Removing rule of the form V → ε in two steps. Observe that in the re-
sulting grammar, the variable B does not produce any terminal, so we
could also remove the rule S → B b, but what matters is that the lan-
guage of the resulting grammar is the same as the original one. 82
3.6 Removing all occurrences of the start symbol S from right-hand sides
of rules, while preserving the language of the original grammar. 83

4.1 A finite automaton accepting (01)∗ . The labels of the nodes represent
the automaton’s memory: it remembers the last bit read if any. 87
4.2 Recognising a palindrome using a stack. 88
4.3 An intuition of a pushdown automaton that recognises L pal# . The key-
words Push, Pop and Top have their usual meaning. The edge labelled
by empty can be taken only when the stack is empty. Note that, in-
stead of pushing q 0 or q 1 , one could simply store 0 and 1’s on the
stack. 90
4.4 The grammar G Exp to generate expressions. 90
4.5 A derivation tree for the word Id + Id ∗ Id. 92
4.6 Another derivation tree for the word Id + Id ∗ Id. 92
4.7 The derivation tree for Id+Id∗Id of Figure 4.5 with a top-down traversal
indicated by the position of each node in the sequence 93
4.8 A CFG which is not in CNF. 95
4.9 A CFG in CNF that corresponds to the CFG in Figure 4.8. 95
4.10An example CNF grammar generating a+ b. 97
4.11An example PDA recognising L pal# (by accepting state). 101
4.12An example PDA recognising L pal# (by empty stack). 104
4.13An non-deterministic PDA recognising L pal (by accepting state). 104
4.14A PDA accepting (by empty stack) arithmetic expressions with + and
∗ operators only. 108
4.15An Illustration of the construction that turns a PDA accepting by empty
stack into a PDA accepting the same language by final state. 111
4.16An Illustration of the construction that turns a PDA accepting by final
state into a PDA accepting the same language by empty stack. Transi-
tions labelled by ε, γ/ε represent all possible transitions for all possible
γ ∈ Γ. 113
4.17A grammar with an unproductive variable (A). 119
4.18A grammar with an unreachable symbol (B ). 119
7

4.19A simple grammar for arithmetic expressions. 123


4.20The derivation tree of Id ∗ Id + Id taking into account the priority of the
operators. 125
4.21The derivation tree of Id + Id + Id taking into account the associativity
of the operators. 125

5.1 A trivial grammar and its corresponding (non-deterministic) parser


(where the initial stack symbol is S). 131
5.2 Illustration of the transformation of a k-LPDA into an equivalent PDA,
assuming Σ = {a, b}, x ∈ Σ and y ∈ Σ ∪ {ε}. 134
5.3 The derivation tree of dd and the notion of Follow1 . 137
5.4 The grammar generating expressions (followed by $ as an end-of-string
marker), where we have taken into account the priority of the opera-
tors, and removed left-recursion. 138
5.5 An example showing that Follow(Prod) contains $ in a case where Exp′
generates ε. 138
5.6 The family of grammars G k 147
5.7 The grammar generating expressions (followed by $ as an end-of-string
marker). This is the same grammar as in Figure 5.4, reproduced here
for readability. 151
5.8 Which grammars are LL(1)? 158

6.1 A configuration of the bottom-up parser where a Reduce must be per-


formed. 164
6.2 An example grammar 166
6.3 An example CFSM 168
6.4 A run of the LR(0) parser 171
6.5 Infinite set of viable prefixes 175
6.6 A CFSM with infinite language 175
6.7 A simple grammar for arithmetic expressions 176
6.8 The CFSM for the grammar generating expressions. 177
6.9 A grammar which is not SLR(1). 180
6.10An excerpt of the CFSM for the previous grammar. 181
6.11An excerpt of the LR(1) CFSM for the example grammar. 184
6.12The LR(1) CFSM for the grammar for arithmetic expressions, first part.
States 13 through 17 (and their successors) are given in the next fig-
ure. 185
6.13The LR(1) CFSM for the grammar for arithmetic expressions, (contin-
ued). States 6 and 11 are displayed on the previous figure. 186
6.14A simple grammar which is not LR(0). 190
6.15The CFSM of the previous grammar. 190
6.16Example of LALR(k) CFSM 194
6.17An LALR(1) grammar that is not SLR(1) and not LL(1) but LL(2). 197
6.18An LR(1) grammar that is not LALR(1) and not LL(1). 198
6.19An LR(k + 1) grammar that is not LR(k) and not LL(k + 1). 198
6.20A grammar which is not LR(k) for any k. 199
6.21Classes of grammars. 203
6.22An grammar which is LL(1) and LALR(1) but not SLR(1). 204
6.23An grammar which is LL(1) and not LALR(1) 205
8

6.24A grammar that is LL(2) and SLR(1) but neither LL(1) nor LR(0). 205
1 Introduction

1.1 What is a language?

T H E N OT I O N O F L A N G UA G E is obviously most important to humans. It


is beyond the scope of these lecture notes to give an exhaustive definition
of this notion, but, we would like to highlight several features of natural
languages, to help us build intuitions that will be useful when introducing
the basic definitions of formal language theory, the main topic of these
notes.
The Concise Oxford Dictionary1 defines a language as: 1
H.W. Fowler, J.B. Sykes, and F.G. Fowler.
The Concise Oxford dictionary of current
A vocabulary and way of using it prevalent in one or more countries. English. Clarendon Press, 1976

The Merriam-Webster Dictionary2 is more explicit: 2


“Language.” Merriam-Webster.com.
Accessed August 2, 2014. http:
The system of words or signs that people use to express thoughts and feel- //www.merriam-webster.com/
ings to each other dictionary/language.

[. . . ]

The words, their pronunciation, and the methods of combining them used
and understood by a community.

Both definitions explain that a language is built on top of basic building


blocks which are words (the set of all words forming a vocabulary), that
these words must be combined according to a certain system of rules (in
order to form sentences), and that the purpose of these combinations of
words (or sentences) is to carry a certain meaning. Hence, two important
concepts pertaining to languages are the form and the meaning.
The form can be loosely defined as the set of rules that govern the mak-
ing of a sentence in a given language. We will rather refer to form as the
syntax of the language. For a natural language, such as English, it starts
with the alphabet, which is the so-called Latin alphabet, on top of which
spelling rules indicate which words are correct or not. Then, those words
can be used to form sentences according to syntactic and grammatical
rules.
Syntax is not only relevant to natural languages, but also for formal lan- 3
B.W. Kernighan and D.M. Ritchie. The
guages such as programming languages. Syntactically correct programs C Programming Language. Prentice-Hall
only can be run by a computer, and the first duty of a compiler is a syntax software series. Prentice Hall, 1988

check.
Observe that, although we remove
the semi-colon from line 5, the er-
Example 1.1. Listing 1.1 shows a syntactically correct C program3 . Delet-
ror is reported on line 6. Indeed,
ing the semi-colon from the end of line 5 triggers a compiler syntax error: the compiler ‘realises that the semi-colon
is missing only when it reads the return
statement on line 6.
10 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

Listing 1.1: A syntactically correct C program.


1 #include <stdio.h>
2 int i = 5;
3
4 int f(int j) {
5 int i = j ;
6 return i+1 ;
7 }
8
9 int main() {
10 printf("Hello world !") ;
11 printf("%d %d", i, f(i+1)) ;
12 return 0 ;
13 }

C-example.c: In function ’f’:


C-example.c:6: error: expected ’,’ or ’;’ before ’return’

On the other hand, when a sentence is syntactically correct, one can


try and make sense out of it, that is, to attach a meaning to this sentence,
which we refer to as its semantics.
The difference between syntax and semantics is important. In natural
languages, a famous example to illustrate this difference is due to Chom-
sky4 : 4
N. Chomsky. Syntactic Structures. Mou-
ton and Co, The Hague, 1957
Example 1.2. Observe that the following sentence is grammatical in En-
glish:

Colorless green ideas sleep furiously

because it follows the “Subject + Verb + Complement” basic pattern of En-


glish, but is clearly nonsensical. In other words, it is syntactically correct,
but semantically incoherent. M

The contrast between syntax and semantics is perhaps sharper in the


setting of programming languages, where the semantics is supposed to be
non-ambiguous. Indeed, every programmer knows how easy it is to write
a syntactically correct program that, when run, does not perform the task
it was intended for. . . Also, different sentences in different programming
languages will produce the same effect:

Example 1.3. The three following statements, respectively in C, Pascal and


COBOL, all sum the contents of variables X and Y and store the result in X.

X += Y

X := X+Y

ADD Y TO X GIVING X
INTRODUCTION 11

In the case of programming languages, associating a semantics to a


given piece of code amounts to produce machine-executable code that,
when run, has the intended effect of the source code. This is, of course,
the purpose of a compiler. To carry out this task, the compiler must first
analyse the structure (i.e., the syntax) of the code, in order to identify key-
words, variable names, and so forth.
Formalising the syntax and the semantics of languages is an old en-
deavour. Dictionaries and grammars for the English language have ex-
isted since the seventeenth century5 . In France, the Académie Française 5
See http://en.wikipedia.org/wiki/
History_of_English_grammars and
has been founded in 1635 with a well-defined mission6 :
http://en.wikipedia.org/wiki/
La principale fonction de l’Académie sera de travailler, avec tout le soin et Dictionary#English_Dictionaries.
6
toute la diligence possibles, à donner des règles certaines à notre langue et à Article 24 of the status of the Academy:
http://www.academie-francaise.fr/
la rendre pure, éloquente et capable de traiter les arts et les sciences.
linstitution/les-missions
The main mission of the Academy will be to labor with all the care and dili-
gence possible, to give exact rules to our language, to render it capable of
treating the arts and sciences

To this end, the Academy publishes a famous Dictionary7 and provides 7


The writing of the ninth edi-
tion is ongoing. The eight edi-
advices on the good usage of the French language8
tion is available on-line http:
Of course, a non-ambiguous, comprehensive and unique formalisation //atilf.atilf.fr/academie.htm.
8
of a natural language’s syntax and semantics seems impossible. For in- See for instance, the list given on:
http://www.academie-francaise.
stance, these notes try to adhere to the so-called Oxford Spelling 9 where
fr/la-langue-francaise/
analyse and behaviour are correct spellings for words that would other- questions-de-langue.
9
wise be spelled analyze and behavior in the United States. Also, the exact R.M. Ritter. The Oxford Guide to Style.
Language Reference Series. Oxford Univer-
accepted meaning of a single term is subject to local variations. sity Press, 2002

On the other hand, formal and mathematically precise definitions of


the syntax of semantics of programming languages are both desirable, and,
hopefully, feasible. Indeed, the syntax of a programming language should:

1. be simple enough that a programmer does not need to refer to a set of


rules too often when producing code.

2. offer a clear structure, so that the code is easily readable and maintain-
able (for instance, the use of functions, blocks, and so forth).

3. be analysable automatically (by means of a program), otherwise no


compiler, no syntax highlighting tool,. . . would be possible.

Finally, the semantics of a programming language should be clearly de-


fined, and non-ambiguous. Otherwise, the program might not have the
effect that the programmer had in mind when writing the code, or the ma-
chine code produced by different compilers from the same source code
might have different effects when run on the same machine.
This short discussion clearly shows the need for a theory of formal lan-
guages, at least for compiler design, the main application we target in these
notes.

1.2 Formal languages

In this section, we give the basics of formal language theory.


12 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

1.2.1 Basic definitions: alphabet, word, language

Let us start with several basic definitions. We first give the definitions, then
comment on them.

Definition 1.4 (Alphabet). An alphabet is a finite set of symbols. We will


usually denote alphabets by Σ. M

Intuitively, an alphabet is the set of symbols that we are allowed to use


to build words and sentences. Because we expect our formal languages to An intuitive justification to the fi-
nite alphabet is that, in real com-
be processed automatically, it is natural to ask that the alphabet be finite.
puters, each memory cell can hold
only a finite amount of information (8, 32,
Example 1.5. The set 64 bits, for instance).

{a, b, c, d , e, f , g , h, i , j , k, l , m, n, o, p, q, r , s, t , u, v, w, x, y, z}

is an alphabet. The set {$, #, !} is an alphabet too. A classical alphabet in the


setting of computer science is {0, 1}. On the other hand, the set of natural
number N for instance, does not qualify as an alphabet because it is not
finite. M

Definition 1.6 (Word). A word on an alphabet Σ is a finite (and possibly


empty) sequence of symbols from Σ. We use the symbol ε to denote the
empty word, i.e., the empty sequence (that contains no symbol). M

We usually denote words by u, v, or w. We use subscripts to denote the


sequence of symbols making up the word, for instance w = w 1 w 2 · · · w n
indicates that the word w is the sequence of symbols w 1 , w 2 ,. . . , w n . In
these notes, we will often use the term ‘string’ instead of ‘word’, because
a sequence of characters is referred to as a ‘string’ in many programming
languages such as C.

Example 1.7. Let ΣC be the alphabet:

{a, b, . . . , z, A, B, . . . , Z, 0, 1, . . . , 9, _}

of all characters which are allowed in C variable names. Then, all words
on ΣC that do not begin with a number and are not C keywords are valid
variable name in C (assuming no limit is imposed on the length of variable
names). M

We denote by |w| the length of a word w, i.e. the number of characters


that it contains. By convention, |ε| = 0. For all i s.t. 1 ≤ i ≤ n, the word
w 1 · · · w i (made up of the i first characters of w) is called a prefix of w and
the word w i · · · w n is called a suffix of w.
Were we dealing with natural languages, the next definition would prob-
ably be that of a sentence, and would be something like: ‘a finite sequence
of words, separated by spaces and punctuation marks, and ending by a
dot’. However, such a definition would give a special status to certain sym-
bols (the ‘space’ and the punctuation symbols), which we have tried to
avoid by giving a very general notion of alphabet, where ‘space’ and punc-
tuation marks can be considered as regular symbols. Actually, for our pur-
pose, such a distinction is useless, so we can directly define the notion of
language:
INTRODUCTION 13

Definition 1.8 (Language). A language on an alphabet Σ is a (possibly


empty or infinite) set of words on Σ. M
It is easy to confuse ε, {ε} and ;.
Since a language is a set, we denote the empty language by the usual The first is a word that contains no
symbol. The second is a language
empty set symbol ;. which is not empty since it contains the
word ε. The last is a language too (empty).
Example 1.9. Here are some examples of languages showing that this def-
inition allows one to capture several interesting problems:

1. The set L Cid of all non-empty words on ΣC (see example 1.7) that do
not begin with a digit, is a language. It contains all valid C identifiers
(variable names, function names, etc) and all C keywords (for, while,
etc).

2. The set L odd of all non-empty words on {0, 1} that end with a 1 is a lan-
guage. It contains all the binary encodings of odd numbers.

3. Similarly to the previous example, the set L () of all words on Σ = {(, )}


which are well-parenthesised, i.e., s.t. each closing parenthesis matches
a previously open and still pending parenthesis, and each open paren-
thesis is eventually closed. For example (()()) ∈ L () , but neither )(( nor
(() do.
This language is also known as the Dyck language, named after the Ger-
man mathematician Walter V O N DY C K (1856–† 1934). It is mainly of
theoretical interest: we will rely on it several times later to discuss the
kind of formalism we need to recognise languages of expressions that
contain parenthesis, such as the language L alg defined in the next item:

4. The set L alg of all algebraic expressions that use only the x variable, the
+ and ∗ operators and parenthesis, and which are well-parenthesised,
is a language on the alphabet Σ = {(, ), x, +, ∗}. For instance ((x+x)∗x)+x
belongs to this language, while )(x + x does not, although it is a word
on Σ.

5. The set LC of all syntactically correct C programs is a language.

6. The set L Cterm of syntactically C programs that terminate whatever the


input given by the user is a language.

All these examples are more or less related to the field of compiler de-
sign, but we will provide examples from other fields of application later.

1.2.2 The membership problem

Considering these examples, it is clear that a very natural problem is to


test whether a word w belongs to a given language L:

Problem 1.10 (Membership). Given a language L and a word w, say whether


w ∈ L.

Being able to answer such a question in general (i.e., for all languages
L) seems to solve meaningful questions. Let us come back to our examples
14 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

to illustrate this. In the first case, testing whether w ∈ L Cid allows one to
check that w is a valid C identifier. Testing whether a binary number be-
longs to L odd allows one to check whether it is odd or even. The member-
ship problem for L alg and LC amount to checking the syntax of expressions
and C programs respectively, a task that is important when compiling. Ob-
serve that all these criteria are purely syntactical. On the other hand, the
last example, L Cterm seems more complex, because the criterion for a word
w to belong to L Cterm is that w is string encoding a terminating C program,
i.e., a semantic criterion, yet the definition of L Cterm makes perfectly sense
and is mathematically sound.
Of course, we are particularly interested in solving the membership
problem automatically. What we mean there is that, given a language L,
we want to build a program that, reads on its input any word w, and re-
turns ‘yes’ iff w ∈ L.
What we will do mainly in these notes is to develop formal tools to pro-
vide an answer to that question. Let us already try and build some in-
tuitions by highlighting characteristics of programs that would recognise
each of those languages. In all the cases, we assume that the word for
which we want to test membership is read character by character, from left
to right, i.e., if w = w 1 w 2 · · · w n , then the program will first read w 1 , then
w 2 , and so forth up to w n . When w n is read, the program must output its
answer.

1. In the case of L Cid , the program must check that w 1 ∈ {a, b, . . . , z,


A, B , . . . , Z }, then that all subsequent characters are in ΣC . Observe that
this program needs only one bit of memory to operate, as it only needs
to remember whether the character it is currently reading is the first
one or not.

2. In the case of L odd , the program must only check that all characters it
receives are 0’s and 1’s, and that the last one is 1. This does not even
require any memory.

3. The case of L () is a bit more difficult, because reading a single paren-


thesis is not sufficient to tell whether it is correct or not. Now, the pro-
gram needs to remember an information about the parenthesis it has
met along the prefix read so far.

More concretely, to check whether an expression is well-parenthesised,


we can write a program using a counter c that:

• increments c each time it reads an opening parenthesis.

• checks that c > 0 each time it reads a closing parenthesis, decre-


ments c if it is the case and returns ‘no’ otherwise.

• returns ‘yes’ at the end of the word if and only if c = 0.


Observe that the memory needed
to test membership of a word to the
It is easy to check that, at all times, c contains the number of pending languages of cases 1 and 2 is finite,
open parenthesis. Observe that, since we have fixed no bound on the while it is unbounded in case 3. It seems
difficult to make a program to recognise
lengths of the words in L () , the values the program needs to store can be
L () that uses a bounded memory only. In-
arbitrarily large. deed, we will give, in Chapter 3 formal ar-
guments showing that it is not possible: L ()
is a so-called context-free language, while
L Cid and L odd belong to the ‘easier’ class
of regular languages.
INTRODUCTION 15

4. Checking whether a string is a correct algebraic expression is at least


as hard as recognising L () . In addition to checking that the expression
is well-parenthesised, one must also check each operation + or ∗ has
exactly two well-formed operands, which can, in turn, be complex ex-
pressions. This suggests that a recursive approach might be needed,
where the base cases are simple expressions containing only one vari-
able, or one constant. In Chapters 5 and 6, we will develop techniques
to generate such recursive programs (called parsers) that can answer
the membership program for a broad class of languages to which L alg
belongs.

5. Checking the syntax of a program written in a high-level language, such


as C, is part of what a compiler should do, and techniques to do so will
be thoroughly presented in these notes. Let us note however, that the
membership problem for w ∈ LC is at least as hard as the membership
for L Cid , L () and L alg as a C program can obviously contain C identifiers
and arithmetic expressions. Also, we must check that curly brackets ({,
}), used to delimit blocks, match; that each else corresponds to an if,
etc. When we write ‘such a program
does not exist’, we do not mean
6. The last language L Cterm models a very hard question: ‘can we write ‘is not known yet’, but rather that
a program that tests whether any program written in a given language there is a provable mathematical impos-
sibility to the existence of such a pro-
terminate?’ Obviously, such a termination tester would be an invalu- gram. This shows that, as surprising as it
able tool for all developers, who have already struggled with unexpected may seem, there are (natural, meaningful
and well-defined) problems that cannot be
infinite loops. Unfortunately, the answer to that question is negative:
solved by a computer!
such a program does not exist. This is proved formally in a ‘Computabil-
ity and Complexity’ course.

In these notes, we will present the mathematical and practical tools that
are necessary to attack those questions.

1.3 Application: compiler design

It should now be clear from the previous examples that a first class appli-
cation to formal language theory is the design of compilers. Remark that here, we are using
the term ‘language’ with the same
A compiler 10 is a program that processes programs and translates a pro-
meaning as in ‘programming lan-
gram P s (the source program, or source code) written in a language L s (the guage’, and not in the formal sense of Def-
source language) into an equivalent program P t (the target program) writ- inition 1.8.
10
A. Aho, M. Lam, R. Sethi, and Ullman
ten in a language L t (the target language). The compiler, being a program J. Compilers: Principles, Techniques, &
itself, might be written in a third language. As an example, gfortran11 Tools. Addison-Wesley series in computer
science. Pearson/Addison Wesley, 2007
is a compiler that translates FORTRAN code into (for instance) Intel i386
11
See the gfortran home page at:
machine code, and is written in C. https://gcc.gnu.org/fortran/.

1.3.1 Anatomy of a compiler

In order to perform its translation, a compiler proceeds in several steps


that we detail now. Those steps can be split in two successive parts:

1. First, the analysis phase builds an abstract representation of the pro-


gram structure. This phase consists in first performing a lexical analy-
sis, or scanning; then a syntactic analysis, or parsing (more details be-
16 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

low). Syntax errors are detected and reported during this phase. Fi-
nally, the analysis part usually contain a semantic analysis, that per-
forms, a.o., type checking, and reports typing errors.

2. Second, the synthesis phase translates the abstract representation of


the program into the target language. Several optimisations can occur
during this phase.

Let now detail these different steps, by referring to Listing 1.1.

1.3.2 Scanning

When teaching a new programming language to someone, one usually


starts by introducing the basic blocks from which a program is built: iden-
tifiers for variable and function names; (reserved) keywords; special signs
such as {, } or ;, and so forth. The first task of a compiler is thus to split the
input string into a sequence of meaningful sub-strings that will be passed
to the next step of analysis. Roughly speaking, this will be the work of the
scanner. An illustration of the effect of the scanner on the code in List- The scanner is often called ‘lexi-
cal analyser’. In these notes, we
ing 1.1 is shown hereunder12 . Each gray box delineates one of the sub-
will indifferently use both expres-
strings that the scanner should identify: sions. Some authors, however, consider
that scanning and lexical analysis are dif-
ferent processes: scanning consists only
int i = 5 ; in dividing the input into relevant sub-
strings, while lexical analysis also performs
the tokenisation (see hereunder for a defi-
nition of token).
int f ( int j ) { 12
We have skipped the first line
#include <stdio.h> because this
int i = j ;
line is actually never part of the input
return i + 1 ; of the compiler. Before being compiled,
} C code is processed by a pre-processor,
that handles so-called pragma directives,
i.e. those keywords that begin with #.
In particular, handling #include direc-
int main ( ) { tives amounts to replacing them by the
content of the file that is included. See
printf ( "Hello World !" ) ;
https://gcc.gnu.org/onlinedocs/
printf ( "%d %d" , i , f ( i + 1 ) ) ; cpp/Pragmas.html for further reference.

return 0 ;
}

Observe how white spaces are ignored in this example. Here, we use the
term ‘white space’ in a broad sense: it also includes tabulation characters
or end-of-line. Those white space symbols are relevant to the compiler
only to separate successive sub-strings13 . Indeed, the two following code 13
This is the case in most programming
languages. A notable exception is the
excerpts have the same effect:
(prank) programming language Whites-
pace, where ‘Any non white space charac-
int i = 5 ;
ters are ignored; only spaces, tabs and new-
lines are considered syntax’. See http://
compsoc.dur.ac.uk/whitespace/.
int

=
INTRODUCTION 17

5
;

However, the scanner does not only split the input in a sequence of
sub-strings as illustrated above, but also performs a preliminary analysis
of those sub-strings and determine their type. For instance, what matters
about the j sub-string in lines 4 and 5 is not the j character, but rather
(1) the fact that j is identified as a variable identifier and (2) the fact that
the same identifier occurs in lines 4 and 5 but not in other lines where
variable identifiers appear (indeed, replacing the two occurrences of j in
those lines by LukeIAmYourFather, or any other legal variable name will
yield a compiled code with exactly the same effect). Also, reserved key-
words (while, if,. . . ), operators (=, <=, !=,. . . ) and special symbols ({, ;,
. . . ) can be identified as such.
To sum up the role of the lexical analyser is not only to split the input
into a sequence of sub-strings, but to relate each of those sub-strings to its
lexical unit. A lexical unit is an abstract family of sub-strings, or, in other
words, a language, that corresponds to a peculiar feature of the language.
The definition of lexical units is a bit arbitrary and depends on the next
steps of the compiling process. For instance, for the C language, we could
have as lexical units:

• identifiers

• keywords

• ...

or we could refine this list of lexical units and have:

• identifiers,

• the while keyword,

• the for keyword,

• ...

where each keyword is its own lexical unit. It should be clear that each
lexical unit in the lists above corresponds to a set of words, i.e., a language.
It is common practice to associate a unique symbolic name to each lexical
unit, for instance a natural number, or a name such as ‘identifier’. As we
are about to see, those values will constitute a part of the scanner’s return
values.

We are now ready to introduce several definitions that somehow for-


malise the discussion about scanning we have had so far 14 : 14
A. Aho, M. Lam, R. Sethi, and Ullman
J. Compilers: Principles, Techniques, &
Definition 1.11 (Lexeme). A lexeme is an element of a lexical unit. M Tools. Addison-Wesley series in computer
science. Pearson/Addison Wesley, 2007
Recall that a lexical unit is a language, this is why it makes sense to
speak about ‘an element of a lexical unit’. A lexeme is thus one of the sub-
strings of the input that has been recognised (or matched) by the lexical
analyser.
18 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

Definition 1.12 (Token). A token is a pair (id, att), where id is the iden-
tifier of a lexical unit, and att is an attribute, i.e. an additional piece of
information about the token. M

Tokens are what the scanner actually returns and provides to the next
step of the compiling process. The attribute part of the token is optional:
it can be used to provide more information about the token, but this is
sometimes not needed.
A typical use of the attribute occurs when the matched lexeme is an
identifier. In this case, the scanner must check whether this identifier has
been matched before, and, if it is the case, to return a piece of information
that links all occurrences of the same identifier throughout the code. The
scanner achieves this by maintaining, at all times, a so-called symbol table.
Roughly speaking the symbol table records, at all times, all the identi-
fiers that the scanner has met so far. Whenever the scanner matches a new
lexeme which is an identifier, it looks it up in the symbol table. If the lex-
eme is not found, the scanner inserts it in the table. Then, the index of the
lexeme in the table can be used as a unique symbolic name for this lex-
eme, which can be put in the attribute part of the token that the scanner
returns.

Example 1.13. Let us consider the simple code excerpt:

1 int i = 5;
2 int j = 3 ;
3 i = 9 ;

Initially, the symbol table is empty. When the lexeme i in line 1 is matched,
the scanner inserts it into the first entry (index 0) of the symbol table, and
returns the token (identifier, 0). Here, identifier denotes the symbolic name
for identifiers, and we have used the index of the lexeme in the symbol
table as the attribute of the token, which is a unique symbolic name for the
identifier i. When j in line 2 is matched, it is inserted into entry number 1
of the symbol table, and the token (identifier, 1) is now returned. So far, the
symbol table has the following content:

index lexeme
0 i
1 j

Then, when i is matched in line 3, it is found in index 0 of the symbol


table, and the scanner returns again (identifier, 0), thereby indicating to the
parser that this identifier is the same as the one that has been matched in
line 1. The symbol table is not altered by this step. M

Observe that this technique is not completely satisfactory: if we want


to apply it to the example of Listing 1.1, we will run into trouble when
matching the i in line 5, which will be wrongly identified as the identifier
of the same variable as the one declared in line 2. The problem with the
way to handle symbol table we have described above is that it does not
allow to cope with scoping. This is, however, hard to achieve at the level of
the scanner which has no global view on the input code, and cannot tell,
INTRODUCTION 19

for instance, a variable declaration from a variable use. When the symbol
table should accommodate scoping, we will defer its creation to the parser
(see next section). The technique we have described so far, however, works
well when no scoping is required, in simple programming languages for
instance.

Let us mention another possible use of the symbol table: it can also be
exploited to match keywords, and avoid that keywords are used as iden-
tifiers. This is achieved by initialising the symbol table with all possible
keywords in the first entries of the table. This allows one to treat keywords
in a similar fashion to identifiers, which often makes the scanner easier to
implement.

Example 1.14. For instance, assume we are building a scanner for a lan-
guage with three keywords: while, for and if. We initialise the symbol
table this way:

index lexeme
0 while
1 for
2 if

Thus, keywords are present in lines 0–2 of the table, and identifiers will be
inserted in the following lines. Assume the scanner matches the lexeme
abc. It will be compared to all line in the symbol table, and inserted since
it is not present:

index lexeme
0 while
1 for
2 if
3 abc

The scanner returns (identifier, 3), which means that a genuine identifier
has been matched, since the index 3 is not among the lines 0–2 that are de-
voted to keywords. Now, assume the scanner matches for, which could be
an identifier since the lexeme contains only letters. The scanner will find
this lexeme in line 1 of the symbol table and return: (identifier, 1). Since
now the attribute of the token is ≤ 2, the parser can identify this token as
the keyword for. M

The scanner: summing up

The scanner is the first part of the compiling process. Its role is
to split the input into a sequence of lexemes (Definition 1.11) that
are associated to lexical units and returns a sequence of tokens (Def-
inition 1.12). It can be responsible for inserting identifiers into the
symbol table, that contains all identifier matched so far, and possi-
bly all keywords. The symbol table is thus used as a communication
medium between the different compiling phases.
20 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

Now that we have a clear view of what the scanner should do, it re-
mains to explain how to do it. Namely, we need to answer the following
questions:

1. How to specify lexical units? So far, we have used vague English descrip-
tions, like: ‘all words starting with a letter and followed by an arbitrary
number of letters and digits’. This is clearly not satisfactory. In Chap-
ter 2, we will introduce regular expressions to this end.

2. How to build in a systematic ways, programs that match lexemes against


the description of lexical units? In Chapter 2 again, we will introduce
the central notion of finite automaton to this end.

1.3.3 Parsing

It should now be clear that the duty of the scanner is to perform a local
analysis of the code: to match a lexeme against a lexical unit, the scanner
analyses a sequence of contiguous characters in the code. Such a local
analysis is not sufficient to analyse all features of programming language.
That for instance the matching of parenthesis in arithmetic expressions,
like:

( ( x + y ) * 3 )

Checking that the first (opening) parenthesis matches the last (closing)
one clearly requests a global view on the piece of code under analysis.
Building (and, to some extent, analysing) such a global and abstract
representation of the code is the task of the parser. To help us build an A binary operator is one that has
two arguments, such the ‘−’ oper-
intuition of what such an abstract representation could be, let us consider
ator in 5 − 3, where the two argu-
the restricted case of arithmetic expressions that can contain: (1) paren- ments are 5 and 3. A unary operator has
thesis (possibly nested); (2) identifiers (i.e., variable names); (3) natural only one argument, such as ‘−’ in the ex-
pression −5.
numbers; (4) the +, -, / and * binary operators; (5) the - unary operator. A
textual description of what ‘a syntactically correct expression’ is, could be
given in an inductive fashion. For instance, a word is a correct expression
if and only if it has one of the following forms:

1. an identifier. For instance, BeamMeUpScotty is correct;

2. a natural number. For instance 42 is correct;

3. ‘(E )’, where E is a correct expression. For instance, (BeamMeUpScotty)


is correct;

4. ‘−E ’, where E is a correct expression. For instance -42 is correct;

5. E 1 op E 2 , where E 1 and E 2 are correct expressions, and op is either + or


- or * or /. For instance, -42+(BeamMeUpScotty) is correct.

The abstract syntax tree In addition to checking whether a program is


syntactically correct, the parser must also build some kind of formal object
that represents the structure of the input. This object will be exploited by
the next steps of the compiling process.
INTRODUCTION 21

The most usual object that parsers build is the abstract syntax tree (AST
for short). As the name indicates, this object is a tree that reflects the nest-
ing of the different programming constructs. As an example, the AST of
the expression i+y*5 could be:
+

Id *

i Id Cst

y 5
Similary, the AST of the following C while loop:

1 while(x>5) {
2 x = x-1 ;
3 y = 2*y ;
4 }

could be:
Prog

Statement St-tail

While ε

Cond Block

>
Statement St-tail
Id Exp
=
x Cst Statement St-Tail

Id Exp
5 = ε

x -
Id Exp

Id Cst
y *
x 1
Cst Id

2 y

The parser and the symbol table In the previous section, we have dis-
cussed the creation of symbol table entries by the scanner, which is a tech-
nique that works fine when the compiler must not handle scoping, as in
Example 1.13. However, in a realistic example such as 1.1, the scope of
variables must clearly be taken into account. Indeed, the i variable used
in line 5 is not the same as the one declared in line 2, because the latter
occurs in the global scope, while the former lies in the block containing
the code of function f().
22 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

To cope with the scoping of identifier names, the compiler can manage
several symbol tables, one for each scope, that contains all the identifiers
from the scope. Since the scanner has no global view on the code and
can hardly detect scopes, we ask the parser the populate these symbol ta-
bles. All the scanner does is to return an information indicating that it has
recognised an identifier, together with the name of the identifier.
All those symbol tables are arranged in a tree, in order to reflect the
nesting of scopes (see example below). When the parser obtains from the
scanner a token corresponding to an identifier whose name is v, it looks
up v in several symbol tables: first, the current symbol table, then—if the
identifier has not been found—its father, and so forth, up to the root that
corresponds to the symbol table of the global scope. This is illustrated in
the following example:

Example 1.15. Let us come back once again to the code excerpt of List-
ing 1.1. Initially, an empty symbol table T0 is created for the global scope.
Then, the parsing goes, a.o. through the following steps (we focus on the
handling of identifiers):

1. When the lexeme i is matched on line 2, the scanner returns the name
i to the parser, which looks it up in the current (and only) symbol table
T0 . Since T0 is still empty, i is inserted into T0 :

T0
Current index lexeme
0 i

In order to identify uniquely each variable15 , the parser can associate 15


This will clearly be necessary when gen-
erating code.
it with a pair (T , i ), where T is the name of a symbol table, and i is the
line in the symbol table. In this case, our variable i would be identified
by (T0 , 0).

2. Then, when reaching line 4, the parser detects the declaration of a new
function, and thus creates a new scope. Concretely, this amounts to
creating a new symbol table T1 , which is inserted in the tree of symbol
tables as a son of T0 :

T0
index lexeme
0 i

T1
Current
index lexeme

3. Then, the parser can insert the declaration of j (the parameter of f) as


a new variable in T1 :
INTRODUCTION 23

T0
index lexeme
0 i

T1
Current index lexeme
0 j

4. Next, moving to line 5, the parser first inserts a fresh i variable, but this
time in T1 . As we will see, it will predate variable i in T0 within the
scope of function f:

T0
index lexeme
0 i

T1
index lexeme
Current
0 j
1 i

5. On the same line, the parser detects the use of variable j. It first looks
up j in T1 (which is the current symbol table) and finds it. Thus, this
occurrence of j will be identified with (T1 , 0).

6. Then, the parser finishes parsing f, and detects the use of variable named
i in line 6. It first looks up in T1 , and finds an occurrence of i. Thus, it
identifies this i with (T1 , 1), i.e., the same i as in line 5.

7. When leaving the scope of f, the parser changes the current symbol
table to T0 . Note however that T1 is kept in memory for the next steps
of compiling.

8. In line 9, the parser detects a new scope and inserts a new symbol table
T2 as a child of T0 , since it is the current table. T2 becomes the current
table:

T0
index lexeme
0 i

T1 T2
Current
index lexeme index lexeme
0 j
1 i
24 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

9. Finally, in line 11, the parser detects the use of a variable called i, and
looks it up in the tree, starting from the current symbol table T2 (which
contains no entry), then moving up the tree towards the root. Variable i
is eventually found in the root T0 , so it is correctly identified with (T0 , 0),
i.e., the one which is declared in line 2.

The parser: summing up

The parser is the second part of the compiling process. It receives a


sequence of tokens from the scanner. Its role is to check whether this
sequence of tokens respects a given syntax, and, when it is the case, to
produce an abstract representation (such as the abstract syntax tree)
of the program’s structure, that will be used by the next phases of the
compiling process. The parser can also populate the symbol table, in
particular when scoping must be taken into account.

As for the scanner, we can now identify several questions that we must
solve in order to build parsers:

1. We have said that the parser must check the syntax of its input, but how
do we specify this syntax? We will see, in Chapter 3 that grammars can
be used for that purpose, just as we have used regular expressions in the
case of scanners.

2. How can we build a parser from a given grammar, and what kind of
machine will it look like? In Chapter 4, we will see that pushdown au-
tomata—an extension of the finite automata from Chapter 2—are ab-
stract machines that can be used to formalise and build parsers. We will
review, in Chapter 5 and Chapter 6, several techniques to build parsers
that are efficient in practice.

1.3.4 Semantic analysis

Now that the parser has checked that the syntax of the input code is cor-
rect, and has built an abstract representation of this code, it is time to start
analysing what the code means, in order to prepare its translation. This
is the aim of semantic analysis. Of course, semantic analysis is highly de-
pendent on the input language, this is why we will stay very general when
introducing it. Yet, we can identify several essential points to deal with:
scoping, typing, and control flow.

Scoping During the semantic analysis phase, the compiler can analyse
the links that exist between the declaration(s) of a name (if any) and the
uses of this name throughout the code. For instance, the—syntactically
correct—code in Figure 1.1 could raise an error during semantic analysis, 1 int main() {
2 i = 3 ;
because variable i is used undeclared (although the name i is declared 3 int i ;
later as an integer variable). 4 }

Observe that the control of the scoping can already be performed dur- Figure 1.1: A syntactically correct excerpt
ing parsing, thanks to the symbol table (see previous section). of C code that raises an error during se-
mantic analysis.
INTRODUCTION 25

Type checking and type control Each name (variable name, function name,
type name,. . . ) in a program is associated with a data type (or, simply, a
type) that describes uniquely how this name can be manipulated. During
semantic analysis, the compiler determines (if possible) the type of each
expression, and checks that the operations on those expressions are con-
sistent with their types. Figure 1.2 shows the typical problems that can
occur when compiling an assignment in C. The assignment in line 9 is not
problematic, because the type of the right-hand side of the assignment is
1 struct S {int i;} ;
int, which is the same as the type of the variable j. Indeed, the sum of the 2 int main() {
int variable i and the int constant 4 is an int. The second assignment 3 int i, j ;
4 struct S s ;
(line 10) raises a warning (i.e., a non-blocking error): the right-hand side
5 struct S * p = &s ;
is a pointer, but assigning a pointer to an int is allowed in C, and a con- 6
version is implicitly applied by the compiler. The last assignment (line 11) 7 i = 3 ;
8
raises an error: the type of the right-hand side is struct S, and the com- 9 j = i + 4 ;
piler does not know how to convert such an object to an int. 10 j = p ;
11 j = s ;
In order to manage types, the compiler can add information to the AST
12 }
that has been built during parsing. This operation of adding information
Figure 1.2: Three syntactically correct as-
to a tree is called ‘decoration’16 . As an example, Figure 1.3 displays the
signments with different behaviours of the
decorated AST of the C statement x = sum * 1.5, where x and sum are semantic analyser. The first (line 9) is not
integer variables. Since one of the terms of the sum * 1.5 product is a problematic. The second (line 10) raises a
warning because a pointer is cast to an in-
float, the compiler assigns this type to the expression. This allows it to teger. The last (line 11) is not allowed: no
detect that the result will need to be truncated when copying it to x (and conversion is possible.
16
perhaps to raise a warning, depending on the compiler and its options). Of Which suggests that ASTs are most prob-
ably Christmas trees. . .
course, such information will be crucial when generating the target code
for this assignment.

Figure 1.3: A decorated AST with typing in-


Statement, int
formation.

=, int

Id, int Exp, float

x, int *, float

Id Cst

sum, int 1.5, float

Control flow The term ‘control flow’ refers to the order in which the in- 1 #include <iostream>

structions of a program are executed. Conditionals (if), jumps (goto), 2


3 int main () {
loops (for, while,. . . ) and function calls can be used to alter the control 4 for(int i=1;i<10;++i) {
flow of a program, in an intricate way. As a consequence, it is possible to 5 infor: std::cout << i ;
6 std::cout << std::endl ;
write syntactically correct programs that contain semantic error related to 7 }
control flow. 8
For instance, Figure 1.4 shows a C++ program that does not compile17 9 goto infor ;
10 }
because the goto in line 9 jumps inside of the for. However, the i vari-
able exists only inside the scope of the for: it does not exist anymore when Figure 1.4: A C++ program which is syntac-
tically correct, but contains a semantic er-
reaching the goto, so jumping inside the for and executing std::cout << i ror: the goto bypasses the definition of the
makes no sense: what would be the value of i? Other problems can be de- i variable, which exists only in the scope of
the for.
tected by analysing the control flow of a program, for instance, detecting 17
On LLVM 6.0.
26 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

dead code (code that can never be executed), etc.


To analyse the control flow of a program, the semantic analyser can
build an abstract representation thereof, which is the control flow graph,
a notion introduced by Allen in the early seventies18 . A control flow graph 18
Frances E. Allen. Control flow analysis.
SIGPLAN Not., 5(7):1–19, July 1970. ISSN
is a directed graph whose nodes represent the different statements of the
0362-1340. D O I : 10.1145/390013.808479
program, and whose edges represent the possible jumps in the control
flow. So, if a statement s 2 can be executed after a statement s 1 (whether
s 2 is on the line next to the line of s 1 or not), there is an edge from s 1 to s 2 .
It can thus occur that some node has several input edges. A node can also
have more than one output edge in the case where this node corresponds
to a conditional. In the case of an if, for instance, there are two output
edges: one for the ‘then’ block, and one for the ‘else’.
Thus, each path in the graph corresponds to one potential execution
of the program. Of course, some path in the graph will be spurious, and
do not correspond to an actual execution of the program. For example,
in the case of a while, the infinite path that never leaves the while will
be present in the CFG, but it might be the case (hopefully!) that the loop
always terminates19 . 19
Remember from the Computability and
Complexity course that the halting prob-
The control flow graph can be built from the AST. Figure 1.5 illustrates
lem is undecidable, hence detecting such
this construction for a typical if statement of the form if B then T else spurious infinite loops is not possible in
E , and which is followed by additional statements represented by N . The general!

edges of the AST are shown in gray. Additional edges (dashed) are first
inserted in the AST to represent the control flow of the if. Similar treatment
is applied recursively in the B , E , T and N sub-trees. Then, the edges of
the AST are removed, and one obtains the typical diamond shape of the
CFG of an if statement.

statements
B

if-then-else N
yields: T E

B T E N

Figure 1.5: The construction of the control


flow graph of a typical if statement for its
AST. B stands for the condition of the if; T
for the ‘then’ block; E for the ‘else’ block;
Semantic analysis: summing up and N for the statements that follow the if
in the program.
The semantic analysis phase receives the AST built by the parser
and performs a (limited) analysis of the semantics of the input code
represented by this AST: checking scoping of variables, checking for
type consistency and detecting (potentially implicit) type conver-
sions, checking for control flow inconsistency. The output of the se-
mantic analysis phase is a decorated AST (and, potentially, a control
flow graph) that, together with the symbol table built during parsing,
INTRODUCTION 27

contains all the necessary information for the synthesis phase of the
compiler.

1.3.5 Synthesis

S Y N T H E S I S is the phase during which the outcome of the compiler is ac-


tually generated. In most cases, this will be an executable program, writ-
ten in some low-level language such as machine code. However, this is
not always the case: these notes, for instance, have been typeset using
the LATEX 2ε word processing system20 , where one first describes the doc- 20
Leslie Lamport. LATEX: A Document
Preparation System. Addison-Wesley, 1986.
ument using a markup language, and then compile this source into a PDF
ISBN 0-201-15790-X
file.
Again, the actions that should be performed during the synthesis part
is highly dependent on the source and target languages. So, we will only
identify several generalities about synthesis in this section.
The input to the synthesis phase is the decorated AST built during the
previous phase of the compiling process. It can also be accompanied by
a control flow graph, should it have been built during semantic analysis.
They should contain all the necessary information to generate the output
code.

Code optimisation Before the output code is actually generated, the com-
piler might perform several optimisations on the code. Typical optimisa-
1 for(int i=0; i<n; ++i) {
tions include, but are not limited to:
2 if (x > 2)
3 printf("%d", i) ;
Control flow optimisation Modifies the control flow graph in order to make 4 else
the resulting code more efficient. Figure 1.6 shows an example. In the 5 printf("%d", i+1) ;
6 }
first version of the loop, the conditional x>2 is tested at each iteration of
the loop. However, the loop does not modify x, so this condition will be Becomes:

either always true along the loop, or always false. The second version 1 if (x>2) {
of the loop is therefore more efficient. 2 for(int i=0; i<n; ++i) {
3 printf("%d", i) ;
4 } else {
Loop optimisation Consists in making loops more efficient, for instance 5 for(int i=0; i<n; ++i) {
by unravelling them when they are executed a constant number of times. 6 printf("%d", i+1) ;
7 }
8 }
Constant propagation The compiler can try and detect variables that keep
a constant value, and replace the occurrences of these variables by those Figure 1.6: An example of control flow op-
timisation. The second code excerpt guar-
constant values, thereby avoiding unnecessary and costly memory ac- antees to test the condition x > 2 only
cesses. once.

Promotion of parameters to references Function parameters that are passed


by value and are not modified by the function can be transformed into
references, to avoid copies of values when calling the function. For in-
1 struct S {
stance, in Figure 1.7, the parameter x can be promoted to a reference 2 // lots of fields!
(in C++), and the function’s signature becomes f(struct S &x). Then, 3 } ;
4
calling f does not request anymore to copy the whole content of the x 5 f(struct S x) {
structure. 6 // no modification of x
7 }

Figure 1.7: An example of a C++ function


where the parameter x can be safely pro-
moted to a reference to avoid a copy of the
whole structure.
28 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

Intermediate language Because generating efficient machine code is highly


dependent on the target architecture, and might request specialised tech-
niques, it makes sense to introduce an intermediary step in the compiling
process: instead of compiling the high-level language (say, C) into ma-
chine code (say, x86 machine language), the compiler generates code in
an intermediate language, which is close to machine language, but per-
mits the use of some level of abstraction. The intermediate language is
afterwards compiled to the target language. This technique allows one to
separate cleanly the difficulties that are related to the input language, and
those that are linked to the target language.
One of the earliest implementation of this concept is the p-code, intro-
duced by Wirth in 196621 . P-code has been used mainly as an intermedi- 21
Niklaus Wirth and Helmut Weber. EU-
LER: A generalization of ALGOL, and its
ary language for Pascal compilers: the compiler would compile the Pascal
formal definition: Part II. Commun. ACM, 9
program into p-code, which could later be compiled to machine code, or (2):89–99, February 1966. ISSN 0001-0782.
even interpreted at the level of the operating system22 . D O I : 10.1145/365170.365202
22
This was the case with the UCSD p-
This principle is still exploited in the LLVM compiler, which is probably system in 1978, just like Java bytecode is
the standard compiler today on Unix-like platforms23 . In LLVM, the input today interpreted by a virtual machine!
23
code is first translated to the so-called intermediary representation, which From the LLVM website (http://www.
llvm.org): ‘The LLVM Project is a col-
is later compiled into machine code, after several optimisation steps. This lection of modular and reusable compiler
allows one to rapidly develop efficient compilers for new languages: one and toolchain technologies’. Note that, on
many platforms such as MacOS X, calling
simply ought to write the translation to the LLVM intermediary represen-
gcc actually calls LLVM.
tation (which should be easier than performing a direct translation to ma-
chine code thanks to the features of the intermediary language, see below),
and benefit from all the optimisations of the LLVM code generator.

Example 1.16. For instance, consider the following C code excerpt:

1 int a, b, c ;
2 if (a > b)
3 c=1 ;
4 else
5 c=2 ;

Then, a possible LLVM intermediary representation24 would be: 24


This code is not actually LLVM IR since
the assignement %c = 1 is not a valid
1 %tmp = icmp sgt i32 %a, %b LLVM IR instruction. Furthermore, we are
assigning the register %c twice!. We will see
2 br i1 %tmp label %iftru, label %iffls
how to fix this later.
3 iftru: %c = 1
4 br label %end
5 iffls: %c = 2
6 br label %end
7 end:

As can be seen from this example, the LLVM intermediate language is pretty
close to a classical machine language, with very low-level instruction such
as icmp sgt i32 to compare two integers on 32 bits, br i1 for a condi-
tional jump, or br for an unconditional jump. But this language allows one
to use as many virtual registers (whose name begin with %) as desired. It
is thus easier to generate LLVM intermediate language than machine lan-
guage for a machine with a fixed (and limited) number of registers. M
INTRODUCTION 29

1.4 Operations on words and languages

Let us close this introduction by giving some technical preliminaries that


will be used throughout these notes.

1.4.1 Operations on words

We first describe several operators that can be used to combine different


words. They are all variations on the notion of concatenation:

Definition 1.17 (Concatenation of two words). Given two words w = w 1 w 2 · · · w n


and v = v 1 v 2 · · · v ℓ , the concatenation of w and v, denoted w · v, is the
word:

w ·v = w1 w2 · · · wn v1 v2 · · · vℓ

By convention, ε·w = w·ε = w, for all words w. In particular ε·ε = ε. M

For example, aa · bb = aabb and aa · ε = ε · aa = aa. Observe that the


concatenation is an operator on words, which is non-commutative, i.e.,
w 1 ·w 2 ̸= w 2 ·w 1 in general; but associative, i.e. w 1 ·w 2 ·w 3 = (w 1 ·w 2 )·w 3 =
w 1 · (w 2 · w 3 ) for all words w 1 , w 2 , w 3 . The empty word ε is a neutral for
concatenation. Concatenation behaves like a non-
commutative product operator,
Based on this notion, we can introduce another notations. For all natu-
such as matrix multiplication
ral numbers n, w n denotes the word obtained by concatenating n copies for instance. This justifies the ‘power
of w: notation’ w n to denote the concatenation
of n copies of w, just like A n denotes
wn = |w · w ·{z
w · · · w} {z· · · × A} for a matrix A. Observe
|A × A ×
n times
n times that, w n · w m = w n+m , as expected. In
particular, for all n, w n = w n+0 = w n · w 0 ,
By convention, w 0 = ε for all words w.
which explains why w 0 = ε.

1.4.2 Operations on languages

We can lift the concatenation operator to sets of words, i.e., languages. In-
tuitively, the concatenation of two languages is a new language that con-
tains all the words obtained by concatenating one word from the former
language with one word from the latter:
By reading this definition carefully,
Definition 1.18 (Concatenation of languages). Let L 1 and L 2 be two lan- one realises that the empty lan-
guage ; is not a neutral for lan-
guages. Then, their concatenation, denotes L 1 · L 2 is the language: guage concatenation. Indeed, assume
L 1 = ;, and consider L 1 · L 2 . For a word
L1 · L2 = {w 1 · w 2 | w 1 ∈ L 1 and w 2 ∈ L 2 } w to belong to L 1 · L 2 , it must have a pre-
fix which is a word of L 1 . However, there
M is no word in L 1 , so, no word belongs to
L 1 · L 2 ; that is, L 1 · L 2 = ;. However, {ε} is:
For example, if L 1 = {I love , I hate }, and L 2 = {compilers, L · {ε} = {ε} · L = L for all languages L.
chocolate}, then L1 · L2 = {I love compilers, I love chocolate,
I hate compilers, I hate chocolate}.
On top of language concatenation, we can introduce several other no-
tations:

1. For all languages L, for all natural numbers n, L n is the language con-
taining all words obtained by taking n words from L an concatenating
them:

Ln = {w 1 w 2 · · · w n | for all 1 ≤ i ≤ n : w i ∈ L}
30 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

For example, if L = {a, b}, then L 3 = {aaa, aab, aba, baa, abb, bab,
bba, bbb}.

2. For all languages L, the Kleene closure of L, denote L ∗ is the language


containing all words made up of an arbitrary number of concatenations
of words from L:

L∗ = {w 1 w 2 · · · w n | n ≥ 0 and for all 1 ≤ i ≤ n : w i ∈ L}

For example, {a}∗ = {ε, a, aa, aaa, aaaa, . . .}. Observe that ε ∈ L ∗ for all
languages L, and that L ∗ is necessarily an infinite language, except for
the cases where L = {ε} and L = ;, since then L ∗ = {ε}.

3. A variation on the Kleene closure is L + which is the language contain-


ing all words made up of an arbitrary and strictly positive number of
concatenations of words from L:

L+ = {w 1 w 2 · · · w n | n ≥ 1 and for all 1 ≤ i ≤ n : w i ∈ L} The Kleene closure is named after Stephen


Cole K L E E N E , (1909 – † 1994), a prominent
American logician, who is one of the gi-
For example, {a}+ = {a, aa, aaa, . . .}. Observe that {ε}+ = {ε}, so it is ants on the shoulders of whom computer
(tempting but) incorrect to write that L + = L ∗ \ {ε}. . . scientists are standing: he was one of the
founders of computability theory, along
with Kurt G Ö D E L , Alonzo C H U R C H , Alan
The Σ∗ notation T U R I N G , to name a few. . .
Picture source: http://math.library.
wisc.edu/about.html#kleene.
Let Σ be an alphabet. Since any alphabet is a set, we can also regard
Σ as a language, which contains only words of one character. Then,
we can write Σ∗ , which contains all the words (including the empty
one) that are made up of characters from Σ. This notation will be used
very often in the rest of these notes.
2 All Things Regular: Languages, Ex-
pressions. . .

T H E R E A D E R S H O U L D N OW B E C O N V I N C E D O F T H E I M P O RTA N C E O F
L A N G UA G E T H E O RY I N C O M P U T E R S C I E N C E , and in particular for com-
piler design. Our main objective for now will be to study formal tools to:
(1) define, using a finite syntax, languages that are potentially infinite; and
(2) manipulate those languages (for instance, combine them using classi-
cal set operations such as union or intersection). In particular, we want
to be able to answer the membership problem, or, in other words, to be
able to tell in an automatic way whether a given word belongs to a given
language or not.
In this chapter, we will study the class of regular languages1 . Regular 1
In francophone Belgium, regular lan-
guages are called « langages réguliers »,
languages form one of the most basic classes of languages, yet they con-
while in France, they are called « langages
tain many useful languages, such as the one we have used to define (most) rationnels », probably a much better trans-
legal C identifiers and keywords: lation.

LC i d = {all non-empty words on ΣC that do not begin with a digit}

where: ΣC = {a, b, . . . , z, A, B , . . . , Z , 0, 1, . . . , 9, _}.


Regular languages are equipped with different formal tools that allow
us to represent and manipulate them:

• Regular expressions can be used to represent those languages. For ex-


ample, if we let:

ℓ = a+b+c+d+e+f+g+h+i+j+k+l+m+n+o
+p + q + r + s + t + u + v + w + x + y + z + A + B + C
+D + E + F + G + H + I + J + K + L + M + N + O + P
+Q + R + S + T + U + V + W + X + Y + Z + _

then, ℓ is a regular expression that defines « any possible letter or the _


symbol » (in regular expressions, the + symbol must be interpreted as
an « or »). Similarly, let:

d = 1+2+3+4+5+6+7+8+9+0

then, d is a regular expression that defines « any possible digit ». Com-


bining those two expressions into:

ℓ · (ℓ + d )∗

we obtain a new regular expression that denotes exactly LC i d . Here, the


· symbol must be interpreted as concatenation, and the ∗ symbol as
32 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

the Kleene closure (see Section 1.4). In other words, the above regular
expression must be interpreted as: « a character matching ℓ (i.e., a non-
digit character) followed by any number of characters matching either
ℓ or d ».

• Finite automata which are abstract machines designed mainly to an-


swer the membership problem. As such, they can also be used to rep-
resent languages, and can be used to perform operations on those lan-
guages. We will see that there are several types of finite automata, yet
they all correspond to the same class of regular languages.

2.1 Regular languages

The formal definition of regular languages is recursive. It starts from the


most simple languages, and uses the +, · and ∗ operations to build more
complex ones:

Definition 2.1 (Regular languages). Let us fix an alphabet Σ. Then, a lan-


guage L is regular iff:

1. either L = ;;

2. or L = {ε};

3. or L = {a} for some a ∈ Σ;

4. or L = L 1 ∪ L 2 ;

5. or L = L 1 · L 2 ;

6. or L = L ∗1

where L 1 and L 2 are regular languages on Σ. M

Throughout this document, we denote by Reg the set of regular lan-


guages. Here are a few examples to illustrate this definition:

Example 2.2.

1. The language {abc, def} on the alphabet {a, b, c, d, e, f} is regular. In-


deed, {a}, {b} and {c} are all regular, by point 3 of Definition 2.1. Then,
{abc} is also regular by point 5. Using similar arguments, we can show
that {def} is regular. Finally, {abc, def} = {abc}∪{def}, hence, {abc, def}
is regular by point 4.
Remark that those arguments can be used to show that all finite lan-
guages are regular.

2. The language of all binary words (thus on the alphabt {0, 1}) is regular.
Indeed, this language can be defined as:
¢∗
{0, 1}∗ = {0} ∪ {1}
¡

3. The language of all well-parenthesised words over Σ = {(, )}, on the other
hand, is not regular (this can be proved formally). Intuitively, the defini-
tion of regular language does not allow to discriminate between words
ALL THINGS REGULAR. . . 33

on the basis of unbounded counting arguments. However, counting


is unavoidable in this case: one must, at all times, keep track of the
current number of pending open parenthesis to check whether closing
parenthesis are legal or not.
This implies also that the set all syntactically correct C programs is not
a regular language either, as a C program might contain, for instance,
algebraic expressions with an arbitrary depth of parenthesis nesting.
Hence, all the tools we will develop in this section will not be sufficient
to check the full syntax of C programs (or any other « classical » pro-
gramming language).
Note however, that the language of all well-parenthesised words that
have a bounded length, over Σ = {(, )}, (for instance, of words containing
at most 10 characters) is regular, because it is a finite language.

Lets us now introduce several formal tools to deal with regular lan-
guages.

2.2 Regular expressions

The first tool we will consider for regular languages are regular expressions.
Regular expressions are a kind of algebraic characterisation of regular lan-
guages. To define regular expressions, we need to define two things: their For the more mathematically in-
clined readers, regular expressions
syntax (i.e., which regular expressions can we write?), and their semantics
form a so-called Kleene algebra, i.e,
(i.e., what is the meaning of a given regular expression, in terms of regular an idempotent semi-ring, see:
language?). These definitions follow closely that of regular languages: Dexter Kozen. On Kleene algebras and
closed semirings. In Mathematical foun-
Definition 2.3 (Regular expressions). Given a finite alphabet Σ, the follow- dations of computer science, Proceedings
of the 15th Symposium, MFCS ’90, Banská
ing are regular expressions on Σ:
Bystrica/Czech. 1990, volume 452 of Lec-
ture notes in computer science, pages 26–
1. The constant ;. It denotes the language L(;) = ;. 47, 1990. URL http://www.cs.cornell.
edu/~kozen/Papers/kacs.pdf
2. The constant ε. It denotes the language L(ε) = {ε}.

3. All constants a ∈ Σ. Each constant a ∈ Σ denotes the language L(a) =


{a}.

4. All expressions of the form r 1 + r 2 , where r 1 and r 2 are regular expres-


sions on Σ. Each expression r 1 + r 2 denotes the language L(r 1 + r 2 ) =
L(r 1 ) ∪ L(r 2 ).

5. All expressions of the form r 1 · r 2 , where r 1 and r 2 are regular expres-


sions on Σ. Each expression r 1 · r 2 denotes the language L(r 1 · r 2 ) =
L(r 1 ) · L(r 2 ).

6. All expressions of the form r ∗ , where r is a regular expression on Σ.


¢∗
Each expression r ∗ denotes the language L(r ∗ ) = L(r ) .
¡

In addition, parenthesis are allowed in regular expressions to group sub-


expressions (with their usual semantics). M

Sometimes, we will also use the r + notation as a shorthand for r · r ∗ .


¢+
That is, L(r + ) = L(r ) .
¡
34 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

Example 2.4. It is easy to check that the example r = ℓ · (ℓ + d )∗ given in


the introduction of the present chapter follows the definition of regular
expression, and that L(r ) = LC i d . M

Given that the definition of regular expressions and of their semantics


follows closely the definition of regular languages (Definition 2.1), it is easy
to prove that:

Theorem 2.1. For all regular languages L, there is a regular expression r s.t.
L(r ) = L. For all regular expressions r , L(r ) is a regular language.
Actually, finding a minimal regu-
Observe that the language L(r ) associated to each regular expression lar expression to denote a given
regular language L is not an easy
r is unique, while there can be several regular expressions to denote the problem, since the problem of determin-
same language. For instance a and a + a both denote the language {a}, i.e. ing whether two given regular expressions
r 1 and r 2 accept the same language (i.e.,
L(a) = L(a + a) = {a}.
L(r 1 ) = L(r 2 )) is a PSPACE-complete prob-
lem, see:
2.2.1 Extended regular expressions L. J. Stockmeyer and A. R. Meyer. Word
problems requiring exponential time (pre-
Regular expressions are widely used in practice, in particular by many Unix liminary report). In Proceedings of the
Fifth Annual ACM Symposium on The-
applications. They can be used, for instance, to look for specific files us- ory of Computing, STOC ’73, pages 1–9,
ing the ls command. As an example the following command lists all the New York, NY, USA, 1973. ACM. D O I :
10.1145/800125.804029
file names in the current directory (thanks to the command find .) and
filters them using the grep tool, following the pattern ^..g.*\.tex which
is given as an extended regular expression.

1 find . | grep "^..g.*\.tex"

The pattern asks to select only the filenames that have a g in the third po-
sition, and have .tex as extension. As can be seen from this example, The difference between the two
syntaxes can be confusing: the +
the syntax of Unix regular expressions (called extended regular expressions)
denotes the alternative in ‘classi-
departs significantly from Definition 2.3. This is not surprising, since Def- cal’ regular expressions, and thus corre-
inition 2.3 has been introduced mainly for theoretical purpose. On the sponds to | in extended regular expres-
sions. On the other hand + in extended
other hand, the syntax of extended regular expressions (see Table 2.1) is regular expressions is the repetition, i.e., it
probably better fitted for practical purpose. Still, all languages that are corresponds to r + in ‘classical’ regular ex-
pressions. . .
definable by extended regular expressions are regular, which means these
new constructs do not alter the expressiveness.
ALL THINGS REGULAR. . . 35

E.R.E. Semantics
x the character x
. any character, except the ‘newline’ special character
"x" the character x, even if x is an operator. For instance "." is the character . and not ‘any
character’.
\x the character x, even if x is an operator (for instance \. is the . character)
[xy] either x or y
[a-z] any character in the range a, b,. . . ,z. Other ranges can be used, like 1-5 or D-X, for instance
[^x] any character but x
^x an x at the beginning of a line
x$ an x at the end of a line
x? an optional x
x* the concatenation of any number of x’s (Kleene closure)
x+ the concatenation of any strictly positive number of x’s
x{m,n} the concatenation of k numbers of x’s-, where m ≤ k ≤ n.
x|y either x or y

Table 2.1: Extended (Unix) regular expres-


sions.
36 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

2.3 Finite automata

While regular expressions provide us with a compact and (hopefully) read-


able way of specifying regular languages, it is not clear, at the first sight,
how one can use regular expressions to manipulate (automatically) reg-
ular languages. In particular, we would like to obtain, for all regular lan-
guages, some kind of abstract specification of an algorithm that allows us
to answer the membership problem on that language (i.e., does a given
word belong to that language?) Clearly, such algorithms will be important
step stones to build compilers.
Such an abstract model of algorithm2 is given by the notion of finite au- 2
At least, a model of algorithm which is
sufficient for regular language, but might
tomaton. Finite automata have been introduced in the early fifties (1951)
not be sufficient in general. A general
by S. Kleene (see Section 1.4), as a model of biological phenomena, namely, model of algorithm is the Turing machine,
the response of neurons to stimuli3 . A more systematic study of this model as postulated by the Church-Turing the-
sis (see the computability and complexity
from the computational point of view has been done a few years later, in course).
1959, by Rabin and Scott4 . Since then, finite automata have been widely 3
Stephen C. Kleene. Representation
recognised as one of the most fundamental models in computer science. of events in nerve nets and finite au-
tomata. Technical Report RM-704,
The RAND Corporation, 1951. URL
http://minicomplexity.org/pubr.
2.3.1 Intuitions
php?t=2&id=2
4
A finite automaton is an abstract machine with the following features: M.O. Rabin and D. Scott. Finite automata
and their decision problems. IBM Jour-
nal of Research and Development, 3(2):
• The machine reads a word, letter by letter, from the first to the last. This 114–125, April 1959. ISSN 0018-8646.
can be understood by envisioning the input word written on an input D O I : 10.1147/rd.32.0114. URL https://
www.researchgate.net/publication/
tape, that the machine reads cell by cell thanks to a reading head. Each
230876408_Finite_Automata_and_
cell contains one letter of the input. Once the machine has read a letter, Their_Decision_Problems
the reading head moves to the next cell. The tape cannot be rewound.

• At all times, the machine is in a well-defined discrete state. There are


only finitely many such states. The reading of each letter triggers a state
change.

• The aim of the machine is to discriminate between words that are in a tape

given language, and words that are not. The automaton does so by ei-
ther accepting or rejecting input words. At all times, the machine pro-
l l d l
duces a binary (yes/no) output, indicating whether the word prefix read
head
so far is accepted or not by the machine.
q1 l q2
Figure 2.1 is an illustration of those concepts. It displays the input tape
d yes/no
(with content lldl), the reading head, and the output. The content of the
q3 l, d
rectangular box represents the different possible states of the automaton, l, d

by means of circles (in this case, the states are called q 1 , q 2 and q 3 ) and
Figure 2.1: An illustration of a finite au-
the possible state changes, by means of labeled arrows between states. In tomaton.
this example, for instance, reading an l on the input tape when in state
q 1 moves the current state to q 2 , and so forth. In addition, we need to
indicate:

1. In which state the automaton starts its execution. In our case, it is q 1 ,


as indicated by the edge without source state pointing to q 1 .

2. How is the output of the automaton determined at all times? As we


have explained, this output depends only on the current state, so states
ALL THINGS REGULAR. . . 37

should be either accepting (in which case the output is ‘yes’) or reject-
ing (the output is ‘no’). We will display accepting states as nodes with a
double border. In this case, q 2 is the only accepting state.

From the intuitions sketched above, it should be clear that the behaviour
of the automaton will depend only on its states and on the possible changes
between those states, i.e., what is depicted inside the rectangular box in
q1 l q2
Figure 2.1. So, in the next illustrations of finite automata, we will restrict
ourselves to this part, that is, we will display the automaton of Figure 2.1 d
q3 l, d
as in Figure 2.2. l, d

Since an automaton either accepts or rejects any word, it also implicitly Figure 2.2: We can represent finite au-
define a language, which contains all the words the automaton accepts. tomata more compactly by focusing on the
‘control’, i.e., the states and transitions.
It is easy to check that the language defined by automaton in Fig 2.2. is
exactly the language of the regular expression l · (l + d)∗ (assuming the
input alphabet is Σ = {l, d}). Indeed:

1. When running with a word starting by an l on the input tape, the au-
tomaton first moves from q 1 to q 2 , which is accepting and where it will
stay up to the end of its execution. So all words starting by an l will be
accepted.

2. When running on a word that does not start by an l (i.e., starts by a d)


on the input tape, the automaton first moves from q 1 to q 3 , which is
not accepting and where it will stay up to the end of its execution. So,
all words starting by a d will be rejected.

2.3.2 Syntax

Let us now formalise these notions:

Definition 2.5 (Finite automaton). A finite automaton is a tuple:

A = 〈Q, Σ, δ, q 0 , F 〉

where:

1. Q is a finite set of states;

2. Σ is the (finite) input alphabet;

3. δ : Q × (Σ ∪ {ε}) 7→ 2Q is the transition function;

4. q 0 ∈ Q is the initial state;

5. F ⊆ Q is the set of accepting states.

Let is illustrate this definition with the example of Figure 2.2:

Example 2.6. On the example of Figure 2.2, we have:

1. Q = {q 1 , q 2 , q 3 };

2. Σ = {l, d};

3. q 0 = q 1 ;
38 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

4. F = {q 2 };

5. and, finally, the transition function δ is given by:


δ(q 1 , l) = {q 2 } δ(q 1 , d) = {q 3 } δ(q 1 , ε) = ;
δ(q 2 , l) = {q 2 } δ(q 2 , d) = {q 2 } δ(q 2 , ε) = ;
δ(q 3 , l) = {q 3 } δ(q 3 , d) = {q 3 } δ(q 3 , ε) = ;

There are two important features to Definition 2.5 that one should ob- q1 b q3
serve. First, the co-domain of the transition function is a set of states. Ob-
a
serve that in the example of Figure 2.2, the function always returns either c
a singleton or the empty set. However, we can also build automata like the q0
automaton of Figure 2.3, where δ(q 0 , a) = {q 1 , q 2 }. c
In this example, there are several possible executions of the automaton a
on the input word ab: the word can be read by an execution visiting q 0 , q2 q4
b
q 1 , then q 3 , or an execution visiting q 0 , q 2 and q 4 . This phenomenon is
Figure 2.3: A non-deterministic finite au-
called non-determinism, and the automaton of Figure 2.3 is said to be non- tomaton.
deterministic. Non-determinism raises several natural questions:

1. How do we determine the output of the automaton, when there are sev-
eral possible runs on the same word, that do not all end in an accepting
state? This occurs with the word ab and the automaton in Figure 2.3.
The rule is that, in non-deterministic automata, there must exist one
run that accepts for the word to be accepted.

2. What is the point of non-determinism anyway? We have introduced fi-


nite automata as abstract machines that model algorithms. Clearly, an
algorithm, or, at the very least, a computer program, are determinis-
tic, so the existence of non-deterministic automata seems to hurt the
intuition. It turns out that non-determinism is a very helpful tool for
modeling certain kinds of problems. Also, we will see later that all non-
deterministic automata can be converted into an equivalent determin-
istic automaton. A common intuition about non-
deterministic automata is that they
For example, consider the family of languages L n (where n ≥ 1) defined have the capability to ‘guess’ some-
as follows: ‘all binary words that contain two 1’s separated by exactly thing about the future of a word. Assume
the automaton in Figure 2.4 is currently
n characters’. Devising a non-deterministic automaton that recognises in state q 0 , has a word from L 2 on its in-
this language is quite easy: Figure 2.4 shows a non-deterministic au- put tape, and reads a 1. It has thus two
tomaton recognising L 2 . This example can be generalised to any n, by ‘choices’ for its next current state: either
stay in q 0 or move to q 1 . Clearly, the
inserting more states between q 2 and q 3 for instance. Clearly, if the au- latter is a good choice to accept the cur-
tomaton outputs ‘yes’ on some word w, then the execution of the au- rent word only if the 1 that is read is fol-
lowed, three characters ahead, by another
tomaton has gone through all the states, which guarantees that there 1. Since the automaton cannot read ahead
are two 1’s separated by two characters (the 1’s that have been read of its reading head, nor rewind the tape, it
when moving from q 0 to q 1 and from q 3 to q 4 respectively). Conversely, must ‘guess’ correctly whether each 1 will
be followed by another 1 three characters
if a word belongs to L 2 , there is clearly at least one run ending in q 4 that ahead, and any accepting run can be un-
reads it, and so the output of the automaton is ‘yes’ since q 4 is accept- derstood as a ‘correct guess’ from the au-
tomaton.
ing. We will see later (see the paragraph on the size of deterministic
automata in Section 2.4.2) that devising a deterministic automaton for
this language is a bit more tricky.

The second important feature of Definition 2.5 is the fact that some
transitions can be labeled by the empty word ε. This is called a ‘sponta-
ALL THINGS REGULAR. . . 39

0, 1 0, 1 Figure 2.4: An automaton recognising L 2 ,


i.e., the set of all binary words that contain
two 1’s separated by exactly 2 characters.
1 0, 1 0, 1 1
q0 q1 q2 q3 q4

neous move’ and allows the automaton to change its current state with-
out reading any character on the input (hence, without moving its read-
ing head). Again, spontaneous moves depart radically from our intuition
of algorithm, yet they can be useful for modeling purpose. For instance,
suppose we want to build an automaton for the language composed of
all words that start with a (possibly empty) sequence of a’s, followed by a
(possibly empty) sequence of b’s. One natural way to do it would be to start
by building two automata for those two parts of the words in the language:

a b

q0 q1

then, add spontaneous move between those states, to allow the automa-
ton to move from the ‘sequence of a’s’ part to the ‘sequence of b ’s’:

a b

q0 ε q1

and, finally, add the relevant initial and accepting states:

a b Figure 2.5: A non-deterministic automa-


ton with spontaneous moves that accepts
a∗ b∗ .
q0 ε q1

Because not all automata may use non-determinism and spontaneous


moves, we define the following classes of finite automata:

Definition 2.7 (Classes of finite state automata).

1. A non-deterministic finite state automata with ε-transitions (ε-NFA for


short) is a finite state automaton, as in Definition 2.5.
Observe that, in our definition of
2. A non-deterministic finite state automaton (NFA for short) is an ε-NFA DFAs, we request that |δ(q, a)| = 1,
A = Q, Σ, δ, q 0 , F s.t. for all q ∈ Q: δ(q, ε) = ;. In this case, we will for all q and a, i.e., that each state
­ ®
has exactly one successor for each letter.
henceforth assume that the signature of the transition function is δ : However, when depicting DFAs, and in or-
Q × Σ 7→ 2Q . der to keep the figures readable, we will
sometimes omit some transitions that lead
3. A deterministic finite state automaton (DFA for short) is an NFA A = to a sink state, i.e., a state from which noth-
ing can be accepted (like state q 3 in Fig-
Q, Σ, δ, q 0 , F s.t. for all q ∈ Q, for all a ∈ Σ: |δ(q, a)| = 1. In this case, we
­ ®
ure 2.2). Note also that some authors do
will henceforth assume that the signature of the transition function is not ask for the transition function to be
complete and use the weaker constraint
δ : Q × Σ 7→ Q, and that the function is complete♣ .
|δ(q, a)| ≤ 1 in their definition of DFAs.

M
40 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

For instance, the automaton in Figure 2.2 is a DFA, hence also an NFA
and an ε-NFA. The automaton in Figure 2.3 is an NFA, hence also an ε-NFA,
but is not a DFA. The automaton in Figure 2.5 is an ε-NFA, but neither an
NFA, nor a DFA.

2.3.3 Semantics

Let us now define formally the notions of ‘execution’, ‘accepting a word’,


etc that we have discussed informally so far.
Remark that we give these defini-
Definition 2.8 (Configuration of an ε-NFA). A configuration of an ε-NFA tions in the most general case of ε-
NFAs, but, since NFAs and DFAs are
A = Q, Σ, δ, q 0 , F is a pair q, w ∈ Q × Σ∗ , where q is the current state,
­ ® ­ ®
special cases of ε-NFAs, these definitions
and w is the input word suffix that remains to be read. apply to them too.
­ ®
The initial configuration of A on the input word w is q 0 , w .
A configuration q, w is accepting (or final) iff q ∈ F and w = ε. M
­ ®

­ ®
Intuitively, a pair q, w completely characterises the current ‘configu-
ration’ of the automaton: its current state is q and the word w remains on
the input (in other words, the reading head is currently on the first char-
acter of w, if w ̸= ε; or at the end of the tape, if w = ε). Then, using the
transition relation, we can define how an automaton changes its current
configuration:

Definition 2.9 (Configuration change). Let A = Q, Σ, δ, q 0 , F be an ε-


­ ®

NFA, and let (q 1 , w 1 ) and (q 2 , w 2 ) be two configurations of A. Then we


say that (q 2 , w 2 ) is a successor of (q 1 , w 1 ) iff there is a letter a ∈ Σ ∪ {ε} such
that (i) w 1 = a · w 2 and (ii) q 2 ∈ δ(q 1 , a). We denote this successor relation
by:
(q 1 , w 1 ) ⊢A (q 2 , w 2 )

In the rest of these notes, we will often omit the subscript on the opera-
tor, when the ε-NFA is clear from the context, and write (q 1 , w 1 ) ⊢(q 2 , w 2 )
instead. Very often, we will consider sequences of configurations (q 1 , w 1 ),
(q 2 , w 2 ), . . . , (q n , w n ) s.t. (q i , w i ) ⊢(q i +1 , w i +1 ) for all 1 ≤ i ≤ n − 1. Such se-
quences are called runs of the automaton (on the word w 1 , which is the
word in the first configuration of the run). We say that a run (q 1 , w 1 ),
(q 2 , w 2 ), . . . , (q n , w n ) is accepting iff its last configuration (q n , w n ) is ac-
cepting; and we say that it is initialised iff its first configuration is initial.
Since ⊢A is a binary relation, we use the classical ⊢*A notation to denote
its reflexive and transitive closure♣ . Then, we can define the accepted lan-
guage of an automaton:

Definition 2.10 (Accepted language of an ε-NFA). Let A = Q, Σ, δ, q 0 , F


­ ®

be an ε-NFA. Then, its accepted language is:

L(A) = w ∈ Σ∗ | ∃q ∈ F s.t. q 0 , w ⊢*A q, ε


© ­ ® ­ ®ª

In other words, a word w is accepted iff the automaton admits an ini-


tialised and accepting run on w.
ALL THINGS REGULAR. . . 41

Example 2.11. As an example, let us consider again the ε-NFA in Fig-


ure 2.5, and let us check whether it accepts w = aab. The only possible
run, starting in q 0 , of this automaton on w is:

(q 0 , aab), (q 0 , ab), (q 0 , b), (q 1 , b), (q 1 , ε)

It is easy to check that the first configuration of the run is initial, that the
last is accepting. Hence, w = aab is accepted.
On the other hand, the maximal run that can be built on the word w ′ =
ba is:
(q 0 , ba), (q 1 , ba), (q 1 , a)

because there are no a-labeled transitions from q 1 . Hence, w ′ is not ac-


cepted since (q 1 , a) is not accepting. M

Recall that, with non-deterministic automata, several runs are possible


on a single input word. In this case, it is sometimes convenient to repre-
sent all the possible runs by means of a tree, whose nodes are labeled by
configurations, and whose edges correspond to the ⊢ relation. As an ex- (q 0 , ab)

ample, the tree of possible runs of the automaton in Figure 2.3 is shown in (q 1 , b) (q 2 , b)
Figure 2.6.
(q 3 , ε) (q 4 , ε)

Figure 2.6: The run-tree of the automaton


2.4 Equivalence between automata and regular expressions in Figure 2.3 on the word ab.

So far, we have reviewed two families of models for defining and manip-
ulating languages: regular expressions, on the one hand, and finite au-
tomata, on the other hand. We know that regular expressions define ex-
actly the class of regular languages (see definition 2.1 and Theorem 2.1),
but what about the expressive power of the three different classes of au- The ‘expressive power’ of a model
is a term often used to speak about
tomata we have introduced? Obviously, DFAs cannot be more expressive
the class of languages that the
than NFAs, which cannot be more expressive than ε-NFAs, by definition. model can define. One can thus speak
We have already seen at least one example of automaton that recognises about the expressive power of regular ex-
pressions (i.e., the regular languages), or
the same language than a given regular expression (see Figure 2.5), but can the expressive power of finite automata and
this be generalised? compare them. . .
It turns out that the expressive power of all three classes of finite au-
tomata is exactly the same, and equals that of regular expressions, that is,
the regular languages. This result is due to Stephen Kleene5 : 5
Stephen C. Kleene. Representation
of events in nerve nets and finite au-
Theorem 2.2 (Kleene’s theorem). For every regular language L, there is a tomata. Technical Report RM-704,
The RAND Corporation, 1951. URL
DFA A such that L(A) = L. Conversely, for all ε-NFAs A, L(A) is regular.
http://minicomplexity.org/pubr.
php?t=2&id=2
In other words, all finite automata recognise regular languages and all
regular languages are recognised by a finite automaton. To establish this
Our formulation of the theorem
result, we will give constructions that convert finite automata into regular
might seem restrictive, but one
expressions and vice-versa. More precisely, we will give algorithms to: must always bear in mind that
DFAs are a special case of NFAs, which are,
1. Convert any regular expression into an ε-NFA defining the same lan- in turn, a special case of ε-NFAs. Hence,
‘For every regular language L, there is a
guage.
DFA A such that L(A) = L’ entails that there
is also an NFA and an ε-NFA recognising L
2. Convert any ε-NFA into a DFA accepting the same language. This is (actually, the DFA A can serve for that pur-
called ‘determinising’ the ε-NFA as it somehow turns it into a determin- pose). Conversely, ‘for all ε-NFAs A: L(A)
istic version. Observe that this method can be applied, in particular, to is regular’ implies that the language of all
NFAs and DFAs are also regular!
any NFA.
42 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

3. Convert any DFA into a regular expression defining the same language.

Regular
expression
2.4.3 2.4.1
This set of transformations is summarised in Figure 2.7. Together with special
Theorem 2.1, those transformations allow us to conclude that finite au- 2.4.2 NFA case
DFA ε-NFA
tomata recognise exactly regular languages.
2.4.2
Figure 2.7: The set of transformations used
to prove Theorem 2.2, with the section
numbers where they are introduced.

2.4.1 From regular expressions to ε-NFAs

To turn a regular expression into a finite automaton, we will once again


exploit the recursive definition of the syntax of regular expressions. The
construction we are about to describe dates back from the sixties. It is
widely attributed to Thompson6 , who has based his work on a previous 6
Ken Thompson. Programming tech-
niques: Regular expression search algo-
construction by McNaughton and Yamada7 .
rithm. Commun. ACM, 11(6):419–422,
The induction hypothesis of the construction is that it builds, for all June 1968. ISSN 0001-0782. DOI:

regular expressions r , an ε-NFA A r s.t. (i) L(A r ) = L(r ); and (ii) the (neces- 10.1145/363347.363387
7
R. McNaughton and H. Yamada. Regular
sarily unique) initial state of A r is called q ri , and A r has exactly one final expressions and state graphs for automata.
f f
state that we denote q r . Moreover, no transition enter q ri , nor leave q r . Electronic Computers, IRE Transactions on,
EC-9(1):39–47, March 1960. ISSN 0367-
9950. D O I : 10.1109/TEC.1960.5221603

Base cases Building ε-NFAs that accept the base cases of regular expres-
sions is easy, as shown in the following table: Observe that we could have given
simpler constructions. For in-
stance, A ε could have been made
up of only one (initial and accepting) state.
However, the construction we present has
the benefit to keep initial and final states
Regular separated, and is therefore more system-
ε-NFA A r atic.
expression r

i f
; q; q;

ε f
ε q εi qε

a f
a ∈Σ q εi qε

Inductive case For the inductive case, we assume ε-NFAs A r 1 and A r 2 are
already known for two regular expressions r 1 and r 2 . We treat the disjunc-
tion, concatenation and Kleene closure as follows:
ALL THINGS REGULAR. . . 43

Regular
ε-NFA A r
expression r

Ar1
f
q ri 1 qr1
ε ε

r1 + r2 q ri qr
f

ε ε
f
q ri 2 qr2
Ar2

Ar1
f
q ri 1 qr1
ε
ε
r1 · r2 q ri qr
f

ε
f
q ri 2 qr2
Ar2

Ar1 ε

ε f ε f
q ri q ri 1 qr1 qr
r 1∗

Example 2.12. Let us consider the regular expression l · (l + d)∗ on the al-
phabet Σ = {l, d}. Following the construction, we start with the base cases:

l
Al =

d
Ad =

From them, we build the ε-NFA for l + d:

ε ε

A l+d =

ε ε
d

Then, let’s apply the Kleene closure:


44 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

l
ε ε
ε ε
A (l+d)∗ =
ε ε
d

Finally, we apply the construction for the concatenation (with automa-


ton A l we have computed above) and obtain the ε-NFA A l·(l+d)∗ displayed
in Figure 2.8. M

q5 l q6

ε ε
q1 l q2 ε q3 ε q4 q9 ε q 10

ε ε
q7 d q8

Figure 2.8: The ε-NFA built from the regu-


lar expression l ·(l + d)∗ , using the system-
atic method.
2.4.2 From ε-NFAs to DFAs

For many practical purposes (like the building of a parser), non-determin-


istic automata are not acceptable, and a deterministic automaton is nec-
essary. We will now review a technique that converts any ε-NFA into a DFA
accepting the same language.
Let us first sketch the intuition behind the construction, by consider-
ing the ε-NFA in Figure 2.8. Assume the automaton is currently in state
q 3 and the next character on the input is an l. The automaton can read
this l, but must first follow two ε-transitions in order to reach state q 5 .
Then, after reading the l from q 5 , the automaton can follow several other
ε-transitions and end up in any of the following states: q 4 , q 5 , q 6 , q 7 , q 9
or q 10 . From those states, the automaton might continue its execution,
yielding several possible runs on the same word. By definition of ε-NFAs,
only one of those runs needs to be accepting for the automaton to accept
the word.
Intuitively, the DFA D corresponding to a given ε-NFA A will simulate all
the possible executions of A on a given input word, by tracking the possi-
ble states in which A can be at all times. To perform this tracking, the DFA
needs some kind of memory, and we will use the states of the DFA to en-
code this memory. Thus, the states of the DFA D will be subsets of A’s set of
states. Roughly speaking, when the DFA will be in state S = {q 1 , q 2 , . . . , q n }
ALL THINGS REGULAR. . . 45

(where q 1 , . . . q n are states of the ε-NFA) after reading a prefix w ′ , then,


{q 1 , . . . q n } is exactly the set of all states that the ε-NFA can reach by read-
ing the same prefix w ′ .

ε-closure To formalise this intuition, we need several ancillary defini-


tions. Let us first introduce the εclosure function that takes a state q ∈ Q
and returns the set of states that automaton A can reach by reading only ε’s.
The next definition formalises this. In this definition, we extend slightly
the definition of the transition function δ by allowing it to be applied to a
set of states. That is, for a set of states S, and a letter a, we let:

δ(S, a) = δ(q, a)
[
q∈S

In other words, computing δ(S, a) amounts to computing all the states that
the automaton can reach from any state q ∈ S, by reading an a. Then:

Definition 2.13 (ε-closure). Let A = Q, Σ, δ, q 0 , F be an ε-NFA. For all


­ ®

i ∈ N, let εclosurei (q) be defined as follows: The definition might seem hard
( to read, but the intuition is re-
{q} if i = 0 ally easy: εclosurei (q) is the set of
i
εclosure (q) = states that A can reach from q by following
δ(εclosurei −1 (q), ε) ∪ εclosurei −1 (q) otherwise at most i transitions labeled by ε.

Then, for all q ∈ Q: εclosure q = εclosureK (q), where K is the least value
¡ ¢

s.t. εclosureK (q) = εclosureK +1 (q).


M

Example 2.14. Let us consider the ε-NFA in Figure 2.8, and let us compute
εclosure q 6 . We compute εclosurei (q) for i = 0, 1, . . . up to stabilisation:
¡ ¢

εclosure0 (q 6 ) = {q 6 }

εclosure1 (q 6 ) = δ(εclosure0 (q 6 ), ε) ∪ εclosure0 (q 6 )


= δ({q 6 }, ε) ∪ {q 6 }
= {q 9 } ∪ {q 6 }
= {q 6 , q 9 }

εclosure2 (q 6 ) = δ({q 6 , q 9 }, ε) ∪ {q 6 , q 9 }
= {q 4 , q 9 , q 10 } ∪ {q 6 , q 9 }
= {q 4 , q 6 , q 9 , q 10 }

εclosure3 (q 6 ) = δ({q 4 , q 6 , q 9 , q 10 }, ε) ∪ {q 4 , q 6 , q 9 , q 10 }
= {q 4 , q 5 , q 7 , q 9 , q 10 } ∪ {q 4 , q 6 , q 9 , q 10 }
= {q 4 , q 5 , q 6 , q 7 , q 9 , q 10 }

εclosure4 (q 6 ) = δ({q 4 , q 5 , q 6 , q 7 , q 9 , q 10 }, ε) ∪ {q 4 , q 5 , q 6 , q 7 , q 9 , q 10 }
= {q 4 , q 5 , q 6 , q 7 , q 9 , q 10 } ∪ {q 4 , q 5 , q 6 , q 7 , q 9 , q 10 }
= {q 4 , q 5 , q 6 , q 7 , q 9 , q 10 }
= εclosure3 (q 6 )

So, we let εclosure q 6 = εclosure3 (q 6 ) = {q 4 , q 5 , q 6 , q 7 , q 9 , q 10 }. M


¡ ¢
46 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

Again, we extend the εclosure function to set of states S, as we did for


the δ function: εclosure (S) = ∪q∈S εclosure (s). In particular εclosure (;) =
;.

Determinisation We can now give the construction to determinise an ε-


NFA.

Determinisation of ε-NFAs

Given an ε-NFA A = Q A , Σ, δ A , q 0A , F A , we build the DFA D =


­ ®
­ D
Q , Σ, δD , q 0D , F D as follows:
®

Recall that, for a finite


A
1. Q D = 2Q set S, the notation 2S de-
notes the set of subsets of S.
2. q 0D = εclosure q 0A
¡ ¢
For example, if S = {1, 2, 3}, then
2S = {1}, {2}, {3}, {1, 2}, {2, 3}, {1, 2, 3}, ; .
© ª

3. F D = {S ∈ Q D | S ∩ F A ̸= ;}

4. for all S ∈ Q D , for all a ∈ Σ: δD (S, a) = εclosure δ A (S, a)


¡ ¢

Let us comment briefly on the items of this definition:

1. As expected, the set of states of the DFA is the set of subsets of the ε-
NFAs states.

2. The initial state of the DFA is the set of states the NFA can reach from
its own initial state q 0A by reading only ε-labeled transitions. Thus, q 0D
is the set of states in which the ε-NFA can be before reading any letter.

3. A state of the DFA is accepting iff it contains at least one accepting state
of the ε-NFA. This is coherent with the intuition that at least one execu-
tion of the ε-NFA must accept for the word to be accepted.

4. The transition function consists in: first reading a letter, then following
as many ε-labeled transitions as possible.

Although we will not present the details here8 , one can show that the DFA 8
The interested reader can find a proof in:
obtained from any ε-NFA by the above construction preserves the accepted John E. Hopcroft, Rajeev Motwani, and
Jeffrey D. Ullman. Introduction to Au-
language of the ε-NFA: tomata Theory, Languages, and Computa-
tion (3rd Edition). Addison-Wesley Long-
Theorem 2.3 (Determinisation of ε-NFAs). For all ε-NFA A, the DFA D ob- man Publishing Co., Inc., Boston, MA,
USA, 2006. ISBN 0321455363
tained by determinising A accepts the same language as A: L(A) = L(D).

Proof. (Sketch) Let us assume A = Q A , Σ, δ A , q 0A , F A . The proof is based


­ ®

on an extension of the transition function which receives a state q, and a


(possibly empty) word w, and returns the set of all possible states that the
automaton can reach by reading the word w. Formally, given a transition
function δ, its extended version is δ̂ defined as:

δ̂(q, ε) = εclosure q
¡ ¢

δ̂(q, w a) = εclosure δ(δ̂(q, w), a)


¡ ¢
ALL THINGS REGULAR. . . 47

Then, clearly, we can define A’s language using δ̂ A instead of δ A , since a


word is accepted iff reading it from q 0A allows one to reach at least one
accepting state. In other words:

L(A) = w ∈ Σ∗ | δ̂ A (q 0A , w) ∩ F A ̸= ;
© ª

Then to prove that L(D) = L(A), it is sufficient to check, that, for all
words w the set δ̂ A (q 0 , w) is exactly the state which is reached by D when
reading w from its initial state. This can be established by induction on the
length of w, which is easy because of the inductive definition of δ̂ A .

Example 2.15. Let us consider again the ε-NFA in Figure 2.8, and let us
build its deterministic counterpart D = Q D , Σ, δD , q 0D , F D .
­ ®

D’s initial state q 0D = εclosure q 1 = {q 1 }. Let us call this state S 1 . From


¡ ¢

S 1 = {q 1 }, the automaton A reaches {q 2 } by reading an l. From q 2 , it can


take several ε-labeled transitions. In other words:

δD (S 1 , l) = εclosure δ A (S 1 , l)
¡ ¢

= εclosure {q 2 }
¡ ¢

= {q 2 , q 3 , q 4 , q 5 , q 7 , q 10 }

Let us denote this set S 2 . Observe that S 2 is accepting, since S 2 ∩ F A =


{q 10 } ̸= ;.
On the other hand, reading a d from S 1 yields the empty set. Hence:

δD (S 1 , d) = εclosure (;)
=;

We continue the construction of the DFA similarly, from S 2 :

εclosure δ A (S 2 , l) = εclosure {q 6 }
¡ ¢ ¡ ¢

= {q 4 , q 5 , q 6 , q 7 , q 9 , q 10 }

Let us denote this last set by S 3 . Observe that S 3 is accepting too.


Reading a d from S 2 yields:

εclosure δ A (S 2 , d) = εclosure {q 8 }
¡ ¢ ¡ ¢

= {q 4 , q 5 , q 8 , q 7 , q 9 , q 10 }

Let us denote this state S 4 .


And from ;:

εclosure δ A (;, d) = ;
¡ ¢

εclosure δ A (;, l) = ;
¡ ¢

Now, from S 3 :

εclosure δ A (S 3 , d) = εclosure {q 8 }
¡ ¢ ¡ ¢

= S4
¡ A
εclosure δ (S 3 , l) = εclosure {q 6 }
¢ ¡ ¢

= S3
48 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

Finally, from S 4 :

εclosure δ A (S 4 , d) = εclosure {q 8 }
¡ ¢ ¡ ¢

= S4
¡ A
εclosure δ (S 4 , l) = εclosure {q 6 }
¢ ¡ ¢

= S3

The resulting DFA is depicted in Figure 2.9. Actually, this figure shows
the part of the DFA which is reachable from the initial state (since we have
built the states iteratively from the initial state). Indeed, a state like {q 1 , q 10 }
also exists in the DFA, but is not reachable.

l l Figure 2.9: The DFA obtained from the ε-


S1 S2 S3 l NFA A l·(l+d)∗ .
d
d d
l
; S4

l, d d

Observe that the result on the determination process does not always
yield a minimal automaton (in this case, the states S 2 and S 3 could be
‘merged’). We will review, in Section 2.5 a technique for minimising DFAs.
M

Size of the determinised automaton Since the set of states of the DFA D
A
obtained by the above construction is 2Q (where Q A is the set of states
of the original ε-NFA A), D could be, in theory, exponentially larger than
A. However, on the previous example, A l·(l+r)∗ has ten states, while its
corresponding DFA (Figure 2.9) has ‘only’ four states reachable (instead of
1024). So, even if the DFA has many states, most of them are not reachable
and their construction can thus easily be avoided.
Is it always going to be the case? The answer, unfortunately, is ‘no’: we
will exhibit an infinite family of languages L n (for all n ≥ 1) s.t., (i) for all
n ≥ 1: there is an ε-NFA A n that recognises L n , and the size of the A n ’s
grows linearly with n; and (ii) letting D n be any deterministic automaton
recognising L n (for all n), the size of the D n ’s grow exponentially with n.
Observe that the above statement is rather strong: whatever the determin-
istic automaton D n we chose to recognise L n , this automaton is bound to
a number of states which is exponential in n. Thus, there is no hope to ob-
tain a determinisation procedure that always produces a DFA that is poly-
nomial in the size of the original ε-NFA (and this holds in particular for the
determinisation procedure we have given above).
The languages L n are those of binary words that contain at least two 1’s
separated by n characters, i.e.:

L n = {w 1 w 2 · · · w ℓ ∈ {0, 1}∗ | ∃1 ≤ i ≤ ℓ − n − 1 : w i = w i +n+1 = 1}


ALL THINGS REGULAR. . . 49

Building an NFA accepting L n (for each n) is easy: we have already given


in Figure 2.4 an NFA accepting L 2 , and Figure 2.10 shows the general con-
struction. It is easy to see that, for all n ≥ 1, A n accepts L n . Indeed, if
a run of the automaton reaches the accepting state q a , it has necessarily
traversed the sequence of states q i , q 0 , q 1 ,. . . q n , q a , which guarantees that
the word contains two 1’s (read by the transitions from q i to q 0 and from
q n to q a respectively), separated by n characters. On the other hand, if a
word w is in L n , then it can be accepted by the automaton: the automa-
ton stays in q i , up to the point where it reads the first of the two 1’s that
are separated by n characters, and moves to q 1 . Then, the accepting states
will be reached for sure.

0, 1 0, 1

1 0, 1 0, 1 0, 1 0, 1 1
qi q0 q1 q2 ··· qn qa

n transitions
Figure 2.10: The family of ε-NFAs A n (n ≥
1) s.t. for all n ≥ 1: L(A n ) = L n .
Observe that the only non-deterministic choice of A n occurs in q i : when
reading a 1, the automaton can either stay in q i , or move to q 0 . If it decides
for the latter, it will accept only if this 1 is followed n +1 characters later by
another 1. In some sense, each time the automaton sees a 1 in state q i , it
must guess whether this 1 will be followed n+1 characters later by another
1, in which case it moves to q 0 . The purpose of the states q 0 , q 1 ,. . . , q n is
to check that this guess was correct.
Finally, it is easy to see that, for all n ≥ 1, A n has n + 3 states, so the size
of the A n automata grows indeed linearly wrt n.

Now, let us argue that the size of deterministic automata D n that accept
L n grows exponentially wrt n. To support our discussion, we consider the
automaton A 1 :

0, 1 0, 1 Figure 2.11: The ε-NFA A 1 recognising L 1 .

1 0, 1 1
qi q0 q1 qa

To check that a word contains indeed two 1’s separated by 1 character,


the automaton must, at all times, remember the two last read characters,
that we denote b 0 and b 1 (that is, if the automaton is reading character w i
of the input word, then b 0 = w i −2 and b 1 = w i −1 ). Then, the automaton
proceeds as follows every time it reads a character:

• If the character is 1, then, the automaton must check whether b 0 is a 1.


If yes, it accepts. Otherwise, it needs to update its memory, by copying
the value of b 1 to b 0 , and letting b 1 = 1.

• If the character is 0, then, the automaton must only update its memory,
by, again, copying the value of b 1 to b 0 , and letting b 1 = 0.
50 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

Thus, the automaton clearly needs those two bits b 0 and b 1 of memory.
There are 22 = 4 possible memory values which are encoded in the states of
the DFA. Hence, D 1 must have at least 4 states. This reasoning generalises
to any n, letting the number of memory bits increase with n: for all n ≥ 1,
the automaton needs n +1 bits of memory. So, any DFA D n recognising L n
must have at least 2n+1 states.
As a matter of fact the automaton D 1 obtained by determinising A 1 (us-
ing the procedure of Section 2.4.2) is displayed in Figure 2.12. The four
states encoding the memory are the four non-accepting states. The gray
labels show the values of the two memory bits associated to those states—
of course, this intuition is valid only after D 1 has read at least 2 characters.
Clearly, this automaton could be made simpler, but only by ‘merging’ the
accepting states: it is not possible to reduce the number of non-accepting
states without changing the language of the automaton.

b 0 = 0, b 1 = 1 b 0 = 1, b 1 = 0 Figure 2.12: The DFA D 1 obtained from the


NFA A 1 . The gray labels show the values of
0 the memory bits associated to some states.
0

1 0 1
{q i } {q i , q 0 } {q i , q 1 } {q i , q 0 , q a }

0
1
b 0 = 0, b 1 = 0

{q i , q 0 , q 1 } 1
1
0
b 0 = 1, b 1 = 1

1 1

0
1 {q i , q 0 , q 1 , q a } {q i , q 1 , q a } {q i , q a }
0

0
ALL THINGS REGULAR. . . 51

2.4.3 From ε-NFAs to regular expressions

In order to complete the picture in Figure 2.7, we present now a technique


for turning an ε-NFA (thus, in particular, a DFA) A into a regular expression
defining the same language as A.
Several techniques exist to do so. The original one can be found in
Kleene’s seminal paper9 , and has later been rephrased using the (now) 9
Stephen C. Kleene. Representation
of events in nerve nets and finite au-
standard automata formalism by McNaughton and Yamada10 . The tech-
tomata. Technical Report RM-704,
nique we will now present has been introduced by Brzozowski and Mc- The RAND Corporation, 1951. URL
Cluskey in 196311 . http://minicomplexity.org/pubr.
php?t=2&id=2
This technique is often called the state elimination technique. Roughly 10
R. McNaughton and H. Yamada. Regular
speaking, it consists in eliminating states of the original ε-NFA one by one, expressions and state graphs for automata.
and updating the labels of the remaining transitions to make sure that the Electronic Computers, IRE Transactions on,
EC-9(1):39–47, March 1960. ISSN 0367-
accepted language does not change. To do this, one has to allow regular 9950. D O I : 10.1109/TEC.1960.5221603
expressions (instead of single characters) to label the transitions, as the 11
J.A. Brzozowski and Jr. McCluskey, E.J.
Signal flow graph techniques for sequen-
next simple example demonstrates. Consider the automaton:
tial circuit state diagrams. Electronic Com-
puters, IEEE Transactions on, EC-12(2):67–
b
76, April 1963. ISSN 0367-7508. D O I :
10.1109/PGEC.1963.263416
q0 q1 q2
a c

Then, eliminating state q 1 can be done if we re-label the transition from


q 0 to q 2 by a regular expression:

b + (a · c)

q0 q1 q2
a c

It is easy to check that the latter automaton (i.e., without state q 1 ) accepts
the same language as the former.
Let us now generalise the idea sketched in this example. Assume we
want to remove some state q of an ε-NFA. Let p 1 , p 2 ,. . . , p n denote the
predecessors of q, i.e., all states p i s.t. q ∈ δ(p i , a), for some a ∈ Σ ∪ {ε}. Let
us further denote by s 1 , s 2 ,. . . , s ℓ the successors of q, i.e. all states s i s.t.
s i ∈ δ(q, a) for some a ∈ Σ ∪ {ε}. Obviously, the removal of q will affect all Observe that a state could be at
the same time a successor and a
transitions from some p i to q, and all transitions from q to some s i . But
predecessor of q, but this is not a
it might also affect some transitions from some p i to some s j , as in the problem for our technique.
above example. In this case, q 0 is a predecessor of q = q 1 ; q 2 is a successor;
and we ‘report’ the information from the two deleted transitions to the
direct transition from q 0 to q 2 . So, in general, the states and transitions we
need to consider when deleting state q are as depicted in Figure 2.13 (left).
Observe that we assumed two important things in this figure:

1. first, all transitions are labeled by regular expressions r i , j . This will be


important because we will iteratively remove states and thus replace
some characters labeling transitions by more complex regular expres-
sions. Since each character is also a regular expression, this is not a
problem: the initial automaton respects this assumption.

2. second, we have assumed that there is a transition between each pair


of states (p i , s j ) (with 1 ≤ i ≤ n and 1 ≤ j ≤ ℓ). This assumption is im-
portant because the information on the transitions we will delete along
52 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

with q will be moved to those transitions. If the automaton does not


respect this hypothesis, we can always add transitions that are labeled
by the regular expression ; without modifying the accepted language
of the automaton.


r 1,ℓ r 1,ℓ + r 1,q r q,q r q,ℓ


r 1,1 r 1,1 + r 1,q r q,q r q,1

p1 s1 p1 s1
r 1,q r q,1 r 1,q r q,1
r q,q r q,q

.. q .. .. q ..
. . . .

r n,q r q,ℓ r n,q r q,ℓ


pn sℓ pn sℓ

r n,ℓ ∗
r n,ℓ + r n,q r q,q r q,ℓ

r n,1 ∗
r n,1 + r n,q r q,q r q,1

Figure 2.13: The situation before (left) and


after (right) deletion of state q
Now, let us observe the right-hand side of Figure 2.13. It shows the au-
tomaton one obtains after removing all the part which is now dotted, i.e.,
q and its incoming and outgoing transitions. To justify this construction,
let us consider, for instance, a run fragment from p 1 to s ℓ . In the original
automaton, this can be done at least in two different ways:

• either by following the direct transition from p 1 to s ℓ , reading a word


that matches r 1,ℓ .

• or by following a path that goes from p 1 to q, follows an arbitrary num-


ber of times the self-loop on q, then goes from q to s ℓ . For this path
to be taken the automaton thus needs to read a word recognised by

r 1,q r q,q r q,ℓ .

As we delete state q, the regular expression r 1,q r q,q r q,ℓ corresponding to
the latter path must now be reported to the former. Hence, the label from

p 1 to s ℓ now becomes r 1,ℓ + r 1,q r q,q r q,ℓ . The other modified labels are
justified in a similar fashion. Since the deletion of q does not affect the
rest of the automaton, we conclude that applying this transformation to
any state q of any ε-NFA does not modify its accepted language.
Then, the algorithm to convert an ε-NFA A = Q, Σ, δ, q 0 , F into a reg-
­ ®

ular expression accepting the same language is as follows. For each ac-
ALL THINGS REGULAR. . . 53

cepting state q f ∈ F , we build an equivalent automaton A q f by deleting


all states except q 0 and q f from A, using the state elimination procedure
described above. Since all states but q 0 and q f have been removed, A q f is
necessarily of either forms shown in Figure 2.14. Indeed, only states q 0 and
q f are left, and it could be the case that q 0 = q f . In both cases, computing
the regular expression that corresponds to those automata is easy—they
are displayed under the automata.

r 0,0 r f ,f r Figure 2.14: The two possible forms for an


r 0, f automaton A q f obtained by eliminating
all states but q 0 and q f , and their corre-
q0 qf q0 sponding regular expressions. We obtain
the right automaton whenever q 0 = q f .
r f ,0

¢∗
r 0,0 + r 0, f · r f∗, f · r f ,0 · r 0, f · r f∗, f
¡
r∗

So, for each accepting state q f , we can now compute a regular expres-
sion r q f that accepts all the words A accepts by a run ending in q f . How-
ever, the language of A is exactly the set of all words that A accepts by a run
ending in either of the accepting states. Then, assuming that the set of ac-
cepting states of A is F = {q 1f , q 2f , . . . q nf }, we obtain the regular expression
corresponding to A as:
q 1f + q 2f + · · · + q nf

Example 2.16. As an example, consider the following ε-NFA:

1
1

q0 0 q1 0 q2 ε q3

Remember that, when no transition is displayed between a pair of states


(q 1 , q 2 )—potentially with q 1 = q 2 —we assume that there is a transition la-
beled by the regular expression ;. We do not display such transitions to
enhance readability.
Then, we apply iteratively the state elimination procedure to obtain A q2
and A q3 :

1. To obtain A q2 , we first observe that q 2 is not reachable from q 3 (in other


words, all outgoing transitions from q 3 are labeled by ;, except the self-
loop). It is thus safe to delete q 3 without further modification of the
transitions:

q0 0 q1 0 q2

From this automaton, we apply the state elimination procedure to delete


q 1 and obtain:
54 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

1+0·0

q0 q2

This automaton is A q2 . Its corresponding regular expression is:

(1 · 1 + 0 · 0 · 1)∗ · (1 + 0 · 0)

2. To obtain A q3 , we first remove q 1 , as in the previous case, and obtain:

1
1+0·0

q0 q2 ε q3

Then, we remove q 2 , which has one predecessor: q 0 ; and two succes-


sors: q 0 and q 3 . Displaying q 2 ’s situation as in Figure 2.13, we obtain:

;
q0
1

q0 1+0·0 q2
1
ε
q3
;

Observe that we have duplicated q 0 for the sake of clarity, since it is


both a predecessor and a successor. Now let us eliminate q 2 , we obtain,
using still the same representation:
¡ ¢
; + (1 + 0 · 0) · 1
q0

q0
1

q3
; + (1 + 0 · 0) · ε
¡ ¢

However, the newly introduced regular expressions can be simplified:


¡ ¢ ¡ ¢
; + (1 + 0 · 0) · 1 = (1 + 0 · 0) · 1
= 1·1+0·0·1

and:

; + (1 + 0 · 0) · ε = (1 + 0 · 0) · ε
¡ ¢ ¡ ¢

= 1+0·0
ALL THINGS REGULAR. . . 55

Then, putting everything together (and taking into account that the dupli-
cate q 0 is a single state), we obtain the automaton A q3 :

1·1+0·0·1 1

q0 1+0·0 q3

Its corresponding regular expression is:

(1 · 1 + 0 · 0 · 1)∗ · (1 + 0 · 0) · 1∗

So, we conclude that a regular expression accepting the same language as


the original ε-NFA A is:

(1 · 1 + 0 · 0 · 1)∗ · (1 + 0 · 0) + (1 · 1 + 0 · 0 · 1)∗ · (1 + 0 · 0) · 1∗
¡ ¢ ¡ ¢

2.5 Minimisation of DFAs

As we have seen in Section 2.4.2 (see Figure 2.9), there can be several DFAs
accepting the same language, and some of them might be larger than the
others. It is thus natural to look for a minimal DFA accepting a given regu-
lar language, and to wonder whether there can be several different minimal
DFAs accepting the same language.
Answers to those questions are provided by a central theorem of au-
tomata theory, which has been established in 1958 by Myhill and Nerode12 . 12
A. Nerode. Linear automaton trans-
formations. Proceedings of the Ameri-
To avoid technicalities which are out of the scope of those notes, we will
can Mathematical Society, 9(4):pp. 541–
not state the theorem, but rather one of its consequence: 544, 1958. ISSN 00029939. URL http:
//www.jstor.org/stable/2033204
Corollary 2.4 (Consequence of the Myhill-Nerode theorem). For all regu-
lar languages L, there is a unique minimal DFA accepting L. This DFA can
be computed from any DFA accepting L.

q1 b q3 Figure 2.15: A DFA which is not minimal.


a
a

q0 q5 a q6
a
b a
q2 a q4

The aim of this section is to discuss the minimisation procedure for


DFAs. Let us start with the simple example shown in Figure 2.15. This
DFA is clearly not minimal. Consider for instance the two accepting states
q 5 and q 6 : it is easy to check that ‘merging’ them (preserving the a-labeled
loop on the merged state) retains the language of the automaton, because,
both states accept the same language a∗ , i.e., any word suffix read from q 5
will eventually be accepted iff it is accepted from q 6 . This characterisa-
tion of states that ‘can be merged’ is the central notion that we need to
minimise DFAs:
56 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

Definition 2.17 (Language acccepted from a state). Given an ε-NFA A =


Q, Σ, δ, q 0 , F , and a state q ∈ Q, we let L(A, q) be the language accepted
­ ®

from q defined as L(A, q) = L(A ′ ) where A ′ = Q, Σ, δ, q, F is the ε-NFA


­ ®

obtained by replacing the initial state of A by q. M

In other words, L(A, q) is the language A would accept if its initial state
were q instead of q 0 . Then, we can characterise states that we will be able
to merge. Those states are said to be equivalent:

Definition 2.18 (Equivalence between states). Let A = Q, Σ, δ, q 0 , F be


­ ®

an ε-NFA, and let q 1 ∈ Q and q 2 ∈ Q be two states of A. Then, q 1 and q 2 are


equivalent (denoted q 1 ≡ q 2 ) iff L(A, q 1 ) = L(A, q 2 ). M

It is easy to check that ≡ is indeed an equivalence relation♣ . For all


states q, we denote by [q] its equivalence class, i.e., the set of all states
that accept the same language as q. Thanks to these definitions we can
present the minimisation procedure for DFAs. As expected, it consists in
‘merging’ equivalent states and updating the transitions accordingly. Con-
cretely, this amounts to using the set of equivalence classes of ≡ as the
states of the minimal automaton: We will give, on page 57, argu-
ments explaining why this defini-
tion of the transition relation is
well-founded and makes sense.
Minimisation of DFAs

Given a DFA A = Q A , Σ, δ A , q 0A , F A , the minimal DFA accepting


­ ®

L(A) is B = Q B , Σ, δB , q 0B , F B where:
­ ®

1. Q B = {[q] | q ∈ Q A }

2. For all [q] ∈ Q B , for all a ∈ Σ: δB ([q], a) = δ A (q, a)


£ ¤

3. q 0B = q 0A
£ ¤

4. F B = {[q] | q ∈ F A }.

Example 2.19. Let us consider the example in Figure 2.15. Here are the
languages accepted by the different states (denoted as regular expressions):

State q Accepted language L(A, q)

q0 a · b · a · a∗ + b · a · a · a∗
q1 b · a · a∗
q2 a · a · a∗
q3 a · a∗
q4 a · a∗
q5 a∗
q6 a∗

Clearly, q 5 ≡ q 6 , q 3 ≡ q 4 , but no other pair of states are equivalent. Thus,


ALL THINGS REGULAR. . . 57

the equivalence classes (and also the states of the minimal DFA) are:

[q 0 ] = {q 0 }
[q 1 ] = {q 1 }
[q 2 ] = {q 2 }
[q 3 ] = [q 4 ] = {q 3 , q 4 }
[q 5 ] = [q 6 ] = {q 5 , q 6 }

The minimal DFA is shown in Figure 2.16. M

Figure 2.16: A minimal DFA.


[q 1 ]
a b

a
[q 0 ] [q 3 ] [q 5 ]

b a
a
[q 2 ]

As is, this technique is not really practical, since it requests to compute,


for each state, its accepted language. A more efficient way of minimising
DFAs is to compute directly the equivalence relations by a process called
partition refinement. This algorithm is based on the two following obser-
vations:

1. It is not possible that an accepting state q 1 be equivalent to a non-


accepting state q 2 . Indeed, ε ∈ L(A, q 1 ) (since q 1 is accepting and we
are considering a DFA, hence an automaton without ε-transitions), but
ε ̸∈ L(A, q 2 ) (since q 2 is not accepting). Hence, L(A, q 1 ) is necessarily
different from L(A, q 2 ).

2. If two states q 1 and q 2 are equivalent, then it must be the case that, for
all letters a: δ(q 1 , a) ≡ δ(q 2 , a). That is, reading the same letter from two
equivalent states yields necessarily equivalent states.
This can be shown by contradiction. Assume q 1 ≡ q 2 but δ(q 1 , a) ̸≡
δ(q 2 , a) for some letter a. Since δ(q 1 , a) ̸≡ δ(q 2 , a), the language ac-
cepted from δ(q 1 , a) must be different from the language accepted from
δ(q 2 , a), by definition of the equivalence relation (Definition 2.17). Hence,
there is at least one word w that differentiates these two languages.
Without loss of generality, let us assume that w can be accepted from
δ(q 1 , a) but not from δ(q 2 , a). Since we consider DFAs, we conclude
that a · w ∈ L(A, q 1 ), but that a · w ̸∈ L(A, q 2 ). Hence, it is not possible
that q 1 ≡ q 2 .

Then, the partition refinement procedure consists in refining, itera-


tively, a symmetrical and reflexive relation♣ ∼ on the states, s.t. two states
q i and q j are kept in relation (q i ∼ q j ) as long as they are believed to be
58 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

equivalent (or, in other words, as long as they have not been proved to be
non-equivalent). Initially, all final states are in relation with each oth- Observe that, once the algorithm
has declared that q i ̸∼ q j , then, we
ers, and all non-final states are too. However, no final state is in relation
are sure that q i ̸≡ q j . However, q i ∼
with a non-final one, since we know for sure that final and non-final states q j does not imply that q i ≡ q j . The fact
cannot be equivalent. that q i ∼ q j only represents the current
belief of the algorithm, but it could be re-
The current state of the relation is stored in a matrix P indexed by the vised later.
states (in both dimensions). We let P [q i , q j ] = 1 iff q i ∼ q j . Since the rela-
tion is symmetrical and reflexive, there are only 1’s on the diagonal and the
matrix is symmetrical , and so we keep only the (strictly) upper triangular
part of the matrix. For instance:

q2 q3
P: 0 1 q1
0 q2

indicates that q 1 ∼ q 3 , but that q 1 ̸∼ q 2 and that q 2 ̸∼ q 3 .


Then, the refinement step consists in finding two states q i and q j s.t.:

• q i ̸= q j ;

• q i is currently believed to be equivalent to q j , i.e., P [q i , q j ] = 1; but

• there is a letter a s.t. P [δ(q i , a), δ(q j , a)] = 0.

Because P [δ(q i , a), δ(q j , a)] = 0, we know for sure that δ(q i , a) ̸≡ δ(q j , a).
Hence, as discussed above, it is not possible that q i ≡ q j , and so we put a
0 in P [q i , q j ]. We go on like that as long as we can update some cells of the
matrix. Algorithm 1 presents this algorithm.
Obviously, the algorithm terminates after having updated all cells of
the matrix in the worst case. It is easy to check that it runs in polynomial
time13 . One can prove14 that, upon termination, this algorithm computes 13
Actually in O (n 5 ), which is not very
good. A more clever implementation al-
exactly the relation ≡ that we are looking for:
lows one to achieve O (n log(n)).
14
John E. Hopcroft, Rajeev Motwani, and
Proposition 2.5. The refinement algorithm always terminates. Upon ter-
Jeffrey D. Ullman. Introduction to Au-
mination, q i ≡ q j iff P [q i , q j ] = 1, for all pairs of states (q i , q j ). tomata Theory, Languages, and Computa-
tion (3rd Edition). Addison-Wesley Long-
Example 2.20. Let us apply Algorithm 1 to the example in Figure 2.15. Re- man Publishing Co., Inc., Boston, MA,
USA, 2006. ISBN 0321455363
member that, following our convention, we have not shown, in the figure,
a sink state q s to which the automaton goes every time a transition is not
represented explicitly (for instance, δ(q 1 , a) = q s ). In the algorithm, how-
ever, we must take this state explicitly into account. So, we start with:

q1 q2 q3 q4 q5 q6 qs
1 1 1 1 0 0 1 q0
1 1 1 0 0 1 q1
1 1 0 0 1 q2
1 0 0 1 q3
0 0 1 q4
1 0 q5
0 q6

because q 5 and q 6 are the only accepting states.


Then, the algorithm first treats the q 0 line and discovers that:
ALL THINGS REGULAR. . . 59

Input: A DFA A = Q = {q 1 , . . . , q n }, Σ, δ, q 0 , F .
­ ®

Output: A strictly upper diagonal Boolean matrix P s.t. P [q i , q j ] = 1


iff q i ≡ q j .

P ← strictly upper diagonal matrix of Boolean ;

foreach 1 ≤ i ≤ n do
foreach i < j ≤ n do
if q i ∈ F ↔ q j ∈ F then
P [q i , q j ] ← 1 ;
else
P [q i , q j ] ← 0 ;

Boolean f i ni shed ← 0 ;

while ¬ f i ni shed do
f i ni shed ← 1 ;
foreach 1 ≤ i ≤ n do
foreach i < j ≤ n do
if P [q i , q j ] = 1 then
foreach a ∈ Σ do
if P [δ(q i , a), δ(q j , a)] = 0 then
P [q i , q j ] ← 0 ;
f i ni shed ← 0 ;

return P ;
Algorithm 1: The algorithm to compute the matrix encoding the equiv-
alence classes of ≡.
60 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

• q 0 is not equivalent to q 3 because we have that δ(q 0 , a) = q 1 and δ(q 3 , a) =


q 6 ; but P [q 1 , q 6 ] = 0;

• q 0 is not equivalent to q 4 because we have that δ(q 0 , a) = q 1 and δ(q 4 , a) =


q 5 ; but P [q 1 , q 5 ] = 0;

and updates the q 0 line accordingly. Then, the algorithm processes the q 1
line, and discovers that:

• q 1 is not equivalent to q 3 because we have that δ(q 1 , a) = q s and δ(q 3 , a) =


q 6 ; but P [q s , q 6 ] = 0;

• q 1 is not equivalent to q 4 because we have that δ(q 1 , a) = q s and δ(q 4 , a) =


q 5 ; but P [q s , q 5 ] = 0;

and updates the q 1 line too. Then, for similar reasons, it discovers that q 2
is not equivalent neither to q 3 nor to q 4 . It also finds that neither q 3 nor q 4
can be equivalent to q s . At the end of the first iteration of the while loop,
the matrix P is thus as follows:

q1 q2 q3 q4 q5 q6 qs
1 1 0 0 0 0 1 q0
1 0 0 0 0 1 q1
0 0 0 0 1 q2
1 0 0 0 q3
0 0 0 q4
1 0 q5
0 q6

At the next step the algorithm discovers that:

• q 0 is not equivalent to q 1 , because: δ(q 0 , b) = q 2 and δ(q 1 , b) = q 3 , but


P [q 2 , q 3 ] = 0;

• q 0 is not equivalent to q 2 , because: δ(q 0 , a) = q 1 and δ(q 2 , a) = q 4 , but


P [q 1 , q 4 ] = 0;

• q 1 is not equivalent to q 2 , because: δ(q 1 , b) = q 3 and δ(q 2 , b) = q s , but


P [q 3 , q s ] = 0;

• q 1 is not equivalent to q s , because: δ(q 1 , b) = q 3 and δ(q s , b) = q s , but


P [q 3 , q s ] = 0;

• q 2 is not equivalent to q s , because: δ(q 2 , a) = q 4 and δ(q s , a) = q s , but


P [q 4 , q s ] = 0.

At the end of the second iteration of the while loop, the matrix P is thus as
follows:
q1 q2 q3 q4 q5 q6 qs
0 0 0 0 0 0 1 q0
0 0 0 0 0 0 q1
0 0 0 0 0 q2
1 0 0 0 q3
0 0 0 q4
1 0 q5
0 q6
ALL THINGS REGULAR. . . 61

Finally, during the third and last iteration of the while loop, the al-
gorithm discovers that q 0 is not equivalent to q s because δ(q 0 , a) = q 1 ,
δ(q s , a) = q s , but P [q 1 , q s ] = 0. Hence, the final matrix is:

q1 q2 q3 q4 q5 q6 qs
0 0 0 0 0 0 0 q0
0 0 0 0 0 0 q1
0 0 0 0 0 q2
1 0 0 0 q3
0 0 0 q4
1 0 q5
0 q6

which indeed corresponds to the equivalence classes used to build the au-
tomaton in Figure 2.16. M

2.6 Operations on regular languages

In this section, we consider different operations on sets that also apply


to and are particularly relevant for languages, i.e., the union, the com-
plement and the intersection. We also consider the problems of testing
emptiness, inclusion and equality of regular languages. Of course, we want
to realise all those operations and tests in an algorithmic way. Since regu-
lar languages are potentially infinite, we need to fix a finite representation
for them. Unsurprisingly, we will rely on finite automata.

Union Given two ε-NFAs A 1 = Q 1 , Σ, δ1 , q 01 , F 1


­ ®
and A 2 =
­ 2
Q , Σ, δ2 , q 02 , F 2 , building an ε-NFA that accepts L(A 1 ) ∪ L(A 2 ) is easy: it
®

amounts to adding a fresh initial state that can, by means of an ε-transition,


jump to either q 01 or q 02 . That is, if we let: Remember that the ⊎ symbol de-
notes the ‘disjoint union’, i.e.: A ⊎ B
is A ∪B assuming A ∩B = ;. We use
A = Q 1 ⊎Q 2 ⊎ {q 0 }, Σ, δ′ , q 0 , F 1 ⊎ F 2
­ ®
it to formalise the facts that the set of states
of both automata should be disjoint, and
where, for all q ∈ Q 1 ⊎Q 2 ⊎ {q 0 }, all a ∈ Σ ∪ {ε}: that the initial state q 0 is a ‘newly created’
 state.


 {q 01 , q 02 } if q = q 0 and a = ε
 δ1 (q, a)

if q ∈ Q 1
δ(q, a) =


 δ2 (q, a) if q ∈ Q 2

 ; otherwise

then, L(A) = L(A 1 ) ∪ L(A 2 ).


Observe that the resulting automaton is necessarily an ε-NFA, even if
A and A 2 are DFAs. It can of course be determinised, like every ε-NFA,
1

using the procedure discussed in Section 2.4.2.

Complement Given an ε-NFA A, we want to compute an ε-NFA A s.t.


L(A) = L(A) = Σ∗ \ L(A). Probably the first idea that comes to one’s mind
when looking for a technique to complement automata is to ‘swap accept-
ing and non-accepting states’. This idea, unfortunately, does not work in
general as shown in Figure 2.17: the automaton (on alphabet Σ = {a}) in
the figure accepts a∗ , so, the complement of its language is Σ∗ \ a∗ = ;.
62 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G a

q1

However, the automaton obtained from the one in Figure 2.17 by having a
+
q 2 only as accepting state accepts a ̸= ;.
q0
From this example, it is clear that the problem comes from the non-
determinism. A DFA, however, has exactly one execution on each word w a a
that ends in an accepting state iff w is accepted. So, swapping accepting
q2
and non-accepting states of a DFA A, and keeping the rest of the automa-
ton identical yields an automaton A accepting the complement of A’s lan- Figure 2.17: Swapping accepting and non-
accepting states does not complement
guage. On each word w, the sequence of states traversed by A will be the non-deterministic automata.
same as in A. Only the final state will be accepting in A iff it is rejecting in
A.
To sum up, for a DFA A = Q, Σ, δ, q 0 , F , we let:
­ ®

A = Q, Σ, δ, q 0 ,Q \ F
­ ®

If the given automaton is not deterministic, we first determinise it using


the procedure of section 2.4.2.

Intersection To compute the intersection of two finite automata languages


L(A 1 ) and L(A 2 ), we build a new FA A that simulates, at the same time, the
executions of A 1 and A 2 on the same word. To do this, A needs to re-
member the current states of A 1 and A 2 . So A’s states are pairs of states
(q 1 , q 2 ), where q 1 is a state of A 1 and q 2 is a state of A 2 . Moreover, the
transition function of A must reflect the possible moves of both automata
when reading the same letter a.
As an example, consider the automata A 1 and A 2 given hereunder. They
accept a∗ · (b + ε) and a + b+ respectively:

A1: A2:
q2 q5 a q7
a b ε

q1 q4
b
b b
q3 q6

Obviously, the initial state of A will be (q 1 , q 4 ), since they are the respective
initial states of A 1 and A 2 . From this state, we can consider several options,
for the transition function:

1. Either the input word begins with an a. A 1 can read this a, but A 2 can-
not since there is not a-labeled transition from q 4 , A 2 ’s current state.
Thus, there is no a-labeled transition from (q 1 , q 4 ) in A.

2. Or, the input word begins with a b. Both automata can read this letter:
A 1 will move either to q 2 or to q 3 , and A 2 will move to q 6 . Hence, in A,
(q 1 , q 4 ) has two b-labeled successors: (q 2 , q 6 ) and (q 3 , q 6 ). Only (q 2 , q 6 )
is accepting, since q 2 and q 6 are both accepting.

3. Or, one of the automata (in this case, A 2 ) makes a spontaneous move,
while the other (A 1 ) is left unchanged. This is possible because there
ALL THINGS REGULAR. . . 63

is an ε-labeled transition from q 4 to q 5 in A 2 . Hence, there is, in A, an


ε-labeled transition from (q 1 , q 4 ) to (q 1 , q 5 ).

Continuing this construction, we obtain the automaton hereunder:

a
(q 1 , q 5 ) (q 1 , q 7 )
ε

b
(q 1 , q 4 ) (q 2 , q 6 )

b
(q 3 , q 6 )

It is easy to check that this automaton accepts a + b which is indeed L(A 1 )∩


L(A 2 ).
Formally, assume A 1 = Q 1 , Σ, δ1 , q 01 , F 1 and A 2 = Q 2 , Σ, δ2 , q 02 , F 2 are
­ ® ­ ®

two ε-NFAs Then, we let A 1 ∩ A 2 be the ε-NFA Q, Σ, δ, q 0 , F where:


­ ®
Remember that the product A ×
B of two sets A and B is the set
1. Q = Q 1 ×Q 2 ; of all pairs (a, b) s.t. the former
element a belongs to A and the latter
2. For all (q 1 , q 2 ) ∈ Q, for all a ∈ Σ ∪ {ε}, δ (q 1 , q 2 ), a contains (q 1′ , q 2′ ) iff
¡ ¢
b, to B . For instance: {1, 2} × {3, 4} =
{(1, 3), (1, 4), (2, 3), (2, 4)}.
one of the following holds:

• q 1′ ∈ δ(q 1 , a) and q 2′ ∈ δ(q 2 , a); or


• a = ε, q 1′ ∈ δ(q 1 , ε) and q 2′ = q 2 ; or
• a = ε, q 1′ = q 1 and q 2′ ∈ δ(q 2 , ε).

3. q 0 = (q 01 , q 02 );

4. F = F 1 × F 2 .

The following can easily be established by induction on the length of the


input words:

Theorem 2.6. Let A 1 and A 2 be two ε-NFAs. Then, L(A 1 ∩ A 2 ) = L(A 1 ) ∩


L(A 2 ).

Finally, observe that the construction for intersection can easily be mod-
ified to obtain an alternative algorithm to compute the union of two ε-
NFAs, simply by letting the set of final states be {(q 1 , q 2 ) | q 1 ∈ F 1 or q 2 ∈
F 2 }. The advantage of this construction is that it produce a DFA when
applied to two DFAs A 1 and A 2 , unlike the previous one that needs ε-
transitions.

Emptiness Clearly, an ε-NFA accepts a word iff there exists a path in the
automaton from the initial state to any accepting state, whatever the word
recognised along the path. Thus, testing for emptiness of ε-NFAs boils
down to a graph problem, which can be solved by classical algorithms
such as breadth- or depth-first search. Algorithm 2 shows a variation of
the breadth-first search to check for emptiness of ε-NFAs. At all times, it
maintains a set Passed of states that it has already visited and a set Frontier
containing all the states that have been visited for the first time at the pre-
vious iteration of the While loop. Each iteration of this loop consists in
64 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

computing a new Frontier set: it contains all direct successors of nodes


from Frontier that are not in Passed. The loop terminates either when all
states have been explored (Frontier = ;) or when an accepting state has
been reached (Passed ∩ F ̸= ;).

Input: An ε-NFA A = Q, Σ, δ, q 0 , F
­ ®

Output: True iff L(A) = ;


Passed ← ; ;
Frontier ← {q 0 } ;
while Frontier ̸= ; and Passed ∩ F = ; do
Passed ← Passed ∪ Frontier ;
NewFrontier ← ; ;
foreach q ∈ Frontier do
foreach a ∈ Σ ∪ {ε} do
NewFrontier ← NewFrontier ∪ δ(q, a) \ Passed ;
¡ ¢

Frontier ← NewFrontier ;
return Passed ∩ F = ; ;
Algorithm 2: Checking for emptiness of ε-NFAs.

Language inclusion Given two ε-NFAs A 1 and A 2 , we would like to check


whether L(A 1 ) ⊆ L(A 2 ), i.e., whether all words accepted by A 1 are also ac-
cepted by A 2 . To do so, we can rely on the machinery we have developed
before. Indeed, it is easy to check that:

L(A 1 ) ⊆ L(A 2 )
iff
L(A 1 ) ∩ L(A 2 ) = ;
In practice, however, the opera-
tions (determinising and comple-
Indeed, if L(A 1 ) ⊆ L(A 2 ), then all words w ∈ L(A 1 ) do not belong to
menting A 2 , computing the inter-
L(A 2 ) (otherwise, they would be rejected by A 2 ). Hence, there is certainly section and checking whether it is empty)
no intersection between L(A 1 ) and L(A 2 ). On the other hand, if L(A 1 ) ∩ can be carried on-the-fly: this allows to
stop the algorithm (and potentially avoid a
L(A 2 ) is empty, this means that there is no word w which is (i) accepted by costly determinisation) as soon as a word
A 1 and (ii) rejected by A 2 . Thus, all words that are accepted by A 1 are also accepted by A 1 and rejected by A 2 is
found. This on-the-fly algorithm will not
accepted by A 2 , hence L(A 1 ) ⊆ L(A 2 ).
be detailed here. It allows to prove that
Checking whether L(A 1 ) ∩ L(A 2 ) = ; can be done by using the tech- the language inclusion problem belongs to
P S PA C E .
niques we have described above: by first building the automaton A = A 1 ∩
A 2 , then checking whether L(A) = ; using Algorithm 2.

Equality testing To test whether L(A 1 ) = L(A 2 ), for two ε-NFAs A 1 and A 2 ,
one can simply check whether L(A 1 ) ⊆ L(A 2 ) and L(A 2 ) ⊆ L(A 1 ). Again, an
efficient, on-the-fly, version of this algorithm better be implemented in
practice.
ALL THINGS REGULAR. . . 65

2.7 Exercises

2.7.1 Definition of regular languages


The definition of regular languages
Exercise 2.1. Consider the alphabet Σ = {0, 1}. Using the inductive defini- is Definition 2.1.
tion of regular languages, prove that the following languages are regular:

1. The set of words made of an arbitrary number of ones, followed by 01,


followed by an arbitrary number of zeroes.

2. The set of odd binary numbers.

Exercise 2.2. Prove that any finite language is regular. Is the language L =
{0n 1n | n ∈ N} regular? Give an intuition of why or why not.

Problem 2.3. Prove that, for all languages L and M : (L ∗ M ∗ )∗ = (L ∪ M )∗ .


Problem taken from Niwińsky and Rytter15 . 15
Damian Niwińsky and Wojciech Rytter.
200 Problems in Formal Languages and
Automata Theory. University of Warsaw,
2.7.2 Finite automata 2017
We have defined the classes of fi-
Exercise 2.4. For each of the following languages (defined on the alphabet nite automata in Section 2.3.
Σ = {0, 1}), design a nondeterministic finite automaton (NFA) that accepts
it:

1. the set of strings ending with 00;

2. the set of strings whose 3rd symbol, counted from the end of the string,
is a 1;

3. the set of strings where each pair of zeroes is directly followed by a pair
of ones;

4. the set of strings not containing 101;

5. the set of binary numbers divisible by 4.


See the procedure for determinis-
Exercise 2.5. Transform the following ε-NFA into DFA: ing automata in Section 2.4.2.

r 0
0, 1 0
0

p 0 q 1 1
t s

a a
b
p q
ε
ε
c c
b
a r
66 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

2.7.3 Regular expressions


The inductive definition of regular
Exercise 2.6. Consider again all the languages from Exercise 2.4, and give expressions is Definition 2.3.
a regular expression that define each of them.
A procedure to turn RE into ε-NFA
Exercise 2.7. For each of the following DFA, give a regular expression ac- has been given in Section 2.4.3.

cepting the same language:

1 0
0 0
q1 q2 q3
1 1

0
q1 q2
0
0
1 1
1
q3

A procedure to turn RE into ε-NFA


has been given in Section 2.4.1.
Exercise 2.8. Convert the following RE into ε-NFA:

1. 01∗

2. (0 + 1)01

3. 00(0 + 1)∗

2.7.4 Extended regular expressions

For the next exercises, you are asked to provide regular expressions using
the ‘extended regular expression’ format (see Section 2.2.1), that is used in
practice. You can test your answers using the regular expression library16 16
Python Software Foundation. re – Regu-
lar expression operations. https://docs.
re in Python, with its re.search(pattern, string, flags=0) method.
python.org/3/library/re.html. On-
The method receives an extended regular expression as the pattern, and line: accessed on April 12th, 2023
returns a Match object indicating the first substring of string (if any)
that matches the pattern. For example:

1 >>> import re
2 >>> re.search("(a|b|c)+","abcbcab")
3 <re.Match object; span=(0, 7), match=’abcbcab’>
4 >>> re.search("(a|b|c)+","abcdef")
5 <re.Match object; span=(0, 3), match=’abc’>
6 >>> re.search("(a|b|c)+","decbaf")
7 <re.Match object; span=(2, 5), match=’cba’>
8 >>> re.search("(a|b|c)+","def")

Observe that the last call returns nothing because no match was possible.

Exercise 2.9. Give an extended regular expression (ERE) that accepts any
sequence of 5 characters, including the newline character \n.

Exercise 2.10. Give an ERE that accepts any string starting with an arbi-
trary number of \ followed by any number of *.
ALL THINGS REGULAR. . . 67

Exercise 2.11. UNIX-like shells (such as bash) allow the user to write batch
files in which comments can be added. A line is defined to be a comment
if it starts with a # sign. What ERE accepts such comments?

Exercise 2.12. Design an ERE that accepts numbers in scientific notation.


Such a number must contain at least one digit and has two optional parts:

• a decimal part: a dot followed by a sequence of digits; and

• an exponent part: an E followed by an integer that may be prefixed by +


or -.

For example, the following strings are valid numbers in scientific notation:
42, 66.4E-5, 8E17

Exercise 2.13. Design an ERE that accepts ‘correct’ sentences that fulfill
the following criteria: (i) no prepending/appending spaces; (ii) the first
word must start with a capital letter; (iii) the sentence must end with a
dot .; (iv) the phrase must be made of one or more words (made of the
characters a...z and A...Z) separated by a single space; (v) there must
be one sentence per line; and (vi) punctuation signs other than the dot are
not allowed.

Exercise 2.14. Give an ERE that accepts old school DOS-style filenames
respecting the following criteria. First, each filename starts by 8 characters
(among a...z, A...Z and _), and the first five characters must be abcde.
Next, each filename has an extension which is .ext. Finally, the ERE must
accept accept the filename only (i.e., without the extension)!
For example, on abcdeLOL.ext, the ERE must accept abcdeLOL.

2.7.5 Minimisation of automata and other operations

Exercise 2.15. Here is a finite automaton:

q0 1 q1 0 q3
1
0
q2 0 q4
0 q5

We want to compute a minimal DFA that accepts the same language as


this automaton: The subset construction technique
is the method of Section 2.4.2.
1. First, determinise this automaton using the subset construction tech-
nique.

2. Is the resulting automaton minimal (in terms of number of states)?

3. For each state q of the resulting DFA, give, as a regular expression, the
language L q that the automaton would accept if q were the initial state.
68 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

4. Based on the information computed at the previous point, propose a


smaller DFA accepting the same language as the original automaton.
The method to minimise DFA is in
Section 2.5, Algorithm 1.
5. Finally, apply the systematic method to minimise DFA and compare
the results.

Exercise 2.16. Consider again the finite automaton at the beginning of


Exercise 2.15. Let us denote this automaton by A = Q, Σ, δ, q 0 , F . Then:
­ ®

1. draw the automaton A ′ = Q, Σ, δ, q 0 ,Q \ F . How can you describe the


­ ®

relationship between A and A ′ in plain english?

2. what is the relationship between L(A) and L(A ′ )? Do they have a non-
empty intersection? Is it the case that L(A) is the complement of L(A ′ )?
The method to complement a finite
automaton is found in Section 2.6
3. . Apply the systematic method to compute the complement of a finite
automaton, and check that the result indeed accepts the complement
of L(A).

Exercise 2.17. Here are two finite automata A and B on the singleton al-
phabet {,}:

q1
,

A: q0 , ,
B: q0 q1
,
q2
,

1. Describe, in plain english, the respective languages of these automata.


Then, describe, again in plain english, the intersection of these two lan-
guages.
The algorithm to compute the in-
tersection is in Section 2.6 again.
2. Use the method to compute the intersection of two finite automata on
A and B , and check whether the result matches your intuition.

2.7.6 The scanner generator JFlex

For this part of the exercises, we will rely on the scanner generator JFlex.
A scanner is a program that reads text on the standard input and prints it
on standard output after applying operations. For example, a filter that re-
places all a with b and that receives abracadabra on input would output
bbrbcbdbbrb. Then, JFlex is a tool that generates such a scanner based
on a set of regular expressions that specify which part of the input should
be matched and modified. To recognise these regular expressions, JFlex is
based on the theory of finite automata that we have studied. The gener-
ated scanned is, in fact, a Java function.
JFlex can be dowloaded from http://jflex.de and a user manual is
at http://jflex.de/manual.html.
We start by a very short explanation of the tool.

Specification format A JFlex specification is made of three parts sepa-


rated by lines with %%:
ALL THINGS REGULAR. . . 69

1. the first part is the user code. It can contain any Java code, that will be
added at the beginning of the generated scanner.

2. the second part contains options and declarations.


The options include:

• %class Name to tell JFlex to produce a scanner inside the classe called
Name;

• %unicode to enable unicode input;


• %line and %column to enable line and column counting respectively;
• %standalone to generate a scanner that is not called by a parser.

Then, some extra Java code included between %{ and %} can be gener-
ated. It will be copied verbatim inside the generated Java class (contrary
to the code of the first part which appears outside of the class).
Finally, some ERE can be defined. They will be used as macros in part 3
of the file to enhance readability. For example:

1 Comment = "/*" [^*] ~"*/" | "/*" "*"+ "/"

defines the macro Comment and associates it to the given ERE.

3. the third part contains the core of the scanner. It is a series of rules
that associate actions (in terms of Java code) to the regular expressions.
Each rule is of the form:

1 Regex {Action}

where:

• Regex is an extended regular expression (ERE), that can use some


of the regular expressions defined in part 2 as macros (using curly
braces around their names, for example: {Comment});
• Action is a Java code snippet that will be executed each time a token
matching Regex is found.

For example, the rule:

1 "==" { return symbol(sym.EQEQ); }

instructs the scanner to return sym.EQEQ every time == is found on the


input.

The reader is advised to have a look at the JFlex documentation17 for a 17


Gerwin Klein, Steve Rowe, and Régis Dé-
camps. Jflex user’s manual. https://
comprehensive example
jflex.de/manual.html, March 2023. Ver-
sion 1.9.1. Online: accessed on April, 12th,
Variables and special actions When writing actions, some special vari- 2023

ables and macros can be accessed:

• yylength() contains the length of the recognized token

• yytext() is a the actual string that was matched by the regular expres-
sion.
70 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

• yyline is the line counter (requires the option %line).

• yycolumn is the column counter (requires the option %column).

• EOF is the End-of-file marker . On Mac, files which do not end


with an empty line can make the
lexer “forget about” the last line.
Meta states In order to track the progress of the scanner, states can be When you test your lexer, make sure that
used. Each action can change or query the current state of the scanner. your test files end with an empty line.

This amounts to having a finite automaton running in parallel to the scan-


ner. This way, when one particular token is recognised, the scanner can
change the current state, which allows to ‘store’ some information that can
be checked during a further call to the scanner.
There are actually two kind of states:

• inclusive states, declared by %state, which acts as Booleans (the scan-


ner can be in several of those states at a time);

• exclusive states, declared by %xstate which are mutually exclusive (like


regular automata states).

Then, the rules can be associated to states, and are active only in these
states. A state can be ‘activated’ using the function yybegin(S) in the code
(where S is the name of the state to activate). Here is an example:

1 %%
2 xstate YYINITIAL, PRINT;
3 %%
4 <YYINITIAL> {
5 "print" {yybegin(PRINT);}
6 }
7 <PRINT> {
8 ";" {yybegin(YYINITIAL);}
9 . {System.out.println(yytext());}
10 }

Executable To obtain the scanner executable, follow these steps18 : 18


Assuming you are using version 1.9.1,
which is the last version at the time of writ-
1. Generate the scanner code with: ing.

1 java -jar jflex-1.9.1.jar myspec.flex+

which creates the file Lexer.java containing the Lexer class (the %class
option can be used to change this);

2. compile the code into a class file: javac Lexer.java which creates
Lexer.class;

3. run it with java Lexer inputfile.

Here are now some exercises to get you familiar with JFlex:

Exercise 2.18. Write a scanner that outputs its input file with line numbers
in front of every line.
ALL THINGS REGULAR. . . 71

Exercise 2.19. Write a scanner that outputs the number of alphanumeric


characters, alphanumeric words and alphanumeric lines in the input file.

Exercise 2.20. Write a scanner that only shows the content of comments
in the input file. Such comments are enclosed within curly braces { }.
You can assume that the input file does not contain curly braces inside
comments.

Exercise 2.21. Write a scanner that transforms the input text as follows.
It should replace the word compiler by nope if the line starts with an a;
by ??? if it starts with a b and by !!! if it starts with a c.

Exercise 2.22. Write a lexical analysis function that recognises the follow-
ing tokens:

• decimal numbers in scientific notation (e.g. -0.4E-1);

• C99 variable identifiers: they start with an an alphabetical symbol, fol-


lowed by an arbitrary number of alphanumeric symbols or underscores;

• relational operators (<, >, ==, !=, >=, <=, !)

• The keywords if, then and else.

Each call to the function must seek the next token on the input. Every time
a token is found, your function must output a message of the form TOKEN
NAME: token (for example: C99VAR: myvariable) and return an object
Symbol containing the token type, its value and its position (line and col-
umn). Templates for the Symbol and LexicalUnit classes are provided on
the Université Virtuelle.
3 Grammars

G R A M M A R S A R E T H E T O O L W E W I L L U S E T O S P E C I F Y T H E F U L L S Y N TA X
O F P R O G R A M M I N G L A N G UA G E S . They are also the basic bulding block
of the systematic construction technique of parsers that we will discuss in
Chapter 5 and Chapter 6. Before giving the formal syntax and semantics
of grammars, we start with a discussion on the limits of regular languages,
to motivate the need for other, more expressive, formalisms.

3.1 The limits of regular languages and some intuitions

In the previous chapter, we have seen several examples of applications of


finite automata, and thus, also, several examples of languages that are reg-
ular. We have also sketched1 the intuitions explaining that the language L () 1
see Example 1.9, and the discussion on
page 14.
of well-parenthesised expressions is not regular. Recall that the intuition
was the following. To check that an expression is well-parenthesised, one
scans the expression from the left to the right, and maintains, at all times, a
counter that tracks the number of pending open parenthesis. Then, when-
ever an opening parenthesis is met, the counter is incremented. Whenever
a closing parenthesis is found, the counter must be strictly positive (oth-
erwise the word is rejected) and is decremented. At the end, the counter
must be equal to 0 (all opened parenthesis have been closed, and no pend-
ing open parenthesis remain) for the word to be accepted. Observe that
one cannot bound the value of the counter, because the length of the words
in L () is not bounded. Intuitively, it seems that this unbounded counter a c
is necessary to recognise words from L () , and that such an unbounded q0 q1 q2
counter cannot be coded in the finite structure of finite automata. b d
Indeed, the only ‘memory’ that finite automata have is their set of states. Figure 3.1: An automaton that ‘forgets’
whether the first letter was an a or a b.
To illustrate how states can be used as a ‘memory’ consider the language
{ab, cd}. This language can be loosely characterised as: ‘if the former letter a q1 b

is an a, then the latter should be a b; if the former is a c, the latter should be


q0 q2
a d. So, to decide whether we should accept after reading the latter letter,
we should remember the former one. c q 1′
d
A first (and very naive!) attempt at building an automaton recognis- Figure 3.2: An automaton that ‘remem-
ing {ab, cd} could be the DFA in Figure 3.1. Clearly, this attempt fails be- bers’ whether the first letter was an a or a b.

cause the automaton always ends up in state q 1 after reading the first
letter, regardless of the letter. As a consequence the automaton accepts
{ab, cd, ad, bc}. Of course, the automaton in Figure 3.2 now accepts the
right language, because states q 1 and q 1′ act as a memory: when the au-
tomaton reaches q 1 , it has recorded that the first letter was an a; and when
74 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

it reaches q 1′ , the first letter was surely a c.

Now that we have a good intuition of what ‘memory’ means for a finite
automaton, let us prove formally that L () is indeed not regular. Our proof
strategy will be by contradiction: we will assume that L () is regular, and
hence, that there exists an ε-NFA A () that recognises it (by Theorem 2.2).
Then, we will exploit the fact that this hypothetical A () has finitely many
states to derive a contradiction. Assuming A () has n states, we will select
a word from L () which is ‘very long’: this means that the word will be long
enough to guarantee that, when the automaton accepts it, an accepting
run necessarily visits twice the same state. Coming back to our intuition
that each state of a finite automaton represents a possible memory con-
tent, this means that, after reading two different prefixes w 1 and w 2 —that
contain a different number of pending open parenthesis—the automaton
has memorised the exact same knowledge about those prefixes2 . However, 2
Just as in the example of Figure 3.1, where
the automaton has the same knowledge
since w 1 and w 2 contain a different number of pending open parenthesis,
when it reads either a or c as the first let-
the behaviour of the automaton should be different after reading these two ter.
prefixes. Unfortunately, since the automaton is in the same state in both
cases, there will be at least one common execution after w 1 and w 2 , i.e.,
by lack of memory, the automaton gets mixed up, and accepts words that
are not part of L () . Let us now formalise this intuition.

Theorem 3.1. L () is not regular

Proof. The proof is by contradiction. Assume L () is regular, and let A () be


an ε-NFA s.t. L(A () ) = L () . Such an ε-NFA exists by Theorem 2.2. Let n be
the number of states of A () , and let us consider the word:

w = ((· · · (( )) · · · ))
| {z } | {z }
n n

Clearly, w ∈ L () , hence w is accepted by A () . Thus, there exists an accept-


ing run of A () on w. Let us assume this run visits the sequence of states
q 0 , q 1 , . . . q 2n , which we represent as:
( ( ( ) )
q0 →
− q1 → − qn →
− ··· → − q n+1 · · · →
− q 2n
(
where q i →
− q i +1 means that the automaton moves from state q i to state
q i +1 by reading a ( (and similarly for the ) character).
Then, since A () has n states, the run portion that accepts the prefix ( · · · (
|{z}
n
must necessarily visit twice the same state3 . Formally, we let k and ℓ be 3
One can invoke here the (in)famous ‘pi-
geonhole principle’ stating that if m pi-
two positions s.t. 0 ≤ k < ℓ ≤ n and q k = q ℓ . In other words, the run por-
geons occupy n holes, and there are strictly
tion between q k and q ℓ is actually a loop in the automaton, that we can more pigeons than holes (m > n), then
repeat as many times as we want, obtaining a word which is not in L () . there is necessarily a hole that contains at
least two birdies.
More precisely, the run is of the form:

( ( ( ( ( ( ) )
q0 → − qk →
− ··· → −
| ·{z −} q ℓ →
·· → − qn →
− ··· → − q n+1 · · · →
− q 2n
loop

Hence, the run obtained by repeating the loop twice is also an accepting
run:

( ( ( ( ( ( ( ( ) )
q0 → − qk →
− ··· → −
| ·{z −} q k →
·· → |− ·{z −} q ℓ →
·· → − qn →
− ··· → − q n+1 · · · →
− q 2n
=q ℓ
loop loop
GRAMMARS 75

However, since ℓ − k > 0, this run accepts a word of the form:

w′ = (······()······)
| {z } | {z }
m times n times

with m > n, which means that w ′ ̸∈ L () , as not all opened parenthesis


have been properly closed. Thus, we have just shown that A () accepts a
word which is not in L () . This contradicts our assumption that L(A () ) =
L () . Hence, the hypothetical automaton A () does not exist. Since there
is no ε-NFA that accepts L () , we conclude, by Theorem 2.2 that L () is not
regular.

This Theorem clearly shows that there are interesting and (arguably)
simple languages that are not regular (hence, they cannot be specified by
means of a regular expression, nor recognised by a finite automaton). This
motivates the introduction of grammars, which are a much more powerful
formalism for specifying languages (in the next chapter, we will study ex-
tensions of automata to handle more languages than finite automata can
do).

Intuitive example In order to introduce grammars, we start with an intu-


itive example that somehow ‘generalises’ the language L () . Let us consider
a definition of ‘expressions’, which is inductive:

1. The sum of two expressions is an expression;

2. The product of two expressions is an expression;

3. An expression between matching parenthesis is an expression;

4. An identifier Id is an expression;

5. A constant Cst is an expression.

As an example, the string (Cst + Id) ∗ Id is an expression but )(CstId) is not,


as can be checked with the definition.
This definition is fine, and can easily be applied to check whether a
given string is a (syntactically correct) expression. However, we would like
to have a more ‘generative’ way of defining expressions. To achieve this, we
will rely on the idea of rewrite system. Roughly speaking, a rewrite system
is a set of rules that allow one to modify a given string of symbols to obtain
a new string. More precisely, each rewrite rule is of the form α → β, where
α and β are strings of symbols. Such a rewrite rule means, intuitively, that
the string of symbols α can be replaced (or ‘rewritten’) by β. For instance,
given the rule Ab → B c, the string a AbB c can be rewritten as aB cB c, by
substituting B c for Ab in the string.
Using this intuition, we can now set up a set of rewriting rules to gen-
erate any syntactically correct grammar (and only those grammars). To
achieve this, we need to introduce certain intermediary symbols that we
call variables. In our case, we need only one variable Exp that represents
an expression. Then, an expression Exp can be rewritten following one
of the five items used in the inductive definition above (i.e., as a sum of
expressions, or a product of expressions, or. . . ). This yields the following
rules:
76 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

(1) Exp → Exp + Exp


(2) Exp → Exp ∗ Exp
(3) Exp → (Exp)
(4) Exp → Id
(5) Exp → Cst
Since all right-hand sides of the rules share the same variable, we often
omit to repeat it, and rather present the rules as:
(1) Exp → Exp + Exp
(2) → Exp ∗ Exp
(3) → (Exp)
(4) → Id
(5) → Cst
It is easy to check that this set of rules indeed allows to generate, by
successive rewriting the string (Cst + Id) ∗ Id, starting from the sequence
Exp. Indeed, Exp can be rewritten as Exp ∗ Exp, by rule 2. In this new
sequence, the latter occurrence of Exp, can be rewritten as Id, by rule 5,
yielding Exp ∗ Id, and so forth. The whole sequence of rewriting is given
hereunder:

2 4 3 1 4 5
Exp =
⇒ Exp ∗ Exp = ⇒ (Exp) ∗ Id =
⇒ Exp ∗ Id = ⇒ (Exp + Exp) ∗ Id =
⇒ (Exp + Id) ∗ Id =
⇒ (Cst + Id) ∗ Id

Note that the initial sequence of symbols Exp contains only one vari-
able; that the last sequence (Cst + Id) ∗ Id is actually a word (it contains
only symbols from the language’s alphabet); and that all the intermedi-
ary sequences contain symbols from the alphabet, and variables (that are
eventually eliminated by rewriting). Let us now formalise these notions.

3.2 Syntax and semantics

Syntax The formal definition of grammar follows the intuitions we have


sketched above:

Definition 3.1 (Grammar). A grammar is a quadruplet G = 〈V , T , P , S〉


where:

• V is a finite set of variables;

• T is a finite set of terminals;

• P is a finite set of production rules of the form α → β with:

– α ∈ (V ∪ T )∗ V (V ∪ T )∗ and

– β ∈ (V ∪ T )∗

• S ∈ V is a variable called the start symbol.

Example 3.2. Formally, the grammar that defines expressions is the tuple:

G Exp = 〈{Exp}, {Cst, Id, (, ), +, ∗}, P , Exp〉


GRAMMARS 77

with:
 

 Exp →Exp+Exp 

 
 Exp →Exp*Exp

 

 

P= Exp →(Exp)
 
Exp →Id

 


 

 
Exp →Cst
 

Observe that our definition of the syntax of grammars is very general.


Rules are of the form α → β with: α ∈ (V ∪ T )∗ V (V ∪ T )∗ and β ∈ (V ∪ T )∗ ,
which means that both left- and right-hand sides of each rules can contain
variables and terminals. The only requirement is that the left-hand side
contains at least one variable. The next example shows the kind of lan-
guages that one can define when left-hand sides of rules are not restricted
to a single variable:

Example 3.3. Consider the grammar:

G ABC = {A, B ,C , S, S ′ }, {a, b, c}, P , S


­ ®

where P contains the following set of rules:


(1) S → ε
(2) → S′
(3) S′ → ABC S ′
(4) → ABC
(5) AB → BA
(6) BA → AB
(7) AC → CA
(8) CA → AC
(9) BC → CB
(10) CB → BC
(11) A → a
(12) B → b
(13) C → c
We claim that this grammar allows one to generate all words on the al-
phabet {a, b, c} that contains the same number of a’s, b’s and c’s. For in-
stance, consider the word aabcbc. Since this word is not empty, we first
apply rule 2. Then, we apply rules 3 and rule 4 to generate 2 A’s, 2 B ’s and
2 C ’s:
2 3 4
⇒ S′ =
S= ⇒ ABC S ′ =
⇒ ABC ABC

Then, we use rules 6 and 8 to move the latter A two positions to the left:
8 6
ABC ABC =
⇒ AB AC BC =
⇒ A ABC BC

Finally, we use rules 11–13 to obtain the desired word:


11 11 12
A ABC BC =⇒ a ABC B =⇒ aaBC BC · · · =⇒ aabcbc

It is easy to generalise this example to any word on {a, b, c} with the same
number of a’s, b’s and c’s, and to check that only those words can be gen-
erated by the grammar. M
78 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

Unfortunately, this generality is very powerful and makes many prob-


lems on grammars undecidable. In Section 3.3, we will introduce restricted Actually, it is possible to encode
a Turing machine into a grammar.
syntactic classes of grammars that are still sufficient in practice, but for
The rough idea is to use the deriv-
which meaningful problems are decidable. able strings of the grammar to describe the
reachable configurations of the machine.
A configuration of a Turing machine can
Semantics Let us now formally define the semantics of grammars, i.e., be encoded as a word of the form w 1 q w 2 ,
the set of words that a grammar can generate. where w 1 is the content of the tape to the
left of the head, and w 2 is the content of
Definition 3.4 (Derivation). Let G = 〈V , T , P , S〉 be a grammar, and let γ the tape (excluding blank characters) un-
der and to the right of the head. Such a
and δ be s.t. γ ∈ (V ∪ T )∗ V (V ∪ T )∗ , and δ ∈ (V ∪ T )∗ . Then, we say that δ
configuration can be encoded by the string
can be derived from γ (under the rules of G), written: w 1 Qw 2 where w 1 and w 2 contain only ter-
minals, and Q is a variable. Then, if the
γ ⇒G δ machine can move from q to q ′ , reading
an a, replacing it by a b, and moving the
head to the right, the grammar would con-
iff there are γ1 , γ2 ∈ (V ∪ T )∗ and a rule α → β ∈ P s.t.: γ = γ1 · α · γ2 and tain the rule Qa→bQ ′ , and so forth. . .
δ = γ1 · β · γ2 . M

On top of this definition, we introduce several notations and short-


hands. When the grammar G is clear from the context, we will often omit
it from ⇒G . That is, we write γ ⇒ δ instead of γ ⇒G δ. Clearly, for all gram-
mar G, ⇒G is a relation on words from (V ∪ T )∗ . Thus, we denote by ⇒∗
the reflexive and transitive closure of ⇒, similarly to what we did for finite
automata in Chapter 2. We also write γ ⇒i δ for i ∈ N iff δ can be derived Do not confuse γ ⇒i δ, which
means: ‘δ can be derived from γ in
from γ in i steps, i.e., there are γ1 , γ2 ,. . . ,γi −1 s.t. γ ⇒ γ1 ⇒ γ2 · · · γi −1 ⇒ δ. i
at most i steps’ and γ =
⇒ δ, which
Finally, let us introduce the following important vocabulary:
means: ‘δ can be derived from γ by apply-
ing rule number i once’.
Definition 3.5 (Sentential form). Let G = 〈V , T , P , S〉 be a grammar. A sen-
tential form is a word from (V ∪ T )∗ that can be derived from the start
symbol. Formally: γ ∈ (V ∪ T )∗ is a sentential form (of G) iff S ⇒G

γ. M

Of course, we will be interested in certain sentential forms: those that


contain terminals only, because they form the language of the grammar:

Definition 3.6 (Language of a grammar). Let G = 〈V , T , P , S〉 be a gram-


mar. The language of G is:

L(G) = {w ∈ T ∗ | S ⇒G

w}

3.3 The Chomsky hierarchy

As sketched before, the definition of grammar we have given (Definition 3.1)


is so general that many problems on grammars4 are undecidable. Yet, as 4
including the language membership
problem, i.e., does a given word w belong
the example of grammar G Exp clearly shows, grammars are excellent tools
to the language L(G) of a given grammar
to describe the syntax of programming languages (among other potential G?
applications). The aim of the Chomsky hierarchy we are about to intro-
duce is to identify syntactic classes of grammars that are useful for practical
and/or theoretical purposes.

Definition 3.7 (Chomsky hierarchy). The Chomsky hierarchy is made up


of four classes of grammars, defined according to syntactic criteria:

Class 0: Unrestricted grammars All grammars are in this class.


GRAMMARS 79

Class 1: Context-sensitive grammars A grammar G = 〈V , T , P , S〉 is context


sensitive iff each rule α → β ∈ P is s.t.: 1. either α = S and β = ε; 2. or
|α| ≤ ¯β¯ and S does not appear in β.
¯ ¯

Class 2: Context-free grammars A grammar G = 〈V , T , P , S〉 is context-free


iff each rule α → β ∈ P is s.t.: α ∈ V , i.e., the left-hand side is only one
variable.

Class 3: Regular grammars A grammar G = 〈V , T , P , S〉 is regular iff it is


either left-regular or right-regular: Observe that a grammar that con-
tains rules of the form A → wB and
Left-regular grammars G is left-regular iff each rule α → β ∈ P is s.t. α ∈ of the form A → B w at the same
time is not regular.
V and either β ∈ T ∗ , or β ∈ V · T ∗ .
Right-regular grammars G is left-regular iff each rule α → β ∈ P is s.t.
α ∈ V and either β ∈ T ∗ , or β ∈ T ∗ · V .

This definition calls for several comments. First, let us give some exam-
ples of grammars:

Example 3.8.

1. The grammar G a ∗ = 〈{S}, {a}, P , S〉 where P contains the rules:

(1) S → Sa
(2) → ε

is left-regular and L(G a ∗ ) = {a}∗ . Observe that replacing the first rule by
S→aS yields a right-regular grammar that accepts the same language.

2. To the contrary, the grammar G = 〈{S}, {a}, P , S〉 where P contains the


rules:
(1) S → Sa
(2) → aS
(3) → ε

is not regular because it is neither left-regular nor right-regular. In other


words, mixing left-recursive and right-recursive rules is not permitted
in a regular grammar. Yet, the language accepted by G is still {a}∗ and
thus regular.

3. The grammar G Exp of Example 3.2 is context-free but not regular, be-
cause, for instance, of rule Exp→Exp + Exp .

4. The grammar G ABC of Example 3.3 is context-sensitive but not context-


free, because of rule AB → B A for instance (two variables on the left-
hand side).

Then, let us discuss the name ‘hierarchy’. This term seems to imply that
the classes of grammars are contained into each other, in other words that
each class 3 grammar is a class 2 grammar, each class 2 grammar is a class
1 grammar, and each class 1 grammar is a class 0 grammar. Unfortunately
this is not the case, as shown by the next example:
80 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

Example 3.9. Consider the grammar with the two following rules:
(1) S → S′
(2) S ′
→ ε
Obviously, this grammar is regular (class 3) because the first rule is of
the form S→wS ′ , with w = ε ∈ T ∗ ; and the latter is of the form S ′ →w
with w = ε ∈ T ∗ again. It is also context-free (class 2) because the left-
hand sides of all its rules are made up of only one variable. Of course, it
is also unrestricted (class 0). But it is clearly not context-sensitive (class 1)
because of the rule S ′ →ε where S ′ ̸= S.
However, observe that the class 1 grammar

〈{S}, ;, {S → ε}, S〉

accepts the same language as G. M

So, while the Chomsky hierarchy does not form a hierarchy of gram-
mars, it defines a hierarchy of languages. To establish this, let us begin
with a few more definitions:

Definition 3.10 (Context-free language). A language L is context-free iff


there exists a context-free grammar G s.t. L(G) = L. M

And, similarly:

Definition 3.11 (Context-sensitive language). A language L is context-sensitive


iff there is a context-sensitive grammar G s.t. L(G) = L. M

Definition 3.12 (Recursively enumerable language). A language L is re-


cursively enumerable (or recognisable) iff there is a grammar G s.t. L(G) =
L. M
Recall that we have used the nota-
We denote by CFL and CSL and RE the sets of context-free, context- tion Reg to denote the set of regular
languages.
sensitive, and recursively enumerable languages respectively5 Then, the 5
Note that, by abuse of notation, we often
next theorem justifies the name ‘Chomsky hierarchy’: write ‘a CFL’ or ‘a CSL’ to mean ‘a language
belonging to CFL’ or ‘a language belonging
Theorem 3.2. For i = 0, 1, 2, 3 let us denote by L i the class of languages to CSL’, etc.
recognised by type i grammars in the Chomsky hierarchy. Then:

L 3 = Reg ⊊ L 2 = CFL ⊊ L 1 = CSL ⊊ L 0 = RE

Proof. (Sketch) We will not prove the theorem with great detail, but we
b
will highlight the main arguments behind each of those relations.

q0 a q1
1. L 3 = Reg. We first show that each DFA A can be converted into a gram-
mar G A s.t. L(A) = L(G A ). The intuition of the construction is as follows.
b a
The set of variables of G A is the set of states of A. We will build the set of
rules in such a way that all sentential forms are of the form w q, where q2
w ∈ Σ∗ is the prefix read so far by the automaton, and q is the current
state. Then, for each transition from some q 1 to some q 2 labeled by a, a, b
the grammar contains a rule q 1 →aq 2 . Finally, when an accepting state (1) q0 → aq 1
(2) → bq 2
is reached, we must be able to get rid of the variable from the sentential
(3) q1 → bq 1
form, so we have rules of the form q→ε for all q ∈ F . Fig. 3.3 shows an (4) → aq 2
example of this transformation. (5) → ε
(6) q2 → aq 2
(7) → bq 2
Figure 3.3: An example of a DFA and its
corresponding right-regular grammar.
GRAMMARS 81

Formally, assuming A = Q, Σ, δ, q 0 , F , we build the grammar G A =


­ ®

Q, Σ, P , q 0 , where:
­ ®

P = {q → aq ′ | δ(q, a) = q ′ } ∪ {q → ε | q ∈ F }

Observe that the resulting grammar is right-regular, so all regular lan-


guages can be accepted by right-regular grammars. This shows that
L 3 ⊇ Reg. To show Reg ⊆ L 3 , we must show that both right-regular
and left-regular grammars define regular languages only. For right-regular
grammars, we can adapt the arguments of the proof above and trans-
form all right-regular grammars G into an ε-NFA AG s.t. L(G) = L(AG ).
We turn each variable of the grammar into a state of the automaton. (1) S → aaB
(2) → a
For all rules of the form A → wB , we add transitions from A to B read-
(3) B → bB
ing word w, adding intermediary states if appropriate. For all rules of (4) → ε
the form A → w, we let the automaton read the word w from state A and b
reach an accepting state. We do not give the details of the construction,
a a
but Fig. 3.4 gives an example. S B

Finally, it is possible to adapt this construction to build, for all left- ε


a
regular grammars G, an ε-NFA AG s.t. L(AG ) is the mirror image of
L(G), i.e., L(AG ) contains all words from G but reversed. Then, it is easy
to modify AG to let it accept L(G), by swapping final and initial states6 Figure 3.4: An example of a right-regular
and reversing the directions of the transitions. We omit the details here. grammar and its corresponding ε-NFA.
6
Adding a fresh initial state if necessary.
2. L 2 = CFL. This equality holds by definition (see Definition 3.10).

3. Reg ⊊ L 2 . To show that Reg ⊆ L 2 , it suffices to observe that all regu-


lar grammars are also context-free (check Definition 3.7). Hence, L 3 ⊆
L 2 , but since Reg = L 3 , we have Reg ⊆ L 2 . To show the strict inclu-
sion, it is sufficient to exhibit a language which is a CFL but not regular.
This is the case of L () (see beginning of the Chapter). We have shown, in
Theorem 3.1 that L () is not regular. We can show that it is in CFL (hence
a in L 2 ) by providing a context-free grammar that defines it:

(1) S → SS
(2) → (S)
(3) → ε

It is easy to check that this grammar is indeed context-free and defines


L () , so L () is a CFL.

4. L 1 = CSL. Again, this holds by definition (Definition 3.11).

5. CFL ⊊ L 1 . To prove this inclusion, we must first show that all con-
text free languages (which can be defined by a context free grammar,
since L 2 = CFL) are also context-sensitive languages. Unfortunately,
as shown by Example 3.9 above, not all context-free grammars are context-
sensitive, so we cannot use a simple and direct syntactic argument as
we did when proving that all regular languages are also context-free.
Observe that, in a context-free grammar, a rule of the form α → β that
violates the property |α| ≤ ¯β¯ is necessarily a rule of the form A → ε.
¯ ¯

Indeed, in a context-free grammar, all left-hand sides of rules contain


only one variable. Thus |α| = 1 in all rule α → β and so, |α| > ¯β¯ means
¯ ¯

¯β¯ = 0, hence β = ε.
¯ ¯
82 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

Based on this observation, we propose a technique that turns any context-


free grammar into an equivalent context-free grammar without rules
of the form A → ε, except when A is the start symbol, because this is
the only case where an ε in the right-hand side is allowed in a context-
sensitive grammar. The procedure works as follows. If a context-free
grammar G = 〈V , T , P , S〉 contains a rule of the form A → ε, with A ̸= S,
then:

(a) Remove from P the rule A → ε:

P ′ = P \ {A → ε}

(b) Find in P ′ all the rules of the form B → β where β contains A, and,
for each of those rules, add to P ′ all the rule B → β′ , where β′ has
been obtained by removing all the A symbols7 from β: 7
Or, in other words, replacing all A’s by ε.

P ′′ = P ′ ∪ B → β1 β2 · · · βn | B → β1 Aβ2 A · · · Aβn ∈ P ′ with βi ∈ (T ∪ V \ {A})∗ for all i


© ª

This yields a new grammar G ′ = V , T , P ′′ , S . Observe that this trans-


­ ®

formation preserves the language of the grammar: L(G ′ ) = L(G). We can


iterate this process up to the point where no rule of the form A → ε, with
A ̸= S remain. Then, clearly, the resulting grammar G has the same lan-
guage as the original one, and contains no rule of the form A → ε, with Original grammar:
A ̸= S. Figure 3.5 illustrates this transformation. (1) S → Aa
(2) → A
However, we are not done yet, because the definition of context-sensitive (3) → Bb
grammar also asks that S does never occur in a right-hand side of rule. (4) A → a
(5) → ε
Again, we propose a transformation that eliminates such rules while (6) B → ε
preserving the language of the grammar. It works as follows: After treating A → ε:
(1) S → Aa
(a) Add a new variable S ′ to the grammar: (2) → a
(3) → A
(4) → ε
V ′ = V ⊎ {S ′ } (5) → Bb
(6) A → a
(7) B → ε
(b) For each rule of the form S → β with β ̸= ε, add to the grammar the
After treating B → ε:
rule S ′ → β:
(1) S → Aa
′ ′ (2) → a
P = P ∪ {S → β | S → β ∈ P with β ̸= ε} (3) → A
(4) → ε
(5) → Bb
(c) If the rule S → ε exists in the grammar, make a copy of all A → β
(6) → b
where β contains S, replacing all S by ε: (7) A → a
Figure 3.5: Removing rule of the form
A → β1 β2 · · · βn ∈ P ′ 
 

  V → ε in two steps. Observe that in the re-
sulting grammar, the variable B does not
P ′′ = P ′ ∪ A → β1 S ′ β2 S ′ · · · S ′ βn with
  produce any terminal, so we could also re-
βi ∈ (T ∪ V \ {S})∗
 
move the rule S → B b, but what matters is
that the language of the resulting grammar
is the same as the original one.
(d) Finally, replace all occurrences of S by S ′ in right-hand sides of
rules:

A → β1 Sβ2 S · · · Sβn ∈ P ′′ 
 

 
P ′′′ = A → β1 S ′ β2 S ′ · · · S ′ βn with
 
βi ∈ (T ∪ V \ {S})∗
 
GRAMMARS 83

Figure 3.6 illustrates the construction. It is easy to check that the result- Original grammar:
ing grammar has the same language as the original one, and that it now (1) S → ε
(2) → aS
respects the syntax of context-sensitive grammars.
(3) → A
This shows that CFL ⊆ L 1 . Now, to prove that the inclusion is strict, we (4) A → aS
After step (b):
need to exhibit a language which is context-sensitive but not context-
(1) S → ε
free. It is the case of the language L(G ABC ) generated by the grammar (2) → aS
given in Example 3.3. Clearly, this language is context-sensitive since (3) → A
it is generated by a context-sensitive grammar. We will not prove here (4) S′ → aS
(5) → A
that this language is not context free. Suffice it to say that this can be (6) A → aS
proved by techniques similar to those we have used to show that L () is After step (c):
not regular (proof of Theorem 3.1). The interested reader should refer (1) S → ε
to the so-called ‘pumping lemmata’ for regular and context-free lan- (2) → aS
(3) → a
guages, which are general techniques allowing one to prove that a lan- (4) → A
guage is not regular and not context-free respectively8 . (5) S′ → aS
(6) → a
6. L 0 = RE. This equality holds by Definition 3.12. (7) → A
(8) A → aS
0 0 (9) → a
7. CSL ⊊ L . The fact that CSL ⊆ L holds by definition: all grammars
Final grammar:
belong to class 0, so all CSL, that can be defined by a context-sensitive
(1) S → ε
grammar (by definition) can be defined by a class 0 grammar. Show- (2) → aS ′
ing that the inclusion is strict requires techniques that are beyond the (3) → a
(4) → A
scope of this course, so we will not prove it here.
(5) S′ → aS ′
(6) → a
(7) → A
(8) A → aS ′
(9) → a
Figure 3.6: Removing all occurrences of
the start symbol S from right-hand sides of
rules, while preserving the language of the
original grammar.

8
John E. Hopcroft, Rajeev Motwani, and
Jeffrey D. Ullman. Introduction to Au-
tomata Theory, Languages, and Computa-
tion (3rd Edition). Addison-Wesley Long-
man Publishing Co., Inc., Boston, MA,
USA, 2006. ISBN 0321455363

It is not easy to exhibit a ‘natu-


ral’ language that strictly separates
CSL and RE. The arguments can
be found in the Computablity and Com-
plexity course: the class CSL corresponds
exactly to the class of languages that can
be decided by a non-deterministic Tur-
ing machine running in linear space. The
space hierarchy theorem tells us that there
are languages (which are decidable, and
thus recognisable or, in other words, re-
cursively enumerable) that require strictly
more than linear space. For instance, the
language of any decider that runs in expo-
nential space is a recursively enumerable
language that is not context-sensitive.
84 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

3.4 Exercises

3.4.1 Context-free languages

Exercise 3.1. Is the language L = {1n | ∃m ∈ N : n = m 2 } regular? Prove your


answer using the techniques we have used at the beginning of the chapter.

3.4.2 Grammars

Exercise 3.2. Informally describe the languages generated by the follow-


ing grammars and specify the classes of the Chomsky hierarchy they be-
long to:
(1) S → abc A
(1) S → 0 (1) S → a (2) → A abc
(2) → 1 (2) → ∗SS (3) A → ε
(3) → 1S (3) → +SS (4) Aa → Sa
(5) cA → cS

Then, give a derivation of the word 1110 according to the first grammar;
a derivation of the word ∗ + a + aa ∗ aa according to the second grammar
G 2 and a derivation of the word abc abc produced by grammar G 3 .

Exercise 3.3. Consider the following grammar:


(1) S → AB
(2) A → Aa
(3) A → bB
(4) B → a
(5) B → Sb

1. To which classes of the Chomsky hierarchy does this grammar belong?

2. Give the derivation trees of the three following sentential forms:

(a) baS b;
(b) bB AB b;
(c) baabaab.

3. Give the leftmost and rightmost derivations for baabaab.

Exercise 3.4. Give a context-free grammar that generates all strings of a


and b (in any order) such that there are strictly more a than b. Test your
grammar on the input baaba by giving a derivation on this word.

Exercise 3.5. Give a context-free grammar that generates the language:

{an bm cℓ | n + m = ℓ}

Exercise 3.6. Give a context-sensitive grammar for the language

{am bn cm dn | m ≥ 1, n ≥ 1}.

Do you think such a language can be generated by a context-free gram-


mar? Explain why.
GRAMMARS 85

Exercise 3.7. Give a context-free grammar that generates all the arith-
metic expressions on the alphabet {(, ), +, ., 0, 1} that evaluate to 2. Problem
taken from Niwińsky and Rytter9 . 9
Damian Niwińsky and Wojciech Rytter.
200 Problems in Formal Languages and
Hint: start by generating all expressions that evaluate to 0, then to 1,
Automata Theory. University of Warsaw,
then to 2. 2017
4 All things context free. . .

C O N T E X T- F R E E L A N G UA G E S A R E T H E S E C O N D I M P O RTA N T C L A S S O F
L A N G UA G E S T H AT W E W I L L C O N S I D E R I N T H I S C O U R S E , after regular
languages. This chapter will be, in some sense, the ‘context-free’ analo-
gous to Chapter 2, where we introduced and studied regular languages.
Let us first summarise quickly what we have learned so far about CFLs
and their relationship to regular languages. First, recall, from the Chomsky
hierarchy (Definition 3.7) that regular languages are all CFLs and that the
containment is strict: for instance the Dyck language1 L () is a CFL which is 1
Recall that this languages contains all
well-parenthesised words on {(, )}, i.e., a
not regular (see Theorem 3.1). Moreover, we already know several formal
parenthesis is closed only if it has been
tools to deal with regular languages and CFLs, as summarised in the table opened before, and, at the end of the
below: word, all opened parenthesis are eventu-
ally closed.
Reg CFL

Regular expressions,
Specification Context-free grammars
regular grammars
Automaton DFA, NFA, ε-NFA ??

As can be seen, one cell is still empty in this table: which automaton
model allows us to characterise the class of CFLs, just as finite automata
correspond to regular languages? We will answer this question in Sec-
tion 4.2, but let us already try and build some intuitions.
As explained in Chapter 2, finite automata (whether they are determin-
istic or not) can be regarded as a model of programs that have access to a
finite amount of memory. This allows them to recognise simple languages
such as (01)∗ , for instance, because the only piece of information the pro-
gram must ‘remember’ (in this example) is the last bit that has been read,
0
in order to check that the next one is indeed different. So, one bit of mem-
1 0
ory is sufficient for (01)∗ , which explains why the automaton accepting it
1
has two states (see Figure 4.1). Figure 4.1: A finite automaton accepting
Now, let us consider a typical CFL, which is the language of all palin- (01)∗ . The labels of the nodes represent
the automaton’s memory: it remembers
dromes on {0, 1} (where the two parts of the palindromes are separated by
the last bit read if any.
#). Formally, we consider the language:

L pal# = w · # · w R | w ∈ {0, 1}∗


© ª

We will prove later that L pal# is indeed a CFL (and is not regular). Let us
admit this fact for now, and let us understand why this language cannot
be recognised by a finite automaton. Continuing the intuition we have
sketched above, a program recognising L pal# must, when reading a word
of the form w#w ′ :
88 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

1. store the whole prefix w up to the occurrence of #;

2. skip the symbol #;

3. read the suffix w ′ , letter by letter, checking that w ′ is indeed w R , the


mirror image of w.

It should be clear that, since the length of the prefix w is not bounded (in
the definition of L pal# ), such a program needs an unbounded amount of
memory, to store this prefix w. Since we are considering automata that
read the input word from the left to the right, in one pass, and cannot
modify the input, one can hardly imagine that a program (or an automa-
ton) could recognise L pal# using only a finite amount of memory.
So, to recognise CFLs, we need to extend finite automata with some
form of unbounded memory. In the case of L pal# , this memory can be
restricted to be a stack. Indeed, our program can be rewritten as: Recall that a stack is a data struc-
ture where elements are stored as
1. read the prefix w up to the occurrence of #, letter by letter, pushing a sequence, where only the last in-
serted symbol can be accessed (it is called
each letter on the stack;
the top of the stack), and that can be modi-
fied only by appending symbols to the end
2. skip the symbol #; of the sequence (a push to the stack); or by
deleting the last symbol if it exists (a pop
3. read the suffix w ′ , letter by letter. Compare each letter from the input from the stack). Therefore, a stack is often
to the top of the stack. If they differ, or if the stack is empty, reject the referred to as a LIFO (an acronym of ‘Last
In First Out’), because the first elements
word, otherwise pop the letter. that will be read from the stack is the last
one that has been written.
4. If the whole suffix has been read and the stack is empty, accept the
word, otherwise reject.

As an example, Figure 4.2 illustrates the execution of this program on


the word 001#100, which is recognised as a palindrome since the stack is
empty after the suffix is read. Each arrow between the stacks is labelled by
a letter which is read by the program.

1 1
0 0 0 1 0 # 0 1 0 0 0

→ −
→ −
→ →
− −
→ −
→ −

0 0 0 0 0 0
Figure 4.2: Recognising a palindrome us-
ing a stack.
So, in Section 4.2, we will extend finite automata by means of a stack, al-
lowing the automaton to perform one operation on the stack at each tran-
sition (in addition to reading an input letter). We will formally study this
new model, called pushdown automata2 (PDA for short), and show that 2
Pushdown is a synonymous of stack.
they recognise exactly the class of CFLs, just as finite automata recognise
regular languages.
In order to prove this last result, we will present a formal connection be-
tween PDAs and CFLs. Again, let us sketch some intuitions, by considering
the grammar:
(1) S → 0S 0
(2) → 1S 1
(3) → #
that generates exactly L pal# . We can check against Definition 3.7 that this
grammar is indeed context-free. It is easy to turn such a grammar into a
recursive program that recognises L pal# , by regarding each variable in the
ALL THINGS CONTEXT FREE. . . 89

right-hand side of the rule as a recursive call. In a python-like syntax, such


a program could be:

1 def S():
2 n = read_next_character()
3
4 if n == ’#’:
5 return True
6
7 if n == ’0’:
8 r = S()
9 if (!r): return False
10 n = read_next_character()
11 if (n == ’0’): return True
12 else: return False
13
14 if n == ’1’:
15 r = S()
16 if (!r): return False
17 n = read_next_character()
18 if (n == ’1’): return True
19 else: return False

where, as expected read_next_character() reads the next character on


the input and returns it. This code matches the semantics of the grammar
because, roughly speaking, a rule such as:

S→0S 0

can be interpreted as:

Read a 0; then check that it is followed by a palindrome, i.e. a word that can
be generated by S; and finally read a matching 0.

Observe that this intuition holds because the grammar is context-free,


i.e., all the rules are of the form A→α , where A is a single variable, that can
be assimilated to a function name. The mapping of grammar rule to recur-
sive functions is harder to figure out (if any) when the rules are allowed to
be context sensitive, such as aSb→c Ad for instance. Observe also that
the recursive functions obtained from CFGs by this construction do not
need unbounded memory: they only need to store a finite, bounded por-
tion of the input word (which is a finite word on a finite alphabet). In the
case of L pal# , each function call needs to store locally the first character
read (either 0 or 1) to compare it to the first character read after the re-
cursive call, so one bit of local memory is sufficient. The unboundedness
of the memory needed to recognise words from L pal# stems from the (un-
bounded) depth of the recursive calls.
Now we can bridge the gap between context free grammars and push-
down automata easily: a context-free grammar is, roughly speaking, a re-
cursive program where each function call needs only bounded memory.
Thus, the behaviour of each function call can be captured by a finite au-
tomaton, but the number of recursive calls cannot be bounded. This can
90 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

be captured by a PDA, as sketched in Figure 4.3. Each time the function There are many different syntax’s
for PDAs. The one we are using
performs a recursive call, the PDA pushes information about its current
in this example aims at illustrating
state (that encodes the value of the local variables) to the stack and moves easily the intuitions of this introduction.
to its initial state. At each return, the PDA pops the top of the stack, to re- Note that the formal syntax we will use in
the rest of the chapter will differ slightly.
cover the value of the local variables and uses it to move to the state that
models the state of the function after the recursive call.

Figure 4.3: An intuition of a pushdown au-


q1 q 1′
tomaton that recognises L pal# . The key-
words Push, Pop and Top have their usual
Top = q1
Push(q1 ) 1 1 meaning. The edge labelled by empty can
Pop be taken only when the stack is empty.
# empty Note that, instead of pushing q 0 or q 1 , one
qi
could simply store 0 and 1’s on the stack.

Top = q0
Push(q0 ) 0 0
Pop
q0 q 0′

Transitions between q i , q 0 and q 1 simulate the recursive calls by push-


ing information on the stack. Once a # is read, the automaton starts pop-
ping the content of the stack to simulate the returns of the recursive calls.
The accepting state is reached only when the stack is empty, i.e., all pend-
ing recursive calls have completed.
This short example shows that, somehow, CFGs can be translated into
PDAs that accept the same language. Using the same ideas, the reverse
translation (from PDAs to CFGs) can be achieved. This is the outline of
the proof we will present in Section 4.2.3. The close connection between
CFGs, PDAs and recursive functions we have just sketched will also be
central in Chapter 5, where we will discuss the automated construction
of parsers from CFGs, in a compiler design perspective. Before that, we
first take a fresh look at grammars, by specialising the theory we have de-
veloped in Chapter 3 to the particular case of CFGs.

4.1 Context-free grammars

Let us start by discussing some formal tools that are useful when dealing
with context-free grammars (hence, also regular grammars). Some of them
(such as the notion of derivation) have already been introduced in Chap-
ter 3, but will be specialised for CFGs.
(1) Exp → Exp + Exp
(2) → Exp ∗ Exp
4.1.1 Derivation (3) → (Exp)
(4) → Id
We have already discussed the notion of derivation in the previous chap- (5) → Cst

ter, see Definition 3.4 sqq. and the intuition given before. Let us consider Figure 4.4: The grammar G Exp to generate
expressions.
again the grammar G Exp , given in Figure 4.4, and let us consider the word
Id + Id ∗ Id which can be generated by this grammar. Indeed, the following
is a possible derivation of G Exp for this word (where we have underlined
the variable which is rewritten at each rule application):

2 1 4 4 4
Exp =
⇒ Exp ∗ Exp =
⇒ Exp + Exp ∗ Exp =
⇒ Exp + Id ∗ Exp =
⇒ Exp + Id ∗ Id =
⇒ Id + Id ∗ Id (4.1)
ALL THINGS CONTEXT FREE. . . 91

We can already observe that, since G Exp is context-free, all the deriva-
tions it generates have a particular shape: they consist, at each step, in
replacing one variable by a word over (Σ ∪V )∗ . This peculiarity of CFGs al-
lows us to define new notions: the leftmost and rightmost derivations, the
derivation trees and the notion of ambiguity.

Leftmost and rightmost derivations Although the sequence (4.1) of deriva-


tions above is sufficient to prove that Id + Id ∗ Id is accepted by G Exp , other
sequences could be exhibited. Indeed, recall that there is some amount of
non-determinism in the definition of the language of a grammar (Defini-
tion 3.6): there should exist at least one derivation producing the word to
be accepted. Other sequences of derivations producing Id + Id ∗ Id are:

2 4 1 4 4
Exp =
⇒ Exp ∗ Exp =
⇒ Exp ∗ Id =
⇒ Exp + Exp ∗ Id =
⇒ Exp + Id ∗ Id =
⇒ Id + Id ∗ Id (4.2)
2 1 4 4 4
Exp =
⇒ Exp ∗ Exp =
⇒ Exp + Exp ∗ Exp =
⇒ Id + Exp ∗ Exp =
⇒ Id + Id ∗ Exp =
⇒ Id + Id ∗ Id (4.3)

Such sequences are called rightmost and leftmost respectively, because we


have obtained them by always deriving the rightmost and leftmost vari-
able in the all sentential forms. On the other hand, (4.1) is neither leftmost
nor rightmost: at the second step, we have derived the leftmost variable;
at the third step, we have derived a variable Exp which was neither left-
most nor rightmost; and at the fourth step we have derived the rightmost
variable. Leftmost and rightmost derivations
are important because they are the
Here is the formal definition capturing these intuitions:
ones which will be generated by the
parsers we will define in Chapter 5 and
Definition 4.1 (Left- and right-most derivation). Let G = 〈V , T , P , S〉 be a Chapter 6. Selecting the left- or right-most
context-free grammar. A derivation w S w ′ ⇒ w α w ′ of G which is ob- derivation allows somehow to get rid of a
part of the grammar’s non-determinism,
tained by applying S → α is leftmost iff: w ∈ T ∗ . It is rightmost iff: w ′ ∈
which is exactly what we need when we
T ∗. M build a program (which must be determin-
istic).
That is, in a leftmost derivation, one can replace variable S by α in
w S w ′ , yielding derivation w S w ′ ⇒ w α w ′ if and only if w contains only
terminals (i.e., no variable, otherwise, the derivation would not be left-
most). Symmetrically for a rightmost derivation, where w ′ must contain
only terminals.

Derivation tree Another way to prove that a given word belongs to the
language of a grammar is to exhibit a derivation tree for this word. The
idea behind the derivation tree is similar to the intuition we have given in
the introduction that a rule like S → α1 B α2 can be interpreted as ‘match
α1 ; then a string that can be generated by B ; then α2 ’, which suggests a
recursive definition of the acceptance of a word. Such a recursive view can
easily be expressed by means of a tree:
S

α1 B α2
This tree can then be completed up to the point where the leaves con-
tain only terminals.
92 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

Exp
Example 4.2. Let us illustrate this idea by a more concrete example. Con-
sider again the grammar G Exp in Figure 4.4, and the word Id + Id ∗ Id. Then,
a derivation tree of this word is given in Figure 4.5. M Exp + Exp

Id
From this example, it is easy to determine the characteristics of a deriva- Exp ∗ Exp

tion tree for a word w. Its root must be labelled by the start symbol S of the Id Id
grammar; the children of each node must correspond to the right-hand Figure 4.5: A derivation tree for the word
Id + Id ∗ Id.
side of a rule whose left-hand side is the label of the node; and the se-
quence of leaves from left to right must be the word w. Here is a more
formal definition:

Definition 4.3 (x-tree). Let G = 〈V , T , P , S〉 be a CFG, and let x ∈ V ∪ T be


either a variable or a terminal of G. Then, an ordered3 , labelled tree T is 3
An ordered tree is a tree in which we have
fixed a total order on the children of all
an x-tree iff:
nodes. So, one can speak of the first, sec-
ond,. . . , last child of a node. A particular
1. either x ∈ T and T is a leaf labelled by x; or case of ordered trees are the classical bi-
nary trees, where the first and second chil-
2. x ∈ V is the label of T ’s root; and there is a rule x → α1 α2 · · · αk in P dren are called left child and right children
s.t. the sub-trees of T are T1 , T2 ,. . . , Tk where Ti is an αi -tree for all respectively.

1 ≤ i ≤ k.

Then, we can define what is a derivation tree for a given word w:

Definition 4.4 (Derivation tree). Given a CFG G = 〈V , T , P , S〉, a tree T is


a derivation tree of w iff T is an S-tree and w is obtained by traversing T ’s
leaves from the left to the right. M

As we will see in the next chapter, derivation trees are a most important
tool in compiler construction. Very often, the output of the parser4 will 4
See Section 1.3.1 for the different phases
of the compiling process and their relative
be a derivation tree (possibly with some extra information as we will see
connections.
later). Indeed, the structure of the derivation tree provides us with more
information on the structure of the input word than a derivation does. For
example, the derivation tree in Figure 4.5 reveals a possible structure for
the expression Id + Id ∗ Id, and suggests that it should be understood as the
sum of Id and Id ∗ Id. In other word, the structure of the tree suggests that
the semantics of the expression corresponds to that of Id + (Id ∗ Id) which
indeed matches the priority of the arithmetic operators. Such information
will clearly be important for the synthesis phase of compiling, where the
executable code corresponding to the expression will be created.

Ambiguities We have seen before that a given word might be derived us-
ing several different derivations (which has prompted us to introduce the Exp

notions of leftmost and rightmost derivations). One natural question is


thus whether there can be different derivation trees for the same word? Exp ∗ Exp
The answer is yes, as can be seen in Figure 4.6.
Exp + Exp Id
Contrary to the tree in Figure 4.5, this tree suggests that the expression
Id + Id ∗ Id should rather be understood as (Id + Id)∗ Id, instead of Id +(Id ∗ Id). Id Id
Figure 4.6: Another derivation tree for the
This is rather unfortunate: were such a tree returned by the parser, the word Id + Id ∗ Id.
code generated by the synthesis phase would not correspond to the actual
priority of the operators. The question is then: ‘how can we decide, based
ALL THINGS CONTEXT FREE. . . 93

only on the grammar G Exp (Figure 4.4) which derivation tree should be
generated for Id + Id ∗ Id?’ Unfortunately, the answer is ‘we can’t!’, because
the grammar does not contain any information about the priority of the
operators. In other words, the grammar is intrinsically ambiguous:

Definition 4.5 (Ambiguous grammar). A CFG is ambiguous iff it generates


at least one word which admits two different derivation trees. M

Example 4.6. As witnessed by Figure 4.5 and Figure 4.6, grammar G Exp
(Figure 4.4) is ambiguous. M

For the reason explained above, ambiguous grammars will be a big is-
sue when generating parsers. We will see, in Section 4.4, techniques to
turn an ambiguous grammar into a non-ambiguous one, that accepts the
same language, by taking into consideration extra information such as the
priority and associativity of the operators.

Relationship between derivations and derivation trees So far, we have seen


two mathematical tools that allow to show that a given word w belongs to
the language of a grammar G: either by exhibiting a derivation of G that
generates w, or by giving a derivation tree of w for G. A natural question
is thus to understand the relationship between derivations and derivation Exp, 1

trees. Roughly speaking, this relationship is a one-to-many one: to each


derivation tree of w correspond potentially several derivations producing
Exp, 4 + Exp, 2
w, while each derivation of w corresponds to one and only one derivation
Id
tree. Exp, 3 ∗ Exp, 5
To make this intuition more formal, consider again the derivation tree Id Id
in Figure 4.5. We call a top-down traversal of a tree any sequence of the Figure 4.7: The derivation tree for Id + Id ∗
Id of Figure 4.5 with a top-down traversal
tree’s internal nodes s.t. each time a node occurs in the sequence, all its
indicated by the position of each node in
ancestors have occurred before. As an example, consider the tree in Fig- the sequence
ure 4.7, which is the same derivation tree as in Figure 4.5, with all internal
nodes labelled by their index in a top-down traversal: first the root, then
the right son of the root, then the left son of this last node, etc.
It is easy to check that this top-down traversal corresponds to the fol-
lowing derivation:

Exp ⇒ Exp + Exp ⇒ Exp + Exp ∗ Exp


⇒ Exp + Id ∗ Exp ⇒ Id + Id ∗ Exp ⇒ Id + Id ∗ Id

This derivation has been obtained by following the sequence of internal


nodes corresponding to the top-down traversal, and applying, each time,
the derivation which has been used in the tree to generate the sons of this
node.
Then, it is clear that all derivations corresponding to a given derivation
tree are those that can be obtained by a top-down traversal of this tree. In
particular, the leftmost-derivation is obtained by the classical infix traver-
sal, and the rightmost-derivation is obtained by the infix traversal where
the left and right sons have been swapped (the right son is visited before
the left one).
94 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

4.1.2 The membership problem

We close this section on CFGs by discussing the membership problem,


when specialised to such grammars. It is defined as follows:

Problem 4.7. Given a CFG G on the alphabet Σ and a word w ∈ Σ∗ , deter-


mine whether w ∈ L(G).

Clearly, this problem is central to the construction of compilers: if, as


explained in the Introduction, we specify the syntax of a programming lan-
guage by means of a CFG, then, checking whether the syntax of a given
program is correct boils down to check whether this program (actually,
the sequence of tokens that compose it) belongs to the language of the
grammar. This complexity sounds like good
news, as it suggests that parsing
We will now show that checking whether a word w is in L(G) can always
can actually be carried out effi-
be done in time O (n 3 ) and memory O (n 2 ) where n = |w|, and when G is a ciently. After all, we have all learned that
CFG. To achieve this result, we first need a very convenient transformation polynomial-time algorithms are efficient,
haven’t we? However, assume that we
of CFGs, which is called the Chomsky normal form. want to check the syntax of a program with
10,000 = 104 tokens (probably a few thou-
sands of lines of code). Then, the syn-
The Chomsky normal form The Chomsky normal form is a simple syn-
tax check would need 1012 steps. Assume
tactic restriction that can be applied to all CFGs, in the sense that all CFGs each of these steps can be carried out in
can be converted into an equivalent CFG (one that accepts the same lan- 10−6 sec, i.e. 1µsec. Then, checking the
syntax of this input would already take
guage) respecting the normal form. As expected, this special form was more than 11 days! Whereas a quadratic
introduced5 by. . . Noam C H O M S K Y (in 1959). Here is the definition: algorithm would run in less than two min-
utes, and a linear time algorithm would
Definition 4.8 (Chomsky normal form for CFG). A CFG G = 〈V , T , P , S〉 take a few milliseconds. As a matter of fact,
we will see how to craft efficient (linear-
is in Chomsky normal form (CNF for short) iff every rule is of one of the time) parsers in Chapter 5 and Chapter 6,
following forms for certain kinds of CFGs.
5
N. Chomsky. On certain formal proper-
ties of grammars. Information and Com-
A → BC
putation (formerly known as Information
A→a and Control), 2(2):137 – 167, 1959. D O I :
10.1016/S0019-9958(59)90362-6
S →ε

where:

• A is any variable (including S): A ∈ V ;

• B and C are any variable different from the start symbol: {B ,C } ⊆ V \{S};
and

• a is any terminal (i.e., different from ε): a ∈ T .

Roughly speaking, all the rules in the grammar either replace one vari-
able A by a sequence of exactly two variables BC (that are different from
the start symbol); or replace a variable A by a single non-empty termi-
nal a. The only exception to this rule is that the start symbol S can gener-
ate ε: this is necessary to ensure that the empty word can be accepted by
a grammar in CNF (and this also explains why we do not allow S to occur
in any right-hand part).
As announced above, we can build, from all CFGs an equivalent one
which is in CNF:
ALL THINGS CONTEXT FREE. . . 95

Theorem 4.1. For all CFGs G, we can build, in polynomial time, a CFG G ′
in CNF s.t. L(G) = L(G ′ ).

We will not prove this theorem. A formal proof can be found in C H O M -


S K Y ’s original paper6 , but we recommend the construction that can be 6
N. Chomsky. On certain formal proper-
ties of grammars. Information and Com-
found in M. S I P S E R ’s book7 on complexity. Let us highlight the main point
putation (formerly known as Information
of the transformation through an example. and Control), 2(2):137 – 167, 1959. D O I :
10.1016/S0019-9958(59)90362-6
Example 4.9. Consider the grammar given in Figure 4.8. It is clearly not in 7
Michael Sipser. Introduction to the The-
CNF. To turn it into an equivalent CFG in CNF, we first take care of the so- ory of Computation. International Thom-
son Publishing, 1st edition, 1996. ISBN
called unit rules, i.e. rules like A → B where the right-hand side contains
053494728X
just one variable. We can remove rule A → B provided we add the rule:
(1) S → a AB
(2) A → B
S → aB B , (3) A → ε
(4) B → aB c
since S → a AB is the only rule where A occurs in the right-hand side. Then, (5) B → d
we get rid of A → ε by deleting A from all right-hand side where it still oc- Figure 4.8: A CFG which is not in CNF.
curs (since ε is now the only way to derive A). Thus, our grammar has now
become:
(1) S → aB
(2) S → aB B
(3) B → aB c
(4) B → d
This is not yet satisfactory! As a matter of fact, only the last rule com-
plies with the definition of CNF. We can transform the other rules by in-
troducing a limited amount of variables. Then, S → aB becomes:

S →V1 B
V1 → a

where V1 is a fresh variable. Similarly, S → aB B becomes:

S →V1 V2
V2 → B B

and B → aB c becomes:

B →V1 V3
(1) S → V1 B
V3 → BV4 (2) S → V1 V2
(3) V2 → BB
V 4 → c.
(4) B → V1 V3
(5) V3 → BV4
The final grammar, which is now in CNF is given in Figure 4.9. (6) V1 → a
M (7) V4 → c
(8) B → d
Figure 4.9: A CFG in CNF that corresponds
Checking language membership of CFGs Equipped with the notion of
to the CFG in Figure 4.8.
CNF, we can now describe a polynomial-time algorithm to check whether
a given word w belongs to the language of a given CFG G = 〈V , T , P , S〉,
that we assume to be in CNF (recall that, if the given CFG is not in CNF, we
can obtain an equivalent CFG which is in CNF, in polynomial time—see
Theorem 4.1).
Let us first sketch some intuitions. We observe that, thanks to the spe-
cial syntax of production rules in CNF grammars, checking membership is
particularly easy for words of length 0 or 1. Indeed:
96 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

• The only word of length 0 is w = ε. Since S → ε is the only rule that can
appear with an ε on the right-hand side, we conclude that ε ∈ L(G) iff
S → ε appears in P .

• Similarly, a word w of length 1 is a single terminal a. Again, w = a is


accepted iff the grammar has a rule S → a. Indeed, if S → a is a rule of G,
it is clear that G accepts a = w. On the other hand, if S → a is not a rule
of G, then there is no way G can accept a = w, since all other rules with
S as the left-hand side are either of the form S → b, with b ̸= a; or of the
form S → AB and will thus generate words of length at most 2.

The second item of the above reasoning can be generalised: we can check
whether any variable A ∈ V can generate a word w = a of length 1 simply
by checking whether the grammar has a rule A → a or not.
Now, let us turn our attention to the following more general problem:
checking whether some given variable A can generate a given word w =
w 1 w 2 · · · w n for n ≥ 2? Clearly, if we can answer that question, then we will
be able to answer the membership problem, simply by letting A = S. We
will base our reasoning on an intuition that we have already given: that
a rule A → α of a CFG can be regarded as a recursive function A which
performs a series of calls as given by α. In the setting of CNF grammars,
all ‘recursive’ rules are of the form A → BC , hence they correspond to two
successive ‘recursive’ calls. In other words, a variable A can generate w iff
we can find a rule A → BC and we can split w into two non-empty sub-
words u and v (i.e., w = uv) s.t.: (i) B generates u; and (ii) C generates v.
Observe that this recursive definition is sound since u and v are both non-
empty, and thus have necessarily a length which is strictly smaller than
n. So eventually, this definition will amount to check whether all charac-
ters of the word w can be generated by some variables, which can be done
easily as we have observed above.
Clearly, this discussion suggests a recursive procedure for checking mem-
bership of a word w to L(G). However, such a procedure could run in ex-
ponential time. In order to avoid this, we will rely on the very general idea
of dynamic programming. In our case, dynamic programming consists in Dynamic programming is a general
algorithmic technique, defined by
storing in a (quadratic) table the result of the procedure when called on all
Wikipedia as ‘a method for solving a
the possible sub-strings of w. By filling this table following a smart order, complex problem by breaking it down into
we will manage to keep the computing time polynomial. Basically, we will a collection of simpler subproblems, solv-
ing each of those subproblems just once,
first fill in the table for the shortest substrings of w (i.e., the individual let- and storing their solutions - ideally, us-
ters), then use this information to deduce whether we can accept longer ing a memory-based data structure’. It
has been introduced by Richard B E L L -
and longer substrings. . . up to the whole word w itself.
M A N in the forties. This technique occurs

More precisely, we will build a table Tab of dimension n × n s.t. each in many classical algorithms such as the
B E L L M A N - F O R D and F L OY D - W A R S H A L L
cell Tab(i , j ) will contain the list of all variables that can generate the sub- algorithms to compute shortest paths in a
word w i · · · w j . Formally, A ∈ Tab(i , j ) iff A ⇒∗ w i · · · w j . When this table graph.

is complete, checking whether w ∈ L(G) amounts to checking whether the


start symbol can generate w 1 · · · w n , i.e. whether S ∈ Tab(1, n).
As explained above, we start by filling the cells that correspond to the
individual letters making up w = w 1 w 2 · · · w n . For all 1 ≤ i ≤ n, we put
variable A in Tab(i , i ) iff the rule A → w i occurs in the grammar.
Then, we fill the cells corresponding to subwords of length ℓ for increas-
ing values of ℓ = 2, 3, . . .. Assuming all the cells for subwords of length < ℓ
ALL THINGS CONTEXT FREE. . . 97

Input: A CFG G = 〈V , T , P , S〉 (in CNF), a word


w = w1 w2 · · · wn ∈ T ∗.
Output: True iff w ∈ L(G).

if w = ε then
if S → ε ∈ P then return True ;
else return False ;

foreach 1 ≤ i ≤ n do
Tab(i , i ) ← {A | A → w i ∈ P } ;

foreach 1 ≤ ℓ ≤ n do
foreach 1 ≤ i ≤ n − ℓ + 1 do
j ← i +ℓ−1 ;
foreach i ≤ k ≤ j − 1 do
foreach rule A → BC ∈ P do
if B ∈ Tab(i , k) and C ∈ Tab(k + 1, j ) then
Add A to Tab(i , j ) ;

if S ∈ Tab(1, n) then return True ;


else return False ;
Algorithm 3: An O (n 3 ) algorithm to check whether w ∈ L(G) for a CFG
grammar in CNF.

have been filled, we fill the cells corresponding to some subword w i · · · w j


of length ℓ as follows. We put variable A in Tab(i , j ) iff there is a rule
A → BC in the grammar, and a split position i ≤ k < j s.t. B ∈ Tab(i , k)
and C ∈ Tab(k + 1, j ) (i.e., iff B can generate the suffix w i . . . w k and C can
generate the suffix w k+1 · · · w j , which we can test by querying the corre-
sponding cells of the table). The algorithm that implements this proce-
dure is given in Algorithm 3, following the presentation of Sipser8 . The 8
Michael Sipser. Introduction to the The-
ory of Computation. International Thom-
paternity of this algorithm has been attributed to several researchers by
son Publishing, 1st edition, 1996. ISBN
whom is has been independently re-discovered. . . It is often referred to as 053494728X
the Cocke-Younger-Kasami algorithm (CYK algorithm for short), although
it seems to have been introduced first9 by S A K A I in 1961. 9
Itiiro Sakai. Syntax in universal transla-
tion. In International Conference on Ma-
Example 4.10. Let us consider the grammar in Figure 4.10, which is in chine Translation of Languages and Ap-
plied Language Analysis, pages 593–608.
CNF. One can check that this grammar accepts the language a+ b. Observe
London: Her Majesty’s Stationery Office,
in particular that the word w = aaab can be accepted by two different 1961
derivations: (1) S → XB
(2) → X ′B
S ⇒ X B ⇒ X AB ⇒ X A AB ⇒ a A AB ⇒ aa AB ⇒ aaaB ⇒ aaab, (3) X → XA
(4) → a
and (5) X′ → AY
(6) Y → AA
(7) A → a
S ⇒ X ′ B ⇒ AY B ⇒ A A AB ⇒ a A AB ⇒ aa AB ⇒ aaaB ⇒ aaab.
(8) B → b
Figure 4.10: An example CNF grammar
So, in particular, the subword w 1 w 2 w 3 = aaa can be generated either by
generating a+ b.

X or by X .
Let us now apply the above algorithm on word w = aaab, and fill a 4 × 4
table Tab (since our word w is of length 4). We will actually only fill the
cells Tab[i , j ] s.t. i ≤ j because there is no subword starting in i and ending
98 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

in j with i > j . Observe that we are interested in checking whether S ∈


Tab[1, 4], which is the top right cell of the table. We proceed by increasing
length of subwords:

• For the subwords of length 1, we consider the subwords w 1 = a, w 2 = a,


w 3 = a and w 4 = b. Clearly, a can be generated by X and A only; and b
can be generated by B only. Indeed, all other variables have derivations
of the form F →G H (where F , G, H are variables), hence, they will all
generate words of length at least 2, since no rule of the form F →ε is
allowed for a variable F ̸= S (by the Chomsky normal form). So, we have
the table:
1 2 3 4
A,X 1
A,X 2
A,X 3
B 4

• For the subwords of length 2, we consider the subwords w 1 w 2 = aa,


w 2 w 3 = aa and w 3 w 4 = ab. The only way to split these subwords is ‘in
the middle’, i.e. w 1 w 2 is split into w 1 and w 2 for example. So, we will
fill in Tab[1, 2] using the information of Tab[1, 1] = A, X and Tab[2, 2] =
A, X . To exploit this information, we need to find a variable V1 in Tab[1, 1],
and a variable V2 in Tab[2, 2] s.t. the grammar has a rule of the form Recall the intuition about the ta-
ble: V1 can generate w 1 and V2 can
V →V1 V2 . In this case, we can add V to Tab[1, 2]. There are two such
generate w 2 .
choices for V1 and V2 . Either we let V1 = X , V2 = A and consider the
rule X →X A ; or we let V1 = A, V2 = A and consider the rule Y →A A .
Observe that indeed, X and Y can both generate aa. Continuing so for
the other cells corresponding to subwords of length 2, we have:

1 2 3 4
A, X X ,Y 1
A, X X ,Y 2
A, X S 3
B 4

• For the words of length 3, we consider w 1 w 2 w 3 = aaa and w 2 w 3 w 4 =


aab. We can recognise w 1 w 2 w 3 by splitting it into w 1 and w 2 w 3 ; or
into w 1 w 2 and w 3 . Let us first consider the case where we split into
w 1 = a and w 2 w 3 = aa. By Tab[1, 1], we know that w 1 can be generated
either by A or by X . By Tab[2, 3], w 2 w 3 can be generated either by X or
by Y . The only rule that allows us to fill Tab[1, 3] in this case is X ′ →AY
, so w 1 w 2 w 3 = aaa can be generated by X ′ . In the latter case (where we
split into w 1 w 2 and w 3 ), we use rule X →X A and discover that X can
generate w 1 w 2 w 3 = aaa as well. We proceed similarly for w 2 w 3 w 4 =
aab, and obtain:
1 2 3 4

A, X X ,Y X,X 1
A, X X ,Y S 2
A, X S 3
B 4
ALL THINGS CONTEXT FREE. . . 99

• Finally, for the single subword of length 4, which is w itself, we need


to consider three possible splits: either into w 1 and w 2 w 3 w 4 ; or into
w 1 w 2 and w 3 w 4 ; or into w 1 w 2 w 3 and w 4 . Only the last split will yield
a new piece of information into Tab[1, 4], since w 1 w 2 w 3 can be gener-
ated either by X or by X ′ and w 4 can be generated by B . Then, both
rules S→X B and S→X ′ B allow us to conclude that S can generate w:

1 2 3 4

A, X X ,Y X,X S 1
A, X X ,Y S 2
A, X S 3
B 4

4.2 Pushdown automata

Let us now define formally the notion of pushdown automaton (PDA for
short) that we have described informally in the introduction of this chap-
ter. Remember that, in essence, a PDA is a finite state automaton aug-
mented with a stack that serves as a memory: at each transition, the au-
tomaton can test the value on the top of the stack, and modify (push, pop)
this top of stack.

4.2.1 Syntax and semantics

Syntax The formal definition of the syntax of PDA clearly shows that they
are an extension of finite automata:

Definition 4.11 (Pushdown automaton). A Pushdown automaton (PDA


for short) is a tuple Q, Σ, Γ, δ, q 0 , Z0 , F s.t.:
­ ®

1. Q is a finite set of states;

2. Σ is a finite input alphabet;

3. Γ is a finite stack alphabet;



4. δ : Q × (Σ ∪ {ε}) × Γ → 2(Q×Γ ) is the transition function;

5. q 0 is the initial state;

6. Z0 ∈ Γ is the initial symbol on the stack;

7. F ⊆ Q is the set of accepting states.

Clearly, the elements Q, Σ, δ and F were already present in the defini-


tion of finite automaton. Γ is the stack alphabet: it contains all the sym-
bols that can be pushed on the stack (in practice, we can thus store on
the stack symbols that are not taken from the input, i.e., not in Σ). Z0 is a
symbol that we assume will always be present on the stack initially. This
will be important to have a clean definition of operations that test whether
the stack is empty or not. Finally, observe that δ has a different signature:
100 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

it takes, as input, the current state and the next symbol on the input (as
in a finite automaton), but also a stack symbol, which is meant to be the
symbol on the top of the stack. It outputs a set of pairs of the form (q, w),
where q is a destination state (as in a non-deterministic finite automaton)
and w is a word from Γ∗ that will replace the symbol on the top of the stack
after the transition has been fired. So, intuitively:

δ(q, a, b) = {(q 1 , γ1 ), . . . , (q n , γn )}

means:

When in state q, reading a from the input, and having b on the top of the
stack, chose non-deterministically a pair (q i , γi ), move to q i , and replace b
on the top of the stack by γi (where the leftmost letter of w i goes on the top).

Before making this definition of the transition function more formal, let
us give an example of PDA following this syntax. Such an example can be This syntax might look hard to read
and is indeed way less intuitive
found in Figure 4.11. In order to depict the translation relation, we draw
than the classical ‘pop’ and ‘push’.
an arrow between q and q ′ , labelled by a, b/w iff (q ′ , w) ∈ δ(q, a, b), i.e. we It allows, however, a very clean definition
can go from q to q ′ while reading an a on the input, seeing a b on the top of the semantics of PDAs, as we will see
later.
of the stack, and replacing this b by w. In other words, in this example, we
have:

1. Q = {q 0 , q 1 , q 2 };

2. Σ = {0, 1, #};

3. Γ = {0, 1, Z0 };

4. F = {q 2 };

5. and the transition function is as follows:


i /t 0 1 Z0
0 {(q 0 , 00)} {(q 0 , 01)} {(q 0 , 0 Z0 )}
δ(q 0 , i , t ) = 1 {(q 0 , 10)} {(q 0 , 11)} {(q 0 , 1 Z0 )}
# {(q 1 , 0)} {(q 1 , 1)} {(q 1 , Z0 )}
ε ; ; ;

i /t 0 1 Z0
0 {(q 1 , ε)} ; ;
δ(q 1 , i , t ) = 1 ; {(q 1 , ε)} ;
# ; ; ;
ε ; ; {(q 2 , Z0 )}
and δ(q 2 , t , i ) = ; for all t ∈ Γ, i ∈ Σ.

With the intuitive definition of the semantics we have sketched, one can
understand that the self-loop on state q 0 consists in pushing all 0 and 1
read from the input. Indeed, whenever a 0 is read, the automaton system-
atically tests for all possible characters x (that can be either 0 or 1 or Z0 )
on the top of the stack, and replaces it by 0x, which amounts to pushing
a 0 (and symmetrically when a 1 is read). The PDA moves from state q 0 to
q 1 only when a # is read on the input, and does not modify the stack in this
case. Then, the self-loop on q 1 consists, when reading a 0, in checking that
the top of the stack is indeed a 0 too, and replacing it by ε, i.e., popping the
ALL THINGS CONTEXT FREE. . . 101

Figure 4.11: An example PDA recognising


0, Z0 /0 Z0
L pal# (by accepting state).
1, Z0 /1 Z0
0, 0/00
1, 0/10
0, 1/01 0, 0/ε
1, 1/11 #, Z0 /Z0 1, 1/ε
#, 0/0
#, 1/1 ε, Z0 /Z0
q0 q1 q2

0 (and, again, symmetrically when a 1 is read). So the self-loop on q1 pops


all the stack content while checking that the characters that are popped
correspond to the symbol read on the input. Because of the LIFO proper-
ties of the stack, this amounts to checking that the suffix of the input word
that occurs after the # is the mirror image of the prefix that occurs before
the #. Finally, the PDA moves to the accepting state q 2 only when the stack We have silently assumed that, to
accept a word, a PDA must reach
is empty, which guarantees that the mirror image of the whole prefix had
an accepting state, as in the case of
been found in the suffix. Hence, the automaton indeed recognises L pal# . finite automata. While this is sufficient for
Let us make these notions more precise by formally defining the semantics the intuitive examples we are discussing
for the moment, bear in mind that we
of PDAs. will later refine this notion by defining two
kinds of acceptance conditions for PDAs.
Configuration of a PDA As for finite automata, a configuration of a PDA
describes the situation in which the different components of the automa-
ton are while it is busy reading a word. In the case of a PDA, this configu-
ration contains the following information:

1. the current state (an element from Q);

2. the remaining input word w ∈ Σ∗ ; and

3. the current content of the stack (an element from Γ∗ ).

This is exactly captured by the following definition:

Definition 4.12 (Configuration of a PDA). A configuration of a PDA


Q, Σ, Γ, δ, q 0 , Z0 , F is a triple
­ ®

q, w, γ ∈ Q × Σ∗ × Γ∗ .
­ ®

M
­ ®
In particular, the initial configuration when reading w is q 0 , w, Z0 ,
i.e., initially, the current state is q 0 , w is on the input and Z0 is on the stack.

Configuration change Let us now formally define how a PDA can move
from one configuration to another. As for finite automata, we use the
c ⊢P c ′ notation to denote the fact that the PDA P can move from config-
uration c to configuration c ′ . Thanks to the syntax introduced in Defini-
tion 4.11, the definition of ⊢ is very simple:

Definition 4.13 (Configuration change of a PDA). Let us consider a PDA


P = Q, Σ, Γ, δ, q 0 , Z0 , F . Then, we say that P can move from configuration
­ ®
102 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

q, aw, X β to configuration q ′ , w, αβ (with a ∈ Σ ∪ {ε}, and X ∈ Γ) iff


­ ® ­ ®

there is (q ′ , α) ∈ δ(q, a, X ). In this case, we write:

q, aw, X β ⊢P q ′ , w, αβ .
­ ® ­ ®

Let us elucidate this definition to ensure that it captures the intuition


we have sketched so far. The original configuration is q, aw, X β , where
­ ®

a is a single input letter, or ε; and X is a stack symbol. Let us assume for the
moment that a ̸= ε. Thus, in the original configuration q, aw, X β , the
­ ®

remaining input is non-empty and begins by letter a. Moreover, symbol


X is on the top of the stack, and the PDA is in state q. Thus, we should
consider δ(q, a, X ), that contains all the possible moves10 that the PDA can 10
Thus, as said above, PDAs can be non-
deterministic.
do in this configuration. All these possible moves are pairs (q , α), where ′

q ′ is the destination state and α is the sequence of symbols that should


replace X on the top of the stack after the transition. Thus, the resulting
configuration is obtained from q, aw, X β by: (i) changing the current
­ ®

state from q to q ′ ; (ii) reading the a at the beginning of the input, hence
only w remains; and (iii) popping the X from the stack and pushing α
instead. Thus, the resulting configuration is indeed q ′ , w, αβ . We can
­ ®

now check that the intuition still works when a = ε. Indeed, in this case,
aw = ε · w = w, and the input word is thus not modified by the transition.
Note that, when the PDA is clear from the context, we might omit the
subscript on ⊢. We also denote by ⊢*P (or, simply, ⊢* ) the reflexive and
transitive closure♣ of ⊢.

Example 4.14. Consider for instance the PDA in Figure 4.11, and the in-
put word 01#10 ∈ L pal# . The initial configuration on this input word is
(q 0 , 01#10, Z0 ). One can check against the definition of ⊢ that:

(q 0 , 01#10, Z0 ) ⊢(q 0 , 1#10, 0 Z0 )

because δ(q 0 , 0, Z0 ) = {(q 0 , 0 Z0 )}


One can further check that the following sequence of configurations
can be visited by the PDA until the input is empty:

(q 0 , 01#10, Z0 ) ⊢(q 0 , 1#10, 0 Z0 ) ⊢(q 0 , #10, 10 Z0 ) ⊢(q 1 , 10, 10 Z0 ) ⊢(q 1 , 0, 0 Z0 ) ⊢(q 1 , ε, Z0 ) ⊢(q 2 , ε, Z0 ).

Hence, we can write that:

(q 0 , 01#10, Z0 ) ⊢* (q 2 , ε, Z0 )

Accepted language Equipped with this notion, we can now define which
words are accepted by a PDA. As said above, we will actually define two
notions of accepted languages. Indeed, one very natural notion of accep-
tance for PDAs is obtained by adapting the definition we have adopted for
finite automata: a word w is accepted iff there is at least one run reading
this word and reaching a final state. However, as shown by the example in
Figure 4.11, another natural notion of acceptance for PDAs is to accept a
word when the stack is empty. Intuitively, in many cases, the stack is used
ALL THINGS CONTEXT FREE. . . 103

as a memory to store some sort of input that still must be treated, so, it is
reasonable to accept a word as soon as this pending input is empty. These
two notions are captured by the following definition:

Definition 4.15 (Accepted languages of a PDA). Let us consider a PDA P =


Q, Σ, Γ, δ, q 0 , Z0 , F . Then:
­ ®

1. Its final state accepted language, denoted L(P ) is:

L(P ) = {w | there are q ∈ F and γ ∈ Γ∗ s.t. q 0 , w, Z0 ⊢*P q, ε, γ }


­ ® ­ ®

2. Its empty stack accepted language, denoted N (P ) is:

N (P ) = {w | there is q ∈ Q s.t. q 0 , w, Z0 ⊢*P q, ε, ε }


­ ® ­ ®

In other words:

1. A word w is in L(P ) (i.e., it is accepted by final state) iff, from the initial
­ ®
configuration q 0 , w, Z0 where w is in the input, one can find an exe-
cution reaching a configuration q, ε, γ where w has been read entirely
­ ®

and the current state q is accepting (q ∈ F ). Observe that the stack does
not need to be empty for w to be accepted (γ is any word in Γ∗ ).

2. On the other hand, a word w is in N (P ) (i.e., it is accepted by empty


­ ®
stack) iff, from the initial configuration q 0 , w, Z0 , one can find an ex-
ecution reaching a configuration q, ε, ε where w has been read en-
­ ®

tirely and the stack is empty (observe that, in this case, the current state
q does not need to be final: q ∈ Q).

Example 4.16. Considering again the sequence of transitions from Exam-


ple 4.14:

(q 0 , 01#10, Z0 ) ⊢* (q 2 , ε, Z0 )

we deduce that 01#10 ∈ L(P ), where P is the PDA in Figure 4.11, because,
q 2 is an accepting state, and the input is empty in the last configuration.
Observe that the stack is not empty, but this is not necessary for a word to
be in L(P ).
Now consider a PDA P ′ obtained from P by deleting state q 2 , and adding,
on q 1 a self-loop transition labelled by ε, Z0 /ε, i.e. a transition that emp-
ties the stack once the Z0 symbol occurs on the top; and where there are no
more accepting states. This PDA is shown in Figure 4.12. Then, we have:

(q 0 , 01#10, Z0 ) ⊢*P’ (q 1 , ε, ε)

i.e., there is an execution of the PDA that reaches (q 1 , ε, ε) where the stack
is empty (but where q 1 is not accepting). This entails that: 01#10 ∈ N (P ′ ).
Observe, however that 01#10 ̸∈ N (P ), because P never empties its stack;
neither that 01#10 ̸∈ L(P ′ ) because P ′ has no accepting state. That is, we
can show that L(P ) = N (P ′ ) = L pal# , and that L(P ′ ) = N (P ) = ;. M
104 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

Figure 4.12: An example PDA recognising


0, Z0 /0 Z0
L pal# (by empty stack).
1, Z0 /1 Z0
0, 0/00
1, 0/10 0, 0/ε
0, 1/01 1, 1/ε
1, 1/11 #, Z0 /Z0 ε, Z0 /ε
#, 0/0
#, 1/1
q0 q1

4.2.2 Deterministic PDAs

When considering finite state automata, we have shown that ε-NFAs and
DFAs are equivalent in the sense that any ε-NFA can always be turned into
a DFA that accepts the same language. Is it the case for PDAs? Unfor-
tunately, the answer is ‘no’: there are some non-deterministic PDAs that
cannot be turned into an equivalent deterministic one. Let us give an ex-
ample.
Consider the language L pal , which is obtained from L pal# by deleting the
# in all words. Formally:

L pal = w w R | w ∈ {0, 1}∗


© ª

Thus, for instance, 010010 ∈ L pal , but 0100 ̸∈ L pal .


Now, let us consider again the PDA in Figure 4.11, which recognises
L pal# . Intuitively, the presence of the # symbol in the middle of all words
in L pal# ‘makes the life of the PDA easier’ because it marks the end of the
prefix w, and ‘tells’ the PDA when to move to state q 1 where it should start
popping from the stack. Unfortunately, this feature is not present anymore
in the words of L pal , so a PDA recognising this language (whether by ac-
cepting state of by empty stack), cannot determine when the prefix of the
palindrome has been read. Instead, it must ‘guess’ it, i.e., by using non-
determinism. This yields the PDA in Figure 4.13 (a similar PDA accepting
L pal by empty stack can be obtained from the PDA in Figure 4.12).

Figure 4.13: An non-deterministic PDA


0, Z0 /0 Z0
recognising L pal (by accepting state).
1, Z0 /1 Z0
0, 0/00
1, 0/10
0, 1/01 0, 0/ε
1, 1/11 ε, Z0 /Z0 1, 1/ε
ε, 0/0
ε, 1/1 ε, Z0 /Z0
q0 q1 q2

Clearly, the PDA in Figure 4.13 is non-deterministic, because the tran-


sition from q 0 to q 1 does not consume any character on the input, so
it can be fired any time. As an example, from the initial configuration
ALL THINGS CONTEXT FREE. . . 105

(q 0 , 0110, Z0 ), the two following successors are possible:

(q 0 , 0110, Z0 ) ⊢(q 0 , 110, 0 Z0 )


and
(q 0 , 0110, Z0 ) ⊢(q 1 , 0110, Z0 )

Moreover, in the latter case, the only possible next configuration change
is:

(q 1 , 0110, Z0 ) ⊢(q 2 , 0110, Z0 ),

i.e., state q 2 is reached without modification of the stack, and the input is
not empty. From this configuration (q 2 , 0110, Z0 ), no other configuration
is reachable, hence, this run of the automaton will not accept the word
0110. Nevertheless, 0110 is accepted by the PDA in Figure 4.13, with the
next sequence of transitions (that corresponds, as expected, to pushing
the prefix 01 and popping the suffix 10):

(q 0 , 0110, Z0 ) ⊢(q 0 , 110, 0 Z0 ) ⊢(q 0 , 10, 10 Z0 ) ⊢(q 1 , 10, 10 Z0 ) ⊢(q 1 , 0, 0 Z0 ) ⊢(q 1 , ε, Z0 ) ⊢(q 2 , ε, Z0 ).

Now that we have understood why non-determinism is crucial to let


the PDA in Figure 4.13 accept L pal , let us try and understand why there is
no deterministic PDA that accepts it. First, let us define this last notion:

Definition 4.17 (Deterministic Pushdown automaton). A Deterministic Push-


down automaton (DPDA) for short is a PDA Q, Σ, Γ, δ, q 0 , Z0 , F s.t.:
­ ®

1. for all q ∈ Q, a ∈ Σ ∪ {ε} and γ ∈ Γ: δ(q, a, γ) has at most one element;


and

2. for all q ∈ Q and γ ∈ Γ: if δ(q, ε, γ) ̸= ;, then δ(q, a, γ) = ; for all a ∈ Σ.

The intuition behind this definition is as follows. The first condition


says that, given a state q, an input letter a and a symbol γ on the top of the
stack, δ(q, a, γ) returns at most one move, i.e., the DPDA has no ‘choice’.
The second condition says that if there is a transition labelled by ε from
some state q and with some symbol γ on the top of the stack, then, this
is the only transition active in this case. Indeed, if there were q and γ s.t.
δ(q, ε, γ) ̸= ; and δ(q, a, γ) ̸= ; too (for some letter a ∈ Σ), then, in a con-
figuration where the PDA is in state q, γ is on the top of the stack, and a
The fact that there is no DPDA
is the first character of the input, the PDA would have the choice between
for some languages accepted by a
taking the ‘a-labelled transition’ and the ‘ε-labelled transition’. (non-deterministic) PDA does not
mean that no non-deterministic PDA can
Example 4.18. The PDA accepting L pal# in Figure 4.11 is a DPDA. The one be determinised. Recall that, in particu-
lar, ε-NFAs are PDAs, that all ε-NFAs can
in Figure 4.13 is not, because δ(q 0 , 0, 0) ̸= ;, and δ(q 0 , ε, 0) ̸= ;, which vio-
be turned into an equivalent DFA, and that
lates the second condition of Definition 4.17. M DFAs are DPDAs. In general, a PDA that
does not use its stack, or stores only a fi-
Now, we can argue that there is no DPDA for the language L pal , which nite amount of data on its stack, whatever
the word it accepts, is equivalent to a finite
shows that, unlike finite automata, PDAs cannot be determinised in gen-
automaton and can thus be determinised.
eral:
106 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

Theorem 4.2. There is no DPDA that accepts L pal .

Proof. (Sketch) A full proof is beyond the scope of these lecture notes, so
we only give the intuition. Consider two words w 1 and w 2 in L pal that
share a common prefix. For instance w 1 = 0110 and w 2 = 011110. If
there is a DPDA that accepts L pal , then it reaches the same configuration
after reading the common prefix 01, and performs the same configura-
tion change when reading the next 1. However, the behaviour of the PDA
should differ when reading w 1 and w 2 , since in the case of w 1 the next 1
belongs to the suffix of the word, and the PDA should check that this suffix
is the mirror image of the prefix; while in the case of w 2 the 1 still belongs
to the prefix.
Every DPDA is a PDA just as every
As a consequence, the class of languages that can be accepted by a DFA is also an ε-NFA, even if the
names of those classes provide the
DPDA is stricly included in the class of languages that can be accepted by opposite intuition. . .
a PDA. Indeed, every DPDA is a PDA, which proves the inclusion; and we
have just identified a language that separates these two classes. We believe
it is important to stress out this result since it departs from what we have
observed with finite automata, where DFA accept the same languages as
NFA and ε-NFA, i.e. the regular languages. We will come back to this at the
end of the section.

4.2.3 Equivalence with context-free grammars

Now that we have characterised pushdown automata, let us study the class
of languages they define, exactly as we have done when we have proved
that finite automata accept the class of regular languages (Kleene’s theo-
rem). As sketched above, PDAs accept the class of context free languages
(CFL), that we have defined so far as the class of languages that are defined
by context-free grammars (CFGs):

Theorem 4.3. For all PDAs P , L(P ) and N (P ) are both context-free lan-
guages. Conversely, for all CFLs L, there are PDAs P and P ′ s.t. L(P ) =
N (P ′ ) = L.

We will prove this theorem in several steps:

1. First, we will show that all CFLs can be accepted by a PDA that accepts
on empty stack, by giving a translation from CFGs to PDAs. Hence, we
show that for all CFGs G, there is a PDA P s.t. N (P ) = L(G);

2. Second, we will show the reverse direction: for all PDAs P , we can build
a CFG G P s.t. L(G P ) = N (P ), i.e. N (P ) is a CFL. Together with point 1,
this shows that the class of languages accepted by empty stack by a PDA
is the class of CFLs. It remains to show that this holds also for PDAs that
accept with final states;

3. Third, we will show that we can convert any PDA accepting by empty
stack into a PDA accepting the same language by accepting state, i.e.
for all PDAs P , there is a PDA P ′ s.t. N (P ) = L(P ′ );

4. Finally, we will show that for all PDAs P , there is a PDA P ′ s.t. L(P ) =
N (P ′ ).
ALL THINGS CONTEXT FREE. . . 107

From CFGs to PDAs accepting by empty stack For this first point, we will
only show an example that should be sufficient to convince the reader of
the validity of the following Lemma:

Lemma 4.4. For all CFG G, there is a PDA P s.t. N (P ) = L(G)

The reason why we restrict ourselves to an example is because, in Chap-


ter 5 and Chapter 6, we will study extensively several techniques to trans-
late CFGs into PDAs. Indeed, as explained in the introduction, this is the
key element to build a parser, which is the second stage of a compiler, the
one that checks that the syntax of the input file is correct. This syntax is
specified by means of a CFG, and the resulting PDA (the parser) is a device
that checks the conformance of the input string to this CFG.

Let us consider again the grammar for arithmetic expressions given in


Figure 4.4, and the word Id + Id ∗ Id which is accepted by the grammar ac-
cording to the following leftmost derivation:

2 1 4 4 4
Exp =
⇒ Exp ∗ Exp =
⇒ Exp + Exp ∗ Exp =
⇒ Id + Exp ∗ Exp = ⇒ Id + Id ∗ Id.
⇒ Id + Id ∗ Exp =

One way for a device to check that Id + Id ∗ Id is indeed accepted by the


grammar is to build this derivation. To do so, the device needs to store, at
all times, the current sentential form, and must be able to perform rewrit-
ing of variables in this sentential form. If this device is a PDA (as we would
like to achieve), it is natural to store the current sentential form on the
stack, with the leftmost symbol on the top. For instance, the initial con-
tent of the stack would be the start symbol only, i.e., we let Z0 = S, where
S is the start symbol (in our example, Z0 = Exp). Then, we should update
the stack in order to reflect the derivations of the grammar. Graphically,
the expected content of the stack for the three first sentential forms of the
derivation should be:

Exp
+
Exp Exp
∗ ∗
Exp Exp Exp

So, now, the question is: how does the PDA update the stack? There are
two possibilities to consider:

1. Either the symbol on the top of the stack is a variable V of the gram-
mar, as in the three pictures above. In this case, the PDA must ‘sim-
ulate’ the derivation by finding a rule V → α in the grammar, popping
this variable V and pushing, instead the right-hand side α. This is what
happened in our example above, where the two steps correspond to
applying the rules Exp → Exp ∗ Exp and Exp → Exp + Exp, respectively. Non-determinism is crucial here:
if there are two rules of the form
Observe that these actions can be implemented in a non-deterministic
V → α1 and V → α2 , the PDA must
PDA with a single state q: for each rule V → α of the grammar, we have ‘guess’ which one to apply when seeing V
an element (q, α) in δ(q, ε,V ), i.e. we add a transition that does not on the top of the stack. The whole point
of Chapter 5 and Chapter 6 will be to build
change the state, does not read from the input, but checks that V is on a PDA that can make the ‘right choices’
the top of the stack and replaces it by α. deterministically in order to obtain a pro-
gram that can be implemented.
108 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

2. Or there is, on the top of the stack, a terminal. In this case, we cannot
apply any grammar rule, and we cannot access the other variables that
could be deeper in the stack. This is what occurs if we further apply the
rule Exp → Id from the third stack above to obtain:

Id
+
Exp

Exp

where the Id on the top of the stack ‘hides’ the Exp variables. However,
in this case, we are sure that the word which will be generated by the
derivation we are currently simulating will start by Id+, i.e. the two ter-
minals which are on the top of the stack. So, we can check that these
two terminals are indeed the two first letters on the input. If it is not the
case, then, clearly, the derivation we are currently simulating will not al-
low to recognise the input word. So, the PDA will not be able to execute
any further step, and will not reach an accepting state. On the other
hand, if Id and + are the two next characters on the input, then, they
can safely be popped and read from the input. This can be achieved by
PDA transitions (still assuming a PDA with a single state q): for all a ∈ T ,
we have in δ(q, a, a) an element (q, ε), i.e. we add a transition that does
not change the content of the stack, but reads a character a from the
ε, Exp/Exp + Exp
input, provided that it is present on the top of the stack, and pops it. ε, Exp/Exp ∗ Exp
Eventually, if the input word is accepted, this will empty the stack, so ε, Exp/(Exp)
ε, Exp/Id
we obtain a PDA that accepts the language of the grammar by empty
ε, Exp/Cst
stack.
Id, Id/ε ∗, ∗/ε
q Cst, Cst/ε +, +/ε
To finish with our example, let us depict the PDA obtained from the
(, (/ε ), )/ε
grammar in Figure 4.4, using the technique described above. It is shown
Figure 4.14: A PDA accepting (by empty
in Figure 4.14. stack) arithmetic expressions with + and ∗
Moreover, one possible execution of this PDA on the input string Id + operators only.
Id ∗ Id is:

(q, Id + Id ∗ Id, Exp) ⊢(q, Id + Id ∗ Id, Exp ∗ Exp) ⊢(q, Id + Id ∗ Id, Exp + Exp ∗ Exp) ⊢
(q, Id + Id ∗ Id, Id + Exp ∗ Exp) ⊢(q, +Id ∗ Id, +Exp ∗ Exp) ⊢(q, Id ∗ Id, Exp ∗ Exp) ⊢
(q, Id ∗ Id, Id ∗ Exp) ⊢(q, ∗Id, ∗Exp) ⊢(q, Id, Exp) ⊢(q, Id, Id) ⊢(q, ε, ε)

which shows that Id + Id ∗ Id is indeed accepted, as (q, ε, ε) is an accepting


configuration (the stack is empty).

From PDA with empty stack to CFG In this case, again, we will restrict
ourselves to presenting an example of translation from a PDA that accepts
with empty stack to a CFG that accepts the same language. The construc-
tion is quite technical, but, fortunately, the intuitions behind the construc-
tion are quite simple. For our example, we will consider again the PDA in
Figure 4.12 recognising L pal# by empty stack. Recall that, in a PDA, the ex-
ecution starts with the Z0 symbol on the top of stack (so, initially, the stack
ALL THINGS CONTEXT FREE. . . 109

is not empty), and that the ultimate goal of the PDA is to empty its stack to
accept a word. This is why the variables in the CFG we will build are of the
form:
[pγq]
where: q and p are two (possibly equal) states of the PDA; γ ∈ Γ is a pos-
sible stack symbol; and the intuitive meaning of the variable is that the set
of words that can be generated from [pγq] is exactly the set of all words that
are accepted by the PDA when: (i) it starts its execution in state p with γ as
the content of the stack; and (ii) it ends its execution in state q. In other
words:

[pγq] ⇒∗ w
iff
(p, w, γ) ⊢* (q, ε, ε).

In addition to those variables, we will also have an S variable, which is the


start symbol of the grammar. So, following the intuition we have given
above, the rules of our grammars that have S as the left-hand side must
be:

S →[q 0 Z0 q 0 ]
S →[q 0 Z0 q 1 ].

Indeed, for a word to be accepted by the PDA in Figure 4.12, the PDA must:

• either have an execution that starts in q 0 , eventually removes the sym-


bol Z0 from the stack (so that it becomes empty), and reaches q 0 . By
the intuition above, the sets of all words accepted by such executions is
the set of words generated by [q 0 Z0 q 0 ];

• or have an execution that starts in q 0 , eventually removes the symbol Z0


from the stack and reaches q 1 . Again, these words are those generated
by [q 0 Z0 q 1 ].

There are no other possibilities since q 0 and q 1 are the only two states in
the PDA.
Now, let us see how we can add to our grammar the rules that have vari-
ables of the form [pγq] as the left-hand side. Let us consider the variable
[q 0 0q 1 ]. So, we need to understand what are the words that the PDA could
accept by an execution that: (i) starts in q 0 with 0 as the only content of
the stack; and (ii) ends in q 1 with an empty stack. For that purpose, we
look at all the possible transitions that can occur from q 0 when 0 is on the
top of the stack. There are three possibilities:

• Either the PDA reads a 1 on the input. In that case, it will push this new
1 to the top of the stack, and stay in q0 . This means that an execution
that should empty the stack must, after this transition, pop two symbols
from the stack: first the 1, then the 0 that was already on the stack. In-
between these two pops, the PDA might either stay in q 0 or move to q 1 .
Thus, we add to our grammar two rules:

[q 0 0q 1 ] → 1[q 0 1q 0 ][q 0 0q 1 ]
[q 0 0q 1 ] → 1[q 0 1q 1 ][q 1 0q 1 ].
110 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

Intuitively, the first rule says: ‘we want the PDA to read a word from a
configuration where q 0 is the current state and 0 is on the top of the
stack, and we want that after this word is read, the PDA reaches q 1 and
the 0 on the top of the stack has been popped ([q 0 0q 1 ]). Then, one way
to do that is to read a 1 from the input (which will push the 1 on the top
of the stack), then continue the execution by reaching q 0 again after
having removed that 1 from the top of the stack ([q 0 1q 0 ]), then con-
tinue again the execution until reaching q 1 after which the 0 on the top
of the stack has been popped ([q 0 0q 1 ]). The second rule says essentially
the same, except that now the intermediary state is q 1 .

• Another possibility is that the PDA reads a 0 from the input. Symmetri-
cally to the previous case, we have the two following rules in the gram-
mar:

[q 0 0q 1 ] → 0[q 0 0q 0 ][q 0 0q 1 ]
[q 0 0q 1 ] → 0[q 0 0q 1 ][q 1 0q 1 ].

• Finally, one possibility is that the PDA reads a # from the input. In this
case, it will not modify the stack (so the 0 that we want eventually to
pop will remain), and it will move to q 1 . Thus, the only rule we have in
this case is:

[q 0 0q 1 ] → #[q 1 0q 1 ].

Now, let us consider variable [q 1 0q 1 ]. For this variable, we must look


for all the possible words that the PDA can accept from a configuration
where q 1 is the current state, 0 is on the top of the stack, and the run of the
PDA ends in q 1 and the 0 has been popped. Since no push can occur from
q 1 , the only word that can be accepted this way is 0—the only possible
transition from (q 1 , w, 0) is the one that pops a 0 and reads a 0 from the
input, and thus w is necessarily equal to 0.
ALL THINGS CONTEXT FREE. . . 111

If we continue with this intuition, we obtain11 : 11


To keep the grammar as short as possi-
ble (!) we have avoided some variables that
(1) S → [q 0 Z0 q 0 ]
cannot produce anything for obvious rea-
(2) → [q 0 Z0 q 1 ] sons, such as [q 1 0q 0 ], as there is no path
(3) [q 0 0q 0 ] → 0[q0 0q0 ][q0 0q0 ] in the PDA from q 1 to q 0 .

(4) → 1[q0 1q0 ][q0 0q0 ]


(5) [q 0 1q 0 ] → 0[q0 0q0 ][q0 1q0 ]
(6) → 1[q0 1q0 ][q0 1q0 ]
(7) [q 0 Z0 q 0 ] → 0[q0 0q0 ][q0 Z0 q0 ]
(8) → 1[q0 1q0 ][q0 Z0 q0 ]
(9) [q 0 0q 1 ] → 0[q0 0q0 ][q0 0q1 ]
(10) → 0[q0 0q1 ][q1 0q1 ]
(11) → 1[q0 1q0 ][q0 0q1 ]
(12) → 1[q0 1q1 ][q1 0q1 ]
(13) → #[q 1 0q 1 ]
(14) [q 0 1q 1 ] → 0[q0 0q0 ][q0 1q1 ]
(15) → 0[q0 0q1 ][q1 1q1 ]
(16) → 1[q0 1q0 ][q0 1q1 ]
(17) → 1[q0 1q1 ][q1 1q1 ]
(18) → #[q 1 1q 1 ]
(19) [q 0 Z0 q 1 ] → 0[q0 0q0 ][q0 Z0 q1 ]
(20) → 0[q0 0q1 ][q1 Z0 q1 ]
(21) → 1[q0 1q0 ][q0 Z0 q1 ]
(22) → 1[q0 1q1 ][q1 Z0 q1 ]
(23) → #[q 1 Z0 q 1 ]
(24) [q 1 0q 1 ] → 0
(25) [q 1 1q 1 ] → 1
(26) [q 1 Z0 q 1 ] → ε
We finish this example by giving a (leftmost) derivation of this grammar
that accepts 01#10:
2
S=
⇒ [q 0 Z0 q 1 ]
20
=⇒ 0[q 0 0q 1 ][q 1 Z0 q 1 ]
12
=⇒ 01[q 0 1q 1 ][q 1 0q 1 ][q 1 Z0 q 1 ]
18
=⇒ 01#[q 1 1q 1 ][q 1 0q 1 ][q 1 Z0 q 1 ]
25
=⇒ 01#1[q 1 0q 1 ][q 1 Z0 q 1 ]
24
=⇒ 01#10[q 1 Z0 q 1 ]
26
=⇒ 01#10. P

From PDA with empty stack to PDA with accepting states For the third ε, X 0 /Z0 X 0
q 0′ q0
step of our proof, we show how we can transform a PDA P that accepts
some language N (P ) by empty stack, into a PDA P ′ that accepts the same
language by accepting state. ε, X 0 /ε ε, X 0 /ε
ε, X 0 /ε
′ ′ qf
Lemma 4.5. For all PDA P , there is a PDA P s.t. N (P ) = L(P ).

Proof. We sketch the construction of P ′ from P . It is illustrated in Fig- Figure 4.15: An Illustration of the construc-
tion that turns a PDA accepting by empty
ure 4.15 Let us assume that P = Q, Σ, Γ, δ, q 0 , Z0 , F . Then, we build P ′ =
­ ®
stack into a PDA accepting the same lan-
guage by final state.
112 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

Q , Σ, Γ ∪ {Z0 }, δ′ , q 0′ , X 0 , F ′ where Q ′ , δ′ and q 0′ are defined according to


­ ′ ®

the following intuition. The idea behind the construction of P ′ is that P ′


should somehow ‘simulate’ P , detect whenever P empties its stack, and
then move to an accepting state. To do so, P ′ will push, on its stack, the
symbol Z0 which is the symbol that is used to mark the bottom of the stack
in P ; and then move to the initial state of P . That is, the first move of P ′ will
be: (q 0′ , w, X 0 ) ⊢(q 0 , w, Z0 X 0 ). From that configuration, P ′ can act exactly
as P . Indeed, the transitions of P cannot test for the X 0 symbol which is
at the bottom of the stack, so this has no influence on the execution of P .
Eventually, P will empty its stack by popping Z0 , so we should make sure
that P ′ accepts in this case. When this occurs, the symbol on the top of the
stack, in P ′ will be X 0 . So, we can add, from all states of P , a transition that
tests for X 0 on the top of the stack, and moves to an accepting state in this
case.
Formally:

• Q ′ = Q ∪ {q 0′ , q f }, where q 0′ is a new initial state, and q f is a new accept-


ing state;

• δ′ (q 0′ , ε, X 0 ) = {(q 0 , Z0 X 0 )}, and δ′ (q 0′ , a, X 0 ) = ; for all a ∈ Σ. That is, P ′


can only push Z0 on top of X 0 from its initial state q 0′ and then move to
q0 ;

• for all states q ∈ Q: δ′ (q, ε, X 0 ) = {(q f , ε)}, and δ′ (q, a, X 0 ) = ; for all
a ∈ Σ. Otherwise, δ′ coincide with δ. That is, we add transitions to the
accepting state only when X 0 occurs on the top of the stack;

• δ′ (q f , a, γ) = ; for all a ∈ Σ ∪ {ε} and γ ∈ Γ, i.e., there is no transition


from the accepting state;

• F ′ = {q f }.

It is easy to check that, we have in P , for some input word w:

(q 0 , w, Z0 ) ⊢*P (q, w ′ , ε)

iff we have, in P ′ :

(q 0′ , w, X 0 ) ⊢P’ (q 0 , w, Z0 X 0 ) ⊢*P’ (q, w ′ , X 0 ).

That is, P ′ can ‘simulate’ the execution of P with the additional X 0 at the
bottom of the stack. Then, from (q, w ′ , X 0 ), P ′ can move to the accepting
state:

(q, w ′ , X 0 ) ⊢P’ (q f , w ′ , ε).

Hence, P accepts w (by empty stack) iff P ′ does (by accepting state), i.e.,
N (P ) = L(P ′ ).

From PDA with accepting state to PDA with empty stack We close the loop
by showing how we can convert a PDA accepting some language L(P ) by
accepting state into one accepting the same language by empty stack.

Lemma 4.6. For all PDA P , there is a PDA P ′ s.t. L(P ) = N (P ′ ).


ALL THINGS CONTEXT FREE. . . 113

P
Proof. The construction is very similar to the one we have used in the
previous proof. Given a PDA P Q, Σ, Γ, δ, q 0 , Z0 , F that accepts some lan-
­ ®
ε, X 0 /Z0 X 0
guage L(P ) by accepting state, we build a PDA P ′ that ‘simulates’ the exe- q 0′ q0

cution of P , checks when an accepting state is reached, and, in this case,


moves, by an ε-transition to a state where it empties its stack.
This can be done as follows12 (the construction is illustrated in Fig- ε, γ/ε ε, γ/ε

ure 4.16): ε, γ/ε qf

Figure 4.16: An Illustration of the construc-


1. We add to P two fresh states q 0′ and q f , where q 0′ is the new initial state;
tion that turns a PDA accepting by final
state into a PDA accepting the same lan-
2. The bottom of stack symbol of P ′ is now X 0 ; guage by empty stack. Transitions labelled
by ε, γ/ε represent all possible transitions
3. From q 0′ there is a single transition to q 0 (the initial state of P ) that for all possible γ ∈ Γ.
12
pushes Z0 on the stack. Thus, after this initial transition, the content We skip the formal details, they can be
found in classical textbooks such as:
of the stack is Z0 X 0 , and all transitions of P can be executed, hence P
John E. Hopcroft, Rajeev Motwani, and
can be ‘simulated’ by P ′ ; Jeffrey D. Ullman. Introduction to Au-
tomata Theory, Languages, and Computa-
4. From all accepting states of P , there are transitions to q f labelled by tion (3rd Edition). Addison-Wesley Long-
man Publishing Co., Inc., Boston, MA,
ε, γ/ε (for all γ ∈ Γ) to q f . That is, once an accepting state is reached in
USA, 2006. ISBN 0321455363
P , P ′ moves to q f and starts popping the symbols from the stack;

5. On q f , there are self-loop transitions labelled by ε, γ/ε (for all γ ∈ Γ), to


empty the stack.

Again, it is easy to check that, there is an execution

(q 0 , w, Z0 ) ⊢*P (q, w ′ , x)

in P iff there is an execution of the form

(q 0′ , w, X 0 ) ⊢P’ (q 0 , w, Z0 X 0 ) ⊢*P’ (q, w ′ , x X 0 )

in P ′ . In the case where q ∈ F (that is, q is accepting in P ), then, from


(q, w ′ , x X 0 ), P ′ can move to q f , where it will empty its stack, i.e.:

(q, w ′ , x X 0 ) ⊢*P’ (q f , w ′ , ε),

where this last configuration is accepting for the ‘empty stack’ condition.
Hence, w is in L(P ) iff w is in N (P ′ )

4.2.4 Deterministic Context-Free Languages

We close this section by considering again the special case of DPDA. We


start by introducing a name and a notation for the class of languages recog-
nised by DPDA. This naming is due to G I N S B U R G and G R E I B A C H 13 : 13
Seymour Ginsburg and Sheila Greibach.
Deterministic context free languages.
Definition 4.19 (Deterministic CFL). A CFL L is deterministic (DCFL for Information and Computation (for-
merly known as Information and
short) iff there is a DPDA P that accepts it, i.e. L(P ) = L. M Control), 9(6):620–648, 1966. ISSN
0019-9958. DOI: 10.1016/S0019-
By the discussion at the end of Section 4.2.2, we already know that some 9958(66)80019-0. URL https:
CFL cannot be recognised by deterministic PDA, so we can state the fol- //www.sciencedirect.com/science/
article/pii/S0019995866800190
lowing about DCFL:
Observe that, thanks to the equiv-
alence between acceptance con-
Lemma 4.7. The class of deterministic context-free languages is stricly con- ditions we have given above, we
tained in the class of context-free languages: DCFL ⊊ CFL. could have written N (P ) instead of L(P ).
114 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

4.3 Operations and closure properties of context-free languages

Let us now study the operations we can apply to context-free languages,


and what are their effect. The operations we are interested in are the ones
we have introduced in Section 1.4, namely: union, intersection, concate-
nation, complement and Kleene closure. Recall that for the class of regu-
lar languages, the answer to those
More precisely, we are interested by the closure properties of CFLs, i.e.,
questions is always ‘yes’: the union,
given two CFLs L 1 and L 2 to which we apply one of the operations listed the intersection and the concatenation of
above, is the resulting language still a CFL? two regular languages are regular; and the
complement and Kleene closure of a regu-
lar language are also regular. This is a re-
Union, concatenation and Kleene closure The union, the concatenation markable property of regular languages.
and the Kleene closure are three operations that preserve the context-free
character of languages. Let us prove this:

Theorem 4.8. Let L 1 and L 2 be two CFLs. Then, L 1 ∪ L 2 , L 1 · L 2 and L ∗1 are


CFLs.

Proof. As L 1 and L 2 are CFLs, there are CFGs G 1 = 〈V1 , T1 , P 1 , S 1 〉 and G 2 =


〈V2 , T2 , P 2 , S 2 〉 that accept them. Then, let us show how we can combine
them to produce grammars G, G ′ and G ′′ that accept L 1 ∪L 2 , L 1 ·L 2 and L ∗1
respectively . Instead of showing how to ap-
ply these operations on grammars
• For the union, we build the grammar recognising L 1 and L 2 , we could
also consider two PDA P 1 and P 2 recog-
nising them and show how to combine
G = 〈V1 ∪ V2 ∪ {S}, T1 ∪ T2 , P , S〉 , them to produce PDA accepting L 1 ∪ L 2 ,
L 1 · L 2 and L ∗
1 respectively. This construc-
where S is fresh variable and: tion would be in the spirit of the transla-
tion from regular expressions to ε-NFA (see
Section 2.4.1).
P = P 1 ∪ P 2 ∪ {S → S 1 , S → S 2 }.

That is, all the rules from both grammars are added to G; and the extra
rules allow the grammar G to ‘chose’ between L 1 or L 2 . More precisely, if
a word is generated from S (i.e. S ⇒∗ w), then it is necessarily generated
either from S 1 (i.e., the derivation is actually S ⇒ S 1 ⇒∗ w) or from S 2
(i.e., the derivation is actually S ⇒ S 2 ⇒∗ w). Thus, w belongs either to
L 1 or to L 2 , i.e. w ∈ L 1 ∪ L 2 . We have just shown that L(G) ⊆ L 1 ∪ L 2 .
Symmetrically, if w ∈ L 1 ∪L 2 , then either w ∈ L 1 or w ∈ L 2 . In the former
∗ ∗
case, w ∈ L 1 implies S 1 ⇒G 1
w, hence S ⇒G S 1 ⇒G w and thus w ∈ L(G).
∗ ∗
In the latter case, w ∈ L 2 implies S 2 ⇒G 2
w, hence S ⇒G S 2 ⇒G w and
thus w ∈ L(G), again. This shows that L 1 ∪ L 2 ⊆ L(G). Together with
L(G) ⊆ L 1 ∪ L 2 , this implies that L(G) = L 1 ∪ L 2 .

• For the concatenation, we build the grammar

G ′ = V1 ∪ V2 ∪ {S ′ }, T1 ∪ T2 , P ′ , S ′ ,
­ ®

where S ′ is fresh variable and:

P ′ = P 1 ∪ P 2 ∪ {S ′ → S 1 S 2 }

It is easy to check that a word is generated by G ′ iff it is generated from


the sentential form S 1 S 2 , hence w is the concatenation of w 1 ∈ L 1 and
w 2 ∈ L 2 , and L(G) = L 1 · L 2 .
ALL THINGS CONTEXT FREE. . . 115

• For the Kleene closure, we proceed in a similar fashion, with a recursive


rule. We build the grammar

G ′′ = V1 ∪ {S ′′ }, T1 , P ′′ , S ′′ ,
­ ®

where S ′′ is fresh variable and:

P ′′ = P 1 ∪ {S ′′ → S 1 S ′′ , S ′′ → ε}.

Again, one can easily check that w is accepted by G ′′ iff it is generated


from a sentential form S 1 S 1 · · · S 1 . In other words, w is a concatenation
of an arbitrary number of words generated from S 1 . Hence, L(G ′′ ) = L ∗1 .

Intersection and complement Unfortunately, neither intersection, nor com-


plement do preserve the context-free character of languages. That is, one
can find two languages L 1 and L 2 that are CFLs, but, s.t. L 1 ∩ L 2 is not a
CFL. From this, we will also be able to deduce that CFLs are not closed
under complement either. Theorem 4.9 can be proved by
invoking the pumping lemma for
To establish those results, we need to admit the following one:
context-free languages, which is re-
motely related to the arguments we have
Theorem 4.9. The language L abc = {an bn cn | n ≥ 0} is not context free. used to show that L () is not regular, at the
beginning of the present chapter.
The intuition behind this result is as follows. Assume we have a PDA
It is not difficult to see that, if we
accepting L abc . Clearly, such a PDA needs to count the number of a’s that admit PDAs with two stacks, then
are on the input, then check that this number corresponds to the number L abc can be recognised. Indeed,
such a PDA would push an a on both stacks
of b’s and to the number of c’s. To count this number of a’s, the PDA has when reading the a’s, pop from the first
no other choice than pushing all the a’s on its stack. Then, to check that when reading the b’s and from the sec-
ond when reading the c’s. However, PDA
the number of b’s is equal to the number of a’s, the PDA must necessarily
with two stacks can simulate Turing ma-
pop all the a’s from the stack while reading the b’s. However, at that point, chines, so adding only a second stack to
the stack is empty, so the count of the number of a’s has been lost, and the PDA greatly increases its expressive power,
so this model cannot be used for practical
PDA cannot check anymore that the number of c’s is correct. purposes in a compiler.

Now, using this result, let us prove that the intersection of two CFLs
might not be a CFL.

Theorem 4.10. CFLs are not closed under intersection.

Proof. Consider the two languages:

L 1 = {an bn ck | k, n ≥ 0}
L 2 = {ak bn cn | k, n ≥ 0}.

So, L 1 contains all words of the form a · · · ab · · · bc · · · c s.t. the number of a’s
is equal to the number of b’s, and L 2 all the words of the same form s.t. the
number of b’s is equal to the number of c’s.
It is easy to check that L 1 and L 2 are CFLs. A PDA accepting L 1 would
push on the stack all the a’s it reads, then check that the number of b’s is
equal by popping from the stack, and finally read c’s without constraint.
Symmetrically, a PDA accepting L 2 would read a prefix of a’s, push when
reading the b’s, then pop when reading the c’s14 14
Another argument to show that L 1 and
L 2 are CFLs is to observe that L 1 is the
So, L 1 and L 2 are both CFLs, but clearly L 1 ∩ L 2 = L abc , which is not a
concatenation of {an bn | n ≥ 0} with c∗ ,
CFL by Theorem 4.9. while L 2 is the concatenation of a∗ with
{bn cn | n ≥ 0}. Clearly, all these languages
are CFLs (in particular, a∗ and c∗ are reg-
ular), so their concatenation is also a CFL,
by Theorem 4.8.
116 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

Finally, we can show that:

Theorem 4.11. CFLs are not closed under complement

Proof. The argument is by contradiction and is based on the previous


Theorem. Let L 1 and L 2 be two CFLs, and assume, for the sake of con-
tradiction, that, for all CFLs L, L = Σ∗ \ L is a CFL. Then, consider the lan-
guage:
³ ´
L1 ∪ L2 .

Clearly, since L 1 and L 2 are CFLs, this language is a CFL too: by hypothesis,
L 1 and L 2 are CFLs; so L 1 ∪ L 2 is a CFL too by Theorem 4.8, hence, its
complement is a CFL. However, classical set theory tells us that:
³ ´
L1 ∩ L2 = L1 ∪ L2 .

Thus, we conclude that, if the complement of any CFL L were a CFL too,
then the intersection of any pair of CFLs would be a CFL, which we know
is not the case by Theorem 4.9. Contradiction.

One can also note that some prob-


To conclude, CFLs do not enjoy all the nice closure properties of regular lems are undecidable for CFLs,
such as inclusion for instance. To
languages. This is not very surprising: an increase in the expressive power mitigate these problems, the class of vis-
usually comes at the price of a loss of properties. ibly pushdown languages (VPL) has been
introduced. They form an intermediary
class between regular languages and CFLs,
4.4 Grammar transformations while retaining enough expressive power
to have interesting applications. The class
of VPL is closed under all classical oper-
Let us close this chapter by considering several techniques that will turn ations (union, intersection, complement,
to be useful when we will build parsers for a given grammar, in order to Kleene star, concatenation), and inclusion
is decidable.
produce syntactic analysers. Those techniques ensure that the grammars
Rajeev Alur and P. Madhusudan.
we consider have certain important properties for the kind of parsers we Visibly pushdown languages. In Pro-
will consider (the importance of these properties will thus become clear in ceedings of the Thirty-sixth Annual ACM
Symposium on Theory of Computing,
the course of Chapter 5). In particular, the transformation that consists in STOC ’04, pages 202–211, New York,
modifying the grammar to take into account priority and associativity of NY, USA, 2004. ACM. ISBN 1-58113-
852-0. D O I : 10.1145/1007352.1007390.
operators allows one to remove (some) ambiguities of grammars.
URL http://doi.acm.org/10.1145/
1007352.1007390

4.4.1 Factoring

The first transformation is called factoring and can be applied when a


grammar contains at least two rules with the same left-hand side, and a
common prefix in the right-hand side. A typical example is in the specifi-
cation of an if in an imperative language:
(1) [if] → if [Cond] then [Code] fi
(2) [if] → if [Cond] then [Code] else [Code] fi
Why is it so important that no two
In this case, if [Cond] then [Code] is the common prefix. Clearly, it is right-hand sides of rules (with the
possible to ‘factor’ this common prefix and transform this grammar into: same left-hand side) share a com-
mon prefix? The intuition is as follows:
(1) [if ] → if [Cond] then [Code] [ifSeq] when we write a parser, we want to pro-
(2) [ifSeq] → fi duce a program that, given a word w and
a grammar G, builds, if possible, a deriva-
(3) [ifSeq] → else [Code] fi tion of G that produces w. Assume that
This latter grammar accepts the same language as the former, but there the parser has already built a prefix of the
derivation; obtained the sentential form
are no common prefixes in right-hand sides of rules. x 1 V x 2 ; and needs to decide which rule to
apply to rewrite V . Assume further that
there are in the grammar two rules, say
V → aα1 and V → bα2 . To make a choice
between the rules, the parser will look at
the next symbol in the input: if it is an a,
then it will apply the former rule; if it is a b,
it will apply the latter. This allows to make
ALL THINGS CONTEXT FREE. . . 117

In general, whenever we have, in a grammar, a set of rules of the form:

V → αβ1
V → αβ2
..
.
V → αβn ,

we can replace them by:

V → αV ′
V ′ → β1
V ′ → β2
..
.
V ′ → βn ,

where V ′ is a fresh variable. This process can be iterated until there are no
more rules to factor.

4.4.2 Removing left-recursion

As already explained, recursion in a CFG refers to the occurrence of the


left-hand side of a rule in its right-hand side. For instance, in the following
grammar that accepts a∗ , the first rule is recursive (and the second is used
to stop the recursion):
(1) S → Sa
(2) S → ε
One speaks of left-recursion when this recursive variable occurs as the
first symbol of the right-hand side, as in the example above. This is actu-
ally a case of direct recursion, but it can also be indirect, as in the following
example:
(1) S → Aa
(2) A → Sb
(3) A → ε
For reasons akin to the one that have prompted us to factor rules, left-
recursion will be problematic when building parsers, so we need to find
a way to remove it. Of course, completely removing recursion will not be
possible: recursion is the only way for a grammar to accept an infinite
language, and any grammar without recursion necessarily accepts a finite
language. So, our technique to remove direct and indirect left-recursion
will be as follows:

1. First, transform indirect left-recursion into direct left-recursion;

2. then, transform left-recursion into right-recursion.

To turn indirect left-recursion into direct left-recursion, we proceed as


follows. Every time there is a rule of the form:

A →Bα
118 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

where A and B are variables, and where all rules with B as left-hand side
are:

B → β1
B → β2
..
.
B → βn ,

we replace A → B α by:

A → β1 α
A → β2 α
..
.
A → βn α.

Clearly, this preserves the language of the grammar. We repeat this trans-
formation until there is no indirect left-recursion left in the grammar.

Next, we need to remove direct left-recursion by turning it into right-


recursion. We proceed as follows. Consider a variable V and assume:

V →V α1
V →V α2
..
.
V →V αn

is the set of all direct left-recursive rules with V as the left-hand side. Fur-
ther assume that:

V → β1
V → β2
..
.
V → βm

is the set of all other (non-left-recursive) rules that have V as the left-hand
side. Observe that a word which is generated from A is necessarily of the
form:

w w 1′ w 2′ · · · w k′

where w is generated from one of the βi ’s, and each w i′ is generated from
ALL THINGS CONTEXT FREE. . . 119

one of the α j ’s. Following this intuition, we replace all those rules by:

V → β1 V ′
V → β2 V ′
..
.
V → βm V ′

V ′ → α1 V ′
V ′ → α2 V ′
..
.
V ′ → αn V ′
V ′ → ε.

As can be seen, V ′ is now the recursive variable, but we have used right-
recursion to generate the sequence of αi ’s.

4.4.3 Removing useless symbols

Definition of useless symbols The formal definition of CFGs we have given


is a syntactic one (i.e., there must be exactly one variable, and no terminal,
on the left-hand side); but it does not guarantee anything about the pos-
sible derivations and the use of the variables and terminals along these
derivations. In particular, it is perfectly possible, but rather undesirable,
to build grammars that still satisfy this definition but contain useless sym-
bols (variables or terminals), as shown by the next examples. (1) S → a
(2) → A
Example 4.20. Let us consider the CFG in Figure 4.17. (3) A → Aa

Clearly, any derivation that starts by S ⇒ A will not allow one to pro- Figure 4.17: A grammar with an unproduc-
tive variable (A).
duce any word, because all sentential forms derived from one containing
an A will also contain an A, that can never be eliminated. In other words,
the variable A is recursive, but there is no way to stop the recursion. This
means that A is useless in this grammar (it will never allow to produce any
word). So, we can safely remove rule 3 from the grammar without modi-
fying its language. But then, we can also remove rule 2, and the grammar
becomes:
(1) S → a M

This example shows a case where a variable (A) is unproductive. More


formally:

Definition 4.21 (Unproductive variable). A variable A in a grammar G =


〈V , T , P , S〉 is unproductive iff there is no word w ∈ T ∗ s.t.

A ⇒G w. M

Example 4.22. Our second example shows a case of a symbol that is pro-
ductive but is nonetheless useless because no sentential form obtained
(1) S → A
from the start symbol will ever contain it. Consider the grammar in Fig- (2) A → a
ure 4.18. (3) B → b

In this case, variable B is productive because B ⇒∗ b, but it can never Figure 4.18: A grammar with an unreach-
able symbol (B ).
be ‘reached’ in any sentential form produced from S. Remark that it is also
120 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

the case with terminal b that occurs only in rule 3 (whereas all terminals
are necessarily productive). M Observe that symbols can be be-
come unreachable because some
Definition 4.23. Let G = 〈V , T , P , S〉 be a grammar. A symbol X ∈ V ∪ T rules have been removed due to the
is unreachable iff there is no sentential form of G that contains an X , i.e. removal of unproductive symbols.

there is no derivation of the form S ⇒G α1 X α2 . M

Now, let us devise algorithms to compute symbols that are unproduc-


tive or unreachable (i.e., the useless symbols). More precisely, we will present
algorithms to compute all the productive and reachable symbols, then use
this information to remove symbols and rules that are useless.

Unproductive symbols First, for unproductive symbols, remember that


all terminals are always productive. Moreover, if we consider a rule of the
form A → α, where α contains only productive symbols, then A is clearly
also productive.

Example 4.24. If we have A → aBC , where B ⇒∗ bb and C ⇒∗ cc, we can


compose these two derivations and obtain aBC ⇒∗ abbC ⇒∗ abbcc, and
thus aBC ⇒∗ abbcc Hence, we also have A ⇒ aBC ⇒∗ abbcc, and A is
productive. M

Based on this observation, we can devise an iterative algorithm that


computes the set of productive symbols of a CFG (it is given in Algorithm 4).
This algorithm maintains a set of symbols that are productive for sure.
Initially, it contains all the terminals. Then, the algorithm considers it-
eratively all the rules of the grammar, and, each time it finds a rule of the
form A → α where all symbols in α are productive, it adds A to the set of
productive symbols. The algorithm grows the set of productive symbols
this way until it reaches a fixed point. Upon termination, all the produc-
tive symbols have been computed, so, all the others are unproductive.

Input: A CFG G = 〈V , T , P , S〉
Output: The set Prod ⊆ V ∪ T of productive symbols

Prod ← T ;
Prec ← ; ;
while Prec ̸= Prod do
Prec ← Prod ;
foreach A → α ∈ P do
if α ∈ Prod∗ then
Prod ← Prod ∪ {A} ;

return Prod ;
Algorithm 4: The algorithm to compute productive symbols in a CFG.

Once the set Prod of productive symbols has been computed, removing
unproductive symbols from G = 〈V , T , P , S〉 yields G ′ = V ′ , T , P ′ , S , where
­ ®

V ′ = Prod ∩ V ∪ {S}, and P ′ contains all the rules of the form A → α ∈ P s.t. Observe that we keep S in V even
when S is not productive, because
α ∈ Prod∗ , i.e., α contains only productive symbols.
the syntax of grammars requests
that V always contains at least the start
Unreachable symbols Next, let us devise an algorithm to compute reach- symbol. However, if S is unproductive, the
set of rules P will not contain any rule of
able symbols. We follow the same kind of inductive reasoning as in the the form S → α.
ALL THINGS CONTEXT FREE. . . 121

case of productive symbols: clearly the start symbol S is reachable. Then,


if a variable A is reachable, and there is a rule A → α, then all symbols in α
are reachable too. This algorithm can be assimilated
to a breadth-first search in a graph.
Based on this observation, we obtain an algorithm to compute reach-
Imagine the nodes of the graphs
able symbols that maintains at all times a set Reach of symbols that are are the terminals and the variables of the
surely reachable. Initially, this set contains only S. Then, each time a grammar, and imagine that a rule of the
form A → α means that there is an edge
rule A → α with A ∈ Reach is found, all the symbols from α are inserted between A and each symbol in α. Then,
in Reach. The algorithm grows the set Reach until a fixed point is found. all the reachable symbols are exactly those
that are reachable in the graph from node
Upon termination, the set Reach contains all reachable symbols, and only
S. This can be computed by a breadth-first
those. Algorithm 5 presents this algorithm. search, which is exactly what the algorithm
does.
Input: A CFG G = 〈V , T , P , S〉
Output: The set Reach ⊆ V ∪ T of reachable symbols

Reach ← {S} ;
Prec ← ; ;
while Prec ̸= Reach do
Prec ← Reach ;
foreach A → α ∈ P do
if A ∈ Reach then
Add to Reach all symbols occurring in α ;

return Reach ;
Algorithm 5: The algorithm to compute reachable symbols in a CFG.

Again, removing unreachable symbols from G = 〈V , T , P , S〉 is easy once


the set Reach has been computed. We obtain the CFG G ′ = V ′ , T ′ , P ′ , S ,
­ ®

where V ′ = Reach ∩ V ; T ′ = Reach ∩ T ; and P ′ contains all the rules of the


form A → α ∈ P s.t. A ∈ Reach.

Removing all useless symbols Finally, let us show how we can combine
the two operations described above to obtain a grammar that contains no
useless symbols. Consider the following grammar:
(1) S → a
(2) → A
(3) A → AB
(4) B → b
where, A is unproductive; B is productive; and both A and B are reachable.
Removing A from the grammar as well as rules 2 and 3 yields the grammar:
(1) S → a
(2) B → b
that indeed contains only productive symbols, but where B is not reach-
able anymore (indeed, it was reachable ‘through’ A which has been re-
moved). We conclude that removing unproductive symbols can create un-
reachable symbols: after removing unproductive symbols, we will need to
run the algorithm to remove unreachable symbols.
Does the reverse hold? That is, is it possible that removing unreachable
symbols make some symbols unproductive? We will argue that this is not
possible, by contradiction. Assume some variable A in a grammar G which

is productive (i.e., A ⇒G w for some w ∈ T ∗ ), and assume we remove the
122 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

unreachable symbols from the grammar, and that, after this removal, the
resulting grammar G ′ still contains A which is now unproductive, i.e. there
∗ ′
is no w s.t. A ⇒G ′ w. Clearly, if A cannot produce any word in G , while

it could in G, it is because all possible derivations that produce a word


from A in G make use of one of the removed unreachable symbols. That
is, for all w ∈ T ∗ : A ⇒G
∗ ∗
w implies that A ⇒G α ⇒G

w, where α contains at
least one variable B which is not reachable in G. However, since we have
assumed that A is still present in G ′ after removal of unreachable symbols,

we conclude that A is reachable, hence, A ⇒G α (with B occurring in α)
implies that B is reachable too. Contradiction.
The conclusion of this discussion is that removing unproductive sym-
bols can create unreachable ones, but that removing unreachable symbols
will not make variables unproductive. Thus, to remove all useless symbols
from a grammar, one should:

1. First, remove unproductive variables;

2. then, remove unreachable symbols.

After that, all variables are guaranteed to be productive, and all symbols to
be reachable.

Example 4.25. We close this section by a complete example showing the


removal of useless symbols. Consider the grammar G = 〈V , T , P , S〉, where
V = {S, A, B ,C , D}, T = {a, b, c} and P contains the set of rules:
(1) S → A
(2) → CC a
(3) A → Da
(4) → AB c
(5) B → b
(6) C → c
First, we compute the set of productive symbols. Following Algorithm 4,
we initialise Prod with T , i.e.:

Prod = {a, b, c}.

Considering rule 6, we discover that C is productive too, because the right-


hand side contains only c ∈ Prod, so now:

Prod = {a, b, c,C }.

Similarly, rule 5 tells us that B is productive, so:

Prod = {a, b, c,C , B }.

Next, by rule 2, we conclude that S is productive, since the right-hand side


of the rule contains only elements from Prod (a and C ),:

Prod = {a, b, c,C , B , S}.

However, we cannot add A to Prod, because the right-hand sides of rules 3


and 4 both contain a symbol (D and A respectively) that does not be-
long to Prod. So the set of productive symbols is exactly {a, b, c, B ,C , S},
ALL THINGS CONTEXT FREE. . . 123

and removing the unproductive symbols from the grammar yields G ′ =


­ ′
V , T , P ′ , S , where V ′ = {S, B ,C }, and P ′ contains the rules:
®

(1) S → CC a
(2) B → b
(3) C → c
Now, we compute the reachable symbols in G ′ . We start with:

Reach = {S}.

By rule 1, we discover that C and a are reachable too, hence:

Reach = {S,C , a}.

Now that C is known to be reachable, we deduce, by rule 3, that terminal c


is reachable too:

Reach = {S,C , a, c}.

At that point, we reach a fixed point: there is no rule of the grammar that
allows to reach neither B nor b from either S or C , so {S,C , a} is exactly the
set of reachable symbols. The resulting grammar is G ′′ = V ′′ , T ′′ , P ′′ , S ,
­ ®

where V ′′ = {S,C }, T ′′ = {a, c} and P ′′ contains the rules:


(1) S → CC a
(2) C → c
which contains only useful symbols. M

4.4.4 Priority and associativity of operators

We close this chapter by explaining a technique that allows to remove am-


biguities occurring typically in grammars designed for arithmetic or Boolean (1) Exp → Exp + Exp
expressions. We will consider once again the grammar for arithmetic ex- (2) → Exp ∗ Exp
(3) → Id
pressions that we recall in Figure 4.19. (4) → Cst
As we have already discussed in Section 4.1.1 this grammar is ambigu- Figure 4.19: A simple grammar for arith-
ous. Consider for instance the word Id + Id ∗ Id, the two following trees are metic expressions.

derivation trees of this word:

Exp Exp

Exp + Exp Exp ∗ Exp

Id Exp ∗ Exp Exp + Exp Id

Id Id Id Id

If we want to build a parser for this grammar, which is a deterministic


program, we will need to choose which of these trees will be returned by
the parser (and then modify the grammar to make sure that only this tree
is returned). To guide our choice, we will take into account the natural pri-
ority and associativity properties of the arithmetic and Boolean operators.
In the example above, the tree we want to be returned by the parser is the
one on the left. Indeed, this tree clearly show that the expression is the
124 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

sum of the first identifier, on the one hand; and of the product of the sec-
ond and third identifiers, on the other hand. Symbolically, this expression
is thus equivalent to Id + (Id ∗ Id), which is indeed the right priority.
However, ambiguities occur even when operator priority doesn’t play
any role. Consider for instance the word Id + Id + Id. In this case, the two
following derivation trees are possible:

Exp Exp

Exp + Exp Exp + Exp

Id Exp + Exp Exp + Exp Id

Id Id Id Id
Now, the tree that we want to obtain is the one on the right, because it
corresponds to the expression (Id + Id) + Id, which is indeed the correct as-
sociativity for the + operator (the associativity is on the left).

Modifying the grammar Let us now modify the grammar to lift these am-
biguities and make sure that the only derivation trees that will be returned
by the parser are those that enforce the priority and associativity of the op-
erators. We discuss priority first. Intuitively, for the priority of the ∗ and
+ operators to be enforced, an expression must be a sum of products of
atoms, where an atom is a basic building block, i.e. either an Id or a Cst.
For instance, an expression like Id ∗ Id + Id ∗ Id must be regarded as the sum
of the two products Id ∗ Id, which means that we will compute the values
of those products first, then take their sum. This can be reflected in the
grammar, by introducing fresh variables corresponding to the concepts of
‘products’ and ‘atoms’, and to modify the rules in order to enforce a hier-
archy between these concepts. In the case of the grammar of Figure 4.19,
we would first introduce rules:

Atom → Id
→ Cst

to define the notion of ‘atom’. Then, we introduce the notion of ‘product’


(of atoms), using a recursive rule as there can be as many atoms as we want
in a product:

Prod → Prod ∗ Atom


→ Atom.

Using the same canvas, we define an Exp as a sum of products, and we Observe that the resulting gram-
mar is left-recursive (and we will
obtain the grammar:
see hereinafter that this left recur-
(1) Exp → Exp + Prod sion is crucial). However, we can use the
techniques of Section 4.4 to remove this
(2) → Prod
left-recursion, if need be.
(3) Prod → Prod ∗ Atom
(4) → Atom
(5) Atom → Cst
(6) → Id
ALL THINGS CONTEXT FREE. . . 125

Let us check that this new grammar is not ambiguous, and that the
Exp
derivation trees indeed respect the priority of operators. Consider again
the word Id + Id ∗ Id. Its (unique) derivation tree is the one shown in Fig-
ure 4.20. Clearly, this tree respects the priority of the operators. Exp + Prod

Now, let us consider the word Id + Id + Id. Its derivation tree is given in Prod Atom
Figure 4.21. Since we have used left-recursion in the rules associated to
Id
Exp and Prod, the left associativity of the operator is naturally respected. Prod ∗ Atom

Atom Id

Id
Figure 4.20: The derivation tree of Id∗Id+Id
Unary minus and parenthesis Let us now consider a more complex, yet taking into account the priority of the op-
typical example, where we allow the use of parenthesis and of the unary erators.

minus (in addition to the − and / operators that were missing in the previ- Exp
ous grammar):

(1) Exp → Exp + Exp Exp + Prod


(2) → Exp − Exp Atom
Exp + Prod
(3) → Exp ∗ Exp
Id
(4) → Exp/Exp Prod Atom

(5) → (Exp) Atom Id


(6) → −Exp
Id
(7) → Id Figure 4.21: The derivation tree of Id+Id+Id
(8) → Cst taking into account the associativity of the
operators.
Let us first discuss the case of the unary minus. Clearly, an expression
like −Id + Id must be understood as (−Id) + Id, and not as −(Id + Id), i.e., the
minus always ranges on the next atom. Thus, we should incorporate the
unary minus to the definition of atom:

Atom → −Atom
→ Id
→ Cst.

We handle (Exp) similarly. Indeed, the parentheses mean that the ex-
pression must be considered as a basic building block, and the priority of
the operators within the parenthesis must not interfere with the operators
outside the parenthesis. So, we obtain the grammar:

(1) Exp → Exp + Prod


(2) → Exp − Prod
(3) → Prod
(4) Prod → Prod ∗ Atom
(5) → Prod/Atom
(6) → Atom
(7) Atom → −Atom
(8) → Cst
(9) → Id
(10) → (Exp)

Finally, after removing left-recursion, we obtain:


126 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

(1) Exp → Prod Exp′



(2) Exp → +Prod Exp′
(3) → −Prod Exp′
(4) → ε
(5) Prod → Atom Prod′

(6) Prod → ∗Atom Prod′
(7) → /Atom Prod′
(8) → ε
(9) Atom → −Atom
(10) → Cst
(11) → Id
(12) → (Exp)
which is a grammar that we will be able to exploit when building parsers,
as explained in the next chapter.
ALL THINGS CONTEXT FREE. . . 127

4.5 Exercises

4.5.1 Pushdown automata

Exercise 4.1. Give a PDA that accepts the language containing all words
of the form w w R where w is any given word on the alphabet Σ = {a, b}
and w R is the mirror image of w. Test your automaton on the input word
abaaaaba, by giving an accepting run of your automaton on this word.
Does your automaton accept by empty stack or by accepting state?

Exercise 4.2. (Exam question in 2014) Give the diagram of a determin-


istic pushdown automaton, on the alphabet Σ = {a, b, c, d, e} ,that accepts
the language L = {(ab)n c(de)n | n ≥ 0} using the empty stack acceptance
condition.

4.5.2 Grammar transformations

The techniques to do so have been


Exercise 4.3. Remove the useless symbols in the following grammars: described in Section 4.4.3.

(1) S → a
(2) → A
(3) A → AB
(4) B → b

(1) S → A
(2) → B
(3) A → aB
(4) → bS
(5) → b
(6) B → AB
(7) → Ba
(8) C → AS
(9) → b
128 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

Exercise 4.4. Consider the following grammar:


(1) E → E op E
(2) → ID[E ]
(3) → ID
(4) op → ∗
(5) → /
(6) → +
(7) → −
(8) → ⇒
See Definition 4.5 and Sec-
tion 4.4.4.
1. Show that it is ambiguous.

2. The priorities of the various operators are as follows: [] and ⇒ have


higher priority than ∗ and /, which have higher priority than + and −.
Modify the grammar to take operator precedence into account as well
as left associativity.
See Section 4.4 for the techniques.
Exercise 4.5. Left-factor the following production rules:
(1) stmt → if expr then stmt − list end if
(2) stmt → if expr then stmt − list else stmt − list end if
See Section 4.4 for the techniques.
Exercise 4.6. Apply the left recursion removal algorithm to the following
grammar:
(1) E → E +T
(2) → T
(3) T → T ∗P
(4) → P
(5) P → ID

Exercise 4.7. (Excerpt from an exam question) Remove unproductive sym-


bols and then inaccessible symbols from the following grammar:
(1) S → aE
(2) → bF
(3) E → bE
(4) → ε
(5) F → aF
(6) → aG
(7) → aH D
(8) G → Gc
(9) → d
(10) H → Ca
(11) C → Hb
(12) D → ab
Then, remove left-recursion and perform left-factoring whenever pos-
sible.
5 Top-down parsers

P A R S I N G I S T H E S E C O N D S T E P O F T H E C O M P I L I N G P R O C E S S . During
this stage, the compiler analyses the syntax of the input program to check
its correctness. Just as we have formalised the scanning step using finite
automata, we will rely on pushdown automata to define rigorously what a
parser does.
More precisely, in this chapter, we will define a first major family of
parsers, namely the top-down parsers. In the next chapter, we will study
a different family of parsers, the bottom-up parsers. As their names indi-
cate, these parsers work in two completely, and actually opposite ways: a
top-down parser tries and build a derivation tree for the input string start-
ing from the root, applying the grammar rules one after the other, until
the sequence of tree leaves forms the desired string. On the other hand,
a bottom-up parser builds a derivation tree starting from the leaves (i.e.,
the input string), follow backwards the derivation rules of the grammar,
until it manages to reach the root of the tree. We will see that these two
paradigms have their own merits. Top-down parsers are perhaps more in-
tuitive, but bottom-up parsers are more powerful. Historically, compilers
such as gcc were written by hacking the code of an automatically gener-
ated bottom-up parser. Nowadays, recent versions of gcc or clang use a
hand-written top-down parser1 . 1
GCC wiki: new C parser. https://gcc.
gnu.org/wiki/New_C_Parser, 2008. On-
For these two main families of parsers, we will present techniques that
line: accessed on December, 29th, 2015;
allow one (when possible) to build automatically parsers from grammars, and CLang: features and goals. http:
which is exactly what we need in the framework of compiler design. //clang.llvm.org/features.html. On-
line: accessed on December, 29th, 2015

5.1 Principle of top-down parsing

We have already explained the main ideas behind top-down parsing when
showing how we can turn any CFG into a PDA that accepts the same lan-
guage by empty stack: see Section 4.2.3, and, in particular, the discussion
of Lemma 4.4, that we recall now. Consider the grammar for arithmetic
expressions in Figure 4.4. Then, a PDA that accepts (by empty stack) the
language of the grammar in Figure 4.4 is the following (where the initial
symbol on the stack is the start symbol of the grammar, namely Exp):
130 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

ε, Exp/Exp + Exp
ε, Exp/Exp ∗ Exp
ε, Exp/(Exp)
ε, Exp/Id
ε, Exp/Cst (, (/ε
), )/ε
+, +/ε
q
∗, ∗/ε
Id, Id/ε
Cst, Cst/ε

This PDA simulates a leftmost derivation by maintaining, at all times,


on its stack, a suffix of the sentential form. For example, if we consider the
word Id + Id ∗ Id and its associated leftmost derivation:

2 1 4 4 4
Exp =
⇒ Exp ∗ Exp =
⇒ Exp + Exp ∗ Exp =
⇒ Id + Exp ∗ Exp =
⇒ Id + Id ∗ Exp =
⇒ Id + Id ∗ Id

Then, the PDA given above will start its execution with the start symbol
Exp of the grammar on the top of its stack, i.e., the execution will start in
configuration:
(q, Id + Id ∗ Id, Exp).

The PDA simulates the first step of the derivation by applying the rule
Exp → Exp ∗ Exp, which consists in popping the left-hand side of the rule
and pushing the right-hand side, yielding the new configuration:

(q, Id + Id ∗ Id, Exp ∗ Exp).

Performing twice the same operations with the rules Exp → Exp + Exp and
Exp → Id, the PDA reaches the configuration:

(q, Id + Id ∗ Id, Id + Exp ∗ Exp),

where a terminal (Id) is now on the top of the stack. At this point, the
PDA can check that the same terminal is on the input, consume it, and
pop the terminal. This can be performed twice, and we obtain the new
configuration:
(q, Id ∗ Id, Exp ∗ Exp).

The simulation of the derivation by the PDA goes on like that up to the
point where the stack is empty and the whole input has been read.

5.1.1 Systematic construction of a top-down parser

Let us now formalise these ideas, and show how we can build a PDA ac-
cepting, by empty stack, the language of a given CFG. Let G = 〈V , T , P , S〉
be a CFG. We build a PDA PG with a single state:

PG = {q}, T ,V ∪ T , δ, q, S, ; ,
­ ®

where δ is such that:


T O P - D OW N PA R S E R S 131

1. for all A ∈ V : δ(q, ε, A) = {(q, α) | A → α ∈ P }. That is, for all symbols A


of the grammar, for all rules of the form A → α, the PDA has a transition
that pops A and pushes α instead (without reading any character from
the input). This operation is called a produce (of rule A → α).

2. for all a ∈ T : δ(q, a, a) = {(q, ε)}. That is, for all terminals a of the gram-
mar, there is a transition that reads a from the input and pops a from
the stack. This operation is called a match (of terminal a).

3. in all other cases that have not been covered above, δ(a, b, c) = ;.

We can prove that this construction is indeed correct (which establishes


Lemma 4.4). We only give the main ideas of the proof, the details can easily
be worked out, and are left as an exercise to the reader:

Lemma 5.1. For all CFGs G, the PDA PG is s.t. L(G) = N (PG ).

Proof. (Sketch) The proof can be done by showing that: (i) for all words
w ∈ L(G), the leftmost derivation producing w can be simulated by an
accepting run of PG ; and that (ii) all accepting runs of PG (accepting a
word w) can be mapped to a leftmost derivation of G that produces w.
These two points are easily established by induction (on the lengths of the
derivation and of the run, respectively).

Non-determinism in the parser While this construction allows one to de-


rive a PDA from any CFG, this PDA is not (yet) a parser because it is non-
deterministic. This can be seen in the example above: when the symbol on
the top of the stack is Exp, the PDA can replace it either by Exp + Exp, or
by Exp ∗ Exp (among other possibilities) independently of the input string.
Observe, however, that when a terminal is present on the top of the stack,
the behaviour of the PDA is deterministic: it will match this symbol with
the same symbol on the input.
This example has allowed us to pinpoint the source of potential non-
determinism in the top-down parsers that we have built from CFGs. Such
non-determinism can only occur when a produce must be performed with
symbol A on the top of the stack and when there are, in the original gram-
mar, at least two rules A → α1 and A → α2 to choose from. Or, put other-
wise, resolving non-determinism in such PDAs amounts to answering the
following question: assuming some variable A is on the top of the stack,
which rule should we produce? (1) S → a
(2) → b

ε, S/a
5.2 Predictive parsers ε, S/b

The rest of this chapter is devoted to identifying classes of grammars for a, a/ε
q
b, b/ε
which a deterministic parser can be achieved, if we allow this parser to
Figure 5.1: A trivial grammar and its
make use of some extra information, that we call a look-ahead. This look-
corresponding (non-deterministic) parser
ahead is parametrised by a natural number k and consists of the k next (where the initial stack symbol is S).
characters on the input, that the parser can now take into account, with- It is important not to confuse a
look-ahead and a read on the in-
out actually reading them, to decide which transition to take. Parsers that
put. The look-ahead allows the
make use of a look-ahead are called predictive parsers. parser to have a view on the future of the
input, without modifying it, while a read
modifies the input—the characters read on
the input cannot be recovered.
132 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

The intuition behind the notion of look-ahead is quite simple. Con-


sider for instance the trivial grammar on the top of Figure 5.1. The corre-
sponding parser is displayed below. When running on the input string
b, the parser is initially in the configuration (q, b, S), and has two non-
deterministic choices: either perform a produce of the former rule, or
of the latter. Clearly, knowing that the next symbol on the input is a b
allows the parser to make the right choice. So, the grammar above can
be parsed deterministically (by a top-down parser) with one character of
look-ahead—this is what we will later call an LL(1) grammar.

5.2.1 Pushdown automata with look-ahead

In order to formalise these ideas, let us now extend the definition of PDAs
by modifying their transition relation so that it can take into account the k
next characters on the input (for a look-ahead of k characters). This means
that the successors of a configuration will now be computed on the basis
of the current state, the current stack content (as in the case of ‘regular’
PDAs), but also the k first characters on the input. Observe that the only difference
between this definition and that of
Definition 5.1 (k-look-ahead PDA). A pushdown automaton with k char- PDAs, is that the transition func-
tion has a fourth parameter, which consti-
acters of look-ahead (k-LPDA for short) is a tuple Q, Σ, Γ, δ, q 0 , Z0 , F , where
­ ®
tutes the look-ahead. This look-ahead is a
all the components are defined as for PDAs (see Definition 4.11), except for word of k characters at most, since we have

the transition function δ that maps Q ×(Σ∪{ε})×Γ×Σ≤k to 2(Q×Γ ) ; where: no guarantee that there are always k char-
acters (or more) remaining on the input.

Σ≤k = ∪ki=0 Σi

is the set of all words of length at most k on the alphabet Σ.


When k = 1, we note LPDA instead of 1-LPDA. M

Let us now define formally the new semantics that takes into account
the look-ahead. We lift the notion of configuration from PDAs to k-LPDAs:
Definition 4.12 carries on to k-LPDAs. The notion of configuration change,
however, must be adapted:

Definition 5.2 (k-LPDA configuration change). Let us consider a k-LPDA


P = Q, Σ, Γ, δ, q 0 , Z0 , F . Let q, auv, X β be a configuration of P , where:
­ ® ­ ®

• X ∈ Γ;

• a ∈ Σ ∪ {ε};

• u ∈ Σ≤k−1 ;

• v ∈ Σ∗ ; and

• if |auv| ≥ k then |au| = k, otherwise v = ε (i.e., au is a prefix of length k


of the remaining input word if there are at least k characters remaining
on the input. Otherwise, au contains all the remaining input).

Then, P can move from q, auv, X β to q ′ , uv, αβ iff there is (q ′ , α) ∈


­ ® ­ ®

δ(q, a, X , au). In this case, we write:

q, auv, X β ⊢P q ′ , uv, αβ .
­ ® ­ ®

M
T O P - D OW N PA R S E R S 133

The notion of configuration change is the only one we need to adapt


for k-LPDAs: all the other notions regarding their semantics (the different
notions of accepted language, etc) are lifted from their counterparts on
PDAs. Now, let us consider an example that shows how the look-ahead
can be used to obtain deterministic automata more easily.

Example 5.3. Let us consider again the trivial grammar in Figure 5.1. We
can now extend its corresponding PDA with a look-ahead of one charac-
ter, to obtain a deterministic LPDA. To this end, we consider the following
transition function:

• δ(q, ε, S, a) = {(q, a)} for the produce of S → a. Observe that we perform


this produce only when the next character on the input is an a;

• δ(q, ε, S, b) = {(q, b)} for the produce of S → b. Again, the produce is per-
formed only when b is the next character on the input;

• δ(q, a, a, a) = {(q, ε)} for the match of a; and

• δ(q, b, b, b) = {(q, ε)} for the match of b.

Graphically, we obtain the following LPDA, assuming a label of the form


u : a, X /β means: ‘if the look-ahead is u, the first character on the input is
a, and the top of the stack is X , replace it by β’ (in other words, we keep
the same convention as for PDAs, prepending the look-ahead followed by
a colon).

a : ε, S/a
b : ε, S/b

a : a, a/ε
q
b : b, b/ε

Observe that this LPDA is now deterministic (and can thus be implemented
as a computer program in a straightforward way). Observe further that the
look-ahead allows the LPDA to query the next character on the input with-
out reading it: once again, a transition labeled by ‘a : ε, S/a’ means that
the automaton checks that the next character on the input is a, but does
not consume it (as indicated by the ε), hence does not modify the input in
the next configuration. On the other hand, a transition labeled by a : a, a/ε
not only checks that there is an a on the input but also reads it. M

Expressive power of k-LPDAs It should now be clear to the reader that


LPDAs are a natural and useful class of automata to build deterministic
parsers from CFGs. While LPDAs seem like a perfect tool on the practi-
cal side, it is not (yet) clear how they fit in the theory we have built so far.
In other words: what is the position of k-LPDAs in the Chomsky hierar-
chy? Clearly, since k-LPDAs extend PDAs they accept all the CFLs. But
could it be the case that they accept (thanks to the look-ahead) more lan-
guages than PDAs? The answer, fortunately is: no. We will establish this
by showing that all k-LPDAs can be turned into an equivalent PDA (which,
however, might be exponentially bigger and non-deterministic). So, at the
134 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

end of the day, k-LPDAs are nothing more than a more convenient syntax
for PDAs (but a very convenient one, as we will see!).

Proposition 5.2. For all k-LPDAs P , we can build a PDA P ′ s.t. L(P ) = L(P ′ ).

Proof. We will present the proof for the case where k = 1, the ideas are
easy to generalise to k > 1. Let us first sketch the intuition of the proof, by
explaining how executions of P ′ correspond to executions of P , and vice
versa. First, we let the set of states of P ′ be pairs of the form (q, a), where q To complement this intuition, re-
call that a finite automaton can be
is a state of q and a ∈ Σ is a single-letter that represents the current look- regarded as a program that uses a
ahead. Note that this look-ahead will not be computed by P ′ , but rather finite amount of memory only. This mem-
ory is encoded in the states of the automa-
guessed using non-determinism, and then checked afterwards. Thus, in-
ton. This is the same idea that we use
tuitively, when P ′ is in (q, a), this corresponds to P being in state q and here. Since the look-ahead is bounded by
having guessed that a is the first character on the input. Following this k, and since the alphabet is finite, there are
only finitely many possible values for the
idea, we have to define the transition function of P ′ so that it properly up- look-ahead, which can thus be stored in
dates the look-ahead contained in its states, and checks the validity of the the states, and then queried and updated
when need be, using non-determinism.
non-deterministic guesses, in order to keep P ′ synchronised with P .
Initially, P ′ simply jumps to a state of the form (q 0 , x) for some x ∈ Σ∪{ε}
(thus, the x is the first guess performed by P ′ ). Then, Figure 5.2 shows the
rest of the construction. If, from state q 1 , P can read some character x ∈ Σ x : x, X /γ
q1 q2
(hence, the look-ahead is necessarily equal to x, otherwise the transition
cannot be taken), then, in P ′ , we can ‘simulate’ this transition from (q 1 , x) becomes:

only (because the look-ahead must be x). Since the corresponding tran- q2 , a

sition of P ′ does read an x, we are certain that the guess was correct. The x, X
x, X /γ
different possible successors correspond to the different possible guesses q1 , x q2 , b
x, X
for the next look-ahead. A special case is displayed at the bottom of the fig- /γ

ure and occurs when P reads ε from the input. In this case, the look-ahead q2 , ε
could be non-empty (otherwise, the look-ahead would have no interest!),
and must not be updated in the state of P ′ . y : ε, X /γ
q1 q2
Let us formalise this. Given an LPDA P = Q, Σ, Γ, δ, q 0 , Z0 , F , we build
­ ®

a PDA P ′ = Q ′ , Σ, Γ, δ′ , q 0′ , Z0 , F ′ where:
­ ®
becomes:

ε, X /γ
• Q ′ = Q × Σ ∪ {ε} ⊎ {q 0′ } (thus, q 0′ is a fresh initial state);
¡ ¢
q1 , y q2 , y

• F ′ = F × {ε} ; and Figure 5.2: Illustration of the transforma-


tion of a k-LPDA into an equivalent PDA,
assuming Σ = {a, b}, x ∈ Σ and y ∈ Σ ∪ {ε}.
• δ′ is s.t.:
n¡ ¢¯¯ o
1. δ′ (q 0′ , ε, Z0 ) = (q 0 , x), Z0 ¯x ∈ Σ∪{ε} : this corresponds to the initial
guess of the look-ahead;

2. for all q ∈ Q, x ∈ Σ and X ∈ Γ:


¢ n¡ ¢¯¯ o
δ′ (q, x), x, X = (q ′ , z), γ ¯(q ′ , γ) ∈ δ(q, x, X , x), z ∈ Σ ∪ {ε} ,
¡

which corresponds to the top part of Figure 5.2;

3. for all q ∈ Q, y ∈ Σ and X ∈ Γ:


¢ n¡ ¢¯¯ o
δ′ (q, y), ε, X = (q ′ , y), γ ¯(q ′ , γ) ∈ δ(q, ε, X , y) ,
¡

which corresponds to the bottom part of Figure 5.2; and

4. δ′ (q, x), y, X = ; in all the cases that have not been specified above.
¡ ¢
T O P - D OW N PA R S E R S 135

We finish by sketching the arguments to show that the construction is


correct. First, we consider an execution of the k-LPDA P on some word w:

q 0 , w 0 , Z0 ⊢P q 1 , w 1 , γ1 ⊢P . . . ⊢P q k , w k , γk ,
­ ® ­ ® ­ ®

where w 0 = w, w k ε and q k ∈ F . Then, one can show by induction on the


length of the execution that it corresponds to an accepting execution in P ′ .
Assuming that for all 0 ≤ i ≤ k, a i denotes the first character of w i (with
a i = ε if w i = ε), this accepting execution in P ′ is:

q 0′ , w 0 , Z0 ⊢P (q 0 , a 0 ), w 0 , Z0 ⊢P’ (q 1 , a 1 ), w 1 , γ1 ⊢P’ . . . ⊢P’ (q k , a k ), w k , γk .


­ ® ­ ® ­ ® ­ ®

That is, it is the execution obtained when P ′ always ‘guesses’ correctly the
next character on the input. It is easy to check that this is indeed an ex-
ecution of P ′ (see the definition of δ′ above), which is accepting because
(q k , a k ) is a final state when q k is.
On the other hand, if

q 0′ , w 0 , Z0 ⊢P (q 0 , a 0 ), w 0 , Z0 ⊢P’ (q 1 , a 1 ), w 1 , γ1 ⊢P’ . . . ⊢P’ (q ℓ , a ℓ ), w ℓ , γℓ


­ ® ­ ® ­ ® ­ ®

is an accepting execution of P ′ on some word w 0 = w (thus, with w ℓ = ε


and q ℓ ∈ F ), then, one can check that:

q 0 , w 0 , Z0 ⊢P q 1 , w 1 , γ1 ⊢P . . . ⊢P q ℓ , w ℓ , γℓ
­ ® ­ ® ­ ®

is an accepting execution of P . This again can be done by induction on


the length of the execution. This part is a bit more difficult that the re-
verse direction, because one has to check that all the steps are allowed by
the look-ahead in P . The key point to prove this is the fact that, at the
end of the execution, w ℓ = ε, otherwise the execution would not be ac-
cepting. Considering the definition of δ′ , this means necessarily that, all
look-aheads ‘guessed’ by P ′ were eventually checked to be correct. Indeed,
when P ′ performs some step (q i , a i ), w i , γi ⊢P’ (q i +1 , a i +1 ), w i +1 , γi +1 ,
­ ® ­ ®

then:

1. either it performs an ε-labeled transition in which case a i = a i +1 and


w i = w i +1 (i.e., no check of the look-ahead is performed, but neither
the input word nor the guessed look-ahead change. So the check is de-
ferred to a later transition);

2. or a transition labeled by a i is performed. This implies that a i was in-


deed the first character of w i , hence the ‘guess’ in (q i , a i ) was correct.

Theorem 5.3. For all k, the class of languages accepted by k-LPDAs is the
class of CFLs.

Proof. Since, for all k, we can translate any k-LPDA into an equivalent PDA
(see Proposition 5.2), the class of k-LPDAs accepts no more than the CFLs.
On the other hand, all PDAs P can trivially be translated into an equivalent
k-LPDA P ′ (for all k): it suffices to define the transition relation of P ′ in
such a way that it ignores the look-ahead (or, in other words, such that it
performs the same actions on the input and on the stack for all possible
values of the look-ahead).
136 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

Now that we have k-LPDAs at our disposal, let us show how to trans-
form, when possible, and in a systematic way, CFGs into deterministic
k-LPDAs that we will be able to translate easily into programs. To this end,
we need to introduce some extra definitions.

5.3 Firstk and Followk

In order to introduce these two notions, we start by an extensive example.

Example 5.4. Let us consider the grammar:


(1) A → aaa
(2) → B bb
(3) → C dd
(4) B → b
(5) C → c
(6) → ε
and let us assume we want to build a predictive parser with one character
of look-ahead. There are two sources of non-determinism in this gram-
mar: on variable A, and on variable C .

1. In the case of variable A, we need to choose between rule 1, rule 2 and


rule 3. Obviously, all words generated from A using rule 1 as the first rule
in the derivation will start with an a, so we will apply rule 1 in this case
only. What about rules 2 and 3? Clearly, there will never be a B nor a C
on the input, as these symbols are variables and not terminals, so we
need to examine what B and C can produce:

(a) Similarly to the case of rule 1, it is easy to see that all words pro-
duced from B will start by a b, by rule 4. So, rule 2 should be applied
only when a b is on the input.
(b) The case of C is more complicated, as C can produce either c or ε.
In the former case, we expect c to be the first next character on the
input to apply rule 3. In the latter case, the derivation is A ⇒ C dd ⇒
dd, so we expect a d as the next character on the input. We conclude
that all derivations starting by rule 3 will produce words that start by
c or d only.

The case of variable A are thus summarised in Table 5.1, which gives for
each look-ahead, the rule to apply when A is on the top of the stack.

Table 5.1: The rules to apply when A is on


Look-ahead
the top of the stack.
Var a b c d
A 1 2 3 3

To obtain this information, we have computed, for each rule of the form
A→α , the set of all the possible first characters of words that can be de-
rived from α. Indeed, in the case of A→aaa , all words derived from aaa
start by an a; in the case of A→B bb , all words derived from B bb start by
b; in the case of A→C dd , all words derived from C dd start either by c
or by d. This captures the intuition behind the First1 , i.e., as we will see
T O P - D OW N PA R S E R S 137

hereinafter, First1 (aaa) = {a}, First1 (B aa) = {b} and First1 (C dd) = {c, d}.
We already have the intuition that computing those sets must some-
times be done recursively: for instance First1 (B aa) is equal to First1 (B ),
because B is the first symbol in B aa.

2. Applying this idea to rule 4 allows us to deduce immediately that we


will apply this rule only when B is on the top of the stack and b is the
next character on the input.

3. The case of variable C is, once again, interesting, as the computation


of the First1 is not sufficient. Indeed, all words that are derived from C
(rule 5) necessarily start by c, but how do we handle the case of ε? Rule 6
does not give us any clue about the next character that we expect on the
input when this rule must be applied.

Instead, we must consider the context in which a C can occur, and what
are the characters that could follow it. The only place where a C occurs
is in rule 3. From this rule we can deduce that all words generated from
C will necessarily be followed by a d. This is better understood by visual-
ising the derivation tree of dd, i.e. the only word that one can generate
from A by applying rule 6, as shown in Figure 5.3.

What we have Figure 5.3: The derivation tree of dd and


the notion of Follow1 .
on the top A
of the stack

C d d
What we expect
on the input

ε
∈ Follow1 (C )

This intuition is captured by the notion of Follow1 : the set Follow1 (X ),


for some variable X will be defined as the set of all possible first charac-
ters of some words that are generated immediately after a word derived
from X .

We can now complete Table 5.1 and obtain Table 5.2, which tells us ex-
actly which rule to apply for each possible symbol on the top of stack and
next character on the input. Since there is at most one rule in each cell of
the table, we have now a deterministic parser at our disposal. M

Table 5.2: An action table for our example


Look-ahead
grammar, and a look-ahead of one char-
Var a b c d acter: it tells us which rule to produce for
each possible symbol on the top of the
A 1 2 3 3 stack, and each possible first character on
B 4 the input.

C 5 6
A word w is thus in Firstk (α) iff:
(i) it is the prefix of some sentential
Let us now formalise properly the notions of Firstk and Followk . form generated from α (i.e., α ⇒∗
w x for some w); and (ii) it has the right
size: either it contains exactly k characters
(|w| = k), or it contains less than k charac-
ter, but this can occur only if cannot make
this prefix any longer, which implies that
x = ε (after all, there is no reason that all
sentential forms generated from α contain
at least k characters).
138 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

Definition 5.5 (Firstk ). Let G = 〈V , T , P , S〉 be a CFG, and let α be a sen-


tential form of G (i.e., α ∈ (T ∪ V )∗ ). Then,

α ⇒∗ w x
 ¯ 
 ¯ 
 ¯ 
k ∗¯
First (α) = w ∈ T ¯ and .
 ¯ 
¯ either |w| = k or |w| < k and x = ε
 ¡ ¢ 

In the case where k = 1, we write First(α) instead of First1 (α). M


The intuition behind the definition
of Followk is easier: w is in the
Definition 5.6 (Followk ). Let G = 〈V , T , P , S〉 be a CFG, and let α be a sen- Followk (α) iff there is some deriva-
tential form of G (i.e., α ∈ (T ∪ V )∗ ). Then, tion allowed by the grammar that pro-
n ¯ ¡ ¢o duces α, followed by γ, and w is in the
Followk (α) = w ∈ T ∗ ¯there are β, γ s.t. S ⇒∗ βαγ and w ∈ Firstk γ . Firstk of γ.
¯

In the case where k = 1, we write Follow(α) instead of Follow1 (α). M

Example 5.7. Let us consider the grammar in Figure 5.4, which generates
expressions. Remember that this is the grammar that we have obtained
after taking into account the priority of the operators and removing left-
recursion (see the last pages of Chapter 4). Observe that we have added a
rule S→Exp$ to the grammar to make sure that all strings end with the
marker $. This will actually make our life easier when computing Follow (1) S → Exp$
(2) Exp → Prod Exp′
sets. Let us start by considering some values of First sets: (3) Exp′ → +Prod Exp′
© ª (4) → −Prod Exp′
• First(Atom) = −, Cst, Id, ( ; (5) → ε
(6) Prod → Atom Prod′
• First2 (Atom) = {−−, −Cst, −Id, −(, (−, ((, (Cst, (Id, Cst, Id}; (7) Prod′ → ∗Atom Prod′
(8) → /Atom Prod′
• First Prod′ = ∗, /, ε ;
¡ ¢ © ª
(9) → ε
(10) Atom → −Atom
• What is the value of First2 Prod′ ? We see that Prod′ produces either a
¡ ¢
(11) → Cst
string starting with ∗ and followed by some string generated by Atom; or (12) → Id
(13) → (Exp)
a string starting with / and followed by some string generated by Atom;
Figure 5.4: The grammar generating ex-
or ε. So, we can rely on First(Atom) to characterise First2 Prod′ , and we
¡ ¢
pressions (followed by $ as an end-of-
find: string marker), where we have taken into
account the priority of the operators, and
First2 Prod′ = ∗ · First1 (Atom) ∪ / · First1 (Atom) ∪ ε
¡ ¢ © ª © ª © ª
removed left-recursion.

= ∗ −, ∗Cst, ∗Id, ∗(, /−, /Cst, /Id, /(, ε .


© ª

Now, let us consider some values of Follow sets:

• What is Follow Exp′ ? All strings generated by Exp′ necessarily appear


¡ ¢

at the end of a string generated by Exp, and all strings generated by Exp
are followed by a $ or by a ) in the final output, so Follow Exp′ = {$, )}.
¡ ¢

• What is Follow(Prod)? All strings generated by Prod are followed by a


string generated by Exp′ . Such a string can: (i) either start by + or −, so S

these two symbols are in Follow(Prod); (ii) or be the empty word. In this Exp $
latter case, the Follow of Prod will be the Follow of Exp′ . This is sketched
Prod Exp′
in Figure 5.5. Here, Prod eventually generates some string α, and Exp′
.. ε
generates ε. Then, clearly, the generated word is α · ε · $ = α$ (this can .
be seen by inspecting the tree’s leaves). This show that $ can indeed α
immediately follow a string (α) generated by Prod. This reasoning holds Figure 5.5: An example showing that
Follow(Prod) contains $ in a case where
for all symbols in Follow Exp′ = Follow(Exp) = {$, )}. We conclude that:
¡ ¢
© ª Exp′ generates ε.
Follow(Prod) = +, −, $, ) .

M
T O P - D OW N PA R S E R S 139

5.3.1 Computation of Firstk

While the discussion above provides us with a clean definition of Firstk ,


it does not provide us with an algorithm to compute those sets2 . The al- 2
Observe that Firstk (α) is finite for all α
since it contains words of length at most k
gorithm we are about to present is based on the following observation. As-
only.
sume we want to compute Firstk (α) for some sentential form α = X 1 X 2 · · · X n
(with all X i ’s individual terminals or variables of the grammar). Then,
we can first compute Firstk (X 1 ). If all words in Firstk (X 1 ) are of length
k, then, we are done. Otherwise (in particular if ε ∈ Firstk (X 1 )), we need
to complete those elements of Firstk (X 1 ) that are not ‘long enough’ by ele-
ments from Firstk (X 2 ). Again, this might not be sufficient, so we compute
Firstk (X 3 ), etc. This suggests Algorithm 6, a greedy algorithm that com-
putes Firstk (X ), for all variables X .
Initially, the algorithm computes Firstk (a) for all terminal a—this is
actually trivial since a terminal a can only generate the word a, hence
Firstk (a) = {a}. Then, it initialises the sets Firstk (A) to ; for all variable A ∈
V and grows those sets in the repeat loop. As long as some of the Firstk (X )
sets have been updated, the information computed during one iteration of
the loop is used to try and enrich other Firstk (A) sets during the next iter-
ation, based on the rules of the grammar: for all rules A→X 1 X 2 · · · X n , we
re-compute the set Firstk (A) by concatenating Firstk (X 1 ), Firstk (X 2 ),. . . ,
Firstk (X n ), and truncating the resulting words to k characters3 . This fol- 3
Remark that we could restrict ourselves to
rules of the form A→X 1 X 2 · · · X n where at
lows exactly the intuition given hereinbefore.
least one of the X i ’s has been updated in
the previous iteration of the loop.
Input: A CFG G = 〈V , T , P , S〉
Output: The sets Firstk (X ) for all X ∈ V ∪ T .

foreach a ∈ T do
Firstk (a) ← {a} ;
foreach A ∈ V do
Firstk (A) ← ; ;
repeat
foreach A → X 1 X 2 · · · X n ∈ P do
Firstk (A) ←
Firstk (A) ∪ Firstk (X 1 ) ⊙k Firstk (X 2 ) ⊙k · · · ⊙k Firstk (X n ) ;
¡ ¢

until no Firstk (A) has been updated, for any A ∈ V ;


Algorithm 6: Computation of Firstk (X ) for all X ∈ V ∪ T .

Example 5.8. Let us close the discussion of Algorithm 6 by an example


execution. Consider again the grammar in Figure 5.4. Here are the first
few steps of execution of the algorithm (we only detail the computation of
First1 (X ) for the variables):

0. Initially, we have: First1 (X ) = ; for all variables X ∈ V .

1. During the first iteration, we go through all rules, and update the First
of their left-hand side:

(a) For the rule S → Exp$, we get the opportunity to update First(S). We
compute: Remember that ;·L = ; for all lan-
guages L. So ; ⊙k L = ; for all lan-
guages L and all k.
140 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

First1 E xp ⊙1 First1 $ = ; ⊙1 {$}


¡ ¢ ¡ ¢

=;

So, this first rule does not allow us to infer more information on
First(S) at that point, because we have not computed any word from
First(Exp) yet.
(b) Actually, at this step of computations, all rules that contain at least
one variable in the right-hand side will yield a similar result, because
all the First’s are still empty.
(c) However, rules E xp ′ → ε, Prod′ → ε, Atom → Cst and Atom → Id al-
low us to update the First Exp′ , First Prod′ and First(Atom) respec-
¡ ¢ ¡ ¢

tively.

So, at the end of the first iteration, we have:

X First1 (X )

Exp′ ε
Prod′ ε
Atom Cst, Id

and First1 (X ) = ; for all other variables.

2. During the second iteration, we will discover new values of First1 sets.
Thanks to Prod → AtomProd′ , we add to First1 (Prod) the elements from:

First1 (Atom) ⊙1 First1 Prod′ = {Cst, Id} ⊙1 {ε}


¡ ¢

= {Cst, Id};

to First1 Prod′ the elements from:


¡ ¢

First1 {∗} ⊙1 First1 (Atom) ⊙1 First1 Prod′ = {∗} ⊙1 {Cst, Id} ⊙1 {ε}
¡ ¢

= {∗},

and from:

First1 {/} ⊙1 First1 (Atom) ⊙1 First1 Prod′ = {/} ⊙1 {Cst, Id} ⊙1 {ε}
¡ ¢

= {/};

and, finally, to First1 (Atom), the elements from:

First1 {−} ⊙1 First1 (Atom) = {−} ⊙1 {Cst, Id}


= {−}.

So, at the end of this iteration, we have:

X First1 (X )

Exp′ ε
Prod Cst, Id
Prod′ ∗, /, ε
Atom −, Cst, Id
T O P - D OW N PA R S E R S 141

and First1 (X ) = ; for all other variables.

3. The algorithm goes on similarly up to stabilisation, and computes the


following values for the First1 sets:

X First1 (X )

S −, Cst, Id, (
Exp −, Cst, Id, (
Exp′ +, −, ε
Prod −, Cst, Id, (
Prod′ ∗, /, ε
Atom −, Cst, Id, (

5.3.2 Computation of Followk

Let us now turn our attention to the computation of Followk (X ) for all vari-
ables X of a CFG. The algorithm is given in Algorithm 7, and is, again, a
greedy algorithm that grows the sets Followk (X ) up to stabilisation. To do
so, we rely on the following intuition: every time we have a rule of the form:

A → αB β,

(i.e., a rule that contains variable B in its right-hand side), we can poten-
tially add more information to Followk (B ). Indeed, a string generated by
B can be followed by a string generated by β, so we can use Firstk β . Ob-
¡ ¢

serve however, that the words in Firstk β might be shorter than k, so we


¡ ¢

might need to complete them with Followk (A).

Input: A CFG G = 〈V , T , P , S〉
Output: The sets Followk (X ) for all X ∈ V .

foreach A ∈ V \ {S} do
Followk (A) ← ; ;
Followk (S) ← {ε} ;
repeat
foreach A → αB β ∈ P (with B ∈ V and α, β ∈ (V ∪ T )∗ ) do
Followk (B ) ← Followk (B ) ∪ Firstk β ⊙k Followk (A) ;
¡ ¡ ¢ ¢

until no Followk (A) has been updated, for any A ∈ V ;


Algorithm 7: Computation of Followk (X ) for all X ∈ V .

Example 5.9. Let us consider again the grammar from Figure 5.4, and let
us apply Algorithm 7 to it, for k = 1.

0. Initially, we have Follow1 (X ) = ; for all variables X , except Follow1 (S), Observe that initialising
which is equal to {ε}. Follow1 (S) to {ε} is crucial
here. Otherwise, if we initialise
1. During the first iteration, the algorithm adds $ to Follow1 (Exp), thanks Follow(X ) to ; for all variables X ,
then the algorithm would terminate af-
to the rule S → Exp$. Indeed, this corresponds, in the algorithm, to hav- ter one iteration with Followk (X ) = ;
ing A = S, B = Exp, α = ε and β = $. So the string: for all X , since the expression
Firstk β ⊙k Followk (A) = Firstk β ⊙k ;
¡ ¢ ¡ ¢

would always evaluate to ;.


142 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

First1 $ ⊙1 Follow1 (S) = {$} ⊙1 {ε}


¡ ¢

= {$}

is indeed added to Follow1 (Exp).


Then, the rule Exp → ProdExp′ allows us to grow the sets Follow1 (Prod)
and Follow1 Exp′ . We add to Follow1 (Prod) the elements from:
¡ ¢

First1 Exp′ ⊙1 Follow1 (Exp) = {+, −, ε} ⊙1 {$}


¡ ¢

= {+, −, $};

and to Follow1 Exp′ the elements from:


¡ ¢

First1 (ε) ⊙1 Follow1 (Exp) = {ε} ⊙1 {$}


= {$}.

Etc. . .

2. The algorithm goes on up to stabilisation, and returns:

X Follow1 (X )

S ε
Exp $, )
Exp′ $, )
Prod +, −, $, )
Prod′ +, −, $, )
Atom ∗, /, +, −, $, )

5.4 LL(k) grammars

Using the tools we have just defined (First and Follow sets), we can now
identify classes of CFGs for which the predictive parsers using k characters
of look-ahead (as sketched above) will be deterministic. Those grammars
are called LL(k) grammars, where ‘LL’ stands for ‘Left scanning, Left Pars-
ing’, because the input string is read (scanned) from the left to the right;
and the parser builds a leftmost derivation when successfully recognising
the input word. This class of grammars has first been introduced by Lewis
and Stearns in 19684 , with further important refinements by Rosenkrantz 4
P. M. Lewis, II and R. E. Stearns. Syntax-
directed transduction. J. ACM, 15(3):465–
and Stearns in 19705 (and many others afterwards. . . )
488, July 1968. ISSN 0004-5411. D O I :
What are the conditions we need to impose on the derivations of a gram- 10.1145/321466.321477
5
mar to make sure that its corresponding parser will be deterministic when D.J. Rosenkrantz and R.E. Stearns. Prop-
erties of deterministic top-down gram-
it has access to k characters of look-ahead? As we have seen already, the mars. Information and Computation (for-
only possible source of non-determinism in the parser stems from the pro- merly known as Information and Control),
17(3):226 – 256, 1970. ISSN 0019-9958.
duces, more specifically, when the grammar contains at least two rules of
D O I : 10.1016/S0019-9958(70)90446-8
the from A → α1 and A → α2 . Given that these two rules exist, let us now
pinpoint a situation that will confuse a parser that has access to k charac-
ters of look-ahead only. Such a situation occur if, at some point in a deriva-
tion of the grammar, A is the leftmost symbol (hence, it is on the top of the
stack), and the parser must decide whether to apply A → α1 or A → α2 , but
T O P - D OW N PA R S E R S 143

the k characters of look-ahead dot not allow it to discriminate between


these two choices. To obtain such a pathological situation, we thus need
to expose two different derivations that the parser cannot distinguish. For
the first derivation, we can hold the following reasoning:

1. First, since we want to have A as the leftmost symbol at some point, the
derivation prefix is:
S ⇒∗ w Aγ

with w ∈ T ∗ and γ ∈ (V ∪ T )∗ .

2. Then, let us assume that in this first derivation, the right choice is to
apply A → α1 , i.e.,
w Aγ ⇒ wα1 γ.

3. Eventually, this derivation will produce a word, which will necessary


be of the form w x 1 , since w is already a string of terminals (in other
words, to finish the derivation, we just need to derive the variables that
potentially remain in α1 γ). So, this first derivation is of the form:

S ⇒∗ w Aγ ⇒ wα1 γ ⇒∗ w x 1 .

Thus, this first derivation generates the word w x 1 . Let us consider again
the moment in the derivation when the sentential form was w Aγ and the
parser had to decide to apply A → α1 . As we have already remarked, A is,
at that point, on the top of the stack, and w has already been read from the
input. Hence, at that point, the string that remains on the input is x 1 , and
all the parser ‘sees’ is Firstk (x 1 ).
Then, it is easy to build a second derivation that will confuse the parser.
Assume that, in the grammar we have a derivation of the form:

S ⇒∗ w Aγ ⇒ wα2 γ ⇒∗ w x 2 .

Observe that, now, the right choice to derive A is A → α2 , and, when the
parser must take this choice, he ‘sees’ a look-ahead of Firstk (x 2 ). So, the
parser will be able to make the right decision regarding the derivation of A
iff the look-ahead it has at its disposal is sufficient to tell those two situa-
tions apart, i.e.: Remember that x 1 and x 2 are
k k words, so their Firstk is a sin-
First (x 1 ) ̸= First (x 2 ) .
gleton containing one string of
The definition6 of LL(k) grammar is based on these intuitions: it says length at most k. This is why we can
write Firstk (x 1 ) ̸= Firstk (x 2 ) instead of
that, whenever a pathological situation such as the one described above Firstk (x 1 ) ∩ Firstk (x 2 ) = ;, for instance.
occurs (the two derivations and Firstk (x 1 ) = Firstk (x 2 )), then, we must 6
P. M. Lewis, II and R. E. Stearns. Syntax-
directed transduction. J. ACM, 15(3):465–
have α1 = α2 ; which means that there is actually no choice to be made in
488, July 1968. ISSN 0004-5411. D O I :
the grammar. Otherwise, the parser would not be able to take a decision 10.1145/321466.321477
and the grammar would not be LL(k): Observe that, if a grammar is LL(k)
for some k, then it is also LL k ′ for
¡ ¢

Definition 5.10 (LL(k) CFGs). A CFG 〈P , T ,V , S〉 is LL(k) iff for all pairs of ′
all k ≥ k. This is coherent with our
intuition that LL(k) means ‘k characters of
derivations:
look-ahead are sufficient’.

S ⇒∗ w Aγ ⇒ wα1 γ ⇒∗ w x 1
S ⇒∗ w Aγ ⇒ wα2 γ ⇒∗ w x 2

with w, x 1 , x 2 ∈ T ∗ , A ∈ V and γ ∈ (V ∪ T )∗ , and Firstk (x 1 ) = Firstk (x 2 ), we


have: α1 = α2 . M
144 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

Example 5.11. Let us consider the following grammar:


(1) S → a Aa
(2) S → b AB a
(3) A → b
(4) A → ε
(5) B → b
(6) B → c
This grammar is actually quite simple, since it can generate only 6 words
through 10 different derivations:

S ⇒ a A a ⇒ aba
S ⇒ a A a ⇒ aa
S ⇒ b AB a ⇒ bbB a ⇒ bbba
S ⇒ b AB a ⇒ bbB a ⇒ bbca
S ⇒ b AB a ⇒ bB a ⇒ bba
S ⇒ b AB a ⇒ bB a ⇒ bca
S ⇒ b AB a ⇒ b A ba ⇒ bbba
S ⇒ b AB a ⇒ b A ca ⇒ bbca
S ⇒ b AB a ⇒ b A ba ⇒ bba
S ⇒ b AB a ⇒ b A ca ⇒ bca.

One can then check that:

1. this grammar is not LL(1). Indeed, consider the pair of derivations:

S ⇒ b AB a ⇒ bbB a ⇒ bbba

and
S ⇒ b AB a ⇒ bB a ⇒ bba
which both read the sentential form b AB a after one step, correspond-
ing to
w = b and γ = B a
in the definition. In the former derivation, one applies A → b, while in
the latter derivation, A → ε is used, corresponding to:

α1 = b and α2 = ε

in the definition. The resulting words are respectively bbba and bba,
corresponding to:
x 1 = bba and x 2 = ba
in the definition (since w = b), so we have:

First1 (x 1 ) = First1 (x 2 ) = {b}.

We are thus in the conditions of the definition, yet α1 ̸= α2 , hence the


definition is not satisfied, and the grammar is not LL(1).

2. this grammar, however, is LL(2). Proving this is a painstaking procedure


as it requires to check the conditions given by the definition for many
pairs of rules, but it can be done on this simple grammar. For instance,
the case that we have identified in the previous item is not problematic
anymore with a look-ahead of k = 2, since now: Bear in mind that, to prove that
the grammar is LL(2) one needs to
{bb} = First2 (x 1 ) ̸= First1 (x 2 ) = {ba} check all the pairs of derivations.
This is just an example!
T O P - D OW N PA R S E R S 145

This example shows well that, while the definition of LL(k) grammar
makes perfect sense and captures our intuition of what an LL(k) grammar
should be, it is of limited use in practice when one wants to check whether
a grammar is LL(k) or not. Indeed, Definition 5.10 requires to check all
possible pairs of derivations in the grammar, and there can be infinitely
many such pairs. Instead, we will now identify a stronger condition that
we will be able to test, and that will still be relevant in practice.

5.4.1 Strong LL(k) grammars

Instead of relying on a semantic condition, as in Definition 5.10, the con-


dition we will present now is a syntactic one as it concerned only with the
rules of grammars. Since there are always finitely many rules (contrary
to the number of derivations which can be infinite), this will allow us to
derive a practical test to check whether a grammar is LL(k) or not. The
definition7 is as follows: 7
D.J. Rosenkrantz and R.E. Stearns. Prop-
erties of deterministic top-down gram-
mars. Information and Computation (for-
Definition 5.12 (Strong LL(k) CFG). A CFG G = 〈V , T , P , S〉 is strong LL(k) merly known as Information and Control),
iff, for all pairs of rules A → α1 and A → α2 in P (with α1 ̸= α2 ): 17(3):226 – 256, 1970. ISSN 0019-9958.
D O I : 10.1016/S0019-9958(70)90446-8
Observe that this definition does
³ ´ ³ ´
Firstk α1 Followk (A) ∩ Firstk α2 Followk (A) = ; not mention derivations, only the
(finitely many) rules of the gram-
mar.
M

Example 5.13. Let us consider the grammar for arithmetic expression from
Figure 5.4, and let us check that it is a strong LL(1) grammar. To this end,
we can rely on the computation of the First and Follow sets from Exam-
ple 5.8 and Example 5.9. To apply Definition 5.12, we need to consider all
the group of rules that have the same left-hand side. There are three such
groups:

1. The three rules that have Exp′ as left-hand side are: Exp′ → +ProdExp′ ,
Exp′ → −ProdExp′ , and Exp′ → ε. Hence, we check that there is no com-
mon element between the three following sets:

First +ProdExp′ Follow Exp′ = {+}


¡ ¡ ¢¢

First −ProdExp′ Follow Exp′ = {−}


¡ ¡ ¢¢

First εFollow Exp′ = Follow Exp′


¡ ¡ ¢¢ ¡ ¢

= {$, )}.

This is indeed the case. Intuitively, this means that, when Exp′ is on the
top of the parser’s stack, it can determine which rule to apply basing
its decision on a single character look-ahead: when the look-ahed is +,
apply the first rule; when the look-ahead is −, apply the second; and
apply the last only when the look-ahead is $ or ).

2. For the three rules that have Prod′ as the left-hand side, we check that
146 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

there is no common element between the three following sets:

First ∗AtomProd′ Follow Prod′ = {∗}


¡ ¡ ¢¢

First /AtomProd′ Follow Prod′ = {/}


¡ ¡ ¢¢

First εFollow Prod′ = Follow Prod′


¡ ¡ ¢¢ ¡ ¢

= {$, +, −, )}.

3. Finally, for the four rules that have Atom as the left-hand side, we con-
sider the four sets:

First(−AtomFollow(Atom)) = {−}
First(CstFollow(Atom)) = {Cst}
First(IdFollow(Atom)) = {Id}
First((Exp)Follow(Atom)) = {(},

which have no element in common. We conclude that the grammar is


indeed strong LL(1). M

The name strong LL(k) suggests that the conditions of Definition 5.12
are stronger than those of Definition 5.10. This is indeed the case: all
strong LL(k) grammars are LL(k) grammars; however, the converse is, in
general, not true, as shown by the next example:

Example 5.14. Let us consider again the grammar given in Example 5.11,
which is LL(2), and let us show that it is not strong LL(2). Indeed, if we
consider the two rules A → b and A → ε, we have:

First2 bFollow2 (A) = First2 ba, bba, bca


¡ ¢ ¡© ª¢

= {ba, bb, bc}

and:

First2 εFollow2 (A) = First2 a, ba, ca


¡ ¢ ¡© ª¢

= {a, ba, ca}.

Since these two sets both contain ba, the grammar is not strong LL(2). M

Nevertheless, in the case where the look-ahead is only one character, it


turns out that strong LL(1) grammars are not more restrictive than LL(1)
grammars. All these results are summarised in the following theorem:

Theorem 5.4.

1. For all k ≥ 1, for all CFG G: if G is strong LL(k), then it is also LL(k).

2. For all k ≥ 2, there is a CFG G which is LL(k) but not strong LL(k).

3. However, all LL(1) grammars are also strong LL(1), i.e. the classes of
LL(1) and strong LL(1) grammars coincide.

Proof. (Sketch) Points 1. and 3. can be derived from Definition 5.10 and
Definition 5.12. Point 2 stems from Example 5.14 that can be generalised
to any k ≥ 2.
T O P - D OW N PA R S E R S 147

5.4.2 The top-down hierarchy of grammars

Now that we have identified an infinite sequence LL(0), LL(1),. . . ,LL(k),. . .


of families of grammars, one can wonder what are the relationships be-
tween them. Obviously LL(k) ⊆ LL(k + 1) for all k. Indeed, if a parser is
deterministic with k characters of look-ahead, it will still be determinis-
tic with an extra character of look-ahead. We also know that the grammar
from Example 5.11, is LL(2) but not LL(1), so LL(1) ⊊ LL(2). Is it true in
general ? The answer is ‘yes’, as shown by the following example.

Example 5.15. This example has been proposed by Kurki-Suonio in 19698 . 8


R. Kurki-Suonio. Notes on top-down lan-
guages. BIT Numerical Mathematics, 9(3):
We only present here the grammars that are of interest to us, but do not
225–238, 1969. ISSN 1572-9125. D O I :
present the formal proof, which is quite involved. For all k ≥ 1, let us 10.1007/BF01946814
consider the grammar G k as in Figure 5.6 (where the third rule is param-
(1) S → aS A
eterised by k). Then, we claim that G k is an LL(k + 1) grammar but not (2) → ε
LL(k). (3) A → ak bS
(4) → c
Although we refer the reader to the cited article for the full proof, we
Figure 5.6: The family of grammars G k .
can observe that, in G k : Each G k is in LL(k + 1) \ LL(k).

1. Firstk+1 (A) = {ak b, c};

2. hence Followk+1 (S) = {ak b, c} as well, by the first rule of the grammar.

3. However, Firstk+1 (S) = {ε, ak+1 }. Indeed, the recursion in the first rule
of the grammar will produce an arbitrarily long prefix of a’s, containing
at least one a, which will be followed by k more a′ s produced by A.

So, we can already conclude that G k is strong LL(k + 1), hence, it is also
LL(k + 1). Using similar arguments, we can show that G k is not strong
LL(k). Unfortunately, this does not imply that G k is not LL(k), and this
needs to be proved with other arguments (is the most involved part of the
proof from the original article). M

So we conclude that:

Theorem 5.5. The family of LL(k) grammars (for all k ≥ 0) forms a strict
hierarchy:

LL(0) ⊊ LL(1) ⊊ LL(2) ⊊ · · · ⊊ LL(k) ⊊ LL(k + 1) ⊊ · · ·

We call this infinite hierarchy of classes of grammars the top-down hi-


erarchy (of grammars).

5.4.3 The top-down hierarchy of languages

Observe that the definitions we have given so far (LL(k), strong LL(k)) are
concerned by grammars9 , but do not speak explicitly about the languages 9
Actually, the definition of strong LL(k) is a
purely syntactical condition on grammars.
those grammars define. We have already seen, in Example 5.11 that there
is at least one grammar which is LL(2) but not LL(1) (hence, not strong
LL(1)). However, we also know that there are potentially several different
grammars to define the same language. So, instead of considering classes
of LL(k) grammars, one could naturally define LL(k) languages:

Definition 5.16 (LL(k) language). A language L is LL(k) iff there is an LL(k)


grammar G L that accepts it, i.e.: L(G L ) = L. M
148 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

Now, we can compare classes of LL(k) languages. Obviously, all LL(k) The intuition is always the same: if
a parser can recognise a language
languages are also LL(k + 1) for all k. Is the converse true? Let us consider
with k characters of look-ahead, it
again the grammar from Example 5.11, which is LL(2) but not LL(1), and can also do so with k +1 characters of look-
let us check whether there is another grammar that generates the same ahead.

language. For this grammar, the answer is trivially ‘yes’ since the language
generated by the grammar generates the finite language:

{aba, aa, bbba, bbca, bba, bca}.

So, an equivalent grammar (which is not yet LL(1)) is:


(1) S → aba
(2) → aa
(3) → bbba
(4) → bbca
(5) → bba
(6) → bca
We can now use the factoring techniques from Section 4.4, and obtain:
(1) S → aA
(2) S → bB
(3) A → ba
(4) → a
(5) B → bB ′
(6) → ca

(7) B → ba
(8) → ca
(9) → a
which one can check is indeed (strong) LL(1), using Definition 5.12. Observe that this statement is
stronger than saying that G k is
So, we have been able to turn our ‘non-LL(1)’ grammar into an equiva-
LL(k) and not LL(k − 1). The latter
lent LL(1). Is it always the case? As a matter of fact, it is not. The proof of statement does not guarantee that there
this statement can be found again in the paper of Kurki-Suonio10 , where is no LL(k − 1) grammar that generates
L(G k ), and this requires a special proof
they prove that the grammar G k from Figure 5.6 generates an LL(k) lan- which is in the cited paper. Once again,
guage that is not LL(k − 1). Hence: one should pay attention to the difference
between syntax (grammars) and semantics
Theorem 5.6. The families of LL(k) languages (for all k ≥ 0) form a strict (languages of grammars).
hierarchy:

LL(0) lang. ⊊ LL(1) lang. ⊊ LL(2) lang. ⊊ · · · ⊊ LL(k) lang. ⊊ LL(K + 1) lang. ⊊ · · ·

Relationship with DCFL Finally, since the point of considering LL(k) lan-
guages is to obtain deterministic parsers, one can wonder of LL(k) lan-
guages compare to DCFL? Clearly, we have: 10
R. Kurki-Suonio. Notes on top-down lan-
guages. BIT Numerical Mathematics, 9(3):
For all k ≥ 0 : LL(k) lang. ⊊ DCFL 225–238, 1969. ISSN 1572-9125. D O I :
10.1007/BF01946814
Indeed, each LL(k) language is recognised by an LL(k) parser which is a
deterministic PDA, so all those languages are deterministic CFL. The con-
tainment needs to be strict since LL(k) lang. ⊊ LL(K + 1) lang. for all k.
One can actually prove a further result: even the (infinite) union of all
LL(k) lang. is still not sufficient to cover all DCFL. This can be proved by
considering the language that is obtained by the union of the regular lan-
guage {an | n ≥ 0} and the CFL {an bn | n ≥ 0}. Indeed, we can show that:
T O P - D OW N PA R S E R S 149

Lemma 5.7. L = {an | n ≥ 0} ∪ {an bn | n ≥ 0} is a DCFL that is not LL(k) for


all k ≥ 0.

(Sketch). One can easily build a DPDA that accepts L by accepting state.
This DPDA pushes all the a it reads on the stack. This will be done by a self-
loop on an accepting state, so all the words of the form an are accepted. If
a b is read from this state, the DPDA moves deterministically to another
state where it will read all the b’s and check that there are as many b’s as a’s
by emptying the stack. When the stack becomes empty, the DPDA moves
to an accepting state. So, in this last state, all words of the form an bn will
be accepted.
However, L cannot be LL(k) for any k. Assume it is the case for some
value k. Then, consider the two words ak and ak bk . We can derive a con-
tradiction from the definition of LL(k) grammars. In its initial state, our
hypothetical parser will perform the same action, since the look-ahead ak
is the same. However, it is clear that there must be two different deriva-
tions: in the first case, only a’s must be generated from the sentential form
that is being built; while in the second case, some symbols must occur
in order to show to generate some amount of b’s (as many as there are
a’s).

So, we can conclude that:

LL(0) lang. ⊊ LL(1) lang. ⊊ · · · ⊊ LL(k) lang. ⊊ · · · ⊊ DCFL,

and that:
[
LL(k) lang. ⊊ DCFL.
k≥0

5.5 LL(1) parsers

Equipped with this general theory, we are now ready to discuss the con-
struction of deterministic top-down parsers for a large and practical class
of grammars, namely the LL(1) grammars. Those parser will thus be called
LL(1) parsers.

5.5.1 Obtaining an LL(1) grammar

As we have seen before, not all grammars are LL(1), and some languages
cannot be defined by an LL(1) grammar. However, for practical matters,
when one wants to generate a parser for a typical programming language,
obtaining an LL(1) grammar for that language is feasible. Here are the typ-
ical obstacles to the LL(1) that can easily be alleviated with the techniques
we have seen so far:

Ambiguity First of all, if a grammar is LL(k) for some k, then it is nec-


essarily unambiguous11 . Thus, to obtain an LL(1) grammar, one must 11
D.J. Rosenkrantz and R.E. Stearns. Prop-
erties of deterministic top-down gram-
first make sure that it is unambiguous. Consider for example the gram-
mars. Information and Computation (for-
mar for arithmetic expression from Figure 4.19. As we have already ar- merly known as Information and Control),
gued, this grammar is ambiguous. However, setting the priority and 17(3):226 – 256, 1970. ISSN 0019-9958.
D O I : 10.1016/S0019-9958(70)90446-8
associativity of operators with the techniques from Section 4.4 yields
an unambiguous and equivalent grammar.
150 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

Left recursion It is easy to check that no grammar that contains a left-


recursive rule can be LL(1). Consider a grammar of the form:

(1) S → Sα
(2) → β
where β is a string of terminals. Then, this grammar is obviously not
LL(1), since the parser cannot decide which rule to apply when S on
the top of the stack, and First β is seen on the input, i.e.:
¡ ¢

First β ∈ First(S) ⊆ First(Sα) .


¡ ¢

However, we have seen in Section 4.4 a technique to turn left-recursion


into right-recursion. On the above example, we obtain:

(1) S → βS ′
(2) S′ → αS ′
(3) → ε
which is now LL(1).

Common prefixes Another source of trouble is when two rules share the
same left-hand side, and a common prefix on their right-hand side,
such as in:
(1) [if] → if [Cond] then [Code] fi
(2) [if] → if [Cond] then [Code] else [Code] fi
...
Here, if the parser sees variable [if ] on the top of the stack, and symbol
if on the input, it cannot decide which rule to apply, so the grammar
is not LL(1). However, factoring (see Section 4.4) solves this issue:

(1) [if ] → if [Cond] then [Code] [ifSeq]


(2) [ifSeq] → fi
(3) [ifSeq] → else [Code] fi
...

Now, let us assume that we have a proper LL(1) grammar to describe the
language we are interested to parse, and let us describe the construction
of its associated LL(1) parser.

5.5.2 Action table

The core of the construction will be the building of the so-called ‘action
table’, which describes what actions the parser must perform (either pro-
duce or match), depending on the look-ahead and the top of the stack.
We have already sketched an example of such a table at the beginning of
Section 5.3. This table describes completely the behaviour of the parser,
so, from now on, we will describe a parser with look-ahead by this means
only, hiding the fact that the parser is actually a PDA12 . Here is a more for- 12
Actually, a PDA with a single state, which
is thus irrelevant. Also, we will hide the fact
mal definition of the action table:
that there is always a transition that can
pop the Z0 symbol to reach an accepting
Definition 5.17 (LL(1) action table). Let G = 〈P , T ,V , S〉 be a CFG. Let us
configuration.
assume that:

1. G’s rules (elements of P ) are indexed from 1 to n; and


T O P - D OW N PA R S E R S 151

2. P contains a rule of the form S → α$, where $ ∈ T is a terminal that does


not occur elsewhere in the grammar (it is an end-of-string marker) and
this rule is the only one that has S on the left-hand side.

Then, the LL(1)-action table of G is a two-dimensional table M s.t.:

• the lines of M are indexed by elements from T ∪ V (the potential tops


of stack);

• the rows of M are indexed by elements from T (the potential look-aheads);


and
In this definition, we state that
each cell of the table can poten-
tially contain several actions. Of
• each cell M [α, ℓ] contains a set of actions that the parser must perform course, if the grammar is LL(1), then, each
in configurations where α is the symbol on the top of the stack, and ℓ is cell should contain only one action and the
parser will be deterministic.
the next terminal on the input. These actions can be either:

– an integer i s.t. 1 ≤ i ≤ n, denoting that a produce of rule number i


must be performed (i.e., if rule number i is α → β, then pop α from
the stack and push β); or

– Accept, denoting that the string read so far is accepted. This occurs
only in cell M [$, $], i.e., when $ is on the top of the stack and also the
next symbol on the input. In terms of PDA, this consists in reading
$, and popping it, to reach an accepting configuration (provided that
no characters are left on the input); or

– Match, denoting that a match action must be performed. This action


occurs only in the cases where α = ℓ ∈ T . Then, the action consists
in popping α and reading α from the input.

– Error, denoting the fact that the parser has discovered an error and
cannot go on with the construction of a derivation. The input should
be rejected.

(1) S → Exp$
Before explaining how to build such a table in a systematic way, we (2) Exp → Prod Exp′
(3) Exp′ → +Prod Exp′
present a complete example of such a table, and the execution of the parser
(4) → −Prod Exp′
on example input strings. (5) → ε
(6) Prod → Atom Prod′
(7) Prod′ → ∗Atom Prod′
Example 5.18. Let us consider once again the grammar for artihmetic ex- (8) → /Atom Prod′
(9) → ε
pressions (Figure 5.4), which we reproduce in Figure 5.7 to enhance read-
(10) Atom → −Atom
ability. Its action table is as follows (where M, A and empty cells denote (11) → Cst
‘Match’, ‘Accept’ and ‘Error’, respectively): (12) → Id
(13) → (Exp)
Figure 5.7: The grammar generating ex-
pressions (followed by $ as an end-of-
string marker). This is the same grammar
as in Figure 5.4, reproduced here for read-
ability.
152 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

M $ + − ∗ / Cst Id ( )

S 1 1 1 1
Exp 2 2 2 2
Exp′ 5 3 4 5
Prod 6 6 6 6
Prod′ 9 9 9 7 8 9
Atom 10 11 12 13

$ A
+ M
− M
∗ M
/ M
Cst M
Id M
( M
) M

Note that the bottom half of the table is not very informative: it just tells
us that we should match terminals when they occur at the top of the stack.
This is not surprising: non-determinism can occur only because of the
‘Produce’ actions. So, in the rest of these notes, we will not show that part
of the table anymore.
Now, let us consider the input word Id + Id ∗ Id which is accepted by the
grammar, and let us build the corresponding run.

1. Initially, we are in a configuration where the stack contains only S and


the input contains only Id + Id ∗ Id. Thus, we look at M [S, Id], and see that
rule (1) must be produced. This is not a surprise, since rule (1) is the
only one that has S as the left-hand side; but at least M [S, Id] does not
raise an error. Had the first character of the input been / for instance,
we would have been able to reject the input string immediately. The
Produce replaces S by Exp$ on the the stack but does not modify the
input.

2. Then, we look up M [Exp, Id], and Produce rule (2). This replaces Exp
by ProdExp′ on the stack (with Prod on top, thus) and does not modify
the input. The parsing continues accordingly, for a couple of steps (the
remaining inputs are drawn below the stacks):

Atom Id

1


2


Prod 6

− Prod 12
−→ Prod′
′ ′
Exp Exp Exp Exp′
S $ $ $ $
Id + Id ∗ Id$ Id + Id ∗ Id$ Id + Id ∗ Id$ Id + Id ∗ Id$ Id + Id ∗ Id$

3. At that point, the terminal Id is present both on the top of the stack and
on the input, so a Match occurs, which modifies the input:
T O P - D OW N PA R S E R S 153

Id
Prod′ −

M Prod′
Exp′ Exp′
$ $
Id + Id ∗ Id$ +Id ∗ Id$

4. Then, the parsing goes on. . . Observe that the next produce consists in
popping the Prod′ variable from the top of the stack (i.e., applying rule
Prod′ → ε):

+ Atom Id Atom
Prod′ 9 3 Prod M Prod 6 Prod′ 12 Prod′ M Prod′ 7 Prod′

− →
− −
→ →
− −→ −→ →


Exp′ Exp ′ Exp Exp ′
Exp′ Exp′ Exp′ Exp′
$ $ $ $ $ $ $ $
+Id ∗ Id$ +Id ∗ Id$ +Id ∗ Id$ Id ∗ Id$ Id ∗ Id$ Id ∗ Id$ ∗Id$ ∗Id$

Atom Id
M Prod′ 12 Prod′ M Prod′ 9 5 A
−→ −→ −→ →
− →
− → Accept!

′ ′ ′ ′
Exp Exp Exp Exp
$ $ $ $ $
Id$ Id$ $ $ $

So, the word Id + Id ∗ Id$ is accepted.

Now, let us consider the word Id(Id)$ which is not syntactically correct
(it is not accepted by the grammar).

1. The corresponding run starts as in the case of Id + Id ∗ Id$, until the mo-
ment where the Id symbol on the top of the stack has to be matched:

Atom Id
1


2


Prod 6

− Prod′ 12
−→ Prod′ M

→ Prod′
Exp Exp′ Exp′ Exp′ Exp′
S $ $ $ $ $
Id(Id)$ Id(Id)$ Id(Id) Id(Id)$ Id(Id)$ (Id)$

2. At this point, the action table returns an error (i.e., M [Prod′ , (] = Error,
indicated by a blank cell in the table above), and the parsing stops

So the word Id(Id)$ is not accepted by the parser and by the grammar. One could also imagine that the
compiler skips the error at that
Observe that when the error is detected, the information in row Prod′
point and tries and re-synchronise,
of the action table could be used to give some feedback to the user, by i.e., tries and compile the remainder of the
telling him which symbol could have been correct at that point of the pars- input (in this case (Id)$, which is correct)
in order to inform the user of potential
ing (without guarantee that the parsing could have continued even if that further errors in the input. Error report-
symbol were present). For example, an error message in the case could ing and re-synchronisation are beyond the
scope of these notes. The interested reader
have been:
can refer to the ‘Dragon book’ .
Error, unexpected symbol (. I was expecting $, +, −, ∗, / or ). A. Aho, M. Lam, R. Sethi, and Ullman
J. Compilers: Principles, Techniques, &
M Tools. Addison-Wesley series in computer
science. Pearson/Addison Wesley, 2007
154 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

Algorithm to build the action table Let us now formalise these ideas. Al-
gorithm 8 presents the construction of the LL(1) table. The algorithm starts
by initialising all the cells M [A, a] (where A is a variable and a a terminal)
to the empty set. These are all the cells that can potentially contain one or
several ‘Produce’ actions. Then all cells M [a, b], where a and b are termi-
nals, are populated: they are all initialised to the empty set, except for the
cells M [a, a] (with a ̸= $) that are initialised to a ‘Match’, and M [$, $] that
contains the ‘Accept’ action. It is useful to compare the way
the algorithm fills in the table with
After this initialisation, all rules A → α are taken into account: all sym-
the definition of strong LL(1) gram-
bols a from First(αFollow(A)) are computed, and the number i of the rule mars. One can then see that there will be
A → α is added to the corresponding cell M [A, a]. Since the algorithm adds no conflict in the table built by the algo-
rithm iff the grammar is strong LL(1), i.e.
the rule number to the cell, each cell can contain several rule numbers, LL(1) since these two notions are equiva-
thereby allowing to detect potential conflicts. lent.

Input: A CFG G = 〈P , T ,V , S〉.


Output: The LL(1) action M table of G

/* Initialisation of the table */


foreach a ∈ T do
foreach A ∈ V do
M [A, a] ← ; ;
foreach b ∈ T \ {a} do
M [b, a] ← ; ;
M [a, a] ← {M} ;
M [$, $] ← {A} ;
/* Adding the ‘Produce’ actions */
foreach rule A → α ∈ P with number i do
foreach a ∈ First(α · Follow(A)) do
M [A, a] ← M [A, a] ∪ {i } ;

return M ;
Algorithm 8: Systematic construction of the LL(1) table of a CFG.

5.5.3 Algorithm of the parser

Next, we formalise, by means of an algorithm, the execution of the parser,


using the action table computed before. We use the syntax of pseudo-code
that we have used to describe our algorithms so far instead of the syntax
of PDA, because it is more readable, and closer to implementation. Note,
however, that we propose, in the next section, another way of implemen-
tation such parsers, which is probably more practical.
The algorithm of the parser behaves as illustrated in Example 5.18: it
starts by pushing the start symbol of the grammar on the stack, then, at all
times, it reads the symbol x on the top of the stack (it can be a variable or a
terminal), checks the look-ahead a, and applies the action which is given
by the action table in cell M [x, a]. It proceeds this way until the ‘Accept’
action is executed, or until an error is encountered (which is the case when
the cell is empty).
T O P - D OW N PA R S E R S 155

Input: An LL(1) CFG G = 〈P , T ,V , S〉 with its action table M and an


input word w = w 1 w 2 · · · w n ∈ T ∗ .
Output: True iff w ∈ L(G). In this case, the sequence of rule
numbers in a left-most derivation is printed on the output.

/* Position of the ‘reading head’ in the input word:


w j is the look-ahead. */
j ←1;
/* Pushing the start symbol. */
Push(S ) ;

while the stack is not empty do


x ← Top() ;
if M [x, w j ] = {i } then
Assume rule number i is A → α ; /* Produce i */
Pop() ;
Push(α) ;
Print(i ) ;

else if M [x, w j ] = {M} then


Pop() ; /* Match */
j ← j +1 ;
else if M [x, w j ] = {A} then
return True ; /* Accept */

else
return False ; /* Error */
156 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

5.6 Implementation using recursive descent

To close the section on top-down parsers, let us describe a straightforward


way to implement those parsers using recursive functions. To do this, we
will build on the intuition we have already given in the introduction of
Section 4: consider a rule like S → 0S 0, for instance. Such a rule can be
interpreted by saying that ‘to recognise an S, one should first read a 0 from
the input, then recognise an S, then read another 0’. In terms of functions,
this could mean that: ‘to return true the S function should first read a 0 on
the input, then make a succesful call (one that returns true) to S, then read
a 0 from the input’.
While this intuition seems to hold for a single rule, and seems to provide
an easy way to turn a CFG into a recursive code that implements a parser,
it fails when there are at least two rules with the same left-hand side, for
instance:
S → Ab

and:
S → B c,

because this would yield two (or more) different implementations for the
same function corresponding to S.
However, in order to resolve this non-determinism, we can rely on the
LL(1) techniques we have presented throughout this chapter. In the exam-
ple above, let us assume that:

First1 A bFollow1 (S) = {a1, a2, . . . an},


¡ ¢

First1 B cFollow1 (S) = {b1, b2, . . . bk}.


¡ ¢

Then, we would obtain the code (using the python syntax):

1 def S():
2 n = get_next_character() # This is a look-ahead
3
4 # If the look-ahead is in First(A b Follow(S))
5 if n == ’a1’ or n == ’a2’ or ... or n == ’an’ :
6 read_next_character() # Discard the look-ahead
7 r = A()
8 if (!r): return False
9 n = read_next_character()
10 if (n == ’b’): return True
11 else: return False
12
13 # If the look-ahead is in First(B c Follow(S))
14 if n == ’b1’ or n == ’b2’ or ... or n == ’bn’ :
15 read_next_character() # Discard the look-ahead
16 r = B()
17 if (!r): return False
18 n = read_next_character()
19 if (n == ’c’): return True
T O P - D OW N PA R S E R S 157

20 else: return False


21
22 # Otherwise, the input cannot be valid.
23 return False

where we rely on two auxiliary functions:

• get_next_character() that returns the next character on the input


but does not consume it from the input (i.e., several subsequent calls to
get_next_character() always return the same value); and

• read_next_character() that returns the next character on the input


and also consume it (the read pointer moves in the input stream).
158 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

5.7 Exercises

5.7.1 First and Follow sets

(1) <S> → <program>


(2) <program> → begin <statement list>
(3) <statement list> → <statement> <statement tail>
(4) <statement tail> → <statement> <statement tail>
(5) <statement tail> → ε
(6) <statement> → ID := <expression> ;
(7) <statement> → read ( <id list> ) ;
(8) <statement> → write ( <expr list> ) ;
(9) <id list> → ID <id tail>
(10) <id tail> → , ID <id tail>
(11) <id tail> → ε
(12) <expr list> → <expression> <expr tail>
(13) <expr tail> → , <expression> <expr tail>
(14) <expr tail> → ε
(15) <expression> → <primary> <primary tail>
(16) <primary tail> → <add op> <primary> <primary tail>
(17) <primary tail> → ε
(18) <primary> → ( <expression> )
(19) <primary> → ID
(20) <primary> → INTLIT
(21) <add op> → +
(22) <add op> → − (1) S → AB B A
(2) A → a
(3) → ε
Exercise 5.1. We consider the grammar given above.
(4) B → b
1 1 (5) → ε
1. Give the values of First (A) and the Follow (A) sets for all variables A
of the grammar. (1) S → aS e
(2) → B
(3) B → bB e
2. Give the values of First <expression> and Follow2 <expression> .
2
¡ ¢ ¡ ¢
(4) → C
(5) C → cC e
(6) → d
5.7.2 LL(k) grammars
(1) S → AB c
Exercise 5.2. Consider the four grammars in Figure 5.8. Which of those (2) A → a
grammars are LL(1)? Justify your answers. (3) → ε
(4) B → b
Exercise 5.3. Give the LL(1) action table for the following grammar: (5) → ε

(1) <S> → <expr>$ (1) S → Ab


(2) A → a
(2) <expr> → − <expr> (3) → B
(3) <expr> → ( <expr> ) (4) → ε
(4) <expr> → <var> <expr-tail> (5) B → b
(6) → ε
(5) <expr-tail> → − <expr>
Figure 5.8: Which grammars are LL(1)?
(6) <expr-tail> → ε
(7) <var> → ID <var-tail>
(8) <var-tail> → ( <expr> )
(9) <var-tail> → ε
6 Bottom-up parsers

I N T H I S S E C T I O N , W E W I L L C O N S I D E R A C O M P L E T E LY D I F F E R E N T F A M -
I LY O F PA R S E R S , W H I C H A R E C A L L E D B OT T O M - U P PA R S E R S . As sketched
already in the introduction of the previous chapter, bottom-up parsers
build a derivation tree starting from the leaves (i.e., the input string), and
follow backwards the derivation rules of the grammar, until they manage
to reach the root of the tree. Those parsers are generally regarded as more
powerful than their top-down counterparts (we will give formal elements
to support this claim in Section 6.9). As such, automatic parser generators
such as yacc1 , bison2 and cup3 implement bottom-up parsers. 1
Stephen C. Johnson. Yacc: Yet an-
other compiler-compiler. Techni-
cal report, AT&T Bell Laboratories,
6.1 Principle of bottom-up parsing 1975. Readable online at http:
//dinosaur.compilertools.net/yacc/
2
Recall the two main actions that a top-down parser can perform: Gnu bison. https://www.gnu.org/
software/bison/. Online: accessed on
1. the Produce, which consists in replacing, on the top of the stack, the December, 29th, 2015
3
Cup: Construction of useful parsers.
left-hand side A by the right-hand side α of some production rule A→α http://www2.cs.tum.edu/projects/
of the grammar; and cup/. Online: accessed on December,
29th, 2015
2. the Match, that consists in reading from the input some character a,
which is at the same time popped from the top of the stack.

Such top-down parsers start their execution with the start symbol S on the
stack and accept with the empty stack. Doing so, they unravel a parse tree
for the input string from the top to the bottom, and produce a leftmost
derivation.
Bottom-up parsers, on the other hand, work in a complete reverse way.
As their name suggest, they build the parse tree from the bottom to the
top. As such, they are often regarded as more efficient, since they deduce
the nodes of the parse tree based on the actual input, whose elements
they recognise as being generated by the grammar; contrary to top-down 4
Donald E. Knuth. On the translation of
parsers that have to start from the start symbol and find a way to obtain languages from left to right. Information
the input by applying the proper grammar rules. and Computation (formerly known as In-
formation and Control), 8:607–639, 1965.
For these notes, we will consider the most prominent class of bottom- D O I : 10.1016/S0019-9958(65)90426-2
5
up parsers, which are those that are based on two main actions: the Shift Prominent american computer scientist,
born in 1938, former Stanford and CalTech
and the Reduce, as we are about to explain. Those parsers (in particular the professor, recipient of the T U R I N G award
LR(k) parsers that we are about to study) have been introduced4 by Don- in 1974. He is the author of the famous
series of books The Art of Computer Pro-
ald E. K N U T H 5 in 1965, in a paper where he generalises previous works
gramming, and of the TEX typesetting lan-
from other prominent computer scientists such as Robert F L OY D 6 . guage, which has later been extend to LATEX
by Leslie L A M P O RT .
In order to build the parse tree from the leaves to the root, Shift-Reduce 6
Robert E. F L OY D , american computer
bottom-up parsers procede as follows: scientist (1936–2001). Recipient of the
T U R I N G award in 1978, and well-known
for several contributions to algorithmic,
including the F L OY D - W A R S H A L L algo-
rithm.
160 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

• First, they rely on the Shift action to move terminals from the input to
the top of the stack. The picture hereunder illustrates a shift, where the
terminal a is read from the input and pushed on the top of the stack:

a
A Shift A
−−−→
b b Observe that the handle A a ap-
abc bc pears with the rightmost character
on the top of the stack. That is,
the handle is reversed wrt to what would
have been pushed to the stack by a Pro-
• Second, they apply the grammar rules in reverse. The parser looks for
duce of the same rule in a top-down parser.
a so-called handle on the top of the stack, i.e. the right-hand side α of This is because the characters that have
some grammar rule A → α. When such a handle is present (in mirror produced the variable A (through other
Reduces, presumably) have been read on
image) on the top of the stack, the parser can perform a Reduce. A Re- the input before the a, so the A has been
duce amounts to popping the handle α, and pushing A instead. This pushed to the stack before the a and is thus
under the a in the stack.
way, the parser unravels a parse tree from the bottom to the top (hence
the name bottom-up parser). The picture hereunder illustrates a Re-
duce of the rule B →A a :

a
A Reduce B
−−−−→
b b
bc bc

• Finally, the aim of the parser is not to empty the stack (by matching all
the terminals), but rather to end up with only the start symbol S on the
stack (and, of course, an empty input). This means that a derivation has
been produced for the whole string, but in the reverse order (since rules
have been applied in the reverse order too when doing the Reduces).
Actually, the bottom-up parser builds a right-most derivation in reverse
order.

Example 6.1. Let us consider once again the grammar for arithmetic ex-
pressions of Figure 4.4, and let us consider the string Id + Id ∗ Id, which
is accepted by the grammar. One possible7 rightmost derivation for this 7
Recall that this grammar is ambiguous, so
several rightmost derivations are possible.
string is as follows:

2 4 1 4 4
Exp =
⇒ Exp ∗ Exp =
⇒ Exp ∗ Id =
⇒ Exp + Exp ∗ Id = ⇒ Id + Id ∗ Id.
⇒ Exp + Id ∗ Id = (6.1)

The parser starts its execution in a configuration where: the stack is


empty (formally, it contains only the empty stack symbol Z0 , but we will
not display it, or the sake of readability); and the input contains Id + Id ∗ Id:

Id + Id ∗ Id

From this configuration, the parser can only shift the first character on the
input:
B OT T O M - U P PA R S E R S 161

Id

+Id ∗ Id

Now, the top of the stack constitutes a handle for the rule Exp→Id . Remark
that, at this point, the parser can decide either to Reduce Exp→Id or to
Shift another character. The former choice is the good one. Indeed, if we
shift a + on top of the Id, it means we will need to find a rule whose right-
hand part contains Id and some other symbols, but there is no such rule
in our grammar. The Reduce of Exp→Id yields:

Exp

+Id ∗ Id

Observe that, in the rightmost derivation (6.1), the last rule is indeed Exp→Id
. So, this is coherent with our claim that the parser builds the rightmost
derivation in a reverse manner. Now, the parser can shift two more sym-
bols, and reduce the Id that ends up on the top of the stack:

Id Exp
S + S + 4 +

− →
− →

Exp Exp Exp Exp

+Id ∗ Id Id ∗ Id ∗Id ∗Id

Next, following the rightmost derivation, Exp + Exp froms a handle of the
rule Exp→Exp + Exp , which can be reduced8 . Then, the parser can shift 8
Formally, this must be performed, in the
PDA, by three transitions in a row: one to
twice, and finally reduce Exp→Id , then Exp→Exp ∗ Exp :
pop the first Exp, one to pop the +, and
one to replace the last Exp by another Exp.
This last transition can be skipped how-
Exp Id Exp ever, since we are in the special case where
+ 1 S ∗ S ∗ 4 ∗ 2 the last symbol that must be popped is also

− →
− →
− →
− →

Exp Exp Exp Exp Exp Exp the one than has to be pushed.

∗Id ∗Id Id

The last configuration is an accepting configuration9 for this parser, be- 9


Formally, the PDA that corresponds to
the parser should contain a transition that
cause: (i) the start symbol Exp of the grammar is on the top of the stack;
pops Exp from the stack and moves to an
and (ii) the input is empty. Thus, the string Id + Id ∗ Id is accepted. One accepting state.
can check that the sequence of reductions performed by the parser cor-
responds to the mirror image sequence of rules that have been applied in
the rightmost derivation. M

6.1.1 Systematic construction of a bottom-up parser

Following these intuitions, we can now explain how to systematically build


a bottom-up parser from a given CFG. We are giving this construction for
the sake of completeness. Actually, when we will make those parsers deter-
ministic (as we did for top-down parsers using look-ahead), we will need
162 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

to slightly alter the behaviour of the basic parser that we are presenting
now. Nevertheless, we believe it is a good exercise to show that the actions
we have described above can actually be implemented in a PDA.
The parser we are about to describe is unfortunately not as simple as
the one-state top-down parser we had obtained in Section 5.1.1. This is
due to the fact that a Reduce entails a sequence of pops from the stack,
which cannot be performed by a single transition. Hence, we will need to
introduce intermediary states. More precisely, for each rule of the form
A→α1 · · · αn (where the αi ’s are individual variables or terminals), we will
have n states, that we call (A, α1 · · · α j ) for 1 ≤ j ≤ n − 1 and (A, ε). Intu-
itively, the PDA reaches (A, α1 · · · α j ) iff: (i) it is in the middle of the reduc-
tion of A→α1 · · · αn ; and (ii) it has already popped characters αn , αn−1 ,. . . ,
α j +1 from the stack. Thus, to finish the reduction from this state, the PDA
must still: (i) pop α j , α j −1 ,. . . , α1 (in this order); and (ii) push A. For ex-
ample, if the grammar contains rule A→bc , then, the parser can, from its
initial state:

1. take a transition that pops c and move to (A, b); then,

2. from (A, b), take a transition that pops b and move to (A, ε); and finally,

3. from (A, ε), take a transition that pushes A, and move back to the initial
state of the PDA.

In addition to these transitions, we also need transitions to perform Shifts, Recall that a PDA with accepting
states accepts only when the ac-
and one transition to move to a dedicated state called q a when the start
cepting state is reached and the in-
symbol of the grammar occurs on the top of the stack. Note that this state put is empty. There is thus no problem in
is not accepting (we build again a PDA that accepts on empty stack), but jumping non-deterministically to the ac-
cepting state whenever the start symbol
its aim is to check that, at this point, the only symbol left on the stack is Z0 , occurs on the top of the stack.
which means that all the symbols on the stack have indeed been reduced
to S.
Formally, from a CFG G = 〈V , T , P , S〉, we build a PDA PG′ as follows:

PG′ = Q, T ,V ∪ T , δ, q i , Z0 , ; ,
­ ®

where:

1. The set of states Q is defined as follows:

Q = q i , q a ∪ (A, ε) | A ∈ V
© ª © ª

∪ (A, α1 · · · α j ) | A → α1 · · · αn ∈ P ∧ 1 ≤ j ≤ n − 1 .
© ª

That is, we have one initial state q i , one accepting state q a , and the
intermediary states for the reductions, as announced;

2. For the transitions function, we start by describing it from the initial


state q i , then we consider the intermediary states:

(a) For all a ∈ T , for all s ∈ V ∪ T ∪ {Z0 }:

δ(q i , a, s) = {(q i , as)} (Shift),


δ(q i , ε, S) = {(q a , ε)} (Accept),
n¡ ¢ ¯¯ o
δ(q i , ε, s) = (A, α), ε ¯ A → αs ∈ P (Reduce).
B OT T O M - U P PA R S E R S 163

Intuitively, there are self-loops on the initial state that Shift any sym-
bol on the input to the stack; when the start symbol S occurs on the
top of the stack, the parser can move to the accepting state (Accept
action); and the parser can decide at any moment to start a Reduce,
provided that the top of the stack is the right-most character in the
handle.
(b) For all states of the form (A, α) with α ̸= ε:
¢ n¡ ¢o
δ (A, αs), ε, s = (A, α), ε
¡

δ (A, α), a, s = ;
¡ ¢
in all other cases.

(c) For all states of the form (A, ε), we have, for all a ∈ T , for all s ∈
T ∪ V ∪ {Z0 }:

δ (A, ε), ε, s = (q i , As)


¡ ¢ © ª

δ (A, ε), a, s = ;.
¡ ¢

(d) Finally, the only transition on q a is a self-loop that removes Z0 , so


that the PDA accepts only when all the symbols on the stack have
been reduced to S:

δ(q a , ε, Z0 ) = (q a , ε)
δ(q a , a, s) = ; in all other cases.

The correctness of this construction is given by the following Lemma:

Lemma 6.1. For all CFGs G, the PDA PG′ is s.t. L(G) = N (PG′ ).

Sketch. The proof is done in two steps. First, one shows that all words w
accepted by the grammar G correspond to an accepting run of PG′ , by in-
duction on the length of a (reversed) rightmost derivation producing w
in G. Second, one shows that all accepting runs of PG′ on some word w can
be translated to a rightmost derivation in G (by induction on the length of
the run).

6.1.2 Non-determinism in the parser

As in the case of top-down parsers, the main limitation to the bottom-up


parser we have just defined is that it is non-deterministic. There are two
sources of non-determinism in this parser:

Reduce-Reduce conflict: such conflicts occur when the top of the stack is
the handle to two different rules, and the parser cannot decide which
rule to reduce;

Shift-Reduce conflicts: such conflicts occur when the top of the stack con-
stitutes a handle to some rule, but the parser cannot determine whether
it should continue shifting or whether it should reduce now.

Let us first focus on Shift-Reduce conflicts. One (but not the only one)
of the difficulties we need to overcome to get rid of such conflicts is to
determine when shifting new symbols might still produce a handle. This
164 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

has been illustrated in Example 6.1. After the first Shift, the configuration
reached by the parser is as shown in Figure 6.1. In this configuration, the
non-deterministic parser can either Reduce rule Exp→Id , or Shift the +.
However, as we have already argued, shifting a + in this configuration will
not yield an accepting run, as there is no other handle than the one of Id

Exp→Id that contains an Id.


+Id ∗ Id
Figure 6.1: A configuration of the bottom-
6.1.3 Viable prefixes up parser where a Reduce must be per-
formed.
This example shows that there is a fundamental difference between the
two stack contents Id and Id+. This is captured by the notion of viable
prefix. To define this notion, we must first observe a relationship between
the stack contents during an accepting run of the bottom-up parser on
input w and a rightmost derivation of w. Looking again at Example 6.1, Recall that a sentential form is a
word over T ∪ V (i.e., a string con-
one can check that all stack contents are prefixes of some sentential taining terminal and variables) that
form obtained along the right-most derivation (6.1). As a matter of fact, occurs along some derivation of the gram-
mar. When a sentential form has been ex-
one can prove the more precise result which is given by Proposition 6.2 tracted from a rightmost (leftmost) deriva-
hereinafter. tion, we call it a right (respectively left)
sentential form.
Proposition 6.2. Let G = 〈V , T , P , S〉 be a CFG, and let PG′ be its correspond-
ing bottom-up parser. Let q i , w, γZ0 be a configuration reached along an
­ ®
Recall that w R is the mirror image
accepting run of PG′ (i.e., a configuration where PG′ is neither in one of the of w. In this case, γR is a word rep-
intermediate states introduced for the Reduce, nor in q a ). Then: resenting the content of the stack
with the top on the right-hand side.
S ⇒∗ γR · w

along a rightmost derivation.

Sketch. The proof is by induction on the length of the run. Clearly, the
property holds in the initial configuration, since in this case γ = ε, hence
γR w = w, and w is an accepted word of the grammar (as the run is accept-
ing).
Next, if the property holds on some configuration q i , w 1 , γ1 Z0 , then
­ ®

moving to the next configuration q i , w 2 , γ2 Z0 where the parser is in q i


­ ®

can be done either by a Shift or by a Reduce. In the case of a Shift, we have


γR1 · w 1 = γR2 · w 2 , because the first letter of w 1 has been transfered to the
stack, so the property still holds. In the case of a Reduce, only the top of the
stack is modified by the reduction of some handle α into some variable A
(because the grammar contains the rule A→α ). That is:

w1 = w2
γR1 = βα
γR2 = βA

for some stack content β. Since w 1 = w 2 is a word of terminals (they con-


tain no grammar variable), A is indeed the rightmost variable in the sen-
tential form γR2 w 2 . Since γR1 w 1 can be derived from S (by induction hy-
pothesis) we conclude that γR2 w 2 precedes γR1 w 1 in the rightmost deriva-
tion that yields γR1 w 1 , hence, S ⇒∗ γR2 · w 2 .

Example 6.2. To illustrate Proposition 6.2, we can check that it holds at


all times along the run shown in Example 6.1. For example, when the con-
figuration is q i , γ, w = q i , Id + Exp, ∗Id , we have γR = Exp + Id, which is
­ ® ­ ®
B OT T O M - U P PA R S E R S 165

indeed a prefix of the sentential form Exp + Id ∗ Id obtained after 4 deriva-


tions in (6.1). Observe that this sentential form is exactly γR · w. M

Thus, the content of the stack in a successful run of the bottom up


parser should always be a prefix of a right-sentential form. Unfortunately,
not all such prefixes allow us to build a succesful run. Continuing Exam-
ple 6.1, Id+ is a prefix of a right-sentential form in (6.1), but, as we have
already argued, it does not allow us to build a successful run because the
parser has shifted past the handle Id that should have been reduced to Exp.
This discussion brings us to a central concept of bottom-up parsers,
that of viable prefix. A prefix is viable iff it can lead to a successful run. In some references, viable prefixes
are called feasible prefixes.
Formally:

Definition 6.3 (Viable prefix). A viable prefix of a CFG G = 〈V , T , P , S〉 is


a prefix of a right-sentential form that can occur (in reverse) on the stack
during an accepting run of the associated bottom-up parser PG′ .
That is, p ∈ (V ∪ T )∗ is a viable prefix iff there is a word w ∈ L(G) and an
accepting run

(q i , w, Z0 ) = (q 1 , w 1 , γ1 Z0 ) ⊢P’G · · · ⊢P’G (q j , w j , γ j Z0 ) ⊢P’G · · · ⊢P’G (q n , w n , γn Z0 ) = (q a , ε, ε)

of PG′ s.t. p = γRj for some j s.t. q j = q i (i.e. PG′ is in the initial state at step
j and not in one of the intermediary states used for the Reduce). M

Observe that the set of viable prefixes of a grammar can be infinite, and
actually constitutes a language on V ∪ T . If we consider once again the
grammar of Example 6.1, we can see that all words of the form Exp + Exp +
· · ·+ Exp are viable prefixes of the grammar. Indeed, for every such word of
the form:
Exp +Exp + · · · + Exp,
| {z }
n times

one can build a derivation obtained by applying n times the first grammar
rule on the rightmost Exp of the sentential form:

Exp ⇒ Exp + Exp ⇒ · · · ⇒ Exp +Exp + · · · + Exp,


| {z }
n times

which can then yield the word:

Id + · · · + Id} .
| Id +{z
n times

Then, an accepting run of the bottom-up parser consists in systematically


shifting all the Id and + tokens on the stack and reducing the Id to Exp as
soon as they are shifted. Finally, the parser reduces the rule Exp→Exp +
Exp n times and accepts. Along this run all the prefixes Exp, Exp + Exp,
Exp + Exp + Exp have been present on the stack.

From this discussion, it should be clear that identifying viable prefixes


will be a central condition to build deterministic bottom-up parsers. The
point of the next section is to introduce a tool to do so.
166 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

6.2 The canonical finite state machine

In this section, we explain how to build, from a CFG G, a finite automaton


recognising exactly the viable prefixes of G. As explained, this automaton
will be instrumental in building deterministic bottom-up parsers, and it Not to be confused with the
Church of the Flying Spaghetti
is called the canonical finite state machine, or CFSM.
Monster. . .
In order to introduce the construction of the CFSM, we will consider an
example grammar.

Example 6.4. Consider the grammar in Figure 6.2. The set of viable pre- Observe that this grammar is not
fixes of this grammar is: LL(1), because of the two rules hav-
ing A as the left-hand side. Never-
ε, A, A $, a, aC , ac, aC D, aC d , ab . theless, we will manage to parse it with a
© ª
bottom-up parser without any symbol of
prevision.
In particular, observe that acD is not a viable prefix, because the parser
has missed the handle c of C →c . Hence, we cannot reduce the rule (1) S → A$
(2) A → aC D
A→aC D . Instead, the parser had to reduce the c into C before shifting (3) → ab
and reducing the d. M (4) C → c
(5) D → d
Now, let’s see how to build the CFSM. If we consider the first rule of the Figure 6.2: An example grammar to
demonstrate the notion of viable prefix.
grammar S→A $ , we see immediately that A and A $ are viable prefixes.
So, we could start building our CFSM by having a three-state automaton
of the form:

A $

However, to track the progress of the automataon along the rule S→A $ ,
we will associate each state with so-called items. An item is simply a gram-
mar rule where we have inserted a • at some point in the right-hand part,
in order to makr the current progress of the CFSM (and, as we will see later,
of the bottom-up parser):

Definition 6.5 (CFSM item). Given a grammar G = 〈V , T , P , S〉, an item of


G is a rule of the form A → α1 • α2 , where A → α1 α2 ∈ P is a rule of G. We
denote by Items (G) the set of items of G. M

So, intuitively the item S→A•$ means: (i) that the automaton is trying
to recognise the handle A $ of the rule S→A $ , (ii) that it has recognised an
A so far, and (iii) that it still expects to read a $ to complete the handle (and
hence, complete the viable prefix). Using this notation, our automaton
becomes:
0 1 2
S → •A $ A S → A•$ $ S → A $•

Observe that we have slightly altered our conventions for depicting au-
tomata in this figure (in order to reflect common practice of the literature).
First, we do not mark explicitly the states as ‘accepting’ since they are all
accepting anyway. Indeed, the prefix of a viable prefix is itself a viable pre-
fix. Second, our states are now divided into two parts: we will understand
why in a moment. Finally, we are now numbering the states, to be able to
refer to them easily (see the bold numbers on top left of states).
B OT T O M - U P PA R S E R S 167

Note that, for the moment, our automaton contains only one item per The fact that states of the CFSM are
state. In general states of the CFSM will be sets of prefixes. Observe that, for sets of items should not be surpris-
ing, since we are building a deter-
a given grammar, there are finitely many items, because there are finitely
ministic automaton, but there can be non-
many rules, which have all a finite right-hand side. So the CFSM is guar- determinism in the grammar. This is rem-
anteed to be finite. iniscent of the subset construction tech-
nique to determinise finite automata from
Clearly, this automaton is not sufficient to accept all viable prefixes. In- Section 2.4.2.
deed, S ⇒ A $ ⇒ aC D $, for instance, is a possible prefix of derivation in
our grammar, so we should also accept the viable prefixes a, aC , etc.
How can we obtain this? We need to incorporate in our CFSM the fact
that A can be derived as aC D, but where to add this information? If we
observe the initial state, it contains the item S→•A $ , in which the • sym-
bol is immediately followed by the variable A. This is a sign that we need
to add more information to the initial state: when the automaton tries to
recognise a viable prefix generated using rule S→A $ , it might need to
read an a, because the rule A→aC D might be applied next. Thus, we will
add to the initial states two new items: A→•aC D and A→•ab .

0
1 2
S → •A $
A S → A•$ $ S → A $•
A → •aC D
A → •ab

The operation that consists in looking, in a state, for all the items of the
form A→α1 •B α2 , where B is a variable; and adding to the state all the
items of the form B →•α is called the closure operation. It needs to be
applied to all states10 of the CFSM (possibly several times until no more 10
The closure operation does not add
items to states 1 and 2, because the • is not
items can be added).
followed by a variable in the correspond-
In our depiction, the items that result from the closure operation ap- ing items.
pear in the lower part of the states. The items from the top part of the
states form the kernel of the state. In our depiction we keep them sep-
arated because it makes it easier to identify states, since the closure of a
given kernel will always be the same.
This new information in the initial state allows us to complete our au-
tomaton. Since the state now contains items A→•aC D and A→•ab , the
automaton can progress in the recognition of a valid prefix by reading an a
from the initial state. This will yield a new state where the automaton
has progressed in both items, so the kernel of the new state will contain
both A→a•C D and A→a•b . This means that we don’t know, at this point,
whether the handle that will be read will be aC D or ab, but both are still Again, compare with the subset
construction for determinising fi-
possible so far. In addition, the closure operation will add the item C →•c
nite automata.
to the state, because the • is directly followed by a C in A→a•C D :
168 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

0
S → •A $
A
A → •aC D ···
A → •ab

a
1
A → a•C D
A → a•b

C → •c

Continuing this systematic construction, we obtain the automaton which


is shown in Figure 6.3.

0
1 2
S → •A $
A S → A•$ $ S → A $•
A → •aC D
A → •ab

a
3
4
5
A → a•C D
C A → aC •D D
A → a•b A → aC D•
D → •d
C → •c

c d
b
6 7 8
A → ab• C → c• D → d•

Figure 6.3: The CFSM for our example


grammar. All states shown on the figure
With these intuitions in mind, we can now describe formally the con- are accepting.
struction of the CFSM for a given CFL G.

The Closure operation We start by giving an algorithm to compute the


closure that we have informally discussed above. It is given in Algorithm 9.
As can be seen, it is a fixed point algorithm, to adds items of the form
B →•β for every item of the form A→α1 •B α2 where the • is followed by
some variable B . This operation is carried on up to stabilisation, i.e. until
no more items can be added to the set, which is then returned.

Computing the successors of items The next step in the construction is


to define formally the notion of successor in a CFSM. For that, we define
a function that receives a set of items I ⊆ Items (G) and a terminal a ∈ T , Thus, I is a state of the CFSM.
and returns the set CFSMSucc (I , a) of all the items obtained from I af-
ter reading a. This is obtained by selecting, from I , all the items of the
form A→α1 •aα2 , where the • is immediately followed by a; and by mov-
ing the • one position ot the right, in order to reflect the progress of the
B OT T O M - U P PA R S E R S 169

Input: A set I ⊆ Items (G) of items of some CFG G = 〈V , T , P , S〉.


Output: The closure of I

Closure ← I ;
repeat
PrecClosure ← Closure ;
foreach A→α1 •B α2 ∈ Closure s.t. B ∈ V do
Closure ← Closure ∪ B → •β | B → β ∈ P ;
© ª

until PrecClosure = Closure;


return Closure ;
Algorithm 9: Computing the Closure of a set I of items for the LR(0)
CFSM.

automaton. Formally:

CFSMSucc (I , a) = A → α1 a • α2 ¯ A → α1 • aα2 ∈ I .
© ¯ ª

Construction of the CFSM Then, the construction of the CFSM associ-


ated to a CFL G is rather straightforward.
Definition 6.6 (Canonical Finite State Machine). Let G = 〈V , T , P , S〉 be a
CFG. Its Canonical Finite State Machine (or CFSM for short) is the DFA

CFSM (G) = Q,V ∪ T , δ, q 0 , F ,


­ ®

where:
1. the set of states Q is the set of all sets of G’s items: Q = 2Items(G) ;

2. the initial state q 0 = Closure ({S → •α | S → α ∈ P }) is obtained by taking


the closure of all the items obtained from the rules where S is the left-
hand side, and where the • appears at the initial position (on the left of
the right-hand side);

3. for all q ∈ Q, for all a ∈ V ∪ T : Observe that the automaton is in-


deed deterministic: the Closure
δ(q, a) = Closure CFSMSucc q, a .
¡ ¡ ¢¢
function returns a set of items,
which is indeed a state of the CFSM.
That is, the transition function is obtained by first computing the suc-
cessors of q by a (hence by advancing the • one position to the right in
the relevant items), and then taking the closure of the resulting set.
Observe that the automaton is ac-
tually complete: there are transi-
4. all the states are accepting but the empty set, which is considered as an
tions labelled by all alphabet sym-
error state: F = Q \ {;}. bols from all states. However, these transi-
tions can lead to ;, which is the error state,
M and which we usually do not depict (see for
instance Figure 6.3) to keep figures com-
The main property of the CFSM is given by the following theorem (which
pact and readable.
we will not prove here):
Theorem 6.3. For all CFG G: CFSM (G) accepts the set of all viables prefixes
of G.

Now that we have the CFSM in our toolbox, how can we exploit it to
make our bottom-up parsers deterministic? In the next sections, we will
introduce the LR(k) parsers, which are a family of deterministic bottom-
up parsers that use k characters of look-ahead. Similarly to the case of
LL(k), the ‘LR’ in LR(k) stands for Left scanning, Right parsing.
170 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

6.3 LR(0) parsers

We start with the LR(0) parsers, which use no look-ahead. This might
sound surprising, since it is hard to imagine that a top-down parser would
be able to parse anything meaningful without look-ahead. As a matter of
fact, a grammar is LL(0) iff it does not contain two rules with the same
left-hand side, which is a very strong restriction, since such grammars
would accept only one word at most! Nevertheless, LR parsers are usu-
ally regarded as more powerful than LL parsers, and there are non-trivial
grammars (such as the one in Example 6.4) that can be parsed by an LR(0)
parser, as shown by the next example, which will help us build our intu-
ition.

Example 6.7. Let us consider again the grammar G of Example 6.4, and
the word acd$ which is accepted by this grammar. How does CFSM (G), as
given in Figure 6.3, helps us parse acd$ in a deterministic fashion.
Let us observe the run of the bottom-up parser on acd$. Initially, the
stack is empty11 , and the only thing the parser can do is to shift a (since 11
It contains only the Z0 character, that we
will not depict in this example in order to
there is no rule which has ε as its right-hand side, no Reduce is possible):
keep it readable.

Shift As usual, we represent the se-


−−−→ quence of stack contents, with the
a
current input under the stack.
acd$ cd$
Don’t forget that the viable prefixes
are mirror images of the stack con-
Now, let us consider the associated CFSM. Remember that its goal is to
tents, see Definition 6.3. So, the
accept viable prefixes, which are stack contents, so let us run it on the two stack content must be read from the bot-
stack contents obtained so far. In the initial configuration of the parser, the tom to the top by the CFSM!

stack was empty, so the CFSM reaches state 0. Observe that in this state,
there are no items of the form A→α• , which indicates that no handle has
been read completely. This is coherent with our decision of Shifting and
not Reducing at this point. In the second configuration, the CFSM reaches
state 1. Again, the items contained in this state indicate that a Shift must
occur, and the parser reaches a configuration where the stack contains
(from the bottom to the top) ac, which moves the CFSM to state 7. We
depict this as follows, indicating the current CFSM state under the input:

Shift Shift c
−−−→ −−−→
a a
acd$ cd$ d$
0 3 7

Now, the parser is in a situation of Shift/Reduce conflict: indeed, it


could reduce the c on the top of the stack to C , or it could shift the d. The
CFSM helps us lift this conflict: in state 7, the only item is C →c• , which
indicates that the handle for C →c has been read completely and must be
reduced. This yields:

c −Reduce 4
−−−−−→ C
a a
d$ d$
7 4
B OT T O M - U P PA R S E R S 171

Following the same reasoning, we now Shift the d, and the CFSM reaches
state 8 from which rule 5 must be reduced (item D→d• ). After this reduce,
the CFSM is in state 5, where a Reduce of rule 2 occurs, which moves the
CFSM to state 1, as the stack now contains A only:

d D
Shift
C −−−→ C −Reduce 5 Reduce 2
−−−−−→ C −−−−−−→
a a a A
d$ $ $ $
4 8 5 1

Finally, the parser shifts the remaining character $, which moves the
CFSM to state 2. In this state, the parser reduces rule 1 which leaves only
the start symbol S on the top of the stack. This amounts to accepting the
input string, and is denoted by the Accept action. The full run is displayed
in Figure 6.4 (top of the figure). M

d D
Shift
−−−→
Shift
−−−→ c −Red. 4 Shift Red. 5
−−−→ C −−−→ C −−−−→ C −Red.
−−−→
2 Shift
−−−→ $ −Accept
−−−−→
!
a a a a a A A S
acd$ cd$ d$ d$ $ $ $ ε ε
0 3 7 4 8 5 1 2

8 5
Shift Shift 7 Red. 4 4 Shift 4 Red. 5 4 Red. 2 Shift 2 Accept !
−−−→ 3 −−−→ 3 −−−−→ 3 −−−→ 3 −−−−→ 3 −−−−→ 1 −−−→ 1 −−−−−→
0 0 0 0 0 0 0 0
acd$ cd$ d$ d$ $ $ $ ε ε

Figure 6.4: Two versions of the run of the


LR(0) parser on acd$. On the top: the run
Let us summarise what we have learned from this example. First, we where we push grammar symbols on the
can always associate a state of the CFSM to the current stack content, and stack. Bottom: the same run where we
push CFSM states instead (which is the ac-
this state is sufficient to determine the action that the parser must per-
tual run of the LR(0) parser).
form. These actions are to:

• Accept in all states containing an item of the form S→α• ;

• Shift whenever the current state does not contain an item of the form
A→α• because we have not yet shifted a complete handle on the stack;
and

• Reduce rule A→α whenever the current state contains the item A→α•
(except if this is a state where an Accept is performed).
172 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

As in the case of top-down parsers, this can be summed up in an action


table, as follows, where an ‘A’ stands for Accept, an ‘S’ stands for Shift, a
number i stands for ‘Reduce rule number i ’, and states are identified by
their numbers: Observe that in our example, each
line of the table contains one and
State Action
only one action. This need not be
the case in general: one could imagine a
0 S
single state of the CFSM containing both
1 S A→α• and B →β• (a Reduce/Reduce con-
2 A flict) or both A→α1 •α2 and B →β• (a
Shift/Reduce conflict). We will see such ex-
3 S amples later and how to solve them using
4 S look-ahead.
5 2
6 3
7 4
8 5

Now, since the CFSM state is the only meaningful information for de-
termining the behaviour of the parser, one could wonder what is the point
of actually pushing symbols on the stack? It turns out that we can, instead,
push the sequence of CFSM states. To better understand this, consider the
excerpt of the run above:
d D
C Shift Red. 5
−−−→ C −−−−→ C
a a a
d$ $ $
4 8 5

In this example, the CFSM state is 4 when the stack content is C a, then
a Shift occurs, which moves the CFSM to state 8. To obtain this state 8,
we only need to know that we were previously in state 4 and that we have
read symbol d. Then, a Reduce occurs that pops one character (because
the right-hand side of rule 5 has length 1), pushes a D instead, and moves
the CFSM to state 5. It is here crucial to observe that, in order to determine
that the CFSM ends up in state 5, all we need to know is:

1. the current state of the CFSM before the reduced handle (here: d) was
on the stack (here this state was 4); and

2. the fact that we are pushing a D into the stack.


We abuse notations and denote
Indeed, in the CFSM: δ(4, D) = 5. In other words, to determine that the CFSM states by their numbers.

CFSM reaches state 5 when the stack contains DC a, one does not need
to re-run the CFSM on the whole stack content from scratch: instead, we
can remember on the stack the state reached with stack content C a (here,
state 4), then look for the (unique) successor of 4 by letter D. This holds Observe that, since the stack now
contains the initial state of the
because the CFSM is deterministic. CFSM from the beginning, we do
So, in practice, the run of the LR(0) parser on acd is the one displayed not even need the Z0 symbol anymore. Or,
in other words, we let Z0 = 0, and the Ac-
in Figure 6.4 (bottom). Let us now formalise these ideas. cept will pop 0 to empty the stack.

6.3.1 Action table of the LR(0) parser

The action table of the LR(0) parser associates one or several actions to
each state:
B OT T O M - U P PA R S E R S 173

Definition 6.8 (LR(0) action table). Let G = 〈V , T , P , S〉 be a CFG with its


associated canonical finite state machine CFSM (G) = Q, q 0 ,V ∪ T , δ, F ,
­ ®

and let us assume that:

• G’s rules (i.e. the elements of P ) are indexed from 1 to n; and

• The non-empty states of CFSM (G) (i.e. the elements of Q \ ;) are in-
dexed from 0 to m − 1, so that Q = {0, 1 . . . , m − 1, ;}.
As in the case of LL(1) our defini-
Then, the LR(0) action table is a table M with |Q| = m + 1 lines s.t., for all tion allows for several actions in a
given cell, but the parser will be de-
0 ≤ i ≤ m, M [i ] contains a set of actions. The actions can be: terministic only when there is exactly one
action per cell.
• either an integer 1 ≤ j ≤ n denoting a Reduce of rule number j ; When we depict the action tables,
we often denote the Shift, the Ac-
• or Shift denoting that the parser must Shift the next character of the cept and the Error by S, A and an
empty cell, respectively.
input to the stack ;

• or Accept denoting that the string read so far is accepted and belongs to
L(G);

• or Error denoting that the string must be rejected. The only state which
is associated with this action (in the case of LR(0)) is ;.

To build this action table, we look for items of the form A→α• in the
states of the CFSM and we associate a Reduce of the corresponding rule to
those states. When the state contains A→α1 •α2 , we associate a Shift to
the state. This is detailed in Algorithm 10. Observe that after the execution
of this algorithm, no cell will be empty, because each state of the CFSM
always contains at least one item that either will satisfy one of the two If’s
of the main loop, or will allow for at least one execution of the innermost
foreach loop.

6.3.2 Running the LR(0) parser

Now that we have described the construction of the action table of the
LR(0) parser, let us formalise how it runs on a given input string. For the
sake of clarity, we will not describe this parser as a PDA, but rather as an
algorithm that can access a stack S from which it can push and pop. The
parser is given in Algorithm 11 and assumes that the action table contains
exactly one action per cell. The NextSymbol() function reads and returns
the next symbol on the input (i.e. in the word w).
It is easy to check that the algorithm follows exactly the intuitions that
we have described so far. Namely, in the main while loop, the parser checks
the content of the LR(0) table in the cell given by the top of the stack, to ob-
tain the action that must be executed. Then:

• if this action is an Error or an Accept, the execution finishes immedi-


ately;

• if the action is a Shift, the next character is read and stored in variable c;
and
174 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

Input: A CFG G = 〈V , T , P , S〉 whose rules are indexed from 1 to n;


and CFSM (G) = 〈Q, 0,V ∪ T , δ, F 〉 where
Q = {0, 1 . . . , m − 1, ;}.
Output: LR(0) action table of G.
/* Initialisation */
M [;] ← Error;
foreach q ∈ {0, 1, . . . , m − 1} do
M [q] ← ; ;
/* Populating the table */
foreach q ∈ {0, 1, . . . , m − 1} do
if q contains an item of the form S→α• then
M [q] ← M [q] ∪ {Accept} ;
foreach item in q that is of the form A→α• with A ̸= S do
M [q] ← M [q] ∪ { j } where rule number j is A→α ;
if q contains an item of the form A→α1 •α2 with α2 ̸= ε then
M [q] ← M [q] ∪ {Shift} ;

return M ;
Algorithm 10: The algorithm to compute the LR(0) action table of a
CFG G using CFSM (G).

Input: The LR(0) action table M of some CFG G and its associated
CFSM (G) = 〈Q, 0,V ∪ T , δ, F 〉; and an input word w.
Output: True iff w ∈ L(G)
Let S be an empty stack ;
Push (S , 0) ; // Push of the initial CFSM state
while M [Top (S )] ̸= Error and M [Top (S )] ̸= Accept do
if M [Top (S )] = {Shift} then
c ← NextSymbol() ;
else if M [Top (S )] = {i } then
Let A→α be rule number i in G;
Pop() |α| times from S ;
c←A;
/* Compute the next CFSM state and push it */
nextState ← δ(Top(S ), c) ;
Push (S , nextState) ;
Algorithm 11: The LR(0) top-down parser. This algorithm assumes that
each cell of action table M contains exactly one action.
B OT T O M - U P PA R S E R S 175

• if the action is a rule number i (meaning that a Reduce of rule num-


ber i must be performed), and if rule number i is A→α , then |α| state
numbers are popped from the stack, and grammar variable A is stored
in c.

Finally, the parser computes the next state based on the current state (which In practice, only the transition rela-
is now on top of stack) and the symbol c that the CFSM must read. This tion of the CFSM is needed for the
parser. It can be given by a simple
new state is pushed onto the stack. After all these actions have been per-
table that returns, for all states s and ter-
formed, the state of the parser has been correctly updated, and it can con- minal a, the successor state of s by a.
tinue its execution the same way.

6.3.3 Another example

We close this section by discussing a final example of LR(0) grammar, one


where the set of viable prefixes is infinite (yet, still regular, as it always is).

Example 6.9. Let us consider the simple grammar in Figure 6.5. This
grammar has one recursive rule, and the rule S→ε , which is particularly
interesting here. Indeed, a bottom-up parser parsing this grammar will (1) S′ → S$
necessarily need to Reduce it at some point (it is needed to terminate the (2) S → S aS b
(3) → ε
recursion). But reducing this rule has a perhaps unexpected effect: since
Figure 6.5: An LR(0) grammar whose set of
the right-hand side is ε, there is nothing to pop from the stack when per- viable prefixes is infinite.
forming the reduction. So, reducing S→ε amounts to pushing S on the
top of the stack, and this can occur basically at any moment in the run of
the parser. This might give use the impression that we will end up with a
non-deterministic parser, yet our LR(0) parser will be deterministic, even
without look-ahead!

0
1
2
S′ → •S $ ′
S → S•$
S $ S ′
→ S $•
S → •S aS b S → S•aS b
S → •
a
5
4 S
3
S → S a•S b
S → S aS•b
S → S aS b• b
S → S•aS b S → •S aS b
S → •
a
Figure 6.6: The CFSM of the previous
grammar. Its language is infinite.
The (LR(0)) CFSM of the grammar is given in Figure 6.6. On can check
that its language is indeed infinite since it contains a loop between states 4 Table 6.1: The LR(0) action table of the pre-
and 5. One can also remark the presence of the item S→• in states 0 and 5, vious grammar.
that triggers the reduce of rule number 3 (i.e., pushing an S on the top of State Action
the stack as discussed above). 0 3
Then, the LR(0) action table is given in Table 6.1. Observe that states 0 1 S
2 A
and 5, again, prompt a Reduce of S→ε , and this Reduce only. There is 3 2
no Shift in these states since there is no outgoing transition labeled by a 4 S
5 3
terminal.

Now, let us build the run of the LR(0) parser on aabb$. As always, we
start with the initial state 0 on the stack. In state 0, a Reduce of S→ε must
176 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

be performed. This amounts to pushing an S on the stack, or rather push-


ing the CFSM state which is the successor of 0 by S, i.e. 1. From 1, a Shift
(of a) is performed, and we move to state 5. So the three first steps of the
run are:

5
R3 1 S 1
−−→ →

0 0 0
aabb$ aabb$ abb$

After that, the run continues as usual. The full run is given in the following
table (in order to save space), where the stack content is displayed from
the bottom to the top:

Input Stack content Action

aabb$ 0 Reduce 3
aabb$ 01 Shift
abb$ 015 Reduce 3
abb$ 0154 Shift
bb$ 01545 Reduce 3
bb$ 015454 Shift
b$ 0154543 Reduce 2
b$ 0154 Shift
$ 01543 Reduce 2
$ 01 Shift
ε 012 Accept

6.4 SLR(1) parsers

One thing which is remarkable about LR(0) parsers is their ability to parse
grammars that are not entirely trivial, without using any look-ahead, as the
previous examples have clearly shown. However, as it often happens with
life and computing, there is no free lunch, and there are grammars of prac-
tical interest that we cannot parse deterministically with LR(0) parsers! Let
us have a look at such an example. . .
(1) S → Exp$
Example 6.10. Consider the grammar in Figure 6.7. It is a simplification (2) Exp → Exp + Prod
of the grammar to generate arithmetic expressions that we have consid- (3) → Prod
(4) Prod → Prod ∗ Atom
ered several times before (the point of the simplification here is to keep (5) → Atom
the example short, but everything we are about to discuss extends to the (6) Atom → Id
grammar with all operators). Notice that this grammar implements the (7) → (Exp)
Figure 6.7: A simple grammar for generat-
priority of operator as discussed at the end of Chapter 4.
ing arithmetic expressions. This grammar
We claim that this grammar is not LR(0). Let us build its CFSM to check is not LR(0). It is also not LL(k) for any k
this. It is given in Figure 6.8. We can immediately spot conflicts in state 1 since it contains left-recursive rules.

and in state 12. In state 1 the parser cannot decide between performing a
shift (which will necessarily be of symbol ∗), or reducing Exp → Prod. Simi-
larly, in state 12, there is a conflict between a shift (of ∗ again) and a reduce
of Exp → Exp + Prod. M
B OT T O M - U P PA R S E R S 177

9
S → Exp$•
0
S → •Exp$ 8
$
4 Exp → Exp+•Prod
Exp → •Exp + Prod
Exp → •Prod S → Exp•$ Prod → •Prod ∗ Atom
Exp +
Prod → •Prod ∗ Atom Exp → Exp•+Prod Prod → •Atom
Prod → •Atom Atom → •Id
Atom → •Id Atom Atom Atom → •(Exp)
Atom → •(Exp) 5 Id
+ Prod
Id Prod → Atom• 10
Prod (
1 ( Atom → (Exp•)
Exp → Prod• Atom Exp → Exp•+Prod
Prod → Prod•∗Atom 6
Atom → (•Exp)
Exp )
Prod Exp → •Exp + Prod 11

2 Exp → •Prod Atom → (Exp)•
Prod → Prod∗•Atom ( Prod → •Prod ∗ Atom (
Prod → •Atom
Atom → •Id
Atom → •(Exp)
Atom → •Id 12
Atom → •(Exp) Exp → Exp + Prod•
Atom Prod → Prod•∗Atom
3 *
7 Id
Prod → Prod ∗ Atom• Id
Atom → Id•

Figure 6.8: The CFSM for the grammar


generating expressions.
178 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

While this example shows that some grammars cannot be parsed by a


deterministic bottom-up parser without any symbol of look-ahead, it also We will later introduce formally the
suggests an easy way to lift this conflict. Let us again consider state 1. notion of LR(k) grammar, and we
will be able to check that the gram-
Clearly, in this state, the Shift action should occur only when a ∗ symbol is mar of the previous example is not LR(0).
the next character on the input (all other shifts lead to the error state that
is not shown). But how can we find a subset of symbols that contain at
least all the symbols for which a Reduce of Exp → Prod should occur?
Another way to express this question is the following: ‘what are the next
symbols that we expect on the input when the top-down parser is sup-
posed to reduce A → α?’ Clearly, if the top-down parser is to reduce A → α,
this means that is has already read (i.e., shifted on the stack) a string of
symbols that have been derived from A. So, the next symbol on the input
(i.e., the first one that we have not shifted yet), must be a symbol that we
expect after a word which is derived from A; and this is exactly the defini-
tion of Follow(A).
We have thus found a practical way to lift the conflicts, at least in our
example:
12
Franklin Lewis DeRemer. Practical
Translators for LR(k) Languages. PhD
1. whenever a state contains an item of the form A → α•, perform a Re-
thesis, Massachusetts Institute of Technol-
duce of A → α if and only if the look-ahead is a symbol from Follow(A); ogy, 1969. URL https://web.archive.
org/web/20130819012838/http:
2. whenever a state contains an item of the form A → α1 • α2 where α2 //publications.csail.mit.edu/lcs/
pubs/pdf/MIT-LCS-TR-065.pdf; and
starts with a terminal, perform a Shift if and only if the look-ahead is a
Franklin Lewis DeRemer. Simple LR(k)
symbol from First(α2 ). grammars. Communications of the ACM,
14(7), 1971. D O I : 10.1145/362619.362625
The technique we have just sketched has been introduced in 1969 by
Franklin D E R E M E R in his PhD thesis12 , and is called Simple LR(1) or SLR(1)
for short, because it uses one character of look-ahead and it can be re- If this ‘simple’, what will come
next? You have no idea. . .
garded as a simplification of the more general LR(1) technique that we will
introduce next. In order to formalise it, we have to modify the algorithm On a more serious note, we point
building the action table, and the parsing algorithm of LR(0). The con- out that this technique can be gen-
eralised to k characters of look-
struction of the CFSM stays the same. Those new algorithms are given in ahead.
Algorithm 12 (for the construction of the action table) and in Algorithm 13
(for the actual parser).
Observe that the action table M is now computed as a two-dimensional
table. For all states q of the CFSM, and for all symbols a ∈ T ∪ {ε}: M [q, a]
provides the parser with the action(s) that it should perform when the cur-
rent state of the CFSM is q and the next symbol on the input (i.e. the look-
ahead) is a.
B OT T O M - U P PA R S E R S 179

Input: A CFG G = 〈V , T , P , S〉 whose rules are indexed from 1 to n;


and CFSM (G) = 〈Q, 0,V ∪ T , δ, F 〉 where
Q = {0, 1 . . . , m − 1, ;}.
Output: SLR(1) action table of G.
/* Initialisation */
foreach a ∈ T do
M [;, a] ← Error;
foreach q ∈ {0, 1, . . . , m − 1} do
M [q, a] ← ; ;

/* Populating the table */


foreach q ∈ {0, 1, . . . , m − 1} do
if q contains an item of the form S→α• then
M [q, ε] ← M [q, ε] ∪ {Accept} ;
foreach item in q that is of the form A→α• with A ̸= S do
foreach a ∈ Follow(A) do
M [q, a] ← M [q, a] ∪ { j } where rule number j is A→α ;

foreach item in q that is of the form A→α1 •aα2 with a ∈ T do


M [q, a] ← M [q, a] ∪ {Shift} ;

return M ;
Algorithm 12: The algorithm to compute the SLR(1) action table of a
CFG G using CFSM (G).

Input: The SLR(1) or LR(1) action table M of some CFG G and its
associated CFSM (G) = 〈Q, 0,V ∪ T , δ, F 〉; and an input word
w.
Output: True iff w ∈ L(G)
Let S be an empty stack ;
/* Push of the initial CFSM state */
Push (S , 0) ;
/* Initialisation of the look-ahead */
ℓ ← first symbol on the input ;
while M [Top (S ) , ℓ] ̸= Error and M [Top (S ) , ℓ] ̸= Accept do
if M [Top (S ) , ℓ] = {Shift} then
c ← NextSymbol() ;
else if M [Top (S ) , ℓ] = {i } then
Let A→α be rule number i in G;
Pop() |α| times from S ;
c←A;
/* Compute the next CFSM state and push it */
nextState ← δ(Top(S ), c) ;
Push (S , nextState) ;
/* Update the look-ahead */
ℓ ← first symbol on the remaining input ;
Algorithm 13: The SLR(1) and LR(1) top-down parser. This algorithm
assumes that each cell of action table M contains exactly one action.
180 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

Table 6.2: The SLR(1) action table for our


Example 6.11. We continue Example 6.10, and build the action table of simple grammar of expressions. The first
column gives the state number, the first
the grammar in Figure 6.7. To this end, we first compute: row lists the possible look-ahead symbols.

1. Follow(Exp) = {$, +, )}; M + ∗ Id ( ) $ ε

2. Follow(Prod) = {∗} ∪ Follow(Exp) = {∗, $, +, )}; 0 S S


1 3 S 3 3
2 S S
3. Follow(Atom) = Follow(Prod) = {∗, $, +, )}.
3 4 4 4 4
Then, we obtain the table given in Figure 6.2. Observe that this table con- 4 S S
5 5 5 5 5
tains no conflict now. 6 S S
Now, let us consider the word Id + Id ∗ Id$, which is in the language of the 7 6 6 6 6
8 S S
grammar, and let us observe the corresponding run of the SLR(1) parser.
9 A
The run starts as usual with: 10 S S
11 7 7 7 7
12 2 S 2 2

0
Id + Id ∗ Id$

When the CFSM is in state 0 and the look-ahead is Id, the action table says
to perform a Shift, which leads to state 7. In this new state, and with a new
look-ahead equal to +, the parser must Reduce rule 6, which leads to sate
5, and so on. . . The full run is then:

S R6 R5 R3 S S 7 R6 5 R5 12 S

− −−→ −−→ −−→ →
− 8 →
− 8 −−→ 8 −−→ 8 →

7 5 1 4 4 4 4 4
0 0 0 0 0 0 0 0 0
Id + Id ∗ Id$ +Id ∗ Id$ +Id ∗ Id$ +Id ∗ Id$ +Id ∗ Id$ Id ∗ Id$ ∗Id$ ∗Id$ ∗Id$

7 3
2 2 2
12 S 12 R6 12 R4 12 R1 S Accept!
8 →
− 8 −−→ 8 −−→ 8 −−→ →
− 9 −−−−−→
4 4 4 4 4 4
0 0 0 0 0 0
Id$ $ $ $ $ ε ε

6.5 LR(k) parsers


(1) S′ → S$
While SLR(1)1 parsers offer a solution to some of the conflicts that exist in (2) S → L=R
(3) S → R
LR(0) parsers, they are not paramount as we will see in the next example:
(4) L → ∗R
Example 6.12. Let us consider the grammar from Figure 6.9. It can gen- (5) L → Id
(6) R → L
erate assignments where both the left-hand and the right-hand sides can
Figure 6.9: A grammar which is not SLR(1).
contain references. For example, the assignment ∗ ∗ Id = ∗ ∗ ∗Id$ can be
generated by this grammar. Observe that we have the following values for
the Follows:
B OT T O M - U P PA R S E R S 181

• Follow(S) = {$};

• Follow(L) = {=, $}; and

• Follow(R) = {=, $}.

Now, let us try and build the CFSM for this grammar. An excerpt is given
in Figure 6.10. On this small excerpt, we can clearly see that there is a
shift/reduce conflict in state 1, if we build an LR(0) parser. Unfortunately,
this conflict subsides if we build an SLR(1) parser. Clearly, when the look-
ahead is =, a Shift can be performed. But since =∈ Follow(S), we would
also perform a Reduce in this case. M

3 4
S ′
→ S•$ $ S ′
→ S $•

S
0
S′ → •S $
1
S → •L = R
S → L•= R
S → •R L
R → L•
L → •∗R
L → •Id
R → •L

R
2
S → R•

Figure 6.10: An excerpt of the CFSM for the


previous grammar.
This example clearly shows that using Follow(R) to lift the conflict does
not work in this example. Observe, however that Follow(R) contains two
elements: = and $, but only one of them causes the conflict. Could it be
the case that we can refine the notion of Follow in order to find a way to
avoid this conflict? In the present example, it turns out to be possible!
Indeed, let us check how we came to have the item R → L• in state 1.
First, starting from S ′ → •S $ in state 0, we have applied the closure oper-
ation to yield item S → •R, then another iteration of the closure to obtain
R → •L, both in state 0.
Now, observe that we have introduced S → •R, because we expect to
reduce an R into an S, in order to finally reduce S $ into S ′ (item S ′ → •S $).
But this means that we expect to have an R on the top of the stack in a
particular context, which is when it is followed by a $. And then, since we
expect to reduce the R from an L (item R → •L), it means, in turn, that we
expect to find an L on the top of the stack only in the context where it is
followed by a $.
A contrario, imagine that we perform a Reduce of R→L in state 1, with
an = as look-ahead. Then, the parser will move to state 2, where it will
reduce an S (still with = as look-ahead), and further move to state 3. How-
182 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

ever, from this state, it is not possible to Shift the = and to continue the
run of the parser.
With this discussion, we conclude that, in state 1, an = in the look-
ahead must trigger a Shift; while we must do a Reduce only when $ is
on the look-ahead. By doing so, we actually refined the notion of Follow.
While Follow(R) = {=, $}, this can be regarded as a global Follow, because
it is computed over the whole grammar. With the finer analysis we have
sketched here, we have obtained some sort of local Follow: in state 1, this
local Follow of R contains only $, which allows us to lift the conflict. The
point of this section is to formalise those notions.

6.5.1 The LR(k) CFSM

In order to formalise the ideas we have just sketched, we need to extend


the notion of CFSM, to incorporate local Follows. We start by redefining
the notion of item:

Definition 6.13 (LR(k) CFSM item). Given a grammar G = 〈V , T , P , S〉, an Remember that the notation T ≤k
item of G is a rule of the form A → α1 • α2 , u, where A → α1 α2 ∈ P is a rule means ∪ki=0 T i , i.e., the set of all
strings of length at most k on T .
of G and u ∈ T ≤k . We denote by LR(k) − Items (G) the set of LR(k) items
of G. M

As can be seen, we have augmented the notion of item with a local


Follows u (of size at most k). The intuition behind an item of the form
A → α1 • α2 , u is that:

• the parser is trying to reduce an A from the handle α1 α2 ;

• it has already reduced α1 on the top of the stack;

• this occurs in the context where A is followed by u. In other words,


if α2 = ε, the look-ahead should be u, otherwise the run of the parser
cannot succeed.

Example 6.14. Continuing the example of Figure 6.9 and Figure 6.10, the
second item in state 1 should be R → L•, $. M

In order to compute an LR(k)-CFSM with such items, we need to rede-


fine the initial state, and the closure operation.
For the initial state, we compute it by taking the closure (which we de-
fine hereunder) of the set of items:

S → •α, ε | S → α ∈ P
© ª

where P is the set of rules of the grammar. That is, we take the closure of all
the items of the form S → •α extended with the local Follow ε (and where
S is, as usual, the start symbol).
Now, for the closure operation, assume we have an item of the form
A → α1 • B α2 , u. We will perform the closure by creating items based on
all rules of the form B → β. The local Follows will be computed on the ba-
sis of what we expect on the input after B . Since we start from the item
A → α1 • B α2 , u, we expect on the input the k first characters of α2 , which
we can complete with u if need be. That is, we will have all the items of
B OT T O M - U P PA R S E R S 183

Input: A set I ⊆ Items (G) of LR(k)-items of some CFG


G = 〈V , T , P , S〉.
Output: The closure of I

Closure ← I ;
repeat
PrecClosure ← Closure ;
foreach A→α1 •B α2 , u ∈ Closure s.t. B ∈ V do
Closure ←
Closure ∪ B → •β, u ′ ¯B → β ∈ P and u ′ ∈ Firstk (α2 · u) ;
© ¯ ª

until PrecClosure = Closure;


return Closure ;
Algorithm 14: Computing the Closure of a set I of items for the LR(k)
CFSM.

the form B → •β, u ′ , where u ′ ∈ Firstk (α2 · u). Algorithm 14 gives the for-
malised algorithm for computing this closure operation.
Finally, the transitions of the CFSM are computed by simply propagat-
ing the local follows. That is, we redefine the successor function as:

CFSMSucc (I , a) = A → α1 a • α2 , u ¯ A → α1 • aα2 , u ∈ I .
© ¯ ª

Let us illustrate these notions by continuing our example:

Example 6.15. Let us build on the example of Figure 6.9 and Figure 6.10.
Let us first compute the initial state of the LR(1) CFSM for the grammar of
Figure 6.9. The kernel of the state will be the item S ′ → •S $, ε. Then, we
compute the closure of this item step by step:

• First, we add the items S → •L = R, $ and S → •R, $. The local Follows


are obtained by computing:

First $ · ε = First $
¡ ¢ ¡ ¢

= {$}.

• Then, from the item S → •L = R, $, we obtain the items L → • ∗ R, = and


L → •Id, =.

• Next, from S → •R, $ we add the item R → •L, $.

• Finally, from this last item, we obtain two new items L → • ∗ R, $ and
L → •Id, $. Observe that the local Follows we have obtained in these
items are different from the one computed before. In order to reduce
the size of the representation of the CFSM, we will often merge items
that differ only by the local Follows into sets of possible local Follows:
we will thus write L → • ∗ R, {=, $} and L → •Id, {=, $}. For the sake of
clarity, we will thus systematically represent the possible local Follows
as sets even when they are singletons.

This state is displayed in Figure 6.11, along with its L-successor, state 1.
This shows that the conflict on state 1 has been lifted, since the local Fol-
low for item R → L• is $. M
184 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

0 Figure 6.11: An excerpt of the LR(1) CFSM


for the example grammar.
S ′
→ •S $ , ε
1
S → •L = R , {$}
S → L•= R , {$}
S → •R , {$} L
R → L• , {$}
L → •∗R , {=, $}
L → •Id , {=, $}
R → •L , {$}

Example 6.16. For our second example, we will build the LR(1) CFSM for
the grammar in Figure 6.7. We already know that this grammar is SLR(1),
so we will be able to compare the SLR(1) and LR(1) CFSM for this grammar.
The CFSM is given in Figure 6.12 and Figure 6.13. Observe that the states
in these two figures are very similar, but that the local follows in the items
are different. As an example, compare state 6 in Figure 6.12 and state 16
in Figure 6.13, where the difference lies in the first item only. This clearly
shows that these local follows allow for a finer analysis, but this comes at
the cost of the number of states. M

LR(k) Action table Now that we can compute the LR(k) CFSM, let us see
how we can exploit it in a parser. We will adapt the techniques we have
introduced for SLR(1). There are mainly two cases to consider:

1. When a state s contains an item of the form A→α•aβ, u , where a is a


terminal (i.e. the dot is directely followed by a terminal), then we must
perform a Shift. This will occur in state s, and when the look-ahead is
some element y ∈ Firstk aβu . Observe that we complete aβ with the
¡ ¢

local Follow that is associeted to the item, in order to obtaint the look-
ahead.

2. When a state s contains an item of the form A→α•, u , we need to Remember that in ‘A→α•, u ’,
the element u is a word of
perform a Reduce of the rule A→α . This action will occur in state s
at most k characters. In the
and when the look-ahead is u. CFSM, we adopt a compact notation like
A→α•, {w 1 , w 2 , . . . , w n } to represent the n
For example, if we rely on the CFSM of Figure 6.12, in state 9, we per- items A→α•, w 1 ,. . . , A→α•, w n .
form a Reduce when the look-ahead is either $, + or ∗. Note that we do It is worth comparing the rule for
not readuce when the look-ahead is ), although ) ∈ Follow(Prod). The performaing a Reduce in the case
of LR(1) to what we have done
Reduce will occur on this look-ahead in state 14 (but, in this state, it will
for SLR(1) (see Algorithm 12). In the
not occur on the look-ahead $). case of SLR(1), the look-ahead we use is
Follow(A). Here, we use u, that is, we use a
We can now exploit those ideas to adapt Algorithm 12 to the LR(k) case. local Follow instead of a global one.

This new algorithm is given in Algorithm 15. Apart from the use of the
local Follows that we have already discussed, the most notable addition to
Algorithm 15 is the use of k characters of look-ahead. This means that the We are indexing by words of length
at most k and not exactly k be-
action table is now indexed by a state q and a word u of length at most k.
cause the actual look-ahead avail-
able on the input might be shorter than k
Example 6.17. We apply Algorithm 15 to the grammar in Figure 6.7. Its (for example when we reach the end of the
LR(1) CFSM is given in Figure 6.12 and Figure 6.13. Its LR(1) action table is input).

in Table 6.3. M
B OT T O M - U P PA R S E R S 185

1
2

S → Exp•$ , {ε}
S′ → Exp$• , {ε} $
Exp → Exp•+Prod , {$, +}

Exp +
0 3
S′ → •Exp$ , ε Exp → Exp+•Prod , {$, +}

Exp → •Exp + Prod , {$, +} Prod → •Prod ∗ Atom , {$, +, ∗}


Exp → •Prod , {$, +} Prod → •Atom , {$, +, ∗}
Prod → •Prod ∗ Atom , {$, +, ∗} Atom → •Id , {$, +, ∗}
Prod → •Atom , {$, +, ∗} Atom → •(Exp) , {$, +, ∗}
Atom → •Id , {$, +, ∗}
Atom Atom
Atom → •(Exp) , {$, +, ∗} ( 4

Prod Id Prod → Atom• , {$, +, ∗} Id


7
Exp → Prod• , {$, +}
Prod → Prod•∗Atom , {$, +, ∗} 5 (
Atom → Id• , {$, +, ∗}
∗ Id
8
Prod → Prod∗•Atom , {$, +, ∗} 6
Prod
Atom → •Id , {$, +, ∗} Atom → (•Exp) , {$, +, ∗}
Atom → •(Exp) , {$, +, ∗} Exp → •Exp + Prod , {), +}
(
Exp → •Prod , {), +}
Atom
9 Prod → •Prod ∗ Atom , {), +, ∗}
Prod → Prod ∗ Atom• , {$, +, ∗} Prod → •Atom , {), +, ∗}
Atom → •Id , {), +, ∗}
Exp Atom → •(Exp) , {), +, ∗}

11
∗ Atom Id ( Prod
Atom → (Exp•) , {$, +, ∗}
+
Exp → Exp•+Prod , {), +} 13 14 15 16 17

10
)
12 Exp → Exp + Prod• , {$, +}
Atom → (Exp)• , {$, +, ∗} Prod → Prod•∗Atom , {$, +, ∗}

Figure 6.12: The LR(1) CFSM for the gram-


mar for arithmetic expressions, first part.
States 13 through 17 (and their successors)
are given in the next figure.
186 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

13
Exp → Exp+•Prod , {), +}

Prod → •Prod ∗ Atom , {), +, ∗}


+
Prod → •Atom , {), +, ∗}
+ Atom → •Id , {), +, ∗}
11 Atom → •(Exp) , {), +, ∗}

17 Atom
14
Exp → Prod• , {), +}
Prod → Atom• , {), +, ∗}
Prod → Prod•∗Atom , {), +, ∗}
Id

Prod Atom
∗ 6
Prod 18
Id (
Prod → Prod∗•Atom , {), +, ∗} ( 15
Atom → •Id , {), +, ∗} Id
Atom → Id• , {), +, ∗}
Atom → •(Exp) , {), +, ∗} Atom Prod
Atom
Id
19 ( 16
Prod → Prod ∗ Atom• , {), +, ∗} Atom → (•Exp) , {), +, ∗}

Exp → •Exp + Prod , {), +}


Exp → •Prod , {), +}
21 Prod → •Prod ∗ Atom , {), +, ∗}
Exp
Atom → (Exp•) , {), +, ∗} Prod → •Atom , {), +, ∗}
Exp → Exp•+Prod , {), +} Atom → •Id , {), +, ∗}
Atom → •(Exp) , {), +, ∗}

)
22 (
20
Atom → (Exp)• , {), +, ∗} Exp → Exp + Prod• , {), +}
Prod → Prod•∗Atom , {), +, ∗}


Figure 6.13: The LR(1) CFSM for the gram-
mar for arithmetic expressions, (contin-
ued). States 6 and 11 are displayed on the
previous figure.
B OT T O M - U P PA R S E R S 187

Input: A CFG G = 〈V , T , P , S〉 whose rules are indexed from 1 to n;


and CFSM (G) = 〈Q, 0,V ∪ T , δ, F 〉 where
Q = {0, 1 . . . , m − 1, ;}.
Output: LR(k) action table of G.
/* Initialisation */
≤k
foreach u ∈ T do
M [;, u] ← Error;
foreach q ∈ {0, 1, . . . , m − 1} do
M [q, u] ← ; ;

/* Populating the table */


foreach q ∈ {0, 1, . . . , m − 1} do
if q contains an item of the form S→α•, ε then
M [q, ε] ← M [q, ε] ∪ {Accept} ;
foreach item in q that is of the form A→α•, u with A ̸= S do
M [q, u] ← M [q, u] ∪ { j } where rule number j is A→α ;
foreach item in q that is of the form A→α1 •aα2 , u with a ∈ T
do
foreach y ∈ Firstk (aα2 u) do
M [q, y] ← M [q, y] ∪ {Shift} ;

return M ;
Algorithm 15: The algorithm to compute the LR(k) action table of a
CFG G using CFSM (G).

Table 6.3: The LR(1) action table for our


M + ∗ Id ( ) $ ε
simple grammar of expressions.
0 S S
1 S S
2 A
3 S S
4 5 5 5
5 6 6 6
6 S S
7 3 S 3
8 S S
9 4 4 4
10 2 S 2
11 S S
12 7 7 7
13 S S
14 5 5 5
15 6 6 6
16 S S
17 3 S 3
18 S S
19 4 4 4
20 2 2
21 ) S
22 7 7 7
188 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

6.5.2 Running the LR(k) parser

Now that we have managed to build the action table of an LR(k) parser,
we can discuss how it runs on a given input word. Actually, the run of
an LR(1) parser will not be different from that of an SLR(1)1 parser: the
origin of the action table and of the CFSM does not matter for Algorithm 13
to run. All this algorithm needs is an action table indexed by states (for
the rows) and look-aheads (for the columns), and a way to compute the
successor states of the CFSM (which can be given as a table, as well, and
is implicitly represented by the CFSM itself ). While we have restricted our
presentation of Algorithm 13 to the case k = 1 (i.e., and LR(1) parser), it is
straightforward to adapt it to the case where k > 1: it suffices to replace the
look-ahead ℓ with a word of size (at most) k.
We can thus close our discussion on the construction of LR(k) parsers
by giving an example of run.

Example 6.18. We repeat Example 6.11, this time using the action table
of the LR(k) parser, and its associated LR(k) CFSM (see Example 6.15 and
Example 6.17). We consider the input word Id + Id ∗ Id$, which is in the
language of the grammar (see Figure 6.7). As always, we start our run in
the following configuration:

0
Id + Id ∗ Id$

In this configuration, the look-ahead is Id, the current state if 0 and the
action table tells us to perform a Shift, which leads to state 5 in the CFSM,
etc. The full run is then:

S R6 R5 R3 S S 5 R6 4 R5 10 S

− −−→ −−→ −−→ →
− 3 →
− 3 −−→ 3 −−→ 3 →

5 4 7 1 1 1 1 1
0 0 0 0 0 0 0 0 0
Id + Id ∗ Id$ +Id ∗ Id$ +Id ∗ Id$ +Id ∗ Id$ +Id ∗ Id$ Id ∗ Id$ ∗Id$ ∗Id$ ∗Id$

5 9
8 8 8
10 S 10 R6 10 R4 10 R1 S Accept!
3 →
− 3 −−→ 3 −−→ 3 −−→ →
− 2 −−−−−→
1 1 1 1 1 1
0 0 0 0 0 0
Id$ $ $ $ $ ε ε

As can be seen, this run is very similar to the run of the SLR(1) parser. This should not be a surprise, since both
parsers build the same rightmost derivation.
M
B OT T O M - U P PA R S E R S 189

6.5.3 LR(k) grammars

In the previous section, we have managed to build LR(k) parsers, which


are bottom-up parsers that use k characters of look-ahead. As their name
suggests, they are some sort of bottom-up relatives to the LL(k) parsers.
In the case of LL(k) parsers we had identified the class of LL(k) grammars
for which we had the guarantee that their corresponding LR(k) parser is
deterministic (see Definition 5.10).
In this section, we will consider a similar definition of LR(k) grammars,
which captures exactly the class of all grammars for which the LR(k) parser
is deterministic (i.e., has no conflict in its action table). Remember the remark on
page 178? There we are. . .
As for LL(k), the formal definition might seem daunting, so we will start
with some intuition. We will proceed as for LL(k): we will try and identify
situations that will be ‘confusing’ for the LR(k) parser, and the definition
will basically state that such situations are not allowed.
Let us assume that we have a first rightmost derivation starting with:

S ⇒∗ γAx ⇒ γαx ⇒ · · ·

In this prefix we have pinpointed the derivation of rule A → α. Since this is


a rightmost derivation, we have the guarantee that x ∈ T ∗ , i.e. x contains
terminals only. On the other hand α and γ are in (T ∪V )∗ : they can contain
terminals and variables. In this derivation, α is the handle of A → α, and
the task of the parser will be to correctly identify this handle and this rule
to perform the reduction.
Thus, in order to build a situation that confuses the parser, we have to
remember what is the information that the parser uses to take a decision.
From the discussion before, we know that it uses two pieces of informa-
tion: the current CFSM state and the look-ahead. Where can we find this
information in the derivation? To answer this question, we must find out
what is on the stack at the moment where the parser must reduce A → α,
and what remains on the input. Since α is reduced as A, α must necessar-
ily be on the top of the stack, so the stack content is (γα)R ; and x remains We have (γα)R and not γα because
we have taken the convention to
on the input (see Proposition 6.2). Then, when the Reduce of A → α must
read the stack from the top to the
occur: bottom.

1. the current CFSM state is the one which is reached when the CFSM
reads the mirror image of the stack content, i.e. γα. Since the CFSM is
deterministic, we can identify the state with γα (i.e., there will be one
and only one state reached when reading γα, so we can abuse notations
slightly and say that γα ‘is the state’); and

2. the look-ahead contains the k first symbols of what remains on the in-
put (if they exist), i.e. Firstk (x).

Now, let us build another derivation that will create a situation where
the parser is ‘confused’, i.e. a derivation where the parser cannot tell the
difference between the reduce of A → α described above, and another re-
duce. Let us thus assume that we have now two rightmost derivations (we
recall the previous one for the sake of comparison):

S ⇒∗ γAx ⇒ γαx ⇒ · · ·
S ⇒∗ δB y ⇒ δβy ⇒ · · ·
190 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

In the latter derivation, the parser must reduce B → β. This occurs when
(δβ)R is on the stack and y remains on the input. So the situation where
the parser gets ‘confused’ between the Reduce of A → α and B → β is when
the CFSM state (or equivalently the stack content) is the same in both
cases, and when the look-aheads are the same. This occurs when:

δβy = γαx ′

(i.e. the stack content is the same) for some x ′ s.t.:

Firstk (x) = Firstk x ′


¡ ¢

(the look-ahead is the same). Observe that this is a semantic def-


inition that talks about all the pairs
Then, the definition of LR(k) grammar says essentially that such a sit-
of derivations (like the definition of
uation cannot occur. So, if the above conditions are met, we must have LL(k)).
γ = δ, A = B and x ′ = y for the grammar to be LR(k). Here it is13 : 13
Donald E. Knuth. On the translation of
languages from left to right. Information
Definition 6.19 (LR(k) grammar). A CFG G = 〈P , T ,V , S〉 is LR(k) iff for all and Computation (formerly known as In-
formation and Control), 8:607–639, 1965.
pairs of rightmost derivations:
D O I : 10.1016/S0019-9958(65)90426-2

S ⇒∗ γAx ⇒ γαx ⇒∗ w 1 x
S ⇒∗ δB y ⇒ δβy ⇒∗ w 2 y

s.t.: (i) δβy = γαx ′ ; and (ii) Firstk (x) = Firstk x ′ , we have: γAx ′ = δB y.
¡ ¢

M
While testing whether a grammar
Testing that a given grammar is LR(k) (for a given k) can be challenging is LR(k) for a given k (for exam-
ple: ‘is this grammar LR(5)?’) is
with this definition. The best way is still to build the LR(k) parser and clearly decidable, K N U T H shows in his pa-
check that it is deterministic. per cite above that testing whether there
exists some k s.t. a given grammar is LR(k)
Let us illustrate Definition 6.19 with an example:
is an undecidable problem.

Example 6.20. Consider the simple grammar in Figure 6.14. It is clearly (1) S → a
not LR(0), since there is a Shift/Reduce conflict in state 1 of the CFSM (see (2) → ab
Figure 6.15). Figure 6.14: A simple grammar which is
not LR(0).
Let us show why it does not satisfy Definition 6.19 for k = 0. To do so, we
0
must find a pair of derivations that do not satisfy the conditions of the def-
S → •a
inition. In our case, this will be pretty straightforward, since the grammar S → •ab
has two derivations only:

S ⇒a a
1
S ⇒ ab S → a•
S → a•b
Matching these derivations with the notations of Definition 6.19, we ob-
tain:
b
2
A=B =S
S → ab•
α=a
β = ab Figure 6.15: The CFSM of the previous
grammar.
γ=δ=ε
x =y =ε
Intuitively, γAx ′ is the situation
Thus, δβy = ab. Since, γα = a, we have δβy = γαx ′ with x ′ = b. So, where we have the stack content
(γA)R from the first derivation but
all the conditions of the definition are satisfied. In particular, note that
with the input x ′ which is a suffix of the
remaining input in the second derivation,
and the parser should be able to tell the
difference between these two situations
and take the right decision. In our ex-
ample, when the parser decides to reduce
S → a in the first derivation, it should still
do this even if we add x ′ = b to the input,
B OT T O M - U P PA R S E R S 191

First0 (x) = ; = First0 x ′ (we have no look-ahead in LR(0)). In this case,


¡ ¢

the definition tells us that we should have γAx ′ = δB y, however:

γAx ′ = A b
δB y = A

So these two values are clearly different.


The grammar, however, is LR(1) since the look-ahead allows us to de-
cide whether to Shift or to Reduce in state 1 of the CFSM. This can also be
checked with the definition: for k = 1, there is no way to arrange the val-
ues α, β, γ, δ, A, B , x, x ′ and y to satisfy the conditions of the definition
and have γAx ′ ̸= δB y. In particular, if we re-use the values as above, we
fail to satisfy the condition First1 (x) = First1 x ′ , since First1 (x) = {ε} and
¡ ¢

First1 x ′ = {b}. M
¡ ¢

6.6 LALR(k) parsers

We close this section on bottom-up techniques by discussing (briefly) a


last family of parsers that are essential in practice since they are the ones
that are actually implemented in tools such as yacc, bison and cup.
One of the main criticism of LR(k) parsers is that their associated CFSM
can be very big (because the number of potential items is multiplied by
the number of potential local look-aheads), hence, the action and succes-
14
sor tables needed by the parser can become unmanageable. As a conse- Franklin Lewis DeRemer. Practical
Translators for LR(k) Languages. PhD
quence, researchers realised quickly that some techniques were needed to thesis, Massachusetts Institute of Technol-
make LR(k) parsers practical. ogy, 1969. URL https://web.archive.
org/web/20130819012838/http:
One such solution was introduced14 by D E R E M E R along with SLR(1),
//publications.csail.mit.edu/lcs/
under the name of LALR(k) parsers. LALR(k) stands for ‘Look-Ahead LR(k)’ pubs/pdf/MIT-LCS-TR-065.pdf
parsers. While D E R E M E R introduced the idea of LALR(k) parsers, a practi-
cal algorithm to build them (efficiently) was given a few years later15 only. 15
W. R. LaLonde, E. S. Lee, and J. Horning.
An LALR(k) parser generator. In Proceed-
Since then, several algorithms have been presented16 .
ings of IFIP congress, pages 151–153. Else-
The idea of the LALR(k) parser is based on an observation we have al- vier Science, New York, 1971
16
ready made when considering LR(k) parsers: in LR(k) parsers, many states Thomas J. Penello and Franklin Lewis
DeRemer. Efficient computation of
are similar in the sense that they contain the same LR(0) items but with
LALR(1) look-ahead sets. ACM SIG-
different ‘local follows’. For example, in the CFSM of Figure 6.12 and Fig- PLAN Notices, 39(4), 2004. DOI:

ure 6.13, states 6 and 16 only differ on the local follows. So, a natural idea 10.1145/69622.357187

consists in ‘merging’ those states, i.e., computing the union of the sets of
items of those states, and building an action table and a successor table
based on this new automaton. Of course, by doing so, we are losing some
precision, we cannot expect that LR(k) grammars will also be LALR(k),
but if the resulting parser is still deterministic, we have obtained a CFSM
which has the same size as the LR(0) CFSM, while using look-aheads for
lifting ambiguities.

6.6.1 Building the LALR(k) parser from the LR(k) CFSM

More formally, we will first explain how to obtain an LALR(k) parser from
the LR(k) CFSM of a given grammar. We start with the definition of the
heart of a CFSM state
192 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

Definition 6.21 (Heart of a CFSM state). Let {A 1 → α1 •β1 , u 1 ; . . . ; A n → αn •


βn , u n } be a set of LR(k) items, i.e. a state of some LR(k) CFSM. Then, its
heart 17 is the set {A 1 → α1 • β1 ; . . . ; A n → αn • βn } of corresponding LR(0) 17
Some references call it the core of the
state.
items. M

Then, we say that two states of the LR(k) automaton are equivalent if
they have the same heart:

Definition 6.22 (LALR(k) equivalence). Two states q 1 and q 2 of some LR(k)


CFSM are equivalent iff they have the same heart. We denote this by q 1 ≡
q2 . M
Thus, the equivalence classe [q] of
One can check that ≡ is indeed an equivalence relation. Hence, it par- a state q is set of all states q ′ that
are equivalent to q, i.e. such that
titions the set of states of the LR(k) CFSM into equivalence classes. For all q ′ ≡ q.
states q of the LR(k) CFSM, we denote by [q] the equivalence class that
contains it. This partition will help us to build a so-called LALR(k) CFSM
as we are about to see. But let us first have a look at an example for ≡.

Example 6.23. Let us consider again the LR(1) CFSM of Figure 6.12 and
Figure 6.13. Here is the list of equivalence classes of its states:

[0] = {0}
[1] = {1}
[2] = {2}
[3] = [13] = {3, 13}
[4] = [14] = {4, 14}
[5] = [15] = {5, 15}
[6] = [16] = {6, 16}
[7] = [17] = {7, 17}
[8] = [18] = {8, 18}
[9] = [19] = {9, 19}
[10] = [20] = {10, 20}
[11] = [21] = {11, 21}
[12] = [22] = {12, 22}

Then, based on this equivalence relation, and from the LR(k) CFSM, we
can define the LALR(k) CFSM. To this end, we need to define the new set
of states and the new transition relation of this automaton: Note that the set of states of the
LALR(k) CFSM is not the set of
1. For each equivalence class of ≡, we will have one state in the LALR(k) equivalence classes. Otherwise,
the states of the new automaton would be
CFSM. More precisely, for each equivalence class [q], there will be a sets of states of the LR(k) CFSM, i.e. sets
new state which contains all the items that are in all the states in [q]. In of sets of items; while we want the states of
other words, the new state (which we denote S [q] ) is: the new automaton to bet sets of items to
obtain a CFSM.

q ′.
[
S [q] =
q ′ ∈[q]

2. Then, we need to adapt the transition relation. This is easy, if we make


the following observation. Let [q] be some equivalence class, and let
B OT T O M - U P PA R S E R S 193

q ′ ∈ [q] be one of these states. By definition of the ‘merging’ of LR(k)


states into LALR(k) states, whenever there is an item of the form

A → α • β, u

in q ′ , then there is an item of the form

A → α • β, v

in all other states q ′′ of the equivalence class (i.e., with the same LR(0)
item, but a different context). This means that the transitions from all
states q ′′ of the equivalence class will be similar in the following sense.
Whenever there is, in the LR(k) CFSM, a transition from q 1 , labeled by η
and going to q 2 , then there is also a transition labeled by η from all q 1′ ∈
[q 1 ], and this transition leads to a state q 2′ which is equivalent to q 2 (i.e.
q 2′ ∈ [q 2 ]). It is thus safe to have an η-labeled transition in the LALR(k)
CFSM from S [q1 ] to S [q2 ] . Remark that this construction preserves the
determinism of the automaton.

Example 6.24. Let us illustrate this on our running example. Consider for
example the equivalence class {8, 18}. Observe that the sets of correspond-
ing LR(0) items of these two states are the same, which is not surprising
since this is the definition of the heart:
n o
Prod → Prod ∗ •Atom, Atom → •Id, Atom → •(Exp) .

This is why we have identically-labeled transitions from both states: one


labeled by Atom, one by Id and one by (. Moreover, these respective transi-
tions reach states that are ≡-equivalent. For example, reading Atom from
state 18 leads to state 19; while reading Atom from 8 leads to 9, with 19 ≡ 9.
M

Based on this discussion, we are now ready to formally define the LALR(k)
CFSM from the LR(k) CFSM:

Definition 6.25 (LALR(k) CFSM). Let G = 〈V , T , P , S〉 be a CFG, and let


Q,V ∪ T , δ, q 0 ,Q \ {;} be its LR(k) CFSM. Then, the LALR(k) CFSM is the
­ ®

DFA Q ,V ∪ T , δ′ , q 0′ ,Q ′ \ {;} where:


­ ′ ®
We compute S [q] for all states q
in Q, however note that, unless all
• Q ′ = {S [q] | q ∈ Q}, where, for all q ∈ Q: equivalence classes are singleton,
there will be at least two states q 1 and q 2 in
Q s.t. S [q1 ] = S [q2 ] . That is not a problem
q ′;
[
S [q] =
for the definition of Q ′ , since it is a set, and
q ′ ∈[q]
thus each element appears at most once.

• q 0′ = S [q0 ] ; and

• for all S [q] ∈ Q ′ , for all a ∈ V ∪ T :

δ′ (S [q] , a) = S [q ′ ] iff there are q 1 ∈ q, q 2 ∈ q ′ s.t. δ(q 1 , a) = q 2 .

Once the LALR(k) CFSM is built, we build the action table as in the case
of an LR(k) parser, and the runs of the parser follow this action table simi-
larly.
194 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

S [1]
S [2]
S′ → Exp•$ , {ε}
S′ → Exp$• , {ε} $
Exp → Exp•+Prod , {$, +}

Exp
+
S [0] S [3]

S′ → •Exp$ , ε Exp → Exp+•Prod , {$, +, )}

Exp → •Exp + Prod , {$, +, )} Prod → •Prod ∗ Atom , {$, +, ∗, )}


Exp → •Prod , {$, +, )} Prod → •Atom , {$, +, ∗, )}
Prod → •Prod ∗ Atom , {$, +, ∗, )} Atom → •Id , {$, +, ∗, )}
Prod → •Atom , {$, +, ∗, )} Atom → •(Exp) , {$, +, ∗, )}
Atom → •Id , {$, +, ∗, )}
Atom Atom
Atom → •(Exp) , {$, +, ∗, )} ( S [4]

Prod Id Prod → Atom• , {$, +, ∗, )} Id


S [7]

Exp → Prod• , {$, +, )}


Prod → Prod•∗Atom , {$, +, ∗, )} S [5] (
Atom → Id• , {$, +, ∗, )}
∗ Id
S [8]

Prod → Prod∗•Atom , {$, +, ∗, )} S [6]


Prod
Atom → •Id , {$, +, ∗, )} Atom → (•Exp) , {$, +, ∗, )}
Atom → •(Exp) , {$, +, ∗, )} Exp → •Exp + Prod , {), +}
(
Exp → •Prod , {), +}
Atom
S [9] Prod → •Prod ∗ Atom , {), +, ∗}
Prod → Prod ∗ Atom• , {$, +, ∗, )} Prod → •Atom , {), +, ∗}
Atom → •Id , {), +, ∗}
Exp Atom → •(Exp) , {), +, ∗}
S [11]
∗ Atom Id ( Prod
Atom → (Exp•) , {$, +, ∗, )}
+
Exp → Exp•+Prod , {), +} S [3] S [4] S [5] S [6] S [7]

S [10]
)
S [12] Exp → Exp + Prod• , {$, +, )}
Atom → (Exp)• , {$, +, ∗, )} Prod → Prod•∗Atom , {$, +, ∗, )}

Figure 6.16: The LALR(k) CFSM for the


grammar for arithmetic expressions.
B OT T O M - U P PA R S E R S 195

Example 6.26. Let us close this discussion by building the LALR(k) parser
for our running example. We start by building the LALR(k) CFSM. It is
given in Figure 6.16. We have purportedly chosen to keep the same pre-
sentation as in Figure 6.12, to allow the reader to compare them.
Then, the action table computed from this CFSM is given in Table 6.4.

Table 6.4: The LALR(1) action table for the


M + ∗ Id ( ) $ ε
simple grammar of expressions.
S [0] S S
S [1] S
S [2] A
S [3] S S
S [4] 5 5 5 5
S [5] 6 6 6 6
S [6] S S
S [7] 3 S 3 3
S [8] S S
S [9] 4 4 4 4
S [10] 2 S 2 2
S [11] S S
S [12] 6 6 6 6

6.6.2 Some remarks on the LALR(k) parser

We close this section on LALR(k) parsers by two remarks. First of all, since
our goal was to obtain a parser which is more efficient than LR(0) but more
compact than LR(k), we should ask ourselves whether it is indeed the case.
As a matter of fact, it is easy to see that the size (in terms of number of
states) of the LALR(k) CFSM is the same as the size of the LR(0) CFSM.
196 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

Proposition 6.4. For all CFG, the LALR(k) CFSM has the same number of
states than the LR(0) CFSM.

Proof. This property stems directly from the definition of LALR(k) states.
Let us first consider the states of the LR(k) CFSM, from which the states of
the LALR(k) CFSM are extracted. We can observe that the set of hearts of
the LR(k) CFSM is exactly the set of LR(0) states. This can be seen by com-
paring the respective effects of the closure operation in the cases of LR(0)
(see Algorithm 9) and LR(k) (Algorithm 14). More precisely, assume that,
when applied to an item A→α•β , the LR(0) closure produces A 1 →α1 •β1
;. . . ; A n →αn •βn . Then, when applied to an item A→α•β, u with the
same heart, the LR(k) closure will produce a set of items A 1 →α1 •β1 u 1 ;. . . ;
A n →αn •βn u n with the same heart as well. Thus, the set of hearts of the
LR(k) CFSM is exactly the set of LR(0) states. However, the set of LALR(k)
states has the same size as the set of LR(k) hearts, by construction. Hence
our proposition.

Moreover, since the SLR(1) parser is essentially the LR(0) parser with
some look-ahead, we also have that:

Corollary 6.5. For all CFG, the LALR(k) parser has the same number of
states as the SLR(1) parser.

So, from these two results, it seems that we have managed to obtain
a pretty good reduction in the number of states that the LALR(k) parser
must use. Now, the question of the usefulness of LALR(k) remains open,
as we haven’t shown yet that there are some grammars that can be parsed
with LALR(k) but not LR(0) or SLR(1). We will address this in more details
in Section 6.6. Observe, however that, in the example we have discussed
above, the obtained LALR(1) parser happens to be the same as the SLR(1)
parser, but this is not always the case.

Finally, we observe that the technique we have described here to build


an LALR(k) parser is pretty inefficient since it requires to first build the
LR(k) parser, then merge states. More efficient techniques, that perform
this merge on-the-fly exist18 , but we will not discuss them here. The de- 18
T. Anderson, J. Eve, and J. Horning. Ef-
ficient LR(1) parsers. Acta Informatica, 2:
scription of such a technique can be found in Section 4.7.5 of the ‘Dragon
2–39, 1973. D O I : 10.1007/BF00571461
Book19 ’. 19
A. Aho, M. Lam, R. Sethi, and Ullman
J. Compilers: Principles, Techniques, &
Tools. Addison-Wesley series in computer
6.7 The bottom-up hierarchy of grammars science. Pearson/Addison Wesley, 2007

Now that we have such a wide range of (bottom-up) parsers at our dis-
posal, one might wonder which one to chose from? Clearly, increasing the
size of the look-ahead increases the number of grammars that a parser can
recognise deterministically; but we also know that such an increase has a
cost in the complexity of the parser. So, it seems that choosing a parser
will be a trade-off between its expressive power and its complexity.
As we did in the case of LL(k) grammars (see Section 5.4.2 and Sec-
tion 5.4.3), we will now establish bottom-up hierarchies of grammars and
languages. We will first compare the families of grammars that can be
parsed deterministically with the different bottom-up techniques we have
B OT T O M - U P PA R S E R S 197

seen, i.e., establish a syntactic hierarchy. However, the fact that a gram-
mar G is, for instance, LR(k + 1) but not LR(k) does not necessarily mean
that L(G) cannot be recognised by an LR(k) parser, because there might be
an LR(k) grammar G ′ with the same language. . . Thus, in Section 6.8, we
will be comparing the families of languages that can be recognised by the
different parsers that we have at our disposal, i.e. to establish a seman-
tic hierarchy. Finally, in Section 6.9 we will compare the top-down and
bottom-up hierarchies.
One could discuss whether the
We will start this discussion with the syntactic comparison. We have al- LR(k) and LL(k) classes are truly
syntactic. Clearly, the definition of
ready defined LL(k), strong LL(k) and LR(k) grammars (see Definition 5.10, strong LL(k) is purely syntactic since it
Definition 5.12 and Definition 6.19). We can further define SLR(1) and concerns the rules of the grammar. The
definitions of LL(k) and LR(k) are more
LALR(k) grammars:
semantic since they constrain the deriva-
Definition 6.27 (SLR(1) and LALR(k) grammars). A CFG is SLR(1) iff the tions that the grammar generates. How-
ever, those definition do no constraint the
SLR(1) parser generated for this grammar is deterministic. A CFG is LALR(k) ultimate semantic object that a grammar
iff the LALR(k) parser generated for this grammar is deterministic. M generates, i.e. its language. So, it is fair
to say that comparing classes of grammars
Now, let us compare these different families of grammars. One can es- (which are syntactic objects) is a syntac-
tic comparison, while comparing the lan-
tablish the following inclusions:
guages that these grammars generate is se-
mantic.
Theorem 6.6. The following (strict) inclusions hold:
We will not venture further into this
(pseudo-)philosophical discussion. . . Let
LR(0) ⊊ SLR(1) ⊊ LALR(1) ⊊ LR(1) ⊊ LR(2) ⊊ · · · ⊊ LR(k) ⊊ · · ·
us just quote this beautiful piece of coffee-
machine wisdom that we have overheard
Proof. We look at all these inclusions one after the other: during the break of a scientific conference:
‘You know, one man’s semantic is another
• LR(0) ⊊ SLR(1). We know that all LR(0) grammars are SLR(1) grammars, man’s syntax’. . .
since SLR(1) is an extension of LR(0) with a look-ahead. The strict in-
clusion stems from Example 6.10, which shows that the grammar in
Figure 6.7 is SLR(1) but not LR(0).

• SLR(1) ⊊ LALR(1). The fact that SLR(1) ⊆ LALR(1) stems from the def-
initions of SLR(1) and LALR(1). Both have states with the same hearts,
which are the LR(0) states by construction, but the contexts are more
precise in the case of LALR(1) since they are unions of local Follows. (1) S → Aa
Hence, SLR(1) does not offer more opportunities to lift conflicts than (2) → b Ac
(3) → dc
LALR(1), so all SLR(1) grammars are also LALR(1). (4) A → d
To separate strictly LALR(1) and SLR(1), we consider the grammar in Figure 6.17: An LALR(1) grammar that is
not SLR(1) and not LL(1) but LL(2).
Figure 6.17 (adapted from exercise 20, Chapter 3.5 of S E I D L , W H I L -
HELM and H A C K ’s book20 ). It is easy to check that it is LALR(1) but not 20
Reinhard Wilhelm, Helmut Seidl, and Se-
bastian Hack. Compiler Design, Syntactic
SLR(1). Indeed, the state that is reached after reading d in the LR(0)
and Semantic Analysis. Springer-Verlag,
CFSM is: 2013. ISBN 978-3-642-17539-8. DOI:
10.1007/978-3-642-17540-4
S → d•c
A → d•

and SLR(1) is not able to lift this Shift/Reduce conflict since c ∈ Follow(A).
However, in the LALR(1) CFSM, this state becomes:

S → d•c , {ε}
A → d• , {a}
198 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

which lifts the conflict (and no other conflicts exist in the LALR(1) au-
tomaton).

• LALR(1) ⊊ LR(1). The fact that LALR(1) ⊆ LR(1) stems from the con-
struction of LALR(1): if a grammar can be parsed deterministically by
an LALR(k) parser, it will also be the case by an LR(k) parser since the
latter is a refinement of the former. To prove that the inclusion is strict,
all we need is to exhibit an example of grammar that is LR(1) but not
LALR(1). Such a grammar is given in Figure 6.18 and can be found as
Example 4.58 in the ‘Dragon book21 ’. 21
A. Aho, M. Lam, R. Sethi, and Ullman
J. Compilers: Principles, Techniques, &
In particular, building the LR(1) CFSM creates the two following states: Tools. Addison-Wesley series in computer
science. Pearson/Addison Wesley, 2007

A → c• , {d} A → c• , {e} (1) S → a Ad


B → c• , {e} B → c• , {d} (2) → bB d
(3) → aB e
(4) → b Ae
(5) A → c
(6) B → c
which have no conflict since the look-ahead always allows one to de-
Figure 6.18: An LR(1) grammar that is not
cide which Reduce to perform. However, they have the same heart, and LALR(1) and not LL(1).
their merge in the LALR(1) CFSM will yield a state with a Reduce/Re-
duce conflict:

A → c• , {d, e}
B → c• , {d, e}

• For all k: LR(k) ⊊ LR(k + 1). The fact that LR(k) ⊆ LR(k + 1) is trivial,
since having a longer look-ahead allows one to parse more grammars
deterministically. Then, for all k ≥ 1, the grammar in Figure 6.19 is (1) S → A bk c
LR(k + 1) but not LR(k). Indeed, after reading a, its LR(k) CFSM reaches (2) → B bk d
(3) A → a
the state:
(4) B → a
Figure 6.19: An LR(k + 1) grammar that is
A → a• , {bk } not LR(k) and not LL(k + 1).

B → a• , {bk }

which clearly contains a Reduce/Reduce conflict. This conflict is lifted


with an additional symbol of look-ahead (and it is easy to check that
the other states contain no conflit):

A → a• , {bk c}
B → a• , {bk d}

Finally, we address a final question regarding the hierarchy of LR(k) Recall that an ambiguous CFG is
one that admits at least two dif-
languages: since the sequence LR(0) , LR(1) , . . . , LR(k) , . . . is growing, one ferent parse trees on a given word,
could think that each CFG can fit snuggly in one of those classes. It turns hence, at least two different rightmost
derivations on this word.
out that this is not the case. Clearly, any ambiguous grammar will not be
B OT T O M - U P PA R S E R S 199

LR(k) since the parser cannot decide which parse tree to use to produce
the rightmost derivation, hence it will necessarily be non-deterministic.
However, there are unambiguous CFG which are not LR(k) for any k. Here
is an example:

Example 6.28. This example has been proposed by K N U T H in his original


paper22 on LR(k), it is given in Figure 6.20. 22
Donald E. Knuth. On the translation of
languages from left to right. Information
To be convinced that this grammar is not LR(k) for any k, we need to
and Computation (formerly known as In-
consider its accepted language, which is {ab2n+1 c | n ≥ 0}. That is, the formation and Control), 8:607–639, 1965.
grammar accepts all the words of the form abb · · · bc where the number of D O I : 10.1016/S0019-9958(65)90426-2

b’s is odd. To do so, the grammar relies on the recursive rule A→b A b , as (1) S → a Ac
shown in the following derivation: (2) A → b Ab
(3) → b
S ⇒ a A c ⇒ ab A bc ⇒ abb A bbc ⇒ abbb A bbbc ⇒ abbbbbc Figure 6.20: A grammar which is not LR(k)
for any k.
So, the crucial difficulty here will be for the parser to decide when to Re-
duce the ‘middle b’ (the one that is underlined) as A.
Assume the word to be accepted is ab2n+1 c for some n. Then, when this
particular Reduce must happens, an LR(k) parser ‘sees’ the look-ahead
Firstk (bn c), and needs to ‘see’ the c in order to realise that it is time to
Reduce the b into A. In other words, if the look-ahead is long enough (i.e.
k ≥ n + 1), the parser will ‘see’ the c and can decide to perform the Reduce
at the right moment on that particular word. Otherwise, the look-ahead
will contain only b’s which does not allow the parser to decide whether to
Reduce or to keep shifting b’s. Unfortunately, the size of the look-ahead
is fixed, and there will always be a word which is too long for this look-
ahead to be sufficient. More precisely, if the size of the look-ahead is k,
the parser will not have enough information to parse deterministically all
words ab2n+1 c with k ≤ n. So the grammar cannot be LR(k) for any k (and
it is clearly unambiguous). M

6.8 The bottom-up hierarchy of languages

The next step in our comparison is to compare the bottom-up parsers


from the point of view of the families of languages that they accept. To
this end, we start with the following definition:

Definition 6.29 (LR(k), SLR(1) and LALR(k) languages). A language L is


LR(k) (SLR(1), LALR(k)) for some k ≥ 0 iff there is an LR(k) grammar (re-
spectively an SLR(1) or an LALR(k) grammar) G s.t. L(G) = L. M

We denote by ‘LR(k) lang.’ (‘SLR(1) lang.’ and ‘LALR(k) lang.’) the


classes of LR(k) languages (respectively, SLR(1) languages and LALR(k)
languages). Clearly:

LR(0) lang. ⊆ SLR(1) lang. ⊆ LALR(1) lang. ⊆ LR(1) lang. ⊆ · · · LR(k) lang. ⊆ LR(K + 1) lang. ⊆ · · ·

This is a direct consequence of the hierarchy of bottom-up grammars (The-


orem 6.6) and of Definition 6.29. Let us now check whether these inclu-
sions are strict or not.

Most of the results of this section can be found in the seminal paper on
LR(k) by Donald K N U T H 23 . We start by observing, as K N U T H does, that 23
Donald E. Knuth. On the translation of
languages from left to right. Information
and Computation (formerly known as In-
formation and Control), 8:607–639, 1965.
D O I : 10.1016/S0019-9958(65)90426-2
200 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

there is at least one language which is an LR(1) language but not an LR(0)
language:
Proposition 6.7. The language L = {a, ε} is an LR(1) language but not an
LR(0) language.
Proof. The fact that L is an LR(1) language can be established by finding
an LR(1) grammar for it. The CFG that contains only the rules S → a and
S → ε is such a grammar. In the initial state of the CFSM, the character of
look-ahead allows the parser to decide between: (i) shifting the a from the
input when it is in the look-ahead; or (ii) reducing the S immediately when
the look-ahead is empty.
For similar reasons, we can conclude that the language L is not LR(0).
This is a bit more difficult, from the conceptual point of view, since we
need to make a reasoning on all possible grammars that can define L and
show that none of them can be LR(0). Assume we a have a grammar G
s.t. L(G) = L, and assume we build the LR(0) CFSM of this grammar. Recall
that the CFSM is a deterministic finite state automaton that reads all the vi-
able prefixes of the grammar in order to identify the handles. In the initial
state, the CFSM has read ε which is a handle (as ε is accepted); so, in the
initial state, Accept must be one action that the parser can perform (oth-
erwise, the parser would shift and miss the handle that allows it to accept
ε ∈ L). However, a Shift must also be one of the actions of the parser, in or-
der to obtain the handle a. So, there is necessarily a Shift/Accept conflict in
the initial state of the LR(0) CFSM, that only one character of look-ahead
can lift, as we have seen.
The fact that DPDA pop up once
So, clearly, LR(0) ⊂ LR(1), this inclusion is strict. But, surprisingly, such again when establishing the ex-
pressive powers of parsers should
a strict inclusion does not carry out to the next levels of the hierarchy! not be a total surprise to the attentive
Indeed, K N U T H , again, proves the following beautiful result, linking the reader (!) Indeed, the whole purpose of
studying parsers was to find a way to parse
LR(k) grammars for k ≥ 1 and the deterministic PDA that we have studied
(thus, by means of a PDA) CFL in a deter-
in Section 4.2.2. ministic way. In some sense, this theo-
rem tells us that we have reached our Graal
Theorem 6.8. If L is the language of a DPDA, then there is an LR(1) gram- with the LR(k) grammars!
mar G that recognises it, i.e. L(G) = L
We will refer the reader to the previously mentioned paper24 for the 24
Donald E. Knuth. On the translation of
languages from left to right. Information
proof of this theorem, and rather discuss here its implications.
and Computation (formerly known as In-
First, let us recall that we denote by DCFL the class of languages ac- formation and Control), 8:607–639, 1965.
cepted by deterministic PDA (DPDA). So, we can rephrase the theorem D O I : 10.1016/S0019-9958(65)90426-2

above by writing:
DCFL ⊆ LR(1) lang.
However, we already know that all the LR(k) classes of languages can be
recognised by a deterministic PDA25 , by definition! So, we conclude now 25
Actually, by a deterministic PDA with
look-ahead, but we have seen in Sec-
that:
tion 5.2.1 that the look-ahead is essentially
for all k ≥ 1 : LR(k) lang. ⊆ DCFL. syntactic sugar and can be incorporated in
a regular (deterministic) PDA
Putting everything together, we obtain:

DCFL ⊆ LR(1) lang. ⊆ LR(2) lang. ⊆ · · · ⊆ LR(k) lang. ⊆ DCFL,


which implies that:

for all k ≥ 1 : LR(k) lang. = DCFL.


B OT T O M - U P PA R S E R S 201

6.8.1 Case of the languages suffixed by $

Is there a way to extend this result to LR(0)? We have already argued in


Proposition 6.7 that LR(0) ⊂ LR(1). However, observe that the example we
have used to separate LR(0) and LR(1) can be slightly modified to yield a
‘very similar’ language that is now LR(0): one just needs to concatenate
the language with a fresh special character, such as $. Indeed, the lan-
guage {a$, $} can now be checked to be LR(0): in the initial state of the
CFSM, a Shift is the only possible action. Depending whether an a or a $ is
shifted, we will end in two different states, where the proper action (Shift
after reading an a, Reduce after reading a $) can be performed. Here, we have chosen to use the let-
ter $ to end words in the language,
It turns out that this result can be generalised when we restrict our-
but any fixed symbol that does not
selves to languages of the form L · {$}, i.e. languages where all the words occur anywhere in words can be used.
end by the same fixed letter. This is again a result due to K N U T H 26 : 26
Donald E. Knuth. On the translation of
languages from left to right. Information
Theorem 6.9. For all languages L in DCFL ∩ Σ∗ · {$} (where $ ̸∈ Σ): there is and Computation (formerly known as In-
formation and Control), 8:607–639, 1965.
an LR(0) grammar to define L. D O I : 10.1016/S0019-9958(65)90426-2

As a consequence, if we restrict ourselves to languages where all words


end by a given character, we can state that LR(k) lang. = DCFL for all k ≥ 0.
In practice, when considering compiler construction, the restriction
that ‘all words need to end with $’ is not really a problem. Indeed, one
can interpret the $ as the ‘end of file’ special symbol that is usually added
at the end of files by operating systems, for example.

6.8.2 Practical interest of LR(k)

Since ‘all LR(k) parsers can accept the same families of languages‘, one can
reasonably wonder why it was necessary to define them all? The motiva-
tion is of course purely practical: even if we can recognise all languages of
DPDA, even with an LR(0) parser (under the light restriction we have seen
above), this does not mean that it is always easy or desirable27 to obtain an 27
After all, grammars are also designed to
specify the syntax of computer languages
LR(0) grammar for a given language. So, we need parsers and parser gen-
to human beings who have to program. So
erators that can exploit broad classes of grammars. The experience shows they should be readable.
that many grammars that one uses in practice fall into the LALR(1) class,
which explains why tools such as yacc28 , bison29 and cup30 target this 28
Stephen C. Johnson. Yacc: Yet an-
other compiler-compiler. Techni-
particular class of grammars.
cal report, AT&T Bell Laboratories,
1975. Readable online at http:
//dinosaur.compilertools.net/yacc/
6.9 Comparison of the top-down and bottom-up hierarchies 29
Gnu bison. https://www.gnu.org/
software/bison/. Online: accessed on
The next step, in our discussion of the different families of context-free December, 29th, 2015
30
grammars and languages that can be parsed deterministically, is to com- Cup: Construction of useful parsers.
http://www2.cs.tum.edu/projects/
pare the bottom-up hierarchies we have established in the two previous cup/. Online: accessed on December,
sections, to the results of Section 5.4.2 and Section 5.4.3 that concern the 29th, 2015

top-down hierarchy.

6.9.1 Comparing the hierarchies of grammars

We start by comparing the top-down and bottom-up hierarchies of gram-


mars. Recall from Theorem 5.5 that the families of LL(k) grammars form a
strict hierarchy, i.e.: LL(0) ⊊ LL(1) ⊊ LL(2) ⊊ · · · ⊊ LL(k) ⊊ LL(k + 1) ⊊ · · · .
202 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

Recall also that, from Theorem 6.6, the same holds the families of LR(k)
grammars, i.e.: LR(0) ⊊ SLR(1) ⊊ LALR(1) ⊊ LR(1) ⊊ LR(2) ⊊ · · · ⊊ LR(k) ⊊
LR(k + 1) ⊊ · · ·
The main result for this comparison consists in checking that all LL(k)
grammars are LR(k), while there are LR(k) grammars that are not LL(k). In
other words: LL(k) ⊊ LR(k) for all k ≥ 0. An history of attempts to establish
this result together with a comprehensive proof can be found in a 1982
paper by Anton N I J H O LT 31 . 31
Anton Nijholt. On the relationship
between LL(k) and LR(k) grammars. In-
Theorem 6.10. For all k ≥ 0: LL(k) ⊊ LR(k). formation Processing Letters, 15(3):97–101,
1982. D O I : 10.1016/0020-0190(82)90038-
2. URL https://www.researchgate.
Proof. For the proof that LL(k) ⊆ LR(k), we refer the reader to the paper by
net/publication/222460902_On_the_
N I J H O LT . The argument is rather short and consists in showing that: “if relationship_between_the_LLk_and_
the leftmost derivations of a grammar satisfy the LL(k) conditions, then LRk_grammars

the rightmost derivations satisfy the LR(k) conditions” (see page 98 of the
article).
To prove that the inclusion is strict, one can rely on the grammar of
Figure 6.19, which is in LR(k + 1) but not in LL(k + 1) (for all k ≥ 0) since
Firstk+1 A bk c = Firstk+1 B bk d = abk .
¡ ¢ ¡ ¢

This results can be seen as the ground behind the folklore assertion that
‘bottom-up parsers are more powerful than top-down parsers’ (although
we will provide other arguments to support this in Section 6.9.2). In other
words, when we have ascertained that k characters of look-ahead are suf-
ficient to parse a grammar top-down, we are sure that k characters of look-
ahead will be sufficient for a bottom-up parser as well.

With this result in mind, we can summarise the relationships between


the different (syntactic) classes of grammars in Figure 6.21. On this figure,
we have placed several points that correspond to grammars that we are
about to describe now. These examples are interesting because they pro-
vide a nice insight into the characteristics that make a grammar belong to
a particular class.
We also refer to the LL and the LR hierarchy to denote the classes of
grammars that belong respectively to LL(k) and LR(k) for some k. In other
words, the LL hierarchy refers to the infinite union:
[
LL(k) ,
k≥0

and, symmetrically, the LR hierarchy refers to:


[
LR(k) .
k≥0

In order to discuss all these examples, we first introduce two grammar


transformations.

1. The first transformation (henceforth called T1) turns a grammar G 1


which is assumed to be LR(k 1 ) (for some k 1 ≥ 0) into a grammar G 2
that is still LR(k 1 ) but not LL(k 2 ) for a chosen k 2 ≥ 0. Let G 1 be an Observe that this argument also
applies when G 1 is SLR(1) or
LR(k 1 ) grammar, and let a and b be two fresh terminals (i.e., terminals
LALR(1). Further observe that
that do not appear in G 1 ). Then, let k 2 be a natural number (possibly this transformation does not preserve the
equal to k 1 ) and let G 2 be the grammar obtained from G 1 by adding language of G 1 , so it can only be used
to prove the existence of a grammar that
belongs to the same bottom-up class
than G 1 but is not LL(k 2 ) for a chosen k 2 .
B OT T O M - U P PA R S E R S 203

(19) Fig. 6.20


LR hierarchy
LL hierarchy

LR(k + 1)
(17) Fig. 6.19

LL(k + 1) (18) Fig. 6.19 + T2


(4) Fig. 5.6

LR(k)
LL(k)

LR(1)
(12) (16)
LL(1)
(8) Fig. 6.18 Fig. 6.18 + T1 Fig. 6.18 + T2
(3) Fig. 6.23
• • •

LALR(1)
(11) (15)
(2) Fig. 6.22 (7) Fig. 6.17
Fig. 6.17 + T1 Fig. 6.17 + T2
• •
• •
SLR(1)
(10)
(6) Fig. 6.24 (14)
(1) Fig. 5.6 Fig. 6.24 + T1
• Fig. 6.24 + T2
for k = 0 •
• •
LR(0)

(0) S → a$ (9)
(5) Fig. 6.2
• Fig. 6.2 + T1 (13)

Fig. 6.2 + T2

Figure 6.21: Comparison of different (syn-


tactic) classes of grammars. The top-down
(LL) hierarchy is in bold, the other classes
are from the bottom-up (LR) hierarchy. All
subsets are non-empty.
204 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

rules S→ak2 b as well as S→a(k2 +1) b . Clearly, those rules imply that
G 2 ̸∈ LL(k 2 ). However, those new rules do not change the LR(k 1 ) char-
acter of the grammar, because a and b are fresh symbols. Indeed, if we
build the LR(0) CFSM for grammar G 2 , we will obtain, in comparaison
with the LR(0) CFSM of G 1 :

• k 2 extra states:

S → •ak2 b, S → •a(k2 +1) b , S → a • a(k2 −1) b, S → a • a(k2 ) b , . . . , S → ak2 • bS → ak2 • ab ,


© ª© ª © ª

in which the parser will shift the a’s to the stack; and
• three extra states S → ak2 b • , S → a(k2 +1) • b and S → a(k2 +1) b •
© ª © ª © ª

where there are no conflicts, even with a look-ahead of size 0.

2. The second transformation (henceforth called T2) is similar to the


previous one but allows one to make sure that the resulting grammar G 2
is not LL(k) for any k ≥ 1, while retaining its properties in the bottom-
up hierarchy. Again, let G 1 be an LR(k 1 ) grammar, and let G 2 be the
grammar obtained by adding to G 1 the rules S→A , A→A a and A→a
(where A is a fresh variable and a is a fresh terminal). Clearly, G 2 is not
LL(k) for any k since it is now left-recursive. However, one can check
here as well that this transformation does not introduce conflicts in the
LR(0) CFSM of the grammar, so G 2 retains the same bottom-up proper-
ties than G 1 .

Now, we can discuss the grammars that appear as elements of the sets
in Figure 6.21:

(0) A simple grammar containing only the rule S→a$ for example is
clearly both LL(1) and LR(0). It is actually LL(0) as well.

(1) The grammar from Figure 5.6, when we let k = 0 is an LL(1) grammar
that can be checked to be SLR(1) as well. It is however not LR(0) since
the two first rules introduce a Shift/Reduce conflict in the initial state
of the CFSM.

(2) The grammar in Figure 6.22 is an LL(1) grammar that can be checked
to be LALR(1) as well. It is however not SLR(1) since:
(1) S → Aa Ab
(2) → B bB a
Follow(A) = Follow(B ) = {a, b}.
(3) A → ε
(4) B → ε
Figure 6.22: An grammar which is LL(1)
(3) The grammar in Figure 6.23 can be checked to be LL(1) but not LALR(1).
and LALR(1) but not SLR(1).
This grammar is adapted from an example of book of Andrew A P -
P E L 32 . The grammar is not LALR(1) because the LR(1) CFSM contains 32
Andrew W. Appel. Modern Compiler Im-
plementation in ML. Cambridge Univer-
the two states:
sity Press, 1998. ISBN 0-521-58274-1

E → A• , {b}
F → A• , {c}

and
B OT T O M - U P PA R S E R S 205

E → A• , {c}
F → A• , {b} (1) S → aX
(2) → Eb
(3) → Fc
(4) X → Ec
(5) → Fb
whose merge in the LALR(1) will create a Reduce/Reduce conflict. (6) E → A
(7) F → A
(4) In general, grammar G k as found in Figure 5.6 (with parameter set to (8) A → ε
k + 1), gives a grammar that is LL(k + 1) but not LR(k). Figure 6.23: An grammar which is LL(1)
and not LALR(1)
(5) The grammar we have used at the beginning of the chapter (in Exam-
ple 6.4) is LR(0) but not LL(1) as we have already seen. It is however
still in the LL hierarchy, as it is LL(2).
(1) S → ab
(6) The grammar in Figure 6.24 is a simple example of grammar that is (2) S → ac
not LL(1) and not LR(0), but which is LL(2) and SLR(1). (3) S → a
Figure 6.24: A grammar that is LL(2) and
(7) The grammar in Figure 6.17 has already been shown to be LALR(1) SLR(1) but neither LL(1) nor LR(0).

but not SLR(1). We can also observe that it is not LL(1) because d ∈
First(A a) ∩ First(dc). However, it is LL(2).

(8) The grammar in Figure 6.18 has also been discussed already, and we
know that it is LR(1) but not LALR(1). We can check that it is not LL(1)
nor LL(2): for example, a look-ahead of ac does not allow one to chose
between S→a A d and S→aB e since both A and B produce c. How-
ever, it is still in the LL hierarchy, as it is LL(3).

Alternatively, the grammar in Figure 6.23 can be modified using tran-


formation T1 above to obtain a grammar that is LR(1) and LL(2) but
not LALR(1)

(9)–(12) The grammars which have been used in the points (5) to (8) above
can be modified using transformation T1 in order to make them LL(k)
but not LL(k + 1) for any k.

(13)–(16) Similarly, transformation T2 can be used to make all these gram-


mars be outside of the LL hierarchy.

(17) We have already seen that the grammar in Figure 6.19 is LR(k + 1) but
not LR(k). One can also check that this grammar is not LL(k + 1), since
Firstk+1 A bk c = Firstk+1 B bk d = abk . It is however, LL(k + 2), so
¡ ¢ ¡ ¢ © ª

still in the LL hierarchy.

(18) Then, applying transformation T2 to the grammar of Figure 6.19 al-


lows one to obtain a grammar that is in LR(k + 1) \ LR(k) but not LL(k)
for any k. Don’t forget that a grammar which
is ambiguous cannot be LL(k) nor
(19) Finally, we have already shown that the grammar in Figure 6.20 is not LR(k) for any k. However, that
does not necessarily imply that all non-
LR(k) for any k, so outside the LR hierarchy (while still non-ambiguous). ambiguous CFG fall into one of those cat-
egories.

6.9.2 Comparing the hierarchies of languages

Let us now compare the top-down and bottom-up hierarchies of languages.


This is will turn out to be much easier than the comparison of grammar
206 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

classes, since we already know that the LR(k) hierarchy of languages ‘col-
lapses’ to correspond to the languages of DPDA. Associating this result
with the hierarchy we have obtained in Section 5.4.3, we obtain:

LL(1) lang. ⊊ LL(1) lang. ⊊ · · · ⊊ LL(k) lang. ⊊ · · · ⊊ LR(1) lang. = LR(2) lang. = · · · = LR(k) lang. = · · · = DCFL
B OT T O M - U P PA R S E R S 207

6.10 Exercises

6.10.1 LR(0)
The algorithms to build a CFSM are
Exercise 6.1. Build the CFSM corresponding to the following grammar: found in Section 6.2.

(1) S′ → S$
(2) S → aC d
(3) → bD
(4) → Cf
(5) C → eD
(6) → Fg
(7) → CF
(8) F → z
(9) D → y
The building techniques for an
Exercise 6.2. Give the action table of the LR(0) parser on the grammar of LR(0) are in Section 6.3.
the previous exercise.
See Section 6.3 again.
Exercise 6.3. Simulate the run of the LR(0) parser for the grammar of the
previous exercises, on the word aeyzzd$.

6.10.2 SLR(1)
SLR(1) parsers are covered in Sec-
Exercise 6.4. Build the SLR(1) parser for the following grammar (i.e., build tion 6.4.
the appropriate CFSM and give the SLR(1) action table):
(1) S′ → S$
(2) S → A
(3) A → bB
(4) → a
(5) B → cC
(6) → cC e
(7) C → d Af
Is the above grammar LR(0)? Justify your answer.

6.10.3 LR(k)
Section 6.5 is devoted to LR(k)
Exercise 6.5. Build the LR(1) parser for the following grammar (i.e., build parsers.
the appropriate CFSM and give the LR(1) action table):
(1) S′ → S$
(2) S → S aS b
(3) → c
(4) → ε
Is this grammar LR(0)? Is it SLR(1) 1? Justify your answers.

Exercise 6.6. Simulate the run of the parser you built at the previous ex-
ercise on the word abacb.

6.10.4 LALR(1)
The definition of LALR(1) parsers
Exercise 6.7. Build the LALR(1) parser for the grammar of exercise 6.5, in Section 6.6 shows how to build
using the LR(1) parser you have built for the same exercise. them from LR(1) parsers.
Since LALR(1) parsers can be built
Exercise 6.8. Find a grammar which is LR(1) but not LALR(1). from LR(1) parsers, try to come up
with states of an LR(1) parser that
would be generate a conflict when the
LALR(1) parser is built, and infer a gram-
mar from that.
A Some reminders of mathematics

This section is a quick reminder of some mathematical concepts that are


pervasive in these lecture notes. It is intended as a refresher, not as an
in-depth explanation from which one could study from. Readers are ad-
vised to read this section first to make sure they are familiar with all the
concepts. If not, they are advised to refer to a good textbook on discrete
mathematics, such as the free online book1 by D O E R and L E VA S S E U R , for 1
A. Doer and K. Levasseur. Applied Dis-
crete Structures. 2012. Available online
example.
with supplementary material at: https://
discretemath.org/

A.1 Greek letters


Did you know where the word ‘al-
Although not strictly mathematical content, we believe that a reminder of phabet’ comes from? If not, have a
the names of the greek letters (as they are used everywhere in these notes) look at the two first lines of the ta-
ble. . .
would be useful. This alphabet is given in Table A.1.

Table A.1: Greek letters.


Capital Lowercase Name

A α alpha
B β beta
Γ γ gamma
∆ δ delta
E ε epsilon
Z ζ zeta
H η eta
Θ θ theta
I ι iota
K κ kappa
Λ λ lambda
M µ mu
N ν nu
Ξ ξ xi
O o omicron
Π π pi
P ρ rho
Σ σ sigma
T τ tau
Y υ upsilon
Φ ϕ phi
X χ chi
Ψ ψ psi
Ω ω omega
210 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

One should pay attention not to


The names we are used in the table are the ‘ancient greek’ names. In confuse ε (epsilon) with ξ (xi). One
should also remark that the alter-
contemporary Greece some of these letters are called differently. For ex- nate glyph ϵ for epsilon is not the same as
ample, µ and ν are now called ‘mi’ and ‘ni’ respectively. Also, some lettres the mathematical symbol ∈ which means
‘belongs to’ (as in x ∈ X : the element x be-
can be written differently: Epsilon can be written ε (as in those notes) or ϵ.
longs to the set X ). Finally, the letters Φ
Phi can be written ϕ (as we do here) or φ. and φ (phi) should not be confused with
the empty set symbol ;.

A.2 Sets and relations

For the sake of completeness, we recall the very basic definitions about
sets, although we assume the reader must be pretty familiar with them.

A.2.1 Sets

Definition A.1 (Set (informal)). A set is a (finite or infinite) collection of


elements. M

We rely on the classical notation x ∈ X to say that x is an element of set


X . We also use A ⊆ B to denote that ‘set A is a subset of B ’, which means
that all elements of set A are also elements of B . We use A ⊂ B to denote
the fact that set A is a proper subset of B . This means that A ⊆ B and that
there is at least one element of B that is not in A (in other words A ̸= B ).
We sometimes use the alternate notation A ⊊ B to emphasise the fact that
A ̸= B .

Example A.2. We can denote the set S containing all natural numbers be-
tween 2 and 7 (included) as:

S = {2, 3, 4, 5, 6, 7}.

Here, we use an enumeration to list all the elements of the set. Although
we have chosen to enumerate the elements in increasing order, we could
have written:
S = {3, 6, 5, 2, 7, 4}

instead. Indeed, sets by themselves do not carry a notion of order.


Other sets of interest to us are the sets of natural numbers and of inte-
gers:

N = {0, 1, 2, 3, . . .}
Z = {. . . , −3, −2, −1, 0, 1, 2, 3, . . .}.

These sets, however, are examples of infinite sets and cannot be com-
pletely enumerated2 . Moreover, even some sets which are finite can be 2
In the sense that we cannot write all their
elements on a sheet of paper. Note that
more conveniently represented using the so-called set-builder notation.
there exists a mathematical notion of ‘enu-
For example, we could define the previous set S like so: merable set’ for which the natural and in-
teger numbers are enumerable
S = {x | 2 ≤ x ≤ 7}.

Similarly, the set of all even natural numbers can be expressed as:

{x ∈ N | x = 2.y for some y ∈ N}.

Finally, let us note that the elements of sets can be anything, including: Note how we use brackets () for
denoting pairs and curly brackets
{} for denoting sets, as is standard
practice.
S O M E R E M I N D E R S O F M AT H E M AT I C S 211

1. pairs. A pair is an ordered collection of two elements. For example,


(1, 2) is the pair where the first element is 1 and the second is 2. Pairs
are commonly used to denote coordinates in a plane for instance, and
clearly, one should not confuse the x and the y coordinates, so the order
matters. Then:
{(1, 2), (2, 3), (1, 3)}

and
{(x, y) | x ∈ N and y ∈ N and y = 2.x}

are two examples of sets containing pairs.

2. other sets.

A.2.2 Relations

Before we can introduce relations, we need the notion of cartesian prod-


uct:

Definition A.3 (Cartesian product). Given two sets A and B (which might
be equal), their cartesian product, denoted A × B is the set:

{(a, b) | a ∈ A and b ∈ B }

As can be seen from the definition, the cartesian product of A and B is


a set of pairs of elements from A and B .

Example A.4. The cartesian product of A = {1, 2, 3} and B = {a, b} is:


© ª
A × B = (1, a), (2, a), (3, a), (1, b), (2, b), (3, b) .

The cartesian product of Z by itself is the set of all possible integer co-
ordinates in a two-dimensional plane. Instead of writing Z × Z, we rather
write Z2 . M

We can now introduce the notion of (binary) relation:

Definition A.5 (Binary relation). Given two sets A and B , a binary relation
over A and B is a subset of A × B . M

Since we will only consider binary relations here, we will simply call
them relations. Put simply, a relation over A and B is a set of pairs (a, b)
where a ∈ A and b ∈ B . Let R be such a relation. Instead of writing (a, b) ∈
R, we will adopt the common shorthand aRb, as shown by the following
examples.

Example A.6. Let S = {1, 2, 3}. Then, the set:


© ª
(1, 1), (1, 2), (1, 3), (2, 2), (2, 3), (3, 3)

is a relation over S 2 (in this case, both sets A and B from the definition are
equal to S). It defines the ‘smaller than or equal to’ concept over S. So, we
can call this relation ‘≤’, and write (a, b) ∈≤ iff a is smaller than or equal to
b. With the shorthand notation, we write a ≤ b. M
212 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

The previous example highligths a particular case of relation, which is


among the homogenous relations:

Definition A.7 (Homogenous relation). A relation R ⊆ A × B is homoge-


nous iff A = B . M

In other words, a relation is homogenous if it is over A 2 for some set A.

Properties of relations The notion of relation is, as we have seen very gen-
eral. It can be used to formalise several concepts. For instance, the previ-
ous example shows that the natural notion of order can be formalised as a
binary relation. In order to identify certain relations of interest, we need to
define some properties that relations can have. We start we two different
properties of relations in general.
First, we look at functional relations. As the name indicates, this prop-
erty captures the notion of function. Consider for example the function sin.
We know that, for all possible value of x, sin(x) is a unique value. However,
several values of x can be mapped by the function to the same value. For
example, sin π2 = sin 5π
¡ ¢ ¡ ¢
2 = 1. Now, if we see the function sin as a rela-
tion containing all the pairs (x, sin(x)), we know that: (i) there can’t be two
pairs (x, y 1 ) and (x, y 2 ) with y 1 ̸= y 2 , since both y 1 and y 2 must be equal
to sin(x), which is unique; however (ii) there can be several pairs (x 1 , y)
and (x 2 , y) with x 1 ̸= x 2 . For example π2 , 1 and 5π
¡ ¢ ¡ ¢
2 , 1 both exist in the
relation. This is captured in the following definition:

Definition A.8 (Functional relation). A relation R ⊆ X × Y over X and Y is


functional iff for all x ∈ X , for all y 1 , y 2 ∈ Y : xR y 1 and xR y 2 implies y 1 =
y2. M

Now, if we look closely at this definition, we remark that it captures the


notion of partial function. Indeed, the defintion says that, for all x ∈ X ,
there can be at most one pair of the form (x, y). When we ‘always want the
function to return something’, we need an additional condition:

Definition A.9 (Total relation). A relation R ⊆ X × Y over X and Y is total


iff there exists a pair (x, y) ∈ R for all x ∈ X . M

Then, a complete function can be defined:

Definition A.10 (Complete function). A function f : X → Y is complete (or


M
©¡ ¢¯ ª
total) iff the set x, f (x) ¯ x ∈ X is a functional and total relation.

Finally, the notion of injective relation can be defined symmetrically to


that of a functional relation:

Definition A.11 (Injective relation). A relation R ⊆ X × Y over X and Y is


injective iff for all y ∈ Y , for all x 1 , x 2 ∈ X : x 1 R y and x 2 R y implies x 1 =
x2 . M

Properties of homogenous relations Let us now focus on relevant proper-


ties of homogeneous relations, i.e. those relations that are subsets of A 2 for
some set A.

Definition A.12 (Properties of homogeneous relations). Let R ⊆ A 2 be a


relation over A 2 . Then, we say that:
S O M E R E M I N D E R S O F M AT H E M AT I C S 213

1. R is reflexive iff, for all a ∈ A: (a, a) ∈ R. That is, all elements are always
put in relation with themselves. Observe that a relation which is
not symmetric is not necessarily
antisymmetric and vice-versa. It is
2. R is symmetric iff, for all (a, b) ∈ R: we also have (b, a) ∈ R. That is, every
possible that (a, b) ∈ R and (b, a) ∈ R for
time a is put into relation with b through R, then b is also put in relation some pairs (a, b) but not all (hence, R is not
to a through R. symmetric), and that there exists at least
such a pair with a ̸= b (hence, it is not an-
tisymmetric).
3. R is antisymmetric iff, for all a, b ∈ A: if (a, b) ∈ R and (b, a) ∈ R, then a = Further observe that some authors use
b. In some sense, the antisymmetric property does not allo symmetry the word ‘total’ instead of ‘strongly con-
nected’. However, this seems to be dep-
to happen on distinct elements: every time we have (a, b) and (b, a) in
recated, and we have adopted the mod-
R, it must happens on the same elements, i.e. a = b. ern notion of ‘total relation’, see Defini-
tion A.2.2.
4. R is transitive iff, whenenver (a, b) ∈ R and (b, c) ∈ R: we also have
(a, c) ∈ R. This is the classical definition of transitivity.

5. R is strongly connected iff, for all a ∈ A, for all b ∈ B : either (a, b) ∈ R or


(b, a) ∈ R. That is, for all pairs of elements in A, R always put them into
relation one way or the order.

While these properties might sound very abstract, they are the basic
building blocks that allow one to define the classical concepts that we are
used to manipulate, like partial orders, orders or equivalences, as we are
about to see.

Partial orders and orders Let’s start with the classical notion of order.

Definition A.13 ((Partial) orders). A partial order is a transitive and anti-


symmetric homogeneous relation. An order (sometimes called total order)
is a partial order which is also reflexive and strongly connected. M

Example A.14. For example, the classical ordering relation ≤ on the inte-
gers is both an order and a partial order. Indeed, if x ≤ y and y ≤ z, then
x ≤ z, so ≤ is transitive. It is also antisymmetric, because if x ≤ y and y ≤ x,
it can only be that x = y. So, ≤ is a partial order. Moreover, x ≤ x for all in-
teger x, so ≤ is also reflexive. Finally, we can always compare two integers
through ≤, so it is also strongly connected. Hence, ≤ is indeed an order.
Now let’s lift the ≤ relation to pairs of integers: (x 1 , y 1 ) ≤ (x 2 , y 2 ) iff
x 1 ≤ x 2 and y 1 ≤ y 2 . In this case, we have elements that are incomparable.
As a concrete example, assume we need to buy a washing machine, and
that we rate the models according to their yearly energy consumption and
their price. So each machine is characterised by (e, p). Assume we have
a machine that consumes 100 kWh and costs 500 euros. We assign it the
pair (100, 500). So, a machine with the pair (80, 400) is clearly better, and
we have (80, 400) ≤ (100, 500). However, a machine with characteristics
(75, 700) is not comparable to our (100, 500) machine: (75, 700) ̸≤ (100, 500)
and (100, 500) ̸≤ (75, 700).
This new relation ≤ is still transitive and antisymmetric, so it is indeed
a partial order. It is also reflexive, but not strongly connected, and is thus
not an order. M
214 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

Equivalence relations Another classical concept is that of equivalence re-


lation. It can also be defined on top of the properties listed above.

Definition A.15 (Equivalence relation). An equivalence relation is an ho-


mogeneous relation which is reflexive, transitive and symmetric. M

Example A.16. As an example, let us say that fruits are equivalent when
they have the same color (assuming they have only one color). So, for
instance, tomatoes are equivalent to cherries because they are both red,
cherries are also equivalent to strawberries because strawberries are red as
well. So, clearly, strawberries must be equivalent to tomatoes. This show
why an equivalence relation must be transitive. Of course, if cherries are
the same color as tomatoes, then tomatoes are the same color as cherries
(!) so our relation is symmetric. Finally, tomatoes are the same color as
tomatoes (!!) so our relation is also reflexive. We conclude that our ‘has
the same color’ relation is indeed an equivalence relation.
We can continue this example and see that equivalence relations natu-
rally induce a splitting of the fruits between so-called classes: while toma-
toes, cherries and strawberries are all red; bananas and lemons belong to
their own gang of yellow fruits. All yellow fruits are equivalent to each
other, but no yellow fruit can be equivalent to a red one. This is further
formalised in the next definitions. M

Definition A.17 (Partition). A partition of a set A is a set of subsets A 1 ,


A 2 ,. . . , A n of A s.t.:

1. All elements of A occur in some subset A i . For all a ∈ A, there exists i


s.t. a ∈ A i .

2. There is no overlapping between the subsets: for all i ̸= j : A i ∩ A j = ;.

So, the notion of partition consists in ‘splitting’ the whole set A into dif-
ferent subsets, much like we do when we cut a cake. Such a ‘cut’ of the set
can be done through an equivalence relation, when we put all equivalent
elements together in a subset:

Definition A.18 (Equivalence classes). Given a set A and an equivalence


relation R on A, the equivalence classes of R are all the non-empty subsets
A 1 , A 2 , . . . A n of A s.t. for all 1 ≤ i ≤ n: a, b ∈ A i iff aRb.
The equivalence classes of R form a partition of A. M

Example A.19. One can now check that these definitions match the intu-
itions given in the example above. Given the set

A = {tomatoes, cherries, strawberries, lemons, bananas};

the subsets {tomatoes, cherries, strawberries} and {lemons, bananas} are the
two equivalence classes of our ‘has the same color’ relation, and they in-
deed form a partition of A, since all fruits end up in either equivalence
class, and there is no intersection between these classes. M
S O M E R E M I N D E R S O F M AT H E M AT I C S 215

Transitive closure Finally, an important concept regarding relations is


that of transitive closure. Roughly speaking, computing the transitive clo-
sure of a given relation amounts to adding the minimal number of pairs
that are necessary to make the relation transitive. Let us illustrate this on
an example.

Example A.20. Let us consider the five cities: Antwerp (A), Brussels (B),
Paris (P), New York (NY) and Miami (M). Let us assume we are given some
information about the possibility to travel from one city to the other by
road, as a relation:
© ª
R = (A, B ), (B , P ), (N Y , M ) .

That is, we know there is a road from Antwerp to Brussels, from Brussels
to Paris and from New York to Miami. Let is now assume we want to know
what are all the possible road connections we can deduce from this infor-
mation. Clearly, if we can go from Antwerp to Brussels and from Brussels
to Paris, then we can also go from Antwerp to Paris, so we can add the pair
(A, P ), but no further pair based on the information which is given to us.
This is exactly the transitive closure of the above relation:

R ′ = (A, B ), (B , P ), (N Y , M ), (A, P ) .
© ª

Observe that this relation is indeed transitive, and that it contains R. It is


also the smallest such relation. Indeed, the pair (A, P ) must be added to
make the relation transitive, and no other pair needs to be added to this
aim. For example, if we had further added the pairs (A, N Y ) and (A, M ),
we would also have a transitive relation that contains R, but it is not the
smallest that has these properties. M
Observe that the transitive closure
Let us formalise this: of a transitive relation R is R itself.

Definition A.21 (Transitive closure of a relation). Given a relation R, its


transitive closure is the smallest relation R ′ s.t. (i) R ⊆ R ′ ; and (ii) R ′ is
transitive. M

Finally, let us note that the notion of transitive closure can be extended
to other properties of relations, such as: the transitive and reflexive closure
of R is the smallest transitive and reflexive relation R ∗ that contain R, and
so forth.
B Bibliography

Gnu bison. https://www.gnu.org/software/bison/. Online: accessed


on December, 29th, 2015.

CLang: features and goals. http://clang.llvm.org/features.html.


Online: accessed on December, 29th, 2015.

Cup: Construction of useful parsers. http://www2.cs.tum.edu/


projects/cup/. Online: accessed on December, 29th, 2015.

GCC wiki: new C parser. https://gcc.gnu.org/wiki/New_C_Parser,


2008. Online: accessed on December, 29th, 2015.

A. Aho, M. Lam, R. Sethi, and Ullman J. Compilers: Principles, Techniques,


& Tools. Addison-Wesley series in computer science. Pearson/Addison
Wesley, 2007.

Frances E. Allen. Control flow analysis. SIGPLAN Not., 5(7):1–19, July 1970.
ISSN 0362-1340. D O I : 10.1145/390013.808479.

Rajeev Alur and P. Madhusudan. Visibly pushdown languages. In Proceed-


ings of the Thirty-sixth Annual ACM Symposium on Theory of Comput-
ing, STOC ’04, pages 202–211, New York, NY, USA, 2004. ACM. ISBN 1-
58113-852-0. D O I : 10.1145/1007352.1007390. URL http://doi.acm.
org/10.1145/1007352.1007390.

T. Anderson, J. Eve, and J. Horning. Efficient LR(1) parsers. Acta Informat-


ica, 2:2–39, 1973. D O I : 10.1007/BF00571461.

Andrew W. Appel. Modern Compiler Implementation in ML. Cambridge


University Press, 1998. ISBN 0-521-58274-1.

J.A. Brzozowski and Jr. McCluskey, E.J. Signal flow graph techniques
for sequential circuit state diagrams. Electronic Computers, IEEE
Transactions on, EC-12(2):67–76, April 1963. ISSN 0367-7508. D O I :
10.1109/PGEC.1963.263416.

N. Chomsky. Syntactic Structures. Mouton and Co, The Hague, 1957.

N. Chomsky. On certain formal properties of grammars. Information and


Computation (formerly known as Information and Control), 2(2):137 –
167, 1959. D O I : 10.1016/S0019-9958(59)90362-6.

Franklin Lewis DeRemer. Practical Translators for LR(k) Languages.


PhD thesis, Massachusetts Institute of Technology, 1969. URL https:
//web.archive.org/web/20130819012838/http://publications.
csail.mit.edu/lcs/pubs/pdf/MIT-LCS-TR-065.pdf.
218 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

Franklin Lewis DeRemer. Simple LR(k) grammars. Communications of the


ACM, 14(7), 1971. D O I : 10.1145/362619.362625.

A. Doer and K. Levasseur. Applied Discrete Structures. 2012. Available


online with supplementary material at: https://discretemath.org/.

Python Software Foundation. re – Regular expression operations. https:


//docs.python.org/3/library/re.html. Online: accessed on April
12th, 2023.

H.W. Fowler, J.B. Sykes, and F.G. Fowler. The Concise Oxford dictionary of
current English. Clarendon Press, 1976.

Seymour Ginsburg and Sheila Greibach. Deterministic context free lan-


guages. Information and Computation (formerly known as Information
and Control), 9(6):620–648, 1966. ISSN 0019-9958. D O I : 10.1016/S0019-
9958(66)80019-0. URL https://www.sciencedirect.com/science/
article/pii/S0019995866800190.

John E. Hopcroft, Rajeev Motwani, and Jeffrey D. Ullman. Introduction to


Automata Theory, Languages, and Computation (3rd Edition). Addison-
Wesley Longman Publishing Co., Inc., Boston, MA, USA, 2006. ISBN
0321455363.

Stephen C. Johnson. Yacc: Yet another compiler-compiler. Techni-


cal report, AT&T Bell Laboratories, 1975. Readable online at http:
//dinosaur.compilertools.net/yacc/.

B.W. Kernighan and D.M. Ritchie. The C Programming Language. Prentice-


Hall software series. Prentice Hall, 1988.

Stephen C. Kleene. Representation of events in nerve nets and finite au-


tomata. Technical Report RM-704, The RAND Corporation, 1951. URL
http://minicomplexity.org/pubr.php?t=2&id=2.

Gerwin Klein, Steve Rowe, and Régis Décamps. Jflex user’s manual.
https://jflex.de/manual.html, March 2023. Version 1.9.1. Online:
accessed on April, 12th, 2023.

Donald E. Knuth. On the translation of languages from left to right. Infor-


mation and Computation (formerly known as Information and Control),
8:607–639, 1965. D O I : 10.1016/S0019-9958(65)90426-2.

Dexter Kozen. On Kleene algebras and closed semirings. In Mathemati-


cal foundations of computer science, Proceedings of the 15th Symposium,
MFCS ’90, Banská Bystrica/Czech. 1990, volume 452 of Lecture notes in
computer science, pages 26–47, 1990. URL http://www.cs.cornell.
edu/~kozen/Papers/kacs.pdf.

R. Kurki-Suonio. Notes on top-down languages. BIT Numerical Mathe-


matics, 9(3):225–238, 1969. ISSN 1572-9125. D O I : 10.1007/BF01946814.

W. R. LaLonde, E. S. Lee, and J. Horning. An LALR(k) parser generator. In


Proceedings of IFIP congress, pages 151–153. Elsevier Science, New York,
1971.
BIBLIOGRAPHY 219

Leslie Lamport. LATEX: A Document Preparation System. Addison-Wesley,


1986. ISBN 0-201-15790-X.

P. M. Lewis, II and R. E. Stearns. Syntax-directed transduction. J. ACM, 15


(3):465–488, July 1968. ISSN 0004-5411. D O I : 10.1145/321466.321477.

R. McNaughton and H. Yamada. Regular expressions and state graphs for


automata. Electronic Computers, IRE Transactions on, EC-9(1):39–47,
March 1960. ISSN 0367-9950. D O I : 10.1109/TEC.1960.5221603.

A. Nerode. Linear automaton transformations. Proceedings of the Ameri-


can Mathematical Society, 9(4):pp. 541–544, 1958. ISSN 00029939. URL
http://www.jstor.org/stable/2033204.

Anton Nijholt. On the relationship between LL(k) and LR(k) gram-


mars. Information Processing Letters, 15(3):97–101, 1982. DOI:
10.1016/0020-0190(82)90038-2. URL https://www.researchgate.
net/publication/222460902_On_the_relationship_between_
the_LLk_and_LRk_grammars.

Damian Niwińsky and Wojciech Rytter. 200 Problems in Formal Languages


and Automata Theory. University of Warsaw, 2017.

Thomas J. Penello and Franklin Lewis DeRemer. Efficient computation


of LALR(1) look-ahead sets. ACM SIGPLAN Notices, 39(4), 2004. D O I :
10.1145/69622.357187.

M.O. Rabin and D. Scott. Finite automata and their decision prob-
lems. IBM Journal of Research and Development, 3(2):114–125,
April 1959. ISSN 0018-8646. DOI: 10.1147/rd.32.0114. URL
https://www.researchgate.net/publication/230876408_
Finite_Automata_and_Their_Decision_Problems.

R.M. Ritter. The Oxford Guide to Style. Language Reference Series. Oxford
University Press, 2002.

D.J. Rosenkrantz and R.E. Stearns. Properties of deterministic top-down


grammars. Information and Computation (formerly known as Infor-
mation and Control), 17(3):226 – 256, 1970. ISSN 0019-9958. D O I :
10.1016/S0019-9958(70)90446-8.

Itiiro Sakai. Syntax in universal translation. In International Conference


on Machine Translation of Languages and Applied Language Analysis,
pages 593–608. London: Her Majesty’s Stationery Office, 1961.

Michael Sipser. Introduction to the Theory of Computation. International


Thomson Publishing, 1st edition, 1996. ISBN 053494728X.

L. J. Stockmeyer and A. R. Meyer. Word problems requiring exponential


time (preliminary report). In Proceedings of the Fifth Annual ACM Sym-
posium on Theory of Computing, STOC ’73, pages 1–9, New York, NY,
USA, 1973. ACM. D O I : 10.1145/800125.804029.

Ken Thompson. Programming techniques: Regular expression search al-


gorithm. Commun. ACM, 11(6):419–422, June 1968. ISSN 0001-0782.
DOI: 10.1145/363347.363387.
220 I N T R O D U C T I O N T O L A N G UA G E T H E O RY A N D C O M P I L I N G

Reinhard Wilhelm, Helmut Seidl, and Sebastian Hack. Compiler Design,


Syntactic and Semantic Analysis. Springer-Verlag, 2013. ISBN 978-3-
642-17539-8. D O I : 10.1007/978-3-642-17540-4.

Niklaus Wirth and Helmut Weber. EULER: A generalization of ALGOL, and


its formal definition: Part II. Commun. ACM, 9(2):89–99, February 1966.
ISSN 0001-0782. D O I : 10.1145/365170.365202.

You might also like