Learn Compiler Design: From B. K.
Sharma
UNIT I
Regular Expressions and Finite Automata
Learn Compiler Design: From B. K. Sharma
Unit I: Syllabus
• Introduction to Compiler
• Major Data Structures in Compiler
• Types of Compiler
• Front-End and Back-End of Compiler
• Compiler Structure:
– Analysis-synthesis model of compilation
• Various phases of Compiler
Learn Compiler Design: From B. K. Sharma
Unit I: Syllabus
• Lexical Analysis:
– Input Buffering
– Specification and Recognition of Tokens
– Design of a Lexical Analyzer Generator
• Lex
Learn Compiler Design: From B. K. Sharma
Summary of Lecture 8: Regular Expression and
Lexical Analyzer
1: The regular expression can be used to describe
a sequence of pattern that defines a string.
2: Operations that can be performed on Regular
Expressions are : Union /Or, Concatenation and
Kleene Closure.
Learn Compiler Design: From B. K. Sharma
Mapping of Lesson with Course Outcome
(CO)
Lesson CO
Lesson 9: Regular Understand Lexical
Expression and Finite Analysis and
Automata implement it using lex
tool.
Learn Compiler Design: From B. K. Sharma
Lesson 9: Regular Expression and Finite
Automata : Learning Outcomes
At the end of this lesson, students will be able to
1: List 3 specifications of tokens
2: Explain issues in lexical analysis.
3: Explain role of finite automata in lexical
analysis.
4: Convert Regular Expression into NFA using
Thompson’s Construction
Learn Compiler Design: From B. K. Sharma
Active Learning Activity: Diagnostic
Assessment
One- Minute Paper
What is Automata?
Name different types of automata.
Learn Compiler Design: From B. K. Sharma
Specification of Tokens
There are 3 specifications of tokens
1) Strings
2) Language
3) Regular expression
A string over an alphabet is a finite sequence of symbols
drawn from that alphabet.
A language is any countable set of strings over some fixed
alphabet.
Each Regular Expression denotes a language.
Learn Compiler Design: From B. K. Sharma
Tasks of Lexical Analyzer:
The Lexical Analyzer in a Compiler performs two
tasks:
First Step: Reads the source code from left to
right character by character and divide into pieces
called lexemes according to some criteria such as
space, tab, separators, punctuation symbol, etc.
These pieces are called lexemes.
Second Step: Assign each lexemes to a predefined
class such as Identifier, Keyword, Operator,
Delimiter, etc. These classes are called tokens.
Each Token is a pair: <Lexeme, Class>
Learn Compiler Design: From B. K. Sharma
Tasks of Lexical Analyzer:
The second step within Lexical Analysis is
performed using Regular Expressions, so
that the sequence of characters that forms
each lexeme is matched with a token class.
When the lexeme ”if” is extracted in the
first step, and then in the second step this
lexeme is matched with the regular
expression:
Keyword: if | else | while | for | return |....
Learn Compiler Design: From B. K. Sharma
Tasks of Lexical Analyzer:
Let us consider the code given below:
if (x==y)
z =12;
else
z = 3;
i f ( x = = y ) \n \t z = 1 2 ; \n e l s e \n \t z = 3 ; \n
In the First Step, the Lexical Analyzer splits the code into
units such as if, (, x, ==, y and so on called lexemes.
In the second step, the lexical Analyzer assign each of these
lexemes to a class.
Here is where Regular Expressions come into play:
the act of assigning strings to a class is done with a pattern-
matching operation, where each string is matched to a regular
expression.
Learn Compiler Design: From B. K. Sharma
Tasks of Lexical Analyzer: Important Take-
away
First, and foremost, during the Lexical Analysis everything is
a character (or a sequence of characters called a string).
Secondly, the objective of the Lexical Analysis is to generate
a list of pairs <lexeme, class>.
Formally, each of these pairs is called Token.
A lexeme is a piece of the source code that is matched to
one of the classes defined by the language’s grammar.
Thirdly, lexemes are extracted from the source code
according to rules that are specific to the language.
After that, each lexeme is assigned to a class, using regular
expressions.
Learn Compiler Design: From B. K. Sharma
Two main Issues in Lexical Analysis
How to specify tokens (patterns)?
How to recognize the tokens given a token
specification?
How to specify tokens:
Tokens are specified by regular expressions.
Each regular expression specifies a token.
How to recognize the tokens given a token
specification?
Tokens are recognized by FA.
Learn Compiler Design: From B. K. Sharma
Two other Issues in Lexical Analysis:
Lookahead, and
ambiguities
“Lookahead” may be required to decide where one
token ends and the next token begins.
Even our simple example has lookahead issues:
i vs. if
= vs. ==
A way to resolve ambiguities:
Is “if” two variables ‘i’ and ‘f’?
Is “==“ two equal signs ‘=‘ ‘=‘?
Learn Compiler Design: From B. K. Sharma
Regular Expressions and Lexical Analyzer
Regular expressions are used to represent the
language for Lexical Analyzer.
What is that language?
Regular Language.
What is the grammar for Regular Language?
Regular grammar generates regular languages which
are all languages accepted by Finite Automaton.
The language accepted by finite automata and
described by an expression called regular
expression.
Learn Compiler Design: From B. K. Sharma
Chomsky Hierarchy
Learn Compiler Design: From B. K. Sharma
What is Finite Automata?
Abstract model of a computing entity or digital
computers that
decides whether to accept or reject a string.
A finite-state machine or finite-state automaton,
finite automaton, or simply a state machine, is a
mathematical model of computation.
It is an abstract machine that can be in exactly
one of a finite number of states at any given
time.
Learn Compiler Design: From B. K. Sharma
Finite Automata AKA FSM
Two types of FAs:
Deterministic (DFA) and Non-Deterministic(NFA)
Deterministic (DFA):
In DFA, there is only one path for specific input from the current
state to the next state.
DFA
Non-deterministic (NFA)
In NFA, there exist many paths for specific input from the current
state to the next state.
Learn Compiler Design: From B. K. Sharma
Finite Automata AKA FSM
FA for ‘a’ FA for ‘ab’ FA for (a | b)
FA for a* FA for a*b* FA for (a | b)*
a b
q0 q1
Learn Compiler Design: From B. K. Sharma
Regular Expressions and Lexical Analyzer
A finite state machine (or Finite automata)
recognizes legal strings from a language. „
Example: Identifier
letter followed by letters or digits
Regular Expression: ([a-z]|[A-Z]) ([a-z]|[A-Z]|[0-9])*
Learn Compiler Design: From B. K. Sharma
Active Learning Activity
One- Minute Paper
NFA, in its name has ’non-deterministic’
because of :
a) The result is undetermined
b) The choice of path is non-deterministic
c) The state to be transited next is non-deterministic
d) All of the mentioned
Learn Compiler Design: From B. K. Sharma
Regular Expressions and Lexical Analyzer
Regular Scanner generator
Finite
expression Automaton
String stream
Regular Finite Automaton is Scanner
created by Scanner program
expression
generator such as
lex
Tokens
RE and FA
A scanner generator (e.g., lex) bridges the gap between
regular expressions and FAs.
Learn Compiler Design: From B. K. Sharma
Inside the scanner generator
RE
Thompson construction
NFA
Subset construction
DFA
Minimization Hopcroft’s Algorithm
Minimized DFA
DFA simulation Scanner
generator
Scanner Program
Learn Compiler Design: From B. K. Sharma
RE To NFA: Thompson Construction
1. For in the regular expression, construct NFA
start
i f L()
2. For a in the regular expression, construct NFA
start a
i f L(a)
Learn Compiler Design: From B. K. Sharma
RE To NFA: Thompson Construction
3.(a) Union or OR operation or Alternative
Learn Compiler Design: From B. K. Sharma
RE To NFA: Thompson Construction
3.(b) If s, t are regular expressions, N(s), N(t) their
NFAs, s|t (union) has NFA:
N(s)
start
i f L(s) L(t)
N(t)
where i and f are new start / final states, and -moves
are introduced from i to the old start states of N(s) and
N(t) as well as from all of their final states to f.
Learn Compiler Design: From B. K. Sharma
RE To NFA: Thompson Construction
3.(c) Concatenation Operation
Learn Compiler Design: From B. K. Sharma
RE To NFA: Thompson Construction
3.(d) If s, t are regular expressions, N(s), N(t) their
NFAs s.t (concatenation) has NFA:
start
i N(s) N(t) f L(s) L(t)
overlap
Alternative:
start
i N(s) N(t) f
Concatenate
where i is the start state of N(s) (or new under the
alternative) and f is the final state of N(t) (or new).
Overlap maps final states of N(s) to start state of N(t).
Learn Compiler Design: From B. K. Sharma
RE To NFA: Thompson Construction
3.(e) a*
Learn Compiler Design: From B. K. Sharma
RE To NFA: Thompson Construction
3.(e) a*
Learn Compiler Design: From B. K. Sharma
RE To NFA: Thompson Construction
3.(e) a*
Learn Compiler Design: From B. K. Sharma
RE To NFA: Thompson Construction
3.(e) a*
Learn Compiler Design: From B. K. Sharma
RE To NFA: Thompson Construction
3.(f) If s is a regular expressions, N(s) its NFA, s*
(Kleene star) has NFA:
start
i f
N(s)
where : i is new start state and f is new final state
ε-move i to f (to accept null string)
ε -moves i to old start, old final(s) to f
ε -move old final to old start
Learn Compiler Design: From B. K. Sharma
RE To NFA: Thompson Construction
Example: 1
Let’s try aa*b
Learn Compiler Design: From B. K. Sharma
RE To NFA: Thompson Construction
Example: 1
Let’s try aa*b
overlap overlap
Learn Compiler Design: From B. K. Sharma
Active Learning Activity
One- Minute Paper
Draw NFA for the following regular
expression:
a*b (a | b)
Learn Compiler Design: From B. K. Sharma
RE To NFA: Thompson Construction
Example: 2
Let’s try a*b(a|b)
Learn Compiler Design: From B. K. Sharma
RE To NFA: Thompson Construction
Example: 2
Let’s try a*b(a|b)
overlap overlap
Learn Compiler Design: From B. K. Sharma
RE To NFA: Thompson Construction
Example: 3
Let’s try ab ( a | b )*
Learn Compiler Design: From B. K. Sharma
RE To NFA: Thompson Construction
Example: 3
Let’s try ab ( a | b )*
Learn Compiler Design: From B. K. Sharma
RE To NFA: Thompson Construction
Example: 3
Let’s try ab ( a | b )*
overlap overlap
Learn Compiler Design: From B. K. Sharma
Active Learning Activity
One- Minute Paper
Draw NFA for the following regular
expression:
a ( b | c )*
Learn Compiler Design: From B. K. Sharma
RE To NFA: Thompson Construction
Example: 4
Let’s try a ( b | c )*
1. a, b, & c a b c
S0 S1 S0 S1 S0 S1
b
S1 S2
2. b | c
S0 S5
c
S3 S4
Learn Compiler Design: From B. K. Sharma
RE To NFA: Thompson Construction
Example: 4
Let’s try a ( b | c )*
b
S2 S3
3. ( b | c )* S0 S1 S6 S7
c
S4 S5
Learn Compiler Design: From B. K. Sharma
RE To NFA: Thompson Construction
Example: 4
Let’s try a ( b | c )*
4. a ( b | c )*
b
S S5
4
a
S0 S1 S2 S3 S8 S9
c
S6 S7
Concatenate
Learn Compiler Design: From B. K. Sharma
RE To NFA: Thompson Construction
Example:4
Let’s try a ( b | c )*
Of course, a human would design something simpler ...
b | c
a
S0 S1
Learn Compiler Design: From B. K. Sharma
RE To NFA: Thompson Construction
Example:5
Let’s try (a | b)* abb
Learn Compiler Design: From B. K. Sharma
Active Learning Activity
One- Minute Paper
Draw NFA for the following regular
expression:
(a | b)* abb
Learn Compiler Design: From B. K. Sharma
RE To NFA: Thompson Construction
Example:5
Let’s try (a | b)* abb
Learn Compiler Design: From B. K. Sharma
RE To NFA: Thompson Construction
Example:5
Let’s try (a | b)* abb
overlap
Learn Compiler Design: From B. K. Sharma
Summary of Lecture 9: Regular Expression and
Lexical Analyzer
1: Abstract model of a computing entity or digital
computers that decides whether to accept or reject a
string is called
2: In Finite Automaton, the automaton is exactly one of a
finite number of states at any given time.
3: Two main types of automata are DFA and NFA.
4: Three specifications of tokens are : string, language and
regular expressions.
5. Regular language is accepted by Finite Automata and is
described by regular expression.
6: Tokens are specified by regular expressions and recognized
by Finite Automata.
7: Thompson’s Construction is used to convert regular expressions to NFA with
-moves
Learn Compiler Design: From B. K. Sharma
Regular Expressions and FSA: Questions
How is Finite Automata useful for lexical analysis?
Construct the NFA and DFA for the following regular
expression: ( a + b )*abb
Construct minimum state expression DFA for the following
regular expression:
(a | b)* a(a | b)
Show the construction of NFA for following regular
expression.
(a | b)*a (a | b)
Draw the transition diagram for identifiers.