Compiler Design 2
Compiler Design 2
Rajashekara Murthy S
B.E., M.Tech., Ph.D.
[email protected]
BITS Pilani
Pilani | Dubai | Goa | Hyderabad
1
–2–
Lexical Analysis
Objectives
To Understand
1. The Role of a Lexical Analyzer
3
Programming Language Structure
Recall that a Programming Language is defined by
1. SYNTAX:
– Decides whether a sentence in a language is well-formed
2. SEMANTICS
– Determines the meaning, if any, of a syntactically well-
formed sentence
3. GRAMMAR
– A formal system that provides a generative finite description
of the language
4
Syntax of a Programming Language
Describes the structure of programs without any
consideration of their meaning.
The syntactic elements of a programming language
are determined by the computation model and
pragmatic concerns
well developed tools (regular, context-free and attribute
grammars) are available for the description of the
syntax of programming language
Lexical Analyzer & the Parser of a compiler handle the
Syntax of the programming language
5
Some Basic Definitions
lex-i-cal : Of or relating to words or the vocabulary of a
language as distinguished from its grammar and
construction
The task concerned with breaking an input
lexical analysis: into its smallest meaningful units, called
tokens.
The task concerned with fitting a sequence of
syntax analysis: tokens into a specified syntax.
8
Tokens, Patterns and Lexemes
What are Tokens ?
– The basic lexical units of the language
– A sequence of Abstract Characters that can be treated as
a unit in the grammar of the language
– A programming language classifies the tokens into a finite
set of token types Some tokens may have attributes
A note on Terminology integer constant token will have the
Some texts refer to actual integer (17, 42) as an
attribute;
– token types as tokens & Identifiers will have a string with the
– tokens as lexemes actual id
if [ Return IF; ]
[ a – z ] [ a – z 0 – 9 ]* [ return ID ]
( [0 – 9 ] + ‘ . ’[ 0 – 9 ] * ) | ( [ ‘ . ’[ 0 – 9 ] +) Return REAL
. return ERROR
25
A regular Expression Recognizer
Given an input string,
The function of a “regular Expression Analyzer” is to
say :
– “YES, the input is part of the language generated
from the regular expression”
– “NO, the input isn’t part of the language generated
from the regular expression”
Using results from Finite Automata theory and theory
of algorithms, we can automate construction of such
recognizers from Regular Expressions
26
Finite Automata
A finite Automation is a Transition Graph that has:
– A finite set of states S (represented by Nodes) with Edges
leading from one state to another
– Each edge is labeled with the symbol ( from the set Σ ) that
causes the transition ( Could be ε also !)
– One state is denoted as start state S0 and certain of the
states are distinguished as final states ( normally denoted
with two concentric circles)
Mathematically, It can be represented as:
A = {S, , s0, F, move }
27
Recognizing Expressions as Tokens with
Finite State Automaton
Operate by reading input symbols (usually characters)
– Transition can be taken if labeled with current symbol
– ε-transition can be taken at any time
Accept when final state reached & no more input
– Scanner slightly different – accept longest match even if
more input
Reject if no transition possible or no more input and
not in final state (DFA)
28
Finite Automata Examples
if start 1 i 2 f 3 return IF
a–z
start a–z
[ a – z ] [ a – z 0 – 9 ]* 1 2 return ID
0–9
.
9 2 3 0–9
0 –
start 0–
1
9
.
4 0–9 5 0–9
return REAL
30
Deterministic Finite Automata (DFA)
A finite automaton is deterministic if
1. It has no edges/transitions labeled with epsilon.
2. For each state and for each symbol in the alphabet,
there is exactly one edge labeled with that symbol.
Such a transition graph is called a state graph.
A Deterministic Finite Automaton (DFA):
start a b b
0 1 2 3
b*abb
b
31
Non-deterministic Finite Automata (NFA)
In Non-deterministic Finite Automata:
1. From a state (node), there may be more than one
edge labeled with the same alphabet and there may
be no edge from a node labeled with an input symbol
2. An edge can be labeled by an empty symbol too
A Non-deterministic Finite Automaton (NFA):
a
start a b b
0 1 2 3
b (a|b)*abb
32
Another NFA
a
a
start
b
b
start a
i f start
i
f
38
Building NFA for Symbols & Operations
2. Building NFA for Alternation N (s | t) :
– Given two NFA N(s) and N(t),
1. Construct new start state i, and new final state f.
2. Add a transition from the start state i to the start states of N(s) and N(t) and label them with
epsilon symbol
3. Add a transition from the Final states of N(s) and N(t) to the final state f and label them with
Epsilon symbol
N(s)
start f
i
N(t)
39
Building NFA for Symbols & Operations
3. Building NFA for Concatenation N(s.t) or N(st) :
– Given two NFA N(s) and N(t),
1. Construct new start state i, and new final state f.
2. Overlap the Start state of later [ N(t) ] with the final state of the former
[N(s) ]
3. From the start state, add an edge labeled with epsilon to start state of
N(s)
4. From the final state of E1, add an epsilon transition to Start state of N(t)
N(s) N(t)
start
i f
40
Building NFA for Symbols & Operations
4. Building NFA for Repetition N(s*) :
1. Construct new start state and new final state
2. Add an epsilon transition from new Start state to the new
final state.
3. Add an epsilon transition from the new final state to the
start state of N(s).
4. Add another epsilon transition from the final state of N(s)
to the constructed final state.
start N(s) f
i
41
Construction of NFA – Examples
(a|b).(a|b)
a (b) b
(a)
a
(a|b)
b
(a|b).(a|b)
a a
b b
42
Construction of NFA – Examples (Contd.)
Symbol [ a – z ] [ a – z 0 – 9 ]* Repetition
a-z
start
1
a–z 2
Return ID
6 7 8
0-9
[0–9]+ = [0–9][0–9]* Repetition
Symbol
start 1 0–9 2 0–9
3 4 5
Return NUM
43
Combining Several NFA’s
f 3 IF
2 a-z
i
a-z
4 5 6 7 8 ID
0-9
1
NUM
0-9 0-9
9 10 11 12 13
14
Any
15 ERROR
character
44
Automating a RE Recognizer Construction
To convert a specification into code:
1. Write down the RE for the input language
2. Build a big NFA
3. Build the DFA that simulates the NFA
4. Systematically shrink the DFA
5. Turn it into code
Note: The DFA construction is done automatically by a tool
such as lex
45
Conversion of NFA to DFA
A DFA can be constructed from the NFA, where each
DFA state represents a set of NFA states from the NFA
Key idea
The state of the DFA after reading some input is
the set of all states the NFA could have reached
after reading the same input
If NFA has n states, DFA will have at most 2n states
Resulting DFA may have more states than needed
Let us study the conversion with an example
46
Converting NFA to DFA
IF
2 f 3
a-z
i
4 a-z 5 6 7 8 ID
0-9
14
Any
15
1 character ERROR
NUM
9 0-9
10 11
0-9
12 13
Q: What states can be reached from state 1
without consuming a character?
A: {1,4,9,14} form the -closure of state 1
1. Start with the initial state in the NFA ( s0), & work out the set of
states in the DFA, Dstates, initialized with a state representing -
closure(s0). 50
Converting NFA to DFA
f
3
IF
2 a-z
i
a-z 5 6 7 8
ID
4 0-9
1 Any ERROR
14
character
15
NUM
9 0-9
10 11
0-9
12 13
Dstates = {1-4-9-14}
a-h 5-6-8-15 Now we need to compute:
1-4-9-14 Move(1-4-9-14,a-h) = ?{ 5,15 }
Then, -closure({5,15}) = {5,6,8,15}
51
Converting NFA to DFA
f
3
IF
2 a-z
i
a-z 5 6 7 8 ID
4 0-9
1 Any ERROR
14
character
15
NUM
9 0-9
10 11
0-9
12 13
Dstates = {1-4-9-14}
a-h 5-6-8-15
Next we need to compute:
1-4-9-14
i Move(1-4-9-14,i) = ?{ 2,5,15 }
2-5-6-8-15
Then, -closure({2,5,15}) = {2,5,6,8,15}
52
Converting NFA to DFA
f
3
IF
2 a-z
i
a-z 5 6 7 8 ID
4 0-9
1 Any ERROR
14
character
15
NUM
9 0-9
10 11
0-9
12 13
Dstates = {1-4-9-14}
a-h 5-6-8-15
j-z Next we need to compute:
1-4-9-14
i Move(1-4-9-14,j-z) = ?{ 5,15 }
2-5-6-8-15
Then, -closure(5,15}) = {5,6,8,15}
53
Converting NFA to DFA
f
3
IF
2 a-z
i
a-z 5 6 7 8 ID
4 0-9
1 Any ERROR
14
character
15
NUM
9 0-9
10 11
0-9
12 13
Dstates = {1-4-9-14}
a-h 5-6-8-15
j-z Next we need to compute:
1-4-9-14
i Move(1-4-9-14,0-9) = {?10,15 }
0-9 2-5-6-8-15
Then, -closure(10,15}) = {10,13,11,15}
10-11-13-15 54
Converting NFA to DFA
f
3
IF
2 a-z
i
a-z 5 6 7 8
ID
4 0-9
1 Any ERROR
14
character
15
NUM
9 0-9
10 11
0-9
12 13
Dstates = {1-4-9-14}
15
other a-h 5-6-8-15
j-z Next we need to compute:
1-4-9-14
i Move(1-4-9-14,other) = {?15 }
0-9 2-5-6-8-15
Then, -closure(15) = {15}
10-11-13-15 55
Converting NFA to DFA
f
3
IF
2 a-z
i
a-z 5 6 7 8
ID
4 0-9
1 Any ERROR
14
character
15
NUM
9 0-9
10 11
0-9
12 13
Dstates = {1-4-9-14}
15
other a-h 5-6-8-15
j-z The analysis for 1-4-9-14 is
1-4-9-14
i complete. We mark it and pick
0-9 2-5-6-8-15 another state in the DFA to analyze.
10-11-13-15 56
Converted DFA
ID a-e, g-z, 0-9
2-5-6-8-15 f IF
i 3-6-7-8
ID a-z,0-9
a-h ID
5-6-8-15 6-7-8
1-4-9-14 j-z a-z,0-9
0-9 NUM NUM a-z,0-9
0-9 11-12-13
10-11-13-15
a a
s3,s5,s6,s7,s8 s9,s11
b
s0,s1,s2
b
a
s4,s5,s6,s7,s8 s10,s11
b
58
Automating a RE Recognizer Construction
To convert a specification into code:
1. Write down the RE for the input language
2. Build a big NFA
3. Build the DFA that simulates the NFA
4. Systematically shrink the DFA
5. Turn it into code
Note: The DFA construction is done automatically by a
tool such as lex ( More on this Later )
59
Systematically shrink the DFA
The Big Picture
– Discover sets of equivalent states
– Represent each such set with just one state
Two states are equivalent if and only if:
– The set of paths leading to them are equivalent
A
– α Є Σ, transitions on α lead to equivalent states (DFA)
– α-transitions to distinct sets states must be in distinct sets
A partition P of S
– A collection of sets P s.t. each s Є S is in exactly one pi Є P
– The algorithm iteratively partitions the DFA’s states
60
Minimization
a p1
a
p3
b
p0
b a
p2 p4
b
Group all the states together. {p0, p1, p2, p3, p4}.
b b
62
Automating a RE Recognizer Construction
To convert a specification into code:
1. Write down the RE for the input language
2. Build a big NFA
3. Build the DFA that simulates the NFA
4. Systematically shrink the DFA
5. Turn it into code
Note: The DFA construction is done automatically by a
tool such as lex ( More on this Later )
63
Pseudo Code For lexical Analyzer
function lexan; integer else if C is a letter then
Var lexbuf : array [0, ..100] of char begin
C: char place C and successive letters &
Begin digits into lexbuf :
loop begin p := lookup ( lexbuf ) :
read a character into C: tokenval := p:
if C is a blank or a tab then return the token field of table entry p
do nothing end
else if C is a newline then else
increment lineno begin /* token is a single character */
else if C is a digit set tokenval to NONE /* no attribute */
begin return integer encoding of character C
set Tokenval to the value end
of this & flwg digits;
return NUM end
end end
64
Automating a RE Recognizer Construction
To convert a specification into code:
1. Write down the RE for the input language
2. Build a big NFA
3. Build the DFA that simulates the NFA
4. Systematically shrink the DFA
5. Turn it into code
Note: The DFA construction is done automatically by a
tool such as Lex
65
Building Lexical Analyzers Automatically
The point to note is :
The Process studied so far is well suited for Automation
1. Implementer writes down the regular expressions
2. Scanner generator builds NFA, DFA, minimal DFA,
and then writes out the (table-driven or direct-coded)
code
3. This process reliably produces fast, robust Lexical
Analyzers
One such Tool is Lex
66
Lexx – A tool for generating Scanner
A widely used tool for specifying Lexical Analyzers for a
wide variety of languages. How does it work ?
1. Specs of a Lexical Analyzer is Lexx Source Pgm lex.l
prepared by creating a program
lex.l ( containing RE’s) in the LEX Compiler
Lex language
2. Then lex.l is run thru Lex lex.yy.c
Compiler to produce a program
lex.yy.c ( Contains a tabular C Compiler
representaion of state Transition
Diagram) A.out
3. Lex.yy.c is run thru C compiler to Input
produce an object code of Lex Stream A.out
Sequence
Analyzer Of Tokens
67
Lexx Functions
1. Translates the definitions into an automaton.
Thank you
71
Regular Expression Construction
Problem : Specify a set of unsigned numbers as a
regular expression. (Examples: 1997, 19.97)
Solution : Start with symbol and keep defining regular
sub-expressions till the final expression is achieved
RULE 1. digit 0|1|2|3| … |9
RULE 2. digits digit digit* (or digit+)
[Kleene star closure meaning 1 or more digits]
1 9 97 2 5 . 9 7 3 6 . . 14
74
Regular Expression Construction
Qn: How to write a regular expression for identifiers?
(identifiers are letters followed by a letter or a digit).
Answer:
1. Letter a|A|b|B|… |z|Z
2. Digit 0|1|2|3| … |9
3. Letter_or_Digit Letter | Digit
4. Identifier Letter | letter_or_digit
One can define similar regular expression (s) for
comments, Strings, operators and delimiters ( the
different tokens of a language)
75
Grammar for a Tiny Language
program ::= statement | program statement
statement ::= assignStmt | ifStmt
assignStmt ::= id = expr ;
ifStmt ::= if ( expr ) stmt
expr ::= id | int | expr + expr
Id ::= a | b | c | i | j | k | n | x | y | z
int ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
The rules of a grammar are also Known as Productions
76