Unit 2-Introduction To Compilers
Unit 2-Introduction To Compilers
Unit-2
Introduction to Compilers
Outline
• Phases of Compilers
• Role of lexical analyzer
• Specification of tokens
• Recognition of tokens
• Lexical analyzer generator
• Finite automata
• Design of lexical analyzer generator
Unit-2
Phases of Compilers
Unit-2
Phases of Compilers (Cont…)
Unit-2
The role of lexical analyzer
token
Source Lexical To semantic
program Parser analysis
Analyzer
getNextToken
Symbol
table
Unit-2
Why to separate Lexical analysis and
parsing
1. Simplicity of design
2. Improving compiler efficiency
3. Enhancing compiler portability
Unit-2
Tokens, Patterns and Lexemes
• A token is a pair a token name and an optional
token value
• A pattern is a description of the form that the
lexemes of a token may take
• A lexeme is a sequence of characters in the
source program that matches the pattern for a
token
Unit-2
Example
Unit-2
Attributes for tokens
• E = M * C ** 2
– <id, pointer to symbol table entry for E>
– <assign-op>
– <id, pointer to symbol table entry for M>
– <mult-op>
– <id, pointer to symbol table entry for C>
– <exp-op>
– <number, integer value 2>
Unit-2
Lexical errors
• Some errors are out of power of lexical
analyzer to recognize:
– fi (a == f(x)) …
• However it may be able to recognize errors
like:
– d = 2r
• Such errors are recognized when no pattern
for tokens matches a character sequence
Unit-2
Error recovery
• Panic mode: successive characters are ignored
until we reach to a well formed token
• Delete one character from the remaining input
• Insert a missing character into the remaining
input
• Replace a character by another character
• Transpose two adjacent characters
Unit-2
Input buffering
• Sometimes lexical analyzer needs to look
ahead some symbols to decide about the
token to return
– In C language: we need to look after -, = or < to
decide what token to return
– In Fortran: DO 5 I = 1.25
• We need to introduce a two buffer scheme to
handle large look-aheads safely
E = M * C * * 2 eof
Unit-2
Sentinels
Switch (*forward++) {
case eof:
if (forward is at end of first buffer) {
reload second buffer;
forward = beginning of second buffer;
}
else if {forward is at end of second buffer) {
reload first buffer;\
forward = beginning of first buffer;
}
else /* eof within a buffer marks the end of input */
terminate lexical analysis;
break;
cases for the other characters;
} Unit-2
Specification of tokens
• In theory of compilation regular expressions
are used to formalize the specification of
tokens
• Regular expressions are means for specifying
regular languages
• Example:
• Letter_(letter_ | digit)*
• Each regular expression is a pattern specifying
the form of strings
Unit-2
Regular expressions
• Ɛ is a regular expression, L(Ɛ) = {Ɛ}
• If a is a symbol in ∑then a is a regular
expression, L(a) = {a}
• (r) | (s) is a regular expression denoting the
language L(r) ∪ L(s)
• (r)(s) is a regular expression denoting the
language L(r)L(s)
• (r)* is a regular expression denoting (L(r))*
• (r) is a regular expression denting L(r)
Unit-2
Regular definitions
d1 -> r1
d2 -> r2
…
dn -> rn
• Example:
letter_ -> A | B | … | Z | a | b | … | Z | _
digit -> 0 | 1 | … | 9
id -> letter_ (letter_ | digit)*
Unit-2
Extensions
• One or more instances: (r)+
• Zero or one instances: r?
• Character classes: [abc]
• Example:
– letter_ -> [A-Za-z_]
– digit -> [0-9]
– id -> letter_(letter|digit)*
Unit-2
Recognition of tokens
• Starting point is the language grammar to
understand the tokens:
stmt -> if expr then stmt
| if expr then stmt else stmt
|Ɛ
expr -> term relop term
| term
term -> id
| number
Unit-2
Recognition of tokens (cont.)
• The next step is to formalize the patterns:
digit -> [0-9]
Digits -> digit+
number -> digit(.digits)? (E[+-]? Digit)?
letter -> [A-Za-z_]
id -> letter (letter|digit)*
If -> if
Then -> then
Else -> else
Relop -> < | > | <= | >= | = | <>
• We also need to handle whitespaces:
ws -> (blank | tab | newline)+
Unit-2
Transition diagrams
• Transition diagram for relop
Unit-2
Transition diagrams (cont.)
• Transition diagram for reserved words and
identifiers
Unit-2
Transition diagrams (cont.)
• Transition diagram for unsigned numbers
Unit-2
Transition diagrams (cont.)
• Transition diagram for whitespace
Unit-2
Lexical Analyzer Generator - Lex
lex.yy.c
C a.out
compiler
Sequence
Input stream a.out
of tokens
Unit-2
Structure of Lex programs
declarations
%%
translation rules Pattern {Action}
%%
auxiliary functions
Unit-2
Example
%{
Int installID() {/* funtion to
/* definitions of manifest constants
install the lexeme, whose first
LT, LE, EQ, NE, GT, GE, character is pointed to by
IF, THEN, ELSE, ID, NUMBER, RELOP */ yytext, and whose length is
%} yyleng, into the symbol table
and return a pointer thereto
/* regular definitions */
delim [ \t\n] }
ws {delim}+
letter [A-Za-z] Int installNum() { /* similar to
installID, but puts numerical
digit [0-9]
constants into a separate
id {letter}({letter}|{digit})* table */
number {digit}+(\.{digit}+)?(E[+-]?{digit}+)? }
%%
{ws} {/* no action and no return */}
if {return(IF);}
then{return(THEN);}
else {return(ELSE);}
{id} {yylval = (int) installID(); return(ID); }
{number} {yylval = (int) installNum(); return(NUMBER);}
…
Unit-2
Finite Automata
• Regular expressions = specification
• Finite automata = implementation
• If end of input
– If in accepting state => accept, othewise => reject
• If no transition possible => reject
Unit-2
Finite Automata State Graphs
• A state
• An accepting state
a
• A transition
Unit-2
• A finite automaton that accepts only “1 ”
A Simple Example
• A finite automaton accepts a string if we can follow transitions labeled with the characters in the string from the start to some accepting state
Unit-2
Another Simple Example
• A finite automaton accepting any number of 1’s followed by a
single 0
• Alphabet: {0,1}
• Check that “1110” is accepted but “110…” is not
1
Unit-2
And Another Example
• Alphabet {0,1}
• What language does this recognize?
1 0
0 0
1
1
Unit-2
And Another Example
• Alphabet still { 0, 1 }
1
Unit-2
Epsilon Moves
• Another kind of transition: -moves
A B
Unit-2
Deterministic and Nondeterministic
Automata
• Deterministic Finite Automata (DFA)
– One transition per input per state
– No -moves
• Nondeterministic Finite Automata (NFA)
– Can have multiple transitions for one input in a
given state
– Can have -moves
• Finite automata have finite memory
– Need only to encode the current state
Unit-2
Execution of Finite Automata
• A DFA can take only one path through the
state graph
– Completely determined by input
Unit-2
Acceptance of NFAs
• An NFA can get into multiple states
1
0 1
• Input: 1 0 1
• Rule: NFA accepts if it can get in a final
state
Unit-2
NFA vs. DFA (1)
• NFAs and DFAs recognize the same set of
languages (regular languages)
Unit-2
NFA vs. DFA (2)
• For a given language the NFA can be simpler than the
DFA
1
0 0
NFA
0
1 0
0 0
DFA
1
1
• DFA can be exponentially larger than NFA
Unit-2
Regular Expressions to Finite Automata
• High-level sketch
NFA
Regular
expressions DFA
Lexical Table-driven
Specification Implementation of DFA
Unit-2
Regular Expressions to NFA (1)
• For each kind of rexp, define an NFA
– Notation: NFA for rexp A
A
• For
• For input a
a
Unit-2
Regular Expressions to NFA (2)
• For AB
A B
• For A | B
B
A
Unit-2
Regular Expressions to NFA (3)
• For A*
A
Unit-2
Example of RegExp -> NFA conversion
• Consider the regular expression
(1 | 0)*1
• The NFA is
C 1 E
A B G 1
0 F H I J
D
Unit-2
Next
NFA
Regular
expressions DFA
Lexical Table-driven
Specification Implementation of DFA
Unit-2
NFA to DFA. The Trick
• Simulate the NFA
• Each state of resulting DFA
= a non-empty subset of states of the NFA
• Start state
= the set of NFA states reachable through -moves from
NFA start state
• Add a transition S a S’ to DFA iff
– S’ is the set of NFA states reachable from the states in S
after seeing the input a
• considering -moves as well
Unit-2
NFA -> DFA Example
C 1 E
A B G 1
0 F H I J
D
0
0 FGABCDHI
ABCDHI 0 1
1
1 EJGABCDHI
Unit-2
NFA to DFA. Remark
• An NFA may be in many states at any time
Unit-2
Table Implementation of a DFA
0
0 T
S 0 1
1
1 U
0 1
S T U
T T U
U T U
Unit-2
Implementation (Cont.)
• NFA -> DFA conversion is at the heart of tools
such as flex or jflex
Unit-2