Unit II - Lexical Analysis-20-1-2021
Unit II - Lexical Analysis-20-1-2021
UNIT II
Contents
Role of lexical analyzer
Specification of tokens
Recognition of tokens
Lexical analyzer generator
Finite automata
From regular expression to NFA
Design of lexical analyzer generator
Optimization of DFA- based pattern matchers
The role of lexical analyzer
token
Source Lexical To semantic
program Parser analysis
Analyzer
getNextToken
Symbol
table
Introduction
Simple way to build lexical analyzer is to construct
a diagram that illustrates the structure of token
Techniques used to implement lexical analyzers are
applicable in query languages & IR systems also
Can utilize pattern matching algorithms
Two secondary tasks- removal of white spaces &
comments, correlating error messages
Sometimes divided into scanning & lexical analysis
Why to separate Lexical analysis and parsing
1. Simplicity of design
2. Improving compiler efficiency- Specialized buffering
techniques for reading input characters & processing tokens
can significantly speed up the performance of a compiler
3. Enhancing compiler portability
Tokens, Patterns and Lexemes
A token is a pair a token name and an optional
token value
A pattern is a description of the form that the
lexemes of a token may take
A lexeme is a sequence of characters in the source
program that matches the pattern for a token
Example
eof-sentinels
Specification of tokens
In theory of compilation regular expressions are
used to formalize the specification of tokens
Regular expressions are means for specifying
regular languages
Example:
Letter_(letter_ | digit)*
Each regular expression is a pattern specifying the
form of strings
Regular expressions
Ɛ is a regular expression, L(Ɛ) = {Ɛ}
If a is a symbol in ∑then a is a regular expression,
L(a) = {a}
(r) | (s) is a regular expression denoting the
language L(r) ∪ L(s)
(r)(s) is a regular expression denoting the language
L(r)L(s)
(r)* is a regular expression denoting (L(r))*
(r) is a regular expression denoting L(r)
Regular definitions
d1 -> r1
d2 -> r2
…
dn -> rn
Example:
letter_ -> A | B | … | Z | a | b | … | Z | _
digit -> 0 | 1 | … | 9
id -> letter_ (letter_ | digit)*
Notational Short hands
One or more instances: (r)+
Zero or one instances: r?
Character classes: [abc]
Example:
letter_ -> [A-Za-z_]
digit -> [0-9]
id -> letter_(letter|digit)*
Recognition of tokens
Starting point is the language grammar to
understand the tokens:
stmt -> if expr then stmt
| if expr then stmt else stmt
|Ɛ
expr -> term relop term
| term
term -> id
| number
Recognition of tokens (cont.)
The next step is to formalize the patterns:
digit -> [0-9]
Digits -> digit+
number -> digit(.digits)? (E[+-]? Digit)?
letter -> [A-Za-z_]
id -> letter (letter|digit)*
If -> if
Then -> then
Else -> else
Relop -> < | > | <= | >= | = | <>
We also need to handle whitespaces:
delim-> blank | tab | newline
ws-> delim+
Transition diagrams
Transition diagram for relop
Transition diagrams (cont.)
Transition diagram for reserved words and
identifiers
Transition diagrams (cont.)
Transition diagram for unsigned numbers
Transition diagrams (cont.)
Transition diagram for whitespace
Lexical Analyzer Generator - Lex
lex.yy.c
C a.out
compiler
declarations
%%
translation rules Pattern {Action}
%%
functions/code
Finite Automata
Regular expressions = specification
Finite automata = implementation
If end of input
If in accepting state => accept, othewise => reject
If no transition possible => reject
Finite Automata State Graphs
A state
• An accepting state
a
• A transition
A Simple Example
A finite automaton that accepts only “1”
NFA
Regular
expressions DFA
Lexical Table-driven
Specification Implementation of DFA
Thomson’s Construction
I/P – Regular Expression(r)
o/p- NFA accepting L(r)
Method-
Break r into its construction sub expressions
Construct NFA for each basic symbol
Combine all NFA’s to get final one
Properties
Each state has unique name
NFA for any r has exactly one start state & one
accepting state
N(r) has at most twice as many states as the
number of symbol & operation in r
Each state of the NFA for r has either one outgoing
transition on symbol or at most two outgoing
empty transitions
Regular Expressions to NFA (1)
For each kind of rexp, define an NFA
Notation: NFA for rexp A
A
• For
• For input a
a
Regular Expressions to NFA (2)
For AB
A B
• For A | B
B
A
Regular Expressions to NFA (3)
For A*
A
Example of RegExp -> NFA conversion
C 1 E
A B 1
0 F G H I J
D
Conversion of an NFA into DFA
Subset construction algorithm is useful for simulating an NFA
by a computer program.
In the transition table of an NFA, each entry is a set of states;
in the transition table of a DFA, each entry is just a single
state.
The general idea behind the NFA-to-DFA construction is that
each DFA state corresponds to a set of NFA states.
The DFA uses its state to keep track of all possible states the
NFA can be in after reading each input symbol.
Subset Construction
- constructing a DFA from an NFA
Input: An NFA N.
Output: A DFA D accepting the same language.
Method: We construct a transition table Dtran for D.
Each DFA state is a set of NFA states and we
construct Dtran so that D will simulate “in parallel”
all possible moves N can make on a given input
string.
Subset Construction (II)