CD - Unit II - Notes
CD - Unit II - Notes
Lexical analysis, Role of a Lexical analyzer, A simple approach to the design of lexical analyzer,
Regular expressions, Finite automata, Minimizing number of states of a DFA, Implementation of
a lexical analyzer
_____________________________________________________________________________
Introduction
The function of lexical analyzer is to read the source program one character at a time and to
translate it into a sequence of primitive units called tokens.
Regular expression: is a notation that can be used to describe all the tokens of programming
language.
If the lexical analyzer and parser are in a separate pass, the lexical analyzer is places its output on
an intermediate file from which the parser would then take its input.
If the lexical analyzer and parser are together in the same pass; the lexical analyzer acts as a co-
routine or subroutine which is called by the parser whenever it needs a new token. This
organization eliminates the need for the intermediate file. In this arrangement, the lexical
analyzer returns to the parser a representation for the token it has found. The representation is an
integer code, if the token is built in such as a left parentheses, comma or colon. The
representation is a pair consisting of an integer code and a pointer to a table if the token is a user
defined such as identifier or constant.
The integer code gives the token type; the pointer points to the value of that token.
Source Lexical Token Parser
program analyzer
get next token
Symbol tale
1
The need for lexical analyzer :
The purpose of splitting the analysis of the source program into two phases that is lexical
analysis and syntax analysis is to simplify the overall design of the compiler.
It is easier to specify the structure of tokens than the syntactic structure of the source program.
The functions of lexical analyzer are:
Keeping track of line numbers
Producing output listing if necessary
Stripping out white spaces (blanks or tab)
Deleting comments
Flowchart is used to describe behavior of the program. While designing lexical analyzers a
specialized kind of flowchart is used called transition diagram.
In transition diagram,
2
The above Figure (2) shows a transition diagram for an identifier, defined to be a letter followed
by a number of letters or digits.
The starting state of the transition diagram is state 0, the edge from which indicates that the first
input character must be a letter.
If this is the case, we enter state1 and look at the next input character. If that is a letter or digit,
we re-enter state1 and look at the input character after that. We continue this way, reading letters
and digits and making transitions from state 1 to itself, until the next input character is a
delimiter for an identifier. On reading the delimiter, we enter state 2.
Example:
else FAIL( )
else FAIL( )
RETRACT( )
The state 2 indicates that an identifier has been found. Since the delimiter is not part of the
identifier, we must retract the lookahead pointer one character, for which we use a procedure
RETRACT( ).
ii) FAIL( ) is a routine which retracts or take back the lookahead pointer and starts up the next
transition diagram, otherwise calls the error routine.
iii) DIGIT( ) is a procedure which returns true if and only if C is one of the digits 0,1,2,…,9.
3
iv) DELIMITER( ) is a procedure which returns true if C is a character that could follow an
identifier.
v) RETRACT( ) indicates that an identifier has been found since the delimiter is not part of the
identifier, we must retract the lookahead pointer one character for which we use a procedure
RETRACT().
The symbol * indicates state on which input retraction must take place.
iv) INSTALL( )- To install newly found identifier in the symbol table if it is not already there.
begin 1 --
end 2 --
if 3 --
then 4 --
else 5 --
< 8 1
<= 9 2
Regular expressions
1) Alphabet/character set
Ex. The set{0,1} is an alphabet. It consists of two symbols 0 and 1 and it is often called the
binary alphabet.
4
A) String: is a finite sequence of symbols such as 001.
Operations on strings:
i) Length of string:
The length of a string x, usually denoted by |x|, is the total number of symbols in x.
|x|=5
If x and y are strings then the concatenation of x and y is written as xy or x.y, is the string formed
by following the symbols of x by the symbols of y.
The concatenation of the empty string with any string is that string. That is ꜫ x=x ꜫ =x.
x1=x
x2=xx
Prefix of string: If x is some string then any string formed by discarding zero or more trailing
symbols symbols of x is called a prefix of x.
Suffix of string: A suffix of x is a string formed by deleting zero or more of leading symbols of
x.
Substring : A substring of x is any string formed by deleting prefix and suffix of a string.
Operations on languages
i) Concatenation of languages
If L and M are languages then LM or L.M is the language consisting of all strings xy which can
be formed by selecting a string x from L and a string y from M.
Example:
M= {10, 110}
LM= {010,0110,0110,01110,11010,110110}
LUM= {x| x is in L or x is in M}
Example:
M= {10, 110}
L0= { ꜫ }
L1=L
L2=LL
L3=LLL
iv) The Kleene (*) closure of language L denoted by L* is “Zero or more occurrences of” L.
v) The Positive (+) closure of language L denoted by L+ is “One or more occurrences of” L.
6
Finite Automata
A recognizer for language L is a program that takes as input a string x and answers “yes” if x is a
sentence of L and “no” otherwise.
I) NFA
II) DFA
Nondeterministic Finite Automata is a directed labeled graph in which nodes are called states
and the labeled edges called transitions.
The NFA looks like a transition diagram but edges can be labeled with a ꜫ (epsilon) as well as
char.
1. a set of states S
6. A set of states F called the accepting or final state indicated by double circle
2. There may not be two identically labeled transitions out of the same state
The state and the input symbol uniquely determine the transition. This is why this automaton is
called deterministic finite automata.
Algorithms:
2. NFA to DFA
7
Algorithm 1: From regular expression to finite automata (NFA)
Method: Decompose R into its primitive components. For each component we construct finite
automata inductively.
8
4. For R1R2, we construct the NFA
9
10
b
11
12
Algorithm 2: Constructing a DFA from a NFA or Subset construction algorithm
Input: A NFA
Operation Description
ꜫ-Closure (T) set of states reachable from some state S in T to epsilon transitions alone
MOVE (T,a) set of states to which there is a transition on input symbol a from some
state S in T
13
Example:
Input:
Method: R=(a|b)*abb
ꜫ-closure(S) =0
ꜫ-closure(T) ={0,1,2,4,7}=A
14
Input symbol
State
a b
A B C
B B D
C B C
D B E
E B C
Transition table
_____________________________________________________________________________
Input: A DFA M
Output: A DFA M’ accepting the same language as M and having as few states as possible.
1. Final states
Rules:
15
Example:
Input symbol
State
a b
A B C
B B D
C B C
D B E
E B C
Transition table
1. Construct Πnew
Π= (E)
Πnew= (E)
______________________________________________________________________________
2. Construct Πnew
Π= (ABCD)
Πnew= (ABCD)
On input a,
A goes to B
B goes to B
C goes to B
D goes to B
16
On input b,
A goes to C
B goes to D
C goes to C
_____________________________________________________________________________
3. Construct Πnew
Π = (ABC)
Πnew= (ABC)
On input a,
A goes to B
B goes to B
C goes to B
On input b,
A goes to C
Π = (ABC)
_____________________________________________________________________________
17
4. Construct Πnew
Π= (AC)
Πnew= (AC)
On input a,
A goes to B
C goes to B
On input b,
A goes to C
C goes to C
Π= (AC)
Πnew= (AC)
18
The final partition Π consists of (AC)(B)(D)(E)
State Input symbol
a b
A B C
B B D
C B C
D B E
E B C
E B A
E B A
19
b
Minimized DFA
20