0% found this document useful (0 votes)
3 views20 pages

CD - Unit II - Notes

The document outlines the principles of lexical analysis, including the role of a lexical analyzer, the use of regular expressions, and finite automata in recognizing tokens in programming languages. It describes the design of lexical analyzers through transition diagrams and algorithms for converting regular expressions to finite automata, as well as minimizing the number of states in a DFA. The document emphasizes the importance of separating lexical analysis from syntax analysis to simplify compiler design.

Uploaded by

alimdhalait
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views20 pages

CD - Unit II - Notes

The document outlines the principles of lexical analysis, including the role of a lexical analyzer, the use of regular expressions, and finite automata in recognizing tokens in programming languages. It describes the design of lexical analyzers through transition diagrams and algorithms for converting regular expressions to finite automata, as well as minimizing the number of states in a DFA. The document emphasizes the importance of separating lexical analysis from syntax analysis to simplify compiler design.

Uploaded by

alimdhalait
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

UNIT – II Lexical Analysis

Lexical analysis, Role of a Lexical analyzer, A simple approach to the design of lexical analyzer,
Regular expressions, Finite automata, Minimizing number of states of a DFA, Implementation of
a lexical analyzer
_____________________________________________________________________________

 Introduction

The function of lexical analyzer is to read the source program one character at a time and to
translate it into a sequence of primitive units called tokens.

Example: keywords, constants, identifiers, symbols and operators.

Regular expression: is a notation that can be used to describe all the tokens of programming
language.

Finite automata: is a notation used to recognize tokens of programming language.

 The role of the lexical analyzer

If the lexical analyzer and parser are in a separate pass, the lexical analyzer is places its output on
an intermediate file from which the parser would then take its input.

If the lexical analyzer and parser are together in the same pass; the lexical analyzer acts as a co-
routine or subroutine which is called by the parser whenever it needs a new token. This
organization eliminates the need for the intermediate file. In this arrangement, the lexical
analyzer returns to the parser a representation for the token it has found. The representation is an
integer code, if the token is built in such as a left parentheses, comma or colon. The
representation is a pair consisting of an integer code and a pointer to a table if the token is a user
defined such as identifier or constant.

The integer code gives the token type; the pointer points to the value of that token.
Source Lexical Token Parser
program analyzer
get next token

Symbol tale

Figure (1) Working of lexical analyzer and parser

1
 The need for lexical analyzer :

The purpose of splitting the analysis of the source program into two phases that is lexical
analysis and syntax analysis is to simplify the overall design of the compiler.

It is easier to specify the structure of tokens than the syntactic structure of the source program.
The functions of lexical analyzer are:
 Keeping track of line numbers
 Producing output listing if necessary
 Stripping out white spaces (blanks or tab)
 Deleting comments

 Simple approach to the design of lexical analyzer

Flowchart is used to describe behavior of the program. While designing lexical analyzers a
specialized kind of flowchart is used called transition diagram.

In transition diagram,

 The circles are called states.


 The states are connected by arrows called edges.
 The labels on the various edges leaving a state indicate the input character that can appear
after that state.

Figure (2) Transition diagram for identifier

2
The above Figure (2) shows a transition diagram for an identifier, defined to be a letter followed
by a number of letters or digits.

The starting state of the transition diagram is state 0, the edge from which indicates that the first
input character must be a letter.

If this is the case, we enter state1 and look at the next input character. If that is a letter or digit,
we re-enter state1 and look at the input character after that. We continue this way, reading letters
and digits and making transitions from state 1 to itself, until the next input character is a
delimiter for an identifier. On reading the delimiter, we enter state 2.

Example:

Consider the transition diagram shown in Figure(1)

The code for state 0:

State 0: C:= GETCHAR( );

if LETTER( C ) then goto state 1

else FAIL( )

The code for State 1:

State 1: C:= GETCHAR( );

if LETTER( C)or DIGIT ( C ) then goto state1

else if DELIMITER( C )then goto state 2

else FAIL( )

The code for State 2:

RETRACT( )

return (id, INSTALL( ))

The state 2 indicates that an identifier has been found. Since the delimiter is not part of the
identifier, we must retract the lookahead pointer one character, for which we use a procedure
RETRACT( ).

i) LETTER( ) is a procedure which returns true if and only if C is a letter.

ii) FAIL( ) is a routine which retracts or take back the lookahead pointer and starts up the next
transition diagram, otherwise calls the error routine.

iii) DIGIT( ) is a procedure which returns true if and only if C is one of the digits 0,1,2,…,9.
3
iv) DELIMITER( ) is a procedure which returns true if C is a character that could follow an
identifier.

Eg. blank, comma, semicolon etc.

v) RETRACT( ) indicates that an identifier has been found since the delimiter is not part of the
identifier, we must retract the lookahead pointer one character for which we use a procedure
RETRACT().

The symbol * indicates state on which input retraction must take place.

iv) INSTALL( )- To install newly found identifier in the symbol table if it is not already there.

State 2 returns a pair (code, value) pair as follows:

Token Code Value

begin 1 --

end 2 --

if 3 --

then 4 --

else 5 --

identifier 6 pointer to symbol table

constant 7 pointer to symbol table

< 8 1

<= 9 2

Table (1) Tokens recognized

 Regular expressions

Regular expression is a notation suitable for describing tokens.

Strings and languages

1) Alphabet/character set

Alphabet or character class denotes any finite set of symbols.

Ex. The set{0,1} is an alphabet. It consists of two symbols 0 and 1 and it is often called the
binary alphabet.

4
A) String: is a finite sequence of symbols such as 001.

Operations on strings:

i) Length of string:

The length of a string x, usually denoted by |x|, is the total number of symbols in x.

Ex. 01101 is a string of length 5.

|x|=5

ii) Empty string

An empty string is denoted by ꜫ having length zero.

iii) Concatenation of strings

If x and y are strings then the concatenation of x and y is written as xy or x.y, is the string formed
by following the symbols of x by the symbols of y.

eg If x=abc and y=de then xy=abcde

The concatenation of the empty string with any string is that string. That is ꜫ x=x ꜫ =x.

iv) Exponentiation of strings

x1=x

x2=xx

x3=xxx and so on.

The ὲ plays the role of 1, the multiplicative identity.

v) Prefix, suffix and substring

Prefix of string: If x is some string then any string formed by discarding zero or more trailing
symbols symbols of x is called a prefix of x.

e.g. abc is a prefix of abcde

Suffix of string: A suffix of x is a string formed by deleting zero or more of leading symbols of
x.

e.g. cde is a suffix of abcde

Substring : A substring of x is any string formed by deleting prefix and suffix of a string.

e.g. cd is a substring of abcde formed by deleting a prefix and a suffix from x.


5
B) Language:

Language is any set of strings formed from some specific alphabet.

 Operations on languages

i) Concatenation of languages

If L and M are languages then LM or L.M is the language consisting of all strings xy which can
be formed by selecting a string x from L and a string y from M.

LM={ xy|x is in L and y is in M}

Example:

L= {0, 01, 110}

M= {10, 110}

LM= {010,0110,0110,01110,11010,110110}

ii) Union of languages

The union of languages L and M is given by

LUM= {x| x is in L or x is in M}

Example:

L= {0, 01, 110}

M= {10, 110}

LUM= {0, 01, 10, 110}

iii) Exponentiation of language

L0= { ꜫ }

L1=L

L2=LL

L3=LLL

iv) The Kleene (*) closure of language L denoted by L* is “Zero or more occurrences of” L.

v) The Positive (+) closure of language L denoted by L+ is “One or more occurrences of” L.

6
 Finite Automata

A recognizer for language L is a program that takes as input a string x and answers “yes” if x is a
sentence of L and “no” otherwise.

I) NFA

II) DFA

I) NFA (Nondeterministic Finite Automata)

Nondeterministic Finite Automata is a directed labeled graph in which nodes are called states
and the labeled edges called transitions.

The NFA looks like a transition diagram but edges can be labeled with a ꜫ (epsilon) as well as
char.

NFA consists of:

1. a set of states S

2. a set of input symbols Σ called alphabet

3. a transition function move that maps symbol pairs to sets of states

4. Labeled edges are called transitions

5. A state S0 called the start state

6. A set of states F called the accepting or final state indicated by double circle

II. DFA (Deterministic Finite Automata)

DFA is a special case of NFA in which

1. No state has epsilon or empty transition

2. There may not be two identically labeled transitions out of the same state

The state and the input symbol uniquely determine the transition. This is why this automaton is
called deterministic finite automata.

 Algorithms:

1. Regular expression to NFA

2. NFA to DFA

3. Minimizing the number of states of DFA

7
 Algorithm 1: From regular expression to finite automata (NFA)

Algorithm: Constructing a NFA from a regular expression

Input: A regular expression R over alphabet Σ

Output: A NFA accepting the language denoted by R

Method: Decompose R into its primitive components. For each component we construct finite
automata inductively.

1. For ꜫ, we construct the NFA

2. For a, we construct the NFA

3. For R1|R2, we construct the NFA

8
4. For R1R2, we construct the NFA

5. For R1*, we construct the NFA

9
10
b

11
12
 Algorithm 2: Constructing a DFA from a NFA or Subset construction algorithm

Input: A NFA

Output: A DFA accepting the same language

Method- use following operations to keep track of sets of NFA states

Operation Description

ꜫ-Closure (S) set of states reachable from state S0

ꜫ-Closure (T) set of states reachable from some state S in T to epsilon transitions alone

MOVE (T,a) set of states to which there is a transition on input symbol a from some

state S in T

13
Example:

Input:

Method: R=(a|b)*abb
ꜫ-closure(S) =0

ꜫ-closure(T) ={0,1,2,4,7}=A

MOVE(A,a) ={1,2,3,4,6,7,8}= B (3,8)

MOVE(A, b) ={1,2,4,5,6,7}=C (5)

MOVE(B, a) ={1,2,3,4,6,7,8}= B (3,8)

MOVE(B, b) ={1,2,4,5,6,7,9}=D (5,9)

MOVE(C, a) ={1,2,3,4,6,7,8}=B (3,8)

MOVE(C, b) ={1,2,4,5,6,7}=C (5)

MOVE(D, a) ={1,2,3,4,6,7,8}=B (3,8)

MOVE(D, b) ={1,2,4,5,6,7,10}=E (5,10)

MOVE(E, a) ={1,2,3,4,6,7,8}=B (3,8)

MOVE (E, b) ={1,2,4,5,6,7}=C (5)

14
Input symbol
State
a b
A B C
B B D
C B C
D B E
E B C
Transition table

DFA for (a| b)*abb

_____________________________________________________________________________

Algorithm 3: Minimizing the number of states of a DFA

Input: A DFA M

Output: A DFA M’ accepting the same language as M and having as few states as possible.

Method: Construct partition Π of the set of states

Initially Π consists of two groups:

1. Final states

2. Non final states

Rules:

i) Π ≠ Πnew, continue the replacement

ii) Π = Π new, on more changes can occur (stop replacement)

15
Example:

Input symbol
State
a b
A B C
B B D
C B C
D B E
E B C
Transition table

Initially Π consists of two groups :

1. Final states (E)

2. Non final states(ABCD)


_____________________________________________________________________________

1. Construct Πnew

Π= (E)

Πnew= (E)

It consists of one state, it cannot be further split.

So, Π= Πnew, no more changes occur

______________________________________________________________________________
2. Construct Πnew

Π= (ABCD)

Πnew= (ABCD)

On input a,

A goes to B

B goes to B

C goes to B

D goes to B

16
On input b,

A goes to C

B goes to D

C goes to C

D goes to E  member of other group


Π= (ABCD)

Πnew= (ABC) (D)

Π=Π ≠ Πnew, continue the process of splitting

_____________________________________________________________________________

3. Construct Πnew

Π = (ABC)

Πnew= (ABC)

On input a,

A goes to B

B goes to B

C goes to B

On input b,

A goes to C

B goes to D member of other group


C goes to C

Π = (ABC)

Πnew= (AC) (B)

Π ≠ Πnew, continue the process

_____________________________________________________________________________

17
4. Construct Πnew

Π= (AC)

Πnew= (AC)

On input a,

A goes to B

C goes to B

On input b,

A goes to C

C goes to C

Here, both transitions are same so take either A or C.

Π= (AC)

Πnew= (AC)

Π = Πnew, stop the process

18
The final partition Π consists of (AC)(B)(D)(E)
State Input symbol
a b
A B C

B B D
C B C
D B E

E B C

State Input symbol


a b
A B A
B B D
A B A
D B E

E B A

State Input symbol


a B
A B A
B B D
D B E

E B A

19
b

Minimized DFA

20

You might also like