0% found this document useful (0 votes)

86 views91 pages

Chapter 3: Lexical Analysis

This document discusses Chapter 3 of a textbook on lexical analysis. It covers the basics of lexical analysis including what a lexical analyzer does, how it works, and how finite automata relate to lexical analysis. It also discusses how lexical analyzers are separated from parsers in compilers and the responsibilities of each. Key terms introduced include tokens, patterns, lexemes, and how lexical analyzers handle errors. Methods for efficient lexical analysis like buffered I/O are also covered.

Uploaded by

swarna_793238588

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

86 views91 pages

Chapter 3: Lexical Analysis

Uploaded by

swarna_793238588

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 91

Chapter 3: Lexical Analysis

Aggelos Kiayias
Computer Science & Engineering Department
CSE244 The University of Connecticut
371 Fairfield Road, Box U-1155
Storrs, CT 06269
[email protected]
http://www.cse.uconn.edu/~akiayias

CH3.1
Lexical Analysis
 Basic Concepts & Regular Expressions
 What does a Lexical Analyzer do?
CSE244  How does it Work?
 Formalizing Token Definition & Recognition
 LEX - A Lexical Analyzer Generator (Defer)
 Reviewing Finite Automata Concepts
 Non-Deterministic and Deterministic FA
 Conversion Process
 Regular Expressions to NFA
 NFA to DFA
 Relating NFAs/DFAs /Conversion to Lexical
Analysis
 Concluding Remarks /Looking Ahead
CH3.2
Lexical Analyzer in Perspective

token
CSE244 source lexical
analyzer parser
program
get next
token

symbol
table

Important Issue:
What are Responsibilities of each Box ?
Focus on Lexical Analyzer and Parser
CH3.3
Lexical Analyzer in Perspective
 LEXICAL ANALYZER  PARSER
 Scan Input  Perform Syntax
CSE244 Analysis
 Remove WS, NL, …
 Actions Dictated by
 Identify Tokens Token Order
 Create Symbol Table  Update Symbol Table
 Insert Tokens into ST Entries
 Generate Errors  Create Abstract Rep.
of Source
 Send Tokens to Parser
 Generate Errors
 And More…. (We’ll
see later)

CH3.4
What Factors Have Influenced the
Functional Division of Labor ?
 Separation of Lexical Analysis From Parsing
Presents a Simpler Conceptual Model
CSE244  From a Software Engineering Perspective
Division Emphasizes
 High Cohesion and Low Coupling
 Implies Well Specified  Parallel Implementation
 Separation Increases Compiler Efficiency (I/O
Techniques to Enhance Lexical Analysis)
 Separation Promotes Portability.
 This is critical today, when platforms (OSs and
Hardware) are numerous and varied!
 Emergence of Platform Independence - Java
CH3.5
Introducing Basic Terminology
 What are Major Terms for Lexical Analysis?
 TOKEN
CSE244  A classification for a common set of strings
 Examples Include <Identifier>, <number>, etc.
 PATTERN
 The rules which characterize the set of strings for a
token
 Recall File and OS Wildcards ([A-Z]*.*)
 LEXEME
 Actual sequence of characters that matches pattern
and is classified by a token
 Identifiers: x, count, name, etc…

CH3.6
Introducing Basic Terminology

Token Sample Lexemes Informal Description of Pattern

const const const
CSE244
if if if
relation <, <=, =, < >, >, >= < or <= or = or < > or >= or >
id pi, count, D2 letter followed by letters and digits
num 3.1416, 0, 6.02E23 any numeric constant
literal “core dumped” any characters between “ and “ except
“

Actual values are critical. Info is :

Classifies
Pattern 1. Stored in symbol table
2. Returned to parser

CH3.7
Handling Lexical Errors
 Error Handling is very localized, with Respect to
Input Source
CSE244  For example: whil ( x := 0 ) do
generates no lexical errors in PASCAL
 In what Situations do Errors Occur?
 Prefix of remaining input doesn’t match any
defined token
 Possible error recovery actions:
 Deleting or Inserting Input Characters
 Replacing or Transposing Characters

 Or, skip over to next separator to “ignore” problem

CH3.8
Designing efficient Lex Analyzers

 is efficiency an issue?
CSE244
 3 Lexical Analyzer construction techniques
how they address efficiency? :
 Lexical Analyzer Generator

 Hand-Code / High Level Language (I/O facilitated

by the language)
 Hand-Code / Assembly Language (explicitly
manage I/O).
 In Each Technique …
 Who handles efficiency ?
 How is it handled ?
CH3.9
I/O - Key For Successful Lexical Analysis

 Character-at-a-time I/O
 Block / Buffered I/O Tradeoffs ?
CSE244
 Block/Buffered I/O
 Utilize Block of memory
 Stage data from source to buffer block at a time
 Maintain two blocks - Why (Recall OS)?

 Asynchronous I/O - for 1 block

 While Lexical Analysis on 2nd block
Block 1 Block 2

When done, ptr... Still Process token

issue I/O
in 2nd block
CH3.10
Algorithm: Buffered I/O with Sentinels
Current token

E = M * eof C * * 2 eof eof

CSE244 lexeme beginning forward (scans

forward : = forward + 1 ;
ahead to find
pattern match)
if forward is at eof then begin
if forward at end of first half then begin
reload second half ; Block I/O Algorithm performs
forward : = forward + 1
I/O’s. We can still
end
else if forward at end of second half then begin have get & un getchar
reload first half ; Block I/O
move forward to biginning of first half
end
else / * eof within buffer signifying end of input * /
terminate lexical analysis
2nd eof  no more input !
end CH3.11
Formalizing Token Definition

EXAMPLES AND OTHER CONCEPTS:

CSE244
Suppose: S ts the string banana

Prefix : ban, banana

Suffix : ana, banana Proper prefix,
Substring : nan, ban, ana, subfix, or substring
banana cannot be all of S

Subsequence: bnan, nn

CH3.12
Language Concepts
A language, L, is simply any set of strings
over a fixed alphabet.
CSE244 Alphabet Languages
{0,1} {0,10,100,1000,100000…}
{0,1,00,11,000,111,…}
{a,b,c} {abc,aabbcc,aaabbbccc,…}
{A, … ,Z} {TEE,FORE,BALL,…}
{FOR,WHILE,GOTO,…}
{A,…,Z,a,…,z,0,…9, { All legal PASCAL progs}
+,-,…,<,>,…} { All grammatically correct
English sentences }
Special Languages:  - EMPTY LANGUAGE
 - contains  string only
CH3.13
Formal Language Operations

CSE244 OPERATION DEFINITION

union of L and M L  M = {s | s is in L or s is in M}
written L  M
concatenation of L LM = {st | s is in L and t is in M}
and M written LM

Kleene closure of L L*= i

written L*
L
i 0

L* denotes “zero or more concatenations of “ L

positive closure of 
i

L written L+ L = L
+
i 1

L+ denotes “one or more concatenations of “ L

CH3.14
Formal Language Operations
Examples
L = {A, B, C, D } D = {1, 2, 3}
CSE244 L  D = {A, B, C, D, 1, 2, 3 }
LD = {A1, A2, A3, B1, B2, B3, C1, C2, C3, D1, D2, D3 }
L2 = { AA, AB, AC, AD, BA, BB, BC, BD, CA, … DD}
L4 = L2 L2 = ??
L* = { All possible strings of L plus  }
L+ = L* - 
L (L  D ) = ??
L (L  D )* = ??
CH3.15
Language & Regular Expressions
 A Regular Expression is a Set of Rules /
CSE244
Techniques for Constructing Sequences of
Symbols (Strings) From an Alphabet.

 Let  Be an Alphabet, r a Regular Expression

Then L(r) is the Language That is Characterized
by the Rules of r

CH3.16
Rules for Specifying Regular Expressions:

fix alphabet 

CSE244   is a regular expression denoting {}

• If a is in , a is a regular expression that denotes {a}

• Let r and s be regular expressions with languages L(r)

and L(s). Then

p (a) (r) | (s) is a regular expression  L(r)  L(s)

r
e
c (b) (r)(s) is a regular expression  L(r) L(s)
e
d
e
(c) (r)* is a regular expression  (L(r))*
n
c
e
(d) (r) is a regular expression  L(r)
All are Left-Associative. Parentheses are dropped as
allowed by precedence rules. CH3.17
EXAMPLES of Regular Expressions

L = {A, B, C, D } D = {1, 2, 3}
CSE244
A|B|C|D =L
(A | B | C | D ) (A | B | C | D ) = L2
(A | B | C | D )* = L*
(A | B | C | D ) ((A | B | C | D ) | ( 1 | 2 | 3 )) = L (L  D)

CH3.18
Algebraic Properties of
Regular Expressions

AXIOM DESCRIPTION
CSE244
r|s=s|r | is commutative
r | (s | t) = (r | s) | t | is associative
(r s) t = r (s t) concatenation is associative
r(s|t)=rs|rt
(s|t)r=sr|tr concatenation distributes over |

r = r
r = r  Is the identity element for concatenation

r* = ( r |  )* relation between * and 

r** = r* * is idempotent

CH3.19
Regular Expression Examples

CSE244
• All Strings that start with “tab” or end with
“bat”:
tab{A,…,Z,a,...,z}*|{A,…,Z,a,....,z}*bat

• All Strings in Which Digits 1,2,3 exist in

ascending numerical order:
{A,…,Z}*1 {A,…,Z}*2 {A,…,Z}*3 {A,…,Z}*

CH3.20
Towards Token Definition
Regular Definitions: Associate names with Regular Expressions
For Example : PASCAL IDs
CSE244
letter  A | B | C | … | Z | a | b | … | z
digit  0 | 1 | 2 | … | 9
id  letter ( letter | digit )*
Shorthand Notation:
“+” : one or more r* = r+ |  & r+ = r r*
“?” : zero or one r?=r | 
[range] : set range of characters (replaces “|” )
[A-Z] = A | B | C | … | Z
Example Using Shorthand : PASCAL IDs

id  [A-Za-z][A-Za-z0-9]*

CH3.21
Token Recognition
How can we use concepts developed so far to assist in
recognizing tokens of a source language ?
CSE244
Assume Following Tokens:
if, then, else, relop, id, num

What language construct are they used for ?

Scan away b, nl, tabs

CSE244 Can we Define Tokens For These?

blank  b
tab  ^T
newline  ^M
delim  blank | tab | newline
ws  delim +

CH3.23
Overall
Regular Token Attribute-Value
Expression
CSE244 ws - -
if if -
then then -
else else -
id id pointer to table entry
num num pointer to table entry
< relop LT
<= relop LE
= relop EQ
<> relop NE
> relop GT
>= relop GE
Note: Each token has a unique token identifier to define category of lexemes
CH3.24
Constructing Transition Diagrams for Tokens

• Transition Diagrams (TD) are used to represent the

tokens
CSE244
• As characters are read, the relevant TDs are used to
attempt to match lexeme to a pattern
• Each TD has:
• States : Represented by Circles
• Actions : Represented by Arrows between states
• Start State : Beginning of a pattern (Arrowhead)
• Final State(s) : End of pattern (Concentric Circles)
• Each TD is Deterministic - No need to choose between 2
different actions !
CH3.25
Example TDs

CSE244
>=: start > = RTN(GE)
0 6 7

other
8 * RTN(G)

We’ve accepted “>” and have read other char that

must be unread.
CH3.26
Example : All RELOPs

start < =
0 1 2 return(relop, LE)
CSE244 >
3 return(relop, NE)
other

= 4 * return(relop, LT)

5 return(relop, EQ)
>

=
6 7 return(relop, GE)
other
8 * return(relop, GT)

CH3.27
Example TDs : id and delim

CSE244
id :
letter or digit

start letter other *

9 10 11

return( get_token(), install_id())

Either returns ptr or “0” if reserved

delim :
delim
start delim other *
28 29 30

CH3.28
Example TDs : Unsigned #s
digit digit digit

start digit . digit E +|- digit other *

12 13 14 15 16 17 18 19
CSE244
E digit

digit digit

start digit * . digit other *

20 21 22 23 24

return(num, install_num())
digit

start digit other *

25 26 27

Questions: Is ordering important for unsigned #s ?

CH3.29
QUESTION :

What would the transition

CSE244
diagram (TD) for strings
containing each vowel, in their
strict lexicographical order,
look like ?

CH3.30
Answer

cons  B | C | D | F | … | Z
CSE244 string  cons* A cons* E cons* I cons* O cons* U cons*

cons cons cons cons cons cons

start A E I O U other

Note: The error path is

error
taken if the character is other
than a cons or the vowel in
the lex order.
CH3.31
What Else Does Lexical Analyzer Do?
All Keywords / Reserved words are matched as ids
• After the match, the symbol table or a special keyword table is
CSE244 consulted
• Keyword table contains string versions of all keywords and
associated token values
if 15
then 16
begin 17
... ...

• When a match is found, the token is returned, along with its

symbolic value, i.e., “then”, 16
• If a match is not found, then it is assumed that an id has been
discovered

CH3.32
Implementing Transition Diagrams
lexeme_beginning = forward; FUNCTIONS USED
state = 0; nextchar(), forward, retract(),
install_num(), install_id(),
token nexttoken()
CSE244 gettoken(),
{ while(1) { isdigit(), isletter(), recover()
switch (state) {
case 0: c = nextchar();
/* c is lookahead character */
repeat
if (c== blank || c==tab || c== newline) {
until start < =
state = 0;
a “return” 0 1 2
lexeme_beginning++;
occurs >
/* advance 3
beginning of lexeme */
other
} *
= 4
else if (c == ‘<‘) state = 1;
else if (c == ‘=‘) state = 5; 5
>
else if (c == ‘>’) state = 6;
else state = fail(); =
6 7
break;
other
… /* cases 1-8 here */ *
CH3.33 8
Implementing Transition Diagrams, II
digit
*
digit other
25 26 27
CSE244 advances
............. forward
case 25; c = nextchar();
if (isdigit(c)) state = 26;
else state = fail();
Case numbers
break;
correspond to transition
case 26; c = nextchar();
diagram states !
if (isdigit(c)) state = 26;
else state = 27;
break;
case 27; retract(1); lexical_value = install_num();
return ( NUM );
.............
looks at the region
retracts lexeme_beginning ... forward
forward CH3.34
Implementing Transition Diagrams, III

.............
case 9: c = nextchar();
CSE244
if (isletter(c)) state = 10;
else state = fail();
break;
case 10; c = nextchar();
if (isletter(c)) state = 10;
else if (isdigit(c)) state = 10;
else state = 11;
break;
case 11; retract(1); lexical_value = install_id();
return ( gettoken(lexical_value) );
.............
letter or digit
*
letter other
reads token 9 10 11
name from ST
CH3.35
When Failures Occur:

Init fail()
CSE244 { start = state;
forward = lexeme beginning;
switch (start) {
case 0: start = 9; break;
case 9: start = 12; break;
case 12: start = 20; break;
case 20: start = 25; break; Switch to
case 25: recover(); break;
next transition
default: /* lex error */
diagram
}
return start;
}

CH3.36
Finite Automata & Language Theory

Finite Automata : A recognizer that takes an input

string & determines whether it’s a
CSE244
valid sentence of the language

Non-Deterministic : Has more than one alternative action

for the same input symbol.

Deterministic : Has at most one action for a given

input symbol.

Both types are used to recognize regular expressions.

CH3.37
NFAs & DFAs

Non-Deterministic Finite Automata (NFAs) easily

represent regular expression, but are somewhat less
CSE244
precise.

Deterministic Finite Automata (DFAs) require more

complexity to represent regular expressions, but offer
more precision.

We’ll review both plus conversion algorithms, i.e.,

NFA  DFA and DFA  NFA

CH3.38
Non-Deterministic Finite Automata

An NFA is a mathematical model that consists of :

CSE244 • S, a set of states

• , the symbols of the input alphabet
• move, a transition function.
• move(state, symbol)  set of states
• move : S  {}  Pow(S)
• A state, s0  S, the start state
• F  S, a set of final or accepting states.

CH3.39
Representing NFAs

CSE244 Transition Diagrams : Number states (circles),

arcs, final states, …

Transition Tables: More suitable to

representation within a
computer

We’ll see examples of both !

CH3.40
Example NFA

S = { 0, 1, 2, 3 } a
start a b b
CSE244 s0 = 0 0 1 2 3

F={3} b
 = { a, b } What Language is defined ?

What is the Transition Table ?

input
a b
 (null) moves possible
s 
t 0 { 0, 1 } {0} i j
a 1 -- {2}
t Switch state but do not
e 2 -- {3} use any input symbol
CH3.41
How Does An NFA Work ?

a
start a b b
CSE244 0 1 2 3

b • Given an input string, we trace moves

• If no more input & in final state, ACCEPT
EXAMPLE: -OR-
Input: ababb
move(0, a) = 0
move(0, a) = 1 move(0, b) = 0
move(1, b) = 2 move(0, a) = 1
move(2, a) = ? (undefined) move(1, b) = 2
move(2, b) = 3
REJECT ! ACCEPT !

CH3.42
Handling Undefined Transitions

We can handle undefined transitions by defining one

more state, a “death” state, and transitioning all
CSE244
previously undefined transition to this death state.

a
start a b b
0 1 2 3

b a
a
a, b

4


CH3.43
NFA- Regular Expressions & Compilation
Problems with NFAs for Regular Expressions:
1. Valid input might not be accepted
CSE244
2. NFA may behave differently on the same input

Relationship of NFAs to Compilation:

1. Regular expression “recognized” by NFA
2. Regular expression is “pattern” for a “token”
3. Tokens are building blocks for lexical analysis
4. Lexical analyzer can be described by a collection of
NFAs. Each NFA is for a language token.

CH3.44
Second NFA Example

Given the regular expression : (a (b*c)) | (a (b | c+)?)

CSE244
Find a transition diagram NFA that recognizes it.

CH3.45
Second NFA Example - Solution

Given the regular expression : (a (b*c)) | (a (b | c+)?)

CSE244
Find a transition diagram NFA that recognizes it.
b
c
2 4


start a b
0 1
 c

3 c 5
String abbc can be accepted.

CH3.46
Alternative Solution Strategy
b
a c
a (b*c) 1 2 3
CSE244

a (b | c+)? 4 a 5 b

c
c
Now that you have the individual 7
diagrams, “or” them as follows:

CH3.47
Using Null Transitions to “OR” NFAs

b
CSE244 a c
1 2 3

0 6


4 a 5 b

c
c
7

CH3.48
Other Concepts

Not all paths may result in acceptance.

CSE244 a
start a b b
0 1 2 3

aabb is accepted along path : 0  0  1  2  3

BUT… it is not accepted along the valid path:

00000

CH3.49
Deterministic Finite Automata

A DFA is an NFA with the following restrictions:

•  moves are not allowed
CSE244
• For every state s S, there is one and only one
path from s for every input symbol a  .
Since transition tables don’t have any alternative options, DFAs are
easily simulated via an algorithm.
s  s0
c  nextchar;
while c  eof do
s  move(s,c);
c  nextchar;
end;
if s is in F then return “yes”
else return “no”
CH3.50
Example - DFA
b
a
start a b b
CSE244 0 1 2 3
a
b a
What Language is Accepted?

Recall the original NFA:

a
start a b b
0 1 2 3

CH3.51
Conversion : NFA  DFA Algorithm

• Algorithm Constructs a Transition Table for DFA from NFA

CSE244
• Each state in DFA corresponds to a SET of states of the NFA
• Why does this occur ?
•  moves
• non-determinism
Both require us to characterize multiple situations that occur
for accepting the same string.
(Recall : Same input can have multiple paths in NFA)
• Key Issue : Reconciling AMBIGUITY !

CH3.52
Converting NFA to DFA – 1st Look


CSE244 a b
2 3 4
 

0  1 5  8
 

6 c 7


From State 0, Where can we move without consuming any input ?
This forms a new state: 0,1,2,6,8 What transitions are defined for
this new state ?
CH3.53
The Resulting DFA
a

0, 1, 2, 6, 8 a 3
CSE244
a
c b

1, 2, 5, 6, 7, 8
c 1, 2, 4, 5, 6, 8
c

Which States are FINAL States ?

a
A How do we handle
B
a a alphabet symbols not
c b defined for A, B, C, D ?
D
c C
c
CH3.54
Algorithm Concepts
NFA N = ( S, , s0, F, MOVE )
-Closure(s) : s S
CSE244
: set of states in S that are reachable
No input is from s via -moves of N that originate
consumed
from s.
-Closure(T) : T  S
: NFA states reachable from all t  T
on -moves only.
move(T,a) : T  S, a
: Set of states to which there is a
transition on input a from some t  T

These 3 operations are utilized by algorithms /

techniques to facilitate the conversion process.
CH3.55
Illustrating Conversion – An Example
Start with NFA:  (a | b)*abb
a
CSE244 2 3
 

start   a b
0 1 6 7 8 9

  b
b
4 5 10

First we calculate: -closure(0) (i.e., state 0)

-closure(0) = {0, 1, 2, 4, 7} (all states reachable from 0
on -moves)
Let A={0, 1, 2, 4, 7} be a state of new DFA, D.
CH3.56
Conversion Example – continued (1)
2nd , we calculate : a : -closure(move(A,a)) and
b : -closure(move(A,b))

CSE244 a : -closure(move(A,a)) = -closure(move({0,1,2,4,7},a))}

adds {3,8} ( since move(2,a)=3 and move(7,a)=8)

From this we have : -closure({3,8}) = {1,2,3,4,6,7,8}

(since 36 1 4, 6 7, and 1 2 all by -moves)
Let B={1,2,3,4,6,7,8} be a new state. Define Dtran[A,a] = B.

b : -closure(move(A,b)) = -closure(move({0,1,2,4,7},b))
adds {5} ( since move(4,b)=5)

From this we have : -closure({5}) = {1,2,4,5,6,7}

(since 56 1 4, 6 7, and 1 2 all by -moves)
Let C={1,2,4,5,6,7} be a new state. Define Dtran[A,b] = C.
CH3.57
Conversion Example – continued (2)
3rd , we calculate for state B on {a,b}
a : -closure(move(B,a)) = -closure(move({1,2,3,4,6,7,8},a))}
CSE244 = {1,2,3,4,6,7,8} = B
Define Dtran[B,a] = B.

b : -closure(move(B,b)) = -closure(move({1,2,3,4,6,7,8},b))}
= {1,2,4,5,6,7,9} = D
Define Dtran[B,b] = D.

4th , we calculate for state C on {a,b}

a : -closure(move(C,a)) = -closure(move({1,2,4,5,6,7},a))}
= {1,2,3,4,6,7,8} = B
Define Dtran[C,a] = B.

b : -closure(move(C,b)) = -closure(move({1,2,4,5,6,7},b))}
= {1,2,4,5,6,7} = C
Define Dtran[C,b] = C.
CH3.58
Conversion Example – continued (3)
5th , we calculate for state D on {a,b}
a : -closure(move(D,a)) = -closure(move({1,2,4,5,6,7,9},a))}
CSE244 = {1,2,3,4,6,7,8} = B
Define Dtran[D,a] = B.

b : -closure(move(D,b)) = -closure(move({1,2,4,5,6,7,9},b))}
= {1,2,4,5,6,7,10} = E
Define Dtran[D,b] = E.

Finally, we calculate for state E on {a,b}

a : -closure(move(E,a)) = -closure(move({1,2,4,5,6,7,10},a))}
= {1,2,3,4,6,7,8} = B
Define Dtran[E,a] = B.

b : -closure(move(E,b)) = -closure(move({1,2,4,5,6,7,10},b))}
= {1,2,4,5,6,7} = C
Define Dtran[E,b] = C.
CH3.59
Conversion Example – continued (4)
This gives the transition table Dtran for the DFA of:

Input Symbol
CSE244 Dstates a b
A B C
B B D
C B C
D B E
E B C

b C b

start A a B b D b E
a
a a
CH3.60
Algorithm For Subset Construction

push all states in T onto stack; computing the

-closure
CSE244 initialize -closure(T) to T;
while stack is not empty do begin
pop t, the top element, off the stack;
for each state u with edge from t to u labeled  do
if u is not in -closure(T) do begin
add u to -closure(T) ;
push u onto stack
end
end CH3.61
Algorithm For Subset Construction – (2)

initially, -closure(s0) is only (unmarked) state in Dstates;

CSE244 while there is unmarked state T in Dstates do begin
mark T;
for each input symbol a do begin
U := -closure(move(T,a));
if U is not in Dstates then
add U as an unmarked state to Dstates;
Dtran[T,a] := U
end
end CH3.62
Regular Expression to NFA Construction
We now focus on transforming a Reg. Expr. to an NFA
This construction allows us to take:
CSE244
• Regular Expressions (which describe tokens)
• To an NFA (to characterize language)
• To a DFA (which can be “computerized”)
The construction process is component-wise
Builds NFA from components of the regular
expression in a special order with particular
techniques.
NOTE: Construction is “syntax-directed” translation, i.e., syntax of
regular expression is determining factor for NFA construction and
structure. CH3.63
Motivation: Construct NFA For:

:
CSE244 a:

ab:

 | ab :

( | ab )* :

CH3.64
Motivation: Construct NFA For:

: start
i f
CSE244 start a
a: 0 1
start b
b: A B

start a  b
ab: 0 1 A B

 | ab :

( | ab )* :
CH3.65
Construction Algorithm : R.E.  NFA

Construction Process :
CSE244 1st : Identify subexpressions of the regular expression

 symbols
r|s
rs
r*

2nd : Characterize “pieces” of NFA for each subexpression

CH3.66
Piecing Together NFAs

1. For  in the regular expression, construct NFA

CSE244
start  L()
i f

2. For a   in the regular expression, construct NFA

start a
i f L(a)

CH3.67
Piecing Together NFAs – continued(1)

3.(a) If s, t are regular expressions, N(s), N(t) their NFAs

CSE244 s|t has NFA:
N(s) 
start  L(s)  L(t)
i f


N(t)

where i and f are new start / final states, and -moves

are introduced from i to the old start states of N(s) and
N(t) as well as from all of their final states to f.

CH3.68
Piecing Together NFAs – continued(2)
3.(b) If s, t are regular expressions, N(s), N(t) their NFAs
st (concatenation) has NFA:
CSE244
start
i N(s) N(t) f L(s) L(t)

overlap
Alternative:

start
i    f
N(s) N(t)

where i is the start state of N(s) (or new under the

alternative) and f is the final state of N(t) (or new).
Overlap maps final states of N(s) to start state of N(t).
CH3.69
Piecing Together NFAs – continued(3)
3.(c) If s is a regular expressions, N(s) its NFA, s*
(Kleene star) has NFA:
CSE244 

start
i   f
N(s)

where : i is new start state and f is new final state

-move i to f (to accept null string)
-moves i to old start, old final(s) to f
-move old final to old start (WHY?)
CH3.70
Properties of Construction

Let r be a regular expression, with NFA N(r), then

CSE244
1. N(r) has #of states  2*(#symbols + #operators) of r

2. N(r) has exactly one start and one accepting state

3. Each state of N(r) has at most one outgoing edge

a or at most two outgoing ’s

4. BE CAREFUL to assign unique names to all states !

CH3.71
Detailed Example
See example 3.16 in textbook for (a | b)*abb
2nd Example - (ab*c) | (a(b|c*))
CSE244 Parse Tree for this regular expression:
r13

r5 | r12

r3 r4 r11 r10

( )
a a r9
r1 r2
r7 | r8
r0 c
* r6
b *
b
c
What is the NFA? Let’s construct it ! CH3.72
Detailed Example – Construction(1)
r0 : b


CSE244 r3 : a r1:  b 


r2 : c

r 4 : r 1 r2  b  c


a  b  c
r 5 : r 3 r4


CH3.73
Detailed Example – Construction(2)

r7 : b
r8:  c 

CSE244 r11: a

b
r6 : c 
 

r9 : r7 | r 8  c 


r10 : r9
b

 

r12 : r11 r10 a  c 


CH3.74
Detailed Example – Final Step

r13 : r5 | r12
CSE244


a  b  c
2 3 4 5 6 7
 

17
1 b
10 11
  
 

a  c 
8 9 12 13 14 15 16

CH3.75
Direct Simulation of an NFA
s  s0
c  nextchar;
CSE244 while c  eof do
s  move(s,c); DFA
c  nextchar;
simulation
end;
if s is in F then return “yes”
else return “no”

S  -closure({s0})
c  nextchar;
while c  eof do NFA
S  -closure(move(S,c));
c  nextchar;
simulation
end;
if SF then return “yes”
else return “no”
CH3.76
Final Notes : R.E. to NFA Construction

• So, an NFA may be simulated by algorithm, when NFA

is constructed using Previous techniques
CSE244
• Algorithm run time is proportional to |N| * |x| where |
N| is the number of states and |x| is the length of input
• Alternatively, we can construct DFA from NFA and
use the resulting Dtran to recognize input:

space time to
required simulate
NFA O(|r|) O(|r|*|x|)
DFA O(2|r|) O(|x|)

where |r| is the length of the regular expression.

CH3.77
Pulling Together Concepts

• Designing Lexical Analyzer Generator

CSE244 Reg. Expr.  NFA construction
NFA  DFA conversion
DFA simulation for lexical analyzer
• Recall Lex Structure
(a | b)*abb

Pattern Action
e.g.
Pattern Action 
(abc)*ab
… … 
etc.
Recognizer!
- Each pattern recognizes lexemes
- Each pattern described by regular expression
CH3.78
Lex Specification  Lexical Analyzer

• Let P1, P2, … , Pn be Lex patterns

CSE244 (regular expressions for valid tokens in prog. lang.)
• Construct N(P1), N(P2), … N(Pn)
• Note: accepting state of N(Pi) will be marked by Pi
• Construct NFA:
N(P1) • Lex applies

conversion
 algorithm to
N(P2)
construct DFA
 that is equivalent!
N(Pn)
CH3.79
Pictorially

Lex Lex Transition

Specification Compiler Table
CSE244

(a) Lex Compiler

lexeme input buffer

FA
Simulator

Transition (b) Schematic lexical analyzer

Table
CH3.80
Example

P1 : a
CSE244
P2 : abb 3 patterns
P3 : a*b+

NFA’s :
P1
start a
1 2
P2
start a b b
3 4 5 6

a b
P3
start
7 8
b

CH3.81
Example – continued (2)
Combined NFA :
 a P1
1 2
CSE244
start  a b b
0 3 4 5 6 P2
a b
 P3
7 8
b

Examples a a b a
{0,1,3,7} {2,4,7} {7} {8} death
pattern matched: - P1 - P3 -
a b b
{0,1,3,7} {2,4,7} {5,8} {6,8} break tie in
favor of P2
pattern matched: - P1 P3 P2,P3 
CH3.82
Example – continued (3)

Alternatively Construct DFA: (keep track of

correspondence between patterns and new accepting states)
CSE244

Input Symbol
STATE a b Pattern
{0,1,3,7} {2,4,7} {8} none
{2,4,7} {7} {5,8} P1
{8} - {8} P3
{7} {7} {8} none
{5,8} - {6,8} P3
break tie in
{6,8} - {8} P2 favor of P2
CH3.83
Minimizing the Number of States of DFA

1. Construct initial partition  of S with two groups:

accepting/ non-accepting.
CSE244
2. (Construct new )For each group G of  do begin
1. Partition G into subgroups such that two states s,t
of G are in the same subgroup iff for all symbols a
states s,t have transitions on a to states of the same
group of .
2. Replace G in new by the set of all these subgroups.
3. Compare new and . If equal, final:=  then proceed to 4,
else set  := new and goto 2.
4. Aggregate states belonging in the groups of final
CH3.84
example
a
a
CSE244 A a F
B
a
b b
a
D
b C
b
b

a
A,C,D a
B,F

b
Minimized DFA: b

CH3.85
Other Issues - § 3.9 – Not Discussed

CSE244

• More advanced algorithm construction –

regular expression to DFA directly

CH3.86
Using LEX

Lex Program Structure:

CSE244 declarations
%%
translation rules
%%
auxiliary procedures

Name the file e.g. test.lex

Then, “lex test.lex” produces the file
“lex.yy.c” (a C-program)

CH3.87
LEX
C declarations %{
/* definitions of all constants
LT, LE, EQ, NE, GT, GE, IF, THEN, ELSE, ... */
CSE244 %}
......
declarations

letter [A-Za-z]
digit [0-9]
id {letter}({letter}|{digit})*
......
%%
if { return(IF);}
Rules

then { return(THEN);}
{id} { yylval = install_id(); return(ID); }
......
%%
Auxiliary

install_id()
{ /* procedure to install the lexeme to the ST */

CH3.88
Example of a Lex Program
int num_lines = 0, num_chars = 0;
%%
CSE244 \n {++num_lines; ++num_chars;}
. {++num_chars;}
%%
main( argc, argv )
int argc; char **argv;
{
++argv, --argc; /* skip over program name */
if ( argc > 0 )
yyin = fopen( argv[0], "r" );
else yyin = stdin;
yylex();
printf( "# of lines = %d, # of chars = %d\n",
num_lines, num_chars ); }
CH3.89
Another Example
%{ #include <stdio.h> %}
WS [ \t\n]*
CSE244
%%

[0123456789]+ printf("NUMBER\n");
[a-zA-Z][a-zA-Z0-9]* printf("WORD\n");
{WS} /* do nothing */
. printf(“UNKNOWN\n“);
%%

main( argc, argv )

int argc; char **argv;
{ ++argv, --argc;
if ( argc > 0 ) yyin = fopen( argv[0], "r" );
else yyin = stdin;
yylex(); }
CH3.90
Concluding Remarks

Focused on Lexical Analysis Process, Including

- Regular Expressions
CSE244 - Finite Automaton
- Conversion
- Lex
- Interplay among all these various aspects of
lexical analysis

Looking Ahead:
The next step in the compilation process is Parsing:
- Top-down vs. Bottom-up
-- Relationship to Language Theory
CH3.91

Visual Basic 6.0 Manual
100% (1)
Visual Basic 6.0 Manual
83 pages
100 C Programs
100% (1)
100 C Programs
82 pages
B2-Sep-18 Notes
No ratings yet
B2-Sep-18 Notes
161 pages
TCS NQT Coding Sheet - TCS Coding Questions - Updated 2022
No ratings yet
TCS NQT Coding Sheet - TCS Coding Questions - Updated 2022
8 pages
Compiler-Lexical Analysis
100% (1)
Compiler-Lexical Analysis
59 pages
Kids and The Commodore 64 (1983)
No ratings yet
Kids and The Commodore 64 (1983)
240 pages
Chapter 2 - Lexical Analysis
100% (1)
Chapter 2 - Lexical Analysis
69 pages
RAPID Instructions, Functions and Data Types - Technical Reference Manual - ABB Robotics
100% (2)
RAPID Instructions, Functions and Data Types - Technical Reference Manual - ABB Robotics
1,324 pages
PHP Lab Manual
No ratings yet
PHP Lab Manual
11 pages
Important Interview Questions On Python
No ratings yet
Important Interview Questions On Python
6 pages
CS Test 1 - Memo Problem - Solving and Design
0% (1)
CS Test 1 - Memo Problem - Solving and Design
7 pages
Technology Skills
No ratings yet
Technology Skills
1,902 pages
Common Windbg Commands (Thematically Grouped) : 1) Built-In Help Commands Cmdvariants/Paramsdescription
No ratings yet
Common Windbg Commands (Thematically Grouped) : 1) Built-In Help Commands Cmdvariants/Paramsdescription
17 pages
Top 100 DSA Problems Pakistan
No ratings yet
Top 100 DSA Problems Pakistan
4 pages
Q-1: What Is Python, What Are The Benefits of Using It, and What Do You Understand of PEP 8?
No ratings yet
Q-1: What Is Python, What Are The Benefits of Using It, and What Do You Understand of PEP 8?
34 pages
A73 Programmers Guide PDF
No ratings yet
A73 Programmers Guide PDF
113 pages
Learn Java - String Methods Cheatsheet - Codecademy PDF
No ratings yet
Learn Java - String Methods Cheatsheet - Codecademy PDF
4 pages
Cse309 3
No ratings yet
Cse309 3
101 pages
Unit22pdf 2021 03 13 13 38 11
No ratings yet
Unit22pdf 2021 03 13 13 38 11
114 pages
Ud en T: Bhoj Reddy Engineering College For Women
No ratings yet
Ud en T: Bhoj Reddy Engineering College For Women
11 pages
6th Sem All Model
No ratings yet
6th Sem All Model
338 pages
Unit 1 (B)
No ratings yet
Unit 1 (B)
69 pages
Hummingbird Basic
No ratings yet
Hummingbird Basic
128 pages
Ptc04 PSF Guide
No ratings yet
Ptc04 PSF Guide
68 pages
Python History and Versions: o o o o o o o o o o
No ratings yet
Python History and Versions: o o o o o o o o o o
52 pages
Compiler
No ratings yet
Compiler
60 pages
Lexical Analysis
No ratings yet
Lexical Analysis
153 pages
Course Syllabus and Planner: Python Programming
No ratings yet
Course Syllabus and Planner: Python Programming
6 pages
Lexical Analysis: CD: Compiler Design
No ratings yet
Lexical Analysis: CD: Compiler Design
122 pages
CD ch2
No ratings yet
CD ch2
104 pages
BookingPal IntegrationOptionsv1 5 2
No ratings yet
BookingPal IntegrationOptionsv1 5 2
42 pages
FULLTEXT01
No ratings yet
FULLTEXT01
56 pages
Compiler Design
No ratings yet
Compiler Design
102 pages
Chapter 2 - Lexical Analysis
No ratings yet
Chapter 2 - Lexical Analysis
56 pages
Lua 5.3 Reference Manual
No ratings yet
Lua 5.3 Reference Manual
103 pages
Lexical Analyzer 2023
No ratings yet
Lexical Analyzer 2023
38 pages
Lexical Analysis
No ratings yet
Lexical Analysis
44 pages
Lexical Analysis and Lexical Analyzer Generators: COP5621 Compiler Construction
No ratings yet
Lexical Analysis and Lexical Analyzer Generators: COP5621 Compiler Construction
52 pages
Unit 2 Lexical Analyzer
No ratings yet
Unit 2 Lexical Analyzer
63 pages
Chpater 2 Lexical Analysis
No ratings yet
Chpater 2 Lexical Analysis
48 pages
4 Lexical Analysis
No ratings yet
4 Lexical Analysis
60 pages
2 Lex
No ratings yet
2 Lex
45 pages
Lexical Analysis
No ratings yet
Lexical Analysis
41 pages
Os Manual Next Sem
No ratings yet
Os Manual Next Sem
64 pages
Become A Python Developer in 45 Days - Bababaana
No ratings yet
Become A Python Developer in 45 Days - Bababaana
21 pages
Ch3 Modified
No ratings yet
Ch3 Modified
80 pages
SIM Cards Personalization
No ratings yet
SIM Cards Personalization
35 pages
Scanner (Lexical Analyzer) : The Structure of A Compiler
No ratings yet
Scanner (Lexical Analyzer) : The Structure of A Compiler
109 pages
2 - Lexical Analysis
No ratings yet
2 - Lexical Analysis
52 pages
Lecture3 E
No ratings yet
Lecture3 E
153 pages
Ch3 1
No ratings yet
Ch3 1
52 pages
21CS51 ATCD MODULE 2 - 2 Lexical Analyser Part2
No ratings yet
21CS51 ATCD MODULE 2 - 2 Lexical Analyser Part2
62 pages
Unit II - Lexical Analysis-20-1-2021
No ratings yet
Unit II - Lexical Analysis-20-1-2021
49 pages
Chapter 3 Finite Automata and Lexical Analysis
No ratings yet
Chapter 3 Finite Automata and Lexical Analysis
100 pages
Lexical Analysis
No ratings yet
Lexical Analysis
31 pages
Chapter 9 - Public Key Cryptography and RSA
No ratings yet
Chapter 9 - Public Key Cryptography and RSA
26 pages
Chapter 2
No ratings yet
Chapter 2
77 pages
Slides 02 - Compiler Construction - UET CS - Lexical Analyzer Rev 2
No ratings yet
Slides 02 - Compiler Construction - UET CS - Lexical Analyzer Rev 2
69 pages
Unit2-Compiler Design
No ratings yet
Unit2-Compiler Design
24 pages
Chapter 3 - Lexical Analysis and Lexical Analyzer Generators
No ratings yet
Chapter 3 - Lexical Analysis and Lexical Analyzer Generators
52 pages
Lecture 03
No ratings yet
Lecture 03
42 pages
Ch3 - Lexical Analysis
No ratings yet
Ch3 - Lexical Analysis
52 pages
Lexical Analysis1
No ratings yet
Lexical Analysis1
44 pages
Chapter 7 Lexical Analysis
No ratings yet
Chapter 7 Lexical Analysis
61 pages
Chapter 2
No ratings yet
Chapter 2
27 pages
Lexical Analysis
No ratings yet
Lexical Analysis
62 pages
Application Trial Maker - CodeProject
No ratings yet
Application Trial Maker - CodeProject
8 pages
C Practical Assignment in Word
No ratings yet
C Practical Assignment in Word
68 pages
Chapter2-Lexical Analysis
No ratings yet
Chapter2-Lexical Analysis
64 pages
CSC 415 Compiler Design: Lexical Analysis
No ratings yet
CSC 415 Compiler Design: Lexical Analysis
40 pages
2024 CD-Ch02 Lexical Analysis
No ratings yet
2024 CD-Ch02 Lexical Analysis
25 pages
WINSEM2023-24 CSI2005 TH VL2023240501823 2024-01-08 Reference-Material-I
No ratings yet
WINSEM2023-24 CSI2005 TH VL2023240501823 2024-01-08 Reference-Material-I
23 pages
String Algorithms in C: Efficient Text Representation and Search 1st Edition Thomas Mailund All Chapter Instant Download
100% (8)
String Algorithms in C: Efficient Text Representation and Search 1st Edition Thomas Mailund All Chapter Instant Download
49 pages
CH 3 Myppt
No ratings yet
CH 3 Myppt
59 pages
Lexical Analysis
No ratings yet
Lexical Analysis
36 pages
Lexical Analysis
No ratings yet
Lexical Analysis
57 pages
Lexical Analysis
No ratings yet
Lexical Analysis
121 pages
ch-2.pdf 2
No ratings yet
ch-2.pdf 2
27 pages
Python Unit 2
No ratings yet
Python Unit 2
29 pages
Chapter 2 Lexical - Analysis
No ratings yet
Chapter 2 Lexical - Analysis
38 pages
SSC Module2 LexicalAnalysis
No ratings yet
SSC Module2 LexicalAnalysis
26 pages
Ch2+3 Compiler
No ratings yet
Ch2+3 Compiler
21 pages
CNL 10 PDF
No ratings yet
CNL 10 PDF
12 pages
PU Syllabus PDF
No ratings yet
PU Syllabus PDF
5 pages
Lexical Analyzer in Perspective: Parser Source Program Token
No ratings yet
Lexical Analyzer in Perspective: Parser Source Program Token
22 pages
Lexical Analysis
No ratings yet
Lexical Analysis
62 pages
4 LexicalAnalysis
No ratings yet
4 LexicalAnalysis
27 pages
ch-2 Compiler Design
No ratings yet
ch-2 Compiler Design
9 pages
Compiler - Lexical Analyzer-2
No ratings yet
Compiler - Lexical Analyzer-2
16 pages
Chapter-3 Short
No ratings yet
Chapter-3 Short
50 pages
Slides CHP 3 and 4
No ratings yet
Slides CHP 3 and 4
21 pages
Course Title:: Modern Programming Tools and Techniques-I
No ratings yet
Course Title:: Modern Programming Tools and Techniques-I
11 pages
DM 2018-19
No ratings yet
DM 2018-19
5 pages
Bhoj Reddy Engineering College For Women: Department of Information Technology
No ratings yet
Bhoj Reddy Engineering College For Women: Department of Information Technology
3 pages
Lec2 LexicalAnalyser
No ratings yet
Lec2 LexicalAnalyser
30 pages
DigitalForensics Autonomous Syllabus
No ratings yet
DigitalForensics Autonomous Syllabus
2 pages
Bhoj Reddy Engineering College For Women: Hyderabad: Department of Information Technology
No ratings yet
Bhoj Reddy Engineering College For Women: Hyderabad: Department of Information Technology
2 pages
Academic Council Minutes October 2020
No ratings yet
Academic Council Minutes October 2020
2 pages
115AP032016
No ratings yet
115AP032016
2 pages
Bhoj Reddy Engineering College For Women: Hyderabad: Department of Information Technology
No ratings yet
Bhoj Reddy Engineering College For Women: Hyderabad: Department of Information Technology
2 pages
WWW - Manaresults.Co - In: (Computer Science and Engineering)
No ratings yet
WWW - Manaresults.Co - In: (Computer Science and Engineering)
2 pages
Pps Internal1
No ratings yet
Pps Internal1
3 pages
Naive String Matching
No ratings yet
Naive String Matching
2 pages
Compiler Design - Phases of Compiler
No ratings yet
Compiler Design - Phases of Compiler
2 pages
Compiler Design - Lexical Analysis
No ratings yet
Compiler Design - Lexical Analysis
2 pages
IT B Farewell Bill 2018
No ratings yet
IT B Farewell Bill 2018
1 page
Lex Analysis and Implementation: Definitive Reference for Developers and Engineers
From Everand
Lex Analysis and Implementation: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
The Tech Interview Playbook: From DSA to System Design
From Everand
The Tech Interview Playbook: From DSA to System Design
Chinmoy Mukherjee
No ratings yet