Chapter 3: Lexical Analysis
Chapter 3: Lexical Analysis
Aggelos Kiayias
Computer Science & Engineering Department
CSE244 The University of Connecticut
371 Fairfield Road, Box U-1155
Storrs, CT 06269
[email protected]
http://www.cse.uconn.edu/~akiayias
CH3.1
Lexical Analysis
Basic Concepts & Regular Expressions
What does a Lexical Analyzer do?
CSE244 How does it Work?
Formalizing Token Definition & Recognition
LEX - A Lexical Analyzer Generator (Defer)
Reviewing Finite Automata Concepts
Non-Deterministic and Deterministic FA
Conversion Process
Regular Expressions to NFA
NFA to DFA
Relating NFAs/DFAs /Conversion to Lexical
Analysis
Concluding Remarks /Looking Ahead
CH3.2
Lexical Analyzer in Perspective
token
CSE244 source lexical
analyzer parser
program
get next
token
symbol
table
Important Issue:
What are Responsibilities of each Box ?
Focus on Lexical Analyzer and Parser
CH3.3
Lexical Analyzer in Perspective
LEXICAL ANALYZER PARSER
Scan Input Perform Syntax
CSE244 Analysis
Remove WS, NL, …
Actions Dictated by
Identify Tokens Token Order
Create Symbol Table Update Symbol Table
Insert Tokens into ST Entries
Generate Errors Create Abstract Rep.
of Source
Send Tokens to Parser
Generate Errors
And More…. (We’ll
see later)
CH3.4
What Factors Have Influenced the
Functional Division of Labor ?
Separation of Lexical Analysis From Parsing
Presents a Simpler Conceptual Model
CSE244 From a Software Engineering Perspective
Division Emphasizes
High Cohesion and Low Coupling
Implies Well Specified Parallel Implementation
Separation Increases Compiler Efficiency (I/O
Techniques to Enhance Lexical Analysis)
Separation Promotes Portability.
This is critical today, when platforms (OSs and
Hardware) are numerous and varied!
Emergence of Platform Independence - Java
CH3.5
Introducing Basic Terminology
What are Major Terms for Lexical Analysis?
TOKEN
CSE244 A classification for a common set of strings
Examples Include <Identifier>, <number>, etc.
PATTERN
The rules which characterize the set of strings for a
token
Recall File and OS Wildcards ([A-Z]*.*)
LEXEME
Actual sequence of characters that matches pattern
and is classified by a token
Identifiers: x, count, name, etc…
CH3.6
Introducing Basic Terminology
CH3.7
Handling Lexical Errors
Error Handling is very localized, with Respect to
Input Source
CSE244 For example: whil ( x := 0 ) do
generates no lexical errors in PASCAL
In what Situations do Errors Occur?
Prefix of remaining input doesn’t match any
defined token
Possible error recovery actions:
Deleting or Inserting Input Characters
Replacing or Transposing Characters
is efficiency an issue?
CSE244
3 Lexical Analyzer construction techniques
how they address efficiency? :
Lexical Analyzer Generator
Character-at-a-time I/O
Block / Buffered I/O Tradeoffs ?
CSE244
Block/Buffered I/O
Utilize Block of memory
Stage data from source to buffer block at a time
Maintain two blocks - Why (Recall OS)?
Subsequence: bnan, nn
CH3.12
Language Concepts
A language, L, is simply any set of strings
over a fixed alphabet.
CSE244 Alphabet Languages
{0,1} {0,10,100,1000,100000…}
{0,1,00,11,000,111,…}
{a,b,c} {abc,aabbcc,aaabbbccc,…}
{A, … ,Z} {TEE,FORE,BALL,…}
{FOR,WHILE,GOTO,…}
{A,…,Z,a,…,z,0,…9, { All legal PASCAL progs}
+,-,…,<,>,…} { All grammatically correct
English sentences }
Special Languages: - EMPTY LANGUAGE
- contains string only
CH3.13
Formal Language Operations
written L*
L
i 0
L written L+ L = L
+
i 1
CH3.14
Formal Language Operations
Examples
L = {A, B, C, D } D = {1, 2, 3}
CSE244 L D = {A, B, C, D, 1, 2, 3 }
LD = {A1, A2, A3, B1, B2, B3, C1, C2, C3, D1, D2, D3 }
L2 = { AA, AB, AC, AD, BA, BB, BC, BD, CA, … DD}
L4 = L2 L2 = ??
L* = { All possible strings of L plus }
L+ = L* -
L (L D ) = ??
L (L D )* = ??
CH3.15
Language & Regular Expressions
A Regular Expression is a Set of Rules /
CSE244
Techniques for Constructing Sequences of
Symbols (Strings) From an Alphabet.
CH3.16
Rules for Specifying Regular Expressions:
fix alphabet
L = {A, B, C, D } D = {1, 2, 3}
CSE244
A|B|C|D =L
(A | B | C | D ) (A | B | C | D ) = L2
(A | B | C | D )* = L*
(A | B | C | D ) ((A | B | C | D ) | ( 1 | 2 | 3 )) = L (L D)
CH3.18
Algebraic Properties of
Regular Expressions
AXIOM DESCRIPTION
CSE244
r|s=s|r | is commutative
r | (s | t) = (r | s) | t | is associative
(r s) t = r (s t) concatenation is associative
r(s|t)=rs|rt
(s|t)r=sr|tr concatenation distributes over |
r = r
r = r Is the identity element for concatenation
CH3.19
Regular Expression Examples
CSE244
• All Strings that start with “tab” or end with
“bat”:
tab{A,…,Z,a,...,z}*|{A,…,Z,a,....,z}*bat
CH3.20
Towards Token Definition
Regular Definitions: Associate names with Regular Expressions
For Example : PASCAL IDs
CSE244
letter A | B | C | … | Z | a | b | … | z
digit 0 | 1 | 2 | … | 9
id letter ( letter | digit )*
Shorthand Notation:
“+” : one or more r* = r+ | & r+ = r r*
“?” : zero or one r?=r |
[range] : set range of characters (replaces “|” )
[A-Z] = A | B | C | … | Z
Example Using Shorthand : PASCAL IDs
id [A-Za-z][A-Za-z0-9]*
CH3.21
Token Recognition
How can we use concepts developed so far to assist in
recognizing tokens of a source language ?
CSE244
Assume Following Tokens:
if, then, else, relop, id, num
blank b
tab ^T
newline ^M
delim blank | tab | newline
ws delim +
CH3.23
Overall
Regular Token Attribute-Value
Expression
CSE244 ws - -
if if -
then then -
else else -
id id pointer to table entry
num num pointer to table entry
< relop LT
<= relop LE
= relop EQ
<> relop NE
> relop GT
>= relop GE
Note: Each token has a unique token identifier to define category of lexemes
CH3.24
Constructing Transition Diagrams for Tokens
CSE244
>=: start > = RTN(GE)
0 6 7
other
8 * RTN(G)
start < =
0 1 2 return(relop, LE)
CSE244 >
3 return(relop, NE)
other
= 4 * return(relop, LT)
5 return(relop, EQ)
>
=
6 7 return(relop, GE)
other
8 * return(relop, GT)
CH3.27
Example TDs : id and delim
CSE244
id :
letter or digit
CH3.28
Example TDs : Unsigned #s
digit digit digit
digit digit
return(num, install_num())
digit
CH3.29
QUESTION :
CH3.30
Answer
cons B | C | D | F | … | Z
CSE244 string cons* A cons* E cons* I cons* O cons* U cons*
accept
CH3.32
Implementing Transition Diagrams
lexeme_beginning = forward; FUNCTIONS USED
state = 0; nextchar(), forward, retract(),
install_num(), install_id(),
token nexttoken()
CSE244 gettoken(),
{ while(1) { isdigit(), isletter(), recover()
switch (state) {
case 0: c = nextchar();
/* c is lookahead character */
repeat
if (c== blank || c==tab || c== newline) {
until start < =
state = 0;
a “return” 0 1 2
lexeme_beginning++;
occurs >
/* advance 3
beginning of lexeme */
other
} *
= 4
else if (c == ‘<‘) state = 1;
else if (c == ‘=‘) state = 5; 5
>
else if (c == ‘>’) state = 6;
else state = fail(); =
6 7
break;
other
… /* cases 1-8 here */ *
CH3.33 8
Implementing Transition Diagrams, II
digit
*
digit other
25 26 27
CSE244 advances
............. forward
case 25; c = nextchar();
if (isdigit(c)) state = 26;
else state = fail();
Case numbers
break;
correspond to transition
case 26; c = nextchar();
diagram states !
if (isdigit(c)) state = 26;
else state = 27;
break;
case 27; retract(1); lexical_value = install_num();
return ( NUM );
.............
looks at the region
retracts lexeme_beginning ... forward
forward CH3.34
Implementing Transition Diagrams, III
.............
case 9: c = nextchar();
CSE244
if (isletter(c)) state = 10;
else state = fail();
break;
case 10; c = nextchar();
if (isletter(c)) state = 10;
else if (isdigit(c)) state = 10;
else state = 11;
break;
case 11; retract(1); lexical_value = install_id();
return ( gettoken(lexical_value) );
.............
letter or digit
*
letter other
reads token 9 10 11
name from ST
CH3.35
When Failures Occur:
Init fail()
CSE244 { start = state;
forward = lexeme beginning;
switch (start) {
case 0: start = 9; break;
case 9: start = 12; break;
case 12: start = 20; break;
case 20: start = 25; break; Switch to
case 25: recover(); break;
next transition
default: /* lex error */
diagram
}
return start;
}
CH3.36
Finite Automata & Language Theory
CH3.37
NFAs & DFAs
CH3.38
Non-Deterministic Finite Automata
CH3.39
Representing NFAs
CH3.40
Example NFA
S = { 0, 1, 2, 3 } a
start a b b
CSE244 s0 = 0 0 1 2 3
F={3} b
= { a, b } What Language is defined ?
a
start a b b
CSE244 0 1 2 3
CH3.42
Handling Undefined Transitions
a
start a b b
0 1 2 3
b a
a
a, b
4
CH3.43
NFA- Regular Expressions & Compilation
Problems with NFAs for Regular Expressions:
1. Valid input might not be accepted
CSE244
2. NFA may behave differently on the same input
CH3.44
Second NFA Example
CH3.45
Second NFA Example - Solution
start a b
0 1
c
3 c 5
String abbc can be accepted.
CH3.46
Alternative Solution Strategy
b
a c
a (b*c) 1 2 3
CSE244
a (b | c+)? 4 a 5 b
c
c
Now that you have the individual 7
diagrams, “or” them as follows:
CH3.47
Using Null Transitions to “OR” NFAs
b
CSE244 a c
1 2 3
0 6
4 a 5 b
c
c
7
CH3.48
Other Concepts
CH3.49
Deterministic Finite Automata
a
start a b b
0 1 2 3
CH3.51
Conversion : NFA DFA Algorithm
CH3.52
Converting NFA to DFA – 1st Look
CSE244 a b
2 3 4
0 1 5 8
6 c 7
From State 0, Where can we move without consuming any input ?
This forms a new state: 0,1,2,6,8 What transitions are defined for
this new state ?
CH3.53
The Resulting DFA
a
0, 1, 2, 6, 8 a 3
CSE244
a
c b
1, 2, 5, 6, 7, 8
c 1, 2, 4, 5, 6, 8
c
start a b
0 1 6 7 8 9
b
b
4 5 10
b : -closure(move(A,b)) = -closure(move({0,1,2,4,7},b))
adds {5} ( since move(4,b)=5)
b : -closure(move(B,b)) = -closure(move({1,2,3,4,6,7,8},b))}
= {1,2,4,5,6,7,9} = D
Define Dtran[B,b] = D.
b : -closure(move(C,b)) = -closure(move({1,2,4,5,6,7},b))}
= {1,2,4,5,6,7} = C
Define Dtran[C,b] = C.
CH3.58
Conversion Example – continued (3)
5th , we calculate for state D on {a,b}
a : -closure(move(D,a)) = -closure(move({1,2,4,5,6,7,9},a))}
CSE244 = {1,2,3,4,6,7,8} = B
Define Dtran[D,a] = B.
b : -closure(move(D,b)) = -closure(move({1,2,4,5,6,7,9},b))}
= {1,2,4,5,6,7,10} = E
Define Dtran[D,b] = E.
b : -closure(move(E,b)) = -closure(move({1,2,4,5,6,7,10},b))}
= {1,2,4,5,6,7} = C
Define Dtran[E,b] = C.
CH3.59
Conversion Example – continued (4)
This gives the transition table Dtran for the DFA of:
Input Symbol
CSE244 Dstates a b
A B C
B B D
C B C
D B E
E B C
b C b
start A a B b D b E
a
a a
CH3.60
Algorithm For Subset Construction
:
CSE244 a:
b:
ab:
| ab :
a*
( | ab )* :
CH3.64
Motivation: Construct NFA For:
: start
i f
CSE244 start a
a: 0 1
start b
b: A B
start a b
ab: 0 1 A B
| ab :
a*
( | ab )* :
CH3.65
Construction Algorithm : R.E. NFA
Construction Process :
CSE244 1st : Identify subexpressions of the regular expression
symbols
r|s
rs
r*
CH3.66
Piecing Together NFAs
start a
i f L(a)
CH3.67
Piecing Together NFAs – continued(1)
CH3.68
Piecing Together NFAs – continued(2)
3.(b) If s, t are regular expressions, N(s), N(t) their NFAs
st (concatenation) has NFA:
CSE244
start
i N(s) N(t) f L(s) L(t)
overlap
Alternative:
start
i f
N(s) N(t)
start
i f
N(s)
CH3.71
Detailed Example
See example 3.16 in textbook for (a | b)*abb
2nd Example - (ab*c) | (a(b|c*))
CSE244 Parse Tree for this regular expression:
r13
r5 | r12
r3 r4 r11 r10
( )
a a r9
r1 r2
r7 | r8
r0 c
* r6
b *
b
c
What is the NFA? Let’s construct it ! CH3.72
Detailed Example – Construction(1)
r0 : b
CSE244 r3 : a r1: b
r2 : c
r 4 : r 1 r2 b c
a b c
r 5 : r 3 r4
CH3.73
Detailed Example – Construction(2)
r7 : b
r8: c
CSE244 r11: a
b
r6 : c
r9 : r7 | r 8 c
r10 : r9
b
CH3.74
Detailed Example – Final Step
r13 : r5 | r12
CSE244
a b c
2 3 4 5 6 7
17
1 b
10 11
a c
8 9 12 13 14 15 16
CH3.75
Direct Simulation of an NFA
s s0
c nextchar;
CSE244 while c eof do
s move(s,c); DFA
c nextchar;
simulation
end;
if s is in F then return “yes”
else return “no”
S -closure({s0})
c nextchar;
while c eof do NFA
S -closure(move(S,c));
c nextchar;
simulation
end;
if SF then return “yes”
else return “no”
CH3.76
Final Notes : R.E. to NFA Construction
space time to
required simulate
NFA O(|r|) O(|r|*|x|)
DFA O(2|r|) O(|x|)
FA
Simulator
P1 : a
CSE244
P2 : abb 3 patterns
P3 : a*b+
NFA’s :
P1
start a
1 2
P2
start a b b
3 4 5 6
a b
P3
start
7 8
b
CH3.81
Example – continued (2)
Combined NFA :
a P1
1 2
CSE244
start a b b
0 3 4 5 6 P2
a b
P3
7 8
b
Examples a a b a
{0,1,3,7} {2,4,7} {7} {8} death
pattern matched: - P1 - P3 -
a b b
{0,1,3,7} {2,4,7} {5,8} {6,8} break tie in
favor of P2
pattern matched: - P1 P3 P2,P3
CH3.82
Example – continued (3)
Input Symbol
STATE a b Pattern
{0,1,3,7} {2,4,7} {8} none
{2,4,7} {7} {5,8} P1
{8} - {8} P3
{7} {7} {8} none
{5,8} - {6,8} P3
break tie in
{6,8} - {8} P2 favor of P2
CH3.83
Minimizing the Number of States of DFA
a
A,C,D a
B,F
b
Minimized DFA: b
CH3.85
Other Issues - § 3.9 – Not Discussed
CSE244
CH3.86
Using LEX
CH3.87
LEX
C declarations %{
/* definitions of all constants
LT, LE, EQ, NE, GT, GE, IF, THEN, ELSE, ... */
CSE244 %}
......
declarations
letter [A-Za-z]
digit [0-9]
id {letter}({letter}|{digit})*
......
%%
if { return(IF);}
Rules
then { return(THEN);}
{id} { yylval = install_id(); return(ID); }
......
%%
Auxiliary
install_id()
{ /* procedure to install the lexeme to the ST */
CH3.88
Example of a Lex Program
int num_lines = 0, num_chars = 0;
%%
CSE244 \n {++num_lines; ++num_chars;}
. {++num_chars;}
%%
main( argc, argv )
int argc; char **argv;
{
++argv, --argc; /* skip over program name */
if ( argc > 0 )
yyin = fopen( argv[0], "r" );
else yyin = stdin;
yylex();
printf( "# of lines = %d, # of chars = %d\n",
num_lines, num_chars ); }
CH3.89
Another Example
%{ #include <stdio.h> %}
WS [ \t\n]*
CSE244
%%
[0123456789]+ printf("NUMBER\n");
[a-zA-Z][a-zA-Z0-9]* printf("WORD\n");
{WS} /* do nothing */
. printf(“UNKNOWN\n“);
%%
Looking Ahead:
The next step in the compilation process is Parsing:
- Top-down vs. Bottom-up
-- Relationship to Language Theory
CH3.91