0% found this document useful (0 votes)
35 views52 pages

Ch3 1

The document discusses lexical analysis and lexical analyzer generators. It provides the following key points: 1) Lexical analysis separates a program into tokens which simplifies parser design and allows for efficient scanning implementations. It also improves portability by normalizing symbols. 2) Lex and Flex are lexical analyzer generators that systematically translate regular expression definitions of tokens into efficient C scanning code. 3) A lex specification defines tokens with regular expressions and associates actions with patterns through translation rules. The generated scanner repeatedly matches input against the patterns.

Uploaded by

sher afzal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views52 pages

Ch3 1

The document discusses lexical analysis and lexical analyzer generators. It provides the following key points: 1) Lexical analysis separates a program into tokens which simplifies parser design and allows for efficient scanning implementations. It also improves portability by normalizing symbols. 2) Lex and Flex are lexical analyzer generators that systematically translate regular expression definitions of tokens into efficient C scanning code. 3) A lex specification defines tokens with regular expressions and associates actions with patterns through translation rules. The generated scanner repeatedly matches input against the patterns.

Uploaded by

sher afzal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 52

1

Lexical Analysis and


Lexical Analyzer
Generators
Chapter 3

Compiler Constructio
The Reason Why Lexical
2

Analysis is a Separate
Phase
• Simplifies the design of the compiler
– LL(1) or LR(1) parsing with 1 token lookahead
would not be possible (multiple
characters/tokens to match)
• Provides efficient implementation
– Systematic techniques to implement lexical
analyzers by hand or automatically from
specifications
– Stream buffering methods to scan input
• Improves portability
– Non-standard symbols and alternate character
encodings can be normalized (e.g. UTF8,
trigraphs)
Interaction of the
3

Lexical Analyzer with


the Parser
Token,
Source Lexical tokenval
Program Parser
Analyzer
Get next
token
error error

Symbol Table
4

Attributes of Tokens

y := 31 + 28*x Lexical analyzer

” > <assign, > <num, 31> <‘+’, > <num, 28> <‘*’, > <

token
(lookahead)
tokenval Parser
(token attribute)
5

Tokens, Patterns, and


Lexemes
• A token is a classification of lexical
units
– For example: id and num
• Lexemes are the specific character
strings that make up a token
– For example: abc and 123
• Patterns are rules describing the set of
lexemes belonging to a token
– For example: “ letter followed by letters and
digits” and “ non-empty sequence of digits”
Specification of
6

Patterns for Tokens:


Definitions
• An alphabet  is a finite set of
symbols (characters)
• A string s is a finite sequence of
symbols from 
 s denotes the length of string s
  denotes the empty string, thus  =
0
• A language is a specific set of
strings over some fixed alphabet 
Specification of
7

Patterns for Tokens:


String Operations
• The concatenation of two strings
x and y is denoted by xy
• The exponentation of a string s
is defined by

s0 = 
si = si-1s for i > 0

note that s = s = s
Specification of
8

Patterns for Tokens:


Language
• Union
Operations
L  M = {s  s  L or s  M}
• Concatenation
LM = {xy  x  L and y  M}
• Exponentiation
L0 = {}; Li = Li-1L
• Kleene closure
L* = i=0,…, Li
• Positive closure
L+ = i=1,…, Li
Specification of
9

Patterns for Tokens:


Regular
• Basis symbols:
Expressions
  is a regular expression denoting language {}
– a   is a regular expression denoting {a}
• If r and s are regular expressions denoting
languages L(r) and M(s) respectively, then
– rs is a regular expression denoting L(r)  M(s)
– rs is a regular expression denoting L(r)M(s)
– r* is a regular expression denoting L(r)*
– (r) is a regular expression denoting L(r)
• A language defined by a regular expression
is called a regular set
Specification of
10

Patterns for Tokens:


Regular Definitions
• Regular definitions introduce a naming
convention with name-to-regular-expression
bindings:
d1  r 1
d2  r 2

dn  r n
where each ri is a regular expression over
  {d1, d2, …, di-1 }
• Any dj in ri can be textually substituted in
ri to obtain an equivalent set of definitions
Specification of
11

Patterns for Tokens:


Regular
• Example:
Definitions
letter  AB…Zab…z
digit  01…9
id  letter ( letterdigit )*

• Regular definitions cannot be


recursive:

digits  digit digitsdigit wrong!


Specification of
12

Patterns for Tokens:


Notational Shorthand
• The following shorthands are often used:

r+ = rr*
r? = r
[a-z] = abc…z

• Examples:
digit  [0-9]
num  digit+ (. digit+)? ( E (+-)?
digit+ )?
13

Regular Definitions
and Grammars
Grammar
 if expr then stmt
 if expr then stmt else stmt
 
 term relop term
 term Regular definitions
 id if  if
 num then  then
else  else
relop  <  <=  <>  >  >=  =
id  letter ( letter | digit )*
num  digit+ (. digit+)? ( E (+-)? dig
Coding Regular
14

Definitions in
Transition Diagrams
relop  <<=<>>>==
start < =
0 1 2return(relop, LE)
>
3return(relop, NE)
other *
4return(relop, LT)
=
5return(relop, EQ)
> =
6 7return(relop, GE)
other *
8return(relop, GT)

 letter ( letterdigitletter
)* or digit

start letter other


9 10 11*
return(gettoken(),
install_id(
Definitions in 15

Transition Diagrams:
Code
token nexttoken()
{ while (1) {
switch (state) {
case 0: c = nextchar();
if (c==blank || c==tab || c==newline) { Decides the
state = 0;
lexeme_beginning++; next start state
}
else if (c==‘<’) state = 1; to check
else if (c==‘=’) state = 5;
else if (c==‘>’) state = 6;
else state = fail();
break; int fail()
case 1: { forward = token_beginning;
… swith (start) {
case 9: c = nextchar(); case 0: start = 9; break;
if (isletter(c)) state = 10; case 9: start = 12; break;
else state = fail(); case 12: start = 20; break;
break; case 20: start = 25; break;
case 10: c = nextchar(); case 25: recover(); break;
if (isletter(c)) state = 10; default: /* error */
else if (isdigit(c)) state = 10; }
else state = 11; return start;
break; }

16

The Lex and Flex


Scanner Generators
• Lex and its newer cousin flex
are scanner generators
• Scanner generators
systematically translate regular
definitions into C source code
for efficient scanning
• Generated code is easy to
integrate in C applications
Creating a Lexical
17

Analyzer with Lex and


Flex
lex
source lex (or flex) lex.yy.c
program
lex.l

lex.yy.c
C a.out
compiler

input sequence
stream a.out of tokens
18

Lex Specification
• A lex specification consists of three parts:
regular definitions, C declarations in
%{ %}
%%
translation rules
%%
user-defined auxiliary procedures
• The translation rules are of the form:
p1 { action1 }
p2 { action2 }

pn { actionn }
19

Regular Expressions in
x Lex
match the character x
\. match the character .
“string” match contents of string of characters
. match any character except newline
^ match beginning of a line
$ match the end of a line
[xyz] match one character x, y, or z (use \ to escape -)
[^xyz]match any character except x, y, and z
[a-z] match one of a to z
r* closure (match zero or more occurrences)
r+ positive closure (match one or more occurrences)
r? optional (match zero or one occurrence)
r1r2 match r1 then r2 (concatenation)
r1|r2 match r1 or r2 (union)
( r ) grouping
r1/r2 match r1 when followed by r2
{d} match the regular expression defined by d
20

Example Lex
Specification 1
Contains
%{ the matching
ranslation #include <stdio.h> lexeme
%}
rules %%
[0-9]+ { printf(“%s\n”, yytext); }
.|\n { }
%% Invokes
main() the lexical
{ yylex(); analyzer
}

lex spec.l
gcc lex.yy.c -ll
./a.out < spec.l
21

Example Lex
Specification 2
%{
#include <stdio.h> Regular
int ch = 0, wd = 0, nl = 0;
definition
ranslation %}
rules delim [ \t]+
%%
\n { ch++; wd++; nl++; }
^{delim} { ch+=yyleng; }
{delim} { ch+=yyleng; wd++; }
. { ch++; }
%%
main()
{ yylex();
printf("%8d%8d%8d\n", nl, wd, ch);
}
22

Example Lex
Specification 3
%{
#include <stdio.h> Regular
%}
definitions
ranslation digit [0-9]
rules letter [A-Za-z]
id {letter}({letter}|{digit})*
%%
{digit}+ { printf(“number: %s\n”, yytext); }
{id} { printf(“ident: %s\n”, yytext); }
. { printf(“other: %s\n”, yytext); }
%%
main()
{ yylex();
}
Example Lex 23

Specification 4
%{ /* definitions of manifest constants */
#define LT (256)

%}
delim [ \t\n]
ws {delim}+
letter [A-Za-z] Return
digit [0-9]
id {letter}({letter}|{digit})* token to
number
%%
{digit}+(\.{digit}+)?(E[+\-]?{digit}+)? parser
{ws} { }
if {return IF;} Token
then
else
{return THEN;}
{return ELSE;}
attribute
{id} {yylval = install_id(); return ID;}
{number} {yylval = install_num(); return NUMBER;}
“<“ {yylval = LT; return RELOP;}
“<=“ {yylval = LE; return RELOP;}
“=“ {yylval = EQ; return RELOP;}
“<>“ {yylval = NE; return RELOP;}
“>“ {yylval = GT; return RELOP;}
“>=“
%%
Install yytext as
{yylval = GE; return RELOP;}

int install_id() identifier in symbol ta



24

Design of a Lexical
Analyzer Generator
• Translate regular expressions
to NFA
• Translate NFA to an efficient
DFA Optional

regular
NFA DFA
expressions

Simulate NFA Simulate DFA


to recognize to recognize
tokens tokens
25

Nondeterministic
Finite Automata
• An NFA is a 5-tuple (S, , , s0, F)
where

S is a finite set of states


 is a finite set of symbols, the
alphabet
 is a mapping from S   to a set of
states
s0  S is the start state
F  S is the set of accepting (or final)
states
26

Transition Graph
• An NFA can be
diagrammatically represented
by a labeled directed graph
called a transition graph
a
S = {0,1,2,3}
start
0
a
1
b
2
b
3
 = {a,b}
s0 = 0
b F = {3}
27

Transition Table
• The mapping  of an NFA can
be represented in a
transition table
Input Input
State
(0,a) = {0,1} a b
(0,b) = {0} 0 {0, {0}
(1,b) = {2} 1}
(2,b) = {3} 1 {2}
2 {3}
28

The Language Defined


by an NFA
• An NFA accepts an input string x if and
only if there is some path with edges
labeled with symbols from x in sequence
from the start state to some accepting
state in the transition graph
• A state transition from one state to
another on the path is called a move
• The language defined by an NFA is the
set of input strings it accepts, such
as (ab)*abb for the example NFA
Design of a Lexical
29

Analyzer Generator: RE
to NFA to DFA
Lex specification with NFA
regular expressions
p1 { action1 } N(p1) action1

p2 { action2 } start
s0
 N(p2) action2
… …
pn { actionn } 
N(pn) actionn

Subset constructi

DFA
From Regular 30

Expression to NFA
(Thompson’ s
Construction)  f
start
 i

a start a
i f

start  N(r1) 
r1r2 i f
 N(r2) 
start
r1r2 i N(r1) N(r2) f


r* start
i  N(r)  f


Combining the NFAs of
31

a Set of Regular
Expressions
start
1
a
2

a { action1 }
start a b b
abb { action2 } 3 4 5 6
a b
a*b+ { action3 }
start
7 b 8
a
1 2

start
0  3
a
4
b
5
b
6
a b

7 b 8
32

Simulating the
Combined NFA Example 1
a
1 2 action1

start
0  3
a
4
b
5
b
6 action2
a b

7 b 8 action3

a a b a
none
0 2 7 8
action3
1 4
3 7 Must find the longest match:
7 Continue until no further moves are pos
When last state is accepting: execute a
33

Simulating the
Combined NFA Example 2
a
1 2 action1

start
0  3
a
4
b
5
b
6 action2
a b

7 b 8 action3

a b b a
none
0 2 5 6
action2
1 4 8 8
action3
3 7
When
7 two or more accepting states are reached,
first action given in the Lex specification is
34

Deterministic Finite
Automata
• A deterministic finite automaton is a
special case of an NFA
– No state has an -transition
– For each state s and input symbol a there
is at most one edge labeled a leaving s
• Each entry in the transition table is
a single state
– At most one path exists to accept a
string
– Simulation algorithm is simple
35

Example DFA

A DFA that accepts (ab)*abb

b
b
a
start a b b
0 1 2 3

a a
36

Conversion of an NFA
into a DFA
• The subset construction algorithm converts
an NFA into a DFA using:
-closure(s) = {s}  {t  s  …  t}
-closure(T) = sT -closure(s)
move(T,a) = {t  s a t and s  T}
• The algorithm produces:
Dstates is the set of states of the new
DFA consisting of sets of states of the
NFA
Dtran is the transition table of the new
DFA
37

-closure and move


Examples
-closure({0}) = {0,1,3,7
a
1 2 move({0,1,3,7},a) = {2,4,
 -closure({2,4,7}) = {2,4
start
 a b b move({2,4,7},a) = {7}
0 3 4 5 6
a b -closure({7}) = {7}
 move({7},b) = {8}
7 b 8 -closure({8}) = {8}
move({8},a) = 
a a b a
none
0 2 7 8
1 4
3 7
7 Also used to simulate
Simulating an NFA
38

using
-closure and move
S := -closure({s0})
Sprev := 
a := nextchar()
while S   do
Sprev := S
S := -closure(move(S,a))
a := nextchar()
end do
if Sprev  F   then
execute action in Sprev
return “ yes”
else return “ no”
39

The Subset
Construction Algorithm
-closure(s0) is the only state in Dstates and it i
e is an unmarked state T in Dstates do
T
ach input symbol a   do
U := -closure(move(T,a))
if U is not in Dstates then
add U as an unmarked state to Dstates
end if
Dtran[T,a] := U
o
Subset Construction 40

Example

1
a
2 3

start    a b b
0 1 6 7 8 9 10

4
b
5


b
Dstates
C A = {0,1,2,4,7}
b a b B = {1,2,3,4,6,7,8}
start
E C = {1,2,4,5,6,7}
a b b
A B D
a D = {1,2,4,5,6,7,9}
a
a E = {1,2,4,5,6,7,10}
Subset Construction 41

Example 2
a
1 2 a1

start
0  3
a
4
b
5
b
6 a2
a b

7 b 8 a3
b

a3 Dstates
C A = {0,1,3,7}
b a
b b B = {2,4,7}
start C = {8}
A D
D = {7}
a a
b b E = {5,8}
B E F
F = {6,8}
a1 a3 a2 a3
42

Minimizing the Number


of States of a DFA

C
b
b a b
start a b b start a b b
A B D E AC B D E
a a
a
a b a a
From Regular
43

Expression to DFA
Directly
• The “ important states” of an NFA
are those without an -
transition, that is if
move({s},a)   for some a then s
is an important state
• The subset construction algorithm
uses only the important states
when it determines
-closure(move(T,a))
From Regular
44

Expression to DFA
Directly (Algorithm)
• Augment the regular expression r
with a special end symbol # to
make accepting states important:
the new expression is r#
• Construct a syntax tree for r#
• Traverse the tree to construct
functions nullable, firstpos,
lastpos, and followpos
From Regular 45

Expression to DFA
Directly: Syntax Tree
of (a|b)*abb#
concatenation
#
6
b
closure 5
b
4

* a
3
alternation
| position
number
a b (for leafs 
1 2
From Regular 46

Expression to DFA
Directly: Annotating
• nullable(n):the Tree
the subtree at node n
generates languages including the
empty string
• firstpos(n): set of positions that can
match the first symbol of a string
generated by the subtree at node n
• lastpos(n): the set of positions that
can match the last symbol of a string
generated be the subtree at node n
• followpos(i): the set of positions
that can follow position i in the tree
From Regular
47

Expression to DFA
Directly: Annotating
Node n the Tree
nullable(n) firstpos(n) lastpos(n)

Leaf  true  

Leaf i false {i} {i}

| nullable(c1) firstpos(c1) lastpos(c1)


/ \ or  
c1 c2 nullable(c2) firstpos(c2) lastpos(c2)
if
if
nullable(c1)
nullable(c2)
then
• nullable(c1) then
firstpos(c1)
/ \ and lastpos(c1) 

c1 c2 nullable(c2) lastpos(c2)
firstpos(c2)
else
else
lastpos(c )
From Regular 48

Expression to DFA
Directly: Syntax Tree
of (a|b)*abb#
{1, 2, 3} {6}

{1, 2, 3} {5} {6} # {6}


6
{1, 2, 3} {4} {5} b {5}
nullable 5
{1, 2, 3} {3} {4} b {4}
4

{3} a {3}
firstposlastpos
{1, 2} {1, 2}
* 3

{1, 2}|{1, 2}

{1} a {1} {2} b {2}


1 2
From Regular
49

Expression to DFA
Directly: followpos
h node n in the tree do
n is a cat-node with left child c1 and right child
for each i in lastpos(c1) do
followpos(i) := followpos(i)  firstpos(c2)
end do
se if n is a star-node
for each i in lastpos(n) do
followpos(i) := followpos(i)  firstpos(n)
end do
d if
From Regular 50

Expression to DFA
Directly: Algorithm
firstpos(root) where root is the root of the syntax
s := {s0} and is unmarked
there is an unmarked state T in Dstates do
ark T
or each input symbol a   do
let U be the set of positions that are in follo
for some position p in T,
such that the symbol at position p is a
if U is not empty and not in Dstates then
add U as an unmarked state to Dstates
end if
Dtran[T,a] := U
nd do
From Regular
51

Expression to DFA
Directly:
followp
Example
Node
os 1
{1, 2,
1 3 4 5 6
3}
{1, 2, 2
2
3}
3 {4}
4 {5}
5 {6}b b
6 - a
start a 1,2, b 1,2, b 1,2,
1,2,3
3,4 3,5 3,6
a
a
52

Time-Space Tradeoffs

Space Time
Automaton (worst (worst
case) case)
O(rx
NFA O(r)
)
DFA O(2|r|) O(x)

You might also like