0% found this document useful (0 votes)
8 views73 pages

2 Lexing

The document discusses the process of lexing in compiler design, detailing the role of a lexical analyzer in reading input characters, grouping them into tokens, and managing patterns using regular expressions. It covers various concepts such as input buffering, error handling, and the implementation of string matching algorithms like the Knuth-Morris-Pratt algorithm. Additionally, it emphasizes the importance of context-insensitive lexing and the organization of token patterns to avoid conflicts between keywords and identifiers.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views73 pages

2 Lexing

The document discusses the process of lexing in compiler design, detailing the role of a lexical analyzer in reading input characters, grouping them into tokens, and managing patterns using regular expressions. It covers various concepts such as input buffering, error handling, and the implementation of string matching algorithms like the Knuth-Morris-Pratt algorithm. Additionally, it emphasizes the importance of context-insensitive lexing and the organization of token patterns to avoid conflicts between keywords and identifiers.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 73

Lexing

Rupesh Nasre.

CS3300 Compiler Design


IIT Madras
July 2024
Character stream

Machine-Independent
Machine-Independent
Lexical
LexicalAnalyzer
Analyzer Code
CodeOptimizer
Optimizer

Intermediate representation

Backend
Token stream
Frontend

Syntax
SyntaxAnalyzer
Analyzer Code
CodeGenerator
Generator

Syntax tree Target machine code

Machine-Dependent
Machine-Dependent
Semantic
SemanticAnalyzer
Analyzer Code
CodeOptimizer
Optimizer

Syntax tree Target machine code

Intermediate
Intermediate
Code Symbol
CodeGenerator
Generator Table
2
Intermediate representation
Lexing Summary
Character stream
● Basic lex Machine-Indep.
Machine-Indep.
Lexical
LexicalAnalyzer
Analyzer Code
CodeOptimizer
Optimizer

Input Buffering Token stream
Intermediate
representation

KMP String Matching Syntax
SyntaxAnalyzer
Analyzer Code
CodeGenerator
Generator

Syntax tree Target machine code



Regex → NFA → DFA Semantic
Machine-Dependent
Machine-Dependent
SemanticAnalyzer
Analyzer Code
CodeOptimizer
Optimizer

Regex → DFA Syntax tree Target machine code
Intermediate
Intermediate
Code
CodeGenerator
Generator
Intermediate
representation

3
Role

Read input characters

Group into words (lexemes)

Return sequence of tokens

Sometimes
– Eat-up whitespace
– Remove comments
– Maintain line number information

4
Token, Pattern, Lexeme
Token Pattern Sample lexeme
if Characters i, f if
comparison <= or >= or < or > or == or != <=, !=
identifier letter (letter + digit)* pi, score, D2
number Any numeric constant 3.14159, 0, 6.02e23
literal Anything but “, surrounded by “” “core dumped”

The following classes cover most or all of the tokens



One token for each keyword

Tokens for the operators, individually or in classes

Token for identifiers

One or more tokens for constants

One token each for punctuation symbols 5
Representing Patterns

Keywords can be directly represented (break, int).

And so do punctuation symbols ({, +).

Others are finite, but too many!
– Numbers
– Identifiers
– They are better represented using a regular expression.
– [a-z][a-z0-9]*, [0-9]+

6
Classwork: Regex Recap

If L is a set of letters (A-Z, a-z) and D is a set
of digits (0-9),
– Find the size of the language LD.
– Find the size of the language L U D.
– Find the size of the language L4.

Write regex for real numbers
– Without eE, without +- in mantissa (1.89)
– Without eE, with +- in mantissa (-1.89)
– With eE, with -+ in exponent (-1.89E-4)
7
Classwork

Write regex for strings over alphabet {a, b} that
start and end with a.

Strings with third last letter as a.

Strings with exactly three bs.

Strings with even length.

Homework
– Exercises 3.3.6 from ALSU.

8
Example Lex
/*/*variables
variables*/*/
Patterns [a-z]
[a-z] {{
yylval
yylval==*yytext
*yytext--'a';
'a';
return
returnVARIABLE;
VARIABLE; Tokens
}}

/*/*integers
integers*/*/
[0-9]+
[0-9]+ {{
yylval
yylval==atoi(yytext);
atoi(yytext); Lexemes
return
returnINTEGER;
INTEGER;
}}

/*/*operators
operators*/*/
[-+()=/*\n]
[-+()=/*\n]{{return
return*yytext;
*yytext;}}
/*/*skip
skipwhitespace
whitespace*/*/
[[\t]
\t] ;;

/*/*anything
anythingelse
elseis
isan
anerror
error*/*/
.. yyerror("invalid
yyerror("invalidcharacter");
character");
9
a1.l a1.y

lex
lex yacc
yacc

lex.yy.c y.tab.c y.tab.h

gcc
gcc

Lexer and parser are not separate binaries;


This is your compiler. a.out they are part of the same executable.10
Lex Regex
Expression Matches Example
c Character c a
\c Character c literally \*
“s” String s literally “**”
. Any character but newline a.*b
^ Beginning of a line ^abc
$ End of a line abc$
[s] Any of the characters in string s [abc]
[^s] Any one character not in string s [^abc]
r* Zero or more strings matching r a*
r+ One or more strings matching r a+
r? Zero or one r a?
r{m, n} Between m and n occurrences of r a{1,5}
r1r2 An r1 followed by an r2 ab
r1 | r2 An r1 or an r2 a|b
(r) Same as r (a | b) 11
r1/r2 r1 when followed by r2 abc/123
Homework

Write a lexer to identify special words in a text.
– Words like stewardesses: only one hand
– Words like typewriter: only one keyboard row
– Words like skepticisms: alternate hands

Implement grep using lex with search pattern
as alphabetical text (no operators *, ?, ., etc.).

12
Lexing and Context

Language design should ensure that lexing
can be done without context.

Your assignments and most languages need
context-insensitive lexing.

DO
DO55 I I==1.25
1.25 DO
DO55 I I==1,25
1,25


“DO 5 I” is an identifier in Fortran, as spaces are allowed in identifiers.

Thus, first is an assignment, while second is a loop.

Lexer doesn't know whether to consider the input “DO 5 I” as an identifier
or as a part of the loop, until parser informs it based on dot or comma.

Alternatively, lexer may employ a lookahead.
13
Lookahead

Duniya usi ki hai jo aage dekhe


14
Lookahead

Lexer needs to look into the future to know
where it is presently.

DO DO / .* COMMA { return DO;}


DO55 I I==1,25
1,25


/ signifies the lookahead symbol. The input is
read and matched, but is left unconsumed in
the current rule.

Corollary: DO loop index and increment must be on the same line


– no arbitrary whitespace allowed.
15
Lexical Errors

It is often difficult to report errors for a lexer.
– fi (a == f(x)) ...
– A lexer doesn't know the context of fi. Hence it
cannot “see” the structure of the sentence –
structure is known only to the parser.
– fi = 2; OR fi(a == f(x));

But some errors a lexer can catch.
– 23 = @a;
– if $x friendof anil ...

What should a lexer do on catching an error? 16


Error Handling

Multiple options
– exit(1);
– Panic mode recovery: delete enough input to recognize a
token
– Delete one character from the input
– Insert a missing character into the remaining input
– Replace a character by another character
– Transpose two adjacent characters

In practice, most lexical errors involve a single character.

Theoretical problem: Find the smallest number of
transformations (add, replace, delete) needed to convert the source
program into one that consists only of valid lexemes.
– Too expensive in practice to be worth the effort. 17
Homework

Try exercise 3.1.2 from ALSU.

18
Input Buffering

“We cannot know we were executing a finite
loop until we come out of the loop.”

In C, without reading the next character we
cannot determine a binary minus symbol (a-b).
 ->, -=, --, -e, ...
 Sometimes we may have to look several
characters in future, called lookahead.
 In the fortran example (DO 5 I), the lookahead
could be upto dot or comma.

Reading character-by-character from disk is
inefficient. Hence buffering is required. 19
Input Buffering

A block of characters is read from disk into a buffer.

Lexer maintains two pointers:
– lexemeBegin
– forward
E = M * C * * 2 \f

forward
lexemeBegin

What
Whatis
isthe
theproblem
problemwith
withsuch
suchaascheme?
scheme?
20
Input Buffering

The issue arises when the lookahead is
beyond the buffer.

When you load the buffer, the previous content
is overwritten!
Input read Input to be
read

E = M * C * * 2 \f

forward
lexemeBegin

How
Howdo
dowe
wesolve
solvethis
thisproblem?
problem? 21
Double Buffering

Uses two (half) buffers.

Assumes that the lookahead would not be
more than one buffer size.

Buf1 Buf2

E = M * C * * 2 \f

forward
lexemeBegin

22
Transition Diagrams

Step to be taken on each character can be
specified as a state transition diagram.
– Sometimes, action may be associated with a state.
< =
0 1 2 return(comp, LE);
other yyless(1); return(comp, LT);
= 3

= return(comp, EQ);
4 5
>
other yyless(1); return(assign, ASSIGN);
6
= 8 return(comp, GE);
7
other 9 yyless(1); return(comp, GT);
23
...
Keywords vs. Identifiers

Keywords may match identifier pattern
– Keywords: int, const, break, ...
– Identifiers: (alpha | _) (alpha | num | _)*

If unaddressed, may lead to strange errors.
– Install keywords a priori in the symbol table.
– Prioritize keywords

In lex, the rule for a keyword must precede
that of the identifier.

24

Incorrect (lex may give warning) Correct


Special vs. General

In general, a specialized pattern must precede the
general pattern (associativity).

Lex also follows maximum substring matching rule
(precedence).
– Reordering the rules for < and <= would not affect the
functionality.

Compare with rule specialization in Prolog.

Classwork: Count number of he and she in a text.

Classwork: Write lex rules to recognize quoted
strings in C.
25
– Try to recognize \” inside it.
he and she
she ++s; she {++s; REJECT;}
he ++h; he {++h;}
Retries another rule

What if I want to count all possible substrings he?


In general, the action associated with a rule may
not be easy / modular to duplicate.
Input: he ahe he she she fsfds fsf fs sfhe he she she she

he=5, she=5 he=10, she=5


26
By the way...

Sometimes, you need not have a parser at all...
– You could define main in your lex file.
– Simply call yylex() from main.
– Compile using lex, then compile lex.yy.c using gcc
and execute a.out.

27
String Matching

Lexical analyzer relies heavily on string
matching.

Given a program text T (length n) and a
pattern string s (length m), we want to check if
s occurs in T.

A naive algorithm would try all positions of T to
check for s (complexity O(m*n)).
n
T

m
s

28
Where can we do better?

T = abababaababbbabbababb

s = ababaa

i=0
abababaababbbabbababb
ababaa

29
Where can we do better?

T = abababaababbbabbababb

s = ababaa

i=0
abababaababbbabbababb
ababaa

30
Where can we do better?

T = abababaababbbabbababb

s = ababaa

i=1
abababaababbbabbababb
ababaa

31
Where can we do better?

T = abababaababbbabbababb

s = ababaa

i=2
abababaababbbabbababb
ababaa Match found

32
Where can we do better?

T = abababaababbbabbababb

s = ababaa

T's current suffix


i=0
abababaababbbabbababb
ababaa
s's proper prefix

Key observation: T's current suffix which is a proper prefix in s


has the treasure for us.
Whenever there is a mismatch, we should utilize this overlap, 33
rather than restarting.
Where can we do better?

T = abababaababbbabbababb

s = ababaa

T's current suffix


i=0
abababaababbbabbababb
ababaa
s's proper prefix

Key observation: T's current suffix which is a proper prefix in s


has the treasure for us.
Whenever there is a mismatch, we should utilize this overlap, 34
rather than restarting.
Knuth-Morris-Pratt Algorithm

In 1970, Morris conceived the idea.

After a few weeks, Knuth independently discovered
the idea.

In 1970, Morris and Pratt published a techreport.

KMP published the algorithm jointly in 1977.

In 1969, Matiyasevic discovered a similar algorithm.

35
Source: wikipedia
KMP String Matching

First linear time algorithm for string matching.

Whenver there is a mismatch, do not restart;
rather fail intelligently.

We define a failure function for each position,
taking into account the suffix and the prefix.

Note that the matched part of the large string T is
essentially the pattern string s. Thus, failure
function can be computed simply using pattern s.

abababaababbbabbababb
ababaa
36
Failure is not final.

Failure function for ababaa

i 1 2 3 4 5 6
f(i) 0 0 1 2 3 1
seen a ab aba abab ababa ababaa
prefix ϵ ϵ a ab aba a

Algorithm given as Figure 3.19 in ALSU.

37
String matching with failure function
Text = a1a2...am; pattern = b1b2...bn (both indexed from 1)
s=0
for (i = 1; i <= m; ++i) { Go over Text
if (s > 0 && ai != bs+1) s = f(s) Handle failure
if (ai == bs+1) ++s Character match

if (s == n) return “yes” Full match


}
return “no”
i 1 2 3 4 5 6
f(i) 0 0 1 2 3 1
seen a ab aba abab ababa ababaa
prefix ϵ ϵ a ab aba a

38
String matching with failure function
Text = a1a2...am; pattern = b1b2...bn (both indexed from 1)
s=0
for (i = 1; i <= m; ++i) { Go over Text
while (s > 0 && ai != bs+1) s = f(s) Handle failure
if (ai == bs+1) ++s Character match

if (s == n) return “yes” Full match


}
return “no”
i 1 2 3 4 5 6
f(i) 0 0 1 2 3 1
seen a ab aba abab ababa ababaa
prefix ϵ ϵ a ab aba a

39
Classwork

Find failure function for pattern ababba.
– Test it on string abababbaa.

Find failure function for aaaaa (and apply on aaaab)
– Example needing multiple iterations of while.
– Despite the nested loop, the complexity is O(m+n).
– The backward traversal of the pattern is upper-
bounded by the forward traversal.
– Forward traversal increments text index.
– For n-length pattern match, the backward traversal
can be at most n. Thus, O(2n) for n-length match.
– Examples at this link. 40
Fibonacci Strings
– s1 = b, s2 = a, sk = sk-1sk-2 for k > 2
– e.g., s3 = ab, s4 = aba, s5 = abaab


Do not contain bb or aaa.

The words end in ba and ab alternatively.

Suppressing last two letters creates a palindrome.

...
● Find the failure function for Fibonacci String s6.
Source: Wikipedia 41
KMP Generalization

KMP can be used for keyword matching.

Aho and Corasick generalized KMP to
recognize any of a set of keywords in a text.
h e r s
0 1 2 8 9
i

s s
6 7

h e
3 4 5

Transition diagram for keywords he, she, his and hers.

i 1 2 3 4 5 6 7 8 9
f(i) 0 0 0 1 2 0 3 0 3 42
KMP Generalization

When in state i, the failure function f(i) notes
the state corresponding to the longest proper
suffix that is also a prefix of some keyword.
h e r s
0 1 2 8 9
i

s s
6 7

h e
3 4 5

Transition diagram for keywords he, she, his and hers. In


Instate
state7,
7,character
character
ssmatches
matchesprefix
prefixofof
the keyword she
the keyword she to to
i 1 2 3 4 5 6 7 8 9
reach
reachstate
state3.
3.
f(i) 0 0 0 1 2 0 3 0 3 43
Regex to DFA

Approach 1: Regex NFA DFA

Approach 2: Regex DFA
– The ideas would be helpful in parsing too.

44
Regex NFA DFA
Draw an NFA for *cpp

Ʃ
0 c 1 p 2 p 3

p p
c
0 c 1 p 2 p 3
c c

How does a machine draw an NFA for an arbitrary


regular expression such as ((aa)*b(bb)*(aa)*)* ? 45
Regex NFA DFA

For the sake of convenience, let's convert *cpp
into *abb and restrict to alphabet {a, b}.

Thus, the regex is (a|b)*abb.

How do we create an NFA for (a|b)*abb?
ϵ
a
ϵ ϵ
ϵ ϵ a b b
ϵ b ϵ

46
Regex NFA DFA

For the sake of convenience, let's convert *cpp
into *abb and restrict to alphabet {a, b}.

Thus, the regex is (a|b)*abb.

How do we create an NFA for (a|b)*abb?
ϵ
a
ϵ 2 3 ϵ
0 ϵ 1 6 ϵ 7
a
8
b
9
b 10
ϵ 4 b
5 ϵ

47
Regex NFA DFA
NFA state DFA state a b
{0, 1, 2, 4, 7} A B C State
{1, 2, 3, 4, 6, 7, 8} B B D Transition
Table
{1, 2, 4, 5, 6, 7} C B C
{1, 2, 4, 5, 6, 7, 9} D B E
{1, 2, 4, 5, 6, 7, 10} E B C

ϵ
a
ϵ 2 3 ϵ
0 ϵ 1 6 ϵ 7
a
8
b
9
b 10
ϵ 4 b
5 ϵ

48
Regex NFA DFA
NFA state DFA state a b
{0, 1, 2, 4, 7} A B C State
{1, 2, 3, 4, 6, 7, 8} B B D Transition
Table
{1, 2, 4, 5, 6, 7} C B C
{1, 2, 4, 5, 6, 7, 9} D B E
{1, 2, 4, 5, 6, 7, 10} E B C

b
C
b a b
a b
A B D b E DFA
a a
a
49
Regex NFA DFA
Ʃ
a b b NFA
0 1 2 3

b b
a
a b b DFA
0 1 2 3
a a

b
C
b a b
a b
A B D b E DFA
a a non-minimal

a
50
Regex NFA DFA
(a|b)*abb Regex

ϵ
a
ϵ 2 3 ϵ
0 ϵ 1 6 ϵ 7 a 8 b 9 b 10 NFA
ϵ b ϵ
4 5

ϵ
b
C
b a b
a b
A B D b E DFA
a a non-minimal

a
51
Lex directly converts Regex to DFA without an intermediate NFA. How?
Regex DFA
1. Construct a syntax tree for regex#.
2. Compute nullable, firstpos, lastpos, followpos.
3. Construct DFA using transition function.
4. Mark firstpos(root) as start state.
5. Mark states that contain position of # as
accepting states.

52
Regex DFA

Regex is (a|b)*abb#.

Construct a syntax tree for the regex.
.

. #
. 6
b
. b 5
* 4
a
3
| ●
Leaves correspond to operands.

Interior nodes correspond to operators.

Operands constitute strings.

1 a b 2
53
Functions from Syntax Tree

For a syntax tree node n
– nullable(n): true if n represents ϵ.
– firstpos(n): set of positions that correspond to the
first symbol of strings in n's subtree.
– lastpos(n): set of positions that correspond to the
last symbol of strings in n's subtree.
– followpos(n): set of next possible positions from n
for valid strings.
ϵ
a
ϵ 2 3 ϵ
0 ϵ 1 6 ϵ 7 a 8
b
9
b
10
ϵ 4 b 5 ϵ
54

ϵ
nullable

nullable(n): true if n represents ϵ.

Regex is (a|b)*abb#.
F .

F . #
F . F
b
F . b F
T * F
a
F

F |

F a b F
55
nullable

nullable(n): true if n represents ϵ.
Node n nullable(n)
leaf labeled ϵ true
leaf with position i false
or-node n = c1 | c2 nullable(c1) or nullable(c2)
cat-node n = c1c2 nullable(c1) and nullable(c2)
star-node n = c* true

Classwork: Write down the rules for firstpos(n).



firstpos(n): set of positions that correspond to the
first symbol of strings in n's subtree.
56
firstpos

firstpos(n): set of positions that correspond
to the first symbol of strings in n's subtree.
Node n firstpos(n)
leaf labeled ϵ {}
leaf with position i {i}
or-node n = c1 | c2 firstpos(c1) U firstpos(c2)
cat-node n = c1c2
star-node n = c* firstpos(c)

57
firstpos

firstpos(n): set of positions that correspond
to the first symbol of strings in n's subtree.
Node n firstpos(n)
leaf labeled ϵ {}
leaf with position i {i}
or-node n = c1 | c2 firstpos(c1) U firstpos(c2)
cat-node n = c1c2 if (nullable(c1)) firstpos(c1) U firstpos(c2)
else firstpos(c1)

star-node n = c* firstpos(c)

58
firstpos
{1,2,3} .
{1,2,3} . #
{1,2,3} . 6
b {6}
{1,2,3} . b 5
{5}
{1,2} * 4
a {4}
3
{3}
{1,2} |

1 a b 2
{1} {2}

59
firstpos

firstpos(n): set of positions that correspond
to the first symbol of strings in n's subtree.
Node n firstpos(n)
leaf labeled ϵ {}
leaf with position i {i}
or-node n = c1 | c2 firstpos(c1) U firstpos(c2)
cat-node n = c1c2 if (nullable(c1)) firstpos(c1) U firstpos(c2)
else firstpos(c1)

star-node n = c* firstpos(c)

Classwork: Write down the rules for lastpos(n).

60
lastpos

lastpos(n): set of positions that correspond
to the last symbol of strings in n's subtree.
Node n lastpos(n)
leaf labeled ϵ {}
leaf with position i {i}
or-node n = c1 | c2 lastpos(c1) U lastpos(c2)
cat-node n = c1c2 if (nullable(c2)) lastpos(c1) U lastpos(c2)
else lastpos(c2)

star-node n = c* lastpos(c)

61
firstpos lastpos
{1,2,3} {6} .

{1,2,3} {5} .
#
{1,2,3} {4} . 6
b {6} {6}
{1,2,3} {3} . b 5
{5} {5}
{1,2} {1,2} * 4
a {4} {4}
3
{3} {3}
{1,2} {1,2} |

1 a b 2
{1} {1} {2} {2}

62
followpos

followpos(n): set of next possible positions
from n for valid strings.
– If n is a cat-node with child nodes c1 and c2
(c1.c2), then for each position in lastpos(c1), all
positions in firstpos(c2) follow.
– If n is a star-node, then for each position in
lastpos(n), all positions in firstpos(n) follow.
– If n is an or-node, ...

63
followpos
If n is a cat-node with child nodes c1 and c2, then for each position in
lastpos(c1), all positions in firstpos(c2) follow.
{1,2,3} {6} .

{1,2,3} {5} .
#
{1,2,3} {4} . 6
b {6} {6}
{1,2,3} {3} . b 5
{5} {5}
{1,2} {1,2} * 4
a {4} {4}
3 n followpos(n)
{3} {3}
{1,2} {1,2} | 1 {3}
2 {3}

1 a b 2
{1} {1} {2} {2}
64
followpos
If n is a cat-node with child nodes c1 and c2, then for each position in
lastpos(c1), all positions in firstpos(c2) follow.
{1,2,3} {6} .

{1,2,3} {5} .
#
{1,2,3} {4} . 6
b {6} {6}
{1,2,3} {3} . b 5
{5} {5}
{1,2} {1,2} * 4
a {4} {4}
3 n followpos(n)
{3} {3}
{1,2} {1,2} | 1 {3}
2 {3}
3 {4}
1 a b 2 4 {5}
{1} {1} {2} {2} 5 {6}
6 {} 65
followpos
If n is a star-node, then for each position in lastpos(n), all positions in
firstpos(n) follow.

{1,2,3} {6} .

{1,2,3} {5} .
#
{1,2,3} {4} . 6
b {6} {6}
{1,2,3} {3} . b 5
{5} {5}
{1,2} {1,2} * 4
a {4} {4}
3 n followpos(n)
{3} {3}
{1,2} {1,2} | 1 {3}
2 {3}
3 {4}
1 a b 2 4 {5}
{1} {1} {2} {2} 5 {6}
6 {} 66
followpos
If n is a star-node, then for each position in lastpos(n), all positions in
firstpos(n) follow.

{1,2,3} {6} .

{1,2,3} {5} .
#
{1,2,3} {4} . 6
b {6} {6}
{1,2,3} {3} . b 5
{5} {5}
{1,2} {1,2} * 4
a {4} {4}
3 n followpos(n)
{3} {3}
{1,2} {1,2} | 1 {3, 1, 2}
2 {3, 1, 2}
3 {4}
1 a b 2 4 {5}
{1} {1} {2} {2} 5 {6}
6 {} 67
Regex DFA
1.Construct a syntax tree for regex#.
2.Compute nullable, firstpos, lastpos, followpos.
3.Construct DFA using transition function (next slide).
4.Mark firstpos(root) as start state.
5.Mark states that contain position of # as
accepting states.

68
DFA Transitions
create unmarked state firstpos(root). {1,2,3} {6} .
while there exists unmarked state s {
mark s a b a
1 2 3
for each input symbol x {
uf = U followpos(p) where p is in s labeled x
transition[s, x] = uf n followpos(n)
1 {3, 1, 2}
if uf is newly created
2 {3, 1, 2}
unmark uf b 3 {4}
a
} 123 1234 4 {5}
5 {6} 69
} 6 {}
Final DFA
b

b a
a b b DFA
123 1234 1235 1236
a
a

Ʃ
a b b NFA
0 1 2 3

b b
a
0 a 1 b 2 b 3 DFA
a a
70
Regex DFA
1.Construct a syntax tree for regex#.
2.Compute nullable, firstpos, lastpos, followpos.
3.Construct DFA using transition function.
4.Mark firstpos(root) as start state.
5.Mark states that contain position of # as
accepting states.

Do this for (b|ab)*(aa|b)*.


71
In case you are wondering...

What to do with this DFA?
– Recognize strings during lexical analysis.
– Could be used in utilities such as grep.
– Could be used in regex libraries as supported in
php, python, perl, ....

72
Lexing Summary
Character stream
● Basic lex Machine-Indep.
Machine-Indep.
Lexical
LexicalAnalyzer
Analyzer Code
CodeOptimizer
Optimizer

Input Buffering Token stream Intermediate
representation


KMP String Matching Syntax
SyntaxAnalyzer
Analyzer Code
CodeGenerator
Generator

Syntax tree Target machine code



Regex → NFA → DFA Semantic
Machine-Dependent
Machine-Dependent
SemanticAnalyzer
Analyzer Code
CodeOptimizer
Optimizer

Regex → DFA Syntax tree Target machine code
Intermediate
Intermediate
Code
CodeGenerator
Generator
Intermediate
representation

73

You might also like