0% found this document useful (0 votes)

8 views73 pages

2 Lexing

The document discusses the process of lexing in compiler design, detailing the role of a lexical analyzer in reading input characters, grouping them into tokens, and managing patterns using regular expressions. It covers various concepts such as input buffering, error handling, and the implementation of string matching algorithms like the Knuth-Morris-Pratt algorithm. Additionally, it emphasizes the importance of context-insensitive lexing and the organization of token patterns to avoid conflicts between keywords and identifiers.

Uploaded by

raadheschandaluru

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views73 pages

2 Lexing

Uploaded by

raadheschandaluru

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 73

Lexing

Rupesh Nasre.

CS3300 Compiler Design

IIT Madras
July 2024
Character stream

Machine-Independent
Machine-Independent
Lexical
LexicalAnalyzer
Analyzer Code
CodeOptimizer
Optimizer

Intermediate representation

Backend
Token stream
Frontend

Syntax
SyntaxAnalyzer
Analyzer Code
CodeGenerator
Generator

Syntax tree Target machine code

Machine-Dependent
Machine-Dependent
Semantic
SemanticAnalyzer
Analyzer Code
CodeOptimizer
Optimizer

Syntax tree Target machine code

Intermediate
Intermediate
Code Symbol
CodeGenerator
Generator Table
2
Intermediate representation
Lexing Summary
Character stream
● Basic lex Machine-Indep.
Machine-Indep.
Lexical
LexicalAnalyzer
Analyzer Code
CodeOptimizer
Optimizer
●
Input Buffering Token stream
Intermediate
representation
●
KMP String Matching Syntax
SyntaxAnalyzer
Analyzer Code
CodeGenerator
Generator

Syntax tree Target machine code

●
Regex → NFA → DFA Semantic
Machine-Dependent
Machine-Dependent
SemanticAnalyzer
Analyzer Code
CodeOptimizer
Optimizer
●
Regex → DFA Syntax tree Target machine code
Intermediate
Intermediate
Code
CodeGenerator
Generator
Intermediate
representation

3
Role
●
Read input characters
●
Group into words (lexemes)
●
Return sequence of tokens
●
Sometimes
– Eat-up whitespace
– Remove comments
– Maintain line number information

4
Token, Pattern, Lexeme
Token Pattern Sample lexeme
if Characters i, f if
comparison <= or >= or < or > or == or != <=, !=
identifier letter (letter + digit)* pi, score, D2
number Any numeric constant 3.14159, 0, 6.02e23
literal Anything but “, surrounded by “” “core dumped”

The following classes cover most or all of the tokens

●
One token for each keyword
●
Tokens for the operators, individually or in classes
●
Token for identifiers
●
One or more tokens for constants
●
One token each for punctuation symbols 5
Representing Patterns
●
Keywords can be directly represented (break, int).
●
And so do punctuation symbols ({, +).
●
Others are finite, but too many!
– Numbers
– Identifiers
– They are better represented using a regular expression.
– [a-z][a-z0-9]*, [0-9]+

6
Classwork: Regex Recap
●
If L is a set of letters (A-Z, a-z) and D is a set
of digits (0-9),
– Find the size of the language LD.
– Find the size of the language L U D.
– Find the size of the language L4.
●
Write regex for real numbers
– Without eE, without +- in mantissa (1.89)
– Without eE, with +- in mantissa (-1.89)
– With eE, with -+ in exponent (-1.89E-4)
7
Classwork
●
Write regex for strings over alphabet {a, b} that
start and end with a.
●
Strings with third last letter as a.
●
Strings with exactly three bs.
●
Strings with even length.
●
Homework
– Exercises 3.3.6 from ALSU.

8
Example Lex
/*/*variables
variables*/*/
Patterns [a-z]
[a-z] {{
yylval
yylval==*yytext
*yytext--'a';
'a';
return
returnVARIABLE;
VARIABLE; Tokens
}}

/*/*integers
integers*/*/
[0-9]+
[0-9]+ {{
yylval
yylval==atoi(yytext);
atoi(yytext); Lexemes
return
returnINTEGER;
INTEGER;
}}

/*/*operators
operators*/*/
[-+()=/*\n]
[-+()=/*\n]{{return
return*yytext;
*yytext;}}
/*/*skip
skipwhitespace
whitespace*/*/
[[\t]
\t] ;;

/*/*anything
anythingelse
elseis
isan
anerror
error*/*/
.. yyerror("invalid
yyerror("invalidcharacter");
character");
9
a1.l a1.y

lex
lex yacc
yacc

lex.yy.c y.tab.c y.tab.h

gcc
gcc

Lexer and parser are not separate binaries;

This is your compiler. a.out they are part of the same executable.10
Lex Regex
Expression Matches Example
c Character c a
\c Character c literally \*
“s” String s literally “**”
. Any character but newline a.*b
^ Beginning of a line ^abc
$ End of a line abc$
[s] Any of the characters in string s [abc]
[^s] Any one character not in string s [^abc]
r* Zero or more strings matching r a*
r+ One or more strings matching r a+
r? Zero or one r a?
r{m, n} Between m and n occurrences of r a{1,5}
r1r2 An r1 followed by an r2 ab
r1 | r2 An r1 or an r2 a|b
(r) Same as r (a | b) 11
r1/r2 r1 when followed by r2 abc/123
Homework
●
Write a lexer to identify special words in a text.
– Words like stewardesses: only one hand
– Words like typewriter: only one keyboard row
– Words like skepticisms: alternate hands
●
Implement grep using lex with search pattern
as alphabetical text (no operators *, ?, ., etc.).

12
Lexing and Context
●
Language design should ensure that lexing
can be done without context.
●
Your assignments and most languages need
context-insensitive lexing.

DO
DO55 I I==1.25
1.25 DO
DO55 I I==1,25
1,25

●
“DO 5 I” is an identifier in Fortran, as spaces are allowed in identifiers.
●
Thus, first is an assignment, while second is a loop.
●
Lexer doesn't know whether to consider the input “DO 5 I” as an identifier
or as a part of the loop, until parser informs it based on dot or comma.
●
Alternatively, lexer may employ a lookahead.
13
Lookahead

Duniya usi ki hai jo aage dekhe

14
Lookahead
●
Lexer needs to look into the future to know
where it is presently.

DO DO / .* COMMA { return DO;}

DO55 I I==1,25
1,25

●
/ signifies the lookahead symbol. The input is
read and matched, but is left unconsumed in
the current rule.

Corollary: DO loop index and increment must be on the same line

– no arbitrary whitespace allowed.
15
Lexical Errors
●
It is often difficult to report errors for a lexer.
– fi (a == f(x)) ...
– A lexer doesn't know the context of fi. Hence it
cannot “see” the structure of the sentence –
structure is known only to the parser.
– fi = 2; OR fi(a == f(x));
●
But some errors a lexer can catch.
– 23 = @a;
– if $x friendof anil ...

What should a lexer do on catching an error? 16

Error Handling
●
Multiple options
– exit(1);
– Panic mode recovery: delete enough input to recognize a
token
– Delete one character from the input
– Insert a missing character into the remaining input
– Replace a character by another character
– Transpose two adjacent characters
●
In practice, most lexical errors involve a single character.
●
Theoretical problem: Find the smallest number of
transformations (add, replace, delete) needed to convert the source
program into one that consists only of valid lexemes.
– Too expensive in practice to be worth the effort. 17
Homework
●
Try exercise 3.1.2 from ALSU.

18
Input Buffering
●
“We cannot know we were executing a finite
loop until we come out of the loop.”
●
In C, without reading the next character we
cannot determine a binary minus symbol (a-b).
 ->, -=, --, -e, ...
 Sometimes we may have to look several
characters in future, called lookahead.
 In the fortran example (DO 5 I), the lookahead
could be upto dot or comma.
●
Reading character-by-character from disk is
inefficient. Hence buffering is required. 19
Input Buffering
●
A block of characters is read from disk into a buffer.
●
Lexer maintains two pointers:
– lexemeBegin
– forward
E = M * C * * 2 \f

forward
lexemeBegin

What
Whatis
isthe
theproblem
problemwith
withsuch
suchaascheme?
scheme?
20
Input Buffering
●
The issue arises when the lookahead is
beyond the buffer.
●
When you load the buffer, the previous content
is overwritten!
Input read Input to be
read

E = M * C * * 2 \f

forward
lexemeBegin

How
Howdo
dowe
wesolve
solvethis
thisproblem?
problem? 21
Double Buffering
●
Uses two (half) buffers.
●
Assumes that the lookahead would not be
more than one buffer size.

Buf1 Buf2

E = M * C * * 2 \f

forward
lexemeBegin

22
Transition Diagrams
●
Step to be taken on each character can be
specified as a state transition diagram.
– Sometimes, action may be associated with a state.
< =
0 1 2 return(comp, LE);
other yyless(1); return(comp, LT);
= 3

= return(comp, EQ);
4 5
>
other yyless(1); return(assign, ASSIGN);
6
= 8 return(comp, GE);
7
other 9 yyless(1); return(comp, GT);
23
...
Keywords vs. Identifiers
●
Keywords may match identifier pattern
– Keywords: int, const, break, ...
– Identifiers: (alpha | _) (alpha | num | _)*
●
If unaddressed, may lead to strange errors.
– Install keywords a priori in the symbol table.
– Prioritize keywords
●
In lex, the rule for a keyword must precede
that of the identifier.

Incorrect (lex may give warning) Correct

Special vs. General
●
In general, a specialized pattern must precede the
general pattern (associativity).
●
Lex also follows maximum substring matching rule
(precedence).
– Reordering the rules for < and <= would not affect the
functionality.
●
Compare with rule specialization in Prolog.
●
Classwork: Count number of he and she in a text.
●
Classwork: Write lex rules to recognize quoted
strings in C.
25
– Try to recognize \” inside it.
he and she
she ++s; she {++s; REJECT;}
he ++h; he {++h;}
Retries another rule

What if I want to count all possible substrings he?

In general, the action associated with a rule may
not be easy / modular to duplicate.
Input: he ahe he she she fsfds fsf fs sfhe he she she she

he=5, she=5 he=10, she=5

26
By the way...
●
Sometimes, you need not have a parser at all...
– You could define main in your lex file.
– Simply call yylex() from main.
– Compile using lex, then compile lex.yy.c using gcc
and execute a.out.

27
String Matching
●
Lexical analyzer relies heavily on string
matching.
●
Given a program text T (length n) and a
pattern string s (length m), we want to check if
s occurs in T.
●
A naive algorithm would try all positions of T to
check for s (complexity O(m*n)).
n
T

m
s

28
Where can we do better?
●
T = abababaababbbabbababb
●
s = ababaa

i=0
abababaababbbabbababb
ababaa

29
Where can we do better?
●
T = abababaababbbabbababb
●
s = ababaa

i=0
abababaababbbabbababb
ababaa

30
Where can we do better?
●
T = abababaababbbabbababb
●
s = ababaa

i=1
abababaababbbabbababb
ababaa

31
Where can we do better?
●
T = abababaababbbabbababb
●
s = ababaa

i=2
abababaababbbabbababb
ababaa Match found

32
Where can we do better?
●
T = abababaababbbabbababb
●
s = ababaa

T's current suffix

i=0
abababaababbbabbababb
ababaa
s's proper prefix

Key observation: T's current suffix which is a proper prefix in s

has the treasure for us.
Whenever there is a mismatch, we should utilize this overlap, 33
rather than restarting.
Where can we do better?
●
T = abababaababbbabbababb
●
s = ababaa

T's current suffix

i=0
abababaababbbabbababb
ababaa
s's proper prefix

Key observation: T's current suffix which is a proper prefix in s

has the treasure for us.
Whenever there is a mismatch, we should utilize this overlap, 34
rather than restarting.
Knuth-Morris-Pratt Algorithm
●
In 1970, Morris conceived the idea.
●
After a few weeks, Knuth independently discovered
the idea.
●
In 1970, Morris and Pratt published a techreport.
●
KMP published the algorithm jointly in 1977.
●
In 1969, Matiyasevic discovered a similar algorithm.

35
Source: wikipedia
KMP String Matching
●
First linear time algorithm for string matching.
●
Whenver there is a mismatch, do not restart;
rather fail intelligently.
●
We define a failure function for each position,
taking into account the suffix and the prefix.
●
Note that the matched part of the large string T is
essentially the pattern string s. Thus, failure
function can be computed simply using pattern s.

abababaababbbabbababb
ababaa
36
Failure is not final.

Failure function for ababaa

i 1 2 3 4 5 6
f(i) 0 0 1 2 3 1
seen a ab aba abab ababa ababaa
prefix ϵ ϵ a ab aba a

Algorithm given as Figure 3.19 in ALSU.

37
String matching with failure function
Text = a1a2...am; pattern = b1b2...bn (both indexed from 1)
s=0
for (i = 1; i <= m; ++i) { Go over Text
if (s > 0 && ai != bs+1) s = f(s) Handle failure
if (ai == bs+1) ++s Character match

if (s == n) return “yes” Full match

}
return “no”
i 1 2 3 4 5 6
f(i) 0 0 1 2 3 1
seen a ab aba abab ababa ababaa
prefix ϵ ϵ a ab aba a

38
String matching with failure function
Text = a1a2...am; pattern = b1b2...bn (both indexed from 1)
s=0
for (i = 1; i <= m; ++i) { Go over Text
while (s > 0 && ai != bs+1) s = f(s) Handle failure
if (ai == bs+1) ++s Character match

if (s == n) return “yes” Full match

}
return “no”
i 1 2 3 4 5 6
f(i) 0 0 1 2 3 1
seen a ab aba abab ababa ababaa
prefix ϵ ϵ a ab aba a

39
Classwork
●
Find failure function for pattern ababba.
– Test it on string abababbaa.
●
Find failure function for aaaaa (and apply on aaaab)
– Example needing multiple iterations of while.
– Despite the nested loop, the complexity is O(m+n).
– The backward traversal of the pattern is upper-
bounded by the forward traversal.
– Forward traversal increments text index.
– For n-length pattern match, the backward traversal
can be at most n. Thus, O(2n) for n-length match.
– Examples at this link. 40
Fibonacci Strings
– s1 = b, s2 = a, sk = sk-1sk-2 for k > 2
– e.g., s3 = ab, s4 = aba, s5 = abaab

●
Do not contain bb or aaa.
●
The words end in ba and ab alternatively.
●
Suppressing last two letters creates a palindrome.
●
...
● Find the failure function for Fibonacci String s6.
Source: Wikipedia 41
KMP Generalization
●
KMP can be used for keyword matching.
●
Aho and Corasick generalized KMP to
recognize any of a set of keywords in a text.
h e r s
0 1 2 8 9
i

s s
6 7

h e
3 4 5

Transition diagram for keywords he, she, his and hers.

i 1 2 3 4 5 6 7 8 9
f(i) 0 0 0 1 2 0 3 0 3 42
KMP Generalization
●
When in state i, the failure function f(i) notes
the state corresponding to the longest proper
suffix that is also a prefix of some keyword.
h e r s
0 1 2 8 9
i

s s
6 7

h e
3 4 5

Transition diagram for keywords he, she, his and hers. In

Instate
state7,
7,character
character
ssmatches
matchesprefix
prefixofof
the keyword she
the keyword she to to
i 1 2 3 4 5 6 7 8 9
reach
reachstate
state3.
3.
f(i) 0 0 0 1 2 0 3 0 3 43
Regex to DFA
●
Approach 1: Regex NFA DFA
●
Approach 2: Regex DFA
– The ideas would be helpful in parsing too.

44
Regex NFA DFA
Draw an NFA for *cpp

Ʃ
0 c 1 p 2 p 3

p p
c
0 c 1 p 2 p 3
c c

How does a machine draw an NFA for an arbitrary

regular expression such as ((aa)*b(bb)*(aa)*)* ? 45
Regex NFA DFA
●
For the sake of convenience, let's convert *cpp
into *abb and restrict to alphabet {a, b}.
●
Thus, the regex is (a|b)*abb.
●
How do we create an NFA for (a|b)*abb?
ϵ
a
ϵ ϵ
ϵ ϵ a b b
ϵ b ϵ

46
Regex NFA DFA
●
For the sake of convenience, let's convert *cpp
into *abb and restrict to alphabet {a, b}.
●
Thus, the regex is (a|b)*abb.
●
How do we create an NFA for (a|b)*abb?
ϵ
a
ϵ 2 3 ϵ
0 ϵ 1 6 ϵ 7
a
8
b
9
b 10
ϵ 4 b
5 ϵ

47
Regex NFA DFA
NFA state DFA state a b
{0, 1, 2, 4, 7} A B C State
{1, 2, 3, 4, 6, 7, 8} B B D Transition
Table
{1, 2, 4, 5, 6, 7} C B C
{1, 2, 4, 5, 6, 7, 9} D B E
{1, 2, 4, 5, 6, 7, 10} E B C

ϵ
a
ϵ 2 3 ϵ
0 ϵ 1 6 ϵ 7
a
8
b
9
b 10
ϵ 4 b
5 ϵ

48
Regex NFA DFA
NFA state DFA state a b
{0, 1, 2, 4, 7} A B C State
{1, 2, 3, 4, 6, 7, 8} B B D Transition
Table
{1, 2, 4, 5, 6, 7} C B C
{1, 2, 4, 5, 6, 7, 9} D B E
{1, 2, 4, 5, 6, 7, 10} E B C

b
C
b a b
a b
A B D b E DFA
a a
a
49
Regex NFA DFA
Ʃ
a b b NFA
0 1 2 3

b b
a
a b b DFA
0 1 2 3
a a

b
C
b a b
a b
A B D b E DFA
a a non-minimal

a
50
Regex NFA DFA
(a|b)*abb Regex

ϵ
a
ϵ 2 3 ϵ
0 ϵ 1 6 ϵ 7 a 8 b 9 b 10 NFA
ϵ b ϵ
4 5

ϵ
b
C
b a b
a b
A B D b E DFA
a a non-minimal

a
51
Lex directly converts Regex to DFA without an intermediate NFA. How?
Regex DFA
1. Construct a syntax tree for regex#.
2. Compute nullable, firstpos, lastpos, followpos.
3. Construct DFA using transition function.
4. Mark firstpos(root) as start state.
5. Mark states that contain position of # as
accepting states.

52
Regex DFA
●
Regex is (a|b)*abb#.
●
Construct a syntax tree for the regex.
.

. #
. 6
b
. b 5
* 4
a
3
| ●
Leaves correspond to operands.
●
Interior nodes correspond to operators.
●
Operands constitute strings.

1 a b 2
53
Functions from Syntax Tree
●
For a syntax tree node n
– nullable(n): true if n represents ϵ.
– firstpos(n): set of positions that correspond to the
first symbol of strings in n's subtree.
– lastpos(n): set of positions that correspond to the
last symbol of strings in n's subtree.
– followpos(n): set of next possible positions from n
for valid strings.
ϵ
a
ϵ 2 3 ϵ
0 ϵ 1 6 ϵ 7 a 8
b
9
b
10
ϵ 4 b 5 ϵ
54

ϵ
nullable
●
nullable(n): true if n represents ϵ.
●
Regex is (a|b)*abb#.
F .

F . #
F . F
b
F . b F
T * F
a
F

F |

F a b F
55
nullable
●
nullable(n): true if n represents ϵ.
Node n nullable(n)
leaf labeled ϵ true
leaf with position i false
or-node n = c1 | c2 nullable(c1) or nullable(c2)
cat-node n = c1c2 nullable(c1) and nullable(c2)
star-node n = c* true

Classwork: Write down the rules for firstpos(n).

●
firstpos(n): set of positions that correspond to the
first symbol of strings in n's subtree.
56
firstpos
●
firstpos(n): set of positions that correspond
to the first symbol of strings in n's subtree.
Node n firstpos(n)
leaf labeled ϵ {}
leaf with position i {i}
or-node n = c1 | c2 firstpos(c1) U firstpos(c2)
cat-node n = c1c2
star-node n = c* firstpos(c)

57
firstpos
●
firstpos(n): set of positions that correspond
to the first symbol of strings in n's subtree.
Node n firstpos(n)
leaf labeled ϵ {}
leaf with position i {i}
or-node n = c1 | c2 firstpos(c1) U firstpos(c2)
cat-node n = c1c2 if (nullable(c1)) firstpos(c1) U firstpos(c2)
else firstpos(c1)

star-node n = c* firstpos(c)

58
firstpos
{1,2,3} .
{1,2,3} . #
{1,2,3} . 6
b {6}
{1,2,3} . b 5
{5}
{1,2} * 4
a {4}
3
{3}
{1,2} |

1 a b 2
{1} {2}

59
firstpos
●
firstpos(n): set of positions that correspond
to the first symbol of strings in n's subtree.
Node n firstpos(n)
leaf labeled ϵ {}
leaf with position i {i}
or-node n = c1 | c2 firstpos(c1) U firstpos(c2)
cat-node n = c1c2 if (nullable(c1)) firstpos(c1) U firstpos(c2)
else firstpos(c1)

star-node n = c* firstpos(c)

Classwork: Write down the rules for lastpos(n).

60
lastpos
●
lastpos(n): set of positions that correspond
to the last symbol of strings in n's subtree.
Node n lastpos(n)
leaf labeled ϵ {}
leaf with position i {i}
or-node n = c1 | c2 lastpos(c1) U lastpos(c2)
cat-node n = c1c2 if (nullable(c2)) lastpos(c1) U lastpos(c2)
else lastpos(c2)

star-node n = c* lastpos(c)

61
firstpos lastpos
{1,2,3} {6} .

{1,2,3} {5} .
#
{1,2,3} {4} . 6
b {6} {6}
{1,2,3} {3} . b 5
{5} {5}
{1,2} {1,2} * 4
a {4} {4}
3
{3} {3}
{1,2} {1,2} |

1 a b 2
{1} {1} {2} {2}

62
followpos
●
followpos(n): set of next possible positions
from n for valid strings.
– If n is a cat-node with child nodes c1 and c2
(c1.c2), then for each position in lastpos(c1), all
positions in firstpos(c2) follow.
– If n is a star-node, then for each position in
lastpos(n), all positions in firstpos(n) follow.
– If n is an or-node, ...

63
followpos
If n is a cat-node with child nodes c1 and c2, then for each position in
lastpos(c1), all positions in firstpos(c2) follow.
{1,2,3} {6} .

{1,2,3} {5} .
#
{1,2,3} {4} . 6
b {6} {6}
{1,2,3} {3} . b 5
{5} {5}
{1,2} {1,2} * 4
a {4} {4}
3 n followpos(n)
{3} {3}
{1,2} {1,2} | 1 {3}
2 {3}

1 a b 2
{1} {1} {2} {2}
64
followpos
If n is a cat-node with child nodes c1 and c2, then for each position in
lastpos(c1), all positions in firstpos(c2) follow.
{1,2,3} {6} .

{1,2,3} {5} .
#
{1,2,3} {4} . 6
b {6} {6}
{1,2,3} {3} . b 5
{5} {5}
{1,2} {1,2} * 4
a {4} {4}
3 n followpos(n)
{3} {3}
{1,2} {1,2} | 1 {3}
2 {3}
3 {4}
1 a b 2 4 {5}
{1} {1} {2} {2} 5 {6}
6 {} 65
followpos
If n is a star-node, then for each position in lastpos(n), all positions in
firstpos(n) follow.

{1,2,3} {6} .

{1,2,3} {5} .
#
{1,2,3} {4} . 6
b {6} {6}
{1,2,3} {3} . b 5
{5} {5}
{1,2} {1,2} * 4
a {4} {4}
3 n followpos(n)
{3} {3}
{1,2} {1,2} | 1 {3}
2 {3}
3 {4}
1 a b 2 4 {5}
{1} {1} {2} {2} 5 {6}
6 {} 66
followpos
If n is a star-node, then for each position in lastpos(n), all positions in
firstpos(n) follow.

{1,2,3} {6} .

{1,2,3} {5} .
#
{1,2,3} {4} . 6
b {6} {6}
{1,2,3} {3} . b 5
{5} {5}
{1,2} {1,2} * 4
a {4} {4}
3 n followpos(n)
{3} {3}
{1,2} {1,2} | 1 {3, 1, 2}
2 {3, 1, 2}
3 {4}
1 a b 2 4 {5}
{1} {1} {2} {2} 5 {6}
6 {} 67
Regex DFA
1.Construct a syntax tree for regex#.
2.Compute nullable, firstpos, lastpos, followpos.
3.Construct DFA using transition function (next slide).
4.Mark firstpos(root) as start state.
5.Mark states that contain position of # as
accepting states.

68
DFA Transitions
create unmarked state firstpos(root). {1,2,3} {6} .
while there exists unmarked state s {
mark s a b a
1 2 3
for each input symbol x {
uf = U followpos(p) where p is in s labeled x
transition[s, x] = uf n followpos(n)
1 {3, 1, 2}
if uf is newly created
2 {3, 1, 2}
unmark uf b 3 {4}
a
} 123 1234 4 {5}
5 {6} 69
} 6 {}
Final DFA
b

b a
a b b DFA
123 1234 1235 1236
a
a

Ʃ
a b b NFA
0 1 2 3

b b
a
0 a 1 b 2 b 3 DFA
a a
70
Regex DFA
1.Construct a syntax tree for regex#.
2.Compute nullable, firstpos, lastpos, followpos.
3.Construct DFA using transition function.
4.Mark firstpos(root) as start state.
5.Mark states that contain position of # as
accepting states.

Do this for (b|ab)(aa|b).

71
In case you are wondering...
●
What to do with this DFA?
– Recognize strings during lexical analysis.
– Could be used in utilities such as grep.
– Could be used in regex libraries as supported in
php, python, perl, ....

72
Lexing Summary
Character stream
● Basic lex Machine-Indep.
Machine-Indep.
Lexical
LexicalAnalyzer
Analyzer Code
CodeOptimizer
Optimizer
●
Input Buffering Token stream Intermediate
representation

●
KMP String Matching Syntax
SyntaxAnalyzer
Analyzer Code
CodeGenerator
Generator

Syntax tree Target machine code

Information Security Fundamental Weaknesses Place EPA Data and Operations at Risk 1st Edition by Government Accountability Office ISBN 1508400784 9781508400783 Instant Download
100% (6)
Information Security Fundamental Weaknesses Place EPA Data and Operations at Risk 1st Edition by Government Accountability Office ISBN 1508400784 9781508400783 Instant Download
75 pages
Rhea Vendors Lioness XS Manual
No ratings yet
Rhea Vendors Lioness XS Manual
49 pages
IDS805 Installer
100% (1)
IDS805 Installer
48 pages
Chapter 2 - Lexical Analysis
100% (1)
Chapter 2 - Lexical Analysis
69 pages
Lab Manual CD
No ratings yet
Lab Manual CD
19 pages
Compiler Lab Manual Final E-Content
75% (16)
Compiler Lab Manual Final E-Content
55 pages
CD ch2
No ratings yet
CD ch2
104 pages
Chapter 3 - Lexical Analysis and Lexical Analyzer Generators
No ratings yet
Chapter 3 - Lexical Analysis and Lexical Analyzer Generators
52 pages
Compiler Desing-Final ppt2
No ratings yet
Compiler Desing-Final ppt2
194 pages
Lisp Interpreter in Rust
From Everand
Lisp Interpreter in Rust
Vishal Patil
1/5 (1)
Lexical Analysis 2
No ratings yet
Lexical Analysis 2
24 pages
HW 31712
No ratings yet
HW 31712
22 pages
Unit 1 (B)
No ratings yet
Unit 1 (B)
69 pages
Lexical Analysis 3
No ratings yet
Lexical Analysis 3
27 pages
Unit 2 Lexical Analyzer
No ratings yet
Unit 2 Lexical Analyzer
63 pages
4-Intro To Flex and Bison-09!09!2024
No ratings yet
4-Intro To Flex and Bison-09!09!2024
28 pages
Lexical Analyzer
No ratings yet
Lexical Analyzer
33 pages
(EM) FWC ICT 2025 1st Term Paper With Scheme-1
No ratings yet
(EM) FWC ICT 2025 1st Term Paper With Scheme-1
21 pages
Lecture 2.76
No ratings yet
Lecture 2.76
31 pages
Chapter 3 Lexical Analysis
No ratings yet
Chapter 3 Lexical Analysis
5 pages
Compiler Design Lexical Analysis
No ratings yet
Compiler Design Lexical Analysis
24 pages
Lexical Analysis
No ratings yet
Lexical Analysis
45 pages
Chapter2-Lexical Analysis
No ratings yet
Chapter2-Lexical Analysis
64 pages
02 Lexical Analysis
No ratings yet
02 Lexical Analysis
86 pages
Lexical Analyzer
No ratings yet
Lexical Analyzer
31 pages
Chapter 2 - Lexical Analyser
No ratings yet
Chapter 2 - Lexical Analyser
38 pages
P6 File Corruption
No ratings yet
P6 File Corruption
20 pages
Lexical Analysis and Lexical Analyzer Generators: COP5621 Compiler Construction
No ratings yet
Lexical Analysis and Lexical Analyzer Generators: COP5621 Compiler Construction
52 pages
2 - Lexical Analysis
No ratings yet
2 - Lexical Analysis
52 pages
COS 320 Compilers: David Walker
No ratings yet
COS 320 Compilers: David Walker
38 pages
CD Unit-2
No ratings yet
CD Unit-2
64 pages
Lexical Analysis
No ratings yet
Lexical Analysis
62 pages
Lecture 3
No ratings yet
Lecture 3
22 pages
Code:: Compiler Design (3170701) 190090107055
No ratings yet
Code:: Compiler Design (3170701) 190090107055
76 pages
Thermodynamics Lec 2
No ratings yet
Thermodynamics Lec 2
19 pages
Chapter 2 - Lexical Analyser
No ratings yet
Chapter 2 - Lexical Analyser
39 pages
Class 2019 Lex
No ratings yet
Class 2019 Lex
30 pages
CD Unit-2
No ratings yet
CD Unit-2
64 pages
Chapter 3 - Lexical Analysis
No ratings yet
Chapter 3 - Lexical Analysis
52 pages
21CS51 ATCD MODULE 2 - 2 Lexical Analyser Part2
No ratings yet
21CS51 ATCD MODULE 2 - 2 Lexical Analyser Part2
62 pages
IoT Module-3
No ratings yet
IoT Module-3
36 pages
Lexical Analysis
No ratings yet
Lexical Analysis
57 pages
All The Links For IT
No ratings yet
All The Links For IT
133 pages
04 Lexi Cal A Analysis
No ratings yet
04 Lexi Cal A Analysis
39 pages
Chapter 2 - Lexical Analysis
No ratings yet
Chapter 2 - Lexical Analysis
56 pages
CD Cse Record
No ratings yet
CD Cse Record
76 pages
Lexical Analysis
No ratings yet
Lexical Analysis
62 pages
CD 1
No ratings yet
CD 1
92 pages
2 Lexing
No ratings yet
2 Lexing
16 pages
Lexical Analysis
No ratings yet
Lexical Analysis
31 pages
Chapter 2 - Lexical Analysis - Regular Expressions
No ratings yet
Chapter 2 - Lexical Analysis - Regular Expressions
27 pages
System Software Manual
No ratings yet
System Software Manual
27 pages
Introduction To Cellular Mobile Radio Systems
No ratings yet
Introduction To Cellular Mobile Radio Systems
83 pages
Lexical Analysis1
No ratings yet
Lexical Analysis1
44 pages
PLT Lecture Notes
No ratings yet
PLT Lecture Notes
5 pages
S6 AU Advanced Cad Lab Edited
No ratings yet
S6 AU Advanced Cad Lab Edited
18 pages
Helipad Lighting System
No ratings yet
Helipad Lighting System
11 pages
SPCC Exp7
No ratings yet
SPCC Exp7
8 pages
Compiler Design Lab KCS552
No ratings yet
Compiler Design Lab KCS552
82 pages
Dart for Flutter
From Everand
Dart for Flutter
Zeuz IT
No ratings yet
Ch2 Lexical Analysis
No ratings yet
Ch2 Lexical Analysis
11 pages
21CS51 ATCD MODULE 2 - 2 Lexical Analyser Part1
No ratings yet
21CS51 ATCD MODULE 2 - 2 Lexical Analyser Part1
63 pages
2 Lex
No ratings yet
2 Lex
45 pages
Iwt Lab 7TH Sem
No ratings yet
Iwt Lab 7TH Sem
46 pages
CC2
No ratings yet
CC2
6 pages
Lexical Analyzer: Using Flex by Dr. S. M. Farhad
No ratings yet
Lexical Analyzer: Using Flex by Dr. S. M. Farhad
22 pages
AMPRI-Project - Associate-Project-Assistant-Jobs-2021
No ratings yet
AMPRI-Project - Associate-Project-Assistant-Jobs-2021
14 pages
New Holland Wheel Loader w50btc en Service Manual
98% (59)
New Holland Wheel Loader w50btc en Service Manual
20 pages
Lab
No ratings yet
Lab
169 pages
Assembly Language:Simple, Short, And Straightforward Way Of Learning Assembly Programming
From Everand
Assembly Language:Simple, Short, And Straightforward Way Of Learning Assembly Programming
Sherwyn Allibang
2/5 (1)
System Programming & Compiler Design Lab Manual
No ratings yet
System Programming & Compiler Design Lab Manual
41 pages
18-Article Text-30-2-10-20210321
No ratings yet
18-Article Text-30-2-10-20210321
11 pages
B140XTN02 D-Auo
No ratings yet
B140XTN02 D-Auo
33 pages
TestBank IntroToIS 8e TechGuide4
No ratings yet
TestBank IntroToIS 8e TechGuide4
17 pages
Compiler Design
No ratings yet
Compiler Design
40 pages
CompilerDesignLabManual PDF
No ratings yet
CompilerDesignLabManual PDF
11 pages
Compiler
No ratings yet
Compiler
60 pages
L5 List Container
No ratings yet
L5 List Container
25 pages
Lecture 4
No ratings yet
Lecture 4
9 pages
Developed by Adnan Alam Khan: For BS Students
No ratings yet
Developed by Adnan Alam Khan: For BS Students
26 pages
TDU 107 Touch Display Unit: Compatible With AGC-4
No ratings yet
TDU 107 Touch Display Unit: Compatible With AGC-4
25 pages
Stas Reviewer Finals
No ratings yet
Stas Reviewer Finals
6 pages
Thermodynamics Lec 5
No ratings yet
Thermodynamics Lec 5
18 pages
CAISSON VHM Lateral
No ratings yet
CAISSON VHM Lateral
6 pages
Bluetooth Compatibility Chart: Find Your Mobile Phone in This List and Check Compatibility With Your Head Units
No ratings yet
Bluetooth Compatibility Chart: Find Your Mobile Phone in This List and Check Compatibility With Your Head Units
20 pages
L9 Functions
No ratings yet
L9 Functions
24 pages
L4 Conditional Loops
No ratings yet
L4 Conditional Loops
16 pages
0 Logistics
No ratings yet
0 Logistics
5 pages
Pricelist Eleven Wedding 2019-1 Terbaru
No ratings yet
Pricelist Eleven Wedding 2019-1 Terbaru
14 pages
JB Resume
No ratings yet
JB Resume
1 page
Date Sheet Final Term Exam Spring-2025 (Final)
No ratings yet
Date Sheet Final Term Exam Spring-2025 (Final)
10 pages
ASTM C762 86 1994 E1
No ratings yet
ASTM C762 86 1994 E1
2 pages
Complementary Instruction in Hybrid Learning Mode: Transforming Classroom Practices Through Technology
No ratings yet
Complementary Instruction in Hybrid Learning Mode: Transforming Classroom Practices Through Technology
8 pages
L2 PythonBasics
No ratings yet
L2 PythonBasics
21 pages
L3 Strings
No ratings yet
L3 Strings
18 pages
IM Appendix F Client Server Systems Ed12
No ratings yet
IM Appendix F Client Server Systems Ed12
7 pages
L10 ClassesAndObjects
No ratings yet
L10 ClassesAndObjects
14 pages
System Security Plan (SSP)
No ratings yet
System Security Plan (SSP)
7 pages
The Evolution of Traditional To New Media
No ratings yet
The Evolution of Traditional To New Media
3 pages

2 Lexing

Uploaded by

2 Lexing

Uploaded by

Lexing

CS3300 Compiler Design

Syntax tree Target machine code

Syntax tree Target machine code

Syntax tree Target machine code

The following classes cover most or all of the tokens

lex.yy.c y.tab.c y.tab.h

Lexer and parser are not separate binaries;

Duniya usi ki hai jo aage dekhe

DO DO / .* COMMA { return DO;}

Corollary: DO loop index and increment must be on the same line

What should a lexer do on catching an error? 16

Incorrect (lex may give warning) Correct

What if I want to count all possible substrings he?

he=5, she=5 he=10, she=5

T's current suffix

Key observation: T's current suffix which is a proper prefix in s

T's current suffix

Key observation: T's current suffix which is a proper prefix in s

Failure function for ababaa

Algorithm given as Figure 3.19 in ALSU.

if (s == n) return “yes” Full match

if (s == n) return “yes” Full match

Transition diagram for keywords he, she, his and hers.

Transition diagram for keywords he, she, his and hers. In

How does a machine draw an NFA for an arbitrary

Classwork: Write down the rules for firstpos(n).

Classwork: Write down the rules for lastpos(n).

Do this for (b|ab)*(aa|b)*.

Syntax tree Target machine code

You might also like

Do this for (b|ab)(aa|b).