2 Lexing
2 Lexing
Rupesh Nasre.
Machine-Independent
Machine-Independent
Lexical
LexicalAnalyzer
Analyzer Code
CodeOptimizer
Optimizer
Intermediate representation
Backend
Token stream
Frontend
Syntax
SyntaxAnalyzer
Analyzer Code
CodeGenerator
Generator
Machine-Dependent
Machine-Dependent
Semantic
SemanticAnalyzer
Analyzer Code
CodeOptimizer
Optimizer
Intermediate
Intermediate
Code Symbol
CodeGenerator
Generator Table
2
Intermediate representation
Lexing Summary
Character stream
● Basic lex Machine-Indep.
Machine-Indep.
Lexical
LexicalAnalyzer
Analyzer Code
CodeOptimizer
Optimizer
●
Input Buffering Token stream
Intermediate
representation
●
KMP String Matching Syntax
SyntaxAnalyzer
Analyzer Code
CodeGenerator
Generator
3
Role
●
Read input characters
●
Group into words (lexemes)
●
Return sequence of tokens
●
Sometimes
– Eat-up whitespace
– Remove comments
– Maintain line number information
4
Token, Pattern, Lexeme
Token Pattern Sample lexeme
if Characters i, f if
comparison <= or >= or < or > or == or != <=, !=
identifier letter (letter + digit)* pi, score, D2
number Any numeric constant 3.14159, 0, 6.02e23
literal Anything but “, surrounded by “” “core dumped”
6
Classwork: Regex Recap
●
If L is a set of letters (A-Z, a-z) and D is a set
of digits (0-9),
– Find the size of the language LD.
– Find the size of the language L U D.
– Find the size of the language L4.
●
Write regex for real numbers
– Without eE, without +- in mantissa (1.89)
– Without eE, with +- in mantissa (-1.89)
– With eE, with -+ in exponent (-1.89E-4)
7
Classwork
●
Write regex for strings over alphabet {a, b} that
start and end with a.
●
Strings with third last letter as a.
●
Strings with exactly three bs.
●
Strings with even length.
●
Homework
– Exercises 3.3.6 from ALSU.
8
Example Lex
/*/*variables
variables*/*/
Patterns [a-z]
[a-z] {{
yylval
yylval==*yytext
*yytext--'a';
'a';
return
returnVARIABLE;
VARIABLE; Tokens
}}
/*/*integers
integers*/*/
[0-9]+
[0-9]+ {{
yylval
yylval==atoi(yytext);
atoi(yytext); Lexemes
return
returnINTEGER;
INTEGER;
}}
/*/*operators
operators*/*/
[-+()=/*\n]
[-+()=/*\n]{{return
return*yytext;
*yytext;}}
/*/*skip
skipwhitespace
whitespace*/*/
[[\t]
\t] ;;
/*/*anything
anythingelse
elseis
isan
anerror
error*/*/
.. yyerror("invalid
yyerror("invalidcharacter");
character");
9
a1.l a1.y
lex
lex yacc
yacc
gcc
gcc
12
Lexing and Context
●
Language design should ensure that lexing
can be done without context.
●
Your assignments and most languages need
context-insensitive lexing.
DO
DO55 I I==1.25
1.25 DO
DO55 I I==1,25
1,25
●
“DO 5 I” is an identifier in Fortran, as spaces are allowed in identifiers.
●
Thus, first is an assignment, while second is a loop.
●
Lexer doesn't know whether to consider the input “DO 5 I” as an identifier
or as a part of the loop, until parser informs it based on dot or comma.
●
Alternatively, lexer may employ a lookahead.
13
Lookahead
●
/ signifies the lookahead symbol. The input is
read and matched, but is left unconsumed in
the current rule.
18
Input Buffering
●
“We cannot know we were executing a finite
loop until we come out of the loop.”
●
In C, without reading the next character we
cannot determine a binary minus symbol (a-b).
->, -=, --, -e, ...
Sometimes we may have to look several
characters in future, called lookahead.
In the fortran example (DO 5 I), the lookahead
could be upto dot or comma.
●
Reading character-by-character from disk is
inefficient. Hence buffering is required. 19
Input Buffering
●
A block of characters is read from disk into a buffer.
●
Lexer maintains two pointers:
– lexemeBegin
– forward
E = M * C * * 2 \f
forward
lexemeBegin
What
Whatis
isthe
theproblem
problemwith
withsuch
suchaascheme?
scheme?
20
Input Buffering
●
The issue arises when the lookahead is
beyond the buffer.
●
When you load the buffer, the previous content
is overwritten!
Input read Input to be
read
E = M * C * * 2 \f
forward
lexemeBegin
How
Howdo
dowe
wesolve
solvethis
thisproblem?
problem? 21
Double Buffering
●
Uses two (half) buffers.
●
Assumes that the lookahead would not be
more than one buffer size.
Buf1 Buf2
E = M * C * * 2 \f
forward
lexemeBegin
22
Transition Diagrams
●
Step to be taken on each character can be
specified as a state transition diagram.
– Sometimes, action may be associated with a state.
< =
0 1 2 return(comp, LE);
other yyless(1); return(comp, LT);
= 3
= return(comp, EQ);
4 5
>
other yyless(1); return(assign, ASSIGN);
6
= 8 return(comp, GE);
7
other 9 yyless(1); return(comp, GT);
23
...
Keywords vs. Identifiers
●
Keywords may match identifier pattern
– Keywords: int, const, break, ...
– Identifiers: (alpha | _) (alpha | num | _)*
●
If unaddressed, may lead to strange errors.
– Install keywords a priori in the symbol table.
– Prioritize keywords
●
In lex, the rule for a keyword must precede
that of the identifier.
24
27
String Matching
●
Lexical analyzer relies heavily on string
matching.
●
Given a program text T (length n) and a
pattern string s (length m), we want to check if
s occurs in T.
●
A naive algorithm would try all positions of T to
check for s (complexity O(m*n)).
n
T
m
s
28
Where can we do better?
●
T = abababaababbbabbababb
●
s = ababaa
i=0
abababaababbbabbababb
ababaa
29
Where can we do better?
●
T = abababaababbbabbababb
●
s = ababaa
i=0
abababaababbbabbababb
ababaa
30
Where can we do better?
●
T = abababaababbbabbababb
●
s = ababaa
i=1
abababaababbbabbababb
ababaa
31
Where can we do better?
●
T = abababaababbbabbababb
●
s = ababaa
i=2
abababaababbbabbababb
ababaa Match found
32
Where can we do better?
●
T = abababaababbbabbababb
●
s = ababaa
35
Source: wikipedia
KMP String Matching
●
First linear time algorithm for string matching.
●
Whenver there is a mismatch, do not restart;
rather fail intelligently.
●
We define a failure function for each position,
taking into account the suffix and the prefix.
●
Note that the matched part of the large string T is
essentially the pattern string s. Thus, failure
function can be computed simply using pattern s.
abababaababbbabbababb
ababaa
36
Failure is not final.
i 1 2 3 4 5 6
f(i) 0 0 1 2 3 1
seen a ab aba abab ababa ababaa
prefix ϵ ϵ a ab aba a
37
String matching with failure function
Text = a1a2...am; pattern = b1b2...bn (both indexed from 1)
s=0
for (i = 1; i <= m; ++i) { Go over Text
if (s > 0 && ai != bs+1) s = f(s) Handle failure
if (ai == bs+1) ++s Character match
38
String matching with failure function
Text = a1a2...am; pattern = b1b2...bn (both indexed from 1)
s=0
for (i = 1; i <= m; ++i) { Go over Text
while (s > 0 && ai != bs+1) s = f(s) Handle failure
if (ai == bs+1) ++s Character match
39
Classwork
●
Find failure function for pattern ababba.
– Test it on string abababbaa.
●
Find failure function for aaaaa (and apply on aaaab)
– Example needing multiple iterations of while.
– Despite the nested loop, the complexity is O(m+n).
– The backward traversal of the pattern is upper-
bounded by the forward traversal.
– Forward traversal increments text index.
– For n-length pattern match, the backward traversal
can be at most n. Thus, O(2n) for n-length match.
– Examples at this link. 40
Fibonacci Strings
– s1 = b, s2 = a, sk = sk-1sk-2 for k > 2
– e.g., s3 = ab, s4 = aba, s5 = abaab
●
Do not contain bb or aaa.
●
The words end in ba and ab alternatively.
●
Suppressing last two letters creates a palindrome.
●
...
● Find the failure function for Fibonacci String s6.
Source: Wikipedia 41
KMP Generalization
●
KMP can be used for keyword matching.
●
Aho and Corasick generalized KMP to
recognize any of a set of keywords in a text.
h e r s
0 1 2 8 9
i
s s
6 7
h e
3 4 5
i 1 2 3 4 5 6 7 8 9
f(i) 0 0 0 1 2 0 3 0 3 42
KMP Generalization
●
When in state i, the failure function f(i) notes
the state corresponding to the longest proper
suffix that is also a prefix of some keyword.
h e r s
0 1 2 8 9
i
s s
6 7
h e
3 4 5
44
Regex NFA DFA
Draw an NFA for *cpp
Ʃ
0 c 1 p 2 p 3
p p
c
0 c 1 p 2 p 3
c c
46
Regex NFA DFA
●
For the sake of convenience, let's convert *cpp
into *abb and restrict to alphabet {a, b}.
●
Thus, the regex is (a|b)*abb.
●
How do we create an NFA for (a|b)*abb?
ϵ
a
ϵ 2 3 ϵ
0 ϵ 1 6 ϵ 7
a
8
b
9
b 10
ϵ 4 b
5 ϵ
47
Regex NFA DFA
NFA state DFA state a b
{0, 1, 2, 4, 7} A B C State
{1, 2, 3, 4, 6, 7, 8} B B D Transition
Table
{1, 2, 4, 5, 6, 7} C B C
{1, 2, 4, 5, 6, 7, 9} D B E
{1, 2, 4, 5, 6, 7, 10} E B C
ϵ
a
ϵ 2 3 ϵ
0 ϵ 1 6 ϵ 7
a
8
b
9
b 10
ϵ 4 b
5 ϵ
48
Regex NFA DFA
NFA state DFA state a b
{0, 1, 2, 4, 7} A B C State
{1, 2, 3, 4, 6, 7, 8} B B D Transition
Table
{1, 2, 4, 5, 6, 7} C B C
{1, 2, 4, 5, 6, 7, 9} D B E
{1, 2, 4, 5, 6, 7, 10} E B C
b
C
b a b
a b
A B D b E DFA
a a
a
49
Regex NFA DFA
Ʃ
a b b NFA
0 1 2 3
b b
a
a b b DFA
0 1 2 3
a a
b
C
b a b
a b
A B D b E DFA
a a non-minimal
a
50
Regex NFA DFA
(a|b)*abb Regex
ϵ
a
ϵ 2 3 ϵ
0 ϵ 1 6 ϵ 7 a 8 b 9 b 10 NFA
ϵ b ϵ
4 5
ϵ
b
C
b a b
a b
A B D b E DFA
a a non-minimal
a
51
Lex directly converts Regex to DFA without an intermediate NFA. How?
Regex DFA
1. Construct a syntax tree for regex#.
2. Compute nullable, firstpos, lastpos, followpos.
3. Construct DFA using transition function.
4. Mark firstpos(root) as start state.
5. Mark states that contain position of # as
accepting states.
52
Regex DFA
●
Regex is (a|b)*abb#.
●
Construct a syntax tree for the regex.
.
. #
. 6
b
. b 5
* 4
a
3
| ●
Leaves correspond to operands.
●
Interior nodes correspond to operators.
●
Operands constitute strings.
1 a b 2
53
Functions from Syntax Tree
●
For a syntax tree node n
– nullable(n): true if n represents ϵ.
– firstpos(n): set of positions that correspond to the
first symbol of strings in n's subtree.
– lastpos(n): set of positions that correspond to the
last symbol of strings in n's subtree.
– followpos(n): set of next possible positions from n
for valid strings.
ϵ
a
ϵ 2 3 ϵ
0 ϵ 1 6 ϵ 7 a 8
b
9
b
10
ϵ 4 b 5 ϵ
54
ϵ
nullable
●
nullable(n): true if n represents ϵ.
●
Regex is (a|b)*abb#.
F .
F . #
F . F
b
F . b F
T * F
a
F
F |
F a b F
55
nullable
●
nullable(n): true if n represents ϵ.
Node n nullable(n)
leaf labeled ϵ true
leaf with position i false
or-node n = c1 | c2 nullable(c1) or nullable(c2)
cat-node n = c1c2 nullable(c1) and nullable(c2)
star-node n = c* true
57
firstpos
●
firstpos(n): set of positions that correspond
to the first symbol of strings in n's subtree.
Node n firstpos(n)
leaf labeled ϵ {}
leaf with position i {i}
or-node n = c1 | c2 firstpos(c1) U firstpos(c2)
cat-node n = c1c2 if (nullable(c1)) firstpos(c1) U firstpos(c2)
else firstpos(c1)
star-node n = c* firstpos(c)
58
firstpos
{1,2,3} .
{1,2,3} . #
{1,2,3} . 6
b {6}
{1,2,3} . b 5
{5}
{1,2} * 4
a {4}
3
{3}
{1,2} |
1 a b 2
{1} {2}
59
firstpos
●
firstpos(n): set of positions that correspond
to the first symbol of strings in n's subtree.
Node n firstpos(n)
leaf labeled ϵ {}
leaf with position i {i}
or-node n = c1 | c2 firstpos(c1) U firstpos(c2)
cat-node n = c1c2 if (nullable(c1)) firstpos(c1) U firstpos(c2)
else firstpos(c1)
star-node n = c* firstpos(c)
60
lastpos
●
lastpos(n): set of positions that correspond
to the last symbol of strings in n's subtree.
Node n lastpos(n)
leaf labeled ϵ {}
leaf with position i {i}
or-node n = c1 | c2 lastpos(c1) U lastpos(c2)
cat-node n = c1c2 if (nullable(c2)) lastpos(c1) U lastpos(c2)
else lastpos(c2)
star-node n = c* lastpos(c)
61
firstpos lastpos
{1,2,3} {6} .
{1,2,3} {5} .
#
{1,2,3} {4} . 6
b {6} {6}
{1,2,3} {3} . b 5
{5} {5}
{1,2} {1,2} * 4
a {4} {4}
3
{3} {3}
{1,2} {1,2} |
1 a b 2
{1} {1} {2} {2}
62
followpos
●
followpos(n): set of next possible positions
from n for valid strings.
– If n is a cat-node with child nodes c1 and c2
(c1.c2), then for each position in lastpos(c1), all
positions in firstpos(c2) follow.
– If n is a star-node, then for each position in
lastpos(n), all positions in firstpos(n) follow.
– If n is an or-node, ...
63
followpos
If n is a cat-node with child nodes c1 and c2, then for each position in
lastpos(c1), all positions in firstpos(c2) follow.
{1,2,3} {6} .
{1,2,3} {5} .
#
{1,2,3} {4} . 6
b {6} {6}
{1,2,3} {3} . b 5
{5} {5}
{1,2} {1,2} * 4
a {4} {4}
3 n followpos(n)
{3} {3}
{1,2} {1,2} | 1 {3}
2 {3}
1 a b 2
{1} {1} {2} {2}
64
followpos
If n is a cat-node with child nodes c1 and c2, then for each position in
lastpos(c1), all positions in firstpos(c2) follow.
{1,2,3} {6} .
{1,2,3} {5} .
#
{1,2,3} {4} . 6
b {6} {6}
{1,2,3} {3} . b 5
{5} {5}
{1,2} {1,2} * 4
a {4} {4}
3 n followpos(n)
{3} {3}
{1,2} {1,2} | 1 {3}
2 {3}
3 {4}
1 a b 2 4 {5}
{1} {1} {2} {2} 5 {6}
6 {} 65
followpos
If n is a star-node, then for each position in lastpos(n), all positions in
firstpos(n) follow.
{1,2,3} {6} .
{1,2,3} {5} .
#
{1,2,3} {4} . 6
b {6} {6}
{1,2,3} {3} . b 5
{5} {5}
{1,2} {1,2} * 4
a {4} {4}
3 n followpos(n)
{3} {3}
{1,2} {1,2} | 1 {3}
2 {3}
3 {4}
1 a b 2 4 {5}
{1} {1} {2} {2} 5 {6}
6 {} 66
followpos
If n is a star-node, then for each position in lastpos(n), all positions in
firstpos(n) follow.
{1,2,3} {6} .
{1,2,3} {5} .
#
{1,2,3} {4} . 6
b {6} {6}
{1,2,3} {3} . b 5
{5} {5}
{1,2} {1,2} * 4
a {4} {4}
3 n followpos(n)
{3} {3}
{1,2} {1,2} | 1 {3, 1, 2}
2 {3, 1, 2}
3 {4}
1 a b 2 4 {5}
{1} {1} {2} {2} 5 {6}
6 {} 67
Regex DFA
1.Construct a syntax tree for regex#.
2.Compute nullable, firstpos, lastpos, followpos.
3.Construct DFA using transition function (next slide).
4.Mark firstpos(root) as start state.
5.Mark states that contain position of # as
accepting states.
68
DFA Transitions
create unmarked state firstpos(root). {1,2,3} {6} .
while there exists unmarked state s {
mark s a b a
1 2 3
for each input symbol x {
uf = U followpos(p) where p is in s labeled x
transition[s, x] = uf n followpos(n)
1 {3, 1, 2}
if uf is newly created
2 {3, 1, 2}
unmark uf b 3 {4}
a
} 123 1234 4 {5}
5 {6} 69
} 6 {}
Final DFA
b
b a
a b b DFA
123 1234 1235 1236
a
a
Ʃ
a b b NFA
0 1 2 3
b b
a
0 a 1 b 2 b 3 DFA
a a
70
Regex DFA
1.Construct a syntax tree for regex#.
2.Compute nullable, firstpos, lastpos, followpos.
3.Construct DFA using transition function.
4.Mark firstpos(root) as start state.
5.Mark states that contain position of # as
accepting states.
72
Lexing Summary
Character stream
● Basic lex Machine-Indep.
Machine-Indep.
Lexical
LexicalAnalyzer
Analyzer Code
CodeOptimizer
Optimizer
●
Input Buffering Token stream Intermediate
representation
●
KMP String Matching Syntax
SyntaxAnalyzer
Analyzer Code
CodeGenerator
Generator
73