Ch3 1
Ch3 1
Compiler Constructio
The Reason Why Lexical
2
Analysis is a Separate
Phase
• Simplifies the design of the compiler
– LL(1) or LR(1) parsing with 1 token lookahead
would not be possible (multiple
characters/tokens to match)
• Provides efficient implementation
– Systematic techniques to implement lexical
analyzers by hand or automatically from
specifications
– Stream buffering methods to scan input
• Improves portability
– Non-standard symbols and alternate character
encodings can be normalized (e.g. UTF8,
trigraphs)
Interaction of the
3
Symbol Table
4
Attributes of Tokens
” > <assign, > <num, 31> <‘+’, > <num, 28> <‘*’, > <
token
(lookahead)
tokenval Parser
(token attribute)
5
s0 =
si = si-1s for i > 0
note that s = s = s
Specification of
8
r+ = rr*
r? = r
[a-z] = abc…z
• Examples:
digit [0-9]
num digit+ (. digit+)? ( E (+-)?
digit+ )?
13
Regular Definitions
and Grammars
Grammar
if expr then stmt
if expr then stmt else stmt
term relop term
term Regular definitions
id if if
num then then
else else
relop < <= <> > >= =
id letter ( letter | digit )*
num digit+ (. digit+)? ( E (+-)? dig
Coding Regular
14
Definitions in
Transition Diagrams
relop <<=<>>>==
start < =
0 1 2return(relop, LE)
>
3return(relop, NE)
other *
4return(relop, LT)
=
5return(relop, EQ)
> =
6 7return(relop, GE)
other *
8return(relop, GT)
letter ( letterdigitletter
)* or digit
Transition Diagrams:
Code
token nexttoken()
{ while (1) {
switch (state) {
case 0: c = nextchar();
if (c==blank || c==tab || c==newline) { Decides the
state = 0;
lexeme_beginning++; next start state
}
else if (c==‘<’) state = 1; to check
else if (c==‘=’) state = 5;
else if (c==‘>’) state = 6;
else state = fail();
break; int fail()
case 1: { forward = token_beginning;
… swith (start) {
case 9: c = nextchar(); case 0: start = 9; break;
if (isletter(c)) state = 10; case 9: start = 12; break;
else state = fail(); case 12: start = 20; break;
break; case 20: start = 25; break;
case 10: c = nextchar(); case 25: recover(); break;
if (isletter(c)) state = 10; default: /* error */
else if (isdigit(c)) state = 10; }
else state = 11; return start;
break; }
…
16
lex.yy.c
C a.out
compiler
input sequence
stream a.out of tokens
18
Lex Specification
• A lex specification consists of three parts:
regular definitions, C declarations in
%{ %}
%%
translation rules
%%
user-defined auxiliary procedures
• The translation rules are of the form:
p1 { action1 }
p2 { action2 }
…
pn { actionn }
19
Regular Expressions in
x Lex
match the character x
\. match the character .
“string” match contents of string of characters
. match any character except newline
^ match beginning of a line
$ match the end of a line
[xyz] match one character x, y, or z (use \ to escape -)
[^xyz]match any character except x, y, and z
[a-z] match one of a to z
r* closure (match zero or more occurrences)
r+ positive closure (match one or more occurrences)
r? optional (match zero or one occurrence)
r1r2 match r1 then r2 (concatenation)
r1|r2 match r1 or r2 (union)
( r ) grouping
r1/r2 match r1 when followed by r2
{d} match the regular expression defined by d
20
Example Lex
Specification 1
Contains
%{ the matching
ranslation #include <stdio.h> lexeme
%}
rules %%
[0-9]+ { printf(“%s\n”, yytext); }
.|\n { }
%% Invokes
main() the lexical
{ yylex(); analyzer
}
lex spec.l
gcc lex.yy.c -ll
./a.out < spec.l
21
Example Lex
Specification 2
%{
#include <stdio.h> Regular
int ch = 0, wd = 0, nl = 0;
definition
ranslation %}
rules delim [ \t]+
%%
\n { ch++; wd++; nl++; }
^{delim} { ch+=yyleng; }
{delim} { ch+=yyleng; wd++; }
. { ch++; }
%%
main()
{ yylex();
printf("%8d%8d%8d\n", nl, wd, ch);
}
22
Example Lex
Specification 3
%{
#include <stdio.h> Regular
%}
definitions
ranslation digit [0-9]
rules letter [A-Za-z]
id {letter}({letter}|{digit})*
%%
{digit}+ { printf(“number: %s\n”, yytext); }
{id} { printf(“ident: %s\n”, yytext); }
. { printf(“other: %s\n”, yytext); }
%%
main()
{ yylex();
}
Example Lex 23
Specification 4
%{ /* definitions of manifest constants */
#define LT (256)
…
%}
delim [ \t\n]
ws {delim}+
letter [A-Za-z] Return
digit [0-9]
id {letter}({letter}|{digit})* token to
number
%%
{digit}+(\.{digit}+)?(E[+\-]?{digit}+)? parser
{ws} { }
if {return IF;} Token
then
else
{return THEN;}
{return ELSE;}
attribute
{id} {yylval = install_id(); return ID;}
{number} {yylval = install_num(); return NUMBER;}
“<“ {yylval = LT; return RELOP;}
“<=“ {yylval = LE; return RELOP;}
“=“ {yylval = EQ; return RELOP;}
“<>“ {yylval = NE; return RELOP;}
“>“ {yylval = GT; return RELOP;}
“>=“
%%
Install yytext as
{yylval = GE; return RELOP;}
Design of a Lexical
Analyzer Generator
• Translate regular expressions
to NFA
• Translate NFA to an efficient
DFA Optional
regular
NFA DFA
expressions
Nondeterministic
Finite Automata
• An NFA is a 5-tuple (S, , , s0, F)
where
Transition Graph
• An NFA can be
diagrammatically represented
by a labeled directed graph
called a transition graph
a
S = {0,1,2,3}
start
0
a
1
b
2
b
3
= {a,b}
s0 = 0
b F = {3}
27
Transition Table
• The mapping of an NFA can
be represented in a
transition table
Input Input
State
(0,a) = {0,1} a b
(0,b) = {0} 0 {0, {0}
(1,b) = {2} 1}
(2,b) = {3} 1 {2}
2 {3}
28
Analyzer Generator: RE
to NFA to DFA
Lex specification with NFA
regular expressions
p1 { action1 } N(p1) action1
p2 { action2 } start
s0
N(p2) action2
… …
pn { actionn }
N(pn) actionn
Subset constructi
DFA
From Regular 30
Expression to NFA
(Thompson’ s
Construction) f
start
i
a start a
i f
start N(r1)
r1r2 i f
N(r2)
start
r1r2 i N(r1) N(r2) f
r* start
i N(r) f
Combining the NFAs of
31
a Set of Regular
Expressions
start
1
a
2
a { action1 }
start a b b
abb { action2 } 3 4 5 6
a b
a*b+ { action3 }
start
7 b 8
a
1 2
start
0 3
a
4
b
5
b
6
a b
7 b 8
32
Simulating the
Combined NFA Example 1
a
1 2 action1
start
0 3
a
4
b
5
b
6 action2
a b
7 b 8 action3
a a b a
none
0 2 7 8
action3
1 4
3 7 Must find the longest match:
7 Continue until no further moves are pos
When last state is accepting: execute a
33
Simulating the
Combined NFA Example 2
a
1 2 action1
start
0 3
a
4
b
5
b
6 action2
a b
7 b 8 action3
a b b a
none
0 2 5 6
action2
1 4 8 8
action3
3 7
When
7 two or more accepting states are reached,
first action given in the Lex specification is
34
Deterministic Finite
Automata
• A deterministic finite automaton is a
special case of an NFA
– No state has an -transition
– For each state s and input symbol a there
is at most one edge labeled a leaving s
• Each entry in the transition table is
a single state
– At most one path exists to accept a
string
– Simulation algorithm is simple
35
Example DFA
b
b
a
start a b b
0 1 2 3
a a
36
Conversion of an NFA
into a DFA
• The subset construction algorithm converts
an NFA into a DFA using:
-closure(s) = {s} {t s … t}
-closure(T) = sT -closure(s)
move(T,a) = {t s a t and s T}
• The algorithm produces:
Dstates is the set of states of the new
DFA consisting of sets of states of the
NFA
Dtran is the transition table of the new
DFA
37
using
-closure and move
S := -closure({s0})
Sprev :=
a := nextchar()
while S do
Sprev := S
S := -closure(move(S,a))
a := nextchar()
end do
if Sprev F then
execute action in Sprev
return “ yes”
else return “ no”
39
The Subset
Construction Algorithm
-closure(s0) is the only state in Dstates and it i
e is an unmarked state T in Dstates do
T
ach input symbol a do
U := -closure(move(T,a))
if U is not in Dstates then
add U as an unmarked state to Dstates
end if
Dtran[T,a] := U
o
Subset Construction 40
Example
1
a
2 3
start a b b
0 1 6 7 8 9 10
4
b
5
b
Dstates
C A = {0,1,2,4,7}
b a b B = {1,2,3,4,6,7,8}
start
E C = {1,2,4,5,6,7}
a b b
A B D
a D = {1,2,4,5,6,7,9}
a
a E = {1,2,4,5,6,7,10}
Subset Construction 41
Example 2
a
1 2 a1
start
0 3
a
4
b
5
b
6 a2
a b
7 b 8 a3
b
a3 Dstates
C A = {0,1,3,7}
b a
b b B = {2,4,7}
start C = {8}
A D
D = {7}
a a
b b E = {5,8}
B E F
F = {6,8}
a1 a3 a2 a3
42
C
b
b a b
start a b b start a b b
A B D E AC B D E
a a
a
a b a a
From Regular
43
Expression to DFA
Directly
• The “ important states” of an NFA
are those without an -
transition, that is if
move({s},a) for some a then s
is an important state
• The subset construction algorithm
uses only the important states
when it determines
-closure(move(T,a))
From Regular
44
Expression to DFA
Directly (Algorithm)
• Augment the regular expression r
with a special end symbol # to
make accepting states important:
the new expression is r#
• Construct a syntax tree for r#
• Traverse the tree to construct
functions nullable, firstpos,
lastpos, and followpos
From Regular 45
Expression to DFA
Directly: Syntax Tree
of (a|b)*abb#
concatenation
#
6
b
closure 5
b
4
* a
3
alternation
| position
number
a b (for leafs
1 2
From Regular 46
Expression to DFA
Directly: Annotating
• nullable(n):the Tree
the subtree at node n
generates languages including the
empty string
• firstpos(n): set of positions that can
match the first symbol of a string
generated by the subtree at node n
• lastpos(n): the set of positions that
can match the last symbol of a string
generated be the subtree at node n
• followpos(i): the set of positions
that can follow position i in the tree
From Regular
47
Expression to DFA
Directly: Annotating
Node n the Tree
nullable(n) firstpos(n) lastpos(n)
Leaf true
Expression to DFA
Directly: Syntax Tree
of (a|b)*abb#
{1, 2, 3} {6}
{3} a {3}
firstposlastpos
{1, 2} {1, 2}
* 3
{1, 2}|{1, 2}
Expression to DFA
Directly: followpos
h node n in the tree do
n is a cat-node with left child c1 and right child
for each i in lastpos(c1) do
followpos(i) := followpos(i) firstpos(c2)
end do
se if n is a star-node
for each i in lastpos(n) do
followpos(i) := followpos(i) firstpos(n)
end do
d if
From Regular 50
Expression to DFA
Directly: Algorithm
firstpos(root) where root is the root of the syntax
s := {s0} and is unmarked
there is an unmarked state T in Dstates do
ark T
or each input symbol a do
let U be the set of positions that are in follo
for some position p in T,
such that the symbol at position p is a
if U is not empty and not in Dstates then
add U as an unmarked state to Dstates
end if
Dtran[T,a] := U
nd do
From Regular
51
Expression to DFA
Directly:
followp
Example
Node
os 1
{1, 2,
1 3 4 5 6
3}
{1, 2, 2
2
3}
3 {4}
4 {5}
5 {6}b b
6 - a
start a 1,2, b 1,2, b 1,2,
1,2,3
3,4 3,5 3,6
a
a
52
Time-Space Tradeoffs
Space Time
Automaton (worst (worst
case) case)
O(rx
NFA O(r)
)
DFA O(2|r|) O(x)