Lexical Analyzer Parser
Lexical Analyzer Parser
source
program
Lexical
Analyzer
token
Parser
Token
Since a token can represent more than one lexeme, additional information should be held
for that specific lexeme. This additional information is called as the attribute of the token.
For simplicity, a token may have a single attribute which holds the required information
for that token.
For identifiers, this attribute a pointer to the symbol table, and the symbol table holds the actual attributes for
that token.
Some attributes:
<id,attr>
where attr is pointer to the symbol table
<assgop,_>
no attribute is needed (if there is only one assignment operator)
<num,val> where val is the actual value of the number.
Terminology of Languages
Alphabet : a finite set of symbols (ASCII characters)
String :
Finite sequence of symbols on an alphabet
Sentence and word are also used in terms of string
is the empty string
|s| is the length of string s.
Operators on Strings:
Concatenation: xy represents the concatenation of strings x and y. s = s
sn = s s s .. s ( n times) s0 =
s=s
Operations on Languages
Concatenation:
L1L2 = { s1s2 | s1 L1 and s2 L2 }
Union
L1 L2 = { s | s L1 or s L2 }
Exponentiation:
L0 = {}
L1 = L
L2 = LL
Kleene Closure
L* =
L
i 0
Positive Closure
L+ =
L
i 1
Example
L1 = {a,b,c,d}
L2 = {1,2}
L1L2 = {a1,a2,b1,b2,c1,c2,d1,d2}
L1 L2 = {a,b,c,d,1,2}
L13 = all strings with length three (using a,b,c,d}
L1* = all strings using letters a,b,c,d and empty string
L1+ = doesnt include the empty string
CS416 Compiler Design
Regular Expressions
We use regular expressions to describe tokens of a programming
language.
A regular expression is built up of simpler regular expressions (using
defining rules)
Each regular expression denotes a language.
A language denoted by a regular expression is called as a regular set.
a
(r1) | (r2)
(r1) (r2)
(r)*
(r)
Language it denotes
{}
{a}
L(r1) L(r2)
L(r1) L(r2)
(L(r))*
L(r)
(r)+ = (r)(r)*
(r)? = (r) |
CS416 Compiler Design
ab*|c
means
highest
next
lowest
(a(b)*)|(c)
Ex:
= {0,1}
0|1 => {0,1}
(0|1)(0|1) => {00,01,10,11}
0* => { ,0,00,000,0000,....}
(0|1)* => all strings with 0 and 1, including the empty string
Regular Definitions
To write regular expression for some languages can be difficult, because
their regular expressions can be quite complex. In those cases, we may
use regular definitions.
We can give names to regular expressions, and we can use these names as
symbols to define other regular expressions.
A regular definition is a sequence of the definitions of the form:
d1 r1 where di is a distinct name and
d2 r2
.
{d1,d2,...,di-1}
dn rn
basic symbols
10
Finite Automata
A recognizer for a language is a program that takes a string x, and answers yes if x
is a sentence of that language, and no otherwise.
We call the recognizer of the tokens as a finite automaton.
A finite automaton can be: deterministic(DFA) or non-deterministic (NFA)
This means that we may use a deterministic or non-deterministic automaton as a
lexical analyzer.
Both deterministic and non-deterministic finite automaton recognize regular sets.
Which one?
deterministic faster recognizer, but it may take more space
non-deterministic slower, but it may take less space
Deterministic automatons are widely used lexical analyzers.
First, we define regular expressions for tokens; Then we convert them into a DFA to
get a lexical analyzer for our tokens.
Algorithm1: Regular Expression NFA DFA (two steps: first to NFA, then to DFA)
Algorithm2: Regular Expression DFA (directly convert a regular expression into a DFA)
11
12
NFA (Example)
a
start
b
{0}
{2}
_
13
for each symbol a and state s, there is at most one labeled edge a leaving s.
i.e. transition function is from pair of state-symbol to state (not set of states)
b
0
a
a
14
Implementing a DFA
Le us assume that the end of a string is marked with a special symbol
(say eos). The algorithm for recognition will be as follows: (an efficient
implementation)
s s0
c nextchar
while (c != eos) do
begin
s move(s,c)
c nextchar
end
if (s in F) then
return yes
else
return no
{ if s is an accepting state }
15
Implementing a NFA
S -closure({s0})
c nextchar
while (c != eos) {
begin
s -closure(move(S,c)) { set of all states can be accessible from a state in S
c nextchar
by a transition on c }
end
if (SF != ) then
return yes
else
return no
16
17
N(r1)
NFA for r1 | r2
N(r2)
18
N(r1)
N(r2)
NFA for r1 r2
For regular expression r*
N(r)
NFA for r*
CS416 Compiler Design
19
(a | b)
(a|b)
(a|b) * a
20
21
S0 = -closure({0}) = {0,1,2,4,7}
S0 into DS as an unmarked state
mark S0
-closure(move(S0,a)) = -closure({3,8}) = {1,2,3,4,6,7,8} = S1
S1 into DS
-closure(move(S0,b)) = -closure({5}) = {1,2,4,5,6,7} = S2
S2 into DS
transfunc[S0,a] S1
transfunc[S0,b] S2
mark S1
-closure(move(S1,a)) = -closure({3,8}) = {1,2,3,4,6,7,8} = S1
-closure(move(S1,b)) = -closure({5}) = {1,2,4,5,6,7} = S2
transfunc[S1,a] S1
transfunc[S1,b] S2
mark S2
-closure(move(S2,a)) = -closure({3,8}) = {1,2,3,4,6,7,8} = S1
-closure(move(S2,b)) = -closure({5}) = {1,2,4,5,6,7} = S2
transfunc[S2,a] S1
transfunc[S2,b] S2
22
a
S1
a
S0
b
S2
b
23
24
a
1
#
4
a
3
|
b
2
25
followpos
Then we define the function followpos for the positions (positions
assigned to leaves).
followpos(i) -- is the set of positions which can follow
the position i in the strings generated by
the augmented regular expression.
For example,
( a | b) * a #
1 2
3 4
followpos(1) = {1,2,3}
followpos(2) = {1,2,3}
followpos(3) = {4}
followpos(4) = {}
26
27
nullable(n)
firstpos(n)
lastpos(n)
leaf labeled
true
leaf labeled
with position i
false
{i}
{i}
c2
nullable(c1) or
nullable(c2)
firstpos(c1) firstpos(c2)
lastpos(c1) lastpos(c2)
if (nullable(c1))
if (nullable(c2))
c2
nullable(c1) and
nullable(c2)
|
c1
c1
*
c1
true
firstpos(c1) firstpos(c2)
lastpos(c1) lastpos(c2)
else firstpos(c1)
else lastpos(c2)
firstpos(c1)
lastpos(c1)
28
If firstpos and lastpos have been computed for each node, followpos
of each position can be computed by making one depth-first traversal
of the syntax tree.
29
Example -- ( a | b) * a #
{1,2,3} {4}
{1,2,3} {3} {4} # {4}
4
{1,2} *{1,2} {3} a{3}
3
{1,2} | {1,2}
{1} a {1} {2} b {2}
2
1
green firstpos
blue lastpos
Then we can calculate followpos
followpos(1) = {1,2,3}
followpos(2) = {1,2,3}
followpos(3) = {4}
followpos(4) = {}
30
31
Example -- ( a | b) * a #
1
followpos(1)={1,2,3}
followpos(2)={1,2,3}
3 4
followpos(3)={4}
S1=firstpos(root)={1,2,3}
mark S1
a: followpos(1) followpos(3)={1,2,3,4}=S2
b: followpos(2)={1,2,3}=S1
mark S2
a: followpos(1) followpos(3)={1,2,3,4}=S2
b: followpos(2)={1,2,3}=S1
move(S1,a)=S2
move(S1,b)=S1
move(S2,a)=S2
move(S2,b)=S1
start state: S1
accepting states: {S2}
followpos(4)={}
S1
a
a
S2
b
32
Example -- ( a | ) b c* #
1
followpos(1)={2}
followpos(2)={3,4}
followpos(3)={3,4}
followpos(4)={}
S1=firstpos(root)={1,2}
mark S1
a: followpos(1)={2}=S2
move(S1,a)=S2
b: followpos(2)={3,4}=S3
move(S1,b)=S3
mark S2
b: followpos(2)={3,4}=S3
move(S2,b)=S3
mark S3
c: followpos(3)={3,4}=S3
move(S3,c)=S3
S2
b
S1
b
S3
start state: S1
33
34
G1 = {2}
G2 = {1,3}
2
b
move(1,a)=2
move(3,a)=2
move(1,b)=3
move(2,b)=3
a
a
{2}
35
Groups:
b
a
b
3
{1,2,3}
{1,2}
{3}
no more partitioning
{4}
a
1->2
2->2
3->4
1->3
2->3
3->3
b
a
{3}
{1,2}
{4}
36
ne
new
newv
newva
newval
What is the end of a token? Is there any character which marks the end
of a token?
But
37
38