CompilerD L3
CompilerD L3
1
Token
• Token represents a set of strings described by a pattern.
– Identifier represents a set of strings which start with a letter continues with letters and digits
– The actual string (newval) is called as lexeme.
– Tokens: identifier, number, addop, delimeter, …
• Since a token can represent more than one lexeme, additional information should be
held for that specific lexeme. This additional information is called as the attribute of
the token.
• For simplicity, a token may have a single attribute which holds the required
information for that token.
– For identifiers, this attribute a pointer to the symbol table, and the symbol table holds the actual
attributes for that token.
• Some attributes:
– <id,attr> where attr is pointer to the symbol table
– <assgop,_> no attribute is needed (if there is only one assignment operator)
– <num,val> where val is the actual value of the number.
• Token type and its attribute uniquely identifies a lexeme.
• Regular expressions are widely used to specify patterns.
2
Terminology of Languages
• Alphabet : a finite set of symbols (ASCII characters)
• String :
– Finite sequence of symbols on an alphabet
– Sentence and word are also used in terms of string
– is the empty string
– |s| is the length of string s.
• Language: sets of strings over some fixed alphabet
– the empty set is a language.
– {} the set containing empty string is a language
– The set of well-wormed C programs is a language
– The set of all possible identifiers is a language.
• Operators on Strings:
– Concatenation: xy represents the concatenation of strings x and y. s = s s=s
– sn = s s s .. s ( n times) s0 =
3
Operations on Languages
• Concatenation:
– L1L2 = { s1s2 | s1 L1 and s2 L2 }
• Union
– L1 L2 = { s | s L1 or s L2 }
• Exponentiation:
– L0 = {} L1 = L L2 = LL
• Kleene Closure
L
– L* = L i i
i 0
• Positive Closure
i0
– L+ =
L
i 1
i
4
Example
• L1 = {a,b,c,d} L2 = {1,2}
• L1L2 = {a1,a2,b1,b2,c1,c2,d1,d2}
• L1 L2 = {a,b,c,d,1,2}
6
Regular Expressions (Rules)
Regular expressions over alphabet
• (r)+ = (r)(r)*
• (r)? = (r) |
7
Regular Expressions (cont.)
• We may remove parentheses by using precedence rules.
– * highest
– concatenation next
– | lowest
• ab*|c means (a(b)*)|(c)
• Ex:
– = {0,1}
– 0|1 => {0,1}
– (0|1)(0|1) => {00,01,10,11}
– 0* => { ,0,00,000,0000,....}
– (0|1)* => all strings with 0 and 1, including the empty string
8
Regular Definitions
• To write regular expression for some languages can be difficult, because
their regular expressions can be quite complex. In those cases, we may
use regular definitions.
• We can give names to regular expressions, and we can use these names
as symbols to define other regular expressions.
10
Finite Automata
• A recognizer for a language is a program that takes a string x, and answers “yes” if x
is a sentence of that language, and “no” otherwise.
• We call the recognizer of the tokens as a finite automaton.
• A finite automaton can be: deterministic(DFA) or non-deterministic (NFA)
• This means that we may use a deterministic or non-deterministic automaton as a
lexical analyzer.
• Both deterministic and non-deterministic finite automaton recognize regular sets.
• Which one?
– deterministic – faster recognizer, but it may take more space
– non-deterministic – slower, but it may take less space
– Deterministic automatons are widely used lexical analyzers.
• First, we define regular expressions for tokens; Then we convert them into a DFA to
get a lexical analyzer for our tokens.
– Algorithm1: Regular Expression NFA DFA (two steps: first to NFA, then to DFA)
– Algorithm2: Regular Expression DFA (directly convert a regular expression into a DFA)
11
Non-Deterministic Finite Automaton (NFA)
• A non-deterministic finite automaton (NFA) is a mathematical model
that consists of:
– S - a set of states
– - a set of input symbols (alphabet)
– move – a transition function move to map state-symbol pairs to sets of states.
– s0 - a start (initial) state
– F – a set of accepting states (final states)
12
NFA (Example)
13
Deterministic Finite Automaton (DFA)
a
b a
The language recognized by
a b
0 1 2
this DFA is also (a|b) * a b
b
14
Converting A Regular Expression into A NFA
(Thomson’s Construction)
• This is one way to convert a regular expression into a NFA.
• There can be other ways (much efficient) for the conversion.
• Thomson’s Construction is simple and systematic method.
It guarantees that the resulting NFA will have exactly one final state,
and one start state.
• Construction starts from simplest parts (alphabet symbols).
To create a NFA for a complex regular expression, NFAs of its sub-
expressions are combined to create its NFA,
15
Thomson’s Construction (cont.)
i f
• To recognize an empty string
N(r1)
i f NFA for r1 | r2
N(r2)
16
Thomson’s Construction (cont.)
NFA for r1 r2
i N(r) f
NFA for r*
17
Thomson’s Construction (Example - (a|b) * a )
a a
a:
(a | b)
b b
b:
a
(a|b) *
b
a
(a|b) * a
a
b
18
Converting a NFA into a DFA (Example)
2 a 3
0 1 a
6 7 8
4 b 5
S1
S0 b a
S2
20
Converting Regular Expressions Directly to DFAs
• We may convert a regular expression into a DFA (without creating a
NFA first).
• First we augment the given regular expression by concatenating it with
a special symbol #.
r (r)# augmented regular expression
• Then, we create a syntax tree for this augmented regular expression.
• In this syntax tree, all alphabet symbols (plus # and the empty string) in
the augmented regular expression will be on the leaves, and all inner
nodes will be the operators in that augmented regular expression.
• Then each alphabet symbol (plus #) will be numbered (position
numbers).
21
DFA based pattern matcher: Regular Expression DFA (cont.)
Syntax tree of (a|b) * a #
#
4
* a
3 • each symbol is numbered (positions)
| • each symbol is at a leave
a b
1 2 • inner nodes are operators
22
followpos
For example, ( a | b) * a #
1 2 3 4
24
How to evaluate followpos
• Two-rules define the function followpos:
• If firstpos and lastpos have been computed for each node, followpos
of each position can be computed by making one depth-first traversal
of the syntax tree.
25
Example -- ( a | b) * a #
27
Example -- ( a | b) * a #
1 2 3 4
S1=firstpos(root)={1,2,3}
mark S1
a: followpos(1) followpos(3)={1,2,3,4}=S2 move(S1,a)=S2
b: followpos(2)={1,2,3}=S1 move(S1,b)=S1
mark S2
a: followpos(1) followpos(3)={1,2,3,4}=S2 move(S2,a)=S2
b: followpos(2)={1,2,3}=S1 move(S2,b)=S1
b a
a
S1 S2
start state: S1 b
accepting states: {S2} 28
Minimizing Number of States of a DFA
• partition the set of states into two groups:
– G1 : set of accepting states
– G2 : set of non-accepting states
29
DFA Minimization
b a
{1,3} a {2}
31
Minimizing DFA – Another Example
a
2
a a
1 4
Groups: {1,2,3} {4}
b
b a
{1,2} {3} a b
3 b no more partitioning 1->2 1->3
2->2 2->3
b 3->4 3->3
{3}
a b
{1,2} a b
a {4}
32
Minimization using Myhill Nerode Theorem:
33
Minimization using Myhill Nerode Theorem:
34
Minimization using Myhill Nerode Theorem:
35
Some Other Issues in Lexical Analyzer (cont.)
• Skipping comments
– Normally we don’t return a comment as a token.
– We skip a comment, and return the next token (which is not a comment) to the parser.
– So, the comments are only processed by the lexical analyzer, and the don’t complicate the
syntax of the language.
36