Lexical Analyzer
Interaction with Scanner & Parser
Interaction of scanner & parser
Toke
Source Lexical n
Parser
Progra Analyzer
m Get next
token
Symbol Table
• Upon receiving a “Get next token” command from parser, the lexical
analyzer reads the input character until it can identify the next token.
• Lexical analyzer also stripping out comments and white space in the form of
blanks, tabs, and newline characters from the source program.
Why to separate lexical analysis &
parsing?
1. Simplicity in design.
2. Improves compiler efficiency.
3. Enhance compiler portability.
Token, Pattern & Lexemes
Token, Pattern & Lexemes
Token Pattern
The set of rules called pattern
Sequence of character
associated with a token.
having a collective meaning
Example: “non-empty sequence of
is known as token. digits”, “letter followed by letters and
Categories of Tokens: digits”
1.Identifier Lexemes
2.Keyword
The sequence of character in a
3.Operator source program matched with a
4.Special symbol pattern for a token is called
lexeme.
5.Constant
Example: Rate, DIET, count, Flag
Example: Token, Pattern & Lexemes
Example: total = sum + 45
Tokens:
total Identifier1
Operator1
=
sum Identifier2 Tokens
+ Operator2
45 Constant1
Lexemes
Lexemes of identifier: total, sum
Lexemes of operator: =, +
Lexemes of constant: 45
Relationship between Token, Pattern & Lexemes
Identify Token and Non-Token
Line1: #include <stdio.h>
Line2: int maximum(int x, int y) {
Line3: // This will compare 2 numbers
Line4: if (x > y)
Line5: return x;
Line6: else {
Line7: return y;
Line8: }
Line9: }
Attributes of Tokens
E = M * C **2
<id, pointer to symbol table entry of E>
<assig-op>
<id, pointer to symbol table entry of M>
<mult-op>
<id, pointer to symbol table entry of C>
<exp-op>
<number, integer value 2>
Lexical errors
• Some errors are out of power of lexical analyzer to recognize:
• fi ( a==f(x)) …
• However it may be able to recognize errors like:
• d=2r
• Such errors are recognized when no pattern for tokens matches a
character sequence
Error recovery
Panic mode: Successive characters are ignored until we reach to a well formed to
Delete one character from the remaining input.
nsert a missing character into the remaining input
Replace a character by another character
Transpose two adjacent character
Specification of tokens
Specification of Tokens
There are 3 Specification of tokens:
1. Strings
2. Language
3. Regular Expression
Strings and Languages
• An alphabet or character class is a finite set of symbols.
• A string over an alphabet is a finite sequence of symbols drawn from
that alphabet.
• A language is any countable set of strings over some fixed alphabet.
Strings Operations
Term Definition
Prefix of s A string obtained by removing zero or more
trailing symbol of string S.
e.g., ban is prefix of banana.
Suffix of S A string obtained by removing zero or more
leading symbol of string S.
e.g., nana is suffix of banana.
Sub string of S A string obtained by removing prefix and suffix
from S.
Proper prefix, e.g., nonempty
Any nan is substring
stringofxbanana
that is respectively proper
suffix and prefix, suffix or substring of S, such that s≠x.
substring of S
Subsequence of A string obtained by removing zero or more not
S necessarily contiguous symbol from S.
e.g., baaa is subsequence of banana.
Try yourself
• Write prefix, suffix, substring, proper prefix, proper suffix and
subsequence of following string:
String: Compiler
Test yourself
Let L={0,1} and S={a,b,c}
1. Union L U S={0,1,a,b,c}
2. Concatenation L.S={0a,1a,0b,1b,0c,1c}
3. Kleene closure L*={ ε,0,1,00....}
4. Positive closureL+={0,1,00....}
Regular Expression & Regular
Definition
Rules to define regular expression
1. is a regular expression that denotes , the set containing empty
string.
2. If is a symbol in then is a regular expression,
3. Suppose and are regular expression denoting the languages
and . Then,
a. is a regular expression denoting
b. is a regular expression denoting
c. * is a regular expression denoting
d. is a regular expression denoting
The language denoted by regular expression is said to be a
regular set.
Regular expression
• A regular expression is a sequence of characters that define
a pattern.
Notational shorthand's
1. One or more instances: +
2. Zero or more instances: *
3. Zero or one instances: ?
4. Alphabets: Σ
Regular expression
• L = Zero or More Occurrences of
a* a =
*
𝜖
a
aa Infinite
aaa
aaa …..
a
aaaaa
…..
Regular expression
+
a+a =
• L = One or More Occurrences of
a
aa Infinite
aaa
aaa …..
a
aaaaa
…..
Precedence and associativity of operators
Operator Precedence Associative
Unary Operator * 1 left
Concatenation 2 left
Union | 3 left
Examples with Σ = { 0, 1}
• All binary strings including the empty string:
• All non-empty binary strings: (0|1)*
• All binary strings of length at least 2, starting and ending(0|1)(0|1)*
with 0s. 0(0|1)*0
• All binary string with at least four character in which the
fourth last character is always 0.: (0|1)*0(0|1) (0|1) (0|
• All binary strings possessing exactly three 1s : 0*10*10*10*
Test yourself for the alphabet {0, 1}
• RE with an odd number of 1’s 0*10*(0*10*10*)*
• All strings having exactly no 0 or many triples of 0’s and (000 | 1)+
many 1
(0|1)*00
• Accepts all Strings divisible by 4 with min length 2
Regular definition
• A regular definition gives names to certain regular expressions
and uses those names in other regular expressions.
• Regular definition is a sequence of definitions of the form:
……
Where is a distinct name & is a regular expression.
Example: Regular definition for identifier
letter A|B|C|………..|Z|a|b|………..|z
digit 0|1|…….|9|
id letter (letter | digit)*
Regular definition example
• Example: Unsigned Pascal numbers
3
5280
39.37
6.336E4
1.894E-4
2.56E+7
Regular Definition
digit 0|1|…..|9
optional_fraction .digits | 𝜖
digits digit digit*
optional_exponent (E(+|-|𝜖)digits)|𝜖
num digits optional_fraction optional_exponent
Transition Diagram
Transition Diagram
• A stylized flowchart is called transition diagram.
is a state
is a transition
is a start state
is a final state
Transition Diagram : Relational operator
<
0 1
=
2 return
(relop,LE)
>
3 return
= (relop,NE)
other
5
return 4 return
(relop,EQ) (relop,LT)
>
6 =
7 return
(relop,GE)
other
8 return
(relop,GT)
Transition diagram : Unsigned number
other
10
digi digi digi
t t t
star digi . +or digi other
t 1 2 3
t
digi
t 4 5 6 7
E
- t 8
other E digi
3 t
9
5280
39.37
1.894 E - 4
2.56 E + 7
45 E + 6
96 E 2
Hard coding & automatic generation
Lexical analyzers
Finite Automata
Finite Automata
• Finite Automata are recognizers.
• FA simply say “Yes” or “No” about each possible input string.
• Finite Automata is a mathematical model consist of:
1. Set of states
2. Set of input symbol
3. A transition function move
4. Initial state
5. Final states or accepting states
Types of finite automata
• Types of finite automata are:
DFA
b
Deterministic finite
automata (DFA): have for each a b b
1 2 3 4
state exactly one edge leaving out
for each symbol. a
a
b a
NFA DFA
Nondeterministic finite automata a
(NFA): There are no restrictions on
the edges leaving a state. There a b b
1 2 3 4
can be several with the same
symbol as label and some edges
can be labeled with . b NFA
DFA optimization
DFA optimization
1. Construct an initial partition of the set of states with two groups:
the accepting states and the non-accepting states .
2. Apply the repartition procedure to to construct a new partition .
3. If , let and continue with step (4). Otherwise, repeat step (2) with
.
for each group of do begin
partition into subgroups such that two states and
of are in the same subgroup if and only if for all
input symbols , states and have transitions on
to states in the same group of .
replace in by the set of all subgroups formed.
end
DFA optimization
4. Choose one state in each group of the partition as the
representative for that group. The representatives will be the
states of . Let s be a representative state, and suppose on
input a there is a transition of from to . Let be the
representative of s group. Then has a transition from to on .
Let the start state of be the representative of the group
containing start state of , and let the accepting states of be
the representatives that are in .
5. If has a dead state , then remove from . Also remove any
state not reachable from the start state.
DFA optimization
States a b
{ 𝐴, 𝐵,𝐶, 𝐷, 𝐸} A B C
B B D
Nonaccepting States Accepting States
C B C
D B E
{𝐷} E B C
States a b
A B A
B B D
• Now no more splitting is possible. D B E
E B A
• If we chose A as the representative
for group (AC), then we obtain Optimized
Transition
reduced transition table Table
Conversion from regular expression
to DFA
Rules to compute nullable, firstpos,
lastpos
• nullable(n)
• The subtree at node generates languages including the empty string.
• firstpos(n)
• The set of positions that can match the first symbol of a string generated by
the subtree at node
• lastpos(n)
• The set of positions that can match the last symbol of a string generated be
the subtree at node
• followpos(i)
• The set of positions that can follow position in the tree.
Rules to compute nullable, firstpos,
lastpos
Node n nullable(n) firstpos(n) lastpos(n)
A leaf labeled by ε true Φ Φ
A leaf labelled
false {i} {i}
with position i
firstpos(c1) lastpos(c1)
n
¿ nullable(c1)
or
c c nullable(c2) firstpos(c2) lastpos(c2)
1 2
if
n . if (nullable(c1)) (nullable(c2))
c c nullable(c1) thenfirstpos(c1) then
1 2 and firstpos(c2) lastpos(c1)
nullable(c2) else lastpos(c2)
n ∗ firstpos(c else )
true firstpos(c1))
1 lastpos(c
c lastpos(c12)
1
Rules to compute followpos
1. If n is concatenation node with left child c1 and right child c2
and i is a position in lastpos(c1), then all position in
firstpos(c2) are in followpos(i)
2. If n is * node and i is position in lastpos(n), then all position in
firstpos(n) are in followpos(i)
Conversion from regular expression to DFA
(a|b) * abb # Step 1: Construct
Syntax Tree
. Step 2: Nullable node
.
¿
𝟔 Here, * is only nullable
. node
𝑏
. 𝟓
𝑏
𝟒
∗ 𝑎
𝟑
¿
𝑎 𝑏
𝟏 𝟐
Conversion from regular expression to DFA
Step 3: Calculate firstpos
Firstpos
.
.
A leaf with position
.
n
. firstpos(c1)
c c firstpos(c2)
1 2
n
firstpos(c1)
c
1
n if (nullable(c1))
thenfirstpos(c1)
c c firstpos(c2)
1 2 else firstpos(c1)
Conversion from regular expression to DFA
Step 3: Calculate lastpos
Lastpos
.
.
¿ A leaf with position
.
𝑏 n
¿
. 𝟓 lastpos(c1)
𝑏 c c lastpos(c2)
𝟒 1 2
∗ 𝑎 n∗
𝟑 c
lastpos(c1)
¿ n
1
if (nullable(c2))
.
𝑎 𝑏 then lastpos(c1)
c c
𝟐 1
lastpos(c2)
2 else lastpos(c2)
Conversion from regular expression to DFA
Step 4: Calculate followpos Positio followpos
n
5 6
Firstpos .
Lastpos
.
¿
. 𝟔
𝑏
. 𝟓 .
𝑏
𝟒
∗ 𝑎
𝟑
¿
𝑏
𝟏 𝟐
Conversion from regular expression to DFA
Step 4: Calculate followpos Positio followpos
n
5 6
. 4 5
.
¿
. 𝟔
𝑏
. 𝟓 .
𝟒
∗ 𝑎 𝒄𝟏 𝒄𝟐
𝟑
¿
𝑎 𝑏
𝟐
Conversion from regular expression to DFA
Step 4: Calculate followpos Positio followpos
n
5 6
Firstpos . 4 5
Lastpos
. 3 4
¿
. 𝟔
𝑏
. 𝟓 .
𝑏
𝟒
∗ 𝑎 𝒄𝟏 𝒄𝟐
𝟑
¿
𝑎 𝑏
𝟏 𝟐
Conversion from regular expression to DFA
Step 4: Calculate followpos Positi followpo
on
5 s 6
.
Firstpos
4 5
Lastpos .
3 4
. 2 3
1 3
.
.
Conversion from regular expression to DFA
Step 4: Calculate followpos Positio followpos
n
5 6
Firstpos . 4 5
Lastpos
. 3 4
2 1,2,3
.
1 1,2,3
.
*
𝒏
¿
𝑎 𝑏
𝟏 𝟐
Conversion from regular expression to DFA
Initial state = of root = {1,2,3} Positio followpos
----- A n
5 6
State A 4 5
δ( (1,2,3),a) = followpos(1) U 3 4
followpos(3) 2 1,2,3
1 1,2,3
=(1,2,3) U (4) =
{1,2,3,4} ----- B
States a b
A={1,2,3 B A
δ( (1,2,3),b) = followpos(2) }
B={1,2,3,
4}
=(1,2,3) ----- A
Conversion from regular expression to DFA
State B
Positio followpos
δ( (1,2,3,4),a) = followpos(1) U followpos(3) n5 6
=(1,2,3) U (4) = {1,2,3,4} ----- B4 5
3 4
δ( (1,2,3,4),b) = followpos(2) U followpos(4) 2 1,2,3
=(1,2,3) U (5) = {1,2,3,5} ----- C1 1,2,3
State C
δ( (1,2,3,5),a) = followpos(1) U followpos(3)States a b
A={1,2,3 B A
=(1,2,3) U (4) = {1,2,3,4} -----B={1,2,3,
B } B C
4}
C={1,2,3, B D
5}
δ( (1,2,3,5),b) = followpos(2) U followpos(5) D={1,2,3,
6}
=(1,2,3) U (6) = {1,2,3,6} ----- D
Conversion from regular expression to DFA
State D Positio followpos
δ( (1,2,3,6),a) = followpos(1) U followpos(3) n
5 6
=(1,2,3) U (4) = {1,2,3,4} ----- B4 5
3 4
δ( (1,2,3,6),b) = followpos(2) 2 1,2,3
1 1,2,3
=(1,2,3) ----- A
b
a States a b
A={1,2,3 B A
a b b }
B={1,2,3, B C
A B C D 4}
C={1,2,3, B D
a 5}
a D={1,2,3, B A
b 6}
DFA
Conversion from regular expression to DFA
Construct DFA for following regular expression:
1. (c | d)*c#