Unit 1 (B)
Unit 1 (B)
Chapter 3
Lexical Analysis and
Lexical Analyzer Generators
Analp Pathak
Assistant Professor
Department of Computer Science Engineering
SRM IST, NCR Campus
2
Outline
• Lexical Analysis
• Role of Lexical Analyzer
• Recognition of Tokens
• Lexical Analyzer Generator: Lex/Flex
• Design Aspects of Lexical Analyzer
3
Symbol Table
5
Attributes of Tokens
<id, “y”> <assign, > <num, 31> <+, > <num, 28> <*, > <id, “x”>
token
tokenval
(token attribute) Parser
6
7
8
Example
9
Example
• Consider Pascal Statement
– const pi = 3.1416;
11
Lexical Errors
• It is hard for a lexical analyzer to tell, without the aid of
other components, that there is a source-code error.
– Example: fi(a==f(x)) …
• fi is a valid lexeme for the token id to the parser.
• Simplest recovery strategy is panic mode recovery.
• Other possible error-recovery actions are:
– Delete one character from the remaining input.
– Insert a missing character into the remaining input.
– Replace a character by another character.
– Transpose two adjacent characters.
• Another way is to Minimum distance error correction is a
convenient theoretical yardstick, but it is not generally
used in practice, because it too costly to implement.
12
Input Buffering
• Efficiency Issues concerned with the buffering of
inputs.
• Three general approaches to the implementation of a
lexical analyzer.
– Use a lexical-analyzer generator, such as the Lex compiler,
to produce the lexical analyzer from a regular-expression-
based specification. In this case, the generator provides
routines for reading and buffering the input.
– Write the lexical analyzer in a conventional systems-
programming language, using the I/O facilities of that
language to read the input.
– Write the lexical analyzer in assembly language and
explicitly manage the reading of input.
13
Input Buffering
• Two-buffer input scheme to look ahead on the
input and identify tokens
• Buffer Pairs
14
Input Buffering
• Sentinels (Guards)
15
letter A | B | … | Z | a | b | … | z
digit 0 | 1 | … | 9
id letter ( letter | digit )*
lex.yy.c C a.out
compiler
input sequence
stream a.out of tokens
27
Lex Specification
• A lex specification consists of three parts:
regular definitions, C declarations in %{ %}
%%
translation rules
%%
user-defined auxiliary procedures
• The translation rules are of the form:
p1 { action1 }
p2 { action2 }
…
pn { actionn }
28
Lex actions
• BEGIN: It indicates the start state. The lexical analyzer starts at
state 0.
• ECHO: It emits the input as it is.
• yytext: When lexer matches or recognizes the token from input
token then the lexeme is stored in a null terminated string called
yytext.
• yylex(): As soon as call to yylex() is encountered scanner starts
scanning the source program.
• yywrap(): It is called when scanner encounters eof i.e. return 0. If
returns 0 then scanner continues scanning.
• yyin: It is the standard input file that stores input source program.
• yyleng: when a lexer reconizes token then the lexeme is stored in a
null terminated string called yytext. It stores the length or number of
characters in the input string. The value in yyleng is same as strlen()
functions.
30
Installing Software
• Download Flex 2.5.4a
• Download Bison 2.4.1
• Download DevC++
• Install Flex at "C:\GnuWin32"
• Install Bison at "C:\GnuWin32"
• Install DevC++ at "C:\Dev-Cpp"
• Open Environment Variables.
– Add "C:\GnuWin32\bin;C:\Dev-Cpp\bin;" to path.
31
For Windows 8
32
Digit
only
printed
34
Lex Program
(Write a program in Lex to identify identifier and keyword in a sentence.)
%{
#include<stdio.h>
static int key_word=0;
static int identifier=0;
%}
%%
"include"|"for"|"define" {key_word++;printf("keyword found");}
"int"|"char"|"float"|"double" {identifier++;printf("identifier found");}
%%
int main()
{
printf("enter the sentence");
yylex();
printf("keyword are: %d\n and identifier are:%d\n",key_word,identifier);
}
int yywrap()
{
return 1;
}
38
Optional
regular
NFA DFA
expressions
Nondeterministic Finite
Automata
• Definition: an NFA is a 5-tuple (S,,,s0,F)
where
Transition Graph
• An NFA can be diagrammatically
represented by a labeled directed graph
called a transition graph
a
S = {0,1,2,3}
start a b b = {a,b}
0 1 2 3
s0 = 0
b F = {3}
41
Transition Table
• The mapping of an NFA can be
represented in a transition table
Input Input
State
(0,a) = {0,1} a b
(0,b) = {0} 0 {0, 1} {0}
(1,b) = {2} 1 {2}
(2,b) = {3}
2 {3}
42
Subset construction
(optional)
DFA
44
a start a
i f
start N(r1)
r1 | r2 i f
N(r2)
start
r1 r2 i N(r1) N(r2) f
r* start
i N(r) f
45
a { action1 }
start a b b
abb { action2 } 3 4 5 6
a b
a*b+ { action3 }
start
7 b 8
a
1 2
start
0 3
a
4
b
5
b
6
a b
7 b 8
46
a a b a
none
0 2 7 8 action3
1 4
3 7 Must find the longest match:
7 Continue until no further moves are possible
When last state is accepting: execute action
47
a b b a
none
0 2 5 6 action2
1 4 8 8 action3
3 7
7 When two or more accepting states are reached, the
first action given in the Lex specification is executed
48
Example DFA
b
b
a
start a b b
0 1 2 3
a a
50
a3 Dstates
C A = {0,1,3,7}
b a
b b B = {2,4,7}
start C = {8}
A D
D = {7}
a a
b b E = {5,8}
B E F
F = {6,8}
a1 a3 a2 a3
56
C
b a
b a
start a b b start a b b
A B D E A B D E
a a
a
a b a
Minimizing the number of states of a 57
Example (Contd…)
E B C E B A
60
Leaf true
{1, 2} | {1, 2}
b b
a
start a 1,2, b 1,2, b 1,2,
1,2,3
3,4 3,5 3,6
a
a
69
Time-Space Tradeoffs
Space Time
Automaton
(worst case) (worst case)
NFA O(|r|) O(|r||x|)
DFA O(2|r|) O(|x|)