0% found this document useful (0 votes)
21 views69 pages

Unit 1 (B)

Uploaded by

bahes80516
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views69 pages

Unit 1 (B)

Uploaded by

bahes80516
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 69

Unit - I

Chapter 3
Lexical Analysis and
Lexical Analyzer Generators

Analp Pathak
Assistant Professor
Department of Computer Science Engineering
SRM IST, NCR Campus
2

Outline
• Lexical Analysis
• Role of Lexical Analyzer
• Recognition of Tokens
• Lexical Analyzer Generator: Lex/Flex
• Design Aspects of Lexical Analyzer
3

The Reason Why Lexical Analysis is a


Separate Phase (Issues in Lexical Analysis)
• Simplifies the design of the compiler
– LL(1) or LR(1) with 1 lookahead would not be possible
• Provides efficient implementation
– Systematic techniques to implement lexical analyzers
by hand or automatically
– Stream buffering methods to scan input
• Improves portability
– Non-standard symbols and alternate character
encodings can be more easily translated
4

Interaction of the Lexical


Analyzer with the Parser
Token,
Source Lexical tokenval
Program Parser
Analyzer
Get next
token
error error

Symbol Table
5

Attributes of Tokens

y := 31 + 28*x Lexical analyzer

<id, “y”> <assign, > <num, 31> <+, > <num, 28> <*, > <id, “x”>

token
tokenval
(token attribute) Parser
6
7
8

Example
9

Tokens, Patterns, and Lexemes


• A token is a classification of lexical units
– For example: id and num
• Lexemes are the specific character strings that
make up a token
– For example: abc and 123
• Patterns are rules describing the set of lexemes
belonging to a token
– For example: “letter followed by letters and digits” and
“non-empty sequence of digits”
10

Example
• Consider Pascal Statement
– const pi = 3.1416;
11

Lexical Errors
• It is hard for a lexical analyzer to tell, without the aid of
other components, that there is a source-code error.
– Example: fi(a==f(x)) …
• fi is a valid lexeme for the token id to the parser.
• Simplest recovery strategy is panic mode recovery.
• Other possible error-recovery actions are:
– Delete one character from the remaining input.
– Insert a missing character into the remaining input.
– Replace a character by another character.
– Transpose two adjacent characters.
• Another way is to Minimum distance error correction is a
convenient theoretical yardstick, but it is not generally
used in practice, because it too costly to implement.
12

Input Buffering
• Efficiency Issues concerned with the buffering of
inputs.
• Three general approaches to the implementation of a
lexical analyzer.
– Use a lexical-analyzer generator, such as the Lex compiler,
to produce the lexical analyzer from a regular-expression-
based specification. In this case, the generator provides
routines for reading and buffering the input.
– Write the lexical analyzer in a conventional systems-
programming language, using the I/O facilities of that
language to read the input.
– Write the lexical analyzer in assembly language and
explicitly manage the reading of input.
13

Input Buffering
• Two-buffer input scheme to look ahead on the
input and identify tokens
• Buffer Pairs
14

Input Buffering
• Sentinels (Guards)
15

Specification of Patterns for


Tokens: Terminology
• An alphabet  is a finite set of symbols
(characters)
• A string s is a finite sequence of symbols
from 
– |s| denotes the length of string s
–  denotes the empty string, thus || = 0
• A language is a specific set of strings over
some fixed alphabet 
16

Specification of Patterns for


Tokens: String Operations
• The concatenation of two strings x and y is
denoted by xy
• The exponentation of a string s is defined
by
s0 = 
si = si-1s for i > 0
(note that s = s = s)
17

Specification of Patterns for


Tokens: Language Operations
• Union
L  M = {s | s  L or s  M}
• Concatenation
LM = {xy | x  L and y  M}
• Exponentiation
L0 = {}; Li = Li-1L
• Kleene closure
L* = i=0,…, Li
• Positive closure
L+ = i=1,…, Li
18

Specification of Patterns for


Tokens: Regular Expressions
• Basis symbols:
–  is a regular expression denoting language {}
– a   is a regular expression denoting {a}
• If r and s are regular expressions denoting
languages L(r) and M(s) respectively, then
– r | s is a regular expression denoting L(r)  M(s)
– rs is a regular expression denoting L(r)M(s)
– r* is a regular expression denoting L(r)*
– (r) is a regular expression denoting L(r)
• A language defined by a regular expression is
called a regular set
19

Specification of Patterns for


Tokens: Regular Definitions
• Naming convention for regular expressions:
d 1  r1
d 2  r2

d n  rn
where ri is a regular expression over
  {d1, d2, …, di-1 }
• Each dj in ri is textually substituted in ri
20

Specification of Patterns for


Tokens: Regular Definitions
• Example:

letter  A | B | … | Z | a | b | … | z
digit  0 | 1 | … | 9
id  letter ( letter | digit )*

• Cannot use recursion, this is illegal:

digits  digit digits | digit


21

Specification of Patterns for


Tokens: Notational Shorthands
• We frequently use the following shorthands:
r+ = rr*
r? = r | 
[a-z] = a | b | c | … | z
• For example:
digit  [0-9]
num  digit+ (. digit+)? ( E (+|-)? digit+ )?
22

Regular Definitions and


Grammars
Grammar
stmt  if expr then stmt
| if expr then stmt else stmt
|
expr  term relop term
| term
term  id Regular definitions
| num if  if
then  then
else  else
relop  < | <= | <> | > | >= | =
id  letter ( letter | digit )*
num  digit+ (. digit+)? ( E (+|-)? digit+ )?
23

Implementing a Scanner Using


Transition Diagrams
relop  < | <= | <> | > | >= | =
start < =
0 1 2 return(relop, LE)
>
3 return(relop, NE)
other
4 * return(relop, LT)
=
5 return(relop, EQ)
> =
6 7 return(relop, GE)
other
8 * return(relop, GT)
id  letter ( letter | digit )* letter or digit

start letter other


9 10 11 * return(gettoken(),
install_id())
Implementing a Scanner Using 24

Transition Diagrams (Code)


token nexttoken()
{ while (1) {
switch (state) {
case 0: c = nextchar();
if (c==blank || c==tab || c==newline) { Decides what
state = 0;
lexeme_beginning++; other start state
}
else if (c==‘<‘) state = 1; is applicable
else if (c==‘=‘) state = 5;
else if (c==‘>’) state = 6;
else state = fail();
break; int fail()
case 1: { forward = token_beginning;
… swith (start) {
case 9: c = nextchar(); case 0: start = 9; break;
if (isletter(c)) state = 10; case 9: start = 12; break;
else state = fail(); case 12: start = 20; break;
break; case 20: start = 25; break;
case 10: c = nextchar(); case 25: recover(); break;
if (isletter(c)) state = 10; default: /* error */
else if (isdigit(c)) state = 10; }
else state = 11; return start;
break; }

25

A Language for Specifying Lexical Analyzer:


The Lex and Flex Scanner Generators

• Lex and its newer cousin flex are scanner


generators
• Systematically translate regular definitions
into C source code for efficient scanning
• Generated code is easy to integrate in C
applications
26

Creating a Lexical Analyzer with


Lex and Flex
lex
source lex or flex lex.yy.c
program compiler
lex.l

lex.yy.c C a.out
compiler

input sequence
stream a.out of tokens
27

Lex Specification
• A lex specification consists of three parts:
regular definitions, C declarations in %{ %}
%%
translation rules
%%
user-defined auxiliary procedures
• The translation rules are of the form:
p1 { action1 }
p2 { action2 }

pn { actionn }
28

Regular Expressions in Lex


x match the character x
\. match the character .
“string”match contents of string of characters
. match any character except newline
^ match beginning of a line
$ match the end of a line
[xyz] match one character x, y, or z (use \ to escape -)
[^xyz]match any character except x, y, and z
[a-z] match one of a to z
r* closure (match zero or more occurrences)
r+ positive closure (match one or more occurrences)
r? optional (match zero or one occurrence)
r1r2 match r1 then r2 (concatenation)
r1|r2 match r1 or r2 (union)
(r) grouping
r1\r2 match r1 when followed by r2
{d} match the regular expression defined by d
29

Lex actions
• BEGIN: It indicates the start state. The lexical analyzer starts at
state 0.
• ECHO: It emits the input as it is.
• yytext: When lexer matches or recognizes the token from input
token then the lexeme is stored in a null terminated string called
yytext.
• yylex(): As soon as call to yylex() is encountered scanner starts
scanning the source program.
• yywrap(): It is called when scanner encounters eof i.e. return 0. If
returns 0 then scanner continues scanning.
• yyin: It is the standard input file that stores input source program.
• yyleng: when a lexer reconizes token then the lexeme is stored in a
null terminated string called yytext. It stores the length or number of
characters in the input string. The value in yyleng is same as strlen()
functions.
30

Installing Software
• Download Flex 2.5.4a
• Download Bison 2.4.1
• Download DevC++
• Install Flex at "C:\GnuWin32"
• Install Bison at "C:\GnuWin32"
• Install DevC++ at "C:\Dev-Cpp"
• Open Environment Variables.
– Add "C:\GnuWin32\bin;C:\Dev-Cpp\bin;" to path.
31

For Windows 8
32

Example Lex Specification 1


%{
#include <stdio.h>
Contains
%} the matching
Translation lexeme
rules %%
[0-9]+ { printf("%s\n", yytext); }
.|\n {}
%%
Invokes
int main( ) the lexical
{ analyzer
yylex( );
}
int yywrap( ) lex spec.l
{ gcc lex.yy.c -ll
return 1; ./a.out < spec.l
}
33

Execution Lex Specification 1

Digit
only
printed
34

Example Lex Specification 2


%{
#include <stdio.h> Regular
int ch = 0, wd = 0, nl = 0;
definition
Translation %}
delim [ \t]+
rules %%
\n { ch++; wd++; nl++; }
^{delim} { ch+=yyleng; }
{delim} { ch+=yyleng; wd++; }
. { ch++; }
%%
main()
{ yylex();
printf("%8d%8d%8d\n", nl, wd, ch);
}
35

Example Lex Specification 3


%{
#include <stdio.h> Regular
%}
definitions
Translation digit [0-9]
letter [A-Za-z]
rules id {letter}({letter}|{digit})*
%%
{digit}+ { printf(“number: %s\n”, yytext); }
{id} { printf(“ident: %s\n”, yytext); }
. { printf(“other: %s\n”, yytext); }
%%
main()
{ yylex();
}
36

Example Lex Specification 4


%{ /* definitions of manifest constants */
#define LT (256)

%}
delim [ \t\n]
ws {delim}+
letter [A-Za-z] Return
digit [0-9]
id {letter}({letter}|{digit})* token to
number
%%
{digit}+(\.{digit}+)?(E[+\-]?{digit}+)?
parser
{ws} { }
if {return IF;} Token
then
else
{return THEN;}
{return ELSE;}
attribute
{id} {yylval = install_id(); return ID;}
{number} {yylval = install_num(); return NUMBER;}
“<“ {yylval = LT; return RELOP;}
“<=“ {yylval = LE; return RELOP;}
“=“ {yylval = EQ; return RELOP;}
“<>“ {yylval = NE; return RELOP;}
“>“ {yylval = GT; return RELOP;}
“>=“
%%
{yylval = GE; return RELOP;} Install yytext as
int install_id() identifier in symbol table

37

Lex Program
(Write a program in Lex to identify identifier and keyword in a sentence.)
%{
#include<stdio.h>
static int key_word=0;
static int identifier=0;
%}
%%
"include"|"for"|"define" {key_word++;printf("keyword found");}
"int"|"char"|"float"|"double" {identifier++;printf("identifier found");}
%%
int main()
{
printf("enter the sentence");
yylex();
printf("keyword are: %d\n and identifier are:%d\n",key_word,identifier);
}
int yywrap()
{
return 1;
}
38

Design of a Lexical Analyzer


Generator
• Translate regular expressions to NFA
• Translate NFA to an efficient DFA

Optional

regular
NFA DFA
expressions

Simulate NFA Simulate DFA


to recognize to recognize
tokens tokens
39

Nondeterministic Finite
Automata
• Definition: an NFA is a 5-tuple (S,,,s0,F)
where

S is a finite set of states


 is a finite set of input symbol alphabet
 is a mapping from S to a set of states
s0  S is the start state
F  S is the set of accepting (or final) states
40

Transition Graph
• An NFA can be diagrammatically
represented by a labeled directed graph
called a transition graph

a
S = {0,1,2,3}
start a b b  = {a,b}
0 1 2 3
s0 = 0
b F = {3}
41

Transition Table
• The mapping  of an NFA can be
represented in a transition table

Input Input
State
(0,a) = {0,1} a b
(0,b) = {0} 0 {0, 1} {0}
(1,b) = {2} 1 {2}
(2,b) = {3}
2 {3}
42

The Language Defined by an


NFA
• An NFA accepts an input string x iff there is some
path with edges labeled with symbols from x in
sequence from the start state to some accepting
state in the transition graph
• A state transition from one state to another on the
path is called a move
• The language defined by an NFA is the set of
input strings it accepts, such as (a|b)*abb for the
example NFA
43

Design of a Lexical Analyzer


Generator: RE to NFA to DFA
Lex specification with NFA
regular expressions
p1 { action1 } N(p1) action1

p2 { action2 } start
s0
 N(p2) action2


pn { actionn } 
N(pn) actionn

Subset construction
(optional)
DFA
44

From Regular Expression to NFA


(Thompson’s Construction)

start
i  f

a start a
i f

start  N(r1) 
r1 | r2 i f
 N(r2) 
start
r1 r2 i N(r1) N(r2) f


r* start
i  N(r)  f


45

Combining the NFAs of a Set of


Regular Expressions
start a
1 2

a { action1 }
start a b b
abb { action2 } 3 4 5 6
a b
a*b+ { action3 }
start
7 b 8
a
1 2

start
0  3
a
4
b
5
b
6
a b

7 b 8
46

Simulating the Combined NFA


Example 1
a
1 2 action1

start
0  3
a
4
b
5
b
6 action2
a b

7 b 8 action3

a a b a
none
0 2 7 8 action3
1 4
3 7 Must find the longest match:
7 Continue until no further moves are possible
When last state is accepting: execute action
47

Simulating the Combined NFA


Example 2
a
1 2 action1

start
0  3
a
4
b
5
b
6 action2
a b

7 b 8 action3

a b b a
none
0 2 5 6 action2
1 4 8 8 action3
3 7
7 When two or more accepting states are reached, the
first action given in the Lex specification is executed
48

Deterministic Finite Automata


• A deterministic finite automaton is a special case
of an NFA
– No state has an -transition
– For each state s and input symbol a there is at most one
edge labeled a leaving s
• Each entry in the transition table is a single state
– At most one path exists to accept a string
– Simulation algorithm is simple
49

Example DFA

A DFA that accepts (a|b)*abb

b
b
a
start a b b
0 1 2 3

a a
50

Conversion of an NFA into a


DFA
• The subset construction algorithm converts an
NFA into a DFA using:
-closure(s) = {s}  {t | s  …  t}
-closure(T) = sT -closure(s)
move(T,a) = {t | s a t and s  T}
• The algorithm produces:
Dstates is the set of states of the new DFA
consisting of sets of states of the NFA
Dtran is the transition table of the new DFA
51

-closure and move Examples


-closure({0}) = {0,1,3,7}
a
1 2 move({0,1,3,7},a) = {2,4,7}
 -closure({2,4,7}) = {2,4,7}
start
 a b b move({2,4,7},a) = {7}
0 3 4 5 6
a b -closure({7}) = {7}
 move({7},b) = {8}
7 b 8 -closure({8}) = {8}
move({8},a) = 
a a b a
none
0 2 7 8
1 4
3 7
7 Also used to simulate NFAs
52

Simulating an NFA using


-closure and move
S := -closure({s0})
Sprev := 
a := nextchar()
while S   do
Sprev := S
S := -closure(move(S,a))
a := nextchar()
end do
if Sprev  F   then
execute action in Sprev
return “yes”
else return “no”
53

The Subset Construction


Algorithm
Initially, -closure(s0) is the only state in Dstates and it is unmarked
while there is an unmarked state T in Dstates do
mark T
for each input symbol a   do
U := -closure(move(T,a))
if U is not in Dstates then
add U as an unmarked state to Dstates
end if
Dtran[T,a] := U
end do
end do
54

Subset Construction Example 1



a
2 3

start    a b b
0 1 6 7 8 9 10

4
b
5


b
Dstates
C A = {0,1,2,4,7}
b
b a B = {1,2,3,4,6,7,8}
start a b b C = {1,2,4,5,6,7}
A B D E
a D = {1,2,4,5,6,7,9}
a
a E = {1,2,4,5,6,7,10}
55

Subset Construction Example 2


a
1 2 a1

start
0  3
a
4
b
5
b
6 a2
a b

7 b 8 a3
b

a3 Dstates
C A = {0,1,3,7}
b a
b b B = {2,4,7}
start C = {8}
A D
D = {7}
a a
b b E = {5,8}
B E F
F = {6,8}
a1 a3 a2 a3
56

Minimizing the Number of States


of a DFA

C
b a
b a
start a b b start a b b
A B D E A B D E
a a
a
a b a
Minimizing the number of states of a 57

DFA (Hoff croft Algorithm)


 Algorithm: Minimizing the number of states of a DFA.
 Input: A DFA D with set of states S, input alphabet , state state s0, and set of accepting
states F.
 Output: A DFA D' accepting the same language as D and having as few states as possible.
 Method:
1. Start with an initial partition  with two groups, F and S – F, the accepting and non-
accepting states of D.
2. Apply the procedure, to construct a new partition new
Initially, let new = ;
for ( each group G of  ) do begin
partition G into subgroups such that two states s and t are in the same subgroup if and
only if for all input symbols a, states s and t have transitions on a to states in the same
group of II; /* at worst, a state will be in a subgroup by itself * /
replace G in new by the set of all subgroups formed;
end
3. If new = , let final =  and continue with step (4) . Otherwise, repeat step (2) with new
in place of new.
4. Choose one state in each group of final as the representative for that group. The
representatives will be the states of the minimum-state DFA D'.
Example 58

• By using the above algorithm, minimizing the given DFA


state. Transition table of given DFA.
• Minimizing the sates INPUT
STATE SYMBOL
0 = (ABCD) (E) a b
A B C
1 = (ABC) (D) (E) B B D
C B C
2 = (AC) (B) (D) (E) D B E

3 = (AC) (B) (D) (E) E B C

• Now, construct the minimum-state DFA. It has four states,


corresponding to the four groups of 3 and let us pick A, B,
D, and E as the representatives of these groups. The initial
state is A, and the only accepting state is E. Below table shows
the transition function for the DFA.
59

Example (Contd…)

Transition table of given DFA. Transition table of minimized DFA.

INPUT SYMBOL INPUT


STATE STATE SYMBOL
a b
a b
A B C
A B A
B B D
B B D
C B C
D B E D B E

E B C E B A
60

From Regular Expression to DFA


Directly
• The important states of an NFA are those
without an -transition, that is if
move({s},a)   for some a then s is an
important state
• The subset construction algorithm uses only
the important states when it determines
-closure(move(T,a))
61

From Regular Expression to DFA


Directly (Algorithm)
• Augment the regular expression r with a
special end symbol # to make accepting
states important: the new expression is r#
• Construct a syntax tree for r#
• Traverse the tree to construct functions
nullable, firstpos, lastpos, and followpos
62

From Regular Expression to DFA


Directly: Syntax Tree of (a|b)*abb#
concatenation
#
6
b
closure 5
b
4
a
* 3
alternation
| position
number
a b (for leafs )
1 2
63

From Regular Expression to DFA


Directly: Annotating the Tree
• nullable(n): the subtree at node n generates
languages including the empty string
• firstpos(n): set of positions that can match the first
symbol of a string generated by the subtree at
node n
• lastpos(n): the set of positions that can match the
last symbol of a string generated be the subtree at
node n
• followpos(i): the set of positions that can follow
position i in the tree
64

From Regular Expression to DFA


Directly: Annotating the Tree
Node n nullable(n) firstpos(n) lastpos(n)

Leaf  true  

Leaf i false {i} {i}

| nullable(c1) firstpos(c1) lastpos(c1)


/ \ or  
c1 c2 nullable(c2) firstpos(c2) lastpos(c2)
if nullable(c1) then if nullable(c2) then
• nullable(c1)
firstpos(c1)  lastpos(c1) 
/ \ and
firstpos(c2) lastpos(c2)
c1 c2 nullable(c2)
else firstpos(c1) else lastpos(c2)
*
| true firstpos(c1) lastpos(c1)
c1
65

From Regular Expression to DFA


Directly: Syntax Tree of (a|b)*abb#
{1, 2, 3} {6}

{1, 2, 3} {5} {6} # {6}


6
{1, 2, 3} {4} {5} b {5}
nullable 5
{1, 2, 3} {3} {4} b {4}
4
firstpos lastpos
{1, 2} {1, 2} {3} a {3}
* 3

{1, 2} | {1, 2}

{1} a {1} {2} b {2}


1 2
66

From Regular Expression to DFA


Directly: followpos
for each node n in the tree do
if n is a cat-node with left child c1 and right child c2 then
for each i in lastpos(c1) do
followpos(i) := followpos(i)  firstpos(c2)
end do
else if n is a star-node
for each i in lastpos(n) do
followpos(i) := followpos(i)  firstpos(n)
end do
end if
end do
67

From Regular Expression to DFA


Directly: Algorithm
s0 := firstpos(root) where root is the root of the syntax tree
Dstates := {s0} and is unmarked
while there is an unmarked state T in Dstates do
mark T
for each input symbol a   do
let U be the set of positions that are in followpos(p)
for some position p in T,
such that the symbol at position p is a
if U is not empty and not in Dstates then
add U as an unmarked state to Dstates
end if
Dtran[T,a] := U
end do
end do
68

From Regular Expression to DFA


Directly: Example
Node followpos
1 {1, 2, 3} 1
2 {1, 2, 3} 3 4 5 6
3 {4}
2
4 {5}
5 {6}
6 -

b b
a
start a 1,2, b 1,2, b 1,2,
1,2,3
3,4 3,5 3,6
a
a
69

Time-Space Tradeoffs
Space Time
Automaton
(worst case) (worst case)
NFA O(|r|) O(|r||x|)
DFA O(2|r|) O(|x|)

The running time of minimizes the number


of states of any DFA is O(nlogn).

You might also like