0% found this document useful (0 votes)
4 views

SSC Module2 LexicalAnalysis

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

SSC Module2 LexicalAnalysis

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Chapter 3 (3.1 – 3.

4)

Sunitha G, SJEC 1
Outline
 Role of lexical analyzer
 Input buffering
 Specification of tokens
 Recognition of tokens

Sunitha G, SJEC 2
The role of lexical analyzer
token
Source Lexical To semantic
program Parser analysis
Analyzer
getNextToken

Symbol
table

Sunitha G, SJEC 3
Attr
Sunitha G, SJEC 4
Why to separate Lexical analysis
and parsing
1. Simplicity of design
2. Improving compiler efficiency
3. Enhancing compiler portability

Sunitha G, SJEC 5
Tokens, Patterns and Lexemes
 A token is a pair consisting of a token name and an
optional token value
 A pattern is a description of the form that the lexemes
of a token may take
 A lexeme is a sequence of characters in the source
program that matches the pattern for a token

Sunitha G, SJEC 6
Example
Token Informal description Sample lexemes
if Characters i, f if
else Characters e, l, s, e else
comparison < or > or <= or >= or == or != <=, !=

id Letter followed by letter and digits pi, score, D2


number Any numeric constant 3.14159, 0, 6.02e23
literal Anything but “ sorrounded by “ “core dumped”

if(a>=b)
printf(“total=%d\n”, a);
Lexemes: if ( a >= b ) printf ( “total=%d” a )
Tokens : IF LP id relop id RP id LP literal id RP
Sunitha G, SJEC 7
Attributes for tokens
 E = M * C ** 2
 <id, pointer to symbol table entry for E>
 <assign-op>
 <id, pointer to symbol table entry for M>
 <mult-op>
 <id, pointer to symbol table entry for C>
 <exp-op>
 <number, integer value 2>

Fig

Sunitha G, SJEC 8
Lexical errors
 Some errors are out of power of lexical analyzer to
recognize:
 fi (a == f(x)) …
 However it may be able to recognize errors like:
 d = 2r
 Such errors are recognized when no pattern for tokens
matches a character sequence

Sunitha G, SJEC 9
Error recovery
 Panic mode: successive characters are ignored until we
reach to a well formed token
 Delete one character from the remaining input
 Insert a missing character into the remaining input
 Replace a character by another character
 Transpose two adjacent characters

Sunitha G, SJEC 10
Input buffering
 Sometimes lexical analyzer needs to look ahead some
symbols to decide about the token to return
 In C language: we need to look after -, = or < to decide
what token to return
 In Fortran: DO 5 I = 1.25
 We need to introduce a two buffer scheme to handle
large look-aheads safely

E = M * C * * 2 eof

Sunitha G, SJEC 11
Sentinels
E = M eof * C * * 2 eof eof

Switch (*forward++) {
case eof:
if (forward is at end of first buffer) {
reload second buffer;
forward = beginning of second buffer;
}
else if {forward is at end of second buffer) {
reload first buffer;\
forward = beginning of first buffer;
}
else /* eof within a buffer marks the end of input */
terminate lexical analysis;
break;
cases for the other characters;
}
Sunitha G, SJEC 12
Specification of tokens
 Strings and Languages
 Operations on Languages

Sunitha G, SJEC 13
Regular Expressions
 In theory of compilation regular expressions are used
to formalize the specification of tokens
 Regular expressions are means for specifying regular
languages
 Example:
 Letter_(letter_ | digit)*
 Each regular expression is a pattern specifying the
form of strings

Sunitha G, SJEC 14
Regular Expressions
 Ɛ is a regular expression, L(Ɛ) = {Ɛ}
 If a is a symbol in ∑then a is a regular expression, L(a)
= {a}
 (r) | (s) is a regular expression denoting the language
L(r) ∪ L(s)
 (r)(s) is a regular expression denoting the language
L(r)L(s)
 (r)* is a regular expression denoting (L(r))*
 (r) is a regular expression denoting L(r)

Sunitha G, SJEC 15
Algebraic laws for regular expressions

Sunitha G, SJEC 16
Regular Definitions
d1 -> r1
d2 -> r2

dn -> rn

Example:
C Identifiers
letter_ -> A | B | … | Z | a | b | … | Z | _
digit -> 0 | 1 | … | 9
id -> letter_ (letter_ | digit)*

Sunitha G, SJEC 17
Example:
Unsigned numbers (integer or floating point)
Strings such as 5280, 0.01234, 6.336E4, or 1.89E-4

Sunitha G, SJEC 18
Extensions
 One or more instances: (r)+
 Zero or one instances: r?
 Character classes: [abc]

 Example: C Identifiers
letter_ -> [A-Za-z_]
digit -> [0-9]
id -> letter_(letter|digit)*

Unsigned numbers
digit -> [0-9]
digits -> digit+
number -> digits (. digits)? ( E [+-]? digits )?
Sunitha G, SJEC 19
Recognition of tokens
 Starting point is the language grammar to understand
the tokens:
stmt -> if expr then stmt
| if expr then stmt else stmt

expr -> term relop term
| term
term -> id
| number

Sunitha G, SJEC 20
Recognition of tokens (cont.)
 The next step is to formalize the patterns:
digit -> [0-9]
Digits -> digit+
number -> digit(.digits)? (E[+-]? Digit)?
letter -> [A-Za-z_]
id -> letter (letter|digit)*
If -> if
Then -> then
Else -> else
Relop -> < | > | <= | >= | = | <>
 We also need to handle whitespaces:
ws -> (blank | tab | newline)+

Sunitha G, SJEC 21
Transition diagrams
 Transition diagram for relop

Sunitha G, SJEC 22
Transition diagrams (cont.)
 Transition diagram for reserved words and identifiers

letter (letter | digit)*

Sunitha G, SJEC 23
Transition diagrams (cont.)
 Transition diagram for unsigned numbers

digit(.digits)? (E[+-]? digit)?

Sunitha G, SJEC 24
Transition diagrams (cont.)
 Transition diagram for whitespace

(blank | tab | newline)+

Sunitha G, SJEC 25
Architecture of a transition-diagram-based lexical analyzer
TOKEN getRelop()
{
TOKEN retToken = new (RELOP)
while (1) { /* repeat character processing until a
return or failure occurs */
switch(state) {
case 0: c= nextchar();
if (c == ‘<‘) state = 1;
else if (c == ‘=‘) state = 5;
else if (c == ‘>’) state = 6;
else fail(); /* lexeme is not a relop */
break;
case 1: …

case 8: retract();
retToken.attribute = GT;
return(retToken);
}
Sunitha G, SJEC 26

You might also like