0% found this document useful (0 votes)
11 views27 pages

Unit 1-REGULAR LANGUAGES

The document discusses lexical analysis, detailing the roles of lexical analyzers and their interaction with parsers. It covers concepts such as tokens, lexemes, patterns, and attributes, as well as methods for error recovery and input buffering techniques. Additionally, it introduces regular expressions, their algebraic properties, and the recognition of tokens through grammar and transition diagrams.

Uploaded by

luqmanmohd009
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views27 pages

Unit 1-REGULAR LANGUAGES

The document discusses lexical analysis, detailing the roles of lexical analyzers and their interaction with parsers. It covers concepts such as tokens, lexemes, patterns, and attributes, as well as methods for error recovery and input buffering techniques. Additionally, it introduces regular expressions, their algebraic properties, and the recognition of tokens through grammar and transition diagrams.

Uploaded by

luqmanmohd009
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Module 2

Chapter 3

LEXICAL ANALYSIS
LEXICAL ANALYSIS
 Role of lexical analyzer / Interaction between lexer & Parser

 Some additional tasks: eliminating comments, blanks, tab and


newline characters, providing line numbers associated with error
messages and making a copy of the source program with error
messages.
lexemes
Token: token is a pair consisting of a token name and an

optional attribute value


Pattern: pattern is a description of the form that the

lexemes of a token may take


Lexeme: lexeme is a sequence of characters in the source

program that matches the pattern for a token and is


identified by the lexical analyzer as an instance of that
token.
In many programming languages, the following

classes cover most or all of the tokens:


1. One token for each keyword. The pattern for a keyword
is the same as the keyword itself.
2.Tokens for the operators, either individually or in
classes such as the token comparison.
3. One token representing all identifiers.
4.One or more tokens representing constants, such as
numbers and literal
5. Tokens for each punctuation symbol, such as left and
right parentheses, comma, and semicolon.
Attributes for tokens
E = M * C ** 2 are written below as a sequence of
pairs.
<id, pointer to symbol-table entry for E>
<assign_op>
<id, pointer to symbol-table entry for M>
<mult_op>
<id, pointer to symbol-table entry for C>
<exp_op>
<number, integer value 2>
Lexical errors
f i ( a == f ( x ) ) . ..
"panic mode" recovery: delete successive characters
from the remaining input
Other recovery methods are:
1. Delete one character from the remaining input.
2. Insert a missing character into the remaining
input.
3. Replace a character by another character.
4. Transpose two adjacent characters.
INPUT BUFFERING
 To speed up reading of source program.
 Need to look at least additional one character ahead.
 Ex: = is a part of ==, < is a part of <=
 Two-buffer scheme is introduced:
 to handle large look aheads safely.

 To reduce the amount of overhead required to process a

single input character.


 Each buffer is of same size N (size of disk block, eg, 4096

bytes)
 If there are less than N characters in the input file, a special

character “eof” marks the end of source file.


INPUT BUFFERING
Buffer Pairs: Two pointers are used.

lexemeBegin: marks the beginning of current


lexeme.
forward: scans ahead until a pattern match is
found.
Sentinels: Special character that can not be a

part of the source program, eg: “eof”


For each character read, we make 2 tests: one

for end of buffer, one for actual character read.


Lookahead code with sentinels
switch ( *forward++ ) {
case eof:
if (forward is at end of first buffer ) {
reload second buffer;
forward = beginning of second buffer;
}
else if (forward is at end of second buffer ) {
reload first buffer;
forward = beginning of first buffer;
}
else /* eof within a buffer marks the end of input */
terminate lexical analysis;
break;
Cases for the other characters
}
SPECIFICATION OF
TOKENS
Alphabet: finite set of symbols, ex: {0,1}
String : finite sequence of symbols drawn from that

alphabet. Ex: 000, 0011 ….


length of string: |s|, is the number of occurrences of

symbols
empty string : denoted by ε
Prefix, suffix, substring, proper prefix, subsequence of

string(deleting zero or more not necessarily consecutive


positions)
Language : set of strings over some fixed alphabet.
Operations on languages

Ex: L={A,B,…..Z,a,b,….z}
D={0,1,…..9}
LUD= 62 strings of length 1 (total of alphabet and digit)
LD=520 strings of length 2 (52*10=520)
L*=? D2= set of 2 digit numbers
L+=?
Regular Expressions
 ε is a regular expression, the set containing empty string is a

regular expression.
 a is a regular expression.

 Suppose r and s are regular expression denoting the

languages L(r) and L(s) then,


 (r|s) is a regular expression denoting L(r)UL(s)

 rs is a regular expression denoting L(r)L(s)

 (r)* is a regular expression denoting (L(r))*

 (r) is a regular expression denoting L(r)


Regular expression Example strings
a|b {a, b}
(a|b)(a|b) {aa, ab, ba, bb}
a* {ε, a, aa, aaa,…}
a|a*b {a, b, ab, aab, aaab,…}

Algebraic properties (laws) of regular


expression
Regular definition
Process of giving a name to certain regular
expressions and use those names in subsequent
expressions.

Ex: 0|1|2|……|9 can be written as D 0|1|2|……|9


Extensions of regular
expressions
Zero or more: *
One or more: +
Zero or one: ?
Character class: [ ]
[abc] denotes the regular expression a|b|c
[a-z] denotes the regular expression a|b|..|z
1. Write regular definition for C identifier using
extensions.
Letter_  [a-zA-Z_]
Digit  [0-9]
Id  Letter_(Letter_|Digit)*

2. Write regular definition for unsigned number using


extensions.
Digit  [0-9]
Digits  Digit+
Number  Digits(. Digits)?(E[+ -]?Digits)?
RECOGNITION OF TOKENS
 Consider the following grammar/regular definitions: A
grammar for branching statements.
stmt if expr then stmt
| if expr then stmt else stmt

expr term relop term | term
term id | num
if  if
then  then
else  else
relop  <|<=|>|>=|=|<>
id  letter(letter|digit)*
num  digit+(.digit+)?(E[+|-]?digit+)?
digit  [0-9]
letter  [a-zA-Z]
The regular definition for blank(white space) is,

delim  blank | tab | newline


ws delim+
Transition Diagrams
collection of nodes or circles, called states.
Edges are directed from one state of the transition
diagram to another.
edge is labeled by a symbol.
Certain states are said to be accepting, or final.
One state is designated the start state, or initial state
retract the forward pointer one position: denoted by *

Ex: Write the transition diagram for relational


operator.
Recognition of Reserved Words and Identifiers
There are two ways that we can handle reserved words that look like
identifiers:
1) Install the reserved words in the symbol table initially.
 installID( ): places it in the symbol table if it is not already there.
 getToken( ): examines the symbol table entry for the lexeme
found
2) Create separate transition diagrams for each keyword
Transition diagram for white space:

Transition diagram for unsigned number:


Implementation of relop T.D
TOKEN getRelop ( ) {
TOKEN retToken = new(RELOP);
while(1) { /* repeat character processing until a return or failure occurs */
switch(state) {
case 0:
c = nextChar( );
if ( c == ‘<’ ) state = 1;
else if ( c == '=' ) state = 5;
else if ( c == '>' ) state = 6;
else fail(); /* lexeme is not a relop */
break;
case 1: ......
case 2: ….
case 8: retract();
retToken.attribute = GT;
return(retToken);
}
}
End of Module

You might also like