0% found this document useful (0 votes)
2 views

unit1

The document provides an overview of lexical analysis in compiler design, detailing the process of converting a stream of characters into tokens while removing whitespace and comments. It explains the functions of a lexical analyzer, the specification of tokens, and the use of regular expressions and finite automata for language recognition. Additionally, it discusses token attributes, error detection, and recovery strategies in lexical analysis.

Uploaded by

nirajdhanore04
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

unit1

The document provides an overview of lexical analysis in compiler design, detailing the process of converting a stream of characters into tokens while removing whitespace and comments. It explains the functions of a lexical analyzer, the specification of tokens, and the use of regular expressions and finite automata for language recognition. Additionally, it discusses token attributes, error detection, and recovery strategies in lexical analysis.

Uploaded by

nirajdhanore04
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

COMPILER DESIGN

(CST327)
LEXICAL ANALYSIS

Unit-1
OVERVIEW
Lexical Analysis and Tokens
 Process:
 stream of characters are read from left to right and grouped into
tokens, by removing any whitespace or comments
 source code is scanned one character at a time
 when the scanner encounters a whitespace, operator symbol or special
symbol, it decides that a word has been completed
 Longest Match Rule is followed and rule priority is applied (Higher
priority to a reserved word)
 Output: token stream.
 Error generated:
 when an invalid token is found.
 When the name of an identifier matches any existing reserved word
Lexical Analysis and Tokens
 The functions of the lexical analyzer:
 Stripping out comments , whitespace and newline characters
from the source program
 tokenize the source language

 pass it to the syntax analyzer

 Store information in the symbol table

 invoke the error handler, when required.


Specification of Tokens
 An alphabet is a finite set of symbols. ( )
 A string over some alphabet is a sequence of

symbols from the alphabet set.


 A language is a set of strings over some fixed

alphabet. It may contain a finite or an infinite


number of strings
Example,
 binary alphabet set is {0, 1}
 01100010 is a string of length 8
Language Recognition
 To recognize a language, automata are constructed.
 Finite automata accept the regular language
specified by regular expressions.
Tokens, Patterns and Lexemes
 For a set of strings in the input for which the same token is produced
as output. This set of strings is described by a rule called a pattern
associated with the token.
 The pattern is said to match each string in the set.

 A lexeme is a sequence of characters in the source program that is


matched by the pattern for a token.
 A string of characters, which logically belong together, is called a
token.
Example,
int a = 12;
int is a lexeme for the token ‘keyword’. <keyword, int>
The substring a is a lexeme for the token ‘identifier’ <id, 100>
= is a lexeme for the token assignment operator, <assign_op, =>
12 is lexeme for the token integer constant , <constant, 101>
; is lexeme for the special symbol, <symbol, ;>
Tokens- Example 1
Lexeme Token Types
int i; int
printf(“Number is %d & i
incremented number is ;
%d”, i , ++i);
Tokens- Example 1
Lexeme Token Types
int i; int
printf(“Number is %d & i
incremented number is ;
printf
%d”, i , ++i);
(
“Number is %d
& incremented
number is %d”
,
i
++
i
)
;
Tokens- Example 1
Lexeme Token Types
int i; int Keyword
printf(“Number is %d & i Identifier
incremented number is ; Symbol
printf Identifier/ c
%d”, i , ++i); function
( Symbol
“Number is %d
& incremented String
number is %d”
, Symbol
i Identifier
++ Operator
i Identifier
) Symbol
; Symbol
Tokens- Example 2
Lexeme Token
int a, b, sum; int Keyword
a Identifier
printf(“\n Enter two numbers:”); b Identifier
scanf(“%d %d”,&a,&b); sum Identifier
; Symbol
sum=a+b;
printf Keyword
printf(“sum: %d”, sum); ( Symbol
) Symbol
scanf Keyword
“\n Enter two numbers:” String
“%d %d” String
& Symbol
= Operator
+ Operator
“sum: %d” String
, Symbol
Attributes for Tokens
 More than one pattern can matches a lexeme
 Provide additional information about the particular
lexeme that was matched, so that subsequent phases
of the compiler can recognize the proper token.
For example, the pattern for a digit is [0–9].
All the numbers from 0 to 9 will match the pattern for
digit. It is essential for the code generator to know
what number was actually matched.
 Information about tokens and their associated
attributes
Lexical Analyser and Symbol Table-
Example1
printf(“ String % d” , ++ i ++ &&& i ** a);
Compute tokens ?

The tokens are


printf ( “String % d” , ++ i ++ &&
& i * * a ) ;
There are 15 tokens in total.
Lexical Analyser and Symbol Table-
Example2
 Specify the <token attribute> set for the C
Statement, a = (b + c) * 2 ;
Lexical Analyzer and Symbol Table-
Example2 a = (b + c) * 2 ;

<id , 100>
<=, > OR <operator, = address>
<(, >
<id , 101>
<+ , >
<id , 102>
<), >
<*,>
< Constant, 103>
<;, >
 The numbers after comma(,) represent the pointer to symbol table entry.
 The symbol table entry at memory location 100 stores the identifier ‘a’,
101 stores ‘b’ and 102 stores ‘c’.
 A constant value 2 is stored at location 103.
Regular Expression
 Regular expression is a notation for specifying patterns.
 Programming language tokens can be described by regular
languages.
 Notations used to represent regular expressions:
Notation Description Example
{ }+ One or more {a}+
repetition String contains a repeated one or more time.
[] Optional [abc]
(Called character String can contain a or b or c
class)
- Range [a-z]
String can contain any character between a to z
? Zero or one instance a?
() Grouping (a | b)
^ Except [^a]
a is not included
RE Examples- Questions
 Regular expression for strings of 0’s and 1’s not
accepting empty string is_____
 Regular expression for string abc___
 Regular expression for string containing substring
ab and any letters before and after is _______
 Regular expression for string containing small letters
and capital letters separated by @ is _____
Transition Diagram
 keep track of information about characters that are
identified as the forward pointer scans the input.
 Positions in a transition diagram are drawn as
circles and are called states.
 The states are connected by arrows called edges.
 A double circle indicates an accepting state, a state
in which a token is found.
 a* indicates that input retraction must take place.
FA are used to represent the transition diagram.
Finite Automata
 Finite automata (FA) is a recognizer for regular expressions.
 The mathematical model of finite automata consists of:
 Finite set of input symbols (Σ)
 Finite set of states (Q)
 One Start state (q0)
 One or more final states (qf)
 Transition function (δ)
 The transition function (δ) maps the finite set of state (Q) to a
finite set of input symbols (Σ), Q × Σ ➔ Q
FA that accepts a string starting with a and ending with b
and having any number of b
Transition Diagram- Example 1
Transition diagram for relational operators in Java
language
< , <=, ==, !=, > and >=
Transition Diagram- Example 1
Transition diagram for relational operators in Java
language
< , <=, ==, !=, > and >=
Transition Diagram- Example 2
Transition diagram for identifiers of C language

The regular expression for identifier uses letter and


digit regular expression,
letter → a | A | b | B | c | C | ....
digit →0 | 1 | 2 | 3 | 4 | 5 |6 | 7 | 8 | 9

identifier → ( _ | letter) (letter | digit | _ )*


Transition Diagram- Example 2
Transition diagram for identifiers of C language
The regular expression for identifier uses letter and
digit regular expression,
letter → a | A | b | B | c | C | ....
digit →0 | 1 | 2 | 3 | 4 | 5 |6 | 7 | 8 | 9
identifier → ( _ | letter) (letter | digit | _ )*
Transition Diagram- Example 3
HOWEWORK
Transition diagram for Numeric constants
 Integers : Int →

 Floating point constant:

Fconst →
 Exponential numbers:

exp → digit+ (.digit+)? E ( '+' | '-' )? digit+


Transition Diagram- Example 3
HOWEWORK
Transition diagram for Numeric constants
 Integers : Int → digit (digit)*

 Floating point constant:

Fconst → digit+ ((.)(digit)+)?


 Exponential numbers:

exp → digit+ (.digit+)? E ( '+' | '-' )? digit+


Recognition of Tokens

transition diagram for relop


Example -1(TRY)
Transition diagram for accepting the data needed by a company for the
given employees
Consider a company wants to accept the data from employees. The data is
described as follows:

 Name of the employee- Can have only letters. Name can start with a capital letter.
 Employee id- It has the format EMP_<number>

 Address- The address must start with digit (one or more). Then a separator comma is
allowed. Followed by letters.
 E-mail- letters followed by @ followed by mail provider (like gmail, yahoo) then .
and domain name (like com, org)
 Salary- floating point number is allowed

The regular definition for a new language to be designed is as follows:


 sletter → [a-z]

 cletter→[A-Z]

 digit→[0-9]

 underscore→ _
Example -1(TRY)
Transition diagram for accepting the data needed by a company for the
given employees
Consider a company wants to accept the data from employees. The data is
described as follows:

 Name of the employee- Can have only letters. Name can start with a capital letter.
 Employee id- It has the format EMP_<number>

 Address- The address must start with digit (one or more). Then a separator comma is
allowed. Followed by letters.
 E-mail- letters followed by @ followed by mail provider (like gmail, yahoo) then .
and domain name (like com, org)
 Salary- floating point number is allowed

The regular definition for a new language to be designed is as follows:


 sletter → [a-z]

 cletter→[A-Z]

 digit→[0-9]

 underscore→ _

 provider→ gmail | yahoo | rediffmail | hotmail

 domain→com |org | edu


Example -1(TRY)
Transition diagram for accepting the data needed by a company for the given employees
Consider a company wants to accept the data from employees. The data is described as follows:

 Name of the employee- Can have only letters. Name can start with a capital letter.
 Employee id- It has the format EMP_<number>
 Address- The address must start with digit (one or more). Then a separator comma is allowed. Followed by
letters.
 E-mail- letters followed by @ followed by mail provider (like gmail, yahoo) then . and domain name (like com,
org)
 Salary- floating point number is allowed
The regular definition for a new language to be designed is as follows:
 sletter → [a-z]
 cletter→[A-Z]
 digit→[0-9]
 underscore→ _
 provider→ gmail | yahoo | rediffmail | hotmail
 domain→com |org | edu
 name→
 empid→
 address→ // cletter and combination with sletter is also valid
 email→
 salary→
Example -1(TRY)
Transition diagram for accepting the data needed by a company for the given employees
Consider a company wants to accept the data from employees. The data is described as follows:

 Name of the employee- Can have only letters. Name can start with a capital letter.
 Employee id- It has the format EMP_<number>
 Address- The address must start with digit (one or more). Then a separator comma is allowed. Followed by
letters.
 E-mail- letters followed by @ followed by mail provider (like gmail, yahoo) then . and domain name (like com,
org)
 Salary- floating point number is allowed
The regular definition for a new language to be designed is as follows:
 sletter → [a-z]
 cletter→[A-Z]
 digit→[0-9]
 underscore→ _
 provider→ gmail | yahoo | rediffmail | hotmail
 domain→com |org | edu
 name→cletter sletter+
 empid→ EMP underscore (digit)+
 address→digit+, sletter+ // cletter and combination with sletter is also valid
 email→ [email protected]
 salary→(digit)+[.]?(digit)+
Example -1(TRY)
Transition diagram for accepting the data needed by a company for the given employees
Consider a company wants to accept the data from employees. The data is described as follows:

 Name of the employee- Can have only letters. Name can start with a capital letter.
 Employee id- It has the format EMP_<number>
 Address- The address must start with digit (one or more). Then a separator comma is allowed. Followed by
letters.
 E-mail- letters followed by @ followed by mail provider (like gmail, yahoo) then . and domain name (like com,
org)
 Salary- floating point number is allowed
The regular definition for a new language to be designed is as follows:
 sletter → [a-z]
 cletter→[A-Z]
 digit→[0-9]
 underscore→ _
 provider→ gmail | yahoo | rediffmail | hotmail
 domain→com |org | edu
 name→cletter sletter+
 empid→ EMP underscore (digit)+
 address→digit+, sletter+ // cletter and combination with sletter is also valid
 email→ [email protected]
 salary→(digit)+[.]?(digit)+
Lexical Errors
 Some of the errors can be detected only by the lexical analysis phase.
 For example, consider a code fragment in C
if(a>b)
printf(“a is greater”);
esle
printf(“b is greater”);
 The lexical analyser encounters esle and cannot judge if it is a misspelling of
the keyword else or an identifier.
 esle is a valid identifier, so the lexical analyser must return the token for an
identifier and let the latter phases handle any error.
 If the lexical analyser is unable to proceed, then error detection and handler is
invoked.
 Error recovery strategies are:
 Panic mode error recovery
 Deleting an extraneous character
 Inserting a missing character
 Replacing an incorrect character with the correct character
 Transposing two adjacent characters
Thank You

Happy Learning !

You might also like