0% found this document useful (0 votes)

9 views34 pages

Unit 1

The document provides an overview of lexical analysis in compiler design, detailing the process of converting a stream of characters into tokens while removing whitespace and comments. It explains the functions of a lexical analyzer, the specification of tokens, and the use of regular expressions and finite automata for language recognition. Additionally, it discusses token attributes, error detection, and recovery strategies in lexical analysis.

Uploaded by

nirajdhanore04

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views34 pages

Unit 1

Uploaded by

nirajdhanore04

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

COMPILER DESIGN

(CST327)
LEXICAL ANALYSIS

Unit-1
OVERVIEW
Lexical Analysis and Tokens
 Process:
 stream of characters are read from left to right and grouped into
tokens, by removing any whitespace or comments
 source code is scanned one character at a time
 when the scanner encounters a whitespace, operator symbol or special
symbol, it decides that a word has been completed
 Longest Match Rule is followed and rule priority is applied (Higher
priority to a reserved word)
 Output: token stream.
 Error generated:
 when an invalid token is found.
 When the name of an identifier matches any existing reserved word
Lexical Analysis and Tokens
 The functions of the lexical analyzer:
 Stripping out comments , whitespace and newline characters
from the source program
 tokenize the source language

 pass it to the syntax analyzer

 Store information in the symbol table

 invoke the error handler, when required.

Specification of Tokens
 An alphabet is a finite set of symbols. ( )
 A string over some alphabet is a sequence of

symbols from the alphabet set.

 A language is a set of strings over some fixed

alphabet. It may contain a finite or an infinite

number of strings
Example,
 binary alphabet set is {0, 1}
 01100010 is a string of length 8
Language Recognition
 To recognize a language, automata are constructed.
 Finite automata accept the regular language
specified by regular expressions.
Tokens, Patterns and Lexemes
 For a set of strings in the input for which the same token is produced
as output. This set of strings is described by a rule called a pattern
associated with the token.
 The pattern is said to match each string in the set.

 A lexeme is a sequence of characters in the source program that is

matched by the pattern for a token.
 A string of characters, which logically belong together, is called a
token.
Example,
int a = 12;
int is a lexeme for the token ‘keyword’. <keyword, int>
The substring a is a lexeme for the token ‘identifier’ <id, 100>
= is a lexeme for the token assignment operator, <assign_op, =>
12 is lexeme for the token integer constant , <constant, 101>
; is lexeme for the special symbol, <symbol, ;>
Tokens- Example 1
Lexeme Token Types
int i; int
printf(“Number is %d & i
incremented number is ;
%d”, i , ++i);
Tokens- Example 1
Lexeme Token Types
int i; int
printf(“Number is %d & i
incremented number is ;
printf
%d”, i , ++i);
(
“Number is %d
& incremented
number is %d”
,
i
++
i
)
;
Tokens- Example 1
Lexeme Token Types
int i; int Keyword
printf(“Number is %d & i Identifier
incremented number is ; Symbol
printf Identifier/ c
%d”, i , ++i); function
( Symbol
“Number is %d
& incremented String
number is %d”
, Symbol
i Identifier
++ Operator
i Identifier
) Symbol
; Symbol
Tokens- Example 2
Lexeme Token
int a, b, sum; int Keyword
a Identifier
printf(“\n Enter two numbers:”); b Identifier
scanf(“%d %d”,&a,&b); sum Identifier
; Symbol
sum=a+b;
printf Keyword
printf(“sum: %d”, sum); ( Symbol
) Symbol
scanf Keyword
“\n Enter two numbers:” String
“%d %d” String
& Symbol
= Operator
+ Operator
“sum: %d” String
, Symbol
Attributes for Tokens
 More than one pattern can matches a lexeme
 Provide additional information about the particular
lexeme that was matched, so that subsequent phases
of the compiler can recognize the proper token.
For example, the pattern for a digit is [0–9].
All the numbers from 0 to 9 will match the pattern for
digit. It is essential for the code generator to know
what number was actually matched.
 Information about tokens and their associated
attributes
Lexical Analyser and Symbol Table-
Example1
printf(“ String % d” , ++ i ++ &&& i ** a);
Compute tokens ?

The tokens are

printf ( “String % d” , ++ i ++ &&
& i * * a ) ;
There are 15 tokens in total.
Lexical Analyser and Symbol Table-
Example2
 Specify the <token attribute> set for the C
Statement, a = (b + c) * 2 ;
Lexical Analyzer and Symbol Table-
Example2 a = (b + c) * 2 ;

<id , 100>
<=, > OR <operator, = address>
<(, >
<id , 101>
<+ , >
<id , 102>
<), >
<*,>
< Constant, 103>
<;, >
 The numbers after comma(,) represent the pointer to symbol table entry.
 The symbol table entry at memory location 100 stores the identifier ‘a’,
101 stores ‘b’ and 102 stores ‘c’.
 A constant value 2 is stored at location 103.
Regular Expression
 Regular expression is a notation for specifying patterns.
 Programming language tokens can be described by regular
languages.
 Notations used to represent regular expressions:
Notation Description Example
{ }+ One or more {a}+
repetition String contains a repeated one or more time.
[] Optional [abc]
(Called character String can contain a or b or c
class)
- Range [a-z]
String can contain any character between a to z
? Zero or one instance a?
() Grouping (a | b)
^ Except [^a]
a is not included
RE Examples- Questions
 Regular expression for strings of 0’s and 1’s not
accepting empty string is_____
 Regular expression for string abc___
 Regular expression for string containing substring
ab and any letters before and after is _______
 Regular expression for string containing small letters
and capital letters separated by @ is _____
Transition Diagram
 keep track of information about characters that are
identified as the forward pointer scans the input.
 Positions in a transition diagram are drawn as
circles and are called states.
 The states are connected by arrows called edges.
 A double circle indicates an accepting state, a state
in which a token is found.
 a* indicates that input retraction must take place.
FA are used to represent the transition diagram.
Finite Automata
 Finite automata (FA) is a recognizer for regular expressions.
 The mathematical model of finite automata consists of:
 Finite set of input symbols (Σ)
 Finite set of states (Q)
 One Start state (q0)
 One or more final states (qf)
 Transition function (δ)
 The transition function (δ) maps the finite set of state (Q) to a
finite set of input symbols (Σ), Q × Σ ➔ Q
FA that accepts a string starting with a and ending with b
and having any number of b
Transition Diagram- Example 1
Transition diagram for relational operators in Java
language
< , <=, ==, !=, > and >=
Transition Diagram- Example 1
Transition diagram for relational operators in Java
language
< , <=, ==, !=, > and >=
Transition Diagram- Example 2
Transition diagram for identifiers of C language

The regular expression for identifier uses letter and

digit regular expression,
letter → a | A | b | B | c | C | ....
digit →0 | 1 | 2 | 3 | 4 | 5 |6 | 7 | 8 | 9

identifier → ( _ | letter) (letter | digit | _ )*

Transition Diagram- Example 2
Transition diagram for identifiers of C language
The regular expression for identifier uses letter and
digit regular expression,
letter → a | A | b | B | c | C | ....
digit →0 | 1 | 2 | 3 | 4 | 5 |6 | 7 | 8 | 9
identifier → ( _ | letter) (letter | digit | _ )*
Transition Diagram- Example 3
HOWEWORK
Transition diagram for Numeric constants
 Integers : Int →

 Floating point constant:

Fconst →
 Exponential numbers:

exp → digit+ (.digit+)? E ( '+' | '-' )? digit+

Transition Diagram- Example 3
HOWEWORK
Transition diagram for Numeric constants
 Integers : Int → digit (digit)*

 Floating point constant:

Fconst → digit+ ((.)(digit)+)?

 Exponential numbers:

exp → digit+ (.digit+)? E ( '+' | '-' )? digit+

Recognition of Tokens

transition diagram for relop

Example -1(TRY)
Transition diagram for accepting the data needed by a company for the
given employees
Consider a company wants to accept the data from employees. The data is
described as follows:

 Name of the employee- Can have only letters. Name can start with a capital letter.
 Employee id- It has the format EMP_<number>

 Address- The address must start with digit (one or more). Then a separator comma is
allowed. Followed by letters.
 E-mail- letters followed by @ followed by mail provider (like gmail, yahoo) then .
and domain name (like com, org)
 Salary- floating point number is allowed

The regular definition for a new language to be designed is as follows:

 sletter → [a-z]

 cletter→[A-Z]

 digit→[0-9]

 underscore→ _
Example -1(TRY)
Transition diagram for accepting the data needed by a company for the
given employees
Consider a company wants to accept the data from employees. The data is
described as follows:

 Name of the employee- Can have only letters. Name can start with a capital letter.
 Employee id- It has the format EMP_<number>

The regular definition for a new language to be designed is as follows:

 sletter → [a-z]

 cletter→[A-Z]

 digit→[0-9]

 underscore→ _

 provider→ gmail | yahoo | rediffmail | hotmail

 domain→com |org | edu

Example -1(TRY)
Transition diagram for accepting the data needed by a company for the given employees
Consider a company wants to accept the data from employees. The data is described as follows:

 Name of the employee- Can have only letters. Name can start with a capital letter.
 Employee id- It has the format EMP_<number>
 Address- The address must start with digit (one or more). Then a separator comma is allowed. Followed by
letters.
 E-mail- letters followed by @ followed by mail provider (like gmail, yahoo) then . and domain name (like com,
org)
 Salary- floating point number is allowed
The regular definition for a new language to be designed is as follows:
 sletter → [a-z]
 cletter→[A-Z]
 digit→[0-9]
 underscore→ _
 provider→ gmail | yahoo | rediffmail | hotmail
 domain→com |org | edu
 name→cletter sletter+
 empid→ EMP underscore (digit)+
 address→digit+, sletter+ // cletter and combination with sletter is also valid
 email→ [email protected]
 salary→(digit)+[.]?(digit)+
Lexical Errors
 Some of the errors can be detected only by the lexical analysis phase.
 For example, consider a code fragment in C
if(a>b)
printf(“a is greater”);
esle
printf(“b is greater”);
 The lexical analyser encounters esle and cannot judge if it is a misspelling of
the keyword else or an identifier.
 esle is a valid identifier, so the lexical analyser must return the token for an
identifier and let the latter phases handle any error.
 If the lexical analyser is unable to proceed, then error detection and handler is
invoked.
 Error recovery strategies are:
 Panic mode error recovery
 Deleting an extraneous character
 Inserting a missing character
 Replacing an incorrect character with the correct character
 Transposing two adjacent characters
Thank You

Happy Learning !

2 - Lexical Analysis
No ratings yet
2 - Lexical Analysis
52 pages
Compiler-Lexical Analysis
100% (1)
Compiler-Lexical Analysis
59 pages
Compiler
No ratings yet
Compiler
60 pages
Chapter 2 - Lexical Analysis
No ratings yet
Chapter 2 - Lexical Analysis
56 pages
Lexical Analysis
No ratings yet
Lexical Analysis
57 pages
Lexical Analysis
No ratings yet
Lexical Analysis
44 pages
Chapter 2 - Lexical Analysis
100% (1)
Chapter 2 - Lexical Analysis
69 pages
Scanner (Lexical Analyzer) : The Structure of A Compiler
No ratings yet
Scanner (Lexical Analyzer) : The Structure of A Compiler
109 pages
Chapter 2
No ratings yet
Chapter 2
27 pages
Compilers CH 3
No ratings yet
Compilers CH 3
58 pages
CD ch2
No ratings yet
CD ch2
104 pages
04 Lexi Cal A Analysis
No ratings yet
04 Lexi Cal A Analysis
39 pages
Chapter 2 - Lexical Analysis - Regular Expressions
No ratings yet
Chapter 2 - Lexical Analysis - Regular Expressions
27 pages
Compiler Course: Lexical Analysis
No ratings yet
Compiler Course: Lexical Analysis
50 pages
Lec2 LexicalAnalyser
No ratings yet
Lec2 LexicalAnalyser
30 pages
Lexical Analysis 3
No ratings yet
Lexical Analysis 3
27 pages
Chapter 3 - Lexical Analysis
100% (1)
Chapter 3 - Lexical Analysis
51 pages
WINSEM2023-24 CSI2005 TH VL2023240501823 2024-01-08 Reference-Material-I
No ratings yet
WINSEM2023-24 CSI2005 TH VL2023240501823 2024-01-08 Reference-Material-I
23 pages
1st Phase Lexical Analyzer
No ratings yet
1st Phase Lexical Analyzer
33 pages
Lexical Analyzer 1
No ratings yet
Lexical Analyzer 1
37 pages
CH 3 Myppt
No ratings yet
CH 3 Myppt
59 pages
Chapter 3 - Lexical Analysis
No ratings yet
Chapter 3 - Lexical Analysis
51 pages
Chapter2-Lexical Analysis
No ratings yet
Chapter2-Lexical Analysis
64 pages
Ch3 Modified
No ratings yet
Ch3 Modified
80 pages
Acknowledgements: The Slides For This Lecture Are A Modified Versions of The Offering by
No ratings yet
Acknowledgements: The Slides For This Lecture Are A Modified Versions of The Offering by
40 pages
ch-2.pdf 2
No ratings yet
ch-2.pdf 2
27 pages
Chapter 3 - Lexical Analysis
No ratings yet
Chapter 3 - Lexical Analysis
34 pages
Chapter 3 - Lexical Analysis
100% (3)
Chapter 3 - Lexical Analysis
51 pages
Lexical Analyser
No ratings yet
Lexical Analyser
55 pages
Compilation Techniques
No ratings yet
Compilation Techniques
20 pages
4-Intro To Flex and Bison-09!09!2024
No ratings yet
4-Intro To Flex and Bison-09!09!2024
28 pages
CP 324 Lexical Analysis l2
No ratings yet
CP 324 Lexical Analysis l2
26 pages
Compiler Design Lexical Analysis
No ratings yet
Compiler Design Lexical Analysis
24 pages
Chapter 7 Lexical Analysis
No ratings yet
Chapter 7 Lexical Analysis
61 pages
UNIT-I - Lexical Analysis
No ratings yet
UNIT-I - Lexical Analysis
51 pages
Chapter 3 - Lexical Analysis
No ratings yet
Chapter 3 - Lexical Analysis
51 pages
Unit 1-REGULAR LANGUAGES
No ratings yet
Unit 1-REGULAR LANGUAGES
27 pages
2 Lex
No ratings yet
2 Lex
45 pages
Slides CHP 3 and 4
No ratings yet
Slides CHP 3 and 4
21 pages
Chapter 2
No ratings yet
Chapter 2
56 pages
L3 FSM
No ratings yet
L3 FSM
20 pages
Unit 2 Lexical Analyzer
No ratings yet
Unit 2 Lexical Analyzer
63 pages
Unit 2 Lexical Analysis
No ratings yet
Unit 2 Lexical Analysis
94 pages
Unit 2-Introduction To Compilers
No ratings yet
Unit 2-Introduction To Compilers
51 pages
Acd Unit-2
No ratings yet
Acd Unit-2
16 pages
Lexical Analysis
No ratings yet
Lexical Analysis
62 pages
Lexical Analysis1
No ratings yet
Lexical Analysis1
44 pages
Lect 03
No ratings yet
Lect 03
19 pages
Lexical Analysis
No ratings yet
Lexical Analysis
62 pages
Lexical Analysis and Lexical Analyzer Generators: COP5621 Compiler Construction
No ratings yet
Lexical Analysis and Lexical Analyzer Generators: COP5621 Compiler Construction
52 pages
Lecture 2.76
No ratings yet
Lecture 2.76
31 pages
Lexical Analysis I: Compiler Construction
No ratings yet
Lexical Analysis I: Compiler Construction
35 pages
CD Unit-2
No ratings yet
CD Unit-2
64 pages
CD Unit-2
No ratings yet
CD Unit-2
64 pages
Chpater 2 Lexical Analysis
No ratings yet
Chpater 2 Lexical Analysis
48 pages
SSC Module2 LexicalAnalysis
No ratings yet
SSC Module2 LexicalAnalysis
26 pages
Chapter 3 - Lexical Analysis
No ratings yet
Chapter 3 - Lexical Analysis
52 pages
Lab Report 2 (Circle)
No ratings yet
Lab Report 2 (Circle)
4 pages
Mariem Abidi Rapport PFE 2020 Final
No ratings yet
Mariem Abidi Rapport PFE 2020 Final
101 pages
For Permit - AR - STJ-BATANGAS
No ratings yet
For Permit - AR - STJ-BATANGAS
6 pages
Network Engineer - Praneesha Martha
No ratings yet
Network Engineer - Praneesha Martha
4 pages
Main
No ratings yet
Main
12 pages
Austin Data Center Project Feasibility Study TACC SECO Final Feasibility Report CM1001
100% (2)
Austin Data Center Project Feasibility Study TACC SECO Final Feasibility Report CM1001
31 pages
Ielts Practice-Reading-Skimming and Scanning
No ratings yet
Ielts Practice-Reading-Skimming and Scanning
5 pages
38942089968
No ratings yet
38942089968
2 pages
Dna Mica Desist em As
No ratings yet
Dna Mica Desist em As
535 pages
CV - CV Vanila Amadeu Communication Manager - Reviwed.
No ratings yet
CV - CV Vanila Amadeu Communication Manager - Reviwed.
3 pages
Manual Hiad 6 Ton Inv. 1942
No ratings yet
Manual Hiad 6 Ton Inv. 1942
46 pages
Official Document Vector Art, Icons, and Graphics For
No ratings yet
Official Document Vector Art, Icons, and Graphics For
11 pages
NI Serial Hardware Specifications PDF
No ratings yet
NI Serial Hardware Specifications PDF
62 pages
I PPR Extracted
No ratings yet
I PPR Extracted
6 pages
Painting Crew Supervisor Interview Question
No ratings yet
Painting Crew Supervisor Interview Question
6 pages
Data Sampel Properti & Real Estate
No ratings yet
Data Sampel Properti & Real Estate
6 pages
Block Diagram: X541UV Repair Guide
No ratings yet
Block Diagram: X541UV Repair Guide
7 pages
Smuat Guide
No ratings yet
Smuat Guide
53 pages
Business Result Pre-Int. Wordlist English-French
No ratings yet
Business Result Pre-Int. Wordlist English-French
18 pages
Physics Activity Class 12
No ratings yet
Physics Activity Class 12
15 pages
General Notes: Bridge Site Location Plan
No ratings yet
General Notes: Bridge Site Location Plan
1 page
Trellix Insights: Key Benefits
No ratings yet
Trellix Insights: Key Benefits
8 pages
Manuel #1116649 (FM841, FM840) Rig 301-52
No ratings yet
Manuel #1116649 (FM841, FM840) Rig 301-52
101 pages
Modern Teaching Methods
75% (4)
Modern Teaching Methods
10 pages
Understanding The Basics of Essbase Data and Cubes Operations - Jane Story
No ratings yet
Understanding The Basics of Essbase Data and Cubes Operations - Jane Story
30 pages
Wpq-105-03 Gmaw 3g Jose A. Rivas
No ratings yet
Wpq-105-03 Gmaw 3g Jose A. Rivas
1 page
Email Invoicing (E-Invoicing) : A Tool For Customer Satisfaction and Logistics Optimization
No ratings yet
Email Invoicing (E-Invoicing) : A Tool For Customer Satisfaction and Logistics Optimization
3 pages
Analog Communication Lab VIVA Questions & Answers
No ratings yet
Analog Communication Lab VIVA Questions & Answers
9 pages
TCS Allegations and Mixtures Quiz-3 PREP INSTA
No ratings yet
TCS Allegations and Mixtures Quiz-3 PREP INSTA
21 pages
Manual
No ratings yet
Manual
64 pages

Unit 1

Uploaded by

Unit 1

Uploaded by

COMPILER DESIGN

 pass it to the syntax analyzer

 Store information in the symbol table

 invoke the error handler, when required.

symbols from the alphabet set.

alphabet. It may contain a finite or an infinite

 A lexeme is a sequence of characters in the source program that is

The tokens are

The regular expression for identifier uses letter and

identifier → ( _ | letter) (letter | digit | _ )*

 Floating point constant:

exp → digit+ (.digit+)? E ( '+' | '-' )? digit+

 Floating point constant:

Fconst → digit+ ((.)(digit)+)?

exp → digit+ (.digit+)? E ( '+' | '-' )? digit+

transition diagram for relop

The regular definition for a new language to be designed is as follows:

The regular definition for a new language to be designed is as follows:

 provider→ gmail | yahoo | rediffmail | hotmail

 domain→com |org | edu

You might also like