0% found this document useful (0 votes)

7 views22 pages

1 - Scanning Slides Sanyal Part1

The document discusses lexical analysis, a crucial step in compiler design that involves breaking down input programs into meaningful units called lexemes and grouping them into tokens. It explains the process of scanning and parsing, the distinction between lexemes and tokens, and the role of a lexical analyzer in the compilation process. Additionally, it covers the automatic generation of lexical analyzers using formal specifications and regular expressions.

Uploaded by

rrushikesh959

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views22 pages

1 - Scanning Slides Sanyal Part1

Uploaded by

rrushikesh959

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

Lexical Analysis

Amitabha Sanyal
(www.cse.iitb.ac.in/˜as)

Department of Computer Science and Engineering,

Indian Institute of Technology, Bombay

January 2016
Introduction

The input program – as you see it.

main ()
{
int i,sum;
sum = 0;
for (i=1; i<=10; i++)
sum = sum + i;
printf("%d\n",sum);
}
Introduction

The same program – as the compiler initially sees it. A continuous

sequence of characters without any structure

main ()←֓{←֓ int i,sum;←֓ sum = 0;←֓ for (i

[

[
[
[
[ [

[ [

[
[ [
[
[ [

[ [
[ [

[
[
[
[

[
=1; i<=10; i++); sum = sum + i;←֓ printf("%d\n",
[

[
[

[
[
sum);←֓}

– The blank space character

[

←֓ – The return character

How do you make the compiler see what you see?

Discovering the structure of the program

Step 1:

a. Break up this string into the smallest meaningful units.

main ( ) ←֓ { ←֓ int i , sum

[

[
[
[
[

[
; ←֓ sum = 0 ; ←֓
[
[
[
[

[
[
[
[
for ( i = 1 ; i <= 10 ; i ++ ) ;
[

[
sum = sum + i ; ←֓
[
[
[

[
[
[
printf ( "%d\n" , sum ) ; ←֓ }
[

We get a sequence of lexemes or tokens.

Discovering the structure of the program
Step 1:

b. During this process, remove the and the ←֓ characters.

[
main ( ) { int i , sum ; sum = 0 ; for (
i = 1 ; i <= 10 ; i ++ ) ; sum = sum + i
; printf ( "%d\n" , sum ) ; }

Steps 1a. and 1b. are interleaved.

This is lexical analysis or scanning.

Discovering the structure of the program
Step 2:
Now group the lexemes to form larger structures.

main ( ) { int i , sum ; sum = 0 ; for (

i = 1 ; i <= 10 ; i ++ ) ; sum = sum + i
; printf ( "%d\n" , sum ) ; }

fundef

fname params compound-stmt

identifier ( ) { vdecl slist }

main type varlist ; ...

int ...
Discovering the structure of the program

fundef

fname params compound-stmt

identifier ( ) { vdecl slist }

main type varlist ; ...

int varlist , var

var identifier

identifier sum

This is syntax analysis or parsing.

Discovering the structure of the program

Why is structure finding done in two steps?

• The process of breaking a program into lexemes (scanning) is easier.
Use a separate technique to do this.
• Reduces the work to be done by the parser.
However, there are tools (Antlr for example) that indeed combine
scanning with parsing.
Lexemes, Tokens and Patterns

Definition: Lexical analysis is the operation of dividing the input

program into a sequence of lexemes (tokens).

Distinguish between
• lexemes – smallest logical units (words) of a program.
Examples – i, sum, for, 10, ++, "%d\n", <=.
• tokens – sets of similar lexemes, i.e. lexemes which have a common
syntactic description.
Examples –
identifier = {i, sum, buffer, . . . }
int constant = {1, 10, . . . }
addop = {+, -}
Lexemes, Tokens and Patterns
What is the basis for grouping lexemes into tokens?
• Why can’t addop and mulop be combined? Why can’t + be a token
by itself?

Lexemes which play similar roles during syntax analysis are grouped into
a common token.
• Operators in addop and mulop have different roles – mulop has an
higher precedence than addop.
• Each keyword plays a different role – is therefore a token by itself.
• Each punctuation symbol and each delimiter is a token by itself.
• All comments are uniformly ignored. They are all grouped under the
same token.
• All identifiers are grouped in a common token.
Lexemes, Tokens and Patterns

Lexemes that are not passed to the later stages of a compiler:

• comments
• white spaces – tab, blanks and newlines
• White spaces are more like separators between lexemes.

These too have to be detected and then ignored.

Lexemes, Tokens and Patterns

Apart from the token itself, the lexical analyser also passes other
information regarding the token. These items of information are called
token attributes

EXAMPLE
lexeme <token, token value>
3 < const, 3>
A <identifier, A>
if <if, –>
= <assignop, –>
> <relop, >>
; <semicolon, –>
Lexemes, Tokens and Patterns

The lexical analyser:

• detects the next lexeme
• categorises it into the right token
• passes to the syntax analyser
• the token name for further syntax analysis
• the lexeme itself, in some form, for stages beyond syntax analysis
Example – tokens in Java
1. Identifier: A Javaletter followed by zero or more Javaletterordigits.
A Javaletter includes the characters a-z, A-Z, _ and $.
2. Constants:
2.1 Integer Constants – 4 byte (usual int) and 8 byte (long int ending
with an L) representations.
• Binary – 0b0000011,
• Octal – 027 (Note the leading 0),
• Hex – 0x0f28,
• Decimal – 1, -1
2.2 Floating point constants
• Float – 1.0345F, 1.04E-12f, .0345f, 1.04e-13f – ends with f
or F,
• Double – 5.6E-120D, 123.4d, 0.1 – ends with d or D, or does not
end with any of f, F, d, D
2.3 Boolean constants – true and false
2.4 Character constants – ’a’, ’\u0034’ (Unicode hex), ’\t’
2.5 String constants – "", "\"", "A string".
2.6 Null constant – null.
Example – tokens in Java

3. Delimiters: (, ), {, }, [, ] , ;, . and ,
4. Operators: =, >, <, . . . , >=
5. Keywords: abstract, boolean . . . volatile, while.
Lexemes, Tokens and Patterns

How does one describe the lexemes that make up the token identifier.

Variants in different languages.

• String of alphanumeric characters. The first character is an
alphabet.
• a string of alphanumeric characters in which the first character is an
alphabet. It has a length of at most 31.
• a string of alphabet or numeric or underline characters in which the
first character is an alphabet or an underline. It has a length of at
most 31. Any character after the 31st are ignored.

Such descriptions are called patterns. The description may be informal or

formal. Regular expressions are the most commonly used formal patterns.
Lexemes, Tokens and Patterns

A pattern is used to
• specify tokens precisely
• build a recognizer from such specifications
Basic concepts and issues
Where does a lexical analyser fit into the rest of the compiler?
• The front end of most compilers is parser driven.
• When the parser needs the next token, it invokes the Lexical
Analyser.
• Instead of analysing the entire input string, the lexical analyser sees
enough of the input string to return a single token.
• The actions of the lexical analyser and parser are interleaved.

parser

input lexical analyser rest of the

program compiler
Creating a Lexical Analyzer

Two approaches:
1. Hand code – This is only of historical interest now.
• Possibly more efficient.
2. Use a generator – To generate the lexical analyser from a formal
description.
• The generation process is faster.
• Less prone to errors.
Automatic Generation of Lexical Analysers

• A formal description (specification) of the tokens of the source

language, will consist of:
• a regular expression describing each token, and
• a code fragment called an action routine describing the action to be
performed, on identifying each token.
• Here is a description of whole numbers and identifiers in form
accepted by the lexical analyser generator Lex.
{DIGIT}+ { yylval = atoi(yytext);
return NUM;
}
{LETTER}({LETTER}|{DIGIT})* { yylval = yytext;
return IDENTIFIER;
}
• The global variable yylval holds the token attribute (henceforth to
be called token value).
Automatic Generation of Lexical Analysers

Lex can read this description and generate a lexical analyser for whole
numbers and identifiers. How?
• The generator puts together:
• A deterministic finite automaton (DFA) constructed from the token
specification.
• A code fragment called a driver routine which can traverse any DFA.
• Code for the action routines.
• These three things taken together constitutes the generated lexical
analyser.
Automatic Generation of Lexical Analysers
• How is the lexical analyser generated from the description?
specification
regular action
expression routines

processed copied

input DFA action Driver tokens

program routines routine
generated lexical analyser

• Note that the driver routine is common for all generated lexical
analysers.

5.tokens, Patterns, and Lexemes
No ratings yet
5.tokens, Patterns, and Lexemes
7 pages
Day 2 - Lexial Analyzer
No ratings yet
Day 2 - Lexial Analyzer
37 pages
2-Lexical Analysis
No ratings yet
2-Lexical Analysis
52 pages
Lecture 3 - Lexical Analysis
No ratings yet
Lecture 3 - Lexical Analysis
42 pages
Chapter 2 - Lexical Analyser
No ratings yet
Chapter 2 - Lexical Analyser
40 pages
Chapter 2 - Lexical Analyser
No ratings yet
Chapter 2 - Lexical Analyser
38 pages
UNIT I BKS Lexical Analysis I - Tokens - Lexemes - Pattern
No ratings yet
UNIT I BKS Lexical Analysis I - Tokens - Lexemes - Pattern
28 pages
Chapter 2 - Lexical Analysis
No ratings yet
Chapter 2 - Lexical Analysis
10 pages
Chapter 2 - Lexical Analyser
No ratings yet
Chapter 2 - Lexical Analyser
39 pages
Compiler Construction CS-4207: Lecture 4-5 Instructor Name: Atif Ishaq
100% (1)
Compiler Construction CS-4207: Lecture 4-5 Instructor Name: Atif Ishaq
37 pages
02 Lexical Analysis
No ratings yet
02 Lexical Analysis
86 pages
Chapter 2 - Lexical Analysis
No ratings yet
Chapter 2 - Lexical Analysis
74 pages
CD - Ch.1
No ratings yet
CD - Ch.1
28 pages
Compiler Design Chapter 2
No ratings yet
Compiler Design Chapter 2
14 pages
002chapter 2 - Lexical Analysis
No ratings yet
002chapter 2 - Lexical Analysis
114 pages
Week 5-6
No ratings yet
Week 5-6
33 pages
CD Chapter 1
No ratings yet
CD Chapter 1
28 pages
Chapter 1
No ratings yet
Chapter 1
28 pages
Chapter 2
No ratings yet
Chapter 2
77 pages
Compiler Design Lexical Analysis
No ratings yet
Compiler Design Lexical Analysis
24 pages
ATCD Mod 3
No ratings yet
ATCD Mod 3
46 pages
CD - CH2 - Lexical Analysis
No ratings yet
CD - CH2 - Lexical Analysis
67 pages
Chapter 2
No ratings yet
Chapter 2
41 pages
Chapter 3 Lexical Analysis
No ratings yet
Chapter 3 Lexical Analysis
5 pages
03 Lex Analysis
No ratings yet
03 Lex Analysis
61 pages
HW 31712
No ratings yet
HW 31712
22 pages
CD - Ch.1
No ratings yet
CD - Ch.1
28 pages
L4 - Lexical Analysis (Introduction)
No ratings yet
L4 - Lexical Analysis (Introduction)
11 pages
CD - CH2 - Lexical Analysis
No ratings yet
CD - CH2 - Lexical Analysis
59 pages
2024 CD-Ch02 Lexical Analysis
No ratings yet
2024 CD-Ch02 Lexical Analysis
25 pages
Lexical Analysis
No ratings yet
Lexical Analysis
6 pages
Unit 2 Lexical Analyzer
No ratings yet
Unit 2 Lexical Analyzer
30 pages
L2 Lexical Analysis
No ratings yet
L2 Lexical Analysis
59 pages
Lexical Analysis 2
No ratings yet
Lexical Analysis 2
24 pages
Lecture 2 10022025 035804pm
No ratings yet
Lecture 2 10022025 035804pm
27 pages
Lecture 03
No ratings yet
Lecture 03
42 pages
CSC 415 Compiler Design: Lexical Analysis
No ratings yet
CSC 415 Compiler Design: Lexical Analysis
40 pages
Lecture 02
No ratings yet
Lecture 02
150 pages
Chapter 2 Lexical Analysis
No ratings yet
Chapter 2 Lexical Analysis
14 pages
@CD - ch2 Compiler Design
No ratings yet
@CD - ch2 Compiler Design
26 pages
Comp Final
No ratings yet
Comp Final
16 pages
Unit 2
No ratings yet
Unit 2
14 pages
Lexical Analysis: Programming Languages Translators
No ratings yet
Lexical Analysis: Programming Languages Translators
21 pages
Chapter 2 Lexical Analysis (Scanning) Edited
No ratings yet
Chapter 2 Lexical Analysis (Scanning) Edited
46 pages
CSI 411 - Compiler - Lecture 2 PDF
No ratings yet
CSI 411 - Compiler - Lecture 2 PDF
22 pages
Lexical Analysis
No ratings yet
Lexical Analysis
15 pages
Lec 02
No ratings yet
Lec 02
17 pages
21CS51 ATCD MODULE 2 - 2 Lexical Analyser Part2
No ratings yet
21CS51 ATCD MODULE 2 - 2 Lexical Analyser Part2
62 pages
UNIT I BKS Lesson 3 Lexical Analysis and Role of Lexical Analyzer
No ratings yet
UNIT I BKS Lesson 3 Lexical Analysis and Role of Lexical Analyzer
28 pages
Compiler Construction II Handout
100% (1)
Compiler Construction II Handout
27 pages
ACD Unit-2 Part-2
No ratings yet
ACD Unit-2 Part-2
20 pages
Chapter 2 - Lexical Analysis - Regular Expressions
No ratings yet
Chapter 2 - Lexical Analysis - Regular Expressions
27 pages
Learning Materials, CD, Unit-2 (Lexical Analysis)
No ratings yet
Learning Materials, CD, Unit-2 (Lexical Analysis)
13 pages
Lexical Analysis
No ratings yet
Lexical Analysis
12 pages
2-Lexical Analysis Part1
No ratings yet
2-Lexical Analysis Part1
39 pages
Lexical Analysis (Scanner)
No ratings yet
Lexical Analysis (Scanner)
26 pages
Lecture 4 Lexical Analysis
No ratings yet
Lecture 4 Lexical Analysis
23 pages
Chapter 2-Lexical Analysis
No ratings yet
Chapter 2-Lexical Analysis
48 pages
6 - cs302 Code Generation
No ratings yet
6 - cs302 Code Generation
251 pages
7 - Parsing Slides Sanyal Part4
No ratings yet
7 - Parsing Slides Sanyal Part4
12 pages
3 - Scanning Slides Sanyal Part3
No ratings yet
3 - Scanning Slides Sanyal Part3
13 pages
4 - Parsing Slides Sanyal Part1
No ratings yet
4 - Parsing Slides Sanyal Part1
10 pages
09 Constraint Satisfaction Problems
No ratings yet
09 Constraint Satisfaction Problems
51 pages
Session - 23 Operands Instruction Formats and Addressing Modes
No ratings yet
Session - 23 Operands Instruction Formats and Addressing Modes
27 pages
Exact Match Domains
No ratings yet
Exact Match Domains
44 pages
Spring 2024 - CS604 - 2
No ratings yet
Spring 2024 - CS604 - 2
2 pages
French Regular Verbs
No ratings yet
French Regular Verbs
14 pages
B. Sc. H Computer S YA1to4r
No ratings yet
B. Sc. H Computer S YA1to4r
6 pages
Counters and Registers
No ratings yet
Counters and Registers
67 pages
Ravi Kant
No ratings yet
Ravi Kant
14 pages
Lukasiewicz Logic Arrays
No ratings yet
Lukasiewicz Logic Arrays
7 pages
RMT 2 Marks
100% (3)
RMT 2 Marks
22 pages
2017 Al Accounting Part II Sinhala Medium PDF
No ratings yet
2017 Al Accounting Part II Sinhala Medium PDF
8 pages
Grade 8 Task 3 Worksheet
No ratings yet
Grade 8 Task 3 Worksheet
6 pages
Data Structure and Algorithms I - Mid - Fall 2023
No ratings yet
Data Structure and Algorithms I - Mid - Fall 2023
2 pages
Ruleta Final 1
100% (1)
Ruleta Final 1
12 pages
Coding Exercise 13 LL Remove Duplicates ( Interview Question)
No ratings yet
Coding Exercise 13 LL Remove Duplicates ( Interview Question)
2 pages
Object Oriented Programming Synopsis
No ratings yet
Object Oriented Programming Synopsis
43 pages
RSA Algorithm
No ratings yet
RSA Algorithm
14 pages
Important Features of Operating System
No ratings yet
Important Features of Operating System
3 pages
BCA 1st YEAR Syllabus
No ratings yet
BCA 1st YEAR Syllabus
26 pages
Acknowledgement Synopsis
No ratings yet
Acknowledgement Synopsis
92 pages
Breaking Keeloq in A Flash: 1 Motivation
No ratings yet
Breaking Keeloq in A Flash: 1 Motivation
18 pages
DLD Mid II V2 Fall 2018
No ratings yet
DLD Mid II V2 Fall 2018
4 pages
Srilakshmi 24008 - 011
No ratings yet
Srilakshmi 24008 - 011
3 pages
Paper 1
No ratings yet
Paper 1
3 pages
It Final Year Syllabus
No ratings yet
It Final Year Syllabus
42 pages
Nagaraj Car Parking System
No ratings yet
Nagaraj Car Parking System
9 pages
C# Unit II
No ratings yet
C# Unit II
24 pages
3116 Analisis Statis Deteksi Malware Jurnal Cybersecurity - Id.en
No ratings yet
3116 Analisis Statis Deteksi Malware Jurnal Cybersecurity - Id.en
5 pages
KiCAD - Plugins
No ratings yet
KiCAD - Plugins
36 pages
Computer Science Model Paper
No ratings yet
Computer Science Model Paper
4 pages

1 - Scanning Slides Sanyal Part1

Uploaded by

1 - Scanning Slides Sanyal Part1

Uploaded by

Lexical Analysis

Department of Computer Science and Engineering,

The input program – as you see it.

The same program – as the compiler initially sees it. A continuous

main ()←֓{←֓ int i,sum;←֓ sum = 0;←֓ for (i

– The blank space character

←֓ – The return character

How do you make the compiler see what you see?

a. Break up this string into the smallest meaningful units.

main ( ) ←֓ { ←֓ int i , sum

We get a sequence of lexemes or tokens.

b. During this process, remove the and the ←֓ characters.

Steps 1a. and 1b. are interleaved.

This is lexical analysis or scanning.

main ( ) { int i , sum ; sum = 0 ; for (

fname params compound-stmt

identifier ( ) { vdecl slist }

main type varlist ; ...

fname params compound-stmt

identifier ( ) { vdecl slist }

main type varlist ; ...

int varlist , var

This is syntax analysis or parsing.

Why is structure finding done in two steps?

Definition: Lexical analysis is the operation of dividing the input

Lexemes that are not passed to the later stages of a compiler:

These too have to be detected and then ignored.

The lexical analyser:

Variants in different languages.

Such descriptions are called patterns. The description may be informal or

input lexical analyser rest of the

• A formal description (specification) of the tokens of the source

input DFA action Driver tokens

You might also like