0% found this document useful (0 votes)

35 views10 pages

Chapter 2 - Lexical Analysis

Uploaded by

om55500r

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views10 pages

Chapter 2 - Lexical Analysis

Uploaded by

om55500r

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Chapter 2- LEXICAL ANALYSIS

2.1 OVER VIEW OF LEXICAL ANALYSIS

o To identify the tokens we need some method of describing the possible tokens that can appear in the input stream.
For this purpose we introduce regular expression, a notation that can be used to describe essentially all the tokens of
programming language.
o Secondly , having decided what the tokens are, we need some mechanism to recognize these in the input stream.
This is done by the token recognizers, which are designed using transition diagrams and finite automata.

2.2 ROLE OF LEXICAL ANALYSIS

Lexical Analysis is the first phase of compiler also known as scanner( text scanner). This phase scans the
source code as a stream of characters and converts it into meaningful Lexemes. Lexical analyzer represents these
lexemes in the form of tokens as: It converts the High level input program into a sequence of Tokens.
 Lexical Analysis can be implemented with the Deterministic finite Automata.
 The output is a sequence of tokens that is sent to the parser for syntax analysis
<token-name, attribute-value>

Upon receiving a ‘get next token’ command form the parser, the lexical analyzer reads the input character until it can
identify the next token. The LA return to the parser representation for the token it has found. The representation will
be an integer code, if the token is a simple construct such as parenthesis, comma or colon.
LA may also perform certain secondary tasks as the user interface. One such task is striping out from the source
program the commands and white spaces in the form of blank, tab and new line characters. Another is correlating
error message from the compiler with the source program.

1
2.3 LEXICAL ANALYSIS VS PARSING:
Lexical analysis Parsing
A Scanner simply turns an input String (say a file) into a list A parser converts this list of tokens into a Tree-like object
of tokens. These tokens represent things like identifiers, to represent how the tokens fit together to form a cohesive
parentheses, operators etc. whole (sometimes referred to as a sentence).

The lexical analyzer (the "lexer") parses individual symbols A parser does not give the nodes any meaning beyond
from the source code file into tokens. From there, the structural cohesion. The next thing to do is extract meaning
"parser" proper turns those whole tokens into sentences of from this structure (sometimes called contextual analysis).
your grammar

For example, in C language, the variable declaration line

Alphabets
Any finite set of symbols {0,1} is a set of binary alphabets, {0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F} is a set of Hexadecimal
alphabets, {a-z, A-Z} is a set of English language alphabets.

Strings
Any finite sequence of alphabets is called a string. Length of the string is the total number of occurrence of alphabets,
e.g., the length of the string tutorials point is 14, and is denoted by |tutorials point| = 14. A string having no
alphabets, i.e. a string of zero length is known as an empty string and is denoted by ε (epsilon).

Arithmetic Symbols Addition(+), Subtraction(-), Modulo(%), Multiplication(*), Division(/)

Punctuation Comma(,), Semicolon(;), Dot(.), Arrow(->)

Assignment =

Special Assignment +=, /=, *=, -=

Comparison ==, !=, <, <=, >, >=

Preprocessor #

Location Specifier &

Logical &, &&, |, ||, !

2
Shift Operator >>, >>>, <<, <<<

2.4 TOKEN, LEXEME, PATTERN:

Token:
A lexical token is a sequence of characters that can be treated as a unit(single logical entity) in the
grammar of the programming languages.
Typical tokens are,
1) Identifiers 2) keywords 3) operators 4) special symbols 5) constants

Pattern: A set of strings in the input for which the same token is produced as output. This set of strings is
described by a rule called a pattern associated with the token.

int value = 100; char name=”hello”;

Contains the tokens. Each identifier, constant, variable, operator is a token in a lexical Analyzer.

int (keyword), value (identifier), = (operator), 100 (constant) and ; (symbol).

Example of tokens:
 Type token (id, number, real, . . . )
 Punctuation tokens (IF, void, return, . . . )
 Alphabetic tokens (keywords)
Keywords; Examples-for, while, if etc.
Identifier; Examples-Variable name, function name etc.
Operators; Examples '+', '++', '-' etc.
Separators; Examples ',' ';' etc
Example of Non-Tokens:
 Comments, preprocessor directive, macros, blanks, tabs, newline etc
Lexeme: The sequence of characters matched by a pattern to form the corresponding token or a sequence
of input characters that comprises a single token is called a lexeme.
eg- “float”, “abs_zero_Kelvin”, “=”, “-”, “273”, “;” .

How Lexical Analyzer functions

1. Tokenization .i.e Dividing the program into valid tokens.
2. Remove white space characters.
3. Remove comments.
4. It also provides help in generating error message by providing row number and column number.

Example:
Description of token Token lexeme pattern
const const const
if if If
relation <,<=,= ,< >,>=,> < or <= or = or < > or >= or letter
followed by letters & digit
i pi any numeric constant
3
nun 3.14 any character b/w “and “except"
literal "core" pattern
A patter is a rule describing the set of lexemes that can represent a particular token in source program.

2.5 LEXICAL ERRORS:

Lexical errors are the errors thrown by your lexer when unable to continue. Which means that there's no
way to recognise a lexeme as a valid token for you lexer. Syntax errors, on the other side, will be thrown
by your scanner when a given set of already recognised valid tokens don't match any of the right sides of
your grammar rules. simple panic-mode error handling system requires that we return to a high-level
parsing function when a parsing or lexical error is detected.

Error-recovery actions are:

i. Delete one character from the remaining input.
ii. Insert a missing character in to the remaining input.
iii. Replace a character by another character.
iv. Transpose two adjacent characters.

 The lexical analyzer identifies the error with the help of automation machine and the grammar of the given
language on which it is based like C , C++ and gives row number and column number of the error.
Suppose we pass a statement through lexical analyzer –
a=b+c; It will generate token sequence like this:
id=id+id; Where each id reference to it’s variable in the symbol table referencing all details
For example, consider the program
int main()
{
// 2 variables
int a, b;
a = 10;
return 0;
}

All the valid tokens are:

'int' 'main' '(' ')' '{' '}' 'int' 'a' ',' 'b' ';'
'a' '=' '10' ';' 'return' '0' ';' '}'

Above are the valid tokens.

You can observe that we have omitted comments.

4
As another example, consider below printf statement.

There are 5 valid token in this printf statement.

Exercise 1:
Count number of tokens :
int main()
{
int a = 10, b = 20;
printf("sum is :%d",a+b);
return 0;
}
Answer: Total number of token: 27.
Exercise 2:
Count number of tokens :
int max(int i);
 Lexical analyzer first read int and finds it to be valid and accepts as token
 max is read by it and found to be valid function name after reading (
 int is also a token , then again i as another token and finally ;
Answer: Total number of tokens 7:
int, max, ( ,int, i, ), ;

Exercise 3:
Count number of tokens and values:
int studno=43609023;

Answer: Total number of tokens 5:

int(keyword), studno(identifier), =(operator), 43609023(constant), ;(separator)

Exercise 4:
Count number of tokens:
int main() 4
{ 1
int a = 10, b = 20; 9
printf("sum is :%d" , a + b ); 9
printf(“HELLO “); 5
printf( “\n”) ; 5
return 0; 3
} 1
Answer: Total number of tokens 37

5
2.6 Lexical Grammar and FSMs

To recognize a token described by a regular definition, the regular expression in the definition is often transformed
into a FSM. The resulting FSM has a finite number of states comprising an initial state and a set of accepting
states.

For example, the regular expression a | b can be converted into the following FSM.

A|B

R = (a|b)c

(a|b)

6
R = (a|b)*c

R = a(bc)*

The FSM for (bc)* would be represented with a loop on bc.

Concatenating the above two FSMs will give us the FSM for a(bc)*.

7
Language Definition o Appearance of programming language: Vocabulary: Regular expression
Syntax : Backus-Naur Form(BNF) or Context Free Form(CFG)
To specify the syntax of a language: CFG and BNF o Example: if-else statement in C has the form of statement → if
(expression ) statement else statement
Alphabet: An alphabet of a language is a set of symbols. o Examples : {0,1} for a binary number
system(language)={0,1,100,101,...} {a,b,c} for language={a,b,c, ac,abcc..} {if,(,),else ...} for a if
statements={if(a==1)goto10, if--}
String: A string over an alphabet o is a sequence of zero or more symbols from the alphabet. o Examples :
0,1,10,00,11,111,0202 ... strings for a alphabet {0,1} o Null string is a string which does not have any symbol of
alphabet.
Language: It is a subset of all the strings over a given alphabet. o Alphabets Ai Languages Li for Ai A0={0,1}
L0={0,1,100,101,...} A1={a,b,c} L1={a,b,c, ac, abcc..} A2={all of C tokens} L2= {all sentences of C program }

Grammar G=(N,T,P,S) o N : a set of nonterminal symbols o T : a set of terminal symbols, tokens o P : a set of
production rules o S : a start symbol, S∈N
To specify the syntax of a language: CFG and BNF
1. Example : if-else statement in C has the form of statement → if ( expression )
Statement else statement • An alphabet of a language is a set of symbols.
2. Examples : {0,1} for a binary number system (language)={0,1,100,101,...}
3. {a,b,c} for language={a,b,c, ac,abcc..}
4. {if,(,),else ...} for a if statements={if(a==1)goto10, if--}
• A string over an alphabet o is a sequence of zero or more symbols from the alphabet. o Examples :
0,1,10,00,11,111,0202 ... strings for a alphabet {0,1}
o Null string is a string which does not have any symbol of alphabet.
• Language o Is a subset of all the strings over a given alphabet.
o Alphabets Ai Languages Li for Ai
A0={0,1} L0={0,1,100,101,...}
A1={a,b,c} L1={a,b,c, ac, abcc, acb….}
A2={all of C tokens} L2= {all sentences of C program }
• Example 2.1. Grammar for expressions consisting of digits and plus and minus signs. o Language of expressions
L={9-5+2, 3-1, ...} o The productions of grammar for this language L are: list → list + digit list → list - digit list → digit
digit → 0|1|2|3|4|5|6|7|8|9 o list, digit : Grammar variables, Grammar symbols o 0,1,2,3,4,5,6,7,8,9,-,+ : Tokens,
Terminal symbols
• Convention specifying grammar o Terminal symbols : bold face string if, num, id o Nonterminal symbol, grammar
symbol : italicized names, list, digit ,A,B
Grammar G=(N,T,P,S)
N : a set of nonterminal symbols
T : a set of terminal symbols, tokens
P : a set of production rules
8
S : a start symbol, S∈N

Example :Grammar G for a language L={9-5+2, 3-1, ...}

o G=(N,T,P,S) N={list,digit}
T={0,1,2,3,4,5,6,7,8,9,-,+}
P : list -> list + digit list -> list - digit list -> digit digit -> 0|1|2|3|4|5|6|7|8|9
S=list
Parse Tree
A derivation can be conveniently represented by a derivation tree( parse tree).
o The root is labeled by the start symbol.
o Each leaf is labeled by a token or ε.
o Each interior none is labeled by a nonterminal symbol.
o When a production A→x1… xn is derived, nodes labeled by x1… xn are made as children nodes of node labeled by
A.
• root : the start symbol
• internal nodes : nonterminal
• leaf nodes : terminal

Example G: list -> list + digit | list - digit | digit digit -> 0|1|2|3|4|5|6|7|8|9
left most derivation for 9-5+2, list ⇒ list+digit ⇒ list-digit+digit ⇒ digit-digit+digit ⇒ 9-digit+digit ⇒ 9-5+digit ⇒ 9-
5+2
right most derivation for 9-5+2, list ⇒ list+digit ⇒ list+2 ⇒ list-digit+2 ⇒ list-5+2 ⇒ digit-5+2 ⇒ 9-5+2 parse tree
for 9-5+2

Ambiguity
• A grammar is said to be ambiguous if the grammar has more than one parse tree for a given string of tokens.
• Example 2.5. Suppose a grammar G that can not distinguish between lists and digits as in Example 2.1. • G : string
→ string + string | string - string |0|1|2|3|4|5|6|7|8|9

9
Associativity of operator
A operator is said to be left associative if an operand with operators on both sides of it is taken by the operator to
its left. eg) 9+5+2≡(9+5)+2, a=b=c≡a=(b=c)
• Left Associative Grammar : list → list + digit | list - digit digit →0|1|…|9
• Right Associative Grammar : right → letter = right | letter letter → a|b|…|z
Precedence of operators We say that a operator(*) has higher precedence than other operator(+) if the
operator(*) takes operands before other operator(+) does.
• ex. 9+5*2≡9+(5*2), 9*5+2≡(9*5)+2 • left associative operators : + , - , * , /
• right associative operators : = , **

Syntax of full expressions

002chapter 2 - Lexical Analysis
No ratings yet
002chapter 2 - Lexical Analysis
114 pages
5.tokens, Patterns, and Lexemes
No ratings yet
5.tokens, Patterns, and Lexemes
7 pages
Instant Download Vector Mechanics For Engineers 12th Edition Ferdinand Pierre Beer - Ebook PDF PDF All Chapters
100% (5)
Instant Download Vector Mechanics For Engineers 12th Edition Ferdinand Pierre Beer - Ebook PDF PDF All Chapters
25 pages
Chapter 3 Lexical Analysis
No ratings yet
Chapter 3 Lexical Analysis
5 pages
Lexical Analysis
No ratings yet
Lexical Analysis
14 pages
Compiler Construction CS-4207: Lecture 4-5 Instructor Name: Atif Ishaq
100% (1)
Compiler Construction CS-4207: Lecture 4-5 Instructor Name: Atif Ishaq
37 pages
Rhea Vendors Lioness XS Manual
No ratings yet
Rhea Vendors Lioness XS Manual
49 pages
Unit 2
No ratings yet
Unit 2
14 pages
Chapter 2 Lexical Analysis
No ratings yet
Chapter 2 Lexical Analysis
26 pages
Lecture 02
No ratings yet
Lecture 02
150 pages
R.V. College of Engineering
No ratings yet
R.V. College of Engineering
56 pages
Compiler Design Chapter 2
No ratings yet
Compiler Design Chapter 2
14 pages
Lexical Analysis
No ratings yet
Lexical Analysis
5 pages
Lexical Analysis
No ratings yet
Lexical Analysis
12 pages
Lecture 3
No ratings yet
Lecture 3
4 pages
HW 31712
No ratings yet
HW 31712
22 pages
Lexical Analyzer
No ratings yet
Lexical Analyzer
16 pages
Lexical Analyzer
No ratings yet
Lexical Analyzer
16 pages
2-Lexical Analysis Part1
No ratings yet
2-Lexical Analysis Part1
39 pages
Pdf&rendition 1
No ratings yet
Pdf&rendition 1
14 pages
Chapter 2 - Lexical Analysis
No ratings yet
Chapter 2 - Lexical Analysis
74 pages
@CD - ch2 Compiler Design
No ratings yet
@CD - ch2 Compiler Design
26 pages
Unit 01 - PART 2
No ratings yet
Unit 01 - PART 2
25 pages
Ch2 - Lexical Analysis
No ratings yet
Ch2 - Lexical Analysis
71 pages
1 - Scanning Slides Sanyal Part1
No ratings yet
1 - Scanning Slides Sanyal Part1
22 pages
Lexical Analyzer
No ratings yet
Lexical Analyzer
16 pages
Chapter 2
No ratings yet
Chapter 2
41 pages
2.1 - Lexical Analysis
No ratings yet
2.1 - Lexical Analysis
102 pages
Lecture 2 10022025 035804pm
No ratings yet
Lecture 2 10022025 035804pm
27 pages
CD - CH2 - Lexical Analysis
No ratings yet
CD - CH2 - Lexical Analysis
59 pages
Lecture 4 Lexical Analysis
No ratings yet
Lecture 4 Lexical Analysis
23 pages
Learning Materials, CD, Unit-2 (Lexical Analysis)
No ratings yet
Learning Materials, CD, Unit-2 (Lexical Analysis)
13 pages
CD KCS502 Unit 1 B
No ratings yet
CD KCS502 Unit 1 B
12 pages
02 Lexical Analysis
No ratings yet
02 Lexical Analysis
86 pages
Lexical Analysis
No ratings yet
Lexical Analysis
15 pages
Day 2 - Lexial Analyzer
No ratings yet
Day 2 - Lexial Analyzer
37 pages
L4 - Lexical Analysis (Introduction)
No ratings yet
L4 - Lexical Analysis (Introduction)
11 pages
Wo 17088 (11P) PLC Panel 04082024
100% (1)
Wo 17088 (11P) PLC Panel 04082024
43 pages
Compiler Construction Lec 1b
No ratings yet
Compiler Construction Lec 1b
37 pages
Chapter 2 - Lexical Analyser
No ratings yet
Chapter 2 - Lexical Analyser
40 pages
SM M315F (MobileRdx - Com)
No ratings yet
SM M315F (MobileRdx - Com)
36 pages
Chapter 2
No ratings yet
Chapter 2
36 pages
Lexical Analysis
No ratings yet
Lexical Analysis
153 pages
Ch3 Modified
No ratings yet
Ch3 Modified
80 pages
Lexical Analysis
No ratings yet
Lexical Analysis
62 pages
UNIT 2 Compiler Design
No ratings yet
UNIT 2 Compiler Design
23 pages
2024 CD-Ch02 Lexical Analysis
No ratings yet
2024 CD-Ch02 Lexical Analysis
25 pages
Lec 02
No ratings yet
Lec 02
17 pages
Unit 2
No ratings yet
Unit 2
61 pages
Follow-Up Email Templates
100% (1)
Follow-Up Email Templates
33 pages
Lecture3 E
No ratings yet
Lecture3 E
153 pages
Chapter 2 Lexical Analysis (Scanning)
No ratings yet
Chapter 2 Lexical Analysis (Scanning)
56 pages
Lecture 3 - Lexical Analysis
No ratings yet
Lecture 3 - Lexical Analysis
42 pages
3a. Context Free Grammar
No ratings yet
3a. Context Free Grammar
18 pages
Chapter 2 - Lexical Analysis
No ratings yet
Chapter 2 - Lexical Analysis
56 pages
Comp Final
No ratings yet
Comp Final
16 pages
Lexical Analysis
No ratings yet
Lexical Analysis
35 pages
Chapter 2 - Lexical Analysis - Regular Expressions
No ratings yet
Chapter 2 - Lexical Analysis - Regular Expressions
27 pages
CD 1
No ratings yet
CD 1
92 pages
CD - CH2 - Lexical Analysis
No ratings yet
CD - CH2 - Lexical Analysis
67 pages
Lexical Analysis
No ratings yet
Lexical Analysis
62 pages
2-Lexical Analysis
No ratings yet
2-Lexical Analysis
52 pages
Unit 2 Lexical Analyzer
No ratings yet
Unit 2 Lexical Analyzer
30 pages
ITP-False Celing-NS-MSS-A-003-R-01
100% (1)
ITP-False Celing-NS-MSS-A-003-R-01
2 pages
BQ Penawaran Jasa Maintenance
No ratings yet
BQ Penawaran Jasa Maintenance
3 pages
Chasing 10X: How Anki Saved My Software Career
No ratings yet
Chasing 10X: How Anki Saved My Software Career
13 pages
IDS805 Installer
100% (1)
IDS805 Installer
48 pages
BOV Visa Classic - 112022
No ratings yet
BOV Visa Classic - 112022
14 pages
Intro To Econometrics With R PDF
No ratings yet
Intro To Econometrics With R PDF
392 pages
Program Overview
No ratings yet
Program Overview
27 pages
Elements of Object Oriented Data Model
No ratings yet
Elements of Object Oriented Data Model
19 pages
CAISSON VHM Lateral
No ratings yet
CAISSON VHM Lateral
6 pages
Unit 3 & 4
No ratings yet
Unit 3 & 4
4 pages
Kodak CCD Primer #KCP-001: Charge-Coupled Device (CCD) Image Sensors
No ratings yet
Kodak CCD Primer #KCP-001: Charge-Coupled Device (CCD) Image Sensors
13 pages
Real Time Air and Water Quality Monitoring With Ai Based Data Analysis and Low Cost Sensors
No ratings yet
Real Time Air and Water Quality Monitoring With Ai Based Data Analysis and Low Cost Sensors
2 pages
Módulos Canadian 440Wp
No ratings yet
Módulos Canadian 440Wp
2 pages
Ugong Senior High School
No ratings yet
Ugong Senior High School
6 pages
Detection and Classification of Arrhythmia Using An Explainable Deep Learning Model
No ratings yet
Detection and Classification of Arrhythmia Using An Explainable Deep Learning Model
9 pages
Cryptocurrencies and Financial Management A Bibliometric Analysis
No ratings yet
Cryptocurrencies and Financial Management A Bibliometric Analysis
14 pages
DPR - Surat Metro (Ceo Ieccl - June-23
No ratings yet
DPR - Surat Metro (Ceo Ieccl - June-23
332 pages
Continuous & Continued Process Verification: Presented by Eoin Hanley 4 July, 2016
No ratings yet
Continuous & Continued Process Verification: Presented by Eoin Hanley 4 July, 2016
39 pages
Estimating Today March April 2023
No ratings yet
Estimating Today March April 2023
36 pages
Image Enhancement Techniques Using OpenCV
No ratings yet
Image Enhancement Techniques Using OpenCV
13 pages
Technology
No ratings yet
Technology
5 pages
Cross NewLaboratoryDreams 2012
No ratings yet
Cross NewLaboratoryDreams 2012
20 pages
TDU 107 Touch Display Unit: Compatible With AGC-4
No ratings yet
TDU 107 Touch Display Unit: Compatible With AGC-4
25 pages
Midibox 2
No ratings yet
Midibox 2
8 pages
TestReach Corfe Co 21
No ratings yet
TestReach Corfe Co 21
1 page
Lisp Interpreter in Rust
From Everand
Lisp Interpreter in Rust
Vishal Patil
1/5 (1)
Introduction to PHP, Part 2, Second Edition
From Everand
Introduction to PHP, Part 2, Second Edition
Adam Majczak
No ratings yet

Chapter 2 - Lexical Analysis

Uploaded by

Chapter 2 - Lexical Analysis

Uploaded by

Chapter 2- LEXICAL ANALYSIS

2.1 OVER VIEW OF LEXICAL ANALYSIS

2.2 ROLE OF LEXICAL ANALYSIS

For example, in C language, the variable declaration line

Arithmetic Symbols Addition(+), Subtraction(-), Modulo(%), Multiplication(*), Division(/)

Punctuation Comma(,), Semicolon(;), Dot(.), Arrow(->)

Special Assignment +=, /=, *=, -=

Comparison ==, !=, <, <=, >, >=

Location Specifier &

Logical &, &&, |, ||, !

2.4 TOKEN, LEXEME, PATTERN:

int value = 100; char name=”hello”;

int (keyword), value (identifier), = (operator), 100 (constant) and ; (symbol).

How Lexical Analyzer functions

2.5 LEXICAL ERRORS:

Error-recovery actions are:

All the valid tokens are:

Above are the valid tokens.

There are 5 valid token in this printf statement.

Answer: Total number of tokens 5:

The FSM for (bc)* would be represented with a loop on bc.

Example :Grammar G for a language L={9-5+2, 3-1, ...}

Syntax of full expressions

You might also like