0% found this document useful (0 votes)

33 views44 pages

Lexical Analysis

COMPILER DESIGN STUDY MATERIAL - Lexical Analysis

Uploaded by

deeplyfwind7

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views44 pages

Lexical Analysis

COMPILER DESIGN STUDY MATERIAL - Lexical Analysis

Uploaded by

deeplyfwind7

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 44

Lexical Analysis

CSO844
Lexical Analyzer

• Functions
• Grouping input characters into tokens
• Stripping out comments and white spaces
• Correlating error messages with the source program
• Issues (why separating lexical analysis from
parsing)
• Simpler design
• Compiler efficiency
• Compiler portability (e.g. Linux to Win)
The Role of a Lexical Analyzer

pass token
and attribute value
read char
Source Lexical Parser
program analyzer
put back get next
char id

Read entire Symbol Table

program into
memory
Lexical Analysis
• What do we want to do? Example:
if (i == j)
Z = 0;
else
Z = 1;
• The input is just a string of characters:
\t if (i == j) \n \t \t z = 0;\n \t else \n \t \t z = 1;
• Goal: Partition input string into substrings
• Where the substrings are tokens
What’s a Token?
• A syntactic category
• In English:
• noun, verb, adjective, …
• In a programming language:
• Identifier, Integer, Keyword, Whitespace,
What are Tokens For?
• Classify program substrings according to role
• Output of lexical analysis is a stream of tokens . . .which is input to the
parser
• Parser relies on token distinctions
• An identifier is treated differently than a keyword
Tokens
• Tokens correspond to sets of strings.
• Identifier: strings of letters or digits, starting with a letter
• Integer: a non-empty string of digits
• Keyword: “else” or “if” or “begin” or …
• Whitespace: a non-empty sequence of blanks, newlines, and tabs
Typical Tokens in a PL

• Symbols: +, -, *, /, =, <, >, ->, …

• Keywords: if, while, struct, float, int, …
• Integer and Real (floating point) literals
123, 123.45
• Char (string) literals
• Identifiers
• Comments
• White space
Tokens, Patterns and Lexemes

• Pattern: A rule that describes a set of strings

• Token: A set of strings in the same pattern
• Lexeme: The sequence of characters of a token

Token Sample Lexemes Pattern

if if if
id abc, n, count,… letters+digit
NUMBER 3.14, 1000 numerical
constant
; ; ;
Token Attribute

• E = C1 ** 10
Token Attribute
ID Index to symbol table entry E
=
ID Index to symbol table entry C1
**
NUM 10
Lexical Error and Recovery
• Error detection
• Error reporting
• Error recovery
• Delete the current character and restart scanning at
the next character
• Delete the first character read by the scanner and
resume scanning at the character following it.
• How about runaway strings and comments?
Specification of Tokens
• Regular expressions are an important notation for specifying lexeme
patterns. While they cannot express all possible patterns, they are very
effective in specifying those types of patterns that we actually need for
tokens.
Strings and Languages
• An alphabet is any finite set of symbols such as letters, digits, and
punctuation.
• The set {0,1) is the binary alphabet
• If x and y are strings, then the concatenation of x and y is also string,
denoted xy, For example, if x = dog and y = house, then xy = doghouse.
• The empty string is the identity under concatenation; that is, for any string
s, ES = SE = s.
• A string over an alphabet is a finite sequence of symbols drawn
from that alphabet.
• In language theory, the terms "sentence" and "word" are often used as
synonyms for "string."
• |s| represents the length of a string s, Ex: banana is a string of length 6
• The empty string, is the string of length zero.
Strings and Languages (cont.)
• A language is any countable set of strings over some fixed alphabet.

• Let L = {A, . . . , Z}, then{“A”,”B”,”C”, “BF”…,”ABZ”,…] is consider the

language defined by L
• Abstract languages like , the empty set, or
{},the set containing only the empty string, are languages under this definition.
Terms for Parts of Strings
Operations on Languages

Example:
Let L be the set of letters {A, B, . . . , Z, a, b, . . . , z ) and
let D be the set of digits {0,1,.. .9).
L and D are, respectively, the alphabets of uppercase and lowercase
letters and of digits.
other languages can be constructed from L and D, using the operators
illustrated above
Operations on Languages (cont.)
1. L U D is the set of letters and digits - strictly speaking the language with 62
(52+10) strings of length one, each of which strings is either one letter or one
digit.
2. LD is the set of 520 strings of length two, each consisting of one letter followed
by one digit.(10×52).
Ex: A1, a1,B0,etc
3. L4 is the set of all 4-letter strings. (ex: aaba, bcef)
4. L* is the set of all strings of letters, including e, the empty string.
5. L(L U D)* is the set of all strings of letters and digits beginning with a letter.
6. D+ is the set of all strings of one or more digits.
Regular Expressions
• The standard notation for regular languages is regular expressions.
• Atomic regular expression:

• Compound regular expression:

Cont.

larger regular expressions are built from smaller ones. Let r and s are regular
expressions denoting languages L(r) and L(s), respectively.
1. (r) | (s) is a regular expression denoting the language L(r) U L(s).
2. (r) (s) is a regular expression denoting the language L(r) L(s) .
3. (r) * is a regular expression denoting (L (r)) * .
4. (r) is a regular expression denoting L(r). This last rule says that we can
add additional pairs of parentheses around expressions without changing
the language they denote.
for example, we may replace the regular expression (a) | ((b) * (c)) by a| b*c.
Examples
Regular Definition
• C identifiers are strings of letters, digits, and underscores.
The regular definition for the language of C identifiers.
• LetterA | B | C|…| Z | a | b | … |z| -
• digit  0|1|2 |… | 9
• id letter( letter | digit )*
• Unsigned numbers (integer or floating point) are strings such
as 5280, 0.01234, 6.336E4, or 1.89E-4. The regular
definition
• digit  0|1|2 |… | 9
• digits  digit digit*
• optionalFraction  .digits | 
• optionalExponent  ( E( + |- | ) digits ) | 
• number  digits optionalFraction optionalExponent
RECOGNITION OF TOKENS
•Given the grammar of branching statement:
The terminals of the grammar, which are
if, then, else, relop, id, and number, are
the names of tokens as used by the lexical
analyzer.
The lexical analyzer also has the job of
stripping out whitespace, by recognizing
•The patterns for the given tokens: the "token" ws defined by:
Tokens, their patterns, and attribute values
Recognition of Tokens: Transition Diagram

Ex :RELOP = < | <= | = | <> | > | >=

= 2 return(relop,LE)

1 return(relop,NE)
> 3
<
start other #
0 = 5 4 return(relop,LT)
return(relop,EQ)
>
=
6 7 return(relop,GE)

# indicates input retraction other # return(relop,GT)

8
Recognition of Identifiers

• Ex2: ID = letter(letter | digit) *

Transition Diagram:
letter or digit

#
start letter other
9 10 11
return(id)

# indicates input retraction

Mapping transition diagrams into C code

letter or digit

start letter other

9 10 11 return(id)

switch (state) {
case 9:
if (isletter( c) ) state = 10; else state =
failure();
break;
case 10: c = nextchar();
if (isletter( c) || isdigit( c) ) state = 10; else state 11
case 11: retract(1); insert(id); return;
Recognition of Reserved Words
•Install the reserved words in the symbol table initially. A field of the symbol-
table entry indicates that these strings are never ordinary identifiers, and tells
which token they represent.

•Create separate transition diagrams for each keyword; the transition

diagram for the reserved word then
The transition diagram for token number
Multiple accepting state

Accepting integer Accepting float Accepting float

e.g. 12 e.g. 12.31 e.g. 12.31E4
Finite Automata

• Transition diagram is finite automation

• Nondeterministic Finite Automation (NFA)

• A set of states
• A set of input symbols
• A transition function, move(), that maps state-symbol
pairs to sets of states.
• A start state S0
• A set of states F as accepting (Final) states.
Example

start a b b
0 1 2 3

b
The set of states = {0,1,2,3}
Input symbol = {a,b}
Start state is S0, accepting state is S3

Lexi Cal A Analyzer
No ratings yet
Lexi Cal A Analyzer
38 pages
Chapter 2
No ratings yet
Chapter 2
31 pages
Lecture3 E
No ratings yet
Lecture3 E
153 pages
Chapter 3 Finite Automata and Lexical Analysis
No ratings yet
Chapter 3 Finite Automata and Lexical Analysis
95 pages
Unit 1
No ratings yet
Unit 1
34 pages
Chapter 3 Finite Automata and Lexical Analysis
No ratings yet
Chapter 3 Finite Automata and Lexical Analysis
100 pages
Lexical Analysis
No ratings yet
Lexical Analysis
62 pages
Chapter 2
No ratings yet
Chapter 2
77 pages
Scanner (Lexical Analyzer) : The Structure of A Compiler
No ratings yet
Scanner (Lexical Analyzer) : The Structure of A Compiler
109 pages
Chapter 3 Finite Automata and Lexical Analysis
No ratings yet
Chapter 3 Finite Automata and Lexical Analysis
100 pages
CD ch2
No ratings yet
CD ch2
104 pages
Lec2 LexicalAnalyser
No ratings yet
Lec2 LexicalAnalyser
30 pages
CP 324 Lexical Analysis l2
No ratings yet
CP 324 Lexical Analysis l2
26 pages
CD Unit-2
No ratings yet
CD Unit-2
64 pages
CD 1
No ratings yet
CD 1
92 pages
Chapter2-Lexical Analysis
No ratings yet
Chapter2-Lexical Analysis
64 pages
Chapter 7 Lexical Analysis
No ratings yet
Chapter 7 Lexical Analysis
61 pages
Lecture 02
No ratings yet
Lecture 02
150 pages
Lexical Analysis
No ratings yet
Lexical Analysis
57 pages
CC Note 1
No ratings yet
CC Note 1
11 pages
Lexical Analysis
No ratings yet
Lexical Analysis
62 pages
Lec 06 Specification of Tokens
No ratings yet
Lec 06 Specification of Tokens
23 pages
Ch3 Modified
No ratings yet
Ch3 Modified
80 pages
Lexical Analyser
No ratings yet
Lexical Analyser
55 pages
Ch2+3 Compiler
No ratings yet
Ch2+3 Compiler
21 pages
Lexical Analysis: CD: Compiler Design
No ratings yet
Lexical Analysis: CD: Compiler Design
122 pages
Lexical Analysis
No ratings yet
Lexical Analysis
153 pages
REVISED LAKSHYA, NIPUN Bharat Guidelines
No ratings yet
REVISED LAKSHYA, NIPUN Bharat Guidelines
4 pages
Lexical Analyzer
No ratings yet
Lexical Analyzer
18 pages
Chapter 2 - Lexical Analysis
100% (1)
Chapter 2 - Lexical Analysis
69 pages
2 - Lexical Analysis
No ratings yet
2 - Lexical Analysis
52 pages
CD Unit-2
No ratings yet
CD Unit-2
64 pages
Compiler Design: Ambo University School of Informatics and Electrical Engineering Department of Computer Science
No ratings yet
Compiler Design: Ambo University School of Informatics and Electrical Engineering Department of Computer Science
35 pages
Lecture 03
No ratings yet
Lecture 03
42 pages
Acd Unit-2
No ratings yet
Acd Unit-2
16 pages
Chapter 2 - Lexical Analysis - Regular Expressions
No ratings yet
Chapter 2 - Lexical Analysis - Regular Expressions
27 pages
Unit 2 Lexical Analyzer
No ratings yet
Unit 2 Lexical Analyzer
63 pages
Chapter 2
No ratings yet
Chapter 2
27 pages
Lexical Analysis
No ratings yet
Lexical Analysis
41 pages
WINSEM2023-24 CSI2005 TH VL2023240501823 2024-01-08 Reference-Material-I
No ratings yet
WINSEM2023-24 CSI2005 TH VL2023240501823 2024-01-08 Reference-Material-I
23 pages
Lexical Analysis 3
No ratings yet
Lexical Analysis 3
27 pages
Lexical Analyzer 1
No ratings yet
Lexical Analyzer 1
37 pages
SE Compiler Chapter 2
No ratings yet
SE Compiler Chapter 2
16 pages
Lexical Analyzer 2023
No ratings yet
Lexical Analyzer 2023
38 pages
Compiler-Lexical Analysis
100% (1)
Compiler-Lexical Analysis
59 pages
Compiler Design Lexical Analysis
No ratings yet
Compiler Design Lexical Analysis
24 pages
Chapter 2 - Lexical Analysis
No ratings yet
Chapter 2 - Lexical Analysis
56 pages
Compiler Course: Lexical Analysis
No ratings yet
Compiler Course: Lexical Analysis
50 pages
CD KCS502 Unit 1 B
No ratings yet
CD KCS502 Unit 1 B
12 pages
ch-2 Compiler Design
No ratings yet
ch-2 Compiler Design
9 pages
03 Lex Analysis
No ratings yet
03 Lex Analysis
61 pages
Chapter Two (3) (Autosaved)
No ratings yet
Chapter Two (3) (Autosaved)
29 pages
Chapter 2 - Lexical Analyser
No ratings yet
Chapter 2 - Lexical Analyser
40 pages
Ch2 Lexical Analysis
No ratings yet
Ch2 Lexical Analysis
11 pages
Lexical Analysis
No ratings yet
Lexical Analysis
31 pages
Compiler
No ratings yet
Compiler
60 pages
Day 21-Intro To Medical Terminology PP
No ratings yet
Day 21-Intro To Medical Terminology PP
19 pages
2 Lex
No ratings yet
2 Lex
45 pages
GRD 7 English Notes T2 2022 (Tom Newby School)
No ratings yet
GRD 7 English Notes T2 2022 (Tom Newby School)
91 pages
LP 5 3-Grading English V
No ratings yet
LP 5 3-Grading English V
53 pages
M.Suhaib Khalid PDF
No ratings yet
M.Suhaib Khalid PDF
10 pages
Air Pollution and Temperature Humidity Monitring System
No ratings yet
Air Pollution and Temperature Humidity Monitring System
30 pages
Writing Section
No ratings yet
Writing Section
59 pages
Thesis Turkish - Tr.en
No ratings yet
Thesis Turkish - Tr.en
27 pages
Intro To Compilers Lecture 2
No ratings yet
Intro To Compilers Lecture 2
15 pages
Hand-Book For Translator Candidates
No ratings yet
Hand-Book For Translator Candidates
26 pages
rkCD-Chapter 2 - LEXICAL ANALYSIS
No ratings yet
rkCD-Chapter 2 - LEXICAL ANALYSIS
9 pages
Critical Reading Notes - 4 - The Embodied Work of Teaching Grammar and Pronunciation in IELTS Speaking Tutorials
No ratings yet
Critical Reading Notes - 4 - The Embodied Work of Teaching Grammar and Pronunciation in IELTS Speaking Tutorials
10 pages
Section 9 Module 2
100% (1)
Section 9 Module 2
14 pages
An IDL/ENVI Implementation of The FFT-based Algorithm For Automatic Image Registration
No ratings yet
An IDL/ENVI Implementation of The FFT-based Algorithm For Automatic Image Registration
11 pages
6 - Gen B2
No ratings yet
6 - Gen B2
19 pages
Great Britain
No ratings yet
Great Britain
15 pages
Ethics ICW PPT v2 1 1683654273829
No ratings yet
Ethics ICW PPT v2 1 1683654273829
11 pages
Grade 9 Ela
No ratings yet
Grade 9 Ela
14 pages
Writing Template Comparative
No ratings yet
Writing Template Comparative
6 pages
Borlasa - Justin Kean Hakeem L. - Lesson 8
No ratings yet
Borlasa - Justin Kean Hakeem L. - Lesson 8
5 pages
White Tiger
No ratings yet
White Tiger
7 pages
A Short Dictionary of Phrasal Verbs PDF
0% (1)
A Short Dictionary of Phrasal Verbs PDF
4 pages
Comparatives: April 16Th AIM: To Review The Use Comparative Adjectives
No ratings yet
Comparatives: April 16Th AIM: To Review The Use Comparative Adjectives
11 pages
Worksheet 7 VOCABULARY 10 Docx (2) (1) Eg
No ratings yet
Worksheet 7 VOCABULARY 10 Docx (2) (1) Eg
5 pages
Jawaban Test Toefl
No ratings yet
Jawaban Test Toefl
2 pages
2022 - 2nd Term Test - Grd6
No ratings yet
2022 - 2nd Term Test - Grd6
5 pages
Inspirational and Powerful Communication
No ratings yet
Inspirational and Powerful Communication
6 pages
How To Write Homework in Hindi
100% (1)
How To Write Homework in Hindi
8 pages
Infinitive Past Participle Español: (Apólodchais) (Apólodchaist)
No ratings yet
Infinitive Past Participle Español: (Apólodchais) (Apólodchaist)
12 pages
Salah Khalid Alasbahi Objective: Sana'a University
No ratings yet
Salah Khalid Alasbahi Objective: Sana'a University
2 pages
Activity in English
No ratings yet
Activity in English
4 pages
Q1 W1 English DLL
No ratings yet
Q1 W1 English DLL
2 pages
English Flyers Sample Statement of Results
No ratings yet
English Flyers Sample Statement of Results
2 pages
Learn C++
From Everand
Learn C++
Durgesh
4.5/5 (9)

Lexical Analysis

Uploaded by

Lexical Analysis

Uploaded by

Lexical Analysis

Read entire Symbol Table

• Symbols: +, -, *, /, =, <, >, ->, …

• Pattern: A rule that describes a set of strings

Token Sample Lexemes Pattern

• Let L = {A, . . . , Z}, then{“A”,”B”,”C”, “BF”…,”ABZ”,…] is consider the

• Compound regular expression:

Ex :RELOP = < | <= | = | <> | > | >=

# indicates input retraction other # return(relop,GT)

• Ex2: ID = letter(letter | digit) *

# indicates input retraction

start letter other

•Create separate transition diagrams for each keyword; the transition

Accepting integer Accepting float Accepting float

• Transition diagram is finite automation

• Nondeterministic Finite Automation (NFA)

You might also like