0% found this document useful (0 votes)

32 views27 pages

Chapter 2 - Lexical Analysis - Regular Expressions

Uploaded by

bbtmk.fm

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views27 pages

Chapter 2 - Lexical Analysis - Regular Expressions

Uploaded by

bbtmk.fm

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 27

Compiler Design

CSCE 354

Dr.Razauddin
University of Hail, Kingdom of Saudi Arabia

2024-2025
Chapter 2

Lexical Analysis: Application of

regular expressions in lexical
scanners
OVERVIEW
• To translate a program from one language into another, a
compiler must first pull it apart and understand its structure
and meaning, then put it together in a different way.

• The front end of the compiler performs analysis; the back end
does synthesis. The analysis is usually broken up into

Lexical analysis breaking the input into individual words or "tokens";

Syntax analysis parsing the phrase structure of the program;

Semantic analysis calculating the program's meaning.

What is a Token?

A lexical token is a sequence of characters that can be

treated as a unit in the grammar of the programming
languages.

Example of tokens:
•Type token (id, number, real, . . . )
•Punctuation tokens (IF, void, return, . . . )
•Alphabetic tokens (keywords)
Lexical Analysis is the first phase of the compiler also known as a
scanner.

It converts the High level input program into a sequence of Tokens.

1.Lexical Analysis can be implemented with the Deterministic finite
Automata.
2.The output is a sequence of tokens that is sent to the parser for
syntax analysis.
LEXICAL TOKENS
 A lexical token is a sequence of characters that can be treated as a unit in the
grammar of a programming language.

 A programming language classifies lexical tokens into a finite set of token types.

 For example, some of the token types of a typical programming language are

Type Examples Type Examples

ID foo n14 last COMMA ,
NUM 73 0 00 515 082 NOTEQ !=
REAL 66.1 .5 10. 1e67 5.5e-10 LPAREN (
IF if RPAREN )
Punctuation tokens such as IF, VOID, RETURN constructed from alphabetic characters
are called reserved words and, in most languages, cannot be used as identifiers.

Examples of nontokens
comment /* try again */
preprocessor directive #include<stdio.h>

preprocessor directive #define NUMS 5, 6

macro NUMS
REGULAR EXPRESSIONS

Each regular expression stands for a set of strings.

Symbol:

For each symbol a in the alphabet of the language, the regular expression a denotes the
language containing just the string a.

Alternation:

Given two regular expressions M and N, the alternation operator written as a vertical bar
makes a new regular expression M N. A string is in the language of M N if it is in the language
of M or in the language of N. Thus, the language of a b contains the two strings a and b.
Concatenation:
Given two regular expressions M and N, the concatenation operator · makes a
new regular expression M · N. A string is in the language of M · N if it is the
concatenation of any two strings and such that is in the language of M and is in
the language of N. Thus, the regular expression (a b) · a defines the language
containing the two strings aa and ba.

Epsilon: The regular expression represents a language whose only string is the
empty string. Thus, (a · b) represents the language {"", "ab"}.

Repetition: Given a regular expression M, its Kleene closure is M*. A string is in

M* if it is the concatenation of zero or more strings, all of which are in M. Thus,
((a b) · a)* represents the infinite set {"", "aa", "ba", "aaaa", "baaa", "aaba",
"baba", "aaaaaa", }.
Using symbols, alternation, concatenation, epsilon, and Kleene closure we can
specify the set of ASCII characters corresponding to the lexical tokens of a
programming language.

Examples

(0 | 1)* · 0 Binary numbers that are multiples of two.

b(abb)*(a| ) Strings of a's and b's with no consecutive a's.

(a|b)aa(a|b) Strings of a's and b's containing consecutive a's.

EXAMPLES
ab | c (a · b) | c
[abcd] (a | b | c | d)
[b-g] [bcdefg]
[b-gM-Qkr] [ bcdefgMNOPQkr]
The empty string. Another way to write the empty string.
M|N Alternation, choosing from M or N.
M·N Concatenation, an M followed by an N.
MN Another way to write concatenation.
M* Repetition (zero or more times).
M+ Repetition, one or more times.
M? Optional, zero or one occurrence of M.
[a zA Z] Character set alternation.
. A period stands for any single character except newline.

"a.+*" Quotation, a string in quotes stands for itself literally.

Regular expressions have the capability to express finite languages by
defining a pattern for finite strings of symbols.

The grammar defined by regular expressions is known as regular

grammar. The language defined by regular grammar is known as regular
language.

Operations

The various operations on languages are:

 Union of two languages L and M is written as

L U M = {s | s is in L or s is in M}

 Concatenation of two languages L and M is written as

LM = {st | s is in L and t is in M}
 The Kleene Closure of a language L is written as
L* = Zero or more occurrence of language L.
Notations
If r and s are regular expressions denoting the languages L(r) and
L(s), then

 Union : (r)|(s) is a regular expression denoting L(r) U L(s)

 Concatenation : (r)(s) is a regular expression denoting L(r)L(s)

 Kleene closure : (r)* is a regular expression denoting (L(r))*

 (r) is a regular expression denoting L(r)

Precedence and Associativity

 *, concatenation (.), and | (pipe sign) are left

associative

 * has the highest precedence

 Concatenation (.) has the second highest precedence.

 | (pipe sign) has the lowest precedence of all.

Representation occurrence of symbols using regular expressions

letter = [a – z] or [A – Z]

digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 or [0-9]

sign = [ + | - ]

Representation of language tokens using regular expressions

Decimal = (sign)?(digit)+

Identifier = (letter)(letter | digit)*

Regular expressions for some tokens.
 Comments or white space does not report back to the parser.

 Instead, the white space is discarded and the lexer resumed.

 The comments for this lexer begin with two dashes, contain only alphabetic characters, and end with
newline.

 Finally, a lexical specification should be complete, always matching some initial substring of the
input; we can always achieve this by having a rule that matches any single character (and in this case,
prints an "illegal character" error message and continues).

 These rules are a bit ambiguous.

 For example, does if8 match as a single identifier or as the two tokens if and 8?

 There are two important disambiguation rules used by Lex, JavaCC, SableCC, and other similar lexical-
analyzer generators:
Regular expressions for some tokens.
 There are two important disambiguation rules used by Lex, JavaCC, SableCC, and other
similar lexical-analyzer generators:

 Longest match: The longest initial substring of the input that can match any regular
expression is taken as the next token.

 Rule priority: For a particular longest initial substring, the first regular expression that
can match determines its token-type. This means that the order of writing down the
regular-expression rules has significance.

 For example, does if8 match as a single identifier or as the two tokens if and 8?

 Thus, if8 matches as an identifier by the longest-match rule, and if matches as a

reserved word by rule-priority.
Exercise
• The input source code is converted into a sequence of tokens.
sum = a + b;
• Steps in Lexical Analysis:
• Input Source Code:
The lexical analyzer (lexer) reads the source code as a sequence of
characters.
• Tokenization:
The lexer divides this sequence into a list of meaningful symbols, known
as tokens. Each token represents a basic element of the programming
language.
• Int – keyword
• Sum – Identifier
• = - Assigned operator
• a - identifier
• + - Addition operator
• b – identifier
• ; - Semicolon (Delimiter)
• Output Tokens:
For the given source code, the lexer might produce the following tokens:
Key Points:
Whitespace and Comments: The lexer typically ignores whitespace and comments, as
they do not affect the meaning of the code.
Error Handling: If the lexer encounters an invalid sequence of characters that cannot
be classified into a valid token, it raises an error. For example, if the input contains an
unexpected character like #, which is not valid in the given context, an error is
reported.
• Output:
KEYWORD: int
ID: sum
ASSIGN: =
ID: a
PLUS: +
ID: b
SEMICOLON: ;
Exercise
• The input source code is converted into a sequence of tokens.
def greet(name):
print("Hello, " + name + "!")
greet("Alice")
• Steps in Lexical Analysis:
• Input Source Code:
The lexical analyzer (lexer) reads the source code as a sequence of
characters.
• Tokenization:
The lexer divides this sequence into a list of meaningful symbols, known as
tokens. Each token represents a basic element of the programming
language.
• def, print – keyword
• greet, name – Identifier
• + - operator
• (), “ - Punctuation
• “Hello” - Literal
• b – identifier
• - White Space (Delimiter)
• Output Tokens:
For the given source code, the lexer might produce the following tokens:
Key Points:
Whitespace and Comments: The lexer typically ignores whitespace and comments, as
they do not affect the meaning of the code.
Error Handling: If the lexer encounters an invalid sequence of characters that cannot
be classified into a valid token, it raises an error. For example, if the input contains an
unexpected character like #, which is not valid in the given context, an error is
reported.
By Students:
• Convert input source code into sequence of token?
1. total = 3.14 * radius * radius;
2. if (x >= 10) {
y = x * 2;
}
3. x = 5 + 3;
4. a = b - 2;

Chapter 2
No ratings yet
Chapter 2
77 pages
Compiler Design Chapter-2
60% (5)
Compiler Design Chapter-2
105 pages
CD ch2
No ratings yet
CD ch2
104 pages
03 Lex Analysis
No ratings yet
03 Lex Analysis
61 pages
Lecture 02
No ratings yet
Lecture 02
150 pages
Chapter 3 Finite Automata and Lexical Analysis
No ratings yet
Chapter 3 Finite Automata and Lexical Analysis
100 pages
Lecture3 E
No ratings yet
Lecture3 E
153 pages
COS 320 Compilers: David Walker
No ratings yet
COS 320 Compilers: David Walker
38 pages
Chapter 3 - Lexical Analysis and Lexical Analyzer Generators
No ratings yet
Chapter 3 - Lexical Analysis and Lexical Analyzer Generators
52 pages
Lexical Analysis
No ratings yet
Lexical Analysis
153 pages
CC 2
No ratings yet
CC 2
65 pages
Chapter2-Lexical Analysis
No ratings yet
Chapter2-Lexical Analysis
64 pages
Unit 1 (B)
No ratings yet
Unit 1 (B)
69 pages
ch3 M.PPTX - 0
No ratings yet
ch3 M.PPTX - 0
46 pages
Scanner (Lexical Analyzer) : The Structure of A Compiler
No ratings yet
Scanner (Lexical Analyzer) : The Structure of A Compiler
109 pages
CH 3 Myppt
No ratings yet
CH 3 Myppt
59 pages
Lexical Analysis: Risul Islam Rasel
No ratings yet
Lexical Analysis: Risul Islam Rasel
148 pages
Chapter 7 Lexical Analysis
No ratings yet
Chapter 7 Lexical Analysis
61 pages
Unit 2 Lexical Analyzer
No ratings yet
Unit 2 Lexical Analyzer
63 pages
CD 1
No ratings yet
CD 1
92 pages
Lexical Analysis
No ratings yet
Lexical Analysis
44 pages
4-Intro To Flex and Bison-09!09!2024
No ratings yet
4-Intro To Flex and Bison-09!09!2024
28 pages
Compiler-Lexical Analysis
100% (1)
Compiler-Lexical Analysis
59 pages
CD Unit-2
No ratings yet
CD Unit-2
64 pages
Ch3 Modified
No ratings yet
Ch3 Modified
80 pages
2 - Lexical Analysis
No ratings yet
2 - Lexical Analysis
52 pages
CH 2 - Lexical Analysis
No ratings yet
CH 2 - Lexical Analysis
36 pages
Lexical Analysis
No ratings yet
Lexical Analysis
57 pages
Lexical Analysis
No ratings yet
Lexical Analysis
41 pages
Lexical Analyzer 2023
No ratings yet
Lexical Analyzer 2023
38 pages
SE Compiler Chapter 2
No ratings yet
SE Compiler Chapter 2
16 pages
Lexical Analysis 3
No ratings yet
Lexical Analysis 3
27 pages
CD Unit-2
No ratings yet
CD Unit-2
64 pages
Chapter 2
No ratings yet
Chapter 2
27 pages
Chpater 2 Lexical Analysis
No ratings yet
Chpater 2 Lexical Analysis
48 pages
WINSEM2023-24 CSI2005 TH VL2023240501823 2024-01-08 Reference-Material-I
No ratings yet
WINSEM2023-24 CSI2005 TH VL2023240501823 2024-01-08 Reference-Material-I
23 pages
Compiler Design 2
No ratings yet
Compiler Design 2
76 pages
Ch2+3 Compiler
No ratings yet
Ch2+3 Compiler
21 pages
Compiler Design Lexical Analysis
No ratings yet
Compiler Design Lexical Analysis
24 pages
Chapter 2 - Lexical Analyser
No ratings yet
Chapter 2 - Lexical Analyser
38 pages
CD KCS502 Unit 1 B
No ratings yet
CD KCS502 Unit 1 B
12 pages
2024 CD-Ch02 Lexical Analysis
No ratings yet
2024 CD-Ch02 Lexical Analysis
25 pages
Compiler Design Unit-1 - 4
No ratings yet
Compiler Design Unit-1 - 4
4 pages
Chapter 2 - Lexical Analysis
No ratings yet
Chapter 2 - Lexical Analysis
56 pages
Module 5 Lexical Analyser
No ratings yet
Module 5 Lexical Analyser
10 pages
Chapter Two (3) (Autosaved)
No ratings yet
Chapter Two (3) (Autosaved)
29 pages
Chapter 2 Lexical - Analysis
No ratings yet
Chapter 2 Lexical - Analysis
38 pages
Pdf&rendition 1
No ratings yet
Pdf&rendition 1
14 pages
Chapter 2 - Lexical Analysis
No ratings yet
Chapter 2 - Lexical Analysis
10 pages
Chapter 2
No ratings yet
Chapter 2
91 pages
Chapter 2 - Lexical Analyser
No ratings yet
Chapter 2 - Lexical Analyser
40 pages
Chapter 3 Lexical Analysis
No ratings yet
Chapter 3 Lexical Analysis
5 pages
2 Lex
No ratings yet
2 Lex
45 pages
Chapter 2 - Lexical Analysis
100% (1)
Chapter 2 - Lexical Analysis
69 pages
Chapter 2
No ratings yet
Chapter 2
56 pages
Intro To Compilers Lecture 2
No ratings yet
Intro To Compilers Lecture 2
15 pages
2-Lexical Analysis
No ratings yet
2-Lexical Analysis
52 pages
Gen Math Summative Test 1st Quarter
No ratings yet
Gen Math Summative Test 1st Quarter
4 pages
Compiler
No ratings yet
Compiler
60 pages
Cp4154-Principles of Programming Languages Iat Ii
No ratings yet
Cp4154-Principles of Programming Languages Iat Ii
2 pages
13.3 Floating Point Numbers Notes 2024
No ratings yet
13.3 Floating Point Numbers Notes 2024
8 pages
DCC Unit 1 Digital Notes
No ratings yet
DCC Unit 1 Digital Notes
60 pages
Unit 6 Basic Processing Unit
No ratings yet
Unit 6 Basic Processing Unit
57 pages
Password Manager
No ratings yet
Password Manager
29 pages
Spark DataFrames Project Exercise - Jupyter Notebook
No ratings yet
Spark DataFrames Project Exercise - Jupyter Notebook
7 pages
DL Unit 1
No ratings yet
DL Unit 1
19 pages
Fitting of A Modified Exponential Trend Line
No ratings yet
Fitting of A Modified Exponential Trend Line
4 pages
Python Code For Linear Search in A List
No ratings yet
Python Code For Linear Search in A List
13 pages
Bit 3202 Artificial Intelligence (Main)
No ratings yet
Bit 3202 Artificial Intelligence (Main)
5 pages
Finite Automata Notes
No ratings yet
Finite Automata Notes
47 pages
C Cgame
No ratings yet
C Cgame
8 pages
Scripting Language
No ratings yet
Scripting Language
28 pages
8 Uniform Cost Search 02-08-2024
No ratings yet
8 Uniform Cost Search 02-08-2024
9 pages
07 Fuzzing & Exploit Dev 101
No ratings yet
07 Fuzzing & Exploit Dev 101
83 pages
FinalProject Checkpoint3
No ratings yet
FinalProject Checkpoint3
15 pages
Unit-5 Undecidability-ToC
No ratings yet
Unit-5 Undecidability-ToC
13 pages
Functions and Arrays in Arduino
No ratings yet
Functions and Arrays in Arduino
3 pages
Lab 4 Report DC
No ratings yet
Lab 4 Report DC
5 pages
Lesson Plan Bca
No ratings yet
Lesson Plan Bca
10 pages
Preprocessor Preprocessor: List of Preprocessor Commands
No ratings yet
Preprocessor Preprocessor: List of Preprocessor Commands
8 pages
Ahsan Raza Programming Fundamentals
No ratings yet
Ahsan Raza Programming Fundamentals
13 pages
Oops 2021 Kuk 4TH Sem
No ratings yet
Oops 2021 Kuk 4TH Sem
2 pages
'Harsh' 'Pratik' 'Bob' 'Dhruv': Print Print Print
No ratings yet
'Harsh' 'Pratik' 'Bob' 'Dhruv': Print Print Print
6 pages
Fajar Susilowati S17072
No ratings yet
Fajar Susilowati S17072
12 pages
Deep Learning
No ratings yet
Deep Learning
1 page
Single Linked List
No ratings yet
Single Linked List
13 pages
Vikas Bharati Public School Periodic Assessment 1 (Session 2021-22) Class: X Subject: Mathematics Time:1 HR M.M. 30
No ratings yet
Vikas Bharati Public School Periodic Assessment 1 (Session 2021-22) Class: X Subject: Mathematics Time:1 HR M.M. 30
2 pages
Test I. Decipher The Code:: Mathematics in The Modern World
No ratings yet
Test I. Decipher The Code:: Mathematics in The Modern World
3 pages
Learn C++
From Everand
Learn C++
Durgesh
4.5/5 (9)
Lisp Interpreter in Rust
From Everand
Lisp Interpreter in Rust
Vishal Patil
1/5 (1)

Chapter 2 - Lexical Analysis - Regular Expressions

Uploaded by

Chapter 2 - Lexical Analysis - Regular Expressions

Uploaded by

Compiler Design

Lexical Analysis: Application of

Lexical analysis breaking the input into individual words or "tokens";

Syntax analysis parsing the phrase structure of the program;

Semantic analysis calculating the program's meaning.

A lexical token is a sequence of characters that can be

It converts the High level input program into a sequence of Tokens.

Type Examples Type Examples

preprocessor directive #define NUMS 5, 6

Each regular expression stands for a set of strings.

Repetition: Given a regular expression M, its Kleene closure is M*. A string is in

(0 | 1)* · 0 Binary numbers that are multiples of two.

b*(abb*)*(a| ) Strings of a's and b's with no consecutive a's.

(a|b)*aa(a|b)* Strings of a's and b's containing consecutive a's.

"a.+*" Quotation, a string in quotes stands for itself literally.

The grammar defined by regular expressions is known as regular

The various operations on languages are:

 Union of two languages L and M is written as

 Concatenation of two languages L and M is written as

 Union : (r)|(s) is a regular expression denoting L(r) U L(s)

 Concatenation : (r)(s) is a regular expression denoting L(r)L(s)

 Kleene closure : (r)* is a regular expression denoting (L(r))*

 (r) is a regular expression denoting L(r)

 *, concatenation (.), and | (pipe sign) are left

 * has the highest precedence

 Concatenation (.) has the second highest precedence.

 | (pipe sign) has the lowest precedence of all.

Representation of language tokens using regular expressions

Identifier = (letter)(letter | digit)*

 Instead, the white space is discarded and the lexer resumed.

 These rules are a bit ambiguous.

 Thus, if8 matches as an identifier by the longest-match rule, and if matches as a

You might also like

b(abb)*(a| ) Strings of a's and b's with no consecutive a's.

(a|b)aa(a|b) Strings of a's and b's containing consecutive a's.