0% found this document useful (0 votes)

6 views37 pages

Lec - 2. Scanning (Lexical Analysis) Part 1

The document discusses the scanning process in compiler construction, focusing on lexical analysis and the formation of tokens from source code. It explains the categories of tokens, their attributes, and the use of regular expressions to define patterns for these tokens. Additionally, it covers practical issues related to scanners, including token records and the handling of comments and ambiguities in token recognition.

Uploaded by

Hesham MosaAd

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views37 pages

Lec - 2. Scanning (Lexical Analysis) Part 1

Uploaded by

Hesham MosaAd

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

Lecture 02 Scanning 1

COMPILER CONSTRUCTION

Principles and Practice

Kenneth C. Louden

Downloaded by Mamdouh Farghaly ([email protected])

2. Scanning (Lexical Analysis)

PART ONE

Downloaded by Mamdouh Farghaly ([email protected])

Contents
PART ONE
2.1 The Scanning Process
2.2 Regular Expression

Downloaded by Mamdouh Farghaly ([email protected])

2.1 The Scanning Process

Downloaded by Mamdouh Farghaly ([email protected])

The Function of a Scanner
• Reading characters from the source code
and form them into logical units called
tokens
• Tokens are logical entities defined as an
enumerated type
– Typedef enum
{IF, THEN, ELSE, PLUS, MINUS, NUM, ID,…}
TokenType;

Downloaded by Mamdouh Farghaly ([email protected])

The Categories of Tokens
• RESERVED WORDS
– Such as IF and THEN, which represent the strings of
characters “if” and “then”
• SPECIAL SYMBOLS
– Such as PLUS and MINUS, which represent the
characters “+” and “-“
• OTHER TOKENS
– Such as NUM and ID, which represent numbers and
identifiers

Downloaded by Mamdouh Farghaly ([email protected])

Relationship between Tokens and its
String
• The string is called STRING VALUE or
LEXEME of token
• Some tokens have only one lexeme, such as
reserved words
• A token may have infinitely many lexemes,
such as the token ID

Downloaded by Mamdouh Farghaly ([email protected])

Relationship between Tokens and its
String
• Any value associated to a token is called an attributes of a
token
– String value is an example of an attribute.
– A NUM token may have a string value such as “32767” and actual
value 32767
– A PLUS token has the string value “+” as well as arithmetic
operation +
• The token can be viewed as the collection of all of its
attributes
– Only need to compute as many attributes as necessary to allow
further processing
– The numeric value of a NUM token need not compute immediately

Downloaded by Mamdouh Farghaly ([email protected])

Some Practical Issues of the Scanner

• One structured data type to collect all the

attributes of a token, called a token record
– Typedef struct
{TokenType tokenval;
char *stringval;
int numval;
} TokenRecord

Downloaded by Mamdouh Farghaly ([email protected])

Some Practical Issues of the Scanner
• The scanner returns the token value only and
places the other attributes in variables
TokeType getToken(void)
• As an example of operation of getToken,
consider the following line of C code.
A[index] = 4+2

a [ i n d e x ] = 4 + 2

a [ i n d e x ] = 4 + 2 RET
Downloaded by Mamdouh Farghaly ([email protected])
2.2 Regular Expression

Downloaded by Mamdouh Farghaly ([email protected])

Some Relative Basic Concepts
• Regular expressions
– represent patterns of strings of characters.
• A regular expression r
– completely defined by the set of strings it matches.
– The set is called the language of r written as L(r)
• The set elements
– referred to as symbols
• This set of legal symbols
– called the alphabet and written as the Greek symbol ∑

Downloaded by Mamdouh Farghaly ([email protected])

Some Relative Basic Concepts
• A regular expression r
– contains characters from the alphabet, indicating
patterns, such a is the character a used as a pattern
• A regular expression r
– may contain special characters called meta-characters
or meta-symbols
• An escape character can be used to turn off the
special meaning of a meta-character.
– Such as backslash and quotes

Downloaded by Mamdouh Farghaly ([email protected])

More About Regular Expression

2.2.1 Definition of Regular Expression [Open]

2.2.2 Extension to Regular Expression [Open]
2.2.3 Regular Expressions for Programming
Language Tokens [Open]

Downloaded by Mamdouh Farghaly ([email protected])

2.2.1 Definition of Regular
Expressions

Downloaded by Mamdouh Farghaly ([email protected])

Basic Regular Expressions
• The single characters from alphabet
matching themselves
– a matches the character by writing L(a)={ a }
– ε denotes the empty string, by L(ε)={ε}

Downloaded by Mamdouh Farghaly ([email protected])

Regular Expression Operations
• Choice among alternatives, indicated by
the meta-character |
• Concatenation, indicated by juxtaposition
• Repetition or “closure”, indicated by the
meta-character *

Downloaded by Mamdouh Farghaly ([email protected])

Choice Among Alternatives
• If r and s are regular expressions, then r|s is a
regular expression which matches any string that
is matched either by r or by s.
• In terms of languages, the language r|s is the union
of language r and s, or L(r|s) = L(r) U L(s)
• A simple example, L(a|b) = L(a) U (b) = {a, b}
• Choice can be extended to more than one
alternative.

Downloaded by Mamdouh Farghaly ([email protected])

Concatenation
• If r and s are regular expression, the rs is their
concatenation which matches any string that is the
concatenation of two strings, the first of which
matches r and the second of which matches s.
• In term of generated languages, the concatenation
set of strings S1S2 is the set of strings of S1
appended by all the strings of S2.
• A simple example, (a|b)c matches ac and bc
• Concatenation can also be extended to more than
two regular expressions.

Downloaded by Mamdouh Farghaly ([email protected])

Repetition
• The repetition operation of a regular expression,
called (Kleene) closure, is written r*, where r is a
regular expression. The regular expression r*
matches any finite concatenation of strings, each
of which matches r.
• A simple example, a* matches the strings epsilon,
a, aa, aaa,…
• In term of generated language, given a set of S of
string, S* is a infinite set union, but each element
in it is a finite concatenation of string from S

Downloaded by Mamdouh Farghaly ([email protected])

Precedence of Operation and Use of
Parentheses
• The standard convention
Repetition * has highest precedence
Concatenation is given the next highest
| is given the lowest
A simple example
a|bc* is interpreted as a|(b(c*))
Parentheses is used to indicate a different
precedence

Downloaded by Mamdouh Farghaly ([email protected])

Name for regular expression
• Give a name to a long regular expression
– digit = 0|1|2|3|4……|9
– (0|1|2|3……|9)(0|1|2|3……|9)* digit
digit*

Downloaded by Mamdouh Farghaly ([email protected])

Definition of Regular Expression
• A regular expression is one of the following:
(1) A basic regular expression, a single legal character a from
alphabet ∑ or meta-character ε.
(2) The form r|s, where r and s are regular expressions
(3) The form rs, where r and s are regular expressions
(4) The form r*, where r is a regular expression
(5) The form (r), where r is a regular expression

• Parentheses do not change the language.

Downloaded by Mamdouh Farghaly ([email protected])

Examples of Regular Expressions
Example 1:
– ∑={ a,b,c} Set of all strings that can be used
– the set of all strings over this alphabet that contain exactly one b.
– (a|c)*b(a|c)*

Example 2:
– ∑={ a,b,c}
– the set of all strings that contain at most one b.
– (a|c)*|(a|c)*b(a|c)* (a|c)*(b|ε)(a|c)*
– the same language may be generated by many different regular
expressions.

Downloaded by Mamdouh Farghaly ([email protected])

Examples of Regular Expressions
Example 3:
– ∑={ a,b}
– the set of strings consists of a single b surrounded by the same
number of a’s.
– S = {b, aba, aabaa,aaabaaa,……} = { anban | n≠0}
– This set can not be described by a regular expression.
• “regular expression can’t count ”

– not all sets of strings can be generated by regular expressions.

– a regular set : a set of strings that is the language for a regular
expression is distinguished from other sets.

Downloaded by Mamdouh Farghaly ([email protected])

Examples of Regular Expressions
Example 4:
– ∑={ a,b,c}
– The strings contain no two consecutive b’s
– ( (a|c)* | (b(a|c))* )*
– ( (a | c ) | (b( a | c )) )* or (a | c | ba | bc)*
• Not yet the correct answer
The correct regular expression
– (a | c | ba | bc)* (b |ε)
– ((b |ε) (a | c | ab| cb )*
– (not b |b not b)*(b|ε) not b = a|c

Downloaded by Mamdouh Farghaly ([email protected])

Examples of Regular Expressions
Example 5:
– ∑={ a,b,c}
– ((b|c)* a(b|c)*a)* (b|c)*
– Determine a concise English description of
the language
– the strings contain an even number of a’s
(nota* a nota* a)* nota*

BACK
Downloaded by Mamdouh Farghaly ([email protected])
2.2.2 Extensions to Regular
Expression

Downloaded by Mamdouh Farghaly ([email protected])

List of New Operations
1) one or more repetitions
r+
2) any character
period “．”
3) a range of characters
[0-9], [a-zA-Z]
List of New Operations
4) any character not in a given set
(a|b|c) a character not either a or b or c
[^abc] in Lex
5) optional sub-expressions
– r? the strings matched by r are optional
2.2.3 Regular Expressions for
Programming Language Tokens

Downloaded by Mamdouh Farghaly ([email protected])

Number, Reserved word and
Identifiers
Numbers
– nat = [0-9]+
– signedNat = (+|-)?nat
– number = signedNat(“．”nat)? (E signedNat)?
Reserved Words and Identifiers
– reserved = if | while | do |………
– letter = [a-z A-Z]
– digit = [0-9]
– identifier = letter(letter|digit)*
Comments
Several forms:
{ this is a pascal comment } {(  })*}

; this is a schema comment

-- this is an Ada comment --(newline)* newline

/* this is a C comment */
can not written as ba(~(ab))*ab, ~ restricted to single character
one solution for ~(ab) : b*(a*(a|b)b*)*a*

Because of the complexity of regular expression, the comments will be

handled by ad hoc methods in actual scanners.
Ambiguity
Ambiguity: some strings can be matched
by several different regular expressions.
– either an identifier or a keyword, keyword
interpretation preferred.
– a single token or a sequence of several tokens,
the single-token preferred.( the principle of
longest sub-string.)
White Space and Lookahead
White space:
– Delimiters: characters that are unambiguously part of
other tokens are delimiters.
– whitespace = ( newline | blank | tab | commen )+
– free format or fixed format
Lookahead:
– buffering of input characters , marking places for
backtracking
DO99I=1,10
DO99I=1.10

Downloaded by Mamdouh Farghaly ([email protected])

The Gnumeric Manual, Version 1.12
No ratings yet
The Gnumeric Manual, Version 1.12
617 pages
Specification of Tokens
No ratings yet
Specification of Tokens
21 pages
Regular Expression
No ratings yet
Regular Expression
89 pages
Compiler Design: Ambo University School of Informatics and Electrical Engineering Department of Computer Science
No ratings yet
Compiler Design: Ambo University School of Informatics and Electrical Engineering Department of Computer Science
35 pages
Cikakkiyar Kariya: Salisu Abdulrazak
No ratings yet
Cikakkiyar Kariya: Salisu Abdulrazak
42 pages
Lecture 2 Scanning (Lexical Analysis) - Part1
No ratings yet
Lecture 2 Scanning (Lexical Analysis) - Part1
51 pages
Y9 - Webauthoring Worksheet - 2
No ratings yet
Y9 - Webauthoring Worksheet - 2
4 pages
Scanner (Lexical Analyzer) : The Structure of A Compiler
No ratings yet
Scanner (Lexical Analyzer) : The Structure of A Compiler
109 pages
Unit 2
No ratings yet
Unit 2
89 pages
2.1 Algorithm and Flowchart
No ratings yet
2.1 Algorithm and Flowchart
19 pages
Optical Design Software Photopia
No ratings yet
Optical Design Software Photopia
38 pages
Compiler Design Chapter-2
60% (5)
Compiler Design Chapter-2
105 pages
Chapter 2 - Lexical Analysis
100% (1)
Chapter 2 - Lexical Analysis
69 pages
Context-Free Grammars Lec5
No ratings yet
Context-Free Grammars Lec5
45 pages
Lexi Cal A Analyzer
No ratings yet
Lexi Cal A Analyzer
38 pages
Automata Theory Computability - M2
No ratings yet
Automata Theory Computability - M2
68 pages
Pcdunit2 Continuation
No ratings yet
Pcdunit2 Continuation
26 pages
Lec 4 CH 2
No ratings yet
Lec 4 CH 2
39 pages
Class 3
No ratings yet
Class 3
52 pages
CD ch2
No ratings yet
CD ch2
104 pages
Lec - 1. INTRODUCTION
No ratings yet
Lec - 1. INTRODUCTION
39 pages
CD Unit-2
No ratings yet
CD Unit-2
64 pages
2 Lexical Analizer
No ratings yet
2 Lexical Analizer
56 pages
Lexical Analysis
No ratings yet
Lexical Analysis
57 pages
2 - Lexical Analysis
No ratings yet
2 - Lexical Analysis
52 pages
CC 2
No ratings yet
CC 2
65 pages
Ch3 - Lexical Analysis
No ratings yet
Ch3 - Lexical Analysis
52 pages
Lec2 LexicalAnalyser
No ratings yet
Lec2 LexicalAnalyser
30 pages
ch3 M.PPTX - 0
No ratings yet
ch3 M.PPTX - 0
46 pages
Lec3 - 2. Scanning (Lexical Analysis)
No ratings yet
Lec3 - 2. Scanning (Lexical Analysis)
11 pages
Lecture 1 STN
No ratings yet
Lecture 1 STN
21 pages
Lecture 3 STN
No ratings yet
Lecture 3 STN
27 pages
2 - 2specification of Tokens
No ratings yet
2 - 2specification of Tokens
17 pages
Lecture02 Scanning 1
No ratings yet
Lecture02 Scanning 1
72 pages
Unit I
No ratings yet
Unit I
37 pages
CD Unit-2
No ratings yet
CD Unit-2
64 pages
Fes250 PDF
No ratings yet
Fes250 PDF
114 pages
Sipser pp63 77 Accessible
No ratings yet
Sipser pp63 77 Accessible
9 pages
Chapter THREE
No ratings yet
Chapter THREE
24 pages
Compilation Techniques
No ratings yet
Compilation Techniques
21 pages
Lexical Analysis
No ratings yet
Lexical Analysis
41 pages
Chapter 3 - Regular Expressions
No ratings yet
Chapter 3 - Regular Expressions
49 pages
CompilerD L3
No ratings yet
CompilerD L3
36 pages
Flautista de Hemelin
No ratings yet
Flautista de Hemelin
10 pages
Specification of Tokens
No ratings yet
Specification of Tokens
21 pages
WINSEM2023-24 CSI2005 TH VL2023240501823 2024-01-08 Reference-Material-I
No ratings yet
WINSEM2023-24 CSI2005 TH VL2023240501823 2024-01-08 Reference-Material-I
23 pages
Proyek Akhir
No ratings yet
Proyek Akhir
13 pages
UBA-5 2 0-ReleaseNotes
100% (1)
UBA-5 2 0-ReleaseNotes
14 pages
2 Regular Expression
No ratings yet
2 Regular Expression
23 pages
Specification of Tokens
No ratings yet
Specification of Tokens
17 pages
Lexical Analyzer 2023
No ratings yet
Lexical Analyzer 2023
38 pages
Lec 4
No ratings yet
Lec 4
16 pages
UNIT-V NLP
No ratings yet
UNIT-V NLP
25 pages
Lexical Analyzer 1
No ratings yet
Lexical Analyzer 1
37 pages
Natural Language Processing - Session 3 - Regular Expressions
No ratings yet
Natural Language Processing - Session 3 - Regular Expressions
39 pages
Lecture 3-4 Updated
No ratings yet
Lecture 3-4 Updated
26 pages
Chapter Two (3) (Autosaved)
No ratings yet
Chapter Two (3) (Autosaved)
29 pages
Lecture # 06
No ratings yet
Lecture # 06
27 pages
Question Bank 16ee415 - PLC & Automation
100% (3)
Question Bank 16ee415 - PLC & Automation
12 pages
How To Program Delphi 3
No ratings yet
How To Program Delphi 3
448 pages
Unit 3 - Regular Expression
No ratings yet
Unit 3 - Regular Expression
45 pages
Nvidia-Learning-Training Course-Catalog
No ratings yet
Nvidia-Learning-Training Course-Catalog
32 pages
Unit22pdf 2021 03 13 13 38 11
No ratings yet
Unit22pdf 2021 03 13 13 38 11
114 pages
Lecture 3a and 3b
No ratings yet
Lecture 3a and 3b
21 pages
Kenya - Google IMEI Tracker Online - How To Track Your Android Phone Using IMEI Number in 20251
No ratings yet
Kenya - Google IMEI Tracker Online - How To Track Your Android Phone Using IMEI Number in 20251
9 pages
AcousticEye G3 - User Manual - V6 PDF
No ratings yet
AcousticEye G3 - User Manual - V6 PDF
93 pages
SPECIFICATION OF TOKENS - Unit 1
No ratings yet
SPECIFICATION OF TOKENS - Unit 1
13 pages
Archiving Master Recipes (PP-PI-MD)
No ratings yet
Archiving Master Recipes (PP-PI-MD)
14 pages
Lexical Analysis-1
No ratings yet
Lexical Analysis-1
9 pages
Gujarat Technological University
No ratings yet
Gujarat Technological University
1 page
TPL Lect 15 - 16
No ratings yet
TPL Lect 15 - 16
5 pages
Compiler Design Assignment
No ratings yet
Compiler Design Assignment
6 pages
Regular Expression: Anab Batool Kazmi
No ratings yet
Regular Expression: Anab Batool Kazmi
32 pages
Chapter 3 - Scanning: 3.1 Kinds of Tokens
No ratings yet
Chapter 3 - Scanning: 3.1 Kinds of Tokens
17 pages
Lexical Analysis
No ratings yet
Lexical Analysis
31 pages
Compilers - Week 2
No ratings yet
Compilers - Week 2
14 pages
Chapter 2
No ratings yet
Chapter 2
56 pages
CPSC 388 - Compiler Design and Construction: Scanners - Regular Expressions
No ratings yet
CPSC 388 - Compiler Design and Construction: Scanners - Regular Expressions
20 pages
Business Verification: Total Records Unmatched / Data Quality Errors Confidence Code 8 Confidence Code 7
No ratings yet
Business Verification: Total Records Unmatched / Data Quality Errors Confidence Code 8 Confidence Code 7
12 pages
Lect2 Lexical
No ratings yet
Lect2 Lexical
9 pages
Using List Views React Native
No ratings yet
Using List Views React Native
3 pages
Ilomilo PC Game
No ratings yet
Ilomilo PC Game
2 pages
Dogfooding
No ratings yet
Dogfooding
10 pages
Siemens Step7 TCP and Proface HMI
100% (1)
Siemens Step7 TCP and Proface HMI
46 pages
Article in Petromin 1 PDF
No ratings yet
Article in Petromin 1 PDF
6 pages
Role of Lexical Analysis: Scanning
No ratings yet
Role of Lexical Analysis: Scanning
2 pages
Specification of Tokens
0% (1)
Specification of Tokens
17 pages
IoT Ecosystem
No ratings yet
IoT Ecosystem
16 pages
Technical Drafting Summative Test 2-20-19
No ratings yet
Technical Drafting Summative Test 2-20-19
4 pages
Maglev Courseware Sample For MATLAB Users
100% (1)
Maglev Courseware Sample For MATLAB Users
12 pages
Infs2608 Notes
No ratings yet
Infs2608 Notes
9 pages
Regular Expressions
100% (2)
Regular Expressions
4 pages
1) What Is Pragma?
No ratings yet
1) What Is Pragma?
3 pages
Wireless Site Survey Checklist: Select Download Format
0% (1)
Wireless Site Survey Checklist: Select Download Format
4 pages
Steps For Inbound ALE IDOC
No ratings yet
Steps For Inbound ALE IDOC
5 pages
Learn C++
From Everand
Learn C++
Durgesh
4.5/5 (9)

Lec - 2. Scanning (Lexical Analysis) Part 1

Uploaded by

Lec - 2. Scanning (Lexical Analysis) Part 1

Uploaded by

Lecture 02 Scanning 1

Principles and Practice

Downloaded by Mamdouh Farghaly ([email protected])

Downloaded by Mamdouh Farghaly ([email protected])

Downloaded by Mamdouh Farghaly ([email protected])

Downloaded by Mamdouh Farghaly ([email protected])

Downloaded by Mamdouh Farghaly ([email protected])

Downloaded by Mamdouh Farghaly ([email protected])

Downloaded by Mamdouh Farghaly ([email protected])

Downloaded by Mamdouh Farghaly ([email protected])

• One structured data type to collect all the

Downloaded by Mamdouh Farghaly ([email protected])

Downloaded by Mamdouh Farghaly ([email protected])

Downloaded by Mamdouh Farghaly ([email protected])

Downloaded by Mamdouh Farghaly ([email protected])

2.2.1 Definition of Regular Expression [Open]

Downloaded by Mamdouh Farghaly ([email protected])

Downloaded by Mamdouh Farghaly ([email protected])

Downloaded by Mamdouh Farghaly ([email protected])

Downloaded by Mamdouh Farghaly ([email protected])

Downloaded by Mamdouh Farghaly ([email protected])

Downloaded by Mamdouh Farghaly ([email protected])

Downloaded by Mamdouh Farghaly ([email protected])

Downloaded by Mamdouh Farghaly ([email protected])

Downloaded by Mamdouh Farghaly ([email protected])

• Parentheses do not change the language.

Downloaded by Mamdouh Farghaly ([email protected])

Downloaded by Mamdouh Farghaly ([email protected])

– not all sets of strings can be generated by regular expressions.

Downloaded by Mamdouh Farghaly ([email protected])

Downloaded by Mamdouh Farghaly ([email protected])

Downloaded by Mamdouh Farghaly ([email protected])

Downloaded by Mamdouh Farghaly ([email protected])

; this is a schema comment

Because of the complexity of regular expression, the comments will be

Downloaded by Mamdouh Farghaly ([email protected])

You might also like