Lec 02

The document discusses the role and tasks of a lexical analyzer in a compiler, including reading input characters, producing tokens, and interacting with the parser and symbol table. It explains the distinction between scanning and lexical analysis, the importance of separating lexical analysis from parsing for design simplicity and efficiency, and the specification of tokens using regular expressions. Additionally, it covers input buffering techniques and the definition of terms related to strings and regular expressions.

Uploaded by

AMMAD

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views17 pages

Lec 02

Uploaded by

AMMAD

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

Lexical Analysis

Chapter# 3
Instructor: Mr. Faiz Rasool
Email: [email protected]
The Role of the Lexical Analyzer
• The main task of the lexical analyzer is to read the input characters of the source
program, group them into lexemes, and produce as output a sequence of tokens for
each lexeme in the source program.
• The stream of tokens is sent to the parser for syntax analysis.
• It is common for the lexical analyzer to interact with the symbol table as well.
• When the lexical analyzer discovers a lexeme constituting an identifier, it needs to
enter that lexeme into the symbol table.
• In some cases, information regarding the kind of identifier may be read from the
symbol table by the lexical analyzer to assist it in determining the proper token it
must pass to the parser.
• The interaction is implemented by having the parser call the lexical analyzer.
• The call, suggested by the getNextToken command, causes the lexical analyzer to
read characters from its input until it can identify the next lexeme and produce for
it the next token, which it returns to the parser
Interactions between the lexical analyzer and
the parser
Certain other tasks of Lexical analyzer
• One such task is stripping out comments and whitespace (blank, newline, tab, and
perhaps other characters that are used to separate tokens in the input).
• Another task is correlating error messages generated by the compiler with the
source program.
• For instance, the lexical analyzer may keep track of the number of newline
characters seen, so it can associate a line number with each error message.
lexical analyzers are divided into a cascade of
two processes

• a) Scanning consists of the simple processes that do not require tokenization of the
input, such as deletion of comments and compaction of consecutive whitespace
characters into one.
• b) Lexical analysis proper is the more complex portion, where the scanner
produces the sequence of tokens as output.
Lexical Analysis Versus Parsing
• There are a number of reasons why the analysis portion of a compiler is
normally separated into lexical analysis and parsing (syntax analysis) phases.
1. Simplicity of design is the most important consideration. The separation of
lexical and syntactic analysis often allows us to simplify at least one of these
tasks. For example, a parser that had to deal with comments and whitespace as
syntactic units would be considerably more complex than one that can assume
comments and whitespace have already been removed by the lexical analyzer.
2. Compiler efficiency is improved. A separate lexical analyzer allows us to apply
specialized techniques that serve only the lexical task, not the job of parsing. In
addition, specialized buffering techniques for reading input characters can speed
up the compiler significantly.
3. Compiler portability is enhanced. Input-device-specific peculiarities can be
restricted to the lexical analyzer.
Tokens, Patterns, and Lexemes
• A token is a pair consisting of a token name and an optional attribute value. The
token name is an abstract symbol representing a kind of lexical unit, e.g., a
particular keyword, or a sequence of input characters denoting an identifier. The
token names are the input symbols that the parser processes. In what follows, we
shall generally write the name of a token in boldface. We will often refer to a
token by its token name.
• A pattern is a description of the form that the lexemes of a token may take. In the
case of a keyword as a token, the pattern is just the sequence of characters that
form the keyword. For identifiers and some other tokens, the pattern is a more
complex structure that is matched by many strings.
• A lexeme is a sequence of characters in the source program that matches the
pattern for a token and is identified by the lexical analyzer as an instance of that
token.
Attributes for Tokens
• When more than one lexeme can match a pattern, the lexical analyzer must
provide the subsequent compiler phases additional information about the particular
lexeme that matched.
• For example, the pattern for token number matches both 0 and 1, but it is
extremely important for the code generator to know which lexeme was found in
the source program.
Lexical Errors
• It is hard for a lexical analyzer to tell, without the aid of other components, that
there is a source-code error.
• For instance, if the string fi is encountered for the first time in a C program in the
context: fi ( a == f(x) ) .. . a lexical analyzer cannot tell whether fi is a misspelling
of the keyword if or an undeclared function identifier.
• Since fi is a valid lexeme for the token id, the lexical analyzer must return the
token id to the parser and let some other phase of the compiler — probably the
parser in this case — handle an error due to transposition of the letters.
Input Buffering
• let us examine some ways that the simple but important task of reading the source program can
be speeded.
• This task is made difficult by the fact that we often have to look one or more characters beyond
the next lexeme before we can be sure we have the right lexeme.
• We shall introduce a two-buffer scheme that handles large lookaheads safely.
• We then consider an improvement involving "sentinels" that saves time checking for the ends of
buffers.
• Sentinel is special value which represent eof.
Buffer Pairs
• Because of the amount of time taken to process characters and the large number of
characters that must be processed during the compilation of a large source program,
specialized buffering techniques have been developed to reduce the amount of
overhead required to process a single input character. An important scheme involves
two buffers that are alternately reloaded

• Each buffer is of the same size N, and N is usually the size of a disk block, e.g., 4096
bytes. Using one system read command we can read N characters into a buffer, rather
than using one system call per character. If fewer than N characters remain in the
input file, then a special character, represented by eof, marks the end of the source
file and is different from any possible character of the source program.
Buffer Pairs
Two pointers to the input are maintained:
1. Pointer lexemeBegin, marks the beginning of the current lexeme, whose extent
we are attempting to determine.
2. Pointer forward scans ahead until a pattern match is found.
• Once the next lexeme is determined, forward is set to the character at its right end.
• Then, after the lexeme is recorded as an attribute value of a token returned to the
parser, lexemeBegin is set to the character immediately after the lexeme just
found.
• Advancing forward requires that we first test whether we have reached the end of
one of the buffers, and if so, we must reload the other buffer from the input, and
move forward to the beginning of the newly loaded buffer.
Specification of Tokens
• Regular expressions are an important notation for specifying lexeme patterns.
While they cannot express all possible patterns. They are very effective in
specifying those types of patterns that we actually need for tokens.
• An alphabet is any finite set of symbols. Typical examples of symbols are letters,
digits, and punctuation. The set {0,1} is the binary alphabet. ASCII is an important
example of an alphabet; it is used in many software systems.
• A string over an alphabet is a finite sequence of symbols drawn from that
alphabet. In language theory, the terms "sentence" and "word" are often used as
synonyms for "string.“
• The length of a string s, usually written |s|, is the number of occurrences of
symbols in s.
• The empty string, denoted e, is the string of length zero
Terms for Parts of Strings
1. A prefix of string s is any string obtained by removing zero or more symbols
from the end of s. For example, ban, banana, and e are prefixes of banana.
2. A suffix of string s is any string obtained by removing zero or more symbols
from the beginning of s. For example, nana, banana, and e are suffixes of
banana.
3. A substring of s is obtained by deleting any prefix and any suffix from s. For
instance, banana, nan, and e are substrings of banana.
4. The proper prefixes, suffixes, and substrings of a string s are those, prefixes,
suffixes, and substrings, respectively, of s that are not e or not equal to s itself.
5. A subsequence of s is any string formed by deleting zero or more not
necessarily consecutive positions of s. For example, baan is a subsequence of
banana.
Regular Expressions
1. e is a regular expression, and L(e) is {e}, that is, the language whose sole
member is the empty string.
2. If a is a symbol in E, then a is a regular expression, and L(a) = {a}, that is, the
language with one string, of length one, with a in its one position.
• Note that by convention, we use italics for symbols, and boldface for their
corresponding regular expression.
• INDUCTION:
• There are four parts to the induction whereby larger regular expressions are built
from smaller ones. Suppose r and s are regular expressions denoting languages
L(r) and L(s), respectively.
INDUCTION
1. (r)|(s) is a regular expression denoting the language L(r) U L(s).
2. (r)(s) is a regular expression denoting the language L(r)L(s).
3. (r)* is a regular expression denoting (L(r))*.
4. (r) is a regular expression denoting L(r).
This last rule says that we can add additional pairs of parentheses around
expressions without changing the language they denote.
a) The unary operator * has highest precedence and is left associative.
b) Concatenation has second highest precedence and is left associative.
c) | has lowest precedence and is left associative.
Continue……….

Compiler Design: Ambo University School of Informatics and Electrical Engineering Department of Computer Science
No ratings yet
Compiler Design: Ambo University School of Informatics and Electrical Engineering Department of Computer Science
35 pages
Chapter 2 - Lexical Analysis
100% (1)
Chapter 2 - Lexical Analysis
69 pages
Condenser Cladding Info
0% (1)
Condenser Cladding Info
37 pages
Protech Controller LF-313LD
100% (4)
Protech Controller LF-313LD
2 pages
TN4611 PDF
No ratings yet
TN4611 PDF
11 pages
Compiler Construction CS-4207: Lecture 4-5 Instructor Name: Atif Ishaq
100% (1)
Compiler Construction CS-4207: Lecture 4-5 Instructor Name: Atif Ishaq
37 pages
Chapter 3 Lexical Analysis
No ratings yet
Chapter 3 Lexical Analysis
5 pages
Chapter 2 Lexical Analysis
No ratings yet
Chapter 2 Lexical Analysis
26 pages
CSI 411 - Compiler - Lecture 2 PDF
No ratings yet
CSI 411 - Compiler - Lecture 2 PDF
22 pages
Compiler Design Chapter 2
No ratings yet
Compiler Design Chapter 2
14 pages
Lecture 02
No ratings yet
Lecture 02
150 pages
Ch2 - Lexical Analysis
No ratings yet
Ch2 - Lexical Analysis
76 pages
Chapter 2 - Lexical Analyser
No ratings yet
Chapter 2 - Lexical Analyser
39 pages
Lexical Analysis: Risul Islam Rasel
No ratings yet
Lexical Analysis: Risul Islam Rasel
148 pages
Unit 01 - PART 2
No ratings yet
Unit 01 - PART 2
25 pages
Unit 1 (B)
No ratings yet
Unit 1 (B)
69 pages
Chapter 2 - Lexical Analyser
No ratings yet
Chapter 2 - Lexical Analyser
38 pages
Lexical Analysis: Deterministic Finite Automata
No ratings yet
Lexical Analysis: Deterministic Finite Automata
37 pages
Compiler Design Lexical Analysis
No ratings yet
Compiler Design Lexical Analysis
24 pages
CD Unit I Part II Lexical Analysis
No ratings yet
CD Unit I Part II Lexical Analysis
58 pages
Ch2+3 Compiler
No ratings yet
Ch2+3 Compiler
21 pages
CD - CH2 - Lexical Analysis
No ratings yet
CD - CH2 - Lexical Analysis
59 pages
Ch2 - Lexical Analysis
No ratings yet
Ch2 - Lexical Analysis
71 pages
Ch2 - Lexical Analysis
No ratings yet
Ch2 - Lexical Analysis
71 pages
Lexical Analysis: CD: Compiler Design
No ratings yet
Lexical Analysis: CD: Compiler Design
122 pages
Lexical Analysis
No ratings yet
Lexical Analysis
153 pages
Ch3 - Lexical Analysis
No ratings yet
Ch3 - Lexical Analysis
52 pages
Lexical Analysis
No ratings yet
Lexical Analysis
31 pages
Module 5 Lexical Analyser
No ratings yet
Module 5 Lexical Analyser
10 pages
Unit 2 Lexical Analysis
No ratings yet
Unit 2 Lexical Analysis
94 pages
Chapter 2 - Lexical Analyser
No ratings yet
Chapter 2 - Lexical Analyser
40 pages
@CD - ch2 Compiler Design
No ratings yet
@CD - ch2 Compiler Design
26 pages
Ch2 Lexical Analysis
No ratings yet
Ch2 Lexical Analysis
11 pages
HW 31712
No ratings yet
HW 31712
22 pages
CD Unit-2
No ratings yet
CD Unit-2
64 pages
CD Aii Partb Ans
No ratings yet
CD Aii Partb Ans
8 pages
Pdf&rendition 1
No ratings yet
Pdf&rendition 1
14 pages
5.1 Pages From Pages From ASME - PCC-2-2008 - Stored Energy Cal
No ratings yet
5.1 Pages From Pages From ASME - PCC-2-2008 - Stored Energy Cal
1 page
Lexical Analysis
No ratings yet
Lexical Analysis
45 pages
CD KCS502 Unit 1 B
No ratings yet
CD KCS502 Unit 1 B
12 pages
Chapter - 2 Lexical Analysis
No ratings yet
Chapter - 2 Lexical Analysis
160 pages
Lexical Analyzer
No ratings yet
Lexical Analyzer
18 pages
3a. Context Free Grammar
No ratings yet
3a. Context Free Grammar
18 pages
UNIT 2 Compiler Design
No ratings yet
UNIT 2 Compiler Design
23 pages
21CS51 ATCD MODULE 2 - 2 Lexical Analyser Part2
No ratings yet
21CS51 ATCD MODULE 2 - 2 Lexical Analyser Part2
62 pages
2.1 - Lexical Analysis
No ratings yet
2.1 - Lexical Analysis
102 pages
Ch3 Modified
No ratings yet
Ch3 Modified
80 pages
Using The Leica TC 307 v2
No ratings yet
Using The Leica TC 307 v2
4 pages
Chapter 2 - Lexical Analysis
No ratings yet
Chapter 2 - Lexical Analysis
48 pages
2 - Lexical Analysis
No ratings yet
2 - Lexical Analysis
52 pages
2-Lexical Analysis
No ratings yet
2-Lexical Analysis
52 pages
Lexical Analysis
No ratings yet
Lexical Analysis
62 pages
Lexical Analysis
No ratings yet
Lexical Analysis
5 pages
Lexical Analyzer
No ratings yet
Lexical Analyzer
56 pages
Lecture3 E
No ratings yet
Lecture3 E
153 pages
ch-2 Compiler Design
No ratings yet
ch-2 Compiler Design
9 pages
CD Unit-2
No ratings yet
CD Unit-2
64 pages
CD 1
No ratings yet
CD 1
92 pages
Lexical Analysis
No ratings yet
Lexical Analysis
57 pages
Unit 2
No ratings yet
Unit 2
61 pages
21CS51 ATCD MODULE 2 - 2 Lexical Analyser Part1
No ratings yet
21CS51 ATCD MODULE 2 - 2 Lexical Analyser Part1
63 pages
OpenText File System Archiving 10.2.0 Release Notes
No ratings yet
OpenText File System Archiving 10.2.0 Release Notes
13 pages
Lexical Analysis
No ratings yet
Lexical Analysis
62 pages
Lexical Analysis
No ratings yet
Lexical Analysis
6 pages
CE600E - V2.2-Duplex Continues Rectification
No ratings yet
CE600E - V2.2-Duplex Continues Rectification
132 pages
Text Summarization
No ratings yet
Text Summarization
6 pages
Daily Practice Problems (DPP) : Sub: Maths Chapter: Quadratic Equation DPP No.: 2
No ratings yet
Daily Practice Problems (DPP) : Sub: Maths Chapter: Quadratic Equation DPP No.: 2
4 pages
Lec2 LexicalAnalyser
No ratings yet
Lec2 LexicalAnalyser
30 pages
Internship Training in Python
No ratings yet
Internship Training in Python
2 pages
Auditing in Oracle 10g Release 2
No ratings yet
Auditing in Oracle 10g Release 2
9 pages
PFD Consiltator
No ratings yet
PFD Consiltator
5 pages
Calculating Frequency Bias Setting
100% (1)
Calculating Frequency Bias Setting
5 pages
High-Performance Liquid Chromatography Determination of Zn-Bacitracin in Animal Feed by Post-Column Derivatization and Fluorescence Detection
No ratings yet
High-Performance Liquid Chromatography Determination of Zn-Bacitracin in Animal Feed by Post-Column Derivatization and Fluorescence Detection
8 pages
A Report On Switchyard Equipment Testing (132
No ratings yet
A Report On Switchyard Equipment Testing (132
11 pages
MS-MO6-L02-Theory of Columns-Rankine Formula
No ratings yet
MS-MO6-L02-Theory of Columns-Rankine Formula
11 pages
Unit 1
No ratings yet
Unit 1
57 pages
Languaje and Political Change (Quentin Skinner)
No ratings yet
Languaje and Political Change (Quentin Skinner)
9 pages
Heliax AVA5-50 Coaxial Cable: One Company. A World of Solutions
No ratings yet
Heliax AVA5-50 Coaxial Cable: One Company. A World of Solutions
2 pages
How To Tune Your TV
No ratings yet
How To Tune Your TV
5 pages
Simpsons Parking Lot Diversity Lab Share
No ratings yet
Simpsons Parking Lot Diversity Lab Share
7 pages
Network An. Chapter-5
No ratings yet
Network An. Chapter-5
23 pages
Inbound 1766823743387247522
No ratings yet
Inbound 1766823743387247522
6 pages
Exam - 1013S 2023 Final
No ratings yet
Exam - 1013S 2023 Final
20 pages
The Graphical Interpretation of The Function Properties: Increasing, Decreasing, and Constant Functions
No ratings yet
The Graphical Interpretation of The Function Properties: Increasing, Decreasing, and Constant Functions
3 pages
Windows 7 Hyper Terminal
No ratings yet
Windows 7 Hyper Terminal
4 pages
Enzyme Practical 1
No ratings yet
Enzyme Practical 1
2 pages
WS2 Sin, CosLaw
No ratings yet
WS2 Sin, CosLaw
4 pages
QB Unit 1 and 2 OOP
No ratings yet
QB Unit 1 and 2 OOP
2 pages
D12000i Rato Principle Block Diagram
No ratings yet
D12000i Rato Principle Block Diagram
1 page
Lisp Interpreter in Rust
From Everand
Lisp Interpreter in Rust
Vishal Patil
1/5 (1)

Lec 02

Uploaded by

Lec 02

Uploaded by

Lexical Analysis

You might also like