0% found this document useful (0 votes)

45 views16 pages

Compiler Lecture 3

The document discusses lexical analysis and how it is the first step in compiler design. It describes how regular expressions are used to specify patterns for tokens and how these can be used to automatically generate scanners instead of building them by hand.

Uploaded by

mnielsonc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

45 views16 pages

Compiler Lecture 3

Uploaded by

mnielsonc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 16

Lecture 3: Introduction to Lexical Analysis

Source code Front-End IR Object code

Back-End
Lexical
Analysis

(from last lecture) Lexical Analysis:

• reads characters and produces sequences of tokens.

Today’s lecture:
Towards automated Lexical Analysis.
CSE 359 - Compiler Masud Ibn Afjal, CSE, HSTU 1
Design
The Big Picture
First step in any translation: determine whether the text to be translated
is well constructed in terms of the input language. Syntax is
specified with parts of speech - syntax checking matches parts of
speech against a grammar.
In natural languages, mapping words to part of speech is idiosyncratic.
In formal languages, mapping words to part of speech is syntactic:
• based on denotation
• makes this a matter of syntax
• reserved keywords are important

What does lexical analysis do?

Recognises the language’s parts of speech.
CSE 359 - Compiler Masud Ibn Afjal, CSE, HSTU 2
Design
Lexical analyzer and the parser

Two processes Lexical Analysis:

-Scanning
-Lexical analysis
CSE 359 - Compiler Masud Ibn Afjal, CSE, HSTU 3
Design
Tokens, Patterns, and Lexemes
 A token is a pair consisting of a token name and an optional attribute
value.
 A pattern is a description of the form that the lexemes of a token may
take. In the case of a keyword as a token, the pattern is just the
sequence of characters that form the keyword.
 A lexeme is a sequence of characters in the source program that
matches the pattern for a token and is identified by the lexical analyzer
as an instance of that token.

Example: printf("Total = %d\n", score);

 both printf and score are lexemes matching the pattern for token id
 "Total = %d\n" is a lexeme matching literal.

CSE 359 - Compiler Masud Ibn Afjal, CSE, HSTU 4

Design
Token Example..
1. One token for each keyword.
2. Tokens for the operators, either individually or in classes such as the
token comparison mentioned in Fig. 3.2.
3. One token representing all identifiers.
4. One or more tokens representing constants, such as numbers and
literal strings.
5. Tokens for each punctuation symbol, such as left and right
parentheses, comma, and semicolon.

CSE 359 - Compiler Masud Ibn Afjal, CSE, HSTU 5

Design
Some Definitions
• A vocabulary (alphabet) is a finite set of symbols.
• A string is any finite sequence of symbols from a vocabulary.
• A language is any set of strings over a fixed vocabulary.
• A grammar is a finite way of describing a language.
• A context-free grammar, G, is a 4-tuple, G=(S,N,T,P), where:
S: starting symbol
N: set of non-terminal symbols
T: set of terminal symbols
P: set of production rules
• A language is the set of all terminal productions of G.
• Example:
S=CatWord; N={CatWord}; T={miau};
P={CatWord  CatWord miau | miau}

CSE 359 - Compiler Masud Ibn Afjal, CSE, HSTU 6

Design
Terms for Parts of Strings

CSE 359 - Compiler Masud Ibn Afjal, CSE, HSTU 7

Design
Example
(A simplified version from Lecture2, Slide 6):
S=E; N={E,T,F}; T={+,*,(,),x}
P={ET|E+T, T F|T*F, F (E)|x}
By repeated substitution we derive sentential forms:
E E+T T+T F+T x+T x+T*F x+F*F
x+x*F x+x*x
This is an example of a leftmost derivation (at each step the
leftmost non-terminal is expanded).
To recognise a valid sentence we reverse this process.
• Exercise: what language is generated by the (non-context free) grammar:
S=S; N={A,B,S}; T={a,b,c};
P={Sabc|aAbc, AbbA, AcBbcc, bBBb, aB aa|aaA}
(for the curious: read about Chomsky’s Hierarchy)
CSE 359 - Compiler Masud Ibn Afjal, CSE, HSTU 8
Design
Why all this?
• Why study lexical analysis?
– To avoid writing lexical analysers (scanners) by hand.
– To simplify specification and implementation.
– To understand the underlying techniques and technologies.
• We want to specify lexical patterns (to derive tokens):
– Some parts are easy:
• WhiteSpace  blank | tab | WhiteSpace blank | WhiteSpace tab
• Keywords and operators (if, then, =, +)
• Comments (/* followed by */ in C, // in C++, % in latex, ...)
– Some parts are more complex:
• Identifiers (letter followed by - up to n - alphanumerics…)
• Numbers
We need a notation that could lead to an implementation!
CSE 359 - Compiler Masud Ibn Afjal, CSE, HSTU 9
Design
Regular Expressions
Patterns form a regular language. A regular expression is a way of
specifying a regular language. It is a formula that describes a possibly
infinite set of strings.
Example:
identifier letter_( letter_| digit)*
Regular Expression (RE) (over a vocabulary V):
  is a RE denoting the empty set {}.
• If a V then a is a RE denoting {a}.
• If r1, r2 are REs then:
– r1* denotes zero or more occurrences of r1;
– r1r2 denotes concatenation;
– r1 | r2 denotes either r1 or r2;
• Shorthands: [a-d] for a | b | c | d; r+ for rr*; r? for r | 
Describe the languages denoted by the following REs
a; a | b; a*; (a | b)*; (a | b)(a | b); (a*b*)*; (a | b)*baa;
Read more: Example 3.3, 3.4 Figure 3.7 Exercise 3.3.2
CSE 359 - Compiler Masud Ibn Afjal, CSE, HSTU 10
Design
Examples
• integer  (+ | – | ) (0 | 1 | 2 | … | 9)+
• integer  (+ | – | ) (0 | (1 | 2 | … | 9) (0 | 1 | 2 | … | 9)*)
• decimal  integer.(0 | 1 | 2 | … | 9)*
• identifier  [a-zA-Z] [a-zA-Z0-9]*

• Real-life application (perl regular expressions):

– [+–]?(\d+\.\d+|\d+\.|\.\d+)
– [+–]?(\d+\.\d+|\d+\.|\.\d+|\d+)([eE][+–]?\d+)?
(for more information read: % man perlre)
(Not all languages can be described by regular expressions.
But, we don’t care for now).
CSE 359 - Compiler Masud Ibn Afjal, CSE, HSTU 11
Design
Building a Lexical Analyser by hand
Based on the specifications of tokens through regular expressions we
can write a lexical analyser. One approach is to check case by case
and split into smaller problems that can be solved ad hoc. Example:
void get_next_token() {
c=input_char();
if (is_eof(c)) { token  (EOF,”eof”); return}
if (is_letter(c)) {recognise_id()}
else if (is_digit(c)) {recognise_number()}
else if (is_operator(c))||is_separator(c))
{token  (c,c)} //single char assumed
else {token  (ERROR,c)}
return;
}
...
do {
get_next_token();
print(token.class, token.attribute);
} while (token.class != EOF);

Can be efficient; but requires a lot of work and may be difficult to modify!
CSE 359 - Compiler Masud Ibn Afjal, CSE, HSTU 12
Design
Building Lexical Analysers “automatically”
Idea: try the regular expressions one by one and find the longest match:
set (token.class, token.length) (NULL, 0)
// first
find max_length such that input matches T1RE1
if max_length > token.length
set (token.class, token.length) (T1, max_length)
// second
find max_length such that input matches T2RE2
if max_length > token.length
set (token.class, token.length) (T2, max_length)
…
// n-th
find max_length such that input matches TnREn
if max_length > token.length
set (token.class, token.length) (Tn, max_length)
// error
if (token.class == NULL) { handle no_match }

Disadvantage: linearly dependent on number of token classes and

requires restarting the search
CSE 359 - Compiler Masud Ibnfor each
Afjal, regular expression.
CSE, HSTU 13
Design
We study REs to automate scanner construction!
Consider the problem of recognising register names starting with r and
requiring at least one digit:
Register  r (0|1|2|…|9) (0|1|2|…|9)* (or, Register  r Digit Digit*)
The RE corresponds to a transition diagram:

digit
start r digit
S0 S1 S2

Depicts the actions that take place in the scanner.

• A circle represents a state; S0: start state; S2: final state (double circle)
• An arrow represents a transition; the label specifies the cause of the transition.
A string is accepted if, going through the transitions, ends in a final state
(for example, r345, r0, r29, as opposed to a, r, rab)
CSE 359 - Compiler Masud Ibn Afjal, CSE, HSTU 14
Design
Towards Automation (finally!)
An easy (computerised) implementation of a transition diagram
is a transition table: a column for each input symbol and a
row for each state. An entry is a set of states that can be
reached from a state on some input symbol. E.g.:
state ‘r’ digit
0 1 -
1 - 2
2(final) - 2
If we know the transition table and the final state(s) we can
build directly a recogniser that detects acceptance:
char=input_char();
state=0; // starting state
while (char != EOF) {
state  table(state,char);
if (state == ‘-’) return failure;
word=word+char;
char=input_char();
}
if (state == FINAL) return acceptance; else return failure;
CSE 359 - Compiler Masud Ibn Afjal, CSE, HSTU 15
Design
The Full Story!
The generalised transition diagram is a finite automaton. It can be:
• Deterministic, DFA; as in the example
• Non-Deterministic, NFA; more than 1 transition out of a state may
be possible on the same input symbol: think about: (a | b)* abb
Every regular expression can be converted to a DFA!
Summary: an introduction to lexical analysis was given.
Next time: More on finite automata and conversions.
Exercise: Produce the DFA for the RE (Q: what is it for?):
Register  r ((0|1|2) (Digit|) | (4|5|6|7|8|9) | (3|30|31))
Reading: Aho2, Sections 2.2, 3.1-3.4.

CSE 359 - Compiler Masud Ibn Afjal, CSE, HSTU 16

Design

CS3304 9 LanguageSyntax 2 PDF
No ratings yet
CS3304 9 LanguageSyntax 2 PDF
39 pages
Lisp Interpreter in Rust
From Everand
Lisp Interpreter in Rust
Vishal Patil
1/5 (1)
CSE302: Compiler Design
No ratings yet
CSE302: Compiler Design
18 pages
Compiler Lecture 5
No ratings yet
Compiler Lecture 5
12 pages
Compiler-Lexical Analysis
100% (1)
Compiler-Lexical Analysis
59 pages
lect03
No ratings yet
lect03
19 pages
Compiler
No ratings yet
Compiler
60 pages
Chapter 2 - Lexical Analysis
No ratings yet
Chapter 2 - Lexical Analysis
56 pages
compiler construction Lecture 3-4
No ratings yet
compiler construction Lecture 3-4
78 pages
ch-2.pdf 2
No ratings yet
ch-2.pdf 2
27 pages
04 Lexi Cal A Analysis
No ratings yet
04 Lexi Cal A Analysis
39 pages
Lexical Analysis1
No ratings yet
Lexical Analysis1
44 pages
Compiler Design
No ratings yet
Compiler Design
65 pages
2
No ratings yet
2
40 pages
Slides 02 - Compiler Construction - UET CS - Lexical Analyzer Rev 2
No ratings yet
Slides 02 - Compiler Construction - UET CS - Lexical Analyzer Rev 2
69 pages
Chapter 2
No ratings yet
Chapter 2
27 pages
Chapter 2 - Copy
No ratings yet
Chapter 2 - Copy
39 pages
slides chp 3 and 4
No ratings yet
slides chp 3 and 4
21 pages
Lexical Analysis: Dr. Murali Krishna Enduri Department of CSE
No ratings yet
Lexical Analysis: Dr. Murali Krishna Enduri Department of CSE
88 pages
SP Unit III-2024-25
No ratings yet
SP Unit III-2024-25
126 pages
Chapter-2[1]
No ratings yet
Chapter-2[1]
77 pages
Chapter 2
No ratings yet
Chapter 2
56 pages
Lecture 2.76
No ratings yet
Lecture 2.76
31 pages
pr
No ratings yet
pr
40 pages
Chapter-2
No ratings yet
Chapter-2
99 pages
Chapter 3 - Lexical Analysis
100% (3)
Chapter 3 - Lexical Analysis
51 pages
4-LexicalAnalysis
No ratings yet
4-LexicalAnalysis
27 pages
Chapter 2 - Lexical Analysis
No ratings yet
Chapter 2 - Lexical Analysis
69 pages
1st Phase Lexical Analyzer
No ratings yet
1st Phase Lexical Analyzer
33 pages
Ch3myppt
No ratings yet
Ch3myppt
59 pages
Csc3205-Lexical - Analysis PDF
No ratings yet
Csc3205-Lexical - Analysis PDF
33 pages
Chapter 2
No ratings yet
Chapter 2
91 pages
6-Lexical Analysis Part5
No ratings yet
6-Lexical Analysis Part5
20 pages
03 Lex Analysis
No ratings yet
03 Lex Analysis
61 pages
Lexical Analyser
No ratings yet
Lexical Analyser
55 pages
Lecture 3
No ratings yet
Lecture 3
22 pages
2 Lex
No ratings yet
2 Lex
45 pages
Chapter 3 - Lexical Analysis
100% (1)
Chapter 3 - Lexical Analysis
51 pages
A Typical Lexical Analyzer Generator Nfa To Dfa DFA Analysis
No ratings yet
A Typical Lexical Analyzer Generator Nfa To Dfa DFA Analysis
64 pages
Chapter 3 - Lexical Analysis and Lexical Analyzer Generators
No ratings yet
Chapter 3 - Lexical Analysis and Lexical Analyzer Generators
52 pages
Compiler Course: Lexical Analysis
No ratings yet
Compiler Course: Lexical Analysis
50 pages
unit1
No ratings yet
unit1
34 pages
Lecture 03
No ratings yet
Lecture 03
42 pages
Lexical Analysis and Lexical Analyzer Generators: COP5621 Compiler Construction
No ratings yet
Lexical Analysis and Lexical Analyzer Generators: COP5621 Compiler Construction
52 pages
lexicalanalysis-160516142825
No ratings yet
lexicalanalysis-160516142825
39 pages
Week 5-6
No ratings yet
Week 5-6
33 pages
Lexical Analysis: Leonidas Fegaras
No ratings yet
Lexical Analysis: Leonidas Fegaras
28 pages
2024 CSN352 Lec 8
No ratings yet
2024 CSN352 Lec 8
48 pages
L3_FSM
No ratings yet
L3_FSM
20 pages
Compiler Design Lexical Analysis
No ratings yet
Compiler Design Lexical Analysis
24 pages
Ch3 1
No ratings yet
Ch3 1
52 pages
02. Chapter 3 - Lexical Analysis
No ratings yet
02. Chapter 3 - Lexical Analysis
51 pages
Compilers CH 3
No ratings yet
Compilers CH 3
58 pages
Chapter 3 - Lexical Analysis
No ratings yet
Chapter 3 - Lexical Analysis
34 pages
CH 2
No ratings yet
CH 2
36 pages
Chpater 2 Lexical Analysis
No ratings yet
Chpater 2 Lexical Analysis
48 pages
02. Chapter 3 - Lexical Analysis (1)
No ratings yet
02. Chapter 3 - Lexical Analysis (1)
52 pages
Chapter 2 lexical_analysis
No ratings yet
Chapter 2 lexical_analysis
38 pages
Chapter 2 Lexical Analysis
No ratings yet
Chapter 2 Lexical Analysis
33 pages
UNIT-I - Lexical Analysis
No ratings yet
UNIT-I - Lexical Analysis
51 pages
Compiler Lecture 6
No ratings yet
Compiler Lecture 6
16 pages
Compiler Lecture 4
No ratings yet
Compiler Lecture 4
12 pages
Compiler Lecture 1
No ratings yet
Compiler Lecture 1
13 pages
Compiler Lecture 2
No ratings yet
Compiler Lecture 2
15 pages
AOSAT_Full_MCQ_Test(1)
No ratings yet
AOSAT_Full_MCQ_Test(1)
3 pages
30 Top Most Magnetic Circuit - Electrical Engineering Multiple Choice Questions and Answers
No ratings yet
30 Top Most Magnetic Circuit - Electrical Engineering Multiple Choice Questions and Answers
6 pages
Cheat Sheet of Mathemtical Notation and Terminology
No ratings yet
Cheat Sheet of Mathemtical Notation and Terminology
1 page
Step by Step Guide To Create BAPI in ABAP Hana
100% (1)
Step by Step Guide To Create BAPI in ABAP Hana
37 pages
Data Presentation
No ratings yet
Data Presentation
64 pages
ch11_wb_ans.pdf
No ratings yet
ch11_wb_ans.pdf
39 pages
Robotics and Automation - Question Bank EC6003
No ratings yet
Robotics and Automation - Question Bank EC6003
18 pages
Bahria College Papers
No ratings yet
Bahria College Papers
6 pages
584SV Frequency Inverter Product Manual HA463617 1
No ratings yet
584SV Frequency Inverter Product Manual HA463617 1
255 pages
Fixed-Removable Prostheses: Done By: Tabark Y. Mizil
No ratings yet
Fixed-Removable Prostheses: Done By: Tabark Y. Mizil
28 pages
Control Engineering I PDF
No ratings yet
Control Engineering I PDF
15 pages
Group 8
No ratings yet
Group 8
62 pages
Transformer 10028 - 1 PDF
100% (1)
Transformer 10028 - 1 PDF
15 pages
Ce43 8
No ratings yet
Ce43 8
14 pages
Unit Cost Calculation Under Traditional Costing.: Predetermined Overhead Rate X Direct Labor Hours RM30 X 1hr. RM30
No ratings yet
Unit Cost Calculation Under Traditional Costing.: Predetermined Overhead Rate X Direct Labor Hours RM30 X 1hr. RM30
4 pages
Approaches and Methods in Computational Linguistics
No ratings yet
Approaches and Methods in Computational Linguistics
18 pages
Notas de Clases SC0x - M2Unit1 - ManagingUncertainty
No ratings yet
Notas de Clases SC0x - M2Unit1 - ManagingUncertainty
12 pages
Estimation (Exercise)
No ratings yet
Estimation (Exercise)
4 pages
Design of Stopwatch Through Digital Logic Design: December 2019
50% (2)
Design of Stopwatch Through Digital Logic Design: December 2019
5 pages
Aim of Project
No ratings yet
Aim of Project
5 pages
Testing: Concepts, Issues, and Techniques: Dr. Sohail Khan
No ratings yet
Testing: Concepts, Issues, and Techniques: Dr. Sohail Khan
46 pages
2023 11 07 21 36 50 8100
No ratings yet
2023 11 07 21 36 50 8100
77 pages
Problem Statement: Advanced Structural Mechanics M Nicholas Fantuzzi
No ratings yet
Problem Statement: Advanced Structural Mechanics M Nicholas Fantuzzi
4 pages
[FREE PDF sample] (eBook PDF) Chemical Structure and Reactivity: An Integrated Approach 2nd Edition ebooks
100% (2)
[FREE PDF sample] (eBook PDF) Chemical Structure and Reactivity: An Integrated Approach 2nd Edition ebooks
41 pages
Computer
No ratings yet
Computer
5 pages
Genetic Algorithm: Initialization
No ratings yet
Genetic Algorithm: Initialization
6 pages
Registered Crop Varieties
No ratings yet
Registered Crop Varieties
33 pages
Deborah Ball PDF
No ratings yet
Deborah Ball PDF
19 pages
Python Practical File
No ratings yet
Python Practical File
23 pages
Sony SDM Hs74p
No ratings yet
Sony SDM Hs74p
26 pages

Compiler Lecture 3

Uploaded by

Compiler Lecture 3

Uploaded by

Lecture 3: Introduction to Lexical Analysis

Source code Front-End IR Object code

(from last lecture) Lexical Analysis:

What does lexical analysis do?

Two processes Lexical Analysis:

Example: printf("Total = %d\n", score);

CSE 359 - Compiler Masud Ibn Afjal, CSE, HSTU 4

CSE 359 - Compiler Masud Ibn Afjal, CSE, HSTU 5

CSE 359 - Compiler Masud Ibn Afjal, CSE, HSTU 6

CSE 359 - Compiler Masud Ibn Afjal, CSE, HSTU 7

• Real-life application (perl regular expressions):

Disadvantage: linearly dependent on number of token classes and

Depicts the actions that take place in the scanner.

CSE 359 - Compiler Masud Ibn Afjal, CSE, HSTU 16

You might also like