0% found this document useful (0 votes)

37 views14 pages

Compilers - Week 2

The document discusses lexical analysis in compilers. It covers: I. The goal of lexical analysis is to divide code into lexical units like keywords, variables, and operators. II. A lexical analyzer recognizes substrings and classifies them into token classes like identifiers, integers, keywords, and whitespace. III. The output of a lexical analyzer is a sequence of token pairs consisting of the class name and lexeme. This is used by later stages of compilation.

Uploaded by

rahma aboeldahab

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views14 pages

Compilers - Week 2

Uploaded by

rahma aboeldahab

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Compilers - Week 2

I- the goal of lexical analysis

It is to divide the code into lexical units called substrings or lexemes
such as keywords, variable names, and operators.
Example:
if(i == j)
Z = 0;
else
Z = 1;
This piece of code is seen by the lexical analyzer as follows:
\tif(i == j)\n\tZ = 0;\n\telse\n\tZ = 1;
II- What does a lexical analyzer do?
A lexical analyzer will recognize the substrings, and also classify them
according to their role into token classes.
Token classes
i- identifier: strings of letters or digits, starting with a letter
ii- integer: a non-empty string of digits
iii- keyword: a set of reserved words, such as “if”, “else”, and
“begin”
iv- whitespace: a non-empty sequence of blanks, new lines, and
tabs.
v- single-character token classes
1- (: (
2- ): )
3- ; : ;
4- = : =
III- output of a lexical analyzer

The output of the lexical analyzer is a sequence of pairs where the ith
pair consists of the name of the class of the ith substring or lexeme,
and the lexeme itself. Each pair is known as a token.
Example:
If string is foo = 42, and the output of the lexical analyzer is:
<”Id”, “foo”> <”op”, “=”> <”Int”, “42”>
\tif(i == j)\n\tZ = 0;\n\telse\n\tZ = 1;
Whitespace: \t\n\t\n\t\n\t
Keywords: if, else
Identifiers: i, j, Z
Numbers: 0, 1
Operator: ==
IV- Lexical analysis example
In FORTRAN, whitespace is insignificant, that is VAR1 is the same as
VA R1.
Example:
DO 5 I = 1, 25 : this is a loop that starts from the header -the do
statement- all the way down to the statement who has a label of
5. It does so 25 times.
DO 5 I = 1.25 : here DO5I or DO 5I is the name of a variable, to
which a value of 1.25 is assigned.
Notes:
i- the lexical analyzer scans the input string from left to
right, recognizing one token at a time.
ii- sometimes, it needs “lookahead” in order to decide
where one token ends and where the next token begins,
like the case in the previous example.
iii- the goal in the design of lexical system is to minimize
the amount of “lookahead”.
Why does FORTRAN have this funny rule?
it turns out that on punch card machines it was easy to add
extra blanks by accidents and as a result they added this rule to
the language so the punch card operators wouldn't have to redo
their work all the time.
V-Cases where lookahead is needed
1- determine whether the = ends there, or if it’s followed by
another one.
2- determine whether “e” is a name of a variable or if it’s followed
by an “lse” which makes it a keyword instead of an identifier.
3- determine whether “>>” should be interpreted as two closed
brackets or a stream operator?
Fun fact: for a long time, the only solution to this problem
was to insert blanks between the two brackets for them to
not be interpreted as a stream operator.
4- programming language 1 (PL1) was developed by IBM and it
was supposed to be so general with as few constraints as
possible.
In PL1, keywords are not reserved and that is a case where lookahead is
required.
VI- regular languages
- The lexical structure of a programming language is a set of token
classes where each class consists of some set of strings.
-Regular languages are used to specify which set of strings belongs to
each token class.
-Regular languages are defined through regular expressions (syntax)
Types of regular expressions:
1- base cases:
i- single character: ‘c’ = {“c”}: for any single character, we
get a one-string language.
ii- ε = {“”} is a language that contains one string which is the
empty string
Note: ε ≠ Φ
2- compound expressions
Ways of building new regular expressions from other
regular expressions.
i- union (or): 𝐴 + 𝐵 = {𝑎| 𝑎 ∈ 𝐴} ∪ {𝑏| 𝑏 ∈ 𝐵}
A such that a is in the language of A, union b such that
b is in the language of B.
ii- concatenation (and): 𝐴 𝐵 = {𝑎𝑏| 𝑎 ∈ 𝐴 ^ 𝑏 ∈ 𝐵}
Cross product
𝑖
iii- iteration: 𝐴 * = ⋃ 𝐴 (Kleeny closure)
𝑖≥0
A concatenated with itself i times.
Note: 𝐴 is A concatenated with itself 0 times, which
0

is the language ε.
In short:
The regular expressions over Σ are the smallest set of
expressions including 𝑅 = ε| '𝑐' |𝑅 + 𝑅| 𝑅𝑅| 𝑅 *
VII- building regular expressions
First, you have to define the set of alphabet ‘Σ’ o be used,
Example:
Σ = {0, 1}
𝑖
1* = ⋃ 1 = "" + 1 + 11 + 111 + 1111 + ....
𝑖≥0
(1 + 0)1 = {𝑎𝑏| 𝑎 ∈ (1 + 0)^ 𝑏 ∈ 1} = {11, 01}
String ab where a is drawn (1 + 0) and b is drawn from 1
𝑖 𝑖
0* + 1* = {0 | 𝑖 ≥ 0} ∪ {1 | 𝑖 ≥ 0}
𝑖
(0 + 1)* = ⋃ (0 + 1)
𝑖≥0
(0 + 1) concatenated with itself i times, as follows:
“ ”, 0 + 1, (0+1) (0+1),(0+1)....(0+1): all strings of 0’s and
1’s
Note: Σ * (denotes the set of all strings you can form
out of the alphabet)
In short:
Regular expressions are syntax that is used to specify a regular
language which is a set of strings.
VIII- Formal languages
Let Σ be a set of characters (an alphabet), a formal language is just any
set of strings over some alphabet.
Example:
alphabet: english characters
Language: english sentences
Example:
Alphabet:
An important concept for many formal languages is a meaning function
L which is a function that maps the strings in the language to their
meaning L(e) = M.
Example:
L: exp -> set of strings
L(regular expression) = M(a set of strings)
The meaning function maps a regular expression to the set of strings
that it denotes, for example;
𝐿(ε) = {" "}
𝐿('𝑐') = {"𝑐"}
𝐿(𝐴 + 𝐵) = 𝐿(𝐴) ∪ 𝐿(𝐵)
First, we interpret A and B using L, then we take the union
of the result.
𝐿(𝐴𝐵) = {𝑎𝑏| 𝑎 ∈ 𝐿(𝐴) ^ 𝑏 ∈ 𝐿(𝐵)}
𝑖
𝐿(𝐴 *) = ⋃ 𝐿(𝐴 )
𝑖≥0
Note:
Arguments to the meaning function (input) are regular
expressions and the outputs are the corresponding sets of
strings.
Why use a meaning function?
1- it makes clear what is syntax, and what is semantic.
2- allows us to consider notation as a separate issue (roman
numbers vs arabic numbers example)

3- allows us to have different syntax for the same meaning, and

hence we discover that some kinds of syntax are better than the
others. Which means there are more expressions than there are
meanings.
Note: syntax and semantics are not 1: 1

-L is many to one, which helps in optimization (replacing a program

with a better equivalent that runs faster).
-L can never be one to many.

X- Lexical specification
1- keywords: ‘if’ + ‘else’ + ‘then’
They’re specified by having single quotes around them.
2- integers: non-empty strings of digits
digit = ‘0’ + ‘1’ + ‘2’ + ‘3’ + ‘4’ + ‘5’ + ‘6’ + ‘7’ + ‘8’ + ‘9’
+
digit : digit * digit* (one digit followed by 0 or more digits)
3- identifiers: strings of letters or digits, starting with a letter
letter = ‘a’ + ‘b’ + ‘c’ + ‘d’ +......
letter = [a - z A - Z]: the union of all single-character regular
expressions beginning with a (the first character) anding with z
(the second character).
digit = ‘0’ + ‘1’ + ‘2’ + ‘3’ + ‘4’ + ‘5’ + ‘6’ + ‘7’ + ‘8’ + ‘9’
Identifiers: letter(letter + digit)*
4- whitespace: non-empty sequence of blanks, newlines, or tabs
+
Whitespace: (‘ ‘ + ‘\n’ + ‘\t’)
Examples:
1- anyone@[Link]
+ + + +
letter ‘@’ letter ‘.’ letter ‘.’ letter
2- how numbers are defined in PASCAL programming language
num = digits opt_fraction opt_exponent
+ +
digit (‘.’ digits) + ε ((‘E’ (‘+’ + ‘-’ + ε)digit ) + ε)
Notes:
+ +
i- (‘.’ digit ) + ε is the same as (‘.’digit )?
+
ii- (‘E’ (‘+’ + ‘-’ + ε)digit ) + ε) is the same as (‘E’ (‘+’ +
+
‘-’)?digit )?
In short:
•Regular expressions describe many useful regular languages,
such as phone numbers, file names, and emails.
+
•At least one: 𝐴 ≡ 𝐴𝐴 *
• Union: 𝐴| 𝐵 ≡ 𝐴 + 𝐵
• Option: 𝐴? ≡ 𝐴 + ε
• Range: ‘a’ + ’b’ +…+ ’z’ ≡ [a-z]
• Excluded range: complement of [a-z] ≡ [^a-z]

XI- how to lexically analyze a program?

1. Write a regular expression for the lexemes of each token class
(numbers, identifiers, keywords, ….).
2. Construct R, matching all lexemes for all tokens
R = Keyword + Identifier + Number + … = R1 + R2 + …
3. Let input be 𝑥1..... 𝑥𝑛
For 1 ≤ 𝑖 ≤ 𝑛, check whether or not
𝑥1...... 𝑥𝑖 ∈ 𝐿(𝑅𝑗), for some j
4. If success, then we know that
𝑥1...... 𝑥𝑖 ∈ 𝐿(𝑅𝑗), for some j
5. Remove 𝑥1...... 𝑥𝑖 from input and go to (3)
Ambiguities to this algorithm
1- how much input is used?
Suppose we have two valid substrings as follows:
𝑥1...... 𝑥𝑖 ∈ 𝐿(𝑅)
𝑥1...... 𝑥𝑗 ∈ 𝐿(𝑅), where 𝑖 ≠ 𝑗
Which input of these two is used?
The answer is larger one, for example if these two
inputs are ‘=’ and ‘==’, then we consider the second
input.
In short: when faced with a choice of two different prefixes
of the input, either which would be a valid token, we should
always choose the larger one, and this method is called the
maximal munch.
2- Which token is used?
Suppose we have a substring that is valid in more than one
specification of a given regular language, as follows:
𝑥1...... 𝑥𝑖 ∈ 𝐿(𝑅)
𝑥1...... 𝑥𝑖 ∈ 𝐿(𝑅𝑗)
𝑥1...... 𝑥𝑖 ∈ 𝐿(𝑅𝑘)
For example, “if” ∈ L(keywords) and “if” ∈ L(identifiers).
This ambiguity is resolved by a priority ordering, so basically
if belongs to the lexical specification that is listed first.
3- What if no rule matches?
𝑥1...... 𝑥𝑖 ∉ 𝐿(𝑅)
To handle this case, we write a regular expression for all
error strings not in the lexical specification, and this regular
expression is given the least priority.
XII- Finite automaton
It’s a good implementation model for regular expressions.
A finite automaton consists of:
– An input alphabet Σ
– A set of states S
– A start state n
– A set of accepting states F ⊆ S
𝑖𝑛𝑝𝑢𝑡
– A set of transitions state −> state
-An input is accepted if the automaton reaches the end of it in an
accepting state.
-An input is rejected if it’s terminated in a state that is not an accepting
state 𝑆 ∉ 𝐹, or if the machine got stuck (never reached the end of the
input).
Example: a finite automaton that accepts only “1”

Consider the following inputs: 1, 10, 0

Input state: accepted!

Input state: rejected (no transitions from A with an input of 0)

Input state: rejected (no transitions from B)

Notes:
i- the language of a finite automaton is the set of accepted
strings.
ii- The machine doesn’t have to execute the ε-moves or the free
moves (the state changes for the same input)
Types of finite automaton
i- Deterministic finite automaton DFA:
-Allows one transition per input per state
-doesn’t allow ε-moves
ii-Nondeterministic Finite Automata NFA:
-Allows multiple transitions for one input in a given state,
that is the same input can cause multiple transitions to
multiple states.
-allows ε-moves
NFA vs DFA
1- NFA:

2- DFA:

-DFAs are faster to execute since there are no choices to

Consider (one transition per state)
-NFAs are exponentially smaller
Note: NFAs and DFAs recognize the same regular language.

XIII- implementation of a lexical specification

1- a lexical specification is written as a set of regular expressions.
2- Each regular expression is converted into an NFA that recognizes the
exact same thing.
3- each NFA is converted into its equivalent DFA
4- each DFA is implemented as a set of lookup tables.

XIV- regular expression to NFA

Notation: NFA for regular expression M

Examples:
i- for ε

ii- for input a

Compound regular expression:

i-AB: Compound the two machines for A and B, such that the final
state of A is no longer a final state, as follows:

ii-A + B:
iii-A*:

we can go from the final state of A back to the starting state.

XV- NFA to DFA conversion
4-04
XVI- implementation of a finite automaton
4-05

Chapter2-Lexical Analysis
No ratings yet
Chapter2-Lexical Analysis
64 pages
Lec2 LexicalAnalyser
No ratings yet
Lec2 LexicalAnalyser
30 pages
Chapter 2
No ratings yet
Chapter 2
56 pages
Compiler Design Chapter-2
60% (5)
Compiler Design Chapter-2
105 pages
Ch3 - Lexical Analysis
No ratings yet
Ch3 - Lexical Analysis
52 pages
Chapter 2 - Lexical Analysis
100% (1)
Chapter 2 - Lexical Analysis
69 pages
CD ch2
No ratings yet
CD ch2
104 pages
Lexical Analysis
No ratings yet
Lexical Analysis
57 pages
Unit 1-REGULAR LANGUAGES
No ratings yet
Unit 1-REGULAR LANGUAGES
27 pages
CD Unit-2
No ratings yet
CD Unit-2
64 pages
CD Unit-2
No ratings yet
CD Unit-2
64 pages
Chapter 3 - Lexical Analysis
No ratings yet
Chapter 3 - Lexical Analysis
62 pages
Lect2 Lexical
No ratings yet
Lect2 Lexical
9 pages
Unit22pdf 2021 03 13 13 38 11
No ratings yet
Unit22pdf 2021 03 13 13 38 11
114 pages
Compiler Lexical Analysis Guide
No ratings yet
Compiler Lexical Analysis Guide
56 pages
ch3 M.PPTX - 0
No ratings yet
ch3 M.PPTX - 0
46 pages
Chapter 2
No ratings yet
Chapter 2
99 pages
Lexical Analysis Techniques and Implementation
No ratings yet
Lexical Analysis Techniques and Implementation
44 pages
2 - Lexical Analysis
No ratings yet
2 - Lexical Analysis
52 pages
Acknowledgements: The Slides For This Lecture Are A Modified Versions of The Offering by
No ratings yet
Acknowledgements: The Slides For This Lecture Are A Modified Versions of The Offering by
40 pages
Lexical Analysis
No ratings yet
Lexical Analysis
62 pages
2 Lex
No ratings yet
2 Lex
45 pages
Lexical Analysis
No ratings yet
Lexical Analysis
62 pages
Chapter 2
No ratings yet
Chapter 2
91 pages
Chapter-2 Compiler Design
No ratings yet
Chapter-2 Compiler Design
98 pages
4 Lexical Analysis
No ratings yet
4 Lexical Analysis
60 pages
1st Phase Lexical Analyzer
No ratings yet
1st Phase Lexical Analyzer
33 pages
Lexical Analyzer in Perspective: Parser Source Program Token
No ratings yet
Lexical Analyzer in Perspective: Parser Source Program Token
22 pages
Chapter 2 - Lexical Analysis - Regular Expressions
No ratings yet
Chapter 2 - Lexical Analysis - Regular Expressions
27 pages
2 - Compilers (Lexical Analysis)
No ratings yet
2 - Compilers (Lexical Analysis)
60 pages
rkCD-Chapter 2 - LEXICAL ANALYSIS
No ratings yet
rkCD-Chapter 2 - LEXICAL ANALYSIS
9 pages
Lexical Analysis
No ratings yet
Lexical Analysis
41 pages
Unit 2
No ratings yet
Unit 2
89 pages
Lexical Analysis and Token Recognition
100% (3)
Lexical Analysis and Token Recognition
51 pages
Lexical Analyzer 1
No ratings yet
Lexical Analyzer 1
37 pages
Lexical Analysis
No ratings yet
Lexical Analysis
73 pages
Lexical Analysis in Compilers
No ratings yet
Lexical Analysis in Compilers
5 pages
Lexical Analysis: Tokens & Patterns Explained
No ratings yet
Lexical Analysis: Tokens & Patterns Explained
77 pages
Lexical Analysis
No ratings yet
Lexical Analysis
47 pages
Chapter 2
No ratings yet
Chapter 2
31 pages
Ch3 Modified
No ratings yet
Ch3 Modified
80 pages
Lexical Analysis
No ratings yet
Lexical Analysis
45 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
9 pages
Compiler
No ratings yet
Compiler
60 pages
CompilerD L3
No ratings yet
CompilerD L3
36 pages
FALLSEM2025-26 BCSE307L TH VL2025260101614 2025-07-16 Reference-Material-I
No ratings yet
FALLSEM2025-26 BCSE307L TH VL2025260101614 2025-07-16 Reference-Material-I
20 pages
ch-2 Compiler Design
No ratings yet
ch-2 Compiler Design
9 pages
Lexical Scanner Overview
No ratings yet
Lexical Scanner Overview
31 pages
Compilers CH 3
No ratings yet
Compilers CH 3
58 pages
Lexical Analyzer
No ratings yet
Lexical Analyzer
34 pages
21CS51 ATCD MODULE 2 - 2 Lexical Analyser Part2
No ratings yet
21CS51 ATCD MODULE 2 - 2 Lexical Analyser Part2
62 pages
Unit-2 Lexical Analysis
No ratings yet
Unit-2 Lexical Analysis
36 pages
Module 3
No ratings yet
Module 3
7 pages
Unit II - Lexical Analysis-20-1-2021
No ratings yet
Unit II - Lexical Analysis-20-1-2021
49 pages
Compiler Design - Lexical Analysis
No ratings yet
Compiler Design - Lexical Analysis
16 pages
Compiler Construction: Scanning & Tokens
No ratings yet
Compiler Construction: Scanning & Tokens
72 pages
FSM and Regular Language Problems
No ratings yet
FSM and Regular Language Problems
4 pages
Finite Automata for CS Students
No ratings yet
Finite Automata for CS Students
32 pages
Flat 2
No ratings yet
Flat 2
8 pages
B.Tech Formal Languages & Automata
No ratings yet
B.Tech Formal Languages & Automata
122 pages
R16 Question Papers Flat
No ratings yet
R16 Question Papers Flat
12 pages
Semester V
No ratings yet
Semester V
22 pages
Pumping Lemma
No ratings yet
Pumping Lemma
45 pages
R23 3rd Year B.tech Cyber Security
No ratings yet
R23 3rd Year B.tech Cyber Security
80 pages
Formal Languages and Automata Theory Exercises Finite Automata Unit 3
No ratings yet
Formal Languages and Automata Theory Exercises Finite Automata Unit 3
12 pages
Non-Deterministic Finite Automata To Deterministic
No ratings yet
Non-Deterministic Finite Automata To Deterministic
6 pages
Compiler Design Concepts Worked Out Examples and M
100% (1)
Compiler Design Concepts Worked Out Examples and M
100 pages
TOC Solutions Adi
No ratings yet
TOC Solutions Adi
58 pages
NLP Module1-4
No ratings yet
NLP Module1-4
100 pages
BTech Exam: Theory of Automata 2023-24
No ratings yet
BTech Exam: Theory of Automata 2023-24
3 pages
Finite Automata & State Machines Guide
No ratings yet
Finite Automata & State Machines Guide
15 pages
Tikz Tutorial
No ratings yet
Tikz Tutorial
6 pages
Flat PQP
No ratings yet
Flat PQP
4 pages
Toc Unit 1 Micro Layout
No ratings yet
Toc Unit 1 Micro Layout
1 page
Formal Languages and Automata Theory
No ratings yet
Formal Languages and Automata Theory
2 pages
Finite Automata and Language Theory
No ratings yet
Finite Automata and Language Theory
16 pages
FLAT - UNIT 1 Notes
100% (2)
FLAT - UNIT 1 Notes
18 pages
02 Lexical Analysis
No ratings yet
02 Lexical Analysis
86 pages
Cs310 Formal Languages and Automata Theory (End - Mo22)
No ratings yet
Cs310 Formal Languages and Automata Theory (End - Mo22)
1 page
Lecture 07 Pushdown, CFG
No ratings yet
Lecture 07 Pushdown, CFG
28 pages
Introduction to Automata Theory
No ratings yet
Introduction to Automata Theory
240 pages
ToC MERGED MCQ
No ratings yet
ToC MERGED MCQ
150 pages
Flat MCQ
No ratings yet
Flat MCQ
5 pages
Formal Languages and Automata Theory July 2023
No ratings yet
Formal Languages and Automata Theory July 2023
9 pages
WRAP - Cs RR 099
No ratings yet
WRAP - Cs RR 099
15 pages
COM555 Introduction To The Course
No ratings yet
COM555 Introduction To The Course
8 pages

Compilers - Week 2

Uploaded by

Compilers - Week 2

Uploaded by

Compilers - Week 2

I- the goal of lexical analysis

3- allows us to have different syntax for the same meaning, and

-L is many to one, which helps in optimization (replacing a program

XI- how to lexically analyze a program?

Consider the following inputs: 1, 10, 0

Input state: rejected (no transitions from A with an input of 0)

Input state: rejected (no transitions from B)

-DFAs are faster to execute since there are no choices to

XIII- implementation of a lexical specification

XIV- regular expression to NFA

ii- for input a

Compound regular expression:

we can go from the final state of A back to the starting state.

You might also like