0% found this document useful (0 votes)
21 views

Compilers - Week 2

The document discusses lexical analysis in compilers. It covers: I. The goal of lexical analysis is to divide code into lexical units like keywords, variables, and operators. II. A lexical analyzer recognizes substrings and classifies them into token classes like identifiers, integers, keywords, and whitespace. III. The output of a lexical analyzer is a sequence of token pairs consisting of the class name and lexeme. This is used by later stages of compilation.

Uploaded by

rahma aboeldahab
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

Compilers - Week 2

The document discusses lexical analysis in compilers. It covers: I. The goal of lexical analysis is to divide code into lexical units like keywords, variables, and operators. II. A lexical analyzer recognizes substrings and classifies them into token classes like identifiers, integers, keywords, and whitespace. III. The output of a lexical analyzer is a sequence of token pairs consisting of the class name and lexeme. This is used by later stages of compilation.

Uploaded by

rahma aboeldahab
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Compilers - Week 2

I- the goal of lexical analysis


It is to divide the code into lexical units called substrings or lexemes
such as keywords, variable names, and operators.
Example:
if(i == j)
Z = 0;
else
Z = 1;
This piece of code is seen by the lexical analyzer as follows:
\tif(i == j)\n\tZ = 0;\n\telse\n\tZ = 1;
II- What does a lexical analyzer do?
A lexical analyzer will recognize the substrings, and also classify them
according to their role into token classes.
Token classes
i- identifier: strings of letters or digits, starting with a letter
ii- integer: a non-empty string of digits
iii- keyword: a set of reserved words, such as “if”, “else”, and
“begin”
iv- whitespace: a non-empty sequence of blanks, new lines, and
tabs.
v- single-character token classes
1- (: (
2- ): )
3- ; : ;
4- = : =
III- output of a lexical analyzer

The output of the lexical analyzer is a sequence of pairs where the ith
pair consists of the name of the class of the ith substring or lexeme,
and the lexeme itself. Each pair is known as a token.
Example:
If string is foo = 42, and the output of the lexical analyzer is:
<”Id”, “foo”> <”op”, “=”> <”Int”, “42”>
\tif(i == j)\n\tZ = 0;\n\telse\n\tZ = 1;
Whitespace: \t\n\t\n\t\n\t
Keywords: if, else
Identifiers: i, j, Z
Numbers: 0, 1
Operator: ==
IV- Lexical analysis example
In FORTRAN, whitespace is insignificant, that is VAR1 is the same as
VA R1.
Example:
DO 5 I = 1, 25 : this is a loop that starts from the header -the do
statement- all the way down to the statement who has a label of
5. It does so 25 times.
DO 5 I = 1.25 : here DO5I or DO 5I is the name of a variable, to
which a value of 1.25 is assigned.
Notes:
i- the lexical analyzer scans the input string from left to
right, recognizing one token at a time.
ii- sometimes, it needs “lookahead” in order to decide
where one token ends and where the next token begins,
like the case in the previous example.
iii- the goal in the design of lexical system is to minimize
the amount of “lookahead”.
Why does FORTRAN have this funny rule?
it turns out that on punch card machines it was easy to add
extra blanks by accidents and as a result they added this rule to
the language so the punch card operators wouldn't have to redo
their work all the time.
V-Cases where lookahead is needed
1- determine whether the = ends there, or if it’s followed by
another one.
2- determine whether “e” is a name of a variable or if it’s followed
by an “lse” which makes it a keyword instead of an identifier.
3- determine whether “>>” should be interpreted as two closed
brackets or a stream operator?
Fun fact: for a long time, the only solution to this problem
was to insert blanks between the two brackets for them to
not be interpreted as a stream operator.
4- programming language 1 (PL1) was developed by IBM and it
was supposed to be so general with as few constraints as
possible.
In PL1, keywords are not reserved and that is a case where lookahead is
required.
VI- regular languages
- The lexical structure of a programming language is a set of token
classes where each class consists of some set of strings.
-Regular languages are used to specify which set of strings belongs to
each token class.
-Regular languages are defined through regular expressions (syntax)
Types of regular expressions:
1- base cases:
i- single character: ‘c’ = {“c”}: for any single character, we
get a one-string language.
ii- ε = {“”} is a language that contains one string which is the
empty string
Note: ε ≠ Φ
2- compound expressions
Ways of building new regular expressions from other
regular expressions.
i- union (or): 𝐴 + 𝐵 = {𝑎| 𝑎 ∈ 𝐴} ∪ {𝑏| 𝑏 ∈ 𝐵}
A such that a is in the language of A, union b such that
b is in the language of B.
ii- concatenation (and): 𝐴 𝐵 = {𝑎𝑏| 𝑎 ∈ 𝐴 ^ 𝑏 ∈ 𝐵}
Cross product
𝑖
iii- iteration: 𝐴 * = ⋃ 𝐴 (Kleeny closure)
𝑖≥0
A concatenated with itself i times.
Note: 𝐴 is A concatenated with itself 0 times, which
0

is the language ε.
In short:
The regular expressions over Σ are the smallest set of
expressions including 𝑅 = ε| '𝑐' |𝑅 + 𝑅| 𝑅𝑅| 𝑅 *
VII- building regular expressions
First, you have to define the set of alphabet ‘Σ’ o be used,
Example:
Σ = {0, 1}
𝑖
1* = ⋃ 1 = "" + 1 + 11 + 111 + 1111 + ....
𝑖≥0
(1 + 0)1 = {𝑎𝑏| 𝑎 ∈ (1 + 0)^ 𝑏 ∈ 1} = {11, 01}
String ab where a is drawn (1 + 0) and b is drawn from 1
𝑖 𝑖
0* + 1* = {0 | 𝑖 ≥ 0} ∪ {1 | 𝑖 ≥ 0}
𝑖
(0 + 1)* = ⋃ (0 + 1)
𝑖≥0
(0 + 1) concatenated with itself i times, as follows:
“ ”, 0 + 1, (0+1) (0+1),(0+1)....(0+1): all strings of 0’s and
1’s
Note: Σ * (denotes the set of all strings you can form
out of the alphabet)
In short:
Regular expressions are syntax that is used to specify a regular
language which is a set of strings.
VIII- Formal languages
Let Σ be a set of characters (an alphabet), a formal language is just any
set of strings over some alphabet.
Example:
alphabet: english characters
Language: english sentences
Example:
Alphabet:
An important concept for many formal languages is a meaning function
L which is a function that maps the strings in the language to their
meaning L(e) = M.
Example:
L: exp -> set of strings
L(regular expression) = M(a set of strings)
The meaning function maps a regular expression to the set of strings
that it denotes, for example;
𝐿(ε) = {" "}
𝐿('𝑐') = {"𝑐"}
𝐿(𝐴 + 𝐵) = 𝐿(𝐴) ∪ 𝐿(𝐵)
First, we interpret A and B using L, then we take the union
of the result.
𝐿(𝐴𝐵) = {𝑎𝑏| 𝑎 ∈ 𝐿(𝐴) ^ 𝑏 ∈ 𝐿(𝐵)}
𝑖
𝐿(𝐴 *) = ⋃ 𝐿(𝐴 )
𝑖≥0
Note:
Arguments to the meaning function (input) are regular
expressions and the outputs are the corresponding sets of
strings.
Why use a meaning function?
1- it makes clear what is syntax, and what is semantic.
2- allows us to consider notation as a separate issue (roman
numbers vs arabic numbers example)

3- allows us to have different syntax for the same meaning, and


hence we discover that some kinds of syntax are better than the
others. Which means there are more expressions than there are
meanings.
Note: syntax and semantics are not 1: 1

-L is many to one, which helps in optimization (replacing a program


with a better equivalent that runs faster).
-L can never be one to many.

X- Lexical specification
1- keywords: ‘if’ + ‘else’ + ‘then’
They’re specified by having single quotes around them.
2- integers: non-empty strings of digits
digit = ‘0’ + ‘1’ + ‘2’ + ‘3’ + ‘4’ + ‘5’ + ‘6’ + ‘7’ + ‘8’ + ‘9’
+
digit : digit * digit* (one digit followed by 0 or more digits)
3- identifiers: strings of letters or digits, starting with a letter
letter = ‘a’ + ‘b’ + ‘c’ + ‘d’ +......
letter = [a - z A - Z]: the union of all single-character regular
expressions beginning with a (the first character) anding with z
(the second character).
digit = ‘0’ + ‘1’ + ‘2’ + ‘3’ + ‘4’ + ‘5’ + ‘6’ + ‘7’ + ‘8’ + ‘9’
Identifiers: letter(letter + digit)*
4- whitespace: non-empty sequence of blanks, newlines, or tabs
+
Whitespace: (‘ ‘ + ‘\n’ + ‘\t’)
Examples:
1- [email protected]
+ + + +
letter ‘@’ letter ‘.’ letter ‘.’ letter
2- how numbers are defined in PASCAL programming language
num = digits opt_fraction opt_exponent
+ +
digit (‘.’ digits) + ε ((‘E’ (‘+’ + ‘-’ + ε)digit ) + ε)
Notes:
+ +
i- (‘.’ digit ) + ε is the same as (‘.’digit )?
+
ii- (‘E’ (‘+’ + ‘-’ + ε)digit ) + ε) is the same as (‘E’ (‘+’ +
+
‘-’)?digit )?
In short:
•Regular expressions describe many useful regular languages,
such as phone numbers, file names, and emails.
+
•At least one: 𝐴 ≡ 𝐴𝐴 *
• Union: 𝐴| 𝐵 ≡ 𝐴 + 𝐵
• Option: 𝐴? ≡ 𝐴 + ε
• Range: ‘a’ + ’b’ +…+ ’z’ ≡ [a-z]
• Excluded range: complement of [a-z] ≡ [^a-z]

XI- how to lexically analyze a program?


1. Write a regular expression for the lexemes of each token class
(numbers, identifiers, keywords, ….).
2. Construct R, matching all lexemes for all tokens
R = Keyword + Identifier + Number + … = R1 + R2 + …
3. Let input be 𝑥1..... 𝑥𝑛
For 1 ≤ 𝑖 ≤ 𝑛, check whether or not
𝑥1...... 𝑥𝑖 ∈ 𝐿(𝑅𝑗), for some j
4. If success, then we know that
𝑥1...... 𝑥𝑖 ∈ 𝐿(𝑅𝑗), for some j
5. Remove 𝑥1...... 𝑥𝑖 from input and go to (3)
Ambiguities to this algorithm
1- how much input is used?
Suppose we have two valid substrings as follows:
𝑥1...... 𝑥𝑖 ∈ 𝐿(𝑅)
𝑥1...... 𝑥𝑗 ∈ 𝐿(𝑅), where 𝑖 ≠ 𝑗
Which input of these two is used?
The answer is larger one, for example if these two
inputs are ‘=’ and ‘==’, then we consider the second
input.
In short: when faced with a choice of two different prefixes
of the input, either which would be a valid token, we should
always choose the larger one, and this method is called the
maximal munch.
2- Which token is used?
Suppose we have a substring that is valid in more than one
specification of a given regular language, as follows:
𝑥1...... 𝑥𝑖 ∈ 𝐿(𝑅)
𝑥1...... 𝑥𝑖 ∈ 𝐿(𝑅𝑗)
𝑥1...... 𝑥𝑖 ∈ 𝐿(𝑅𝑘)
For example, “if” ∈ L(keywords) and “if” ∈ L(identifiers).
This ambiguity is resolved by a priority ordering, so basically
if belongs to the lexical specification that is listed first.
3- What if no rule matches?
𝑥1...... 𝑥𝑖 ∉ 𝐿(𝑅)
To handle this case, we write a regular expression for all
error strings not in the lexical specification, and this regular
expression is given the least priority.
XII- Finite automaton
It’s a good implementation model for regular expressions.
A finite automaton consists of:
– An input alphabet Σ
– A set of states S
– A start state n
– A set of accepting states F ⊆ S
𝑖𝑛𝑝𝑢𝑡
– A set of transitions state −> state
-An input is accepted if the automaton reaches the end of it in an
accepting state.
-An input is rejected if it’s terminated in a state that is not an accepting
state 𝑆 ∉ 𝐹, or if the machine got stuck (never reached the end of the
input).
Example: a finite automaton that accepts only “1”

Consider the following inputs: 1, 10, 0


Input state: accepted!

Input state: rejected (no transitions from A with an input of 0)

Input state: rejected (no transitions from B)


Notes:
i- the language of a finite automaton is the set of accepted
strings.
ii- The machine doesn’t have to execute the ε-moves or the free
moves (the state changes for the same input)
Types of finite automaton
i- Deterministic finite automaton DFA:
-Allows one transition per input per state
-doesn’t allow ε-moves
ii-Nondeterministic Finite Automata NFA:
-Allows multiple transitions for one input in a given state,
that is the same input can cause multiple transitions to
multiple states.
-allows ε-moves
NFA vs DFA
1- NFA:

2- DFA:

-DFAs are faster to execute since there are no choices to


Consider (one transition per state)
-NFAs are exponentially smaller
Note: NFAs and DFAs recognize the same regular language.

XIII- implementation of a lexical specification


1- a lexical specification is written as a set of regular expressions.
2- Each regular expression is converted into an NFA that recognizes the
exact same thing.
3- each NFA is converted into its equivalent DFA
4- each DFA is implemented as a set of lookup tables.

XIV- regular expression to NFA


Notation: NFA for regular expression M

Examples:
i- for ε

ii- for input a

Compound regular expression:


i-AB: Compound the two machines for A and B, such that the final
state of A is no longer a final state, as follows:

ii-A + B:
iii-A*:

we can go from the final state of A back to the starting state.


XV- NFA to DFA conversion
4-04
XVI- implementation of a finite automaton
4-05

You might also like