Compilers - Week 2
Compilers - Week 2
The output of the lexical analyzer is a sequence of pairs where the ith
pair consists of the name of the class of the ith substring or lexeme,
and the lexeme itself. Each pair is known as a token.
Example:
If string is foo = 42, and the output of the lexical analyzer is:
<”Id”, “foo”> <”op”, “=”> <”Int”, “42”>
\tif(i == j)\n\tZ = 0;\n\telse\n\tZ = 1;
Whitespace: \t\n\t\n\t\n\t
Keywords: if, else
Identifiers: i, j, Z
Numbers: 0, 1
Operator: ==
IV- Lexical analysis example
In FORTRAN, whitespace is insignificant, that is VAR1 is the same as
VA R1.
Example:
DO 5 I = 1, 25 : this is a loop that starts from the header -the do
statement- all the way down to the statement who has a label of
5. It does so 25 times.
DO 5 I = 1.25 : here DO5I or DO 5I is the name of a variable, to
which a value of 1.25 is assigned.
Notes:
i- the lexical analyzer scans the input string from left to
right, recognizing one token at a time.
ii- sometimes, it needs “lookahead” in order to decide
where one token ends and where the next token begins,
like the case in the previous example.
iii- the goal in the design of lexical system is to minimize
the amount of “lookahead”.
Why does FORTRAN have this funny rule?
it turns out that on punch card machines it was easy to add
extra blanks by accidents and as a result they added this rule to
the language so the punch card operators wouldn't have to redo
their work all the time.
V-Cases where lookahead is needed
1- determine whether the = ends there, or if it’s followed by
another one.
2- determine whether “e” is a name of a variable or if it’s followed
by an “lse” which makes it a keyword instead of an identifier.
3- determine whether “>>” should be interpreted as two closed
brackets or a stream operator?
Fun fact: for a long time, the only solution to this problem
was to insert blanks between the two brackets for them to
not be interpreted as a stream operator.
4- programming language 1 (PL1) was developed by IBM and it
was supposed to be so general with as few constraints as
possible.
In PL1, keywords are not reserved and that is a case where lookahead is
required.
VI- regular languages
- The lexical structure of a programming language is a set of token
classes where each class consists of some set of strings.
-Regular languages are used to specify which set of strings belongs to
each token class.
-Regular languages are defined through regular expressions (syntax)
Types of regular expressions:
1- base cases:
i- single character: ‘c’ = {“c”}: for any single character, we
get a one-string language.
ii- ε = {“”} is a language that contains one string which is the
empty string
Note: ε ≠ Φ
2- compound expressions
Ways of building new regular expressions from other
regular expressions.
i- union (or): 𝐴 + 𝐵 = {𝑎| 𝑎 ∈ 𝐴} ∪ {𝑏| 𝑏 ∈ 𝐵}
A such that a is in the language of A, union b such that
b is in the language of B.
ii- concatenation (and): 𝐴 𝐵 = {𝑎𝑏| 𝑎 ∈ 𝐴 ^ 𝑏 ∈ 𝐵}
Cross product
𝑖
iii- iteration: 𝐴 * = ⋃ 𝐴 (Kleeny closure)
𝑖≥0
A concatenated with itself i times.
Note: 𝐴 is A concatenated with itself 0 times, which
0
is the language ε.
In short:
The regular expressions over Σ are the smallest set of
expressions including 𝑅 = ε| '𝑐' |𝑅 + 𝑅| 𝑅𝑅| 𝑅 *
VII- building regular expressions
First, you have to define the set of alphabet ‘Σ’ o be used,
Example:
Σ = {0, 1}
𝑖
1* = ⋃ 1 = "" + 1 + 11 + 111 + 1111 + ....
𝑖≥0
(1 + 0)1 = {𝑎𝑏| 𝑎 ∈ (1 + 0)^ 𝑏 ∈ 1} = {11, 01}
String ab where a is drawn (1 + 0) and b is drawn from 1
𝑖 𝑖
0* + 1* = {0 | 𝑖 ≥ 0} ∪ {1 | 𝑖 ≥ 0}
𝑖
(0 + 1)* = ⋃ (0 + 1)
𝑖≥0
(0 + 1) concatenated with itself i times, as follows:
“ ”, 0 + 1, (0+1) (0+1),(0+1)....(0+1): all strings of 0’s and
1’s
Note: Σ * (denotes the set of all strings you can form
out of the alphabet)
In short:
Regular expressions are syntax that is used to specify a regular
language which is a set of strings.
VIII- Formal languages
Let Σ be a set of characters (an alphabet), a formal language is just any
set of strings over some alphabet.
Example:
alphabet: english characters
Language: english sentences
Example:
Alphabet:
An important concept for many formal languages is a meaning function
L which is a function that maps the strings in the language to their
meaning L(e) = M.
Example:
L: exp -> set of strings
L(regular expression) = M(a set of strings)
The meaning function maps a regular expression to the set of strings
that it denotes, for example;
𝐿(ε) = {" "}
𝐿('𝑐') = {"𝑐"}
𝐿(𝐴 + 𝐵) = 𝐿(𝐴) ∪ 𝐿(𝐵)
First, we interpret A and B using L, then we take the union
of the result.
𝐿(𝐴𝐵) = {𝑎𝑏| 𝑎 ∈ 𝐿(𝐴) ^ 𝑏 ∈ 𝐿(𝐵)}
𝑖
𝐿(𝐴 *) = ⋃ 𝐿(𝐴 )
𝑖≥0
Note:
Arguments to the meaning function (input) are regular
expressions and the outputs are the corresponding sets of
strings.
Why use a meaning function?
1- it makes clear what is syntax, and what is semantic.
2- allows us to consider notation as a separate issue (roman
numbers vs arabic numbers example)
X- Lexical specification
1- keywords: ‘if’ + ‘else’ + ‘then’
They’re specified by having single quotes around them.
2- integers: non-empty strings of digits
digit = ‘0’ + ‘1’ + ‘2’ + ‘3’ + ‘4’ + ‘5’ + ‘6’ + ‘7’ + ‘8’ + ‘9’
+
digit : digit * digit* (one digit followed by 0 or more digits)
3- identifiers: strings of letters or digits, starting with a letter
letter = ‘a’ + ‘b’ + ‘c’ + ‘d’ +......
letter = [a - z A - Z]: the union of all single-character regular
expressions beginning with a (the first character) anding with z
(the second character).
digit = ‘0’ + ‘1’ + ‘2’ + ‘3’ + ‘4’ + ‘5’ + ‘6’ + ‘7’ + ‘8’ + ‘9’
Identifiers: letter(letter + digit)*
4- whitespace: non-empty sequence of blanks, newlines, or tabs
+
Whitespace: (‘ ‘ + ‘\n’ + ‘\t’)
Examples:
1- [email protected]
+ + + +
letter ‘@’ letter ‘.’ letter ‘.’ letter
2- how numbers are defined in PASCAL programming language
num = digits opt_fraction opt_exponent
+ +
digit (‘.’ digits) + ε ((‘E’ (‘+’ + ‘-’ + ε)digit ) + ε)
Notes:
+ +
i- (‘.’ digit ) + ε is the same as (‘.’digit )?
+
ii- (‘E’ (‘+’ + ‘-’ + ε)digit ) + ε) is the same as (‘E’ (‘+’ +
+
‘-’)?digit )?
In short:
•Regular expressions describe many useful regular languages,
such as phone numbers, file names, and emails.
+
•At least one: 𝐴 ≡ 𝐴𝐴 *
• Union: 𝐴| 𝐵 ≡ 𝐴 + 𝐵
• Option: 𝐴? ≡ 𝐴 + ε
• Range: ‘a’ + ’b’ +…+ ’z’ ≡ [a-z]
• Excluded range: complement of [a-z] ≡ [^a-z]
2- DFA:
Examples:
i- for ε
ii-A + B:
iii-A*: