PoCD Chapter 01 Handouts 2024-25
PoCD Chapter 01 Handouts 2024-25
Chapter 01 – Introduction to
Compilers
PoCD Team
School of Computer Science & Engineering
2024 - 25
Principles of Compiler Design
SoCSE Page | 2
Principles of Compiler Design
Figure:2.1
The input to a compiler may be produced by one or more preprocessors, and further
processing of the compiler's output may be needed before running machine code is
obtained.
Preprocessors produce input to compilers. They may perform the following functions:
Macro processing: A preprocessor may allow a user to define macros that are
shorthand’s for longer constructs.
File inclusion: A preprocessor may include header files into the program text. For
example, the C preprocessor causes the contents of the file <global.h> to replace the
statement #include<global.h>when it processes a file containing this statement.
Assembler –Some compilers produce assembly code that is passed to an assembler for
further processing. Other compilers perform the job of the assembler, producing
relocatable machine code that can be passed directly to the loader link/editor.
SoCSE Page | 3
Principles of Compiler Design
Assembly code is a mnemonic version of machine code, in which names are used
instead of binary codes for operations, and names are also given to memory addresses.
3.Compiler:
Compiler is a software which converts one language to another. A compiler takes a program
as input written in its source language which is a high-level language such as FORTRAN, C, C++
and produce an equivalent output program written in its target language which is machine
language and sometime called as object code consisting of machine instructions of the
computer on which it is to be executed. It can be graphically represented as shown in
Figure:3.1
SoCSE Page | 4
Principles of Compiler Design
Figure:3.1
Why compiler:
Initially the programs were written in machine language-numeric codes that represented the
actual machine operations to be performed. For example
C7 06 0000 0002
representing the instruction to move 2 to the location 0000.Writing such code is extremely
time consuming and tedious. Hence this form of coding was replaced by assembly language,
where the instructions and memory locations are represented in symbolic forms. For example
MOV x, 2
An assembler translates the symbolic codes and memory locations of assembly language into
the corresponding numeric codes of machine language. Assembly language improved the
speed and accuracy. However, it is still not easy to write and it is difficult to read and
understand. Assembly language is extremely dependent on particular machine for which it
was written, so code written for one machine must be completely rewritten for another
machine. So, there was a need for programming language which is resembling mathematical
notations or natural language, which is independent of any one particular machine and yet
capable of itself being translated by a program into executable code. For example, previous
instruction can be written in concise and machine independent form as
x=2
First machine independent language FORTRAN and its compiler developed by team IBM
between 1954 and 1957, and most of the process involved in translating programming
language were not understood at that time. Study of Natural languages by Noam Chomsky
made construction of compiler easy and even capable of partial automation.
SoCSE Page | 5
Principles of Compiler Design
There are other programs used together with the compiler are Interpreters, assembler, linker,
loader, pre-processor, editors, debuggers, profilers, project managers.
4.The translation process
A compiler consists of number of phases as shown in below Figure: 4.1
Figure :4.1
SoCSE Page | 6
Principles of Compiler Design
The scanner:
This phase reads the source program, which is in the form of stream of characters. The
scanner performs lexical analysis. It generates meaningful units from sequence of characters.
Consider the C language assignment statement
a[index] = 4 + 2
This statement contains 12 nonblank characters but only 8 tokens.
The tokens produced are shown in Table :4.1
Table :4.1
Along with generation of identifiers the scanner also enters identifiers into symbol table and
literals into literal table.
The parser:
The tokens generated from source code using scanner acts as input to the parser and
performs syntax analysis, which determines structure of the program. This phase determines
the structural elements of the program and their relationships. The output of syntax analysis
shown in Figure:4.2 is represented as a parse tree or syntax tree. The above assignment
expression consisting of a subscripted expression on the left and integer arithmetic
expression on the right and this structure can be represented as parse tree in the following
form.
SoCSE Page | 7
Principles of Compiler Design
Figure:4.2
The internal node of the parse tree is labelled by the names of the structure they represent
and leaves of the parse tree represent the sequence of tokens from the input. The parse tree
is insufficient in representing structure, so the parser generates syntax tree also called
abstract syntax tree. The abstract syntax tree for the above assignment statement abstract
syntax tree is shown in Figure :4.3
Figure:4.3
SoCSE Page | 8
Principles of Compiler Design
In the above syntax tree, many of the nodes have disappeared. For example, if we know that
an expression is a subscript operation, then it is not necessary to keep the brackets [ and].
The semantic analyzer:
This phase of the compiler represents the semantics of the program rather than syntax or
structure. The semantics analyzer does the analysis of static features of a program before the
execution, which cannot be conveniently expressed as syntax and analyzed by the parser. The
semantics usually determine runtime behavior. Figure:4.4 is the tree generated by semantic
analyzer for the assignment statement, where it associates meanings as shown.
Figure:4.4
The source code optimizer:
Compilers include code improvement or optimization steps. Optimization is done after
semantic analysis phase and many of the optimization can be done at source level. Individual
compiler exhibit variation not only in kinds of optimization performed but also in placement
of optimization phases. For the assignment statement we can perform source level
optimization by precomputation of the expression 4+2 by the compiler to the result 6. This
optimization is done directly on the syntax tree as shown in Figure: 4.5 by collapsing the right-
hand side subtree of the root node to a constant value called as constant folding.
SoCSE Page | 9
Principles of Compiler Design
Figure:4.5
Other types of optimization techniques that can be done on intermediate code (form of code
representation between source code and object code) are three code address and Pcode.
In three address scheme it contains up to three addresses in memory location as shown in
Figure: 4.6
Figure:4.6
The intermediate code can also be called as intermediate representation or IR.
SoCSE Page | 10
Principles of Compiler Design
Figure:4.7
The target code optimizer:
In this phase compiler improves the target code generated by the code generator. For
example, the improvements can be
• Choosing addressing mode to improve the performance
Figure:4.8
SoCSE Page | 11
Principles of Compiler Design
SoCSE Page | 12
Principles of Compiler Design
globally to the program and a constant or string will appear only once in this table. The literal
table reduces the size of the program in memory by reusing constants and strings. The literal
table also used by code generator to construct symbolic addresses for literals and for entering
data definitions in target code file.
Intermediate code:
The intermediate code may be kept as an array of text strings, a temporary text file or as a
linked list of structures depending on the kind of representation (three address code and P-
code) and the kind of optimization performed.
Temporary files:
Computers did not possess enough memory for an entire program to be kept in memory
during compilation, so during the process We can use temporary files to hold the results of
intermediate steps.
5. Chomsky Hierarchy:
When the first compiler was under development the findings of Noam Chomsky on the natural
languages made the construction of compilers easy and even partial automation. Chomsky
introduced the classification of languages according to the complexity of their grammars and
the power of the algorithms needed to recognize them. The Chomsky hierarchy of languages
consists of four levels of grammars called type 0, type 1, type 2 and type3 grammars as shown
in Figure :5.1. type2 or context free grammar is standard way to represent structure
programming language. Regular expression and finite automata correspond to type3 or
regular grammar and are closely associated with type2.
SoCSE Page | 13
Principles of Compiler Design
Figure :5.1
Passes:
The compiler processes the entire source program several times referred as passes before
generating code. There exists one pass or multipass compilers depending on the level of
optimization.
Figure: 8.1
Upon receiving a "get next token" command from the parser, the lexical analyzer reads input
characters until it can identify the next token. The main functionality of the lexical analysis is
to read input characters and produce tokens by interacting with symbol table. The other
functionalities are stripping out comments and white spaces from the source program,
correlating error messages from the compiler with the source program and macro expansion.
Sometimes the lexical analyzers are divided into a cascade of two phases. The first phase is
called as scanning and second lexical analysis. Below are some advantages of having these
phases separate.
1.Simplicity
2.Efficiency
SoCSE Page | 15
Principles of Compiler Design
3.Portability
Terminology:
Lexeme: A sequence of input characters that comprises a single token is called a lexeme.
Eg: float, total, =
Token: Lexeme identified as valid tokens using predefined rules.
Eg: Identifier, Strings, Keywords
Pattern: Strings described by a rule.
The below table shows the difference between lexeme, token and pattern.
Few of the tokens, lexeme and pattern are shown in Table :8.1
Table:8.1
Attributes for tokens:
The lexical analyzer must provide additional information about the particular lexeme that
matched to the subsequent phases of compiler. The lexical analyzer collects information
about tokens into their associated attributes. Usually a token has only a single attribute-a
pointer to the symbol table entry in which the information about the token is kept. For
example, we may require the information of both lexeme and its associated line number for
SoCSE Page | 16
Principles of Compiler Design
a token of type identifier. Both this information can be stored in symbol table entry for the
identifier.
Consider the statement E = M * C ** 2
The tokens and associated attribute values are as follows
<id, pointer to symbol table entry E>
<assign_op,>
<id, pointer to symbol table entry M>
<mul_op,>
<id, pointer to symbol table entry C>
<exp_op,>
<num, integer value 2>
9.Lexical errors:
There are mainly three types of compile time errors and they are lexical, syntactic and
semantic errors.
Lexical error is an input that can be rejected by the scanner. Types of lexical errors are:
A sequence of characters that cannot be scanned into any valid token.
Misspelling of identifiers, keywords or operators.
Example if the string ‘fi ’is encountered for first time in C program as
fi(a==f(x))…..
The scanner cannot tell whether fi is a misspelling of if keyword or an undeclared function
identifier, since fi is valid lexeme for token identifier, the scanner must return token id to
parser and it does this. But in reality, it is an error which is not identified by the scanner. In
such cases the other phases of the compiler recognize the error probably the parser.
The scanner uses a method called as panic error recovery method whenever none of the
patterns for tokens matches any prefix of remaining input and scanner is unable to proceed.
In such cases the unmatched patterns are deleted successively from the remaining input until
the lexical analyzer can find well-formed token at the beginning of the input left out.
SoCSE Page | 17
Principles of Compiler Design
10.Regular Expression:
Basic Definitions:
Alphabets, Languages and Grammar
Alphabets are finite, non-empty set of symbols, usually denoted by Σ
Examples:
∑ = {0, 1} , the binary alphabet
∑ = {a, b, c, ….z}, set of all lower case letters
Strings
Strings are finite sequences of symbols chosen from the alphabet. Example: 0110 is a
string from binary alphabet ∑ = {0, 1}
Strings are denoted by u, v, w, x, y, z. Example u = 10110
Empty String
Zero occurrences of symbols. It is denoted by Є
Length of a string
The number of symbols in a string
Notion: |w|
|00101| = 5
u = 101 and |u| = 3
For an Empty string, length is 0
Reverse of a string
w = ab then wR = ba
SoCSE Page | 18
Principles of Compiler Design
Concatenation of strings
If w1 and w2 are two strings, then concatenation of is w1.w2
w.Є = Є.w = w for all w
w1.w2 ≠ w2.w1
The length of concatenated strings is sum of length of two strings
Powers of an alphabet
If ∑ = {a, b, c}
Then
∑0 = Є
∑1 = {a, b, c}
∑2 = {aa, bb, cc, ab, ac, ba, bc, ca, cb}
And so on…
∑* = ∑0 U ∑1 U ∑2 U ∑3 U ….
∑* = {a, b, c} * = {Є, a, b, ab, aa, abb, aab,…}
A ∑* closure contains Є. But a ∑+ closure does not contain Є.
∑+ = ∑* - {Є}
∑+ = ∑1 U ∑2 U ∑3 U ….
∑* = ∑+ U {Є}
Languages
A set of all strings all of which are chosen from some ∑* where ∑ is a particular alphabet, is
called a language.
If ∑ is an alphabet, and L ⊆ ∑*, then L is language over ∑.
Examples:
The language of all the strings consisting of n o’s followed by n 1’s for some n > = 0 is {
Є , 01, 0011, 000111, …}
The set of strings of 0’s and 1’s with an equal number of each is
{ Є , 01, 0011, 0101, 1001,…}
∑* is a language for any alphabet ∑
Ø, the empty language, is a language over any alphabet
{ Є }, the language consisting of only empty string, is also a language over any alphabet.
Ø ≠ Є.
SoCSE Page | 19
Principles of Compiler Design
Regular expression ‘r’ represents patterns of strings of characters and is completely defined
by set of strings that it matches. This set is called language generated by the regular
expression represented as L(r). The language depends on character set generally ASCII
character or subset of it or it could be generic, in that case the elements of set are referred as
symbols. This set of symbols called the alphabet and represented using Greek symbol Σ.A
regular expression can also be a character from alphabet but have different meaning i.e all
symbols indicate patterns. (During discussion it will be mentioned what will be the context)
A regular expression may contain characters that have special meaning. Such characters are
called metacharacters or meta symbols. The metacharacter may not be legal character form
alphabet and need to be distinguished from other characters in the alphabet, using escape
character (“turns off” special meaning). Example backslash and quote.
First, we describe basic set of regular expressions then we describe operations that generate
new regular expressions from existing one.
Basic Regular Expressions
1. These are single charaters from the alphabet, which match themselves.
Example:a character a from Σ, we can indicate that regular expression a match the character
a by writing L(a)={a}.
Special cases:
2. A match of the empty string ℇ(epsilon), that is string that contains no character and
Language generated can be represented as L(ℇ)={ℇ}.
3. A match of empty set Φ, that matches no string at all. This language is called empty set {}
and we write L(Φ)= { }.
SoCSE Page | 20
Principles of Compiler Design
2.Concatenation:
The concatenation of two regularexpressions r and s is written as rs and it matches any string
that is concatenation of two strings,the first of which matches r and second of which matches
s.
1. The regular expression ab matches only the string ab.i.e L(ab)=L(a)L(b)={ab}
2. The regular expression(a|b)c matches string ac and bc.(paranthesis will be discussed in next
seesions) i.eL((a|b)c)=L(a|b)L(c)={a,b}{c}={ac,bc}
The concatenation can be extended to more than two regular expressions.
3.Repetition: The repetition operation of a regular expression also called Kleene closure, is
written as r*, where r is regular expression. r*matches any finite concatenation of strings,
each of matches r.
i.e
a*= ℇ, a, aa, aaa ,aaaa……..
For set of strings * can be defined as
S*= ℇ U S U SS U SSS U ……….
i.e
SoCSE Page | 21
Principles of Compiler Design
Example:
(a|bb)* matches ℇ, a, aa, aaa, abb, bba, bbbb, aabb so on.
In terms of languages L((a|bb)*)=L(a|bb)*={a,bb}*={ ℇ, a, aa, aaa, abb, bba, bbbb, aabb,
aabbbaa, aabb, bbaaa…….}
Note:
The same language can be generated by many different regular expressions. Required to find
small and efficient regular expression.
Not all sets of strings that we can describe can be generated by regular expression.
Example:
1. Unsigned decimal Integer:
digit=0|1|2|3|4|5|6|7|8|9
digitdigit*
2. Signed Integer:
digit=0|1|2|3|4|5|6|7|8|9
sign=+|-|ℇ
SignedInteger=sign digit digit*
3. Keywords:
Key_word=int | float | char
SoCSE Page | 23
Principles of Compiler Design
More Examples:
4. KLETU student three-digit roll number (ranging from 001 to 999)
d1= 0|1|2|3|4|5|6|7|8|9
d2= 1|2|3|4|5|6|7|8|9
Roll_no=00d2 | 0d2d1 |d2d1d1
6. Identifiers:
letter=a| b| c| d|………|z| A|B|C|………. |Z
digit=0 | 1 |2 | 3 |………. | 9
Identifier=letter(letter|digit) *
7. Set of all strings containing exactly one b over the alphabet Σ={a,b,c}
(a|c) *b (a|c) *
8. Set of all strings ending with abb over the alphabet Σ={a,b}
(a|b)*abb
9. Set of all strings starting with 0 and ending with 1 over the alphabet Σ= {0,1}
0(0|1) *1
SoCSE Page | 24
Principles of Compiler Design
SoCSE Page | 25
Principles of Compiler Design
1. Signed Integers:
digit=[0-9]+
sign=+|-
SignedInteger=(sign)? digit
2. Identifiers:
letter=[a-zA-Z]
digit=[0-9]
Identifier=letter(letter|digit)*
3. if and for
keyword=if | for
SoCSE Page | 26
Principles of Compiler Design
4. Regular expression for the language starting and ending with a and having any having
any combination of b's in between
R = a b* b
5. Assume we would like our password to contain all of the following, but in no particular
order:
At least one digit [0-9]
At least one lowercase character [a-z]
At least one uppercase character [A-Z]
At least one special character [*.! @#$%^&(){}[]:;<>,.?/~_+-=|\]
At least 8 characters in length, but no more than 32.
(?. *[0-9]) (? . *[a-z]) (?.*[A-Z]) (?.*[*.!@#$%^&(){}[]:;<>,.?/~_+-=|\]){8,32}
6. The valid pin code of India must satisfy the following conditions.
It can be only six digits.
It should not start with zero.
First digit of the pin code must be from 1 to 9.
Next five digits of the pin code may range from 0 to 9.
It should allow only one white space, but after three digits.
[1-9]{1} [0-9]{2} \s {0,1} [0-9]{3}
~*~*~*~*~*~*~*~*~*~*~*~
SoCSE Page | 27