0% found this document useful (0 votes)
16 views

compilers_intro_jan2025

The document outlines the course CS 333 Compiler Design for Spring 2025, detailing its objectives, outcomes, and structure. It emphasizes the importance of compilers in computer engineering, the phases of compilation, and the complexity of software systems. Additionally, it covers the roles of lexical and syntax analysis in compiler design, along with examples and definitions relevant to programming languages.

Uploaded by

btech10903.22
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

compilers_intro_jan2025

The document outlines the course CS 333 Compiler Design for Spring 2025, detailing its objectives, outcomes, and structure. It emphasizes the importance of compilers in computer engineering, the phases of compilation, and the complexity of software systems. Additionally, it covers the roles of lexical and syntax analysis in compiler design, along with examples and definitions relevant to programming languages.

Uploaded by

btech10903.22
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

CS 333 Compiler Design

SP 2025

Dr. Indrajit Mukherjee ([email protected])


Dr. Bikas Kanti Sarkar ([email protected])
Dr. Supratim Biswas ([email protected])

Department of Computer Science and Engineering


BIT Mesra
Course Information

Read the basic course information : Course Objectives, Course Outcomes, CO-PO-PSO mapping; syllabus, from the home page.

Course Objectives
This course enables the students to:
1. Understand the need of compiler in Computer Engineering
2. Provide a thorough understanding of design, working, and implementation of
programming languages
3. Trace the major concept areas of language translation and compiler design
4. Create an awareness of the functioning and complexity of modern compilers

Course Outcomes
After the completion of this course, students will be:
1. Analyze the need of compiler for interfacing between user and machine
2. Explain the role of several phases of compilation process
3. Create an awareness of the function and complexity of modern compilers
4. Outline the major concept areas of languages translation and Compiler design
5. Develop a comprehensive Compiler for a given language
6. Apply knowledge for developing tool for natural language processing
Course Conduct

4 contact hrs / week : 3 lectures and 1 tutorial

Prescribed Evaluation scheme will be followed

Prepare for fill in the blanks type of questions in quizzes

Attendance norms of the institute apply – do not approach
the instructors for relaxation in attendance

Direct positive correlation observed in FLAT between active
presence and skill enhancement / better performace

Course and the associated laboratory course CS 334 will
be synchronized to reinforce the concepts and skills.
Distinctive Features of a Compilers Course

Unique course in CSE discipline. It bridges deep theoretical concepts to the most
recent advances in architectures and operating systems

Awareness of compiler features helps in increased programming productivity
(invest time and effort to know the capabilities of your compiler)

Two drivers for Research in compilers - design of new programming languages
and invention of new architectures.

Research has produced spectacular enhancements in technology. Designing a
working compiler 3 decades back required several human years.

Today, given m different languages and n different architectures, m*n compilers
can be generated in hours / days.

Research in compilers continues, specially for automatic optimization, automatic
parallelization of sequential programs.

The hupe leaps in hardware technology have not been efficiently expolited till date
because software tools have not been able to catch up with the native
computational powers of comtemporary High Performance Architectures.
Complexity of Software Systems
The following table gives an idea about the size and complexity of large
software systems from 3 highly popular domains.

The information is only meant to be indicative as the figures are significantly


larger today.

Name Category 1st No of No of Lines of White No of Main


Rel Dirs Files Code spaces Langs Language

Linux OS 1991 3830 45884 20659032 36% 18 C


4.9-rc.5 Kernel

Gcc Compiler 1987 5500 79499 10638200 33% 33 C, C++


6.2.0 Framework

Mysql DBMS 1995 1533 10656 3901791 29% 27 C++, C


5.7.16
What is a Compiler ?

A compiler is essentially a Translator from a High Level Language (HLL)
such as C, C++, Java to a Low Level Language (LLL) such as assembly
or machine language.

Source program Target Code


compiler
High Level Language Low level Language


The difference in the level of abstractions used by HLL and LLL are huge.
HLLs focus on application domain computations while LLLs depend on
the target architecture.

The translation process used in a compiler is non-trivial. A compiler slowly
brings down the HLL computations to instructions supprted by an
architecture.
Another view of a Compiler

Source program compiler Target Code
Front Back
End End

Intermediate Code

The front-end of a compiler performs HLL specific analyses and
produces semantically equivalent code (intermediate code)
whose abstraction level is significantly lower than HLL but still
higher than LLL.

The back-end of a compiler analyses the intermediate code
produced by the front-end and is responsible for generating
target code.
Compiler Phase and Pass

Phase and pass are two commonly used terms in compiler
design.

The process of compilation is carried out in distinct logical
steps, which are called phases.

A compiler typically has 7 phases : lexical analysis, syntax
analysis, semantic analysis, intermediate code generation,
run time environment, code optimization and code
generation

A pass denotes a processing of the entire source code or its
equivalent form. Industrial quality compilers use large
number of passes in their translation process.
Compiler and Support Software

A compiler is not an end to end software. It uses several
system software tools to accomplish its goal.
Source program
Preprocessed assembly
code code
pre-processor compiler assembler

Expand Macro,.. HLL to LLL assembly


to m/c code
object
executable code
in memory
m/c code
code linker
OS loader
Create process; Relocation,
loads code
give resources linking
in memory

output Object code; library


input
Other Models of Translation

A compiler is not the only way to translate code from HLL to
LLL.

Another mechanism, known as Intrepretation, is also used
popularly. Such software are called Interpreters.

In principle a HLL may be either compiled or interpreted, but
in practice some languages are compiled and some are
interpreted or same use both the models.

Difference between Compiler and Interpreter

Identify languages that are interpreted and languages that
are compiled.
THEME 1


COMPILERS : EXAMPLE DRIVEN

APPROACH
Simple C program
#include <stdio.h>

int main()

{

int a[1000], i, j;
 int sum = 10000;
 for (i = 0; i < 1000; i++) a[i] = i;
 for (i = 0; i < 1000; i++)

for (j = 0; j < i*i; j++)
 a[i]
= a[i] + a[j];
 printf(" sum : %d \n", sum);

}

C program – Manual Analysis

Let us manually analyse “firstprog.c”

What is the role played by the statement ? #include <stdio.h>

Someone has to supply the prototype of the i/o function printf()
in order to check whether the call is valid

Let us assume there exists a s/w program that will do this task.

In real world this is task of a tool called C pre-processor, named
as “cpp” which is called by C compiler when it encounters
“#include... “ and similar other statements.

For the manual processing we shall ignore this statement for
the present.
C program – Manual Analysis

The remaining statements in “firstprog.c”


The pretty indented display of the text in the earlier slide was due to an
editor which interpreted certain characters before displaying.

The contents of the raw text in the file is shown below, where the symbol,
↲ represents newline and↔ denotes a single white space. This is how a
compiler sees the program at the first instance.

↲int↔main()↲{↲int↔↔a[1000],↔i,↔j;↲↔↔int
↔sum↔=↔10000;↲↔↔for↔(i↔=↔0;↔i↔<↔1
000;↔i++)↔a[i]↔=↔i;↲ ...............................

The first task is then to break the stream of characters shown above into
meaningful units of language C.
Basic Elements of C program


Breaking the program string into meaningful words of the language give the
smallest logical units of the language.


Why is “main” a word and not “main()” is decided by the programming
language (PL) specifications and not by a compiler.

The elements in the red boxes are called tokens (lexemes) and a compiler first
converts the character stream into a sequence of tokens.

] , i , j ; int sum = 10000

; for ( i = 0 ; i < 1000

] , i , j ; int sum = 10000

int main ( ) { int a [ 1000


Lexical Analysis


The process of partitioning a program into its constituent smallest logical words
is called as Lexical Analysis.


Distinction between lexemes and tokens

Lexeme : instance of the smallest word

Examples : int main a sum for 1000 < ++ * = .......

Token : groups of similar lexemes

Examples : identifier = {i j a sum }

keyword = {main int for }

integer = {0 1000 10000}

operator = {< ++ * = }
Lexical Analysis

Lexemes Tokens and Patterns


How to describe the lexemes that form the token identifier

 Different PL have variations in the specification of an identifier :

string of alphanumeric characters with an alphabet as the first character

string of alphanumeric characters with an alphabet as the first character but
length is restricted to 31 characters

string of alphabet or numerals or underline (_) characters with an alphabet or
underline as the first character with length ≤ 31; characters beyond 31 are
ignored

Decscriptions as given above, may be formal or informal are called as patterns.
A pattern when described formally gives a precise description of the underlying
token. We are already familiar with the formal notation of regular expression
which us used to describe patterns.

The benefit of formal pattern description is that one can directly design a
recognizer (DFA) from the patterns.
Structure of a Program

Lexical analysis partitions an input program into a stream of tokens (reduces
 the size of the input).


The immediate next task is to determine if the consecutive tokens, when
 grouped together, describe some structure of the PL.


A PL defines the linguistic structures that constitute a program in the language.
For example, a C program is a collection of functions and a function has a
hierarchical structure as shown below.

The structure of func-defn is defined using sub-structures, ret-type, func-name,
arguments and body, as shown below.

func-defn

 ret-type func-name ( arguments ) { body }


Structure of a Program

Let us intuitively apply the ideas to the C program to discover its structure.


The input “int main() {“


partially matches the program structure as displayed in the tree below.

func-defn


ret-type func-name ( arguments ) { body }

int main ε


Structure of a Program

The structure of a function body is further elaborated into a block-statement

which in turn is specified by one or more declaration statements followed by a
list of statements.

func-defn

ret-type func-name ( arguments ) { body }

int main block-stmt


ε


decl-stmts stmt-list


Structure of a Program

The structure of a single declaration statement is used to further extend the
tree under decl-stmts – which comprises of a type specifier followed by a list of

variable declarations.
func-defn

ret-type func-name ( arguments ) { body }

int main block-stmt


ε

decl-stmts stmt-list

decl-stmt decl-stmts

 type var-list ;


The tree gets extended when the input examined is :

int main ( ) { int a [ 1000 ] , i , is shown in the next page.

Structure of a Program
func-defn

 ret-type func-name ( arguments ) { body }

int main block-stmt


ε

decl-stmts stmt-list

decl-stmt decl-stmts

type var-list ;

int var , var-list

array var , var-list


id [ num ] id var , var-list


a 1000 i
Tree for Validating Program Structure

  process of discovering language constructs in a stream of input tokens is


The
called Syntax Analysis. Syntax analysis uses lexical analysis to convert
lexemes
 to tokens and works on tokens only.

A tree is a natural data structure to represent the structure of a program and is
constructed incrementally as input tokens are examined one by one.

Such a tree, often called a Parse Tree or an Abstract Syntax Tree (AST), is
used for representation of discovered programming langauge structures.

The tokens lie at the leaf nodes of the tree while the internal nodes are the
names of language features rooted at this node.

The structural linguistic features have to be clearly specified to the compiler
designer.

A CFG meets the requirement for specifying the syntactic aspects of language
constructs.
CFG for Syntax Specification

  an illustration, we write a CFG underlying the structure validation process


As

G = (N, T, P, S) where N = { func-defn, ret-type, func-name, arguments, body,

block-stmt, decl-stmts, stmt-list, decl-stmt, type, var-list, var, array}

T has the following symbols and tokens : ( ) { } int , ; id [ ] num

The nonterminal func-defn is the start symbol S

The production rules follow :

func-defn →ret-type func-name ( arguments ) { body }

ret-type → int | void | ε func-name → id

arguments → arg-list | ε body → block-stmt | ε

block-stmt → decl-stmts stmt-list decl-stmts → decl-stmt | ε

decl-stmt →type var-list ; var-list →var , varlist

var → array | id array → id [num ]
Understanding the Meaning


Semantic Analysis


Semantic Analysis


Semantic Analysis


Intermediate Code Generation


Intermediate Code Generation


Intermediate Code Generation


Intermediate Code Generation


Intermediate Code Generation


Run Time Environment


Run Time Environment


Run Time Environment


Run Time Environment


Compiler Optimizations

Intermediate code optimization is an optional phase which is enabled by
an user if so required.

Optimization
Front end Phase Back end
Int code Int code

Source

program Target
program


Purpose is to improve the quality of intermediate code generated by the
front end of the compiler.

Illustrate the capability of this phase through examples.
Compiler Optimizations

Consider the following C program.
int main()

{ int a[1000], i, j;
int sum = 10000;
for (i = 0; i < 1000; i++) a[i] = i;
for (i = 0; i < 1000; i++)
for (j = 0; j < i*i; j++) a[i] = a[i] + a[j];
printf(" sum : %d \n", sum);
}
Compiler Optimizations

Compile this program with
$ gcc -S -fverbose-asm firstprog.c
int main()

{ int a[1000], i, j;
int sum = 10000;
for (i = 0; i < 1000; i++) a[i] = i;
for (i = 0; i < 1000; i++)
for (j = 0; j < i*i; j++) a[i] = a[i] + a[j];
printf(" sum : %d \n", sum);
}
Compiler Optimizations

Compile this program with
$ gcc -S -fverbose-asm firstprog.c -o unopt.s
Now compile using the O2 switch

 $ gcc -S -fverbose-asm firstprog.c -O2 -o opt.s



Compare the two assembly code side by side and
observe.

You may be surprised by the second assembly code.
Manual Optimization
Consider the following C++ program :
int main()
{ int a = 2, b = 3, c = 40, d, i, j;
int x[10]={10, 20, 30, 40, 50, 60, 70, 80, 90, 100};
for ( i = 1; i < 11; i+=2 )
{ if ( i%2) { d = a * b; x[i] = x[i-1] + d; c = b*100; }
else {d = a * a; x[(i+d)%10] = x[i]+x[i-1];} };
if (a < 2) for (j = 1; j < 11; j++) x[(j+5)%10] = x[j] + 5;
else for (j = 0; j < 10; j++) cout << x[j] << " ";
cout << endl;
return 0;
}
Constant Evaluation
Task 1 : Evaluate compile time constants. What are the savings ?
int main()
{ int a = 2, b = 3, c = 40, d, i, j; int x[10]={10, 20, 30, 40, 50, 60, 70, 80, 90, 100};
for ( i = 1; i < 11; i+=2 )
{ if ( i%2)
{ d = 6; // d = a * b;
x[i] = x[i-1] + d;
c = 300; // c = b*100;
}
else { d = 4; // d = a * a;
x[(i+d)%10] = x[i]+x[i-1];} };
if (False ) for (j = 1; j < 11; j++) x[(j+5)%10]=x[j]+5; // 2 < 2
else for (j = 0; j < 10; j++) cout << x[j] << " "; cout << endl;
return 0;
}
Constant Propagation
Task 2 : Evaluate compile time constants. What are the savings ?
int main()
{ int a = 2, b = 3, c = 40, d, i, j; int x[10]={10, 20, 30, 40, 50, 60, 70, 80, 90, 100};
for ( i = 1; i < 11; i+=2 )
{ if ( i%2)
{ d = 6; // d = a * b;
x[i] = x[i-1] + 6; // substitute 6 for d
c = 300; // c = b*100;
}
else { d = 4; // d = a * a;
X [ (i+4) % 10] = x[i] + x[i-1];} };
if (False ) for (j = 1; j < 11; j++) x[(j+5)%10]=x[j]+5; // 2 < 2
else for (j = 0; j < 10; j++) cout << x[j] << " "; cout << endl;
return 0;
}
Loop Invariant Code Motion
Task 3 : Move loop invariant code out of loops. What are the savings ?
int main()
{ int a = 2, b = 3, c = 40, d, i, j; int x[10]={10, 20, 30, 40, 50, 60, 70, 80, 90, 100};
d = 6; // d = a * b;
c = 300; // c = b*100;
d = 4; // d = a * a;
for ( i = 1; i < 11; i+=2 )
{ if ( i%2)
{ x[i] = x[i-1] + 6; }
else { x[(i+4) % 10] = x[i] + x[i-1];} };
if (False ) for (j = 1; j < 11; j++) x[(j+5)%10]=x[j]+5; // 2 < 2
else for (j = 0; j < 10; j++) cout << x[j] << " "; cout << endl;
return 0;
}
Live Variables & Dead Code Removal
Task 4 : Find variables that are live and remove dead code. What are the
savings?
int main()
{ int a = 2, b = 3, c = 40, d, i, j;
int x[10]={10, 20, 30, 40, 50, 60, 70, 80, 90, 100};
for ( i = 1; i < 11; i+=2 )
{ if ( i%2)
{ x[i] = x[i-1] + 6; }
else { x[(i+4) % 10] = x[i] + x[i-1];} };
if (False ) for (j = 1; j < 11; j++) x[(j+5)%10]=x[j]+5; // 2 < 2
else for (j = 0; j < 10; j++) cout << x[j] << " "; cout << endl;
return 0;
}
Remove Unreachable Code
Task 5 : Find variables that are unreachable, evaluate conditional
expressions in control flow constructs and remove unreachable code. What
are the savings?
int main()
{ int a = 2, b = 3, c = 40, d, i, j;
int x[10]={10, 20, 30, 40, 50, 60, 70, 80, 90, 100};
for ( i = 1; i < 11; i+=2 )
{ if ( i%2)
{ x[i] = x[i-1] + 6; }
else { x[(i+4) % 10] = x[i] + x[i-1];} };
for (j = 0; j < 10; j++) cout << x[j] << " "; cout << endl;
return 0;
}
Remove Unused Definitions
Task 6 : Find the definitions that are not used in the program. What are the
savings?
int main()
{ // int a = 2, b = 3, c = 40, d,
int i, j;
int x[10]={10, 20, 30, 40, 50, 60, 70, 80, 90, 100};
for ( i = 1; i < 11; i+=2 )
{ if ( i%2)
{ x[i] = x[i-1] + 6; }
else { x[(i+4) % 10] = x[i] + x[i-1];} };
for (j = 0; j < 10; j++) cout << x[j] << " "; cout << endl;
return 0;
}
Final Optimized Program
Task 6 : Find the definitions that are not used in the program. What
are the savings?
int main()
{ int i, j;
int x[10]={10, 20, 30, 40, 50, 60, 70, 80, 90, 100};
for ( i = 1; i < 11; i+=2 )
{ if ( i%2)
{ x[i] = x[i-1] + 6; }
else { x[(i+4) % 10] = x[i] + x[i-1];} };
for (j = 0; j < 10; j++) cout << x[j] << " "; cout << endl;
return 0;
}
Code Generation


Code Generation


Compiler Concepts Learned Through Example


THEME 2


STRUCTURE OF A COMPILER

Design of a Compiler


Regular expressions: Finite Automata : Lexical Analyser


Context free grammars : Pushdown automata : Parser or
Syntax Analyser


Nested Symbol Tables : Scope and lifetime analysis :

Data structure and algorithms


Memory layout and activation records : Runtime

environments : Architecture and OS
Design of a Compiler

Semantic Analysis [Context Sensitive Issues such as Type
checking; intermediate code generation] : Syntax directed
translation scheme : Tree, graph; Data structure and algorithms.

Parameter passing mechanisms : Semantic Analyser :
Programming Languages.

Optimization of intermediate code : Optional pass : graph, lattice
theory, solution of equations; Discrete Structures.

Code generation : Compiler back-end design: Algorithms and
complexity; Architecture, Assembly language, OS.


Theory and Practice are intimately connected
THEME 3


Experimentation With Gnu C Compiler

Open Source Production Quality Compiler
Simple C program
#include <stdio.h>

int main()

{

int a[1000], i, j;
 int sum = 10000;
 for (i = 0; i < 1000; i++) a[i] = i;
 for (i = 0; i < 1000; i++)

for (j = 0; j < i*i; j++)
 a[i]
= a[i] + a[j];
 printf(" sum : %d \n", sum);

}

Compiler Features

Most compilers provide a host of features for programming ease and
enhanced productivity.

Illustrated with Gnu C compiler (gcc) & C++ compiler (g++)

List a few compiler options that are generally useful. Read

the online manual, man gcc or info gcc for more details.

-fverbose-asm -S -c -o

-v -O2 -E

-fump-tree-all and many many others

Experiments to illustrate the working of a compiler through a simple
example.


Checking Correctness
The output produced after the 6 optimizations are
performed in the order discussed are as follows.
10 16 30 36 50 56 70 76 90 96 original source
10 16 30 36 50 56 70 76 90 96 after optimization 1
10 16 30 36 50 56 70 76 90 96 after optimizations 1,2
10 16 30 36 50 56 70 76 90 96 after optimization 1 to 3
10 16 30 36 50 56 70 76 90 96 after optimization 1 to 4
10 16 30 36 50 56 70 76 90 96 after optimization 1 to 5
10 16 30 36 50 56 70 76 90 96 after optimization 1 to 6
10 16 30 36 50 56 70 76 90 96 after optimization by
gcc
Compiling for Dumps

Other necessary command line switches



-O2 -fdump-tree-all

-O3 enables -ftree-vectorize. Other flags must be enabled explicitl

Other useful options



Suffixing -all to all dump switches

-S to stop the compilation with assembly generation

-fverbose-asm to see more detailed assembly dump

You might also like