Compiler Design
Compiler Design
May 2024
Addis Ababa,
Ethiopia
Compiler Design
Objective of the Course
To learn basic techniques used in compiler construction such as lexical analysis, top-
down and bottom-up parsing, context-sensitive analysis, and intermediate code
generation.
To learn basic data structures used in compiler construction such as abstract syntax trees,
symbol tables, three-address code, and stack machines.
To learn software tools used in compiler construction such as lexical analyzer generators,
and parser generators.
Chapter One:
Introduction to Compiling
What is Compiler
a program that reads a program written in one language and translates it into an
equivalent program in another language.
Compiler vs Interpreter
Compiler: convert human readable instructions to computer readable instructions one
time.
Interpreter: converts human instructions to machine instructions each time the program
is run.
Applications of compiler technology
⚫ Parsers for HTML in web browser
⚫ Machine code generation for high level languages
⚫ Software testing
2|Page
⚫ Program optimization
⚫ Malicious code detection
⚫ Design of new computer architectures
Cousins of the Compiler
Preprocessor:
⚫ produces input for compiler
⚫ file inclusion, language extension, etc.
Assembler
⚫ assembly language into machine code
⚫ output of an assembler is called an object file
Linker
⚫ links and merges various object files to make an executable file.
⚫ determine the memory location where these codes will be loaded
Loader
⚫ loading executable files into memory and execute them.
⚫ It calculates the size of a program (instructions and data) and creates memory
space for it.
⚫ It initializes various registers to initiate execution.
Cross-Compiler
⚫ compiler that runs on platform (A) and generates executable code for another
platform (B).
Source-to-source Compiler
⚫ compiler that translates source code of one programming language to another
Phases of a Compiler
Analysis
⚫ Machine Independent/Language Dependent
Synthesis
⚫ Machine Dependent/Language independent
3|Page
4|Page
Analysis of the Source Program
1. Lexical / Linear Analysis (scanning)
⚫ Scans the source code as a stream of characters
⚫ Represent lexemes in the form of tokens as:
<token-name, attribute-value>
⚫ Token
smallest meaningful element that a compiler understands.
⚫ Eg.
Identifiers, Keywords, Literals, Operators and Special symbols.
⚫ Blanks, new lines, comments will be removed from the source program.
5|Page
3. Semantic Analysis
Certain checks are performed to make sure that the components of the program fit
together meaningfully.
Unlike parsing, this phase checks for semantic errors in the source program (e.g. type
mismatch)
- Type checking of various programming language constructs is one of the most
important tasks.
Stores type information in the symbol table or the syntax tree.
- Types of variables, function parameters, array dimensions, etc.
6|Page
5. Code Optimization
Changes the IC by removing such inefficiencies
Improve the code
a. Improvement may be time, space, or power consumption.
It changes the structure of programs,
6. Code Generation
Converts intermediate code to machine code.
Must handle all aspects of machine architecture
Storage allocation decisions are made
a. Register allocation and assignment
7|Page
Chapter 2:
Lexical Analysis
What is Lexical Analysis
The first phase of a compiler
The input is a high level language program
The output is a sequence of tokens
Strips off blanks, tabs, newlines, and comments from the source program
Keeps track of line numbers
Classes of Tokens
⚫ Identifiers: names chosen by the programmer
⚫ Keywords: names already in the programming language
⚫ Separators: punctuation characters
⚫ Operators: symbols that operate on arguments and produce results
⚫ Literals: numeric, textual literals
8|Page
Chapter 3
Syntax Analysis
Every language has rules for syntactic structure of well formed programs.
Takes streams of tokens from lexical analyzer and produce a parse tree.
Grammars
Every programming language has grammar rules
Parsers or syntax analyzers are generated for a particular grammar
CFG are used for syntax specification of programming languages
Context Free Grammar (CFG)
Is denoted as G = (N, T , P, S)
N : finite set of non-terminals
T : finite set of terminals
S ∈ N: The start symbol
P : Finite set of productions, each of the form A→α, where A∈N and α ∈ (N U
T)∗
Derivations
Derivation of terminal string from non-terminal
A production is applied at each step in derivation
the productions E→E + E, E→id, and E→ id, are applied at steps 1,2, and, 3 respectively.
read as S derives id + id.
Derivation Trees
Derivations can be displayed as trees
Internal nodes of the tree are all non-terminals
Leaves are all terminals
The yield of a derivation tree is the list of the labels of all the leaves read from left to
right.
9|Page
Leftmost and Rightmost Derivations
Leftmost Derivation
⚫ Apply a production only to the leftmost variable at every step
⚫ S → aAS | a | SS
⚫ A → SbA | ba
⚫ S => aAS => aSbAS =>aabAS => aabbaS => aabbaa
Rightmost Derivation
⚫ Apply production to the rightmost variable at every step
⚫ S =>aAS =>aAa=>aSbAa =>aSbbaa =>aabbaa
Parsing
Process of constructing parse tree for a sentence generated by a given grammar.
2 types of parsers
⚫ Top down parsing (predictive parsers)
LL(1)
⚫ Bottom up parsing (SR parsers)
LR(1)
10 | P a g e
LL(1) Grammar
L – left to right
L – left most derivation
1 – number of look ahead
First( ) and Follow( )
⚫ the first terminal in a string and the terminal that follows a variable respectively.
LR Parsing
LR(k) - Left to right scanning with Rightmost derivation in reverse, k being the number
of lookahead tokens.
Types of LR Parsers
LR (0) , SLR (1) , LALR (1) , CLR (1)
LL LR
Starts with root non-terminal on stack Ends with root non-terminal on the stack
Builds the parse tree top-down Builds the parse tree bottom-up
11 | P a g e
Chapter 4
Semantic Analysis
Syntax Directed Translation
Attaching actions to the grammar rules(productions).
Actions are executed during the compilation
⚫ Not during the generation of the compiler
Actions are executed according to the parsing mechanism.
Syntax Directed Definitions
Is a generalization of a context free grammar
Is a CFG with attributes and rules
Attributes are associated with grammar symbols and rules with productions
Attributes may be:
⚫ Numbers
⚫ Types
⚫ Strings etc
Syntax Directed Definition- Example
Production Semantic Rules
L E return print(E.val)
E E1 + T E.val = E1.val + T.val
ET E.val = T.val
T T1 * F T.val = T1.val * F.val
TF T.val = F.val
T(E) F.val = E.val
F digit F.val = digit.lexval
12 | P a g e
⚫ Two fields containing pointers to left and right
mkleaf(id, entry)
⚫ Creates an identifier node with label id &
⚫ A field containing entry, ptr to symbol table entry for the identifier
mkleaf(num, val)
⚫ Create a number node with label num &
⚫ A field containing val, the value for the number
Chapter 5
Type Checking
What are Types ?
Types:
⚫ Describe the values computed during the execution of the program
13 | P a g e
Type Errors:
⚫ Improper or inconsistent operations during program execution
Type-safety:
⚫ Absence of type errors
Type Checking
Semantic checks to enforce the type safety of the program
Semantic Checks
⚫ Static – done during compilation
⚫ Dynamic – done during run-time
Examples
⚫ Unary and binary operators
⚫ Number and type of arguments
⚫ Return statement with return type
⚫ Compatible assignment
Static Checking
The compiler must check the semantic conventions of the source language
Static Checking: ensures that certain kind of errors are detected and reported
Example
Type Checks: incompatible operands
Flow Control Check
Uniqueness Check
Name Related Check
Type Checking of Expressions
E literal { E.type = char }
E num { E.type = int }
E id { E.type = lookup(id.entry) }
EE1 mod E2 { E.type=if E1.type=int and E2.type= int
then int
else type_error }
14 | P a g e
EE1[E2] { E.type=if E2.type=int and
E1.type=array(s,t) then t else type_error }
Type Checking of Statements
Sid=E { S.type = if id.type=E.type then
void else type_error }
Sif E then S1 { S.type = if E.type=Boolean then
S1.type else type_error }
Swhile E do S1 { S.type = if E.type = Boolean then
S1.type else type_error }
Chapter Six
15 | P a g e
Three Address Code
Is a generic form and can be implemented as:
⚫ Quadruples
⚫ Triples
⚫ Indirect Triples
⚫ Tree
⚫ DAG
Example: a = b + c * d , a + b * c - d / (b * c) ?
t1 = c * d
t2 = b + t1
a = t2
16 | P a g e
Declarations
Involves allocation of space in memory &
Entry of type and name in symbol table
Off set variable (Offset=0) is used to denote the base address
int a; float b;
Allocation process: { offset = 0 }
int a;
id.type = int
id.width = 2
offset = offset + id.width { offset = 2 }
float b;
id.type=float
id.width=4
offset = offset +id.width { offset = 6 }
Chapter 8
Introduction to Code Optimization
17 | P a g e
Techniques
Common sub-expression elimination
⚫ Repeated appearance computed previously
Strength reduction
⚫ Replacement of expensive expressions with simple ones
Code movement
⚫ Moving a block of code outside a loop
Dead code elimination
⚫ Eliminated code statements that are either never executed or unreachable
Register Allocation
Registers hold values
Example
⚫ a=c+d
⚫ e=a+b
⚫ f=e–1
With the assumption that a and e die after use
Temporary a can be reused after e=a+b, same wz a
Can allocate a,e and f all to one register(r1)
⚫ r1 = r2 + r3
⚫ r1 = r1 + r4
⚫ r1 = r1 – 1
Peephole Optimization
Transforming to optimal sequence of instructions
Common Techniques:
Elimination of redundant loads and stores
⚫ Eg.
⚫ r2 = r1 + 5
⚫ I = r2
⚫ r3 = I
⚫ r4 = r3 * 3
Constant folding
⚫ Eg.
⚫ R2 = 3 * 2
Constant Propagation
⚫ Eg.
⚫ r1 = 3
⚫ r2 = r1 * 2
Copy Propagation
⚫ Eg.
⚫ r2 = r1
18 | P a g e
⚫ r3 = r1 + r2
⚫ r2 = 5;
Elimination of useless instructions
⚫ Eg.
⚫ r1 = r1 + 0 r1 = r1 * 1
19 | P a g e