0% found this document useful (0 votes)
35 views41 pages

CC 1

This document provides an overview of compiler construction. It discusses the aims and learning outcomes of studying compiler construction. It also defines key terms like compilers, interpreters, and describes the overall compilation process from source code to assembly code. The structure of a compiler is explained as having a front end that analyzes source code and produces an intermediate representation, and a back end that takes the intermediate code to generate target machine code. Context-free grammars are also introduced as a way to specify programming language syntax.

Uploaded by

Kami
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views41 pages

CC 1

This document provides an overview of compiler construction. It discusses the aims and learning outcomes of studying compiler construction. It also defines key terms like compilers, interpreters, and describes the overall compilation process from source code to assembly code. The structure of a compiler is explained as having a front end that analyzes source code and produces an intermediate representation, and a back end that takes the intermediate code to generate target machine code. Context-free grammars are also introduced as a way to specify programming language syntax.

Uploaded by

Kami
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 41

Compiler Construction

About Me

Amir Ali,

PhD Candidate ,
Xian Jiaotong University, China
MS (CCS), SEECS, NUST, Pakistan

Email: [email protected]
Aims of Course

• Any program written in a programming language must be translated before it


can be executed. This translation is typically accomplished by a software
system called compiler.
• This module aims to introduce students to the principles and techniques used
to perform this translation and the issues that arise in the construction of a
compiler.
Learning Outcomes

• A student successfully completing this module should be able to:


• understand the principles governing all phases of the
compilation process.
• understand the role of each of the basic components of a
standard compiler.
• show awareness of the problems of and methods and
techniques applied to each phase of the compilation process.
• apply standard techniques to solve basic problems that arise in
compiler construction.
• understand how the compiler can take advantage of particular
processor characteristics to generate good code.
Books

• Aho, Lam, Sethi, Ullman. “Compilers: Principles, Techniques and


Tools”, 2nd edition. (Aho2) The 1st edition (by Aho, Sethi, Ullman –
Aho1), the “Dragon Book”, has been a classic for over 20 years.
• Cooper & Torczon. “Engineering a Compiler” –
• Other books:
• Hunter et al. “The essence of Compilers” (Prentice-Hall)
• Grune et al. “Modern Compiler Design” (Wiley)

5
Syllabus

• Introduction
• Lexical Analysis (scanning)
• Syntax Analysis (parsing)
• Semantic Analysis
• Intermediate Representations
• Storage Management
• Code Generation
• Code Optimisation
Why Take this Course

Reason #1: understand compilers and languages


• understand the code structure
• understand language semantics
• understand relation between source code and generated
machine code
• become a better programmer

7
Why Take this Course

Reason #2: nice balance of theory and practice


• Theory
• mathematical models: regular expressions, automata,
grammars, graphs
• algorithms that use these models
• Practice
• Apply theoretical notions to build a real compiler

8
Why Take this Course

Reason #3: programming experience


• write a large program which manipulates complex data
structures
• learn more about C++/C#/Java

9
Definitions – Compilers : Language processors
(compile: collect material into a list, volume)
• What is a compiler?
• A program that accepts as input a program text in a certain
language and produces as output a program text in another
language, while preserving the meaning of that text (Grune et al,
2000).
• A program that reads a program written in one language (source
language) and translates it into an equivalent program in another
language (target language) (Aho et al)
• What is an interpreter?
• A program that reads a source program and produces the results
of executing this source.
• We deal with compilers! Many of these issues arise with interpreters!
Compilation - Big Picture
Source Code

int expr( int n )


{
int d;
d = 4*n*n*(n+1)*(n+1);
return d;
}

• Optimized for human readability


• Matches human notions of grammar
• Uses named constructs such as variables and procedures

12
Assembly Code
.globl _expr imull %eax,%edx
_expr: movl 8(%ebp),%eax
pushl %ebp incl %eax
movl %esp,%ebp imull %eax,%edx
subl $24,%esp movl %edx,-4(%ebp)
movl 8(%ebp),%eax movl -4(%ebp),%edx
movl %eax,%edx movl %edx,%eax
leal 0(,%edx,4),%eax jmp L2
movl %eax,%edx .align 4 Optimized for hardware
imull 8(%ebp),%edx L2: • Consists of machine instructions
movl 8(%ebp),%eax leave • Uses registers and unnamed
incl %eax ret memory locations
• Much harder to understand by
13
humans
How to translate

• the generated machine code must execute precisely the same computation
as the source code
• Is there a unique translation? No!
• Is there an algorithm for an “ideal translation”? No!

• Translation is a complex process


• source language and generated code are very different
• Need to structure the translation
How to translate

If the target program is an executable machine-language


program, it can then be called by the user to process
inputs and produce outputs;
How to translate
• C is typically compiled
• Lisp is typically interpreted
• Java is compiled to bytecodes, which are then interpreted by
Virtual Machine (perhaps across the network)
Structure of a Compiler

• Up to this point we have treated a compiler as a single box that maps a source program in to a
semantically equivalent target program. If we open up this box a little, we see that there are two parts to
this mapping: analysis and synthesis.

• The analysis part breaks up the source program in to constituent pieces and imposes a grammatical
structure (lexical, grammar and syntax errors) on them.
• It then uses this structure to create an intermediate representation of the source program.
• If the analysis part detects that the source program is either syntactically ill formed or semantically
unsound, then it must provide informative messages, so the user can take corrective action.
• The analysis part also collects information about the source program and stores it in a data structure
called a symbol table, which is passed along with the intermediate representation to the synthesis part.
• The synthesis part constructs the desired target program from the intermediate representation and the
information in the symbol table.

• The analysis part is often called the front end of the compiler; the synthesis part is the back end.
Structure of a Compiler
Structure of a Compiler

source Front IR Back machine


code End End code

errors
Front end : maps legal source code into IR
• Recognizes legal (& illegal) programs
• Report errors in a useful way
• Produce IR & preliminary storage map

Back end : maps IR into target machine code


Front End

source scanner tokens parser IR


code

errors

Modules:
1. Scanner
2. Parser
Front End
Scanner Example
• Maps character stream into words – basic
unit of syntax
becomes
x = x + y
<id,x>
<id,x>
• Produces pairs – a word and its part of <assign,=>
<id,x>
speech
<op,+> word
<id,y>
token type

Parser
• Recognizes context-free syntax and reports
errors
• Guides context-sensitive (“semantic”)
analysis
• Builds IR for source program
Context-Free Grammars

• Context-free syntax is specified with a grammar


G=(S,N,T,P)
• S is the start symbol
• N is a set of non-terminal symbols
• T is set of terminal symbols or words
• P is a set of productions or rewrite rules

22
Context-Free Grammars
Grammar for expressions
1. goal → expr • For this CFG
2. expr → expr op term
S = goal
3. | term
4. term → number T = { number, id, +, -}
5. | id N = { goal, expr, term, op}
6. op → + P = { 1, 2, 3, 4, 5, 6, 7}
7. | -

23
Context-Free Grammars
• Given a CFG, we can derive sentences by repeated substitution
• Consider the sentence (expression) x + 2 – y
Production Result
goal
1 expr
2 expr op term
5 expr op y
7 expr – y
2 expr op term – y
4 expr op 2 – y
6 expr + 2 – y
3 term + 2 – y
5 x+2–y
24
Context-Free Grammars
The Front End
• To recognize a valid sentence in some CFG, we reverse this
process and build up a parse
• A parse can be represented by a tree: parse tree or syntax tree
Production Result
goal
1 expr
2 expr op term
5 expr op y
7 expr – y
2 expr op term – y
4 expr op 2 – y
6 expr + 2 – y
3 term + 2 – y
25
5 x+2–y
A language-processing system

• The task of collecting the source program is


sometimes entrusted to a separate program, called a
preprocessor. The preprocessor may also expand
shorthands, called macros, in to source language
statements.
• The compiler, compiles the program and translates it
to assembly program (low-level language).
• An assembler then translates the assembly program
into machine code.
• A linker tool is used to link all the parts of the
program together for execution (executable machine
code).
• A loader loads all of them into memory and then the
program is executed.
Phases of a Compiler
• The compilation process is a sequence of various phases.

• Each phase takes input from its previous stage, has its own
representation of source program, and feeds its output to the next
phase of the compiler.

• In practice, several phases may be grouped together,


and the intermediate representations between the grouped phases
need not be constructed explicitly .
Lexical Analysis
• The first phase of a compiler is called lexical analysis or scanning.
• It reads the stream of characters making up the source program
and groups the characters in to meaningful sequences called
lexemes.
• For each lexeme, the lexical analyzer produces as output a token
of the form that it passes on to the subsequent phase, syntax
analysis.

• In the above token, the first component token-name is an


abstract symbol that is used during syntax analysis, and the
second component attribute-value points to an entry in the
symbol table for this token. Information from the symbol-table
entry is needed for semantic analysis and code generation.

Lexical Analysis
• For example position = initial + rate * 60
• position is a lexeme that would b e mapped in to a token <id; 1>, where id is an abstract symbol standing
for identifier and 1 points to the symbol table entry for position.
• The assignment symbol = is a lexeme that is mapped in to the token < = >. Since this token needs no
attribute-value, we have omitted the second component. We could have used any abstract symbol such as
assign for the token-name, but for notational convenience we have chosen to use the lexeme itself as the
name of the abstract symbol.
• initial is a lexeme that is mapped in to the token <id; 2>, , where 2 points to the symbol-table entry for
initial.
• + is a lexeme that is mapped in to the token <+>.
• rate is a lexeme that is mapped in to the token <id; 3>, where 3 points to the symbol-table entry for rate.
• * is a lexeme that is mapped in to the token <*>
• 60 is a lexeme that is mapped in to the token <60> (in later we may use it as <number, 4> )
• Blanks separating the lexemes would b e discarded b y the lexical analyzer.
• In this representation, the token names =, +, and * are abstract symbols for the assignment, addition, and
multiplication operators, respectively .
Lexical Analysis
Syntax Analysis

• The second phase of the compiler is syntax analysis or parsing.


• The parser uses the first components of the tokens produced by the lexical analyzer to create a tree-like
intermediate representation that depicts the grammatical structure of the token stream.
• A typical representation is a syntax tree in which each interior node represents an operation and the
children of the node represent the arguments of the operation.
• The tree has an interior node labeled * with <id,3> as its left child
• and the integer 60 as its right child. The node <id,3> represents
the identifier rate. The node labeled + makes it explicit that we must first multiply the value of rate b y 60.
• The node labeled + indicates that we must add the result of this multiplication to the value of initial. The
root of the tree, labeled =, indicates that we must store the result of this addition in to the location for the
identifier position.
Semantic Analysis

• The semantic analyzer uses the syntax tree and the information in the symbol table to check the source
program for semantic consistency with the language dentition.
• It also gathers type information and saves it in either the syntax tree or the symbol table, for subsequent
use during intermediate-code generation.
• An important part of semantic analysis is type checking, where the compiler checks that each operator
has matching operands.
• F or example, many programming language definitions require an array index to be an integer; the
compiler must report an error if a floating-point number is used to index an array.
• The language specification may permit some type conversions called coercions. For example, a binary
arithmetic operator may be applied to either a pair of integers or to a pair of floating-point numbers. If the
operator is applied to a floating-point number and an integer, the compiler may convert or coerce the
integer in to a floating-point number.
• Also, the semantic analyzer keeps track of identifiers, their types and expressions; whether identifiers are
declared before use or not etc. The semantic analyzer produces an annotated syntax tree as an output.
Intermediate Code Generation

• After syntax and semantic analysis of the source program, many compilers generate an explicit low-level
or machine-like intermediate representation, which w e can think of as a program for an abstract machine.
This intermediate representation should have two important properties: it should b e easy to
produce and it should be easy to translate in to the target machine.
• In later lectures , we consider an intermediate form called three-address code, which consists of a
sequence of assembly-like instructions with three operands per instruction. Each operand can act like a
register
Code Optimization

• The machine-independent co de-optimization phase attempts to improve the intermediate co de so that


better target co de will result. Usually better means faster, but other objectives may be desired, such as
shorter code, or target code that consumes less power. For example, a straight forward algorithm generates
the intermediate code, using an instruction for each operator in the tree representation that comes from the
semantic analyzer.

• The optimizer can deduce that the con version of 60 from integer to floating point can b e done once and
for all at compile time, so the inttofloat operation can be eliminated by replacing the integer 60 by the
floating-point number 60.0.
• Moreover, t3 is used only once to transmit its value to id1 so the optimizer can transform in to the shorter
sequence
Code Generation

• The code generator takes as input an intermediate representation of the source program and maps it in to
the target language. If the target language is machine code, registers or memory lo cations are selected for
each of the variables used b y the program.
• Then, the intermediate instructions are translated in to sequences of machine instructions that perform the
same task. A crucial aspect of code
generation is the assignment of registers to hold variables.
• For example, using registers R1 and R2, the intermediate code in optimization phase might get translated
in to the machine code.
Symbol Table

• It is a data-structure maintained throughout all the phases of a compiler.


• All the identifier's names along with their types are stored here.
• The symbol table makes it easier for the compiler to quickly search the identifier record and retrieve it.
• The symbol table is also used for scope management.

The Grouping of Phases in to Passes

The discussion of phases deals with the logical organization of a compiler. In an implementation, activities from
several phases may be grouped together in to a pass that reads an input file and writes an output file. For example, the
front-end phases of lexical analysis, syntax analysis, semantic analysis, and intermediate code generation might be
grouped together in to one pass. Code optimization might be an optional pass. Then there could be a back-end pass
consisting of code generation for a particular target machine

Compiler-Construction Tools
There are some tools available for each stage / phase of compiler (we will see and try to get hands on experience)
Symbol Table
Full Compiler Structure
Qualities of a Good Compiler

What qualities would you want in a compiler?


• generates correct code (first and foremost!)
• generates fast code
• conforms to the specifications of the input language
• copes with essentially arbitrary input size, variables, etc.
• compilation time (linearly)proportional to size of source
• good diagnostics
• consistent optimisations
• works well with the debugger
Principles of Compilation
The compiler must:
• preserve the meaning of the program being compiled.
• “improve” the source code in some way.
Other issues (depending on the setting):
• Speed (of compiled code)
• Space (size of compiled code)
• Feedback (information provided to the user)
• Debugging (transformations obscure the relationship source code vs target)
• Compilation time efficiency (fast or slow compiler?)
Uses of Compiler Technology
• Most common use: translate a high-level program to object code
• Program Translation: binary translation, hardware synthesis, …
• Optimizations for computer architectures:
• Improve program performance, take into account hardware parallelism, RISC, CISC,
etc…
• Automatic parallelisation
• Performance instrumentation: e.g., -pg option of cc or gcc
• Interpreters: e.g., Python, Ruby, Perl, Matlab, sh, …
• Software productivity tools
• Bound Checking, Type Checking, Debugging aids: e.g, purify (memory management
errors, etc).
• Security: Java VM uses compiler analysis to prove “safety” of Java code.
• Text formatters, just-in-time compilation for Java, power management,
global distributed computing, …
Key: Ability to extract properties of a source program (analysis) and
transform it to construct a target program (synthesis)

You might also like