compiler_design_and_implementation
compiler_design_and_implementation
Description: This course is intended to give the students a thorough knowledge of compiler
design techniques and tools for modern computer programming languages. This course covers
advanced topics such as data-flow analysis and control-flow analysis, code generation and
program analysis and optimization.
Prerequisites: Students should be familiar with the concepts in theory of computation (e.g.,
regular sets and context-free grammars); design and analysis of algorithms (e.g., asymptotic
complexity, divide and conquer and dynamic-programming techniques); and have strong
programming skills using dynamic, pointer-based data structures either in C or C++
programming languages. Students should also be familiar with basic concepts of imperative
programming languages such as scoping rules, parameter passing disciplines and recursion.
Grading: We will assign 10% of this class grade to home works, 10% for the programming
projects, 10% for the mid-term test and 70% for the final exam. The Final exam is
comprehensive.
Lectures: Below is a description of the contents. We may change the order to accommodate the
materials you need for the projects.
Language Processors
Simply stated, a compiler is a program that can read a program in one language — the source
language — and translate it into an equivalent program in another language — the target
language; An important role of the compiler is to report any errors in the source program that it
detects during the translation process. If the target program is an executable machine-language
program, it can then be called by the user to process inputs and produce outputs. An interpreter
is another common kind of language processor. Instead of producing a target program as a
translation, an interpreter appears to directly execute the operations specified in the source
program on inputs supplied by the user.
The machine-language target program produced by a compiler is usually much faster than an
interpreter at mapping inputs to outputs. An interpreter, however, can usually give better error
diagnostics than a compiler, because it executes the source program statement by statement.
Example 1: Java language processors combine compilation and interpretation, as shown. A Java
source program may first be compiled into an intermediate form called bytecodes. The bytecodes
are then interpreted by a virtual machine. A benefit of this arrangement is that bytecodes
compiled on one machine can be interpreted on another machine, perhaps across a network. In
order to achieve faster processing of inputs to outputs, some Java compilers, called just-in-time
compilers, translate the bytecodes into machine language immediately before they run the
intermediate program to process the input.
Lexical Analysis
The first phase of a compiler is called lexical analysis or scanning. The lexical analyzer reads the
stream of characters making up the source program and groups the characters into meaningful
sequences called lexemes. For each lexeme, the lexical analyzer produces as output a token of the
form (token-name, attribute-value) that it passes on to the subsequent phase, syntax analysis. In
the token, the first component token-name is an abstract symbol that is used during syntax
analysis, and the second component attribute-value points to an entry in the symbol table for this
token. Information from the symbol-table entry is needed for semantic analysis and code
generation. For example, suppose a source program contains the assignment statement p o s i t i
o n = i n i t i a l + r a t e * 60
Syntax Analysis
The second phase of the compiler is syntax analysis or parsing. The parser uses the first
components of the tokens produced by the lexical analyzer to create a tree-like intermediate
representation that depicts the grammatical structure of the token stream. A typical
representation is a syntax tree in which each interior node represents an operation and the
children of the node represent the arguments of the operation. A syntax tree for the token stream
(1.2) is shown as the output of the syntactic analyzer in. This tree shows the order in which the
operations in the assignment p o s i t i o n = i n i t i a l + rate * 60 are to be performed. The tree
has an interior node labeled * with (id, 3) as its left child and the integer 60 as its right child. The
node (id, 3) represents the identifier rate. The node labeled * makes it explicit that we must first
multiply the value of r a t e by 60.
Semantic Analysis
The semantic analyzer uses the syntax tree and the information in the symbol table to check the
source program for semantic consistency with the language definition. It also gathers type
information and saves it in either the syntax tree or the symbol table, for subsequent use during
intermediate-code generation. An important part of semantic analysis is type checking, where the
compiler checks that each operator has matching operands. For example, many programming
language definitions require an array index to be an integer; the compiler must report an error if a
floating-point number is used to index an array.
Definition of Grammars
A context-free grammar has four components:
1. A set of terminal symbols, sometimes referred to as "tokens." The terminals are the
elementary symbols of the language defined by the grammar.
2. A set of nonterminals, sometimes called "syntactic variables." Each nonterminal represents a
set of strings of terminals, in a manner we shall describe.
3. A set of productions, where each production consists of a nonterminal, called the head or left
side of the production, an arrow, and a sequence of
Derivations
A grammar derives strings by beginning with the start symbol and repeatedly replacing a
nonterminal by the body of a production for that nonterminal. The terminal strings that can be
derived from the start symbol form the language defined by the grammar. The ten productions
for the nonterminal digit allow it to stand for any of the terminals 0 , 1 , . . . ,9. From production
(2.3), a single digit by itself is a list. Productions (2.1) and (2.2) express the rule that any list
followed by a plus or minus sign and then another digit makes up a new list.
Tree Terminology
Tree data structures figure prominently in compiling.
• A tree consists of one or more nodes. Nodes may have labels, which in this book typically will
be grammar symbols. When we draw a tree, we often represent the nodes by these labels only.
• Exactly one node is the root. All nodes except the root have a unique parent; the root has no
parent. When we draw trees, we place the parent of a node above that node and draw an edge
between them. The root is then the highest (top) node.
• If node N is the parent of node M, then M is a child of N. The children of one node are called
siblings. They have an order, from the left, and when we draw trees, we order the childen of a
given node in this manner.