CD Part
CD Part
Assembler: Assembler
Compiler Design 2
Overview of Compilers
Compiler Design 3
A Language Processing System
The Structure of a Compiler
Compiler Design 8
Example: position =initial + rate * 60
o/p is: <id, 1> <=> <id, 2> <+> <id, 3> <*> <60>
1.”position” is a lexeme mapped into a token <id, 1>, where id is an abstract symbol standing
for identifier and 1 points to the symbol table entry for position. The symbol-table entry for
an identifier holds information about the identifier, such as its name and type .
2. = is a lexeme that is mapped into the token<=>. Since this token needs no attribute-value,
we have omitted the second component. For notational convenience, the lexeme itself is
used as the name of the abstract symbol.
3. “initial” is a lexeme that is mapped into the token <id, 2>, where 2 points to the symbol-
table entry for initial.
5. “rate” is a lexeme mapped into the token <id, 3>, where 3 points to the symbol-table entry
for rate.
Compiler Design 9
Syntax Analysis (parser) : The second phase of the
compiler
• The parser uses the first components of the tokens produced by the lexical
analyzer to create a tree-like intermediate representation that depicts the
grammatical structure of the token stream.
• A typical representation is a syntax tree in which each interior node
represents an operation and the children of the node represent the arguments
of the operation
Compiler Design 10
Semantic Analysis: Third phase of the compiler
• The semantic analyzer uses the syntax tree and the information in the
symbol table to check the source program for semantic consistency with
the language definition.
• Gathers type information and saves it in either the syntax tree or the
symbol table, for subsequent use during intermediate-code generation.
Compiler Design 11
Intermediate Code Generation: three-address
code
After syntax and semantic analysis of the source program, many compilers
generate an explicit low-level or machine-like intermediate representation (a
program for an abstract machine). This intermediate representation should
have two important properties:
• It should be easy to produce and
• It should be easy to translate into the target machine.
The considered intermediate form called three-address code, which consists of
a sequence of assembly-like instructions with three operands per instruction.
Each operand can act like a register.
Compiler Design 12
Code Optimization: To generate better target code
• The machine-independent code-optimization phase attempts to improve the
intermediate code so that better target code will result.
• Usually better means:
• faster, shorter code, or target code that consumes less power.
• The optimizer can deduce that the conversion of 60 from integer to floating
point can be done once and for all at compile time, so the int to float operation
can be eliminated by replacing the integer 60 by the floating-point number
60.0. Moreover, t3 is used only once
• There are simple optimizations that significantly improve the running time of
the target program without slowing down compilation too much .
Compiler Design 13
Code Generation: takes as input an intermediate representation
of the source program and maps it into the target language
• If the target language is machine code, registers or memory
locations are selected for each of the variables used by the
program.
• Then, the intermediate instructions are translated into sequences of
machine instructions that perform the same task.
• A crucial aspect of code generation is the Judicious assignment of
registers to hold variables.
Compiler Design 14
Compiler Design 15
Symbol-Table Management:
• The symbol table is a data structure containing a
record for each variable name, with fields for the
attributes of the name.
• The data structure should be designed to allow the
compiler to find the record for each name quickly and
to store or retrieve data from that record quickly
Compiler Design 16
The Grouping of Phases into Passes
• Deals with the logical organization of a compiler
• For example, the front-end phases of lexical analysis,
syntax analysis, semantic analysis, and intermediate
code generation can be grouped together into one pass.
• Code optimization might be an optional pass.
• Then there could be a back-end pass consisting of code
generation for a particular target machine.
Compiler Construction Tools
• Parser generators that automatically produce syntax
analyzers from a grammatical description of a
programming language.
• Scanner generators that produce lexical analyzers from
a regular-expression description of the tokens of a
language.
• Syntax-directed translation engines that produce
collections of routines for walking a parse tree and
generating intermediate code.
Compiler Construction Tools
• Code Generators translates each operation of the
intermediate language into the machine language for a
target machine.
• Data flow analysis engines that facilitate the gathering
of information about how values are transmitted from
one part of a program to each other part. Data-flow
analysis is a key part of code optimization.
• Compiler Construction Toolkits that provide an
integrated set of routines for constructing various
phases of a compiler.
The Evolution of Programming Languages
The Move to Higher-level Languages
• First-generation - machine languages.
• Second-generation - assembly languages.
• Third-generation - higher-level languages like Fortran,
Cobol, Lisp, C, C++, C#, and Java.
• Fourth-generation languages are languages designed for
specific applications like SQL for database queries and
Postscript for text formatting.
• Fifth-generation language has been applied to logic and
constraint-based languages like Prolog
The Evolution of Programming
Languages
• Imperative languages specifies how a computation is to be done.
• Languages such as C, C++, C#, and Java are imperative languages.
• In imperative languages there is a notion of program state and
statements that change the state.
• Functional languages such as ML and Haskell and constraint logic
languages such as Prolog are often considered to be declarative
languages (what computation is to be done).
The Evolution of Programming
Languages
• Scripting languages are interpreted languages with high-level
operators designed for "gluing together" computations.
• Awk, JavaScript, Perl, PHP, Python, Ruby, and Tel are popular
examples of scripting languages.
Impact on Compilers
• Develop new algorithms and translations to support for new
programming language feature.
• New translation algorithms must take advantage of the new hardware
capabilities.
• Millions of lines of codes may be there in input program.
• A compiler must translate correctly the potentially infinite set of
programs that could be written in the source language.
• The problem of generating the optimal target code from a source
program is undecidable.
The Science of Building a Compiler
Modeling in Compiler Design and Implementation
• The study of compilers is about how we design the
right mathematical models and choose the right
algorithms.
• Some of fundamental models are finite-state machines,
regular expressions and context free languages.
The Science of Code Optimization
• The term "optimization" in compiler design refers to
the attempts that a compiler makes to produce code
that is more efficient than the obvious code.
• Compiler optimizations must meet the following
design objectives:
• Optimization must be correct, that is preserve the meaning of
the compiled program,
• Optimization must improve the performance of programs.
• The compilation time must be kept reasonable and
• engineering effort required must be manageable.
The Science of Code Optimization
• Optimizations speed ups execution time also
conserve power.
• Compilation time should be short to support a
rapid development and debugging cycle.
• Compiler is a complex system; we must keep
the system simple to assure that the engineering
and maintenance costs of the compiler are
manageable.
Applications of Compiler Technology
Implementation of High-Level Programming Languages
• Higher-level programming languages are easier to program in, but are less
efficient, that is, the target programs generated run more slowly.
• Programmers using a low-level language have more control over a
computation and can produce more efficient code.
• The register keyword in the C programming language is an early example of
the interaction between compiler technology and language evolution.
Applications of Compiler Technology
• The token names and associated attribute values for the Fortran
statement.
• <id, pointer to symbol-table entry for E>
• <assign_op>
• <id, pointer to symbol-table entry for M>
• <mult_op>
• <id, pointer to symbol-table entry for C>
• <exp_op>
• <number, integer value 2>
Lexical Errors
• f i ( a == f (x) ) . . .
• a lexical analyzer cannot find whether fi is a
misspelling of the keyword if or an undeclared
function identifier.
• Since fi is a valid lexeme for the token id, the lexical
analyzer must return the token id to the parser and
let some other phase of the compiler.
Lexical Errors
• Suppose if the lexical analyzer is unable to proceed because
none of the patterns for tokens matches any prefix of the
remaining input.
• The simplest recovery strategy is "panic mode" recovery. Delete
successive characters from the remaining input, until the lexical
analyzer can find a well-formed token at the beginning of what
input is left.(iffff)
• Other possible error-recovery actions are:
1. Delete one character from the remaining input. (iff)
2. Insert a missing character into the remaining input.(pritf)
3. Replace a character by another character.(ef)
4. Transpose two adjacent characters.
Input Buffering- increases the speed of
reading the source pgm
• Two-buffer scheme that handles large lookaheads safely called Buffer
Pairs.
• Specialized buffering techniques have been developed to reduce the
amount of overhead required to process a single input character.
Buffer Pairs
Compiler Construction
• Induction:
- Larger regular expressions are built from smaller ones.
Let r and s are regular expressions denoting languages L(r) and
L(s), respectively.
1. (r) | (s) is a regular expression denoting the language L(r) U
L(s).
2. (r) (s) is a regular expression denoting the language L(r) L(s)
3. (r) * is a regular expression denoting (L (r)) * .
4. (r) is a regular expression denoting L(r).
Compiler Construction
This last rule says that we can add additional pairs of
parentheses around expressions without changing
the language they denote.
• for example, we may replace the regular expression
(a) | ((b) * (c)) by a| b*c.
Compiler Construction
Regular Expressions
Example: Let ∑ = {a, b}
• The regular expression a|b denotes the language {a, b}.
• (a|b) (a|b) denotes {aa, ab, ba, bb} the language of all strings of
length two over the alphabet ∑.
• a* denotes the language consisting of all strings of zero or more
a's, that is, {Є, a, aa, aaa, ... }
• (a|b)* denotes the set of all strings consisting of zero or more
instances of a or b, that is, all strings of a's and b's: {Є, a, b, aa,
ab, ba, bb, aaa, ... }.
• a|a*b denotes the language {ab, b, ab, aab, aaab, ... }, that is, the
string a and all strings consisting of zero or more a's and ending
in b.
Regular Set
Compiler Construction
• The terminals of the grammar, which are if, then,
else, relop, id, and number, are the names of tokens
as used by the lexical analyzer.
• The lexical analyzer also has the job of stripping out
whitespace, by recognizing the "token" ws defined by:
Compiler Construction
Recognition of Token
• To recognize tokens there are 2 steps
2 return(relop,LE)
=
1 3 return(relop,NE)
>
<
start other *
0 = 5 4 return(relop,LT)
return(relop,EQ)
>
=
6 7 return(relop,GE)
letter other
*
start
9 10 11
return ( getToken(),
installID() )
Compiler Construction
Two ways of handling reserved words that
looks like identifies
• Two questions remain.
1. How do we distinguish between identifiers and keywords such as if,
then and else, which also match the pattern in the transition diagram?
2. What is (getToken(), installID())?
Two ways of handling reserved words
that looks like identifies
1) Install the reserved words in the Symbol Table initially.
installID() checks if the lexeme is already in the symbol table. If it is not
present, the lexeme is installed ( places it in symbol table) as an id token.
In either case a pointer to the entry is returned.
getToken examines the symbol table entry for the lexeme found and
returns the token name
2) Create separate transition for each keyword;
The transition diagram for token unsigned
number Accepting float
Multiple accepting state e.g. 12.31E+45
Accepting float
e.g. 12.31E4
Compiler Construction
Architecture of a Transition- Diagram-Based
Lexical Analyzer
• There are several ways that a collections of Transition Diagram can be
used to build a LA.
2. Run the various TDs “in parallel,” feeding the next input character to
all of them and allowing each one to make whatever transitions it
required.
• Actual parameter
• Formal Parameter
1.Call–by-Value
2.Call-by-Reference
3.Call-by-Name:
Actual parameter were substituted literally for the formal
parameter(as if macro for Actual parameter) in the code of callee.
Aliasing
• Two formal parameter can refer to the same location such variables
called aliases of one another.
Ex: a is an array of procedure p
getNext
Token
Symbol table
• Verifies the tokens can be generated by the grammar
• Report syntax errors
• Recover from commonly occurring errors
• Construct a parse tree and passes it to rest of the
compiler
Three types of parser
• We categorize the parsers into three groups:
1 Universal parser
Can parse any grammar but too inefficient to use in
production compilers
2 Top-Down Parser
the parse tree is created top to bottom, starting from
the root.
3 Bottom-Up Parser
the parse is created bottom to top; starting from the
leaves
• Both top-down and bottom-up parsers
scan the input from left to right
(one symbol at a time).
• Efficient top-down and bottom-up parsers can be
implemented only for sub-classes of context-free
grammars.
– LL for top-down parsing
– LR for bottom-up parsing
Syntax Error Handling
• Common Programming errors can occur at many different levels.
1. Lexical errors: include misspelling of identifiers, keywords, or
operators.
2.Syntactic errors : include misplaced semicolons or extra or
missing braces.
3.Semantic errors: include type mismatches between operators and
operands.
4.Logical error: incorrect reasoning
use of = instead of ==
Goals of error handler in a Parser
• Panic-Mode Recovery
• Phrase-Level Recovery
• Error Productions
• Global Correction
Panic-Mode Recovery
• On discovering an error, the parser discards input symbols
one at a time until one of a designated set of Synchronizing
tokens is found.
• Synchronizing tokens are usually delimiters.
• Ex: semicolon or } whose role in the source program is clear
and unambiguous.
• It often skips a considerable amount of input without
checking it for additional errors.
Advantages:
• Simplicity
• Is guaranteed not to go into an infinite loop
Phrase-Level Recovery
• A parser may perform local correction on the remaining
input. i.e
it may replace a prefix of the remaining input by some
string that allows the parser to continue.
Ex: replace a comma by a semicolon, insert a missing
semicolon
• Local correction is left to the compiler designer.
• It is used in several error-repairing compliers, as it can
correct any input string
Error Productions
• We can augment the grammar for the language at with productions
that generate the erroneous constructs.
• Then we can use the grammar augmented by these error
productions to Construct a parser.
If an error production is used by the parser, we can generate
appropriate error diagnostics to indicate the erroneous construct
that has been recognized in the input.
Global Correction
• We use algorithms that perform minimal sequence of changes to
obtain a globally least cost correction
• Given an incorrect input string x and grammar G, these algorithms
will find a parse tree for a related string y.
• Such that the number of insertions, deletions and changes of
tokens required to transform x into y is as small as possible.
• It is too costly to implement in terms of time space, so these
techniques only of theoretical interest
Writing grammar
E→E+T | | T
T→T*F | F
F→(E )| id
Left factored
Example :
S → iEtS / iEtSeS / a
E→b
Sol:
Top down parsing
Types –
E$ id+id * id$
• Reduction
• Shift : shift the next input symbol from the right string onto the
top of
the stack
𝑐+𝑐 𝑐