0% found this document useful (0 votes)
117 views194 pages

Compiler Desing-Final ppt2

The document discusses the Analysis-Synthesis model of compilation. It has front-end analysis phases like lexical, syntax and semantic analysis. The back-end phases like code generation and optimization are called the synthesis phase as they synthesize the target language from the intermediate representation created by the front-end. The model has well-defined phases with logical activities and code reusability. Lexical analysis is the first phase that breaks the input into tokens. Lexical analyzers are generated using tools like Lex and Flex that take regular expressions as input and generate a DFA to recognize tokens.

Uploaded by

Tanya Singhal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
117 views194 pages

Compiler Desing-Final ppt2

The document discusses the Analysis-Synthesis model of compilation. It has front-end analysis phases like lexical, syntax and semantic analysis. The back-end phases like code generation and optimization are called the synthesis phase as they synthesize the target language from the intermediate representation created by the front-end. The model has well-defined phases with logical activities and code reusability. Lexical analysis is the first phase that breaks the input into tokens. Lexical analyzers are generated using tools like Lex and Flex that take regular expressions as input and generate a DFA to recognize tokens.

Uploaded by

Tanya Singhal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 194

• Advantages of the model

Also known as Analysis-Synthesis model of compilation


- Front end phases are known as analysis phases
- Back end phases are known as synthesis phases
- Each phase has a well defined work
- Each phase handles a logical activity in the process of
compilation

The Analysis-Synthesis model:


• The front end phases are Lexical, Syntax and Semantic
analyses. These form the "analysis phase" as you can well
see these all do some kind of analysis. The Back End
phases are called the "synthesis phase" as they synthesize
the intermediate and the target language and hence the
program from the representation created by the Front End
phases. The advantages are that not only can lots of code
be reused, but also since the compiler is well structured -
it is easy to maintain & debug.
How does LEX work?
• Regular expressions describe the languages that can
be recognized by finite automata.

• Translate each token regular expression into a non


deterministic finite automaton(NFA)

• Convert the NFA into an equivalent DFA

• Minimize the DFA to reduce the number of states

• Emit code driven by the DFA tables.


Lexical Analysis
Difficulties in the Implementation of
Lexical analysers
• Lexemes in a fixed position. Fix format vs. free
format languages
• Handling of blanks
– in Pascal, blanks separate identifiers
– in Fortran, blanks are important only in literal strings
for example variable counter is same as count er
• Another example
• DO 10 I = 1.25 DO10I=1.25
• DO 10 I = 1,25 DO10I=1,25
Recognition of reserved keywords
and identifiers
• To reduce the number of states, enter
keywords into the symbol table as if they were
identifiers.

• When the LA consults the symbol table to find


the correct lexical value to return, it discovers
that this identifier is really a keyword, and the
symbol table entry has the proper token code
to return.
Transition diagram for unsigned
numbers
Implementation of Transition Diagram
Another transition diagram for
Unsigned Numbers

• A more complex transition diagram is difficult to


implement and may give rise to errors during coding,
however, there are ways to better implementation.
Lexical Analyzer Generators

• The process of constructing a lexical analyzer can be


automated.

• Input to the generator:


– List of regular expressions in priority order
– Associated actions for each of regular expression

• Output of the generator:


– Program that reads input character stream and breaks that
into tokens
– Reports lexical errors(unexpected characters), if any
• Two popular lexical analyzer generators are

– Flex : generates lexical analyzer in C or C++. It is more


modern version of the original Lex tool that was part of
the AT&T Bell Labs version of Unix.
• An open source implementation of the original UNIX lex utility

– Jlex: written in Java. Generates lexical analyzer in Java


• Lex:

– It is a lexical analyzer generator.

– The input notation for the Lex tool is referred to as the


Lex language.

– The tool itself is the Lex compiler.

– Behind the scenes, the Lex compiler transforms the


input patterns into a transition diagram and generates
code, in a file called lex.yy.c, that simulates this
transition diagram.
• An input file which we call lex.l is written in
the Lex language and describes the lexical
analyzer to be generated.

• The lex compiler transforms lex.l to a C


program, in a file that is always named lex.yy.c
• The latter file is compiled by the C compiler into
a file called a.out, as always.
• The C-compiler output is a working lexical
analyzer that can take a stream of input characters
and produce a stream of tokens.

• The normal use of the C compiled output, referred


to as a.out is a subroutine of the parser.
Fig: Creating a lexical analyzer with Lex
Structure of Lex programs:
• A lex program has the following form:

declarations
%%
Translation rules
%%
Auxiliary functions (optional)
• The declaration section includes declarations
of
– variables, manifest constants ( identifiers declared
to stand for a constant), and
– regular definitions.

• The translation rules each have the form


Pattern {Action}
- Actions will contain the C code . A single statement
or a code block.
Lex Pattern Examples
abc Match the string “abc”
[a-z A-Z] Match any lower or uppercase
letter
Dog.*cat Match any string starting with
dog, and ending with cat

(ab)+ Match one or more occurrences


of “ab” concatenated

[^a-z]+ Matches any string of one or


more characters that do not
include lowercase a-z
[ + -]? [0-9]+ Match any string of one or more
digits with an optional prefix of
+ or -
Lex input example
• Filename: example1.l

%%
“HI” printf(“Hello World”);
. ; //Empty C statement.
Does nothing for any other character
encountered in the input
%%
Executing the Lex File
• lex example1.l
(Processes the lex file to generate a scanner which
gets saved as lex.yy.c)

• cc lex.yy.c –ll
compile the scanner and grab main() from the lex
library(-ll)

• ./a.out
Run the scanner taking input from standard input
Fig: Lex program for the relational operators as
tokens
• For a trivial example, consider a program to delete
from the input all blanks or tabs at the ends of
lines.
%%
[ \t]+$ ;
is all that is required.

– The program contains a %% delimiter to mark the


beginning of the rules, and one rule.
– This rule contains a regular expression which matches
one or more instances of the characters blank or tab
(written \t for visibility, in accordance with the C
language convention) just prior to the end of a line.
• To change any remaining string of blanks or
tabs to a single blank, add another rule:
%%
[ \t]+$ ;
[ \t]+ printf(" ");

– The first rule matches all strings of blanks or tabs


at the end of lines, and
– the second rule all remaining strings of blanks or
tabs.
• Lex can be used alone for simple transformations,
or for analysis and statistics gathering on a lexical
level.
• It is particularly easy to interface Lex and Yacc .
– Lex programs recognize only regular expressions;
– Yacc writes parsers that accept a large class of context
free grammars, but require a lower level analyzer to
recognize input tokens.
– Thus, a combination of Lex and Yacc is often
appropriate.
– When used as a preprocessor for a later parser
generator, Lex is used to partition the input stream, and
the parser generator assigns structure to the resulting
pieces.
• The general format of Lex source is:
{definitions}
%%
{rules}
%%
{user subroutines}

• where the definitions and the user subroutines are often


omitted. The second %% is optional, but the first is required
to mark the beginning of the rules. The absolute minimum
Lex program is thus
%%
• (no definitions, no rules) which translates into a program
which copies the input to the output unchanged.
• If the action is merely a single C expression, it
can just be given on the right side of the line; if it
is compound, or takes more than a line, it should
be enclosed in braces.

• The operator characters are

"\[]^-?.*+|()$/{}%<>
and if they are to be used as text characters, an
escape should be used.
• The quotation mark operator (") indicates that
whatever is contained between a pair of quotes is to
be taken as text characters. Thus
xyz"++"
matches the string xyz++ when it appears.
– Note that a part of a string may be quoted. It is harmless
but unnecessary to quote an ordinary text character; the
expression
"xyz++"
is the same as the one above.
• An operator character may also be turned into
a text character by preceding it with \ as in
xyz\+\+
which is another, less readable, equivalent of
the given expressions.
• In character classes, the ^ operator must appear as
the first character after the left bracket; it indicates
that the resulting string is to be complemented with
respect to the computer character set. Thus

[^abc]
matches all characters except a, b, or c, including all
special or control characters; or
[^a-zA-Z]
is any character which is not a letter.
• Lex Actions:

– When a specified expression is matched, Lex executes


the corresponding action.
– Note that there is a default action, which consists of
copying the input to the output.
• This is performed on all strings not otherwise matched.
• Thus the Lex user who wishes to absorb the entire input,
without producing any output, must provide rules to match
everything.
• In more complex actions, the user will often want
to know the actual text that matched some
expression like [a-z]+. Lex leaves this text in an
external character array named yytext. Thus, to
print the name found, a rule like

[a-z]+ printf("%s", yytext);


will print the string in yytext.
• Ambiguous Source Rules.
– Lex can handle ambiguous specifications. When more than one
expression can match the current input, Lex chooses as follows:
1) The longest match is preferred.
2)Among rules which matched the same number of
characters, the rule given first is preferred.

• Thus, suppose the rules


integer keyword action ...;
[a-z]+ identifier action ...;
to be given in that order.
– If the input is integers, it is taken as an identifier, because [a-z]+
matches 8 characters while integer matches only 7.
– If the input is integer, both rules match 7 characters, and the
keyword rule is selected because it was given first.
– Anything shorter (e.g. int) will not match the expression integer
and so the identifier interpretation is used.
• Remember that Lex is turning the rules into a
program. Any source not intercepted by Lex is
copied into the generated program.
Parsing
Syntax Analysis:
What does it do?
• Error reporting and recovery

• Model using context free grammars

• Recognize using Push down automata/Table


Driven Parsers
What a Syntax Analyser cannot do?
• To check whether variables are of types which
operations are allowed.

• To check whether a variable has been declared


before use

• To check whether a variable has been initialized

• These issues will be handled in semantic analysis


Limitations of Regular Languages
• To describe language syntax precisely and conveniently,
Can regular expressions be used?
• Many languages are not regular, for example, string of
balanced parentheses.. "for every opening parenthesis
there must be a closing parenthesis" cannot be
described using a regular expression
- ((((.))))
-{(i)i|i=0}
- There is no regular expression for this language
- A finite automata may repeat states, however, it cannot
remember the number of times it has been to a
particular state.
• Many programming languages have an
inherently recursive structure that can be
defined by Context Free Grammars (CFG)
rather intuitively.
Syntax Definition:
• Context free grammars
- a set of tokens (terminal symbols)
- a set of non terminal symbols
- a set of productions of the form non
terminal String of terminals & non terminals
- a start symbol <T, N, P, S>
• A grammar derives strings by beginning with a
start symbol and repeatedly replacing a non
terminal by the right hand side of a
production for that non terminal.

• The strings that can be derived from the start


symbol of a grammar G form the language
L(G) defined by the grammar.
Examples
• String of balanced parentheses:
S (S)S|ε

• Grammar for a string of digits separated by + or -:


list list + digit | list - digit | digit
digit 0|1|…|9
• list list + digit
list - digit + digit
digit - digit + digit
9 - digit + digit
9 - 5 + digit
9-5+2

Therefore, the string 9-5+2 belongs to the language


specified by the grammar.

It would be interesting to know that the name


context free grammar comes from the fact that use of
a production X . does not depend on the context of
X.
• The simple reason is that non-terminals appear by
themselves to the left of the arrow in context-free
rules:
A
• The rule A says that A may be replaced by
anywhere, regardless of where A occurs.
• On the other hand, we could define a context as pair of
strings such that a rule would apply only if occurs
before and occurs after the non-terminal A. We would
write this as


Derivations
• The construction of a parse tree can be made
precise by taking a derivational view, in which
productions are treated as rewriting rules.
• At each step, we choose a non-terminal to
replace. Different choices can lead to different
derivations.

• Two derivations are of interest


1. Leftmost: replace leftmost non-terminal (NT)
at each step
2. Rightmost: replace rightmost NT at each step
• The two derivations produce different parse trees.
The parse trees imply different evaluation orders!
• Parse Trees
– The derivations can be represented in a tree-like fashion.
– The interior nodes contain the non-terminals used
during the derivation
Context free grammars Versus Regular
Expressions
• Every construct that can be described by a
regular expression can be described by a
grammar, but not vice versa.
• Alternatively, every regular language is a
context free language, but not vice-versa.
• We can construct mechanically a grammar to
recognize the same language as a non
deterministic finite automaton (NFA).
• It uses the following construction:
– For each state i of the NFA, create a non terminal
Ai
– If state i has a transition to state j on input a, add
the production AiaAj. If state i goes to state j on
input epsilon, add the production AiAj.
– If i is an accepting state, add Aiƹ.
– If i is the start state, make Ai be the start symbol
of the grammar.
Parsing Techniques
• There are two primary parsing techniques:
– Top-down
– Bottom-up
Top- down parsers:
• A top-down parsers starts at the root of the
parse tree and grows towards leaves.
• At each node, the parser picks a production
and tries to match the input.
• However, the parser may pick the wrong
production in which case it will need to
backtrack.
• Some grammars are backtrack- free.
Bottom-up parsers:
• A bottom- up parser starts at the leaves and
grows toward root of the parse tree.
• As input is consumed, the parser encodes
possibilities in an internal state.
• The bottom- up parser starts in a state valid
for legal first tokens.
LL(1) Parser
• The first L in LL(1) stands for scanning the
input from left to right
• The second L stands for producing a leftmost
derivation
• The “1” stands for using one input symbol of
lookahead at each step to make parsing action
decisions.
Recursive Descent Parsing
• It consists of a set of procedures, one for each
non terminal.
• Execution begins with the procedure for the
start symbol, which halts and announces
success if it scans the entire input string.
Void A( )
{
choose an A-production, A->X1,X2….Xk;
for(i=1….k){
if(Xi is a non-terminal)
call procedure Xi( );
else if(Xi equals the current input symbol a)
advance the input to the next symbol;
else /*an error has occurred*/;
}
}

Fig. A typical procedure for a non-terminal in a top-down parser


• Predictive parsers, that is, recursive-descent parsers
needing no backtracking can be constructed for a
class of grammars called LL(1).

• A grammar G is LL(1) if and only if whenever


A->α|β are two distinct productions of G, the
following conditions hold:
1. For no terminal a do both α and β derive string
beginning with a. (Equivalent to the statement that
FIRST(α) and FIRST(β) are disjoint sets).
2. At most one of α and β can derive the empty string.
3. If ε is in FIRST(β) , then FIRST(α) and FOLLOW(A) are
disjoint sets, and likewise if ε is in FIRST(α).
• Consider the grammar
E-> iE’
E’-> +iE’/ Ԑ
E( )
{
if(l==‘i’)
{
match(‘i’);
E’();
}
}
E’( )
{
if(l==‘+’)
{
match(‘+’);
match(‘i’);
E’( );
}
else
return( );
}
match(char t)
{
if(l==t)
l=getchar();
else
printf(“error’);
}
main()
{
E();
if(l==‘$’)
printf(“parsing success”);
}
Non-recursive predictive parsing
• A non recursive predictive parser can be built by
maintaining a stack explicitly rather than
implicitly via recursive calls.
• The table-driven parser has an input buffer, a
stack containing a sequence of grammar
symbols, a parsing table constructed by a parsing
algorithm and an output stream.
• The input buffer contains the string to be parsed
followed by the endmarker $.
• We reuse the symbol $ to mark the bottom of the
stack, which initially contains the start symbol of
the grammar on top.
Bottom Up Parsing
• We can think of bottom-up parsing as the process
of “reducing” a string w to the start symbol of the
grammar.

• At each reduction step, a specific substring


matching the body of a production is replaced by
the non-terminal at the head of that production.

• The key decisions during bottom-up parsing are


about when to reduce and about what
production to apply, as the parse proceeds.
• Handle Pruning: A “handle” is a substring that
matches the body of a production, and whose
reduction represents one step along the
reverse of a rightmost derivation.

• Shift-reduce parsing: It is a form of bottom-up


parsing in which a stack holds grammar
symbols and an input buffer holds the rest of
the string to be parsed.
• While the primary operations are shift and
reduce, there are actually four possible
actions a shift-reduce parser can make:
1. Shift- Shift the next input symbol onto the top of
the stack.
2. Reduce- The right end of the string to be
reduced must be at the top of the stack. Locate
the left end of the string within the stack and
decide what non terminal to replace the string.
3. Accept- Announce successful completion of
parsing.
4. Error-Discover a syntax error and call an error
recovery routine.
Parser Generators
• Parser generators exists for LL(1) and LALR(1)
grammars. For eg.,
 LALR(1)- YACC, Bison, CUP
 LL(1)-ANTLR
 Recursive Descent- JavaCC
• The structure of the yacc file is
Declarations
%%
Translation rules
%%
Supporting C/C++ functions
Semantic Analysis
Syntax Directed Translation
• It is a systematic process of assigning meanings to
programs which can also be viewed as
computation of some special
information(attributes) associated with the
symbols of the Grammar.

• To accomplish the task of syntax-directed


translations, there are two general approaches:
– Syntax-Directed Definitions(SDD)
– Syntax-Directed Translation(SDT) Schemes
• The conceptual view of syntax-directed
translation can be presented as

Input string  Parse Tree Dependency


graph Evaluation order for semantic rules
Syntax Analyzer as translator

Abstract Syntax
Continuous
Stream of Parser Tree, Syntax
Tree,
Tokens
intermediate
code, etc.
Syntax + Translation
Rules

Fig.: Syntax Directed Translation


Syntax-Directed Translation
• The grammar symbols are associated with
attributes to associate information with the
programming language constructs that they
represent.
• Values of these attributes are evaluated by the
semantic rules associated with the grammar
productions.
• Any attribute may hold almost any information
– It can hold a string, a number, a memory location, etc
• Evaluation of these semantic rules:

– May put information into the symbol table


– May perform type checking
– May issue error messages
– May perform some other activities
Syntax –Directed Definitions and
Translation Schemes
• When we associate semantic rules with
productions, we use two notations:

– Syntax-Directed Definitions
– Translation Schemes
• Syntax-Directed Definitions:

– It gives high-level specifications for translations


– It hides many implementation details such as
order of evaluation of semantic actions
– We associate a production rule with a set of
semantic actions, and we do not say when they
will be evaluated.
• Syntax-Directed Translation Schemes:

– Indicate the order of evaluation of semantic


actions associated with a production rule.

– In other words, translation schemes give a little bit


information about implementation details.
Syntax Directed Definitions
• A SDD is a generalization of a context-free
grammar in which:

– Each grammar symbol is associated with a set of


attributes.
– This set of attributes for a grammar symbol can be of
the following categories:

• Synthesized attributes
• Inherited attributes

– Each production rule is associated with a set of


semantic rules
• In a SDD, each production A α is associated with
a set of semantic rules of the form:
b=f(c1,c2,c3,….cn)
Where f is a function and b can be one of the
following:
– b is a synthesized attribute of A and c1,c2,….,cn are
attributes of the grammar symbols in the
production(A α )
– b is an inherited attribute of one of the grammar
symbols in α (on the right side of the production) ,
and c1,c2,….,cn are attributes of the grammar symbols
in the production(A α ).
• Terminals have only synthesized attributes
whose values are provided by the scanner or
the lexical analyzer.

• The start non-terminal typically has no


inherited attributes.

• We may allow function calls as semantic-rules


also; they are called “Side-effects”
Annotated Parse Tree
• A parse tree showing the values of attributes at each
node is called an annotated parse tree

• The process of computing the attribute values at the


nodes is called annotation (or decoration) of the parse
tree.

• Definitely the order of these computations depends on


the dependency graph induced by the semantic rules.

• Values of attributes in nodes of annotated parse-tree


are either,
– Initialized to constant values by the lexical analyzer
– Determined by the semantic-rules
Evaluating Attributes
• If a syntax-directed definition employs only
Synthesized attributes, the evaluation of the
attributes can be done in a bottom-up fashion

• Inherited attributes would require more


arbitrary “traversals” of the annotated parse-
tree.

• A dependency graph suggests possible


evaluation orders for an annotated parse-tree.
Attribute Grammar
• An attribute grammar is a formal way to define
attributes for the productions of a formal grammar,
associating these attributes with values.
Example of a Syntax-Directed Definition
• Here,
– The SDD is based on the grammar for arithmetic
expressions which evaluates expressions terminated by an
endmarker n.
– Grammar symbols: L,E,T,F,n,+,*,(,), digit
– Non terminals E,T,F have an attribute called val.
– Terminal digit has an attribute called lexval whose value Is
provided by the lexical analyzer.

PRODUCTION SEMANTIC RULE


L En print(E.val)
E E1+T E.val=E1.val+T.val
ET E.val=T.val
T T1*F T.val= F.val
T F T.val=F.val
F(E) F.val=E.val
F digit F.val=digit.lexval
Synthesized and Inherited attributes
• Synthesized attributes are computed bottom-
up fashion from the leaves upwards

• Inherited attributes flow down from the


parent or sibling to the node in question
v
Evaluation Order for SDDs
• “Dependency graphs” are a useful tool for
determining an evaluation order for the
attribute instances in a given parse tree.
• While an annotated parse tree shows the
values of attributes, a dependency graph
helps us determine how those values can be
computed.
Dependency graphs
• A dependency graph depicts the flow of information
among the attributes instances in a particular parse
tree; an edge from one attribute instance to
another means that the value of the first is needed
to compute the second.
• Edges express constraints implied by the semantic
rules.
– Consider the following production and rule:
PRODUCTION SEMANTIC RULE
E E1+T E.val = E1.val +T.val
• Here, val is a synthesized attribute.
• As a convention, we show the parse tree edges
as dotted lines, while the edges of the
dependency graph are solid.
Ordering the Evaluation of Attributes
• The dependency graph characterizes the possible
orders in which we can evaluate the attributes at the
various nodes of a parse tree.

• If the dependency graph has an edge from node M to


node N, then the attribute corresponding to M must be
evaluated before the attribute of N.

• A topological sort of a directed graph is a linear


ordering of its vertices such that for each directed edge
xy from a vertex x to vertex y, x comes before y in the
ordering.
• For the above graph there are other topological sorts as well, such as 1,3,5,2,4,6,7,8,9
• There are two important classes of SDDs:
1. S-attributed definitions: An SDD is S-attributed if
every attribute is synthesized.
2. L-attributed definitions: Each attribute must be either
a. Synthesized or
b. Inherited, but with the rules limited as follows.
Suppose there is a production A X1X2X3…Xn,
and that there is an inherited attribute Xi.a computed by
a rule associated with this production. Then the rule may
use only:
• Inherited attributes associated with the head A
• Either inherited or synthesized attributes associated with
the occurrences of symbols X1,X2,X3,X4,….,Xi-1 located
to the left of Xi.
• Inherited or synthesized attributes associated with this
occurrence of Xi itself but only in such a way that there
is no cycles in a dependency graph followed by the
attributes of this Xi.
– Given the Syntax-Directed Definition below
with the synthesized attribute val, draw the
annotated parse tree for the expression
(3+4)*(5+6)
PRODUCTION SEMANTIC RULE
L En print(E.val)
E E1+T E.val=E1.val+T.val
ET E.val=T.val
T T1*F T.val= F.val
T F T.val=F.val
F(E) F.val=E.val
F digit F.val=digit.lexval
Applications of Syntax-Directed
Translation
• One important application of SDT is the
construction of syntax trees.
• Since some compilers use syntax trees as an
intermediate representation, a common form of
SDD turns it’s input string into a tree.
• To complete the translation to intermediate
code, the compiler may then walk the syntax
tree, using another set of rules that are in effect
an SDD on the syntax tree rather than the parse
tree.
• We consider two SDDs for constructing syntax
trees for expressions.
– S-attributed definition, which is suitable for use
during bottom-up parsing
– L-attributed definition, is suitable for use during
top-down parsing
Construction of syntax trees
• We implement the nodes of a syntax tree by
objects with a suitable number of fields. Each
object will have an op field that is the label of the
node.
• The objects will have additional field as follows:
– If the node is a leaf, an additional field holds the
lexical value for the leaf. The construction function
Leaf(op, val) creates a leaf object. Alternatively, the
nodes are viewed as records, then Leaf returns a
pointer to a new record for a leaf.
– If the node is an interior node, there as many
additional fields as the node has children in the
syntax tree. A construction function Node takes
two or more arguments: Node(op,c1,c2,….,ck)
creates an object with first field op and k
additional fields for the k children c1,….,ck.
Intermediate Code Generation
• In the analysis-synthesis model of a compiler,
the front end analyzes a source program and
creates an intermediate representation, from
which the back-end generates target code.
Why Intermediate code?
• While generating machine code directly from
source code is possible, it entails two
problems
– With m languages and n target machines, we need
to write m front ends, m*n optimizers, and m*n
code generators.

– The code optimizer which is one of the largest and


very difficult to write components of a compiler,
cannot be reused.
• By converting source code to an intermediate
code, a machine independent code optimizer
may be written

• This means just m front ends, n code


generators and 1 optimizer
• Directed Acyclic Graphs for Expressions

– Like the syntax tree for an expression, a DAG has


leaves corresponding to atomic operands and
interior nodes corresponding to operators.

– The difference is that a node N in DAG has more


than one parent if N represents a common sub
expression.
• In a syntax tree, the tree for the common sub
expression would be replicated as many times as the
sub expression appears in the original expression.
– A DAG not only represents expressions more
succintly, it gives the compiler important clues
regarding the generation of efficient code to
evaluate the expressions.
The Value-Number method for
construction of a DAG
Three Address Code
• In three-address code, there is at most one
operator on the right side of an instruction; that is,
no built-up arithmetic expressions are permitted.
• Thus a source-language expression like x+y*z might
be translated into the sequence of three-address
instructions

• where t1 and t2 are compiler-generated temporary


names.
• This unraveling of multi-operator arithmetic
expressions and of nested flow-of-control
statements makes three-address code desirable for
target-code generation and optimization
• Three address code can be implemented using
records with fields for the addresses;
Quadruples, triples and indirect triples.
• Quadruples: A quadruple has four fields,
which we call op, arg1, arg2 and result.
– The op field contains an internal code for the
operator.
– For instance, the three-address instruction x = y +z
is represented by placing + in op, y in arg1, z in
arg2, and x in result
• There are some following exceptions to this
rule:
– Instructions with unary operators like x = minus y
or x = y do not use arg2. Note that for a copy
statement like x = y, op is =, while for most other
operations, the assignment operator is implied.

– Conditional and unconditional jumps put the


target label in result.
• Triples: A triple has only three fields, which we
call op, arg1, and arg2.
– Using triples, we refer to the result of an
operation x op y by its position, rather than by an
explicit temporary name.

– Thus, instead of the temporary t1 in Fig. 6.10 (b),


a triple representation would refer to position (0).
Parenthesized numbers represent pointers into
the triple structure itself.
• A benefit of quadruples over triples can be
seen in an optimizing compiler, where
instructions are often moved around.
• With quadruples, if we move an instruction
that computes a temporary t, then the
instructions that use t require no change.
• With triples, the result of an operation is
referred to by its position, so moving an
instruction may require us to change all
references to that result.
• Indirect triples : Indirect triples consist of a listing
of pointers to triples, rather than a listing of
triples themselves

• With indirect triples, an optimizing compiler can


move an instruction by reordering the instruction
list, without affecting the triples themselves.
• Static checking includes type checking, which
ensures that operators are applied to
compatible operands.

• It also includes any syntactic checks that


remain after parsing.

• For example, static checking assures that a


break-statement in C is enclosed within a
while-, for-, or switch-statement; an error is
reported if such an enclosing statement does
not exist
Types
• The applications of types can be grouped under
checking and translation:

– Type checking uses logical rules to reason about the


behavior of a program at run time.
• Specifically, it ensures that the types of the operands match the
type expected by an operator. For example, the && operator in
Java expects its two operands to be booleans; the result is also of
type boolean.

– Translation Applications. From the type of a name, a


compiler can determine the storage that will be needed for
that name at run time.
Storage layout for local names
• From the type of a name, we can determine
the amount of storage that will be needed for
the name at run time.
Type Conversions
• Consider expressions like x + i, where x is of type
float and i is of type integer.

• Since the representation of integers and floating-


point numbers is different within a computer and
different machine instructions are used for
operations on integers and floats, the compiler
may need to convert one of the operands of + to
ensure that both operands are of the same type
when the addition occurs.
• Suppose that integers are converted to floats
when necessary, using a unary operator
(float)
• For example, the integer 2 is converted to a
float in the code for the expression 2 * 3 .14:
t1 = (float) 2;
t2 = t1 * 3.14;
• We introduce another attribute E.type, whose
value is either integer or float.

• The rule associated with E  El + E2 builds on


the pseudocode
if ( E1.type = integer and E2.type = integer )
E.type = integer;
else if ( E1 .type = float and E2. type = integer )
…..
• Type conversion rules vary from language to
language.
• The rules for Java in Fig. 6.25 distinguish
between widening conversions, which are
intended to preserve information, and
narrowing conversions, which can lose
information.
• The widening rules are given by the hierarchy
in Fig. 6.25(a): any type lower in the hierarchy
can be widened to a higher type.
• Thus, a char can be widened to an int or to a
float, but a char cannot be widened to a short.

• The narrowing rules are illustrated by the


graph in Fig. 6.25(b): a type s can be narrowed
to a type t if there is a path from s to t.

• Note that char, short, and byte are pairwise


convertible to each other.
• Conversion from one type to another is said to
be implicit if it is done automatically by the
compiler.
• Implicit type conversions, also called
coercions, are limited in many languages to
widening conversions.
• Conversion is said to be explicit if the
programmer must write something to cause
the conversion.
• Explicit conversions are also called casts.
• The semantic action for checking E  E1 + E2
uses two functions:

1. max(t1, t2) takes two types t1 and t2 and


returns the maximum of the two types in the
widening hierarchy. It declares an error if
either t1 or t2 is not in the hierarchy.
2. widen(a, t, w) generates type conversions if
needed to widen an address a of type t into a
value of type w. It returns a itself if t and w are
the same type. Otherwise, it generates an
instruction to do the conversion and place the
result in a temporary t, which is returned as the
result.

• Pseudocode for widen, assuming that the only


types are, integer and float, appears in Fig. 6.26.
Code Optimization Techniques
1. Constant Folding

-It refers to a technique of evaluating the expressions whose


operands are known to be constant at compile time itself.

Eg. a= (22/7)*d

2. Constant propagation

- In constant propagation, if a variable is assigned a constant value,


then subsequent use of that variable can be replaced by a constant
as long as no intervening assignment has changed the value of the
variable.
– Eg. Pi=3.14
r=5
area=pi*r*r
3. Common sub-expression elimination

– This technique replaces redundant expression each


time it is encountered.

– Example: T1=4*i T1=4*i


T2=a[T1] T2=a[T1]
T3=4*j
T3=4*j T5=n
T4=4*i T6=b[T1]+T5
T5=n
T6=b[T4]+T5 After Optimization
Before optimization
4. Code Movement
– It is a technique of moving a block of code
outside a loop if it won’t have any difference
outside or inside the loop.
Example:

for(int i=0;i<n;i++) x=y+z;


{ for(int i=0;i<n;i++)
x=y+z; {
a[i]=6*I;
} a[i]=6*I;
}
Before Optimization
After Optimization
5.Dead Code Elimination
– This method involves eliminating those code
statements which are either never executed or
unreachable or if executed their output is never
used.
– Example
i=0; i=0
If(i==1)
{ After
a=x+5; Optimization
}

Before Optimization
6. Strength Reduction
• It is the replacement of expressions that
are expensive with cheaper and simple ones.
Example:

B=A*2; B=A+A;

Before After
Optimization Optimization
Basic Blocks and Flow Graphs
• Helps in the identification of loops in the code
for code optimization process

• First job is to partition a sequence of three-


address instructions into basic blocks.
Code Generation
Code Generation
• Three primary tasks of code generator:

– Instruction selection: involves choosing


appropriate target-machine instructions to
implement the IR statements

– Register allocation and assignment involves


deciding what values to keep in which registers.

– Instruction ordering involves deciding in what


order to schedule the execution of instructions.
A simple Target Machine Model
• Our target computer models a three-address
machine with load and store operations,
computation operations, jump operations,
and conditional jumps.

• The underlying computer is a byte-


addressable machine with n general-purpose
registers, RO, R1, . . . , Rn - 1
• We assume the following kinds of instructions
are available:
– Load operations: The instruction LD dst, addr
loads the value in location addr into location dst.
– Store operations: The instruction ST x, r stores
the value in register r into the location x.
– Computation operations of the form OP dst,
srcl, src2, where OP is a operator like ADD or
SUB, and dst, srcl , and src2 are locations.
– Unconditional jumps: The instruction BR L
causes control to branch to the machine instruction
with label L. (BR stands for branch.)

– Conditional jumps of the form Bcond r, L, where


r is a register, L is a label, and cond stands for any
of the common tests on values in the register r.
• For example, BLTZ r, L causes a jump to label L if the
value in register r is less than zero, and allows control to
pass to the next machine instruction if not.
Addressing Modes
• We assume our target machine has a variety of
addressing modes:
– In instructions, a location can be a variable name x
referring to the memory location that is reserved for x
– A location can also be an indexed address of the form
a(r), where a is a variable and r is a register.
• For example, the instruction LD R1, a(R2) has the effect of
setting Rl = contents (a +contents (R2)), where contents(x)
denotes the contents of the register or memory location
represented by x.
–A memory location can be an integer indexed by a
register. For example, LD R1, 1OO(R2) has the
effect of setting R1
=contents(100+contents(R2))that is, of loading
into R1 the value in the memory location obtained
by adding 100 to the contents of register R2.
– Two indirect addressing modes: *r means the
memory location found in the location represented
by the contents of register r and *100(r) means the
memory location found in the location obtained by
adding 100 to the contents of r.

• For example, LD R1, *100 (R2) has the effect of setting


R1 = contents(contents(100+contents(R2))), that is, of
loading into R1 the value in the memory location stored
in the memory location obtained by adding 100 to the
contents of register R2.
– Immediate constant addressing mode: The constant
is prefixed by #. The instruction LD R1, #100
loads the integer 100 into register R1, and ADD
R1, R1, #100 adds the integer 100 into register R1.
Program and Instruction Costs
• Cost of Instruction=1+ costs associated with the
addressing modes of the operands

• Addressing modes involving registers have zero


additional cost, while those involving a memory
location or constant in them have an additional cost of
one.

• We assume the cost of a target-language program on a


given input is the sum of costs of the individual
instructions executed when the program is run on that
input
Example:
• The instruction LD RO, R1 copies the contents of
register R1 into register RO. This instruction has a
cost of one because no additional memory words are
required.
• The instruction LD RO, M loads the contents of
memory location M into register RO. The cost is
two.
• The instruction LD R1, *100(R2) loads into register
R1 the value given by contents(contents(l00 +
contents(R2))). The cost is two.
1. 2 + 2 + 1 + 2 = 7
2. 2 + 2 + 2 + 2 = 8
3. 2 + 2 + 2 + 2 = 8
Recursive Descent Parsing
• It consists of a set of procedures, one for each
non terminal.

• Execution begins with the procedure for the


start symbol, which halts and announces
success if it scans the entire input string.
Void A( )
{
choose an A-production, A->X1,X2….Xk;
for(i=1….k){
if(Xi is a non-terminal)
call procedure Xi( );
else if(Xi equals the current input symbol a)
advance the input to the next symbol;
else /*an error has occurred*/;
}
}

Fig. A typical procedure for a non-terminal in a top-down parser


• Consider the grammar
E-> iE’
E’-> +iE’/ Ԑ
E( )
{
l=getchar();
if(l==‘i’)
{
match(‘i’);
E’();
}
}
E’( )
{
if(l==‘+’)
{
match(‘+’);
match(‘i’);
E’( );
}
else
return( );
}
match(char t)
{
if(l==t)
l=getchar();
else
printf(“error’);
}
main()
{
E();
if(l==‘$’)
printf(“parsing success”);
}

You might also like