Compiler Design
Compiler Design
Compiler Design
moDule- i
Various Phases of a Compiler
The compilation process is a sequence of various phases. Each phase takes input from its previous
stage, has its own representation of source program, and feeds its output to the next phase of the
compiler. Let us understand the phases of a compiler.
Lexical Analysis
The first phase of scanner works as a text scanner. This phase scans the source code as a stream
of characters and converts it into meaningful lexemes. Lexical analyzer represents these lexemes
in the form of tokens as:
<token-name, attribute-value>
Syntax Analysis
The next phase is called the syntax analysis or parsing. It takes the token produced by lexical
analysis as input and generates a parse tree (or syntax tree). In this phase, token arrangements are
checked against the source code grammar, i.e. the parser checks if the expression made by the
tokens is syntactically correct.
Semantic Analysis
Semantic analysis checks whether the parse tree constructed follows the rules of language. For
example, assignment of values is between compatible data types, and adding string to an integer.
Also, the semantic analyzer keeps track of identifiers, their types and expressions; whether
identifiers are declared before use or not etc. The semantic analyzer produces an annotated syntax
tree as an output.
Code Optimization
The next phase does code optimization of the intermediate code. Optimization can be assumed as
something that removes unnecessary code lines, and arranges the sequence of statements in order
to speed up the program execution without wasting resources (CPU, memory).
Code Generation
In this phase, the code generator takes the optimized representation of the intermediate code and
maps it to the target machine language. The code generator translates the intermediate code into
a sequence of (generally) re-locatable machine code. Sequence of instructions of machine code
performs the task as the intermediate code would do.
Symbol Table
It is a data-structure maintained throughout all the phases of a compiler. All the identifier's names
along with their types are stored here. The symbol table makes it easier for the compiler to quickly
search the identifier record and retrieve it. The symbol table is also used for scope management.
Lexical analysis
Lexical Analysis is the first phase of the compiler also known as a scanner. It converts
the High level input program into a sequence of Tokens.
• Lexical Analysis can be implemented with the Deterministic finite Automata.
• The output is a sequence of tokens that is sent to the parser for syntax analysis
Example of Non-Tokens:
• Comments, preprocessor directive, macros, blanks, tabs, newline, etc.
Lexeme: The sequence of characters matched by a pattern to form the corresponding
token or a sequence of input characters that comprises a single token is called a lexeme.
eg- “float”, “abs_zero_Kelvin”, “=”, “-”, “273”, “;” .
2. Tokenization: This is the process of breaking the input text into a sequence of
tokens. This is usually done by matching the characters in the input text against
a set of patterns or regular expressions that define the different types of tokens.
3. Token classification: In this stage, the lexer determines the type of each token.
For example, in a programming language, the lexer might classify keywords,
identifiers, operators, and punctuation symbols as separate token types.
4. Token validation: In this stage, the lexer checks that each token is valid
according to the rules of the programming language. For example, it might
check that a variable name is a valid identifier, or that an operator has the
correct syntax.
5. Output generation: In this final stage, the lexer generates the output of the
lexical analysis process, which is typically a list of tokens. This list of tokens can
then be passed to the next stage of compilation or interpretation.
• The lexical analyzer identifies the error with the help of the automation machine
and the grammar of the given language on which it is based like C, C++, and
gives row number and column number of the error.
Suppose we pass a statement through lexical analyzer – a = b + c ; It will
generate token sequence like this: id=id+id; Where each id refers to it’s
variable in the symbol table referencing all details For example, consider the program
int main()
{
// 2 variables
int a, b;
a = 10;
return 0;
}
All the valid tokens are:
'int' 'main' '(' ')' '{' 'int' 'a' ',' 'b' ';'
'a' '=' '10' ';' 'return' '0' ';' '}'
Issues :
On discovering an error, the parser discards input symbols one at a time until a synchronizing
token is found. The synchronizing tokens are usually delimiters, such as
semicolon or end. It has the advantage of simplicity and does not go into an infinite loop.
When multiple errors in the same statement are rare, this method is quite useful.
On discovering an error, the parser performs local correction on the remaining input that
allows it to continue. Example: Insert a missing semicolon or delete an extraneous semicolon
etc.
Error productions:
The parser is constructed using augmented grammar with error productions. If an error
production is used by the parser, appropriate error diagnostics can be generated to indicate the
erroneous constructs recognized by the input.
Global correction:
Given an incorrect input string x and grammar G, certain algorithms can be used to find
a parse tree for a string y, such that the number of insertions, deletions and changes of tokens
is as small as possible. However, these methods are in general too costly in terms of time and
space.
Context-Free Grammars:
The syntax of a programming language is described by context-free grammar (CFG).
CFG consists of a set of terminals, a set of non-terminals, a start symbol, and a set of
productions.
Notation – ? ? ? where ? is a is a single variable [V]
? ? (V+T)*
Ambiguity
A grammar that produces more than one parse tree for some sentence is said to be
ambiguous.
Eg- consider a grammar
S -> aS | Sa | a
Now for string aaa, we will have 4 parse trees, hence ambiguous
Symbol Table :
Symbol table is an important data structure created and maintained by compilers in order to store
information about the occurrence of various entities such as variable names, function names,
objects, classes, interfaces, etc. Symbol table is used by both the analysis and the synthesis parts
of a compiler.
A symbol table may serve the following purposes depending upon the language in hand:
• To store the names of all entities in a structured form at one place.
• To verify if a variable has been declared.
• To implement type checking, by verifying assignments and expressions in the source
code are semantically correct.
• To determine the scope of a name (scope resolution).
A symbol table is simply a table which can be either linear or a hash table. It maintains an entry
for each name in the following format:
For example, if a symbol table has to store information about the following variable declaration:
Implementation
If a compiler is to handle a small amount of data, then the symbol table can be implemented as an
unordered list, which is easy to code, but it is only suitable for small tables only. A symbol table
can be implemented in one of the following ways:
Operations
A symbol table, either linear or hash, should provide the following operations.
insert()
This operation is more frequently used by analysis phase, i.e., the first half of the compiler where
tokens are identified and names are stored in the table. This operation is used to add information
in the symbol table about unique names occurring in the source code. The format or structure in
which the names are stored depends upon the compiler in hand.
An attribute for a symbol in the source code is the information associated with that symbol. This
information contains the value, state, scope, and type about the symbol. The insert() function takes
the symbol and its attributes as arguments and stores the information in the symbol table.
For example:
int a;
lookup(symbol)
This method returns 0 (zero) if the symbol does not exist in the symbol table. If the symbol exists
in the symbol table, it returns its attributes stored in the table.
Scope Management
A compiler maintains two types of symbol tables: a global symbol table which can be accessed by
all the procedures and scope symbol tables that are created for each scope in the program.
Token
Tokens
int (keyword), a(identifier), =(operator), 10(constant) and ;(punctuation-
semicolon)
Answer – Total number of tokens = 5
Lexeme
It is a sequence of characters in the source code that are matched by given predefined
language rules for every lexeme to be specified as a valid token.
Example:
main is lexeme of type identifier(token)
(,),{,} are lexemes of type punctuation(token)
Pattern
The sequence of
Interpretation all the reserved keywords characters that
of type of that language(main, make the
Keyword printf, etc.) int, goto keyword.
it must start
with the
alphabet,
Interpretation followed by the
of type name of a variable, alphabet or a
Identifier function, etc main, a digit.
Interpretation
of type all the operators are
Operator considered tokens. +, = +, =
any string of
characters
Interpretation a grammar rule or (except ‘ ‘)
of type Literal boolean literal. “Welcome to GeeksforGeeks!” between ” and “
Types of lexical error that can occur in a lexical analyzer are as follows:
1. Exceeding length of identifier or numeric constants.
Example:
• C++
#include <iostream>
int main() {
return 0;
This is a lexical error since signed integer lies between −2,147,483,648 and
2,147,483,647
printf("Geeksforgeeks");$
return 0;
This is a lexical error since an illegal character $ appears at the end of the statement.
3. Unmatched string
Example:
• C++
#include <iostream>
int main() {
/* comment
cout<<"GFG!";
return 0;
This is a lexical error since the ending of comment “*/” is not present but the beginning is
present.
4. Spelling Error
• C++
#include <iostream>
int main() {
• C++
#include <iostream>
int main() {
return 0;
int main() {
cout<<"GFG!";
return 0;
return 0;
scanned. I
nitially both the pointers point to the first character of the input string as shown
below
The forward ptr moves ahead to search for end of lexeme. As soon as the blank space is
encountered, it indicates end of lexeme. In above example as soon as ptr (fp) encounters
a blank space the lexeme “int” is identified. The fp will be moved ahead at white space,
when fp encounters white space, it ignore and moves ahead. then both the begin ptr(bp)
and forward ptr(fp) are set at next token. The input character is thus read from secondary
storage, but reading in this way from secondary storage is costly. hence buffering
technique is used.A block of data is first read into a buffer, and then second by lexical
analyzer. there are two methods used in this context: One Buffer Scheme, and Two
Buffer Scheme. These are explained as following
below.
1. One Buffer Scheme: In this scheme, only one buffer is used to store the input
string but the problem with this scheme is that if lexeme is very long then it
crosses the buffer boundary, to scan rest of the lexeme the buffer has to be
refilled, that makes overwriting the first of
lexeme.
2. Two Buffer Scheme: To overcome the problem of one buffer scheme, in this
method two buffers are used to store the input string. the first buffer and second
buffer are scanned alternately. when end of current buffer is reached the other
buffer is filled. the only problem with this method is that if length of the lexeme is
longer than length of the buffer then scanning input cannot be scanned
completely. Initially both the bp and fp are pointing to the first character of first
buffer. Then the fp moves towards right in search of end of lexeme. as soon as
blank character is recognized, the string between bp and fp is identified as
corresponding token. to identify, the boundary of first buffer end of buffer
character should be placed at the end first buffer. Similarly end of second buffer
is also recognized by the end of buffer mark present at the end of second buffer.
when fp encounters first eof, then one can recognize end of first buffer and
hence filling up second buffer is started. in the same way when second eof is
obtained then it indicates of second buffer. alternatively both the buffers can be
filled up until end of the input program and stream of tokens is identified.
This eof character introduced at the end is calling Sentinel which is used to
identify the end of
buffer.
SPECIFICATION OF TOKENS
"string." The length of a string s, usually written |s|, is the number of occurrences of symbols
in s. For example, banana is a string of length six. The empty string, denoted ε, is the string of
length zero.
Operations on strings
The following string-related terms are commonly used:
1. A prefix of string s is any string obtained by removing zero or more symbols from the
end of string s. For example, ban is a prefix of banana.
2. A suffix of string s is any string obtained by removing zero or more symbols from the
beginning of s. For example, nana is a suffix of banana.
3. A substring of s is obtained by deleting any prefix and any suffix from s. For example,
nan is a substring of banana.
4. The proper prefixes, suffixes, and substrings of a string s are those prefixes, suffixes, and
substrings, respectively of s that are not ε or not equal to s itself.
5. A subsequence of s is any string formed by deleting zero or more not necessarily consecutive
positions of s
6. For example, baan is a subsequence of banana.
Operations on languages:
The following are the operations that can be applied to languages:
1. Union
2. Concatenation
3. Kleene closure
4. Positive closure
The following example shows the operations on strings: Let L={0,1} and S={a,b,c}
Regular grammar
Regular Expressions
· Each regular expression r denotes a language L(r).
· Here are the rules that define the regular expressions over some alphabet Σ and the
languages that those expressions denote:
1.ε is a regular expression, and L(ε) is { ε }, that is, the language whose sole member is the
empty string.
2. If ‘a’ is a symbol in Σ, then ‘a’ is a regular expression, and L(a) = {a}, that is, the language
with one string, of length one, with ‘a’ in its one position.
3.Suppose r and s are regular expressions denoting the languages L(r) and L(s). Then, a) (r)|(s)
is a regular expression denoting the language L(r) U L(s).
b) (r)(s) is a regular expression denoting the language L(r)L(s). c) (r)* is a regular expression
denoting (L(r))*.
d) (r) is a regular expression denoting L(r).
4.The unary operator * has highest precedence and is left associative.
5.Concatenation has second highest precedence and is left associative.
6. Regular Expressions
Regular Expressions are used to denote regular languages. An expression is regular if:
• ɸ is a regular expression for regular language ɸ.
• ɛ is a regular expression for regular language {ɛ}.
• If a ∈ Σ (Σ represents the input alphabet), a is regular expression with language
{a}.
• If a and b are regular expression, a + b is also a regular expression with
language {a,b}.
• If a and b are regular expression, ab (concatenation of a and b) is also regular.
• If a is regular expression, a* (0 or more times a) is also regular.
Note : Two regular expressions are equivalent if languages generated by them are
same. For example, (a+b*)* and (a+b)* generate same language. Every string
which is generated by (a+b*)* is also generated by (a+b)* and vice versa.
moDule- ii
syntax analysis
Syntax Analysis or Parsing is the second phase, i.e. after lexical analysis. It checks the
syntactical structure of the given input, i.e. whether the given input is in the correct syntax
(of the language in which the input has been written) or not. It does so by building a data
structure, called a Parse tree or Syntax tree. The parse tree is constructed by using the
pre-defined Grammar of the language and the input string. If the given input string can be
produced with the help of the syntax tree (in the derivation process), the input string is
found to be in the correct syntax. if not, the error is reported by the syntax analyzer.
Syntax analysis, also known as parsing, is a process in compiler design where the
compiler checks if the source code follows the grammatical rules of the programming
language. This is typically the second stage of the compilation process, following lexical
analysis.
The main goal of syntax analysis is to create a parse tree or abstract syntax tree (AST) of
the source code, which is a hierarchical representation of the source code that reflects
the grammatical structure of the program.
There are several types of parsing algorithms used in syntax analysis, including:
• LL parsing: This is a top-down parsing algorithm that starts with the root of the
parse tree and constructs the tree by successively expanding non-terminals. LL
parsing is known for its simplicity and ease of implementation.
• LR parsing: This is a bottom-up parsing algorithm that starts with the leaves of
the parse tree and constructs the tree by successively reducing terminals. LR
parsing is more powerful than LL parsing and can handle a larger class of
grammars.
• LR(1) parsing: This is a variant of LR parsing that uses lookahead to
disambiguate the grammar.
• LALR parsing: This is a variant of LR parsing that uses a reduced set of
lookahead symbols to reduce the number of states in the LR parser.
• Once the parse tree is constructed, the compiler can perform semantic analysis
to check if the source code makes sense and follows the semantics of the
programming language.
• The parse tree or AST can also be used in the code generation phase of the
compiler design to generate intermediate code or machine code.
The pushdown automata (PDA) is used to design the syntax analysis phase.
The Grammar for a Language consists of Production rules.
Example: Suppose Production rules for the Grammar of a language are:
S -> cAd
A -> bc|a
And the input string is “cad”.
Now the parser attempts to construct a syntax tree from this grammar for the given input
string. It uses the given production rules and applies those as needed to generate the
string. To generate string “cad” it uses the rules as shown in the given
diagram:
In step (iii) above, the production rule A->bc was not a suitable one to apply (because the
string produced is “cbcd” not “cad”), here the parser needs to backtrack, and apply the next
production rule available with A which is shown in step (iv), and the string “cad” is produced.
Thus, the given input can be produced by the given grammar, therefore the input is correct
in syntax. But backtrack was needed to get the correct syntax tree, which is really a
complex process to implement.
There can be an easier way to solve this, which we shall see in the next article “Concepts
of FIRST and FOLLOW sets in Compiler Design”.
Advantages :
Disadvantages:
This way, the parser accomplishes two tasks, i.e., parsing the code, looking for errors and
generating a parse tree as the output of the phase.
Parsers are expected to parse the whole code even if some errors exist in the program. Parsers use
error recovering strategies, which we will learn later in this chapter.
Grammar
A grammar is a set of structural rules which describe a language. Grammars assign structure
to any sentence. This term also refers to the study of these rules, and this file includes
morphology, phonology, and syntax. It is capable of describing many, of the syntax
of programming languages.
• The non-terminal symbol should appear to the left of the at least one production
• The goal symbol should never be displayed to the right of the::= of any production
• A rule is recursive if LHS appears in its RHS
Context Free Grammar
A CFG is a left-recursive grammar that has at least one production of the type. The rules in a
context-free grammar are mainly recursive. A syntax analyser checks that specific program
satisfies all the rules of Context-free grammar or not. If it does meet, these rules syntax
analysers may create a parse tree for that programme.
expression -> expression -+ term
expression -> expression – term
expression-> term
term -> term * factor
term -> expression/ factor
term -> factor factor
factor -> ( expression )
factor -> id
Grammar Derivation
Grammar derivation is a sequence of grammar rule which transforms the start symbol into
the string. A derivation proves that the string belongs to the grammar’s language.
Left-most Derivation
When the sentential form of input is scanned and replaced in left to right sequence, it is
known as left-most derivation. The sentential form which is derived by the left-most
derivation is called the left-sentential form.
Right-most Derivation
Rightmost derivation scan and replace the input with production rules, from right to left,
sequence. It’s known as right-most derivation. The sentential form which is derived from the
rightmost derivation is known as right-sentential form.
Parsing
The parser is that phase of the compiler which takes a token string as input and with the
help of existing grammar, converts it into the corresponding Intermediate
Representation(IR). The parser is also known as Syntax Analyzer.
Classification of Parser
Types of Parser:
The parser is mainly classified into two categories, i.e. Top-down Parser, and Bottom-up
Parser. These are explained below:
Top-Down Parser:
1. The top-down parser is the parser that generates parse for the given input
string with the help of grammar productions by expanding the non-terminals i.e.
it starts from the start symbol and ends on the terminals. It uses left most
derivation.
Further Top-down parser is classified into 2 types: A recursive descent parser,
and Non-recursive descent parser. Recursive Descent Parsing
2. Predictive Parsing or Non-Recursive Parsing or LL(1) Parsing or Table Driver
Parsing
Recursive Descent Parsing –
1. Whenever a Non-terminal spend the first time then go with the first alternative
and compare it with the given I/P String
2. If matching doesn’t occur then go with the second alternative and compare with
the given I/P String.
3. If matching is not found again then go with the alternative and so on.
4. Moreover, If matching occurs for at least one alternative, then the I/P string is
parsed successfully.
LL(1) or Table Driver or Predictive Parser –
1. In LL1, first L stands for Left to Right and second L stands for Left-most
Derivation. 1 stands for a number of Look Ahead tokens used by parser while
parsing a sentence.
2. LL(1) parsing is constructed from the grammar which is free from left recursion,
common prefix, and ambiguity.
3. LL(1) parser depends on 1 look ahead symbol to predict the production to
expand the parse tree.
4. This parser is Non-Recursive.
Bottom-up Parser:
Bottom-up Parser is the parser that generates the parse tree for the given input string
with the help of grammar productions by compressing the non-terminals i.e. it starts from
non-terminals and ends on the start symbol. It uses the reverse of the rightmost
derivation.
Further Bottom-up parser is classified into two types: LR parser, and Operator
precedence parser.
• LR parser is the bottom-up parser that generates the parse tree for the given
string by using unambiguous grammar. It follows the reverse of the rightmost
derivation.
LR parser is of four types:
(a)LR(0)
(b)SLR(1)
(c)LALR(1)
(d)CLR(1)
Ambiguity
A grammar is said to be ambiguous if there exists more than one leftmost derivation or more than
one rightmost derivative or more than one parse tree for the given input string. If the grammar is
not ambiguous then it is called unambiguous.
Example:
1. S = aSb | SS
2. S = ∈
For the string aabb, the above grammar generates two parse trees:
If the grammar has ambiguity then it is not good for a compiler construction. No method
can automatically detect and remove the ambiguity but you can remove ambiguity by re-writing
the whole grammar without ambiguity.
Back-tracking
Top- down parsers start from the root node (start symbol) and match the input string against the
production rules to replace them (if matched). To understand this, take the following example of
CFG:
S → rXd | rZd
X → oa | ea
Z → ai
For an input string: read, a top-down parser, will behave like this:
It will start with S from the production rules and will match its yield to the left-most letter of the
input, i.e. ‘r’. The very production of S (S → rXd) matches with it. So the top-down parser advances
to the next input letter (i.e. ‘e’). The parser tries to expand non-terminal ‘X’ and checks its
production from the left (X → oa). It does not match with the next input symbol. So the top-down
parser backtracks to obtain the next production rule of X, (X → ea).
Now the parser matches all the input letters in an ordered manner. The string is accepted.
Predictive Parser
Predictive parser is a recursive descent parser, which has the capability to predict which production
is to be used to replace the input string. The predictive parser does not suffer from backtracking.
To accomplish its tasks, the predictive parser uses a look-ahead pointer, which points to the next
input symbols. To make the parser back-tracking free, the predictive parser puts some constraints
on the grammar and accepts only a class of grammar known as LL(k) grammar.
Predictive parsing uses a stack and a parsing table to parse the input and generate a parse tree.
Both the stack and the input contains an end symbol $ to denote that the stack is empty and the
input is consumed. The parser refers to the parsing table to take any decision on the input and
stack element combination.
In recursive descent parsing, the parser may have more than one production to choose from for a
single instance of input, whereas in predictive parser, each step has at most one production to
choose. There might be instances where there is no production matching the input string, making
the parsing procedure to fail.
LL Parser
An LL Parser accepts LL grammar. LL grammar is a subset of context-free grammar but with some
restrictions to get the simplified version, in order to achieve easy implementation. LL grammar can
be implemented by means of both algorithms namely, recursive-descent or table-driven.
LL parser is denoted as LL(k). The first L in LL(k) is parsing the input from left to right, the second
L in LL(k) stands for left-most derivation and k itself represents the number of look aheads.
Generally k = 1, so LL(k) may also be written as LL(1).
LL(1) Parsing: Here the 1st L represents that the scanning of the Input will be done from
Left to Right manner and the second L shows that in this parsing technique we are going
to use Left most Derivation Tree. And finally, the 1 represents the number of look-ahead,
which means how many symbols are you going to see when you want to make a
decision.
Essential conditions to check first are as follows:
1. The grammar is free from left recursion.
2. The grammar should not be ambiguous.
3. The grammar has to be left factored in so that the grammar is deterministic
grammar.
These conditions are necessary but not sufficient for proving a LL(1) parser.
Algorithm to construct LL(1) Parsing Table:
Step 1: First check all the essential conditions mentioned above and go to step 2.
Step 2: Calculate First() and Follow() for all non-terminals.
1. First(): If there is a variable, and from that variable, if we try to drive all the
strings then the beginning Terminal Symbol is called the First.
2. Follow(): What is the Terminal Symbol which follows a variable in the process of
derivation.
Step 3: For each production A –> α. (A tends to alpha)
1. Find First(α) and for each terminal in First(α), make entry A –> α in the table.
2. If First(α) contains ε (epsilon) as terminal, then find the Follow(A) and for each
terminal in Follow(A), make entry A –> ε in the table.
3. If the First(α) contains ε and Follow(A) contains $ as terminal, then make entry A
–> ε in the table for the $.
To construct the parsing table, we have two functions:
In the table, rows will contain the Non-Terminals and the column will contain the Terminal
Symbols. All the Null Productions of the Grammars will go under the Follow elements
and the remaining productions will lie under the elements of the First set.
Now, let’s understand with an example.
Example-1: Consider the Grammar:
E --> TE'
E' --> +TE' | ε
T --> FT'
T' --> *FT' | ε
F --> id | (E)
*ε denotes epsilon
Step1 – The grammar satisfies all properties in step 1
Step 2 – calculating first() and follow()
Find their First and Follow sets:
First Follow
E’ –> +TE’/ε { +, ε } { $, ) }
T’ –> *FT’/ε { *, ε } { +, $, ) }
id + * ( ) $
As you can see that all the null productions are put under the Follow set of that symbol
and all the remaining productions lie under the First of that symbol.
Note: Every grammar is not feasible for LL(1) Parsing table. It may be possible that one
cell may contain more than one production.
Let’s see an example.
Bottom up Parsing
Bottom-up parsing starts from the leaf nodes of a tree and works in upward direction till it reaches the
root node.
Operator precedence grammar is kinds of shift reduce parsing method. It is applied to a small class
of operator grammars.
Operator precedence can only established between the terminals of the grammar. It ignores the
non-terminal.
a ⋖ b means that terminal "a" has the lower precedence than terminal "b".
a ≐ b means that the terminal "a" and "b" both have same precedence.
Parsing Action
1. E → E+T/T
2. T → T*F/F
3. F → id
Given string:
1. w = id + id * id
LR parsing is divided into four parts: LR (0) parsing, SLR parsing, CLR parsing and LALR parsing.
LR (1) Parsing
Various steps involved in the LR (1) Parsing:
Augment Grammar
Augmented grammar G` will be generated if we add one more production in the given grammar G.
It helps the parser to identify when to stop the parsing and announce the acceptance of the input.
Example
Given grammar
1. S → AA
2. A → aA | b
1. S`→ S
2. S → AA
3. A → aA | b
LR(0) items is useful to indicate that how much of the input has been scanned up to a given point
in the process of parsing.
Example
Given grammar:
1. S → AA
2. A → aA | b
Add Augment Production and insert '•' symbol at the first position for every production in G
1. S` → •S
2. S → •AA
3. A → •aA
4. A → •b
I0 State:
Add Augment production to the I0 State and Compute the Closure
I0= S`→•S
S → •AA
Add all productions starting with "A" in modified I0 State because "•" is followed by the non-
terminal. So, the I0 State becomes.
I0= S`→•S
S→•AA
A→•aA
A → •b
I1= S` → S•
Add all productions starting with A in to I2 State because "•" is followed by the non-terminal. So,
the I2 State becomes
I2=S→A•A
A→•aA
A → •b
A→a•A
A→•aA
A → •b
In the SLR (1) parsing, we place the reduce move only in the follow of left hand side.
Example
1. S -> •Aa
2. A->αβ•
1. Follow(S) = {$}
2. Follow (A) = {a}
SLR ( 1 ) Grammar
S→E
E→E+T|T
T→T*F|F
F → id
Add Augment Production and insert '•' symbol at the first position for every production in G
S`→•E
E→•E+T
E→•T
T→•T*F
T→•F
F → •id
I0 State:
Add all productions starting with E in to I0 State because "." is followed by the non-terminal. So,
the I0 State becomes
I0= S`→•E
E→•E+T
E → •T
Add all productions starting with T and F in modified I0 State because "." is followed by the non-
terminal. So, the I0 State becomes.
I0= S`→•E
E→•E+T
E→•T
T→•T*F
T→•F
F → •id
Add all productions starting with T and F in I5 State because "." is followed by the non-terminal.
So, the I5 State becomes
I5= E→E+•T
T•T*F
T→•F
F → •id
Add all productions starting with F in I6 State because "." is followed by the non-terminal. So, the
I6 State becomes
I6= T→T*•F
F → •id
In the CLR (1), we place the reduce node only in the lookahead symbols.
The look ahead is used to determine that where we place the final item.
The look ahead always add $ symbol for the argument production.
Example
CLR ( 1 ) Grammar
1. S → AA
2. A → aA
3. A → b
Add Augment Production, insert '•' symbol at the first position for every production in G and also
add the lookahead.
1. S` → •S, $
2. S → •AA, $
3. A → •aA, a/b
4. A → •b, a/b
I0 State:
Add all productions starting with S in to I0 State because "." is followed by the non-terminal. So,
the I0 State becomes
I0= S`→•S,$
S → •AA, $
Add all productions starting with A in modified I0 State because "." is followed by the non-
terminal. So, the I0 State becomes.
I0= S`→•S,$
S→•AA,$
A→•aA,a/b
A → •b, a/b
Add all productions starting with A in I2 State because "." is followed by the non-terminal. So, the
I2 State becomes
I2= S→A•A,$
A→•aA,$
A → •b, $
Add all productions starting with A in I3 State because "." is followed by the non-terminal. So, the
I3 State becomes
I3= A→a•A,a/b
A→•aA,a/b
A → •b, a/b
Add all productions starting with A in I6 State because "." is followed by the non-terminal. So, the
I6 State becomes
I6= A→a•A,$
A→•aA,$
A → •b, $
In the LALR (1) parsing, the LR (1) items which have same productions but different look ahead are
combined to form a single set of items
LALR (1) parsing is same as the CLR (1) parsing, only difference in the parsing table.
Example
LALR ( 1 ) GrammarPlay Video
1. S → AA
2. A → aA
3. A → b
Add Augment Production, insert '•' symbol at the first position for every production in G and also
add the look ahead.
1. S` → •S, $
2. S → •AA, $
3. A → •aA, a/b
4. A → •b, a/b
I0 State:
Add all productions starting with S in to I0 State because "•" is followed by the non-terminal. So,
the I0 State becomes
I= S`→•S,$
S → •AA, $
Add all productions starting with A in modified I0 State because "•" is followed by the non-
terminal. So, the I0 State becomes.
I0= S`→•S,$
S→•AA,$
A → •aA, a/b
A → •b, a/b
I1= Go to (I0, S) = closure (S` → S•, $) = S` → S•, $
I2= Go to (I0, A) = closure ( S → A•A, $ )
Add all productions starting with A in I2 State because "•" is followed by the non-terminal. So, the
I2 State becomes
I2= S → A•A, $
A → •aA, $
A → •b, $
Add all productions starting with A in I3 State because "•" is followed by the non-terminal. So, the
I3 State becomes
Add all productions starting with A in I6 State because "•" is followed by the non-terminal. So, the
I6 State becomes
I6 = A → a•A, $
A → •aA, $
A → •b, $
If we analyze then LR (0) items of I3 and I6 are same but they differ only in their lookahead.
I3 = { A → a•A, a/b
A → •aA, a/b
A → •b, a/b
}
I6= { A → a•A, $
A → •aA, $
A → •b, $
}
Clearly I3 and I6 are same in their LR (0) items but differ in their lookahead, so we can combine
them and called as I36.
The I4 and I7 are same but they differ only in their look ahead, so we can combine them and called
as I47.
The I8 and I9 are same but they differ only in their look ahead, so we can combine them and called
as I89.
The production must have non- The production must have non-terminal
2. terminal as its head. as a symbol in its body.
7.
Dependency Graph
A dependency graph is used to represent the flow of information among the attributes in a parse
tree. In a parse tree, a dependency graph basically helps to determine the evaluation order for the
attributes. The main aim of the dependency graphs is to help the compiler to check for various types
of dependencies between statements in order to prevent them from being executed in the incorrect
sequence, i.e. in a way that affects the program’s meaning. This is the main aspect that helps in
identifying the program’s numerous parallelizable components.
It assists us in determining the impact of a change and the objects that are affected by it. Drawing
edges to connect dependent actions can be used to create a dependency graph. These arcs result
in partial ordering among operations and also result in preventing a program from running in parallel.
Although use-definition chaining is a type of dependency analysis, it results in unduly cautious data
reliance estimations. On a shared control route, there may be four types of dependencies between
statements I and j.
Dependency graphs, like other-directed networks, have nodes or vertices depicted as boxes or
circles with names, as well as arrows linking them in their obligatory traversal direction. Dependency
graphs are commonly used in scientific literature to describe semantic links, temporal and causal
dependencies between events, and the flow of electric current in electronic circuits. Drawing
dependency graphs is so common in computer science that we’ll want to employ tools that automate
the process based on some basic textual instructions from us.
Types of dependencies:
Dependencies are broadly classified into the following categories:
1. Data Dependencies:
When a statement computes data that is later utilized by another statement. A state in which
instruction must wait for a result from a preceding instruction before it can complete its execution. A
data dependence will trigger a stoppage in the flowing services of a processor pipeline or block the
parallel issuing of instructions in a superscalar processor in high-performance processors using
pipeline or superscalar approaches.
2. Control Dependencies:
Control Dependencies are those that come from a program’s well-ordered control flow. A scenario
in which a program instruction executes if the previous instruction evaluates in a fashion that permits
it to execute is known as control dependence.
3. Flow Dependency:
In computer science, a flow dependence occurs when a program statement refers to the data of a
previous statement.
4. Antidependence:
When an instruction needs a value that is later modified, this is known as anti-dependency, or write-
after-read (WAR). Instruction 2 anti-depends on instruction 3 in the following example; the order of
these instructions cannot be modified, nor can they be performed in parallel (potentially changing
the instruction ordering), because this would modify the final value of A.
5. Output-Dependency:
An output dependence, also known as write-after-write (WAW), happens when the sequence in
which instructions are executed has an impact on the variable’s ultimate output value. There is an
output dependence between instructions 3 and 1 in the example below; altering the order of
instructions would affect the final value of A, hence these instructions cannot be run in parallel.
6. Control-Dependency:
If the outcome of A determines whether B should be performed or not, an instruction B has a control
dependence on a previous instruction A. The display style S 2S 2 instruction has a control reliance
on the display style S 1S 1 instruction in the following example. However, display style S 3S 3 is not
dependent on display style S 1S 1, because display style S 3S 3 is always done regardless of the
result of display style S 1S 1.
Example of Dependency Graph:
Design dependency graph for the following grammar:
E -> E1 * E2
Evaluation Order
Evaluation order for SDD includes how the SDD(Syntax Directed Definition) is evaluated with the
help of attributes, dependency graphs, semantic rules, and S and L attributed definitions. SDD helps
in the semantic analysis in the compiler so it’s important to know about how SDDs are evaluated
and their evaluation order. This article provides detailed information about the SDD evaluation. It
requires some basic knowledge of grammar, production, parses tree, annotated parse tree,
synthesized and inherited attributes.
Terminologies:
• Parse Tree: A parse tree is a tree that represents the syntax of the production hierarchically.
• Annotated Parse Tree: Annotated Parse tree contains the values and attributes at each node.
• Synthesized Attributes: When the evaluation of any node’s attribute is based on children.
• Inherited Attributes: When the evaluation of any node’s attribute is based on children or
parents.
Dependency Graphs:
A dependency graph provides information about the order of evaluation of attributes with the help of
edges. It is used to determine the order of evaluation of attributes according to the semantic rules
of the production. An edge from the first node attribute to the second node attribute gives the
information that first node attribute evaluation is required for the evaluation of the second node
attribute. Edges represent the semantic rules of the corresponding production.
Dependency Graph Rules: A node in the dependency graph corresponds to the node of the parse
tree for each attribute. Edges (first node from the second node)of the dependency graph represent
that the attribute of the first node is evaluated before the attribute of the second node.
Production Table
3. A1 ⇢ B A1.syn = B.syn
Node Attribute
1 digit.lexval
2 digit.lexval
3 digit.lexval
4 B.syn
5 B.syn
6 B.syn
7 A1.syn
8 A.syn
9 A1.inh
10 S.val
Table-2
1 4 B.syn = digit.lexval
2 5 B.syn = digit.lexval
3 6 B.syn = digit.lexval
Table-2
4 7 A1.syn = B.syn
8 9 A1.inh = A.syn
S-Attributed Definitions:
S-attributed SDD can have only synthesized attributes. In this type of definitions semantic rules are
placed at the end of the production only. Its evaluation is based on bottom up parsing.
Example: S ⇢ AB { S.x = f(A.x | B.x) }
L-Attributed Definitions:
L-attributed SDD can have both synthesized and inherited (restricted inherited as attributes can only
be taken from the parent or left siblings). In this type of definition, semantics rules can be placed
anywhere in the RHS of the production. Its evaluation is based on inorder (topological sorting).
Example: S ⇢ AB {A.x = S.x + 2} or S ⇢ AB { B.x = f(A.x | B.x) } or S ⇢ AB { S.x = f(A.x | B.x) }
Note:
• Every S-attributed grammar is also L-attributed.
• For L-attributed evaluation in order of the annotated parse tree is used.
• For S-attributed reverse of the rightmost derivation is used.
• Semantic Rules with controlled side-effects:
• Side effects are the program fragment contained within semantic rules. These side effects in
SDD can be controlled in two ways: Permit incidental side effects and constraint admissible
evaluation orders to have the same translation as any admissible order.
Type checking
Type checking is the process of verifying and enforcing constraints of types in values. A compiler
must check that the source program should follow the syntactic and semantic conventions of the
source language and it should also check the type rules of the language. It allows the programmer
to limit what types may be used in certain circumstances and assigns types to values. The type-
checker determines whether these values are used appropriately or not.
It checks the type of objects and reports a type error in the case of a violation, and incorrect types
are corrected. Whatever the compiler we use, while it is compiling the program, it has to follow the
type rules of the language. Every language has its own set of type rules for the language. We know
that the information about data types is maintained and computed by the compiler.
The information about data types like INTEGER, FLOAT, CHARACTER, and all the other data types
is maintained and computed by the compiler. The compiler contains modules, where the type
checker is a module of a compiler and its task is type checking.
Conversion
Conversion from one type to another type is known as implicit if it is to be done automatically by the
compiler. Implicit type conversions are also called Coercion and coercion is limited in many
languages.
Example: An integer may be converted to a real but real is not converted to an integer.
Conversion is said to be Explicit if the programmer writes something to do the Conversion.
Tasks:
• has to allow “Indexing is only on an array”
• has to check the range of data types used
• INTEGER (int) has a range of -32,768 to +32767
• FLOAT has a range of 1.2E-38 to 3.4E+38.
Languages like Pascal and C have static type checking. Type checking is used to check the
correctness of the program before its execution. The main purpose of type-checking is to check the
correctness and data type assignments and type-casting of the data types, whether it is syntactically
correct or not before their execution.
Static Type-Checking is also used to determine the amount of memory needed to store the variable.
Overloading:
An Overloading symbol is one that has different operations depending on its context.
Overloading is of two types:
• Operator Overloading
• Function Overloading
Operator Overloading: In Mathematics, the arithmetic expression “x+y” has the addition
operator ‘+’ is overloaded because ‘+’ in “x+y” have different operators when ‘x’ and ‘y’ are
integers, complex numbers, reals, and Matrices.
Example: In Ada, the parentheses ‘()’ are overloaded, the ith element of the expression A(i) of an
Array A has a different meaning such as a ‘call to function ‘A’ with argument ‘i’ or an explicit
conversion of expression i to type ‘A’. In most languages the arithmetic operators are overloaded.
Function Overloading: The Type Checker resolves the Function Overloading based on types of
arguments and Numbers.
Type Systems
• A type system is a collection of rules that assign types to program constructs (more constraints
added to checking the validity of the programs, violation of such constraints indicate errors).
• A languages type system specifies which operations are valid for which types.
Type checks: The compiler checks that names and values are used in accordance with
type rules of the language.
Indexing checks: The compiler checks that indexing is applied only to an array.
Function call checks: The compiler checks that a function (or procedure) is applied to
the correct number and type of arguments.
Flow-of-control checks: The compiler checks that if a statement causes the flow of
control to leave a construct, then there is a place to transfer this flow. For instance when
using break in C
Type Expressions
• A type expression is either a basic type or is formed by applying an operator called a type constructor
to a type expression. The sets of basic types and constructors depend on the language to be checked.
• A basic type is a type expression. Typical basic types for a language include boolean, char, integer,
o Arrays : If T is a type expression, then array(I, T) is a type expression denoting the type of an
array with elements of type T and index set I. I is often a range of integers. Ex. int a[25] ;
o Products : If T1 and T2 are type expressions, then their Cartesian product T1 x T2 is a type
expression. x associates to the left and that it has higher precedence. Products are introduced
for completeness; they can be used to represent a list or tuple of types (e.g., for function
parameters).
o Records : A record is a data structure with named fields. A type expression can be formed by
applying the record type constructor to the field names and their types.
o Pointers : If T is a type expression, then pointer (T) is a type expression denoting the type
set(range). Function F : D -> R.A type expression can be formed by using the type
constructor -> for function types. We write s -> t for "function from type s to type t".
• Type expressions may contain variables whose values are themselves type expressions.
Example
• The array type int [2][3] can be written as a type expression array(2, array(3, integer)). This type is
represented by the tree. The operator array takes two parameters, a number and a type.
Structural Equivalence
• Type expressions are built from basic types and constructors, a natural concept of equivalence
between two type expressions is structural equivalence. i.e., two expressions are either the same
basic type or formed by applying the same constructor to structurally equivalent types. That is, two
type expressions are structurally equivalent if and only if they are identical.
• For example, the type expression integer is equivalent only to integer because they are the same
basic type.
• Similarly, pointer (integer) is equivalent only to pointer (integer) because the two are formed by
• The algorithm recursively compares the structure of type expressions without checking for cycles so
it can be applied to a tree representation. It assumes that the only type constructors are for arrays ,
Names Equivalence
• In some languages, types can be given names (Data type name). For example, in the Pascal program
fragment.
Type Checking
• The identifier link is declared to be a name for the type cell. The variables next, last, p, q, r are not
o Every time a type constructor or basic type is seen, a new node is created.
o Two type expressions are equivalent if they are represented by the same node in the type
graph.
Names Equivalence
• The identifier link is declared to be a name for the type ‡cell. new type names np and nqr have been
introduced.
• since next and last are declared with the same type name, they are treated as having equivalent
types. Similarly, q and r are treated as having equivalent types because the same implicit type name
• However, p, q, and next do not have equivalent types, since they all have types with different
names.
Pascal Program-Fragment
Note that type name cel1 has three parents. All labeled pointer. An equal sign appears between
the type name link and the node in the type graph to which it refers.
Type Conversion
• Converting type casts
– No code needed for structural equivalence
– Run-time semantic error for intersecting values
– Possible conversion of low-level representations, e.g., float to integer
• Non-converting type casts
– E.g., array of characters reinterpreted and pointers or integers, bit manipulation of floats
• Run Time Environment establishes relationships between names and data objects.
• The allocation and de-allocation of data objects are managed by the Run Time Environment
• If the procedure is recursive, several of its activations may & alive at the same time. Each call of a
procedure leads to an activation that may manipulate data objects allocated for its use.
• Elementary data types, such as characters, integers, and reals can be represented by equivalent data
• However, aggregates, such as arrays, strings , and structures, are usually represented by collections
of primitive objects.
Source Language Issues
• Procedure
• Activation Trees
• Control Stack
• Bindings of Names
Storage Organization
• The executing target program runs in its own logical address space in which each program value has
a location. The management and organization of this logical address space is shared between the
compiler, operating system, and target machine. The operating system maps the logical addresses
• The run-time representation of an object program in the logical address space consists of data and
program areas.
• The run time storage is subdivided to hold code and data as follows:
o Data objects
• The size of the generated target code is fixed at compile time, so the compiler can place the
executable target code in a statically determined area Code, usually in the low end of memory.
• The size of some program data objects, such as global constants, and data generated by the compiler,
such as information to support garbage collection, may be known at compile time, and these data
• One reason for statically allocating as many data objects as possible is that the addresses of these
Activation Trees
• Lifetime of an activation is the sequence of steps present in the execution of the procedure.
• If ‘a’ and ‘b’ be two procedures then their activations will be non-overlapping (when one is called
• A procedure is recursive if a new activation begins before an earlier activation of the same
• An activation tree shows the way control enters and leaves activations.
• The node for a is the parent of the node for b if and only if control flows from activation a to b.
• The node for a is to the left of the node for b if and only if the lifetime of a occurs before the lifetime
of b.
{
Int n;
readarray();
quicksort(1,n);
}
quicksort(int m, int n)
{
Int i= partition(m,n);
quicksort(m,i-1);
quicksort(i+1,n);
}
Stack Control
• We can use a stack , called a control stack to keep track of live procedure activations.
• The idea is to push the node for activation onto the control stack as the activation begins and to pop
• Then the contents of the control stack are related to paths to the root of the activation tree.
• When node n is at the top of the control stack, the stack contains the nodes along the path from n to
the root.
Example
• The activation tree that have been reached when control enters the activation represented by q(2,3
) . Activations with labels r, p(1, 9), p(1, 3), and q(1, 3) have executed to completion, so the figure
contains dashed lines to their nodes. The solid lines mark the path from q(2, 3) to the root.
Activation Records
• Procedure calls and returns are usually managed by a run-time stack called the control stack. Each
live activation has an activation record (sometimes called a frame) on the control stack. The contents
o Temporary values, such as those arising from the evaluation of expressions, in cases where
o Local data belonging to the procedure whose activation record this is.
o A saved machine status, with information about the state of the machine just before the call to
the procedure. This information typically includes the return address and the contents of
registers that were used by the calling procedure and that must be restored when the return
occurs.
o An "access link" may be needed to locate data needed by the called procedure but found
o Space for the return value of the called function, if any. Again, not all called procedures return
a value, and if one does, we may prefer to place that value in a register for efficiency.
o The actual parameters used by the calling procedure. Commonly, these values are not placed
Parameter Passing
The communication medium among procedures is known as parameter passing. The values of the
variables from a calling procedure are transferred to the called procedure by some mechanism.
R- value
The value of an expression is called its r-value. The value contained in a single variable also
becomes an r-value if its appear on the right side of the assignment operator.
R-value can always be assigned to some other variable.
L-value
The location of the memory(address) where the expression is stored is known as the l-value of that
expression.
It always appears on the left side if the assignment operator.
• Call by Value
• Call by reference
• Copy restore
• Call by name
Call by Value
• In call by value the calling procedure pass the r-value of the actual parameters and the
compiler puts that into called procedure’s activation record.
• Formal parameters hold the values passed by the calling procedure, thus any changes made
in the formal parameters does not affect the actual parameters.
Call by Value
Call by Reference
• In call by reference the formal and actual parameters refers to same memory location.
• The l-value of actual parameters is copied to the activation record of the called function. Thus
the called function has the address of the actual parameters.
• If the actual parameters does not have a l-value (eg- i+3) then it is evaluated in a
new temporary location and the address of the location is passed.
• Any changes made in the formal parameter is reflected in the actual parameters (because
changes are made at the address).
Call by Reference
Call by Name
• In call by name the actual parameters are substituted for formals in all the places formals
occur in the procedure.
• It is also referred as lazy evaluation because evaluation is done on parameters only when
needed.
Call by Name
Symbol Table
• Symbol tables are data structures that are used by compilers to hold information about
source-program constructs. The information is collected incrementally by the analysis phases
of a compiler and used by the synthesis phases to generate the target code.
• Entries in the symbol table contain information about an identifier such as its character string
(or lexeme) , its type, its position in storage, and any other relevant information.
• The symbol table, which stores information about the entire source program, is used by all
phases of the compiler.
• An essential function of a compiler is to record the variable names used in the source
program and collect information about various attributes of each name.
• These attributes may provide information about the storage allocated for a name, its type, its
scope.
• A symbol table can be implemented in one of the following ways:
• Linear (sorted or unsorted) list
• Binary Search Tree
• Hash table
• Among the above all, symbol tables are mostly implemented as hash tables, where the
source code symbol itself is treated as a key for the hash function and the return value is the
information about the symbol.
• A symbol table may serve the following purposes depending upon the language in hand:
• To store the names of all entities in a structured form at one place.
• To verify if a variable has been declared.
• To implement type checking, by verifying assignments and expressions.
• To determine the scope of a name (scope resolution).
Symbol-Table Entries
• A compiler uses a symbol table to keep track of scope and binding information about names.
The symbol table is searched every time a name is encountered in the source text.
• Changes to the table occur if a new name or new information about an existing name is
discovered. A linear list is the simplest to implement, but its performance is poor. Hashing
schemes provide better performance.
• The symbol table grows dynamically even though fixed at compile time.
• Each entry in the symbol table is for the declaration of a name.
• The format of entries does not uniform.
• The following information about identifiers are stored in symbol table.
• The name.
• The data type.
• The block level.
• Its scope (local, global).
• Pointer / address
• Its offset from base pointer
• Function name, parameter and variable.
Implicit Deallocation
• Implicit deallocation requires cooperation between the user program and the run-
time package, because the latter needs to know when a storage block is no longer in use.
• This cooperation is implemented by fixing the format of storage blocks.
Implicit Deallocation
• Reference counts:
o We keep track of the number of blocks that point directly to the present block. If this
count ever drops to 0, then the block can be deallocated because it cannot be referred
to.
o In other words, the block has become garbage that can be collected. Maintaining
• Marking techniques:
moDule- iV
Intermediate code generation
Intermediate Code Representation Techniques
In the analysis-synthesis model of a compiler, the front end of a compiler translates a source program into an
independent intermediate code, then the back end of the compiler uses this intermediate code to generate
the target code (which can be understood by the machine). The benefits of using machine-independent
intermediate code are:
• Because of the machine-independent intermediate code, portability will be enhanced. For ex,
suppose, if a compiler translates the source language to its target machine language without having
the option for generating intermediate code, then for each new machine, a full native compiler is
required. Because, obviously, there were some modifications in the compiler itself according to the
machine specifications.
• Retargeting is facilitated.
• It is easier to apply source code modification to improve the performance of source code by optimizing
the intermediate code.
The translation of the source code into the object code for the target machine, a compiler can produce a
middle-level language code, which is referred to as intermediate code or intermediate text. There are three
types of intermediate code representation are as follows −
Postfix Notation:
Also known as reverse Polish notation or suffix notation. The ordinary (infix) way of writing the sum of a
and b is with an operator in the middle: a + b The postfix notation for the same expression places the
operator at the right end as ab +. In general, if e1 and e2 are any postfix expressions, and + is any binary
operator, the result of applying + to the values denoted by e1 and e2 is postfix notation by e1e2 +. No
parentheses are needed in postfix notation because the position and arity (number of arguments) of the
operators permit only one way to decode a postfix expression. In postfix notation, the operator follows the
operand.
Syntax Tree:
A syntax tree is nothing more than a condensed form of a parse tree. The operator and keyword nodes of the parse
tree are moved to their parents and a chain of single productions is replaced by the single link in the syntax tree the
internal nodes are operators and child nodes are operands. To form a syntax tree put parentheses in the expression,
this way it’s easy to recognize which operand should come first.
Three-Address Code
The three-address code is a sequence of statements of the form A−=B op C, where A, B, C are either
programmer-defined names, constants, or compiler-generated temporary names, the op represents for an
operator that can be fixed or floatingpoint arithmetic operators or a Boolean valued data or a logical operator.
The reason for the name “three address code” is that each statement generally includes three addresses, two
for the operands, and one for the result.
There are three types of three address code statements which are as follows −
Quadruples representation − Records with fields for the operators and operands can be define three address
statements. It is possible to use a record structure with fields, first hold the operator ‘op’, next two hold
operands 1 and 2 respectively, and the last one holds the result. This representation of three addresses is called
a quadruple representation.
Triples representation − The contents of operand 1, operand 2, and result fields are generally pointer to
symbol records for the names described by these fields. Therefore, it is important to introduce temporary
names into the symbol table as they are generated.
This can be prevented by using the position of statement defines temporary values. If this is completed then,
a record structure with three fields is enough to define the three address statements− The first holds the
operator and the next two holds the values of operand 1 and operand 2 respectively. Such representation is
known as triple representation.
Indirect Triples Representation − The indirect triple representation uses an extra array to list the pointer to
the triples in the desired sequence. This is known as indirect triple representation.
The triple representation for the statement x− = (a + b)*-c is as follows −
In these productions, nonterminal B represents a boolean expression and non-terminal S represents a statement.
This grammar generalizes the running example of while expressions that we introduced in Example 5.19. As in that
example, both B and S have a synthe-sized attribute code, which gives the translation into three-address instructions.
For simplicity, we build up the translations B.code and S.code as strings, us-ing syntax-directed definitions. The
semantic rules defining the code attributes could be implemented instead by building up syntax trees and then emitting
code during a tree traversal.
The translation of if (B) S1 consists of B.code followed by Si.code, as illustrated in Fig. 6.35(a). Within B.code are
jumps based on the value of B. If B is true, control flows to the first instruction of Si.code, and if B is false, control
flows to the instruction immediately following S1 . code.
The labels for the jumps in B.code and S.code are managed using inherited attributes. With a boolean expression B, we
associate two labels: B.true, the label to which control flows if B is true, and B.false, the label to which control flows
if B is false. With a statement S, we associate an inherited attribute S.next denoting a label for the instruction
immediately after the code for S. In some cases, the instruction immediately following S.code is a jump to some
label L. A jump to a jump to L from within S.code is avoided using S.next.
The syntax-directed definition in Fig. 6.36-6.37 produces three-address code for boolean expressions in the context of
if-, if-else-, and while-statements.
Boolean expressions
Boolean expressions are composed of the boolean operators (which we denote &&, I I, and !, using the C convention
for the operators AND, OR, and NOT, respectively) applied to elements that are boolean variables or relational ex-
pressions. Relational expressions are of the form E rel E , where Ex and E are arithmetic expressions. In this section,
± 2 2
We use the attribute rel. op to indicate which of the six comparison operators <, < = , =, ! =, >, or >= is represented by
rel. As is customary, we assume that I I and &;& are left-associative, and that I I has lowest precedence, then &&, then
!.
Given the expression Bi I I B2, if we determine that B1 is true, then we can conclude that the entire expression is true
without having to evaluate B2.
Similarly, given B1k,kB2, if Bi is false, then the entire expression is false. The semantic definition of the programming
language determines whether all parts of a boolean expression must be evaluated. If the language definition permits (or
requires) portions of a boolean expression to go unevaluated, then the compiler can optimize the evaluation of boolean
expressions by computing only enough of an expression to determine its value. Thus, in an expression such as B1 I I
B2, neither B1 nor B2 is necessarily evaluated fully. If either B1 or B2 is an expression with side effects (e.g., it contains
a function that changes a global variable), then an unexpected answer may be obtained.
Procedure Calls
construct in compiler design that generates good code for procedure calls and returns. For simplicity, we
assume that parameters are passed by value. A procedure is similar to a function. Technically, it is an
important and frequently used programming .
A procedure call is a simple statement that includes the procedure name, parentheses with actual parameter
names or values, and a semicolon at the end.
The types of the actual parameters must match the types of the formal parameters created when the
procedure was first declared (if any). The compiler will refuse to compile the compilation unit in which the call
is made if this is not done.
General Form Procedure_Name(actual_parameter_list);
-- where commas separate the parameters
Calling Sequence
A call's translation provides a list of activities executed at the beginning and end of each operation. In a calling
sequence, the following actions occur:
• Space is made available for the activation record when a procedure is called.
• To allow the called method to access data in enclosing blocks, set the environment
pointers.
• The caller procedure's state is saved so that it can resume execution following the
call.
• Save the return address as well. It is the location to which the called procedure
must transfer after it has been completed.
• Finally, for the called procedure, generate a jump to the beginning of the code.
‘call’ is a calling function with f and n, here f represents name of the procedure and n represents number of
parameters
Now, let’s first take an example of a program to understand function definition and a function call.
main()
{
swap(x,y); //calling function
}
In the above program we have the main function and inside the main function a function call swap(x,y), where
x and y are actual arguments. We also have a function definition for swap, swap(int a, int b) where parameters
a and b are formal parameters.
In the three-address code, a function call is unraveled into the evaluation of parameters in preparation for a
call, followed by the call itself. Let’s understand this with another example:
n= f(a[i])
Here, f is a function containing an array of integers a[i]. This function will return some value and the value is
stored in ‘n’ which is also an integer variable.
A→array of integers
F→ function from integer to an integer.
Three address codes for the above function can be written as:
t1= i*4
t2=a[t1]
param t2
t3= call f,1
n=t3
t1= i*4
In this instruction, we are calculating the value of i which can be passed as index value for array a.
t2=a[t1]
In this instruction, we are getting value at a particular index in array a. Since t1 contains an index, here t2 will
contain a value. The above two expressions are used to compute the value of the expression(a[i]) and then
store it in t2.
param t2
The value t2 is passed as a parameter of function f(a[i])
t3= call f,1
This instruction is a function call, where position 1 represents the number of parameters in the function call.
It can vary for different function calls but here it is 1. The calling function will return some value and the value
is stored in t3.
n=t3
The returned value will be assigned to variable n.
Let's see the production with function definition and function call. Several nonterminals like D, F, S, E, A are
used to represent intermediate code.
D → define T id ( F ) { S }
F → 𝜖 | T id, F
S → return E ;
E → id ( A )
A→𝜖|E,A
In D→ define T id (F) {S}, the nonterminal D is for declaration, and T is for type. In the function declaration,
we are going to define the type of the function(T), function name(id), parameters and code to be executed.
(F) represent parameters/arguments of the function and {S} is code to be executed.
Now let’s see what can be a formal parameter.
F → 𝜖 | T id, F
Here the parameter can be empty(𝜖) or of some type(T) followed by the name(id). F at the end represents
the sequence of formal parameters. For example, add(int x, int y, int w,.........).
S → return E;
Here S is code(set of statements to be executed) which will return a value of an expression(E).
E → id ( A )
Expression has some function call with the actual parameters. id represents the name of function and (A)
represents actual parameters. Actual parameters can be generated by the nonterminal A.
A → 𝜖 | E, A.
Actual parameters can be generated with expression E, it can also be a sequence of parameters A. For
example, in add function there can be multiple parameters w,x,y………, etc, add(x,y,z……..).
Code optimization
Source of Optimizations
Optimization is a program transformation technique, which tries to improve the code by making it consume
less resources (i.e. CPU, Memory) and deliver high speed.
In optimization, high-level general programming constructs are replaced by very efficient low-level
programming codes. A code optimizing process must follow the three rules given below:
• The output code must not, in any way, change the meaning of the program.
• Optimization should increase the speed of the program and if possible, the program should
demand less number of resources.
• Optimization should itself be fast and should not delay the overall compiling process.
Efforts for an optimized code can be made at various levels of compiling the process.
• At the beginning, users can change/rearrange the code or use better algorithms to write the
code.
• After generating intermediate code, the compiler can modify the intermediate code by
address calculations and improving loops.
• While producing the target machine code, the compiler can make use of memory hierarchy
and CPU registers.
Optimization can be categorized broadly into two types : machine independent and machine dependent.
Machine-independent Optimization
In this optimization, the compiler takes in the intermediate code and transforms a part of the code that does
not involve any CPU registers and/or absolute memory locations. For example:
do
{
item = 10;
value = value + item;
} while(value<100);
This code involves repeated assignment of the identifier item, which if we put this way:
Item = 10;
do
{
value = value + item;
} while(value<100);
should not only save the CPU cycles, but can be used on any processor.
Machine-dependent Optimization
Machine-dependent optimization is done after the target code has been generated and when the code is
transformed according to the target machine architecture. It involves CPU registers and may have absolute
memory references rather than relative references. Machine-dependent optimizers put efforts to take
maximum advantage of memory hierarchy.
Basic blocks play an important role in identifying variables, which are being used more than once in a single
basic block. If any variable is being used more than once, the register memory allocated to that variable
need not be emptied unless the block finishes execution.
Optimization process can be applied on a basic block. While optimization, we don't need to change
the set of expressions computed by the block.
There are two type of basic block optimization. These are as follows:
1. Structure-Preserving Transformations
2. Algebraic Transformations
In the common sub-expression, you don't need to be computed it over and over again. Instead of
this you can compute it once and kept in store from where it's referenced when encountered again.
1. a : = b + c
2. b : = a - d
3. c : = b + c
4. d : = a - d
In the above expression, the second and forth expression computed the same expression. So the
block can be transformed as follows:
1. a : = b + c
2. b : = a - d
3. c : = b + c
4. d : = b
(b) Dead-code elimination
o It is possible that a program contains a large amount of dead code.
o This can be caused when once declared and defined once and forget to remove them in this case they
serve no purpose.
o Suppose the statement x:= y + z appears in a block and x is dead symbol that means it will never
subsequently used. Then without changing the value of the basic block you can safely remove this
statement.
A statement t:= b + c can be changed to u:= b + c where t is a temporary variable and u is a new
temporary variable. All the instance of t can be replaced with the u without changing the basic block
value.
1. t1 : = b + c
2. t2 : = x + y
These two statements can be interchanged without affecting the value of block when value of t1
does not affect the value of t2.
2. Algebraic transformations:
o In the algebraic transformation, we can change the set of expression into an algebraically equivalent
set. Thus the expression x:= x + 0 or x:= x *1 can be eliminated from a basic block without changing
the set of expression.
o Constant folding is a class of related optimization. Here at compile time, we evaluate constant
expressions and replace the constant expression by their values. Thus the expression 5*2.7 would be
replaced by13.5.
o Sometimes the unexpected common sub expression is generated by the relational operators like <=,
>=, <, >, +, = etc.
o Sometimes associative expression is applied to expose common sub expression without changing the
basic block value. if the source code has the assignments
1. a:= b + c
2. e:= c +d +b
1. a:= b + c
2. t:= c +d
3. e:= t + b
Loop Optimization
Loop optimization is most valuable machine-independent optimization because program's inner loop takes
bulk to time of a programmer.
If we decrease the number of instructions in an inner loop then the running time of a program may be
improved even if we increase the amount of code outside that loop.
1. Code motion
2. Induction-variable elimination
3. Strength reduction
1.Code Motion:
Code motion is used to decrease the amount of code in loop. This transformation takes a statement or
expression which can be moved outside the loop body without affecting the semantics of the program.
For example
2.Induction-Variable Elimination
Induction variable elimination is used to replace variable from inner loop.
It can reduce the number of additions in a loop. It improves both code space and run time performance.
In this figure, we can replace the assignment t4:=4*j by t4:=t4-4. The only problem which will be
arose that t4 does not have a value when we enter block B2 for the first time. So we place a relation
t4=4*j on entry to the block B2.
3.Reduction in Strength
o Strength reduction is used to replace the expensive operation by the cheaper once on the target
machine.
o Addition of a constant is cheaper than a multiplication. So we can replace multiplication with an
addition within the loop.
o Multiplication is cheaper than exponentiation. So we can replace exponentiation with multiplication
within the loop.
Example:
1. while (i<10)
2. {
3. j= 3 * i+1;
4. a[j]=a[j]-2;
5. i=i+2;
6. }
1. s= 3*i+1;
2. while (i<10)
3. {
4. j=s;
5. a[j]= a[j]-2;
6. i=i+2;
7. s=s+6;
8. }
Based on the local information a compiler can perform some optimizations. For example, consider the
following code:
1. x = a + b;
2. x=6*3
o In this code, the first assignment of x is useless. The value computer for x is never used in the
program.
o At compile time the expression 6*3 will be computed, simplifying the second assignment statement to
x = 18;
Some optimization needs more global information. For example, consider the following code:
1. a = 1;
2. b = 2;
3. c = 3;
4. if (....) x = a + 5;
5. else x = b + 4;
6. c = x + 1;
In this code, at line 3 the initial assignment is useless and x +1 expression can be simplified as 7.
But it is less obvious that how a compiler can discover these facts by looking only at one or two consecutive
statements. A more global analysis is required so that the compiler knows the following things at each point
in the program:
Data flow analysis is used to discover this kind of property. The data flow analysis can be performed on the
program's control flow graph (CFG).
The control flow graph of a program is used to determine those parts of a program to which a particular value
assigned to a variable might propagate.
5. An iterative algorithm –
An iterative algorithm is the most common way to solve the data flow analysis equations. In this algorithm,
we particularly have two states first one is in-state and the other one is out-state. The algorithm starts with
an approximation of the in-state of each block and then computed by applying the transfer functions on the
in-states. The in-states is updated by applying the join operations. The latter two steps are repeated until we
Random order –
In this iteration, order is not aware whether the data-flow equations solve a forward or backward data-flow
problem. And hence, the performance is relatively poor compared to specialized iteration orders.
Post order –
This iteration order for backward data-flow problems. A node is visited after all its successor nodes have been
visited, and implemented with the depth-first strategy.
Reverse post order –
This iteration order for forwarding data-flow problems. The node is visited before any of its successor nodes has
been visited, except when the successor is reached by a back edge.
Example –
line 2: a = 5;
line 3: else
line 4: a = 3;
line 5: endif
line 6:
Example descriptions :
From the above example, we can observe that the reaching definition of variable at line 7 is the set of
assignments a = 5 at line 2 and a = 3 at line 4.
reach the fix point: the situation in which the in-states do not change.
q.foo = 2;
i = p.foo + 3;
Flow graph is a directed graph. It contains the flow of control information for the set of basic block.
A control flow graph is used to depict that how the program control is being parsed among the
blocks. It is useful in the loop optimization.
o Block B1 is the initial node. Block B2 immediately follows B1, so from B2 to B1 there is an edge.
o The target of jump from last statement of B1 is the first statement B2, so from B1 to B2 there is an
edge.
o B2 is a successor of B1 and B1 is the predecessor of B2.
moDule- V
Code Generation and Instruction Selection
The final phase in compiler model is the code generator. It takes as input an intermediate
representation of the source program and produces as output an equivalent target program. The code
generation techniques presented below can be used whether or not an optimizing phase occurs before code
generation.
Issues
The following issues arise during the code generation phase:
2. Target program:
The target program is the output of the code generator. The output can be:
c) Absolute machine language: It can be placed in a fixed location in memory and can be executed
immediately.
3. Memory management
o During code generation process the symbol table entries have to be mapped to actual p addresses
and levels have to be mapped to instruction address.
o Mapping name in the source program to address of data is co-operating done by the front end and
code generator.
o Local variables are stack allocation in the activation record while global variables are in static area.
4. Instruction selection:
o Nature of instruction set of the target machine should be complete and uniform.
o When you consider the efficiency of target machine then the instruction speed and machine idioms
are important factors.
o The quality of the generated code can be determined by its speed and size.
Example:
1. a:= b + c
2. d:= a + e
Inefficient assembly code is:
1. MOV b, R0 R0→b
2. ADD c, R0 R0 c + R0
3. MOV R0, a a → R0
4. MOV a, R0 R0→ a
5. ADD e, R0 R0 → e + R0
6. MOV R0, d d → R0
5. Register allocation
Register can be accessed faster than memory. The instructions involving operands in register are shorter
and faster than those involving in memory operand.
Register allocation: In register allocation, we select the set of variables that will reside in register.
Register assignment: In Register assignment, we pick the register that contains variable.
Certain machine requires even-odd pairs of registers for some operands and result.
For example:
1. D x, y
Where,
y is the divisor
6. Evaluation order
The efficiency of the target code can be affected by the order in which the computations are performed. Some
computation orders need fewer registers to hold results of intermediate than others.
Basic Blocks:
Basic Block is a straight line code sequence that has no branches in and out branches except to the entry
and at the end respectively. Basic Block is a set of statements that always executes one after other, in a
sequence.
The first task is to partition a sequence of three-address code into basic blocks. A new basic block is begun
with the first instruction and instructions are added until a jump or a label is met. In the absence of a jump,
control moves further consecutively from one instruction to another.
Output: it contains a list of basic blocks with each three address statement in exactly one block
Method: First identify the leader in the code. The rules for finding leaders are as follows:
For each leader, its basic block consists of the leader and all statement up to. It doesn't include the next
leader or end of the program.
Consider the following source code for dot product of two vectors a and b of length 10:
1. begin
2. prod :=0;
3. i:=1;
4. do begin
5. prod :=prod+ a[i] * b[i];
6. i :=i+1;
7. end
8. while i <= 10
9. end
The three address code for the above source program is given below:
B1
1. (1) prod := 0
2. (2) i := 1
B2
1. (3) t1 := 4* i
2. (4) t2 := a[t1]
3. (5) t3 := 4* i
4. (6) t4 := b[t3]
5. (7) t5 := t2*t4
6. (8) t6 := prod+t5
7. (9) prod := t6
8. (10) t7 := i+1
9. (11) i := t7
10. (12) if i<=10 goto (3)
There are two types of basic block transformations. These are as follows:
1. Structure-Preserving Transformations
2. Algebraic Transformations
In the case of algebraic transformation, we basically change the set of expressions into an
algebraically equivalent set.
x:= x + 0
or x:= x *1
This can be eliminated from a basic block without changing the set of expressions.
Flow Graphs:
• Flow graph is a directed graph containing the flow-of-control information for the set of basic blocks making
up a program.
The nodes of the flow graph are basic blocks. It has a distinguished initial node.
• E.g.: Flow graph for the vector dot product is given as follows:
• B1 is the initial node. B2 immediately follows B1, so there is an edge from B1 to B2. The target of jump
from last statement of B1 is the first statement B2, so there is an edge from B1 (last statement) to B2 (first
statement).
Loops
• A loop is a collection of nodes in a flow graph such that
1. All nodes in the collection are strongly connected.
2. The collection of nodes has a unique entry.
Register Allocation
Registers are the fastest locations in the memory hierarchy. But unfortunately, this resource is limited. It
comes under the most constrained resources of the target processor. Register allocation is an NP-complete
problem. However, this problem can be reduced to graph coloring to achieve allocation and assignment.
Therefore a good register allocator computes an effective approximate solution to a hard problem.
Figure – Input-Output
The register allocator determines which values will reside in the register and which register will hold each of
those values. It takes as its input a program with an arbitrary number of registers and produces a program
with a finite register set that can fit into the target machine.
Allocation vs Assignment:
Allocation: –
Maps an unlimited namespace onto that register set of the target machine.
• Reg. to Reg. Model: Maps virtual registers to physical registers but spills excess amount to memory.
• Mem. to Mem. Model: Maps some subset of the memory location to a set of names that models the
physical register set.
Allocation ensures that code will fit the target machine’s reg. set at each instruction.
Assignment: –
Maps an allocated name set to the physical register set of the target machine.
• Assumes allocation has been done so that code will fit into the set of physical registers.
• No more than ‘k’ values are designated into the registers, where ‘k’ is the no. of physical registers.
Code Generation
Code Generator
A code generator is expected to have an understanding of the target machine’s runtime environment and its
instruction set. The code generator should take the following things into consideration to generate the code:
• Target language : The code generator has to be aware of the nature of the target language
for which the code is to be transformed. That language may facilitate some machine-specific
instructions to help the compiler generate the code in a more convenient way. The target
machine can have either CISC or RISC processor architecture.
• IR Type : Intermediate representation has various forms. It can be in Abstract Syntax Tree
(AST) structure, Reverse Polish Notation, or 3-address code.
• Selection of instruction : The code generator takes Intermediate Representation as input
and converts (maps) it into target machine’s instruction set. One representation can have many
ways (instructions) to convert it, so it becomes the responsibility of the code generator to
choose the appropriate instructions wisely.
• Register allocation : A program has a number of values to be maintained during the
execution. The target machine’s architecture may not allow all of the values to be kept in the
CPU memory or registers. Code generator decides what values to keep in the registers. Also,
it decides the registers to be used to keep these values.
• Ordering of instructions : At last, the code generator decides the order in which the
instruction will be executed. It creates schedules for instructions to execute them.
Descriptors
The code generator has to track both the registers (for availability) and addresses (location of values) while
generating the code. For both of them, the following two descriptors are used:
• Register descriptor : Register descriptor is used to inform the code generator about the
availability of registers. Register descriptor keeps track of values stored in each register.
Whenever a new register is required during code generation, this descriptor is consulted for
register availability.
• Address descriptor : Values of the names (identifiers) used in the program might be stored
at different locations while in execution. Address descriptors are used to keep track of memory
locations where the values of identifiers are stored. These locations may include CPU
registers, heaps, stacks, memory or a combination of the mentioned locations.
Code generator keeps both the descriptor updated in real-time. For a load statement, LD R1, x, the code
generator:
Code Generation
Basic blocks comprise of a sequence of three-address instructions. Code generator takes these sequence
of instructions as input.
Note : If the value of a name is found at more than one place (register, cache, or memory), the register’s
value will be preferred over the cache and main memory. Likewise cache’s value will be preferred over the
main memory. Main memory is barely given any preference.
getReg : Code generator uses getReg function to determine the status of available registers and the location
of name values. getReg works as follows:
• If variable Y is already in register R, it uses that register.
• Else if some register R is available, it uses that register.
• Else if both the above options are not possible, it chooses a register that requires minimal
number of load and store instructions.
For an instruction x = y OP z, the code generator may perform the following actions. Let us assume that L is
the location (preferably register) where the output of y OP z is to be saved:
• Call function getReg, to decide the location of L.
• Determine the present location (register or memory) of y by consulting the Address Descriptor
of y. If y is not presently in register L, then generate the following instruction to copy the value
of y to L:
MOV y’, L
where y’ represents the copied value of y.
• Determine the present location of z using the same method used in step 2 for y and generate
the following instruction:
OP z’, L
where z’ represents the copied value of z.
• Now L contains the value of y OP z, that is intended to be assigned to x. So, if L is a register,
update its descriptor to indicate that it contains the value of x. Update the descriptor of x to
indicate that it is stored at location L.
• If y and z has no further use, they can be given back to the system.
Other code constructs like loops and conditional statements are transformed into assembly language in
general assembly way.
Characteristics
The following are some characteristics of DAG.
• DAG is a type of data structure used to represent the structure of basic blocks.
• Its main aim is to perform the transformation on basic blocks.
• The leaf nodes of the directed acyclic graph represent a unique identifier that can be a variable
or a constant.
• The non-leaf nodes represent an operator symbol.
• Moreover, the nodes are also given a string of identifiers to use as labels for the computed
value.
• Transitive closure and transitive reduction are defined differently in DAG.
• DAG has defined topological ordering.
o Each node contains a list of attached identifiers to hold the computed values.
Method:
Step 1:
Step 2:
For case(i), create node(OP) whose right child is node(z) and left child is node(y).
For case(ii), check whether there is node(OP) with one child node(y).
Output:
For node(x) delete x from the list of identifiers. Append x to attached identifiers list for the node n found in
step 2. Finally set node(x) to n.
Example:
Consider the following three address statement:
1. S1:= 4 * i
2. S2:= a[S1]
3. S3:= 4 * i
4. S4:= b[S3]
5. S5:= s2 * S4
6. S6:= prod + S5
7. Prod:= s6
8. S7:= i+1
9. i := S7
10. if i<= 20 goto (1)
The order in which computations are done can affect the cost of resulting object code. For example,
consider the following basic block:
t1 : = a + b
t2 : = c + d
t3 : = e - t2
t4 : = t1 - t3
Peephole Optimization
A statement-by-statement code-generations strategy often produces target code that contains redundant
instructions and suboptimal constructs. The quality of such target code can be improved by applying
“optimizing” transformations to the target program.
A simple but effective technique for improving the target code is peephole optimization, a method for
trying to improving the performance of the target program by examining a short sequence of target
instructions (called the peephole) and replacing these instructions by a shorter or faster sequence, whenever
possible.
The peephole is a small, moving window on the target program. The code in the peephole need not
be contiguous, although some implementations do require this. It is characteristic of peephole optimization
that each improvement may spawn opportunities for additional improvements.
Peephole optimization is an optimization technique by which code is optimized to improve the machine's
performance. More formally, Peephole optimization is an optimization technique performed on a small
set of compiler-generated instructions; the small set is known as the peephole optimization in
compiler design or window.
In this optimization, the redundant operations are removed. For example, loading and storing
values on registers can be optimized.
For example,
a= b+c
d= a+e
It is implemented on the register(R0) as
MOV b, R0; instruction to copy b to the register
ADD c, R0; instruction to Add c to the register, the register is now b+c
MOV R0, a; instruction to Copy the register(b+c) to a
MOV a, R0; instruction to Copy a to the register
ADD e, R0 ;instruction to Add e to the register, the register is now a(b+c)+e
MOV R0, d; instruction to Copy the register to d
This can be optimized by removing load and store operation, like in third instruction value in
register R0 is copied to a, and it again loaded to R0 in the next step for further operation. The
optimized implementation will be:
MOV b, R0; instruction to Copy b to the register
ADD c, R0; instruction to Add c to the register, which is now b+c (a)
MOV R0, a; instruction to Copy the register to a
ADD e, R0; instruction to Add e to the register, which is now b+c+e [(a)+e]
MOV R0, d; instruction to Copy the register to d
Strength Reduction
In strength reduction optimization, operators that consume higher execution time are replaced by
the operators consuming less execution time. Like multiplication and division, operators can be
replaced by shift operators.
Initial code:
n = a * 2;
Optimized code:
b= a << 1;
//left shifting the bit
Initial code:
b = a / 2;
Optimized code:
b = a >> 1;
// right shifting the bit by one will give the same result
The algebraic expressions that are useless or written inefficiently are transformed.
For example:
a=a+0
a=a*1
a=a/1
a=a-0
//All these above expression are causing calculation overhead.
// These can be removed for optimization
Replace Slower Instructions With Faster
Slower instructions can be replaced with faster ones, and registers play an important role. For
example, a register supporting unit increment operation will perform better than adding one to the
register. The same can be done with many other operations, like multiplication.
Add #1
SUB #1
//The above instruction can be replaced with
// INC R
// DEC R
//If the register supports increment and decrement
The dead code can be eliminated to improve the system's performance; resources will be free and
less memory will be needed.
int dead(void)
{
int a=1;
int b=5;
int c=a+b;
return c;
// c will be returned
// The remaining part of code is dead code, never reachable
int k=1;
k=k*2;
k=k+b;
return k;
// This dead code can be removed for optimization
}
Moreover, null sequences and user less operations can be deleted too.
Operations:
Insert ()
o Insert () operation is more frequently used in the analysis phase when the tokens are identified and
names are stored in the table.
o The insert() operation is used to insert the information in the symbol table like the unique name
occurring in the source code.
o In the source code, the attribute for a symbol is the information associated with that symbol. The
information contains the state, value, type and scope about the symbol.
o The insert () function takes the symbol and its value in the form of argument.
For example:
1. int x;
lookup()
In the symbol table, lookup() operation is used to search a name. It is used to determine:
1. lookup (symbol)
Hash Table –
• In hashing scheme, two tables are maintained – a hash table and symbol table and are the most
commonly used method to implement symbol tables.
• A hash table is an array with an index range: 0 to table size – 1. These entries are pointers pointing
to the names of the symbol table.
• To search for a name we use a hash function that will result in an integer between 0 to table size –
1.
• Insertion and lookup can be made very fast – O(1).
• The advantage is quick to search is possible and the disadvantage is that hashing is complicated to
implement.
int value=10;
void sum_num()
{
int num_1;
int num_2;
{
int num_3;
int num_4;
}
int num_5;
{
int_num 6;
int_num 7;
}
}
Void sum_id
{
int id_1;
int id_2;
{
int id_3;
int id_4;
}
int num_5;
}
The global symbol table contains one global variable of integer value and two procedure names, which can
be accessed by all the child nodes. The names mentioned in the method1 symbol table (and all its child
tables) are not available for the method2 table and its child tables.
The hierarchy of the data structures implemented for the symbol table is stored in the semantic analyzer. A
name is searched in the symbol table using the following hierarchy
• First, a symbol will be searched in the current scope, i.e., the current symbol table.
• if a name is found, then the search is completed; else, it will be searched in the parent symbol
table until,
• Either the name is found, or the global symbol table has been searched for the name.
Types Of Errors
Now we'll take a look at some common types of errors. There are three kinds of errors:
1. Compile-time errors
2. Runtime errors
3. Logical errors
Misspellings of identifiers, keywords, or operators fall into this category. These errors occur both at the
lexical phase and during program execution. When a series of characters does not satisfy the pattern of
any token, a lexical error occurs. These mistakes might occur as a result of spelling mistakes or the
appearance of any unlawful characters.
In general, lexical errors occurs when:
• Identifier or numeric constants are too long.
• Characters that seem to be illegal.
• Strings that don't match.
Example:
class Factorial{
public static void main(String args[]){
int i,fact=1;
int number=5;
for(i=1;i<=number;i++){
fact=fact*i;
}
System.out.println("Factorial of "+number+" is: "+fact);$$$
}
}
Here, we have used a non-readable syntax after the print statement which results in an error and it
comes under lexical error. So, the error will look like:
error: illegal start of expression
These problems arise during the syntactic phase and execution. These issues occur when there is
an imbalance in the parenthesis or when some operators are missing. For example, a semicolon
that isn't there or a parenthesis that isn't balanced.
Typically Syntactic errors look like this:
• Discrepancies in the structure
• The operator is absent.
• Misspelled Keywords
• Parenthesis that aren't balanced
Example:
class Factorial{
public static void main(String args[]){
int i,fact=1;
int number == 5;
for(i=1;i<=number;i++){
fact=fact*i;
}
System.out.println("Factorial of "+number+" is: "+fact);
}
}
The above code has a syntactic error since we need to use =(assigment operator) in the statement, but
we're using invalid expression (==) instead.
3. Semantic Errors
These errors are detected during the compilation time of a program when they occur during the semantic
analysis step. These problems occur when operators, variables, or undeclared variables are used
incorrectly. Like the type conflicts between operator and operand or incompatible value assignment.
Some of the semantic errors are
Example:
class Factorial{
public static void main(String args[]){
int i;
int number = 5;
for(i=1;i<=number;i++){
fact=fact*i;
}
System.out.println("Factorial of "+number+" is: "+fact);
}
}
The above code will produce an error as we didn’t declare the variable “fact” which generates a semantic
error. And hence the output looks like:
error: cannot find symbol
Runtime Errors:-
A run-time error occurs during the execution of a program and is most commonly caused by incorrect system
parameters or improper input data. This can include a lack of memory to run an application, a memory conflict
with another software, or a logical error, which is an example of run-time error.
These are errors that occur when a user enters improper syntax into a code or enters code that a typical
compiler cannot run.
Logical Errors:-
When programs execute poorly and yet don't terminate abnormally, logic errors occur. A logic error can result
in unexpected or unwanted outputs or other behavior, even if it is not immediately identified as such. These
are errors that occur when the specified code is unreachable or when an infinite loop is present.
Now, We'll look at several error recovery mechanisms in a compiler as we've got a good knowledge of the
different types of errors.
The compiler's simplest requirement is to simply stop, issue a message, and halt compiling. To cope with
problems in the code, the parser can implement one of five typical error-recovery mechanisms in the following
which are some of the most prevalent recovery strategies.
Global Corrections:-
The parser looks over the entire program and tries to figure out what it's supposed to accomplish, then finds
the closest match that's error-free.
When given an incorrect input (statement) X, it generates a parse tree for the closest error-free statement Y.
This method may allow the parser to make minimum changes to the source code, but it has yet to be deployed
in practice due to its complexity (time and space).