Compiler Design

Download as pdf or txt
Download as pdf or txt
You are on page 1of 117

Compiler Design

moDule- i
Various Phases of a Compiler
The compilation process is a sequence of various phases. Each phase takes input from its previous
stage, has its own representation of source program, and feeds its output to the next phase of the
compiler. Let us understand the phases of a compiler.

Lexical Analysis
The first phase of scanner works as a text scanner. This phase scans the source code as a stream
of characters and converts it into meaningful lexemes. Lexical analyzer represents these lexemes
in the form of tokens as:

<token-name, attribute-value>
Syntax Analysis
The next phase is called the syntax analysis or parsing. It takes the token produced by lexical
analysis as input and generates a parse tree (or syntax tree). In this phase, token arrangements are
checked against the source code grammar, i.e. the parser checks if the expression made by the
tokens is syntactically correct.

Semantic Analysis
Semantic analysis checks whether the parse tree constructed follows the rules of language. For
example, assignment of values is between compatible data types, and adding string to an integer.
Also, the semantic analyzer keeps track of identifiers, their types and expressions; whether
identifiers are declared before use or not etc. The semantic analyzer produces an annotated syntax
tree as an output.

Intermediate Code Generation


After semantic analysis the compiler generates an intermediate code of the source code for the
target machine. It represents a program for some abstract machine. It is in between the high-level
language and the machine language. This intermediate code should be generated in such a way
that it makes it easier to be translated into the target machine code.

Code Optimization
The next phase does code optimization of the intermediate code. Optimization can be assumed as
something that removes unnecessary code lines, and arranges the sequence of statements in order
to speed up the program execution without wasting resources (CPU, memory).

Code Generation
In this phase, the code generator takes the optimized representation of the intermediate code and
maps it to the target machine language. The code generator translates the intermediate code into
a sequence of (generally) re-locatable machine code. Sequence of instructions of machine code
performs the task as the intermediate code would do.

Symbol Table
It is a data-structure maintained throughout all the phases of a compiler. All the identifier's names
along with their types are stored here. The symbol table makes it easier for the compiler to quickly
search the identifier record and retrieve it. The symbol table is also used for scope management.
Lexical analysis

Lexical Analysis is the first phase of the compiler also known as a scanner. It converts
the High level input program into a sequence of Tokens.
• Lexical Analysis can be implemented with the Deterministic finite Automata.
• The output is a sequence of tokens that is sent to the parser for syntax analysis

What is a token? A lexical token is a sequence of characters that can be treated as a


unit in the grammar of the programming languages. Example of tokens:
• Type token (id, number, real, . . . )
• Punctuation tokens (IF, void, return, . . . )
• Alphabetic tokens (keywords)
Keywords; Examples-for, while, if etc.
Identifier; Examples-Variable name, function name, etc.
Operators; Examples '+', '++', '-' etc.
Separators; Examples ',' ';' etc

Example of Non-Tokens:
• Comments, preprocessor directive, macros, blanks, tabs, newline, etc.
Lexeme: The sequence of characters matched by a pattern to form the corresponding
token or a sequence of input characters that comprises a single token is called a lexeme.
eg- “float”, “abs_zero_Kelvin”, “=”, “-”, “273”, “;” .

How Lexical Analyzer works-


1. Input preprocessing: This stage involves cleaning up the input text and
preparing it for lexical analysis. This may include removing comments,
whitespace, and other non-essential characters from the input text.

2. Tokenization: This is the process of breaking the input text into a sequence of
tokens. This is usually done by matching the characters in the input text against
a set of patterns or regular expressions that define the different types of tokens.
3. Token classification: In this stage, the lexer determines the type of each token.
For example, in a programming language, the lexer might classify keywords,
identifiers, operators, and punctuation symbols as separate token types.

4. Token validation: In this stage, the lexer checks that each token is valid
according to the rules of the programming language. For example, it might
check that a variable name is a valid identifier, or that an operator has the
correct syntax.

5. Output generation: In this final stage, the lexer generates the output of the
lexical analysis process, which is typically a list of tokens. This list of tokens can
then be passed to the next stage of compilation or interpretation.

• The lexical analyzer identifies the error with the help of the automation machine
and the grammar of the given language on which it is based like C, C++, and
gives row number and column number of the error.
Suppose we pass a statement through lexical analyzer – a = b + c ; It will
generate token sequence like this: id=id+id; Where each id refers to it’s
variable in the symbol table referencing all details For example, consider the program
int main()
{
// 2 variables
int a, b;
a = 10;
return 0;
}
All the valid tokens are:
'int' 'main' '(' ')' '{' 'int' 'a' ',' 'b' ';'
'a' '=' '10' ';' 'return' '0' ';' '}'

Interface with input parser and symbol table

Role of the parser :


In the syntax analysis phase, a compiler verifies whether or not the tokens generated by
the lexical analyzer are grouped according to the syntactic rules of the language. This is
done by a parser. The parser obtains a string of tokens from the lexical analyzer and
verifies that the string can be the grammar for the source language. It detects and reports
any syntax errors and produces a parse tree from which intermediate code can be
generated.

1. It verifies the structure generated by the tokens based on the grammar.


2. It constructs the parse tree.
3. It reports the errors.
4. It performs error recovery.

Issues :

Parser cannot detect errors such as:


1. Variable re-declaration
2. Variable initialization before use
3. Data type mismatch for an operation.
The above issues are handled by Semantic Analysis phase.

Syntax error handling :


Programs can contain errors at many different levels. For example :
1. Lexical, such as misspelling an identifier, keyword or operator.
2. Syntactic, such as an arithmetic expression with unbalanced parentheses.
3. Semantic, such as an operator applied to an incompatible operand.
4. Logical, such as an infinitely recursive call.

Functions of error handler :


1. It should report the presence of errors clearly and accurately.
2. It should recover from each error quickly enough to be able to detect subsequent errors.
3. It should not significantly slow down the processing of correct programs.

Error recovery strategies :


The different strategies that a parse uses to recover from a syntactic error are:
1. Panic mode
2. Phrase level
3. Error productions
4. Global correction

Panic mode recovery:

On discovering an error, the parser discards input symbols one at a time until a synchronizing
token is found. The synchronizing tokens are usually delimiters, such as

semicolon or end. It has the advantage of simplicity and does not go into an infinite loop.
When multiple errors in the same statement are rare, this method is quite useful.

Phrase level recovery:

On discovering an error, the parser performs local correction on the remaining input that
allows it to continue. Example: Insert a missing semicolon or delete an extraneous semicolon
etc.

Error productions:

The parser is constructed using augmented grammar with error productions. If an error
production is used by the parser, appropriate error diagnostics can be generated to indicate the
erroneous constructs recognized by the input.

Global correction:

Given an incorrect input string x and grammar G, certain algorithms can be used to find
a parse tree for a string y, such that the number of insertions, deletions and changes of tokens
is as small as possible. However, these methods are in general too costly in terms of time and
space.

Context-Free Grammars:
The syntax of a programming language is described by context-free grammar (CFG).
CFG consists of a set of terminals, a set of non-terminals, a start symbol, and a set of
productions.
Notation – ? ? ? where ? is a is a single variable [V]
? ? (V+T)*
Ambiguity
A grammar that produces more than one parse tree for some sentence is said to be
ambiguous.
Eg- consider a grammar
S -> aS | Sa | a
Now for string aaa, we will have 4 parse trees, hence ambiguous

Symbol Table :
Symbol table is an important data structure created and maintained by compilers in order to store
information about the occurrence of various entities such as variable names, function names,
objects, classes, interfaces, etc. Symbol table is used by both the analysis and the synthesis parts
of a compiler.
A symbol table may serve the following purposes depending upon the language in hand:
• To store the names of all entities in a structured form at one place.
• To verify if a variable has been declared.
• To implement type checking, by verifying assignments and expressions in the source
code are semantically correct.
• To determine the scope of a name (scope resolution).
A symbol table is simply a table which can be either linear or a hash table. It maintains an entry
for each name in the following format:

<symbol name, type, attribute>

For example, if a symbol table has to store information about the following variable declaration:

static int interest;

then it should store the entry such as:

<interest, int, static>

The attribute clause contains the entries related to the name.

Implementation
If a compiler is to handle a small amount of data, then the symbol table can be implemented as an
unordered list, which is easy to code, but it is only suitable for small tables only. A symbol table
can be implemented in one of the following ways:

• Linear (sorted or unsorted) list


• Binary Search Tree
• Hash table
Among all, symbol tables are mostly implemented as hash tables, where the source code symbol
itself is treated as a key for the hash function and the return value is the information about the
symbol.

Operations
A symbol table, either linear or hash, should provide the following operations.

insert()
This operation is more frequently used by analysis phase, i.e., the first half of the compiler where
tokens are identified and names are stored in the table. This operation is used to add information
in the symbol table about unique names occurring in the source code. The format or structure in
which the names are stored depends upon the compiler in hand.
An attribute for a symbol in the source code is the information associated with that symbol. This
information contains the value, state, scope, and type about the symbol. The insert() function takes
the symbol and its attributes as arguments and stores the information in the symbol table.
For example:

int a;

should be processed by the compiler as:


insert(a, int);
lookup()
lookup() operation is used to search a name in the symbol table to determine:

• if the symbol exists in the table.


• if it is declared before it is being used.
• if the name is used in the scope.
• if the symbol is initialized.
• if the symbol declared multiple times.
The format of lookup() function varies according to the programming language. The basic format
should match the following:

lookup(symbol)

This method returns 0 (zero) if the symbol does not exist in the symbol table. If the symbol exists
in the symbol table, it returns its attributes stored in the table.

Scope Management
A compiler maintains two types of symbol tables: a global symbol table which can be accessed by
all the procedures and scope symbol tables that are created for each scope in the program.

Token, Lexeme and Patterns

Token

It is basically a sequence of characters that are treated as a unit as it cannot be further


broken down. In programming languages like C language- keywords (int, char, float,
const, goto, continue, etc.) identifiers (user-defined names), operators (+, -, *, /),
delimiters/punctuators like comma (,), semicolon(;), braces ({ }), etc. , strings can be
considered as tokens. This phase recognizes three types of tokens: Terminal Symbols
(TRM)- Keywords and Operators, Literals (LIT), and Identifiers (IDN).
Let’s understand now how to calculate tokens in a source code (C language):
Example 1:
int a = 10; //Input Source code

Tokens
int (keyword), a(identifier), =(operator), 10(constant) and ;(punctuation-
semicolon)
Answer – Total number of tokens = 5

Lexeme

It is a sequence of characters in the source code that are matched by given predefined
language rules for every lexeme to be specified as a valid token.
Example:
main is lexeme of type identifier(token)
(,),{,} are lexemes of type punctuation(token)

Pattern

It specifies a set of rules that a scanner follows to create a token.


Example of Programming Language (C, C++):
For a keyword to be identified as a valid token, the pattern is the sequence of characters
that make the keyword.
For identifier to be identified as a valid token, the pattern is the predefined rules that it
must start with alphabet, followed by alphabet or a digit.

Difference between Token, Lexeme, and Pattern

Criteria Token Lexeme Pattern

Token is basically a It is a sequence of characters in It specifies a set


sequence of characters the source code that are matched of rules that a
that are treated as a unit by given predefined language scanner follows
as it cannot be further rules for every lexeme to be to create a
Definition broken down. specified as a valid token. token.

The sequence of
Interpretation all the reserved keywords characters that
of type of that language(main, make the
Keyword printf, etc.) int, goto keyword.

it must start
with the
alphabet,
Interpretation followed by the
of type name of a variable, alphabet or a
Identifier function, etc main, a digit.

Interpretation
of type all the operators are
Operator considered tokens. +, = +, =

each kind of punctuation


Interpretation is considered a token. e.g.
of type semicolon, bracket,
Punctuation comma, etc. (, ), {, } (, ), {, }
Criteria Token Lexeme Pattern

any string of
characters
Interpretation a grammar rule or (except ‘ ‘)
of type Literal boolean literal. “Welcome to GeeksforGeeks!” between ” and “

Difficulties in Lexical Analysis


When the token pattern does not match the prefix of the remaining input, the lexical
analyzer gets stuck and has to recover from this state to analyze the remaining input. In
simple words, a lexical error occurs when a sequence of characters does not match the
pattern of any token. It typically happens during the execution of a program.

Types of Lexical Error:

Types of lexical error that can occur in a lexical analyzer are as follows:
1. Exceeding length of identifier or numeric constants.
Example:
• C++
#include <iostream>

using namespace std;

int main() {

int a=2147483647 +1;

return 0;

This is a lexical error since signed integer lies between −2,147,483,648 and
2,147,483,647

2. Appearance of illegal characters


Example:
• C++
#include <iostream>

using namespace std;


int main() {

printf("Geeksforgeeks");$

return 0;

This is a lexical error since an illegal character $ appears at the end of the statement.

3. Unmatched string
Example:
• C++
#include <iostream>

using namespace std;

int main() {

/* comment

cout<<"GFG!";

return 0;

This is a lexical error since the ending of comment “*/” is not present but the beginning is
present.

4. Spelling Error
• C++
#include <iostream>

using namespace std;

int main() {

int 3num= 1234; /* spelling error as identifier

cannot start with a number*/


return 0;

5. Replacing a character with an incorrect character.

• C++
#include <iostream>

using namespace std;

int main() {

int x = 12$34; /*lexical error as '$' doesn't

belong within 0-9 range*/

return 0;

Other lexical errors include


6. Removal of the character that should be present.
• C++
#include <iostream> /*missing 'o' character

hence lexical error*/

using namespace std;

int main() {

cout<<"GFG!";

return 0;

7. Transposition of two characters.


• C++
#include <iostream>

using namespace std;


int main()

/* spelling of main here would be treated as an lexical

error and won't be considered as an identifier,

transposition of character 'i' and 'a'*/

cout << "GFG!";

return 0;

Input Buffering in Compiler Design


The lexical analyzer scans the input from left to right one character at a time. It uses two
pointers begin ptr(bp) and forward ptr(fp) to keep track of the pointer of the input

scanned. I
nitially both the pointers point to the first character of the input string as shown
below
The forward ptr moves ahead to search for end of lexeme. As soon as the blank space is
encountered, it indicates end of lexeme. In above example as soon as ptr (fp) encounters
a blank space the lexeme “int” is identified. The fp will be moved ahead at white space,
when fp encounters white space, it ignore and moves ahead. then both the begin ptr(bp)
and forward ptr(fp) are set at next token. The input character is thus read from secondary
storage, but reading in this way from secondary storage is costly. hence buffering
technique is used.A block of data is first read into a buffer, and then second by lexical
analyzer. there are two methods used in this context: One Buffer Scheme, and Two
Buffer Scheme. These are explained as following
below.
1. One Buffer Scheme: In this scheme, only one buffer is used to store the input
string but the problem with this scheme is that if lexeme is very long then it
crosses the buffer boundary, to scan rest of the lexeme the buffer has to be
refilled, that makes overwriting the first of

lexeme.
2. Two Buffer Scheme: To overcome the problem of one buffer scheme, in this
method two buffers are used to store the input string. the first buffer and second
buffer are scanned alternately. when end of current buffer is reached the other
buffer is filled. the only problem with this method is that if length of the lexeme is
longer than length of the buffer then scanning input cannot be scanned
completely. Initially both the bp and fp are pointing to the first character of first
buffer. Then the fp moves towards right in search of end of lexeme. as soon as
blank character is recognized, the string between bp and fp is identified as
corresponding token. to identify, the boundary of first buffer end of buffer
character should be placed at the end first buffer. Similarly end of second buffer
is also recognized by the end of buffer mark present at the end of second buffer.
when fp encounters first eof, then one can recognize end of first buffer and
hence filling up second buffer is started. in the same way when second eof is
obtained then it indicates of second buffer. alternatively both the buffers can be
filled up until end of the input program and stream of tokens is identified.
This eof character introduced at the end is calling Sentinel which is used to
identify the end of

buffer.

SPECIFICATION OF TOKENS

There are 3 specifications of tokens:


1)Strings
2) Language
3)Regular expression

Strings and Languages


v An alphabet or character class is a finite set of symbols.
v A string over an alphabet is a finite sequence of symbols drawn from that alphabet.
v A language is any countable set of strings over some fixed alphabet.
In language theory, the terms "sentence" and "word" are often used as synonyms for

"string." The length of a string s, usually written |s|, is the number of occurrences of symbols
in s. For example, banana is a string of length six. The empty string, denoted ε, is the string of
length zero.

Operations on strings
The following string-related terms are commonly used:
1. A prefix of string s is any string obtained by removing zero or more symbols from the
end of string s. For example, ban is a prefix of banana.

2. A suffix of string s is any string obtained by removing zero or more symbols from the
beginning of s. For example, nana is a suffix of banana.

3. A substring of s is obtained by deleting any prefix and any suffix from s. For example,
nan is a substring of banana.

4. The proper prefixes, suffixes, and substrings of a string s are those prefixes, suffixes, and
substrings, respectively of s that are not ε or not equal to s itself.
5. A subsequence of s is any string formed by deleting zero or more not necessarily consecutive
positions of s
6. For example, baan is a subsequence of banana.

Operations on languages:
The following are the operations that can be applied to languages:
1. Union
2. Concatenation
3. Kleene closure
4. Positive closure

The following example shows the operations on strings: Let L={0,1} and S={a,b,c}

Regular grammar

Regular Expressions
· Each regular expression r denotes a language L(r).

· Here are the rules that define the regular expressions over some alphabet Σ and the
languages that those expressions denote:

1.ε is a regular expression, and L(ε) is { ε }, that is, the language whose sole member is the
empty string.
2. If ‘a’ is a symbol in Σ, then ‘a’ is a regular expression, and L(a) = {a}, that is, the language
with one string, of length one, with ‘a’ in its one position.
3.Suppose r and s are regular expressions denoting the languages L(r) and L(s). Then, a) (r)|(s)
is a regular expression denoting the language L(r) U L(s).
b) (r)(s) is a regular expression denoting the language L(r)L(s). c) (r)* is a regular expression
denoting (L(r))*.
d) (r) is a regular expression denoting L(r).
4.The unary operator * has highest precedence and is left associative.
5.Concatenation has second highest precedence and is left associative.
6. Regular Expressions
Regular Expressions are used to denote regular languages. An expression is regular if:
• ɸ is a regular expression for regular language ɸ.
• ɛ is a regular expression for regular language {ɛ}.
• If a ∈ Σ (Σ represents the input alphabet), a is regular expression with language
{a}.
• If a and b are regular expression, a + b is also a regular expression with
language {a,b}.
• If a and b are regular expression, ab (concatenation of a and b) is also regular.
• If a is regular expression, a* (0 or more times a) is also regular.

Regular Expression Regular Languages


set of vovels (a∪e∪i∪o∪u) {a, e, i, o, u}

a followed by 0 or more b (a.b*) {a, ab, abb, abbb, abbbb,….}

{ ε , a ,aou, aiou, b, abcd…..} where ε


* *
any no. of vowels followed v .c ( where v – vowels represent empty string (in case 0 vowels
by any no. of consonants and c – consonants) and o consonants )

Regular Grammar : A grammar is regular if it has rules of form A -> a or A -> aB


or A -> ɛ where ɛ is a special symbol called NULL.

Regular Languages : A language is regular if it can be expressed in terms of


regular expression.

Closure Properties of Regular Languages


Union : If L1 and If L2 are two regular languages, their union L1 ∪ L2 will also be
regular. For example, L1 = {a n | n ≥ 0} and L2 = {bn | n ≥ 0}
L3 = L1 ∪ L2 = {an ∪ bn | n ≥ 0} is also regular.
Intersection : If L1 and If L2 are two regular languages, their intersection L1 ∩ L2
will also be regular. For example,
L1= {am bn | n ≥ 0 and m ≥ 0} and L2= {a m bn ∪ bn am | n ≥ 0 and m ≥ 0}
L3 = L1 ∩ L2 = {am bn | n ≥ 0 and m ≥ 0} is also regular.
Concatenation : If L1 and If L2 are two regular languages, their concatenation
L1.L2 will also be regular. For example,
L1 = {an | n ≥ 0} and L2 = {bn | n ≥ 0}
L3 = L1.L2 = {am . bn | m ≥ 0 and n ≥ 0} is also regular.
Kleene Closure : If L1 is a regular language, its Kleene closure L1* will also be
regular. For example,
L1 = (a ∪ b)
L1* = (a ∪ b)*
Complement : If L(G) is regular language, its complement L’(G) will also be
regular. Complement of a language can be found by subtracting strings which are
in L(G) from all possible strings. For example,
L(G) = {an | n > 3}
L’(G) = {an | n <= 3}

Note : Two regular expressions are equivalent if languages generated by them are
same. For example, (a+b*)* and (a+b)* generate same language. Every string
which is generated by (a+b*)* is also generated by (a+b)* and vice versa.
moDule- ii
syntax analysis
Syntax Analysis or Parsing is the second phase, i.e. after lexical analysis. It checks the
syntactical structure of the given input, i.e. whether the given input is in the correct syntax
(of the language in which the input has been written) or not. It does so by building a data
structure, called a Parse tree or Syntax tree. The parse tree is constructed by using the
pre-defined Grammar of the language and the input string. If the given input string can be
produced with the help of the syntax tree (in the derivation process), the input string is
found to be in the correct syntax. if not, the error is reported by the syntax analyzer.
Syntax analysis, also known as parsing, is a process in compiler design where the
compiler checks if the source code follows the grammatical rules of the programming
language. This is typically the second stage of the compilation process, following lexical
analysis.
The main goal of syntax analysis is to create a parse tree or abstract syntax tree (AST) of
the source code, which is a hierarchical representation of the source code that reflects
the grammatical structure of the program.
There are several types of parsing algorithms used in syntax analysis, including:
• LL parsing: This is a top-down parsing algorithm that starts with the root of the
parse tree and constructs the tree by successively expanding non-terminals. LL
parsing is known for its simplicity and ease of implementation.
• LR parsing: This is a bottom-up parsing algorithm that starts with the leaves of
the parse tree and constructs the tree by successively reducing terminals. LR
parsing is more powerful than LL parsing and can handle a larger class of
grammars.
• LR(1) parsing: This is a variant of LR parsing that uses lookahead to
disambiguate the grammar.
• LALR parsing: This is a variant of LR parsing that uses a reduced set of
lookahead symbols to reduce the number of states in the LR parser.
• Once the parse tree is constructed, the compiler can perform semantic analysis
to check if the source code makes sense and follows the semantics of the
programming language.
• The parse tree or AST can also be used in the code generation phase of the
compiler design to generate intermediate code or machine code.
The pushdown automata (PDA) is used to design the syntax analysis phase.
The Grammar for a Language consists of Production rules.
Example: Suppose Production rules for the Grammar of a language are:
S -> cAd
A -> bc|a
And the input string is “cad”.
Now the parser attempts to construct a syntax tree from this grammar for the given input
string. It uses the given production rules and applies those as needed to generate the
string. To generate string “cad” it uses the rules as shown in the given
diagram:

In step (iii) above, the production rule A->bc was not a suitable one to apply (because the
string produced is “cbcd” not “cad”), here the parser needs to backtrack, and apply the next
production rule available with A which is shown in step (iv), and the string “cad” is produced.
Thus, the given input can be produced by the given grammar, therefore the input is correct
in syntax. But backtrack was needed to get the correct syntax tree, which is really a
complex process to implement.
There can be an easier way to solve this, which we shall see in the next article “Concepts
of FIRST and FOLLOW sets in Compiler Design”.

Advantages :

• Advantages of using syntax analysis in compiler design include:


• Structural validation: Syntax analysis allows the compiler to check if the source
code follows the grammatical rules of the programming language, which helps
to detect and report errors in the source code.
• Improved code generation: Syntax analysis can generate a parse tree or
abstract syntax tree (AST) of the source code, which can be used in the code
generation phase of the compiler design to generate more efficient and
optimized code.
• Easier semantic analysis: Once the parse tree or AST is constructed, the
compiler can perform semantic analysis more easily, as it can rely on the
structural information provided by the parse tree or AST.

Disadvantages:

• Disadvantages of using syntax analysis in compiler design include:


• Complexity: Parsing is a complex process, and the quality of the parser can
greatly impact the performance of the resulting code. Implementing a parser for
a complex programming language can be a challenging task, especially for
languages with ambiguous grammars.
• Reduced performance: Syntax analysis can add overhead to the compilation
process, which can reduce the performance of the compiler.
• Limited error recovery: Syntax analysis algorithms may not be able to recover
from errors in the source code, which can lead to incomplete or incorrect parse
trees and make it difficult for the compiler to continue the compilation process.
• Inability to handle all languages: Not all languages have formal grammars, and
some languages may not be easily parseable.
Syntax Analyzers
A syntax analyzer or parser takes the input from a lexical analyzer in the form of token streams.
The parser analyzes the source code (token stream) against the production rules to detect any
errors in the code. The output of this phase is a parse tree.

This way, the parser accomplishes two tasks, i.e., parsing the code, looking for errors and
generating a parse tree as the output of the phase.
Parsers are expected to parse the whole code even if some errors exist in the program. Parsers use
error recovering strategies, which we will learn later in this chapter.

Grammar
A grammar is a set of structural rules which describe a language. Grammars assign structure
to any sentence. This term also refers to the study of these rules, and this file includes
morphology, phonology, and syntax. It is capable of describing many, of the syntax
of programming languages.

Rules of Form Grammar

• The non-terminal symbol should appear to the left of the at least one production
• The goal symbol should never be displayed to the right of the::= of any production
• A rule is recursive if LHS appears in its RHS
Context Free Grammar
A CFG is a left-recursive grammar that has at least one production of the type. The rules in a
context-free grammar are mainly recursive. A syntax analyser checks that specific program
satisfies all the rules of Context-free grammar or not. If it does meet, these rules syntax
analysers may create a parse tree for that programme.
expression -> expression -+ term
expression -> expression – term
expression-> term
term -> term * factor
term -> expression/ factor
term -> factor factor
factor -> ( expression )
factor -> id

Grammar Derivation
Grammar derivation is a sequence of grammar rule which transforms the start symbol into
the string. A derivation proves that the string belongs to the grammar’s language.

Left-most Derivation

When the sentential form of input is scanned and replaced in left to right sequence, it is
known as left-most derivation. The sentential form which is derived by the left-most
derivation is called the left-sentential form.

Right-most Derivation

Rightmost derivation scan and replace the input with production rules, from right to left,
sequence. It’s known as right-most derivation. The sentential form which is derived from the
rightmost derivation is known as right-sentential form.

Parsing
The parser is that phase of the compiler which takes a token string as input and with the
help of existing grammar, converts it into the corresponding Intermediate
Representation(IR). The parser is also known as Syntax Analyzer.
Classification of Parser

Types of Parser:
The parser is mainly classified into two categories, i.e. Top-down Parser, and Bottom-up
Parser. These are explained below:
Top-Down Parser:
1. The top-down parser is the parser that generates parse for the given input
string with the help of grammar productions by expanding the non-terminals i.e.
it starts from the start symbol and ends on the terminals. It uses left most
derivation.
Further Top-down parser is classified into 2 types: A recursive descent parser,
and Non-recursive descent parser. Recursive Descent Parsing
2. Predictive Parsing or Non-Recursive Parsing or LL(1) Parsing or Table Driver
Parsing
Recursive Descent Parsing –
1. Whenever a Non-terminal spend the first time then go with the first alternative
and compare it with the given I/P String
2. If matching doesn’t occur then go with the second alternative and compare with
the given I/P String.
3. If matching is not found again then go with the alternative and so on.
4. Moreover, If matching occurs for at least one alternative, then the I/P string is
parsed successfully.
LL(1) or Table Driver or Predictive Parser –
1. In LL1, first L stands for Left to Right and second L stands for Left-most
Derivation. 1 stands for a number of Look Ahead tokens used by parser while
parsing a sentence.
2. LL(1) parsing is constructed from the grammar which is free from left recursion,
common prefix, and ambiguity.
3. LL(1) parser depends on 1 look ahead symbol to predict the production to
expand the parse tree.
4. This parser is Non-Recursive.

Bottom-up Parser:
Bottom-up Parser is the parser that generates the parse tree for the given input string
with the help of grammar productions by compressing the non-terminals i.e. it starts from
non-terminals and ends on the start symbol. It uses the reverse of the rightmost
derivation.
Further Bottom-up parser is classified into two types: LR parser, and Operator
precedence parser.
• LR parser is the bottom-up parser that generates the parse tree for the given
string by using unambiguous grammar. It follows the reverse of the rightmost
derivation.
LR parser is of four types:
(a)LR(0)
(b)SLR(1)
(c)LALR(1)
(d)CLR(1)

Ambiguity

A grammar is said to be ambiguous if there exists more than one leftmost derivation or more than
one rightmost derivative or more than one parse tree for the given input string. If the grammar is
not ambiguous then it is called unambiguous.

Example:

1. S = aSb | SS
2. S = ∈

For the string aabb, the above grammar generates two parse trees:
If the grammar has ambiguity then it is not good for a compiler construction. No method
can automatically detect and remove the ambiguity but you can remove ambiguity by re-writing
the whole grammar without ambiguity.

Top down parsing


The types of top-down parsing are depicted below:

Recursive Descent Parsing


Recursive descent is a top-down parsing technique that constructs the parse tree from the top and
the input is read from left to right. It uses procedures for every terminal and non-terminal entity.
This parsing technique recursively parses the input to make a parse tree, which may or may not
require back-tracking. But the grammar associated with it (if not left factored) cannot avoid back-
tracking. A form of recursive-descent parsing that does not require any back-tracking is known
as predictive parsing.
This parsing technique is regarded recursive as it uses context-free grammar which is recursive in
nature.

Back-tracking
Top- down parsers start from the root node (start symbol) and match the input string against the
production rules to replace them (if matched). To understand this, take the following example of
CFG:

S → rXd | rZd
X → oa | ea
Z → ai

For an input string: read, a top-down parser, will behave like this:
It will start with S from the production rules and will match its yield to the left-most letter of the
input, i.e. ‘r’. The very production of S (S → rXd) matches with it. So the top-down parser advances
to the next input letter (i.e. ‘e’). The parser tries to expand non-terminal ‘X’ and checks its
production from the left (X → oa). It does not match with the next input symbol. So the top-down
parser backtracks to obtain the next production rule of X, (X → ea).
Now the parser matches all the input letters in an ordered manner. The string is accepted.

Predictive Parser
Predictive parser is a recursive descent parser, which has the capability to predict which production
is to be used to replace the input string. The predictive parser does not suffer from backtracking.
To accomplish its tasks, the predictive parser uses a look-ahead pointer, which points to the next
input symbols. To make the parser back-tracking free, the predictive parser puts some constraints
on the grammar and accepts only a class of grammar known as LL(k) grammar.

Predictive parsing uses a stack and a parsing table to parse the input and generate a parse tree.
Both the stack and the input contains an end symbol $ to denote that the stack is empty and the
input is consumed. The parser refers to the parsing table to take any decision on the input and
stack element combination.
In recursive descent parsing, the parser may have more than one production to choose from for a
single instance of input, whereas in predictive parser, each step has at most one production to
choose. There might be instances where there is no production matching the input string, making
the parsing procedure to fail.

LL Parser
An LL Parser accepts LL grammar. LL grammar is a subset of context-free grammar but with some
restrictions to get the simplified version, in order to achieve easy implementation. LL grammar can
be implemented by means of both algorithms namely, recursive-descent or table-driven.
LL parser is denoted as LL(k). The first L in LL(k) is parsing the input from left to right, the second
L in LL(k) stands for left-most derivation and k itself represents the number of look aheads.
Generally k = 1, so LL(k) may also be written as LL(1).

Predictive parsing LL(1) grammar

LL(1) Parsing: Here the 1st L represents that the scanning of the Input will be done from
Left to Right manner and the second L shows that in this parsing technique we are going
to use Left most Derivation Tree. And finally, the 1 represents the number of look-ahead,
which means how many symbols are you going to see when you want to make a
decision.
Essential conditions to check first are as follows:
1. The grammar is free from left recursion.
2. The grammar should not be ambiguous.
3. The grammar has to be left factored in so that the grammar is deterministic
grammar.
These conditions are necessary but not sufficient for proving a LL(1) parser.
Algorithm to construct LL(1) Parsing Table:
Step 1: First check all the essential conditions mentioned above and go to step 2.
Step 2: Calculate First() and Follow() for all non-terminals.
1. First(): If there is a variable, and from that variable, if we try to drive all the
strings then the beginning Terminal Symbol is called the First.
2. Follow(): What is the Terminal Symbol which follows a variable in the process of
derivation.
Step 3: For each production A –> α. (A tends to alpha)
1. Find First(α) and for each terminal in First(α), make entry A –> α in the table.
2. If First(α) contains ε (epsilon) as terminal, then find the Follow(A) and for each
terminal in Follow(A), make entry A –> ε in the table.
3. If the First(α) contains ε and Follow(A) contains $ as terminal, then make entry A
–> ε in the table for the $.
To construct the parsing table, we have two functions:
In the table, rows will contain the Non-Terminals and the column will contain the Terminal
Symbols. All the Null Productions of the Grammars will go under the Follow elements
and the remaining productions will lie under the elements of the First set.
Now, let’s understand with an example.
Example-1: Consider the Grammar:
E --> TE'
E' --> +TE' | ε
T --> FT'
T' --> *FT' | ε
F --> id | (E)

*ε denotes epsilon
Step1 – The grammar satisfies all properties in step 1
Step 2 – calculating first() and follow()
Find their First and Follow sets:

First Follow

E –> TE’ { id, ( } { $, ) }


First Follow

E’ –> +TE’/ε { +, ε } { $, ) }

T –> FT’ { id, ( } { +, $, ) }

T’ –> *FT’/ε { *, ε } { +, $, ) }

F –> id/(E) { id, ( } { *, +, $, ) }

Step 3 – making parser table


Now, the LL(1) Parsing Table is:

id + * ( ) $

E E –> TE’ E –> TE’

E’ E’ –> +TE’ E’ –> ε E’ –> ε

T T –> FT’ T –> FT’

T’ T’ –> ε T’ –> *FT’ T’ –> ε T’ –> ε

F F –> id F –> (E)

As you can see that all the null productions are put under the Follow set of that symbol
and all the remaining productions lie under the First of that symbol.
Note: Every grammar is not feasible for LL(1) Parsing table. It may be possible that one
cell may contain more than one production.
Let’s see an example.

Bottom up Parsing
Bottom-up parsing starts from the leaf nodes of a tree and works in upward direction till it reaches the
root node.

The image given below depicts the bottom-up parsers available.


Shift-Reduce Parsing
Shift-reduce parsing uses two unique steps for bottom-up parsing. These steps are known as shift-
step and reduce-step.
• Shift step: The shift step refers to the advancement of the input pointer to the next
input symbol, which is called the shifted symbol. This symbol is pushed onto the
stack. The shifted symbol is treated as a single node of the parse tree.
• Reduce step : When the parser finds a complete grammar rule (RHS) and replaces it
to (LHS), it is known as reduce-step. This occurs when the top of the stack contains a
handle. To reduce, a POP function is performed on the stack which pops off the handle
and replaces it with LHS non-terminal symbol.
LR Parser
The LR parser is a non-recursive, shift-reduce, bottom-up parser. It uses a wide class of context-
free grammar which makes it the most efficient syntax analysis technique. LR parsers are also
known as LR(k) parsers, where L stands for left-to-right scanning of the input stream; R stands for
the construction of right-most derivation in reverse, and k denotes the number of lookahead
symbols to make decisions.
There are three widely used algorithms available for constructing an LR parser:

• SLR(1) – Simple LR Parser:


o Works on smallest class of grammar
o Few number of states, hence very small table
o Simple and fast construction
• LR(1) – LR Parser:
o Works on complete set of LR(1) Grammar
o Generates large table and large number of states
o Slow construction
• LALR(1) – Look-Ahead LR Parser:
o Works on intermediate size of grammar
o Number of states are same as in SLR(1)
Operator Precedence Grammars
A grammar that is used to define mathematical operators is called an operator
grammar or operator precedence grammar. Such grammars have the restriction that
no production has either an empty right-hand side (null productions) or two adjacent non-
terminals in its right-hand side.
Examples –
This is an example of operator grammar:
E->E+E/E*E/id
However, the grammar given below is not an operator grammar because two non-
terminals are adjacent to each other:
S->SAS/a
A->bSb/b
We can convert it into an operator grammar, though:
S->SbSbS/SbS/a
A->bSb/b

Operator precedence grammar is kinds of shift reduce parsing method. It is applied to a small class
of operator grammars.

A grammar is said to be operator precedence grammar if it has two properties:

o No R.H.S. of any production has a∈.


o No two non-terminals are adjacent.

Operator precedence can only established between the terminals of the grammar. It ignores the
non-terminal.

There are the three operator precedence relations:


a ⋗ b means that terminal "a" has the higher precedence than terminal "b".

a ⋖ b means that terminal "a" has the lower precedence than terminal "b".

a ≐ b means that the terminal "a" and "b" both have same precedence.

Parsing Action

o Both end of the given input string, add the $ symbol.


o Now scan the input string from left right until the ⋗ is encountered.
o Scan towards left over all the equal precedence until the first left most ⋖ is encountered.
o Everything between left most ⋖ and right most ⋗ is a handle.
o $ on $ means parsing is successful.
Example
Grammar:

1. E → E+T/T
2. T → T*F/F
3. F → id

Given string:

1. w = id + id * id

Let us consider a parse tree for it as follows:

LR parsers (SLR, CLR, LALR)


LR parsing is one type of bottom up parsing. It is used to parse the large class of grammars.

In the LR parsing, "L" stands for left-to-right scanning of the input.

"R" stands for constructing a right most derivation in reverse.


"K" is the number of input symbols of the look ahead used to make number of parsing decision.

LR parsing is divided into four parts: LR (0) parsing, SLR parsing, CLR parsing and LALR parsing.

LR (1) Parsing
Various steps involved in the LR (1) Parsing:

o For the given input string write a context free grammar.


o Check the ambiguity of the grammar.
o Add Augment production in the given grammar.
o Create Canonical collection of LR (0) items.
o Draw a data flow diagram (DFA).
o Construct a LR (1) parsing table.

Augment Grammar
Augmented grammar G` will be generated if we add one more production in the given grammar G.
It helps the parser to identify when to stop the parsing and announce the acceptance of the input.

Example
Given grammar

1. S → AA
2. A → aA | b

The Augment grammar G` is represented by

1. S`→ S
2. S → AA
3. A → aA | b

Canonical Collection of LR(0) items


An LR (0) item is a production G with dot at some position on the right side of the production.

LR(0) items is useful to indicate that how much of the input has been scanned up to a given point
in the process of parsing.

In the LR (0), we place the reduce node in the entire row.

Example
Given grammar:

1. S → AA
2. A → aA | b

Add Augment Production and insert '•' symbol at the first position for every production in G

1. S` → •S
2. S → •AA
3. A → •aA
4. A → •b

I0 State:
Add Augment production to the I0 State and Compute the Closure

I0 = Closure (S` → •S)


Add all productions starting with S in to I0 State because "•" is followed by the non-terminal. So,
the I0 State becomes

I0= S`→•S
S → •AA

Add all productions starting with "A" in modified I0 State because "•" is followed by the non-
terminal. So, the I0 State becomes.

I0= S`→•S
S→•AA
A→•aA
A → •b

I1= Go to (I0, S) = closure (S` → S•) = S` → S•


Here, the Production is reduced so close the State.

I1= S` → S•

I2= Go to (I0, A) = closure (S → A•A)

Add all productions starting with A in to I2 State because "•" is followed by the non-terminal. So,
the I2 State becomes

I2=S→A•A
A→•aA
A → •b

Go to (I2,a) = Closure (A → a•A) = (same as I3)

Go to (I2, b) = Closure (A → b•) = (same as I4)

I3= Go to (I0,a) = Closure (A → a•A)

Add productions starting with A in I3.

A→a•A
A→•aA
A → •b

Go to (I3, a) = Closure (A→a•A)=(same as I3)


Go to (I3, b) = Closure (A → b•) = (same as I4)

I4= Go to (I0, b) = closure (A → b•) = A → b•


I5= Go to (I2, A) = Closure (S → AA•) = SA → A•
I6= Go to (I3, A) = Closure (A → aA•) = A → aA•

SLR (1) Parsing


SLR (1) refers to simple LR Parsing. It is same as LR(0) parsing. The only difference is in the parsing
table.To construct SLR (1) parsing table, we use canonical collection of LR (0) item.

In the SLR (1) parsing, we place the reduce move only in the follow of left hand side.

Various steps involved in the SLR (1) Parsing:

o For the given input string write a context free grammar


o Check the ambiguity of the grammar
o Add Augment production in the given grammar
o Create Canonical collection of LR (0) items
o Draw a data flow diagram (DFA)
o Construct a SLR (1) parsing table

Example

1. S -> •Aa
2. A->αβ•

1. Follow(S) = {$}
2. Follow (A) = {a}

SLR ( 1 ) Grammar
S→E
E→E+T|T
T→T*F|F
F → id

Add Augment Production and insert '•' symbol at the first position for every production in G

S`→•E
E→•E+T
E→•T
T→•T*F
T→•F
F → •id

I0 State:

Add Augment production to the I0 State and Compute the Closure

I0 = Closure (S` → •E)

Add all productions starting with E in to I0 State because "." is followed by the non-terminal. So,
the I0 State becomes

I0= S`→•E
E→•E+T
E → •T

Add all productions starting with T and F in modified I0 State because "." is followed by the non-
terminal. So, the I0 State becomes.

I0= S`→•E
E→•E+T
E→•T
T→•T*F
T→•F
F → •id

I1= Go to (I0, E) = closure (S`→E•,E→ E• + T)


I2= Go to (I0, T) = closure (E → T•T, T• → * F)
I3= Go to (I0, F) = Closure ( T → F• ) = T → F•
I4= Go to (I0, id) = closure ( F → id•) = F → id•
I5= Go to (I1, +) = Closure (E → E +•T)

Add all productions starting with T and F in I5 State because "." is followed by the non-terminal.
So, the I5 State becomes

I5= E→E+•T
T•T*F
T→•F
F → •id

Go to (I5, F) = Closure (T → F•) = (same as I3)


Go to (I5, id) = Closure (F → id•) = (same as I4)

I6= Go to (I2, *) = Closure (T → T * •F)

Add all productions starting with F in I6 State because "." is followed by the non-terminal. So, the
I6 State becomes

I6= T→T*•F
F → •id

Go to (I6, id) = Closure (F → id•) = (same as I4)

I7= Go to (I5, T) = Closure (E → E + T•) = E → E + T•


I8= Go to (I6, F) = Closure (T → T * F•) = T → T * F•

CLR (1) Parsing


CLR refers to canonical lookahead. CLR parsing use the canonical collection of LR (1) items to build
the CLR (1) parsing table. CLR (1) parsing table produces the more number of states as compare to
the SLR (1) parsing.

In the CLR (1), we place the reduce node only in the lookahead symbols.

Various steps involved in the CLR (1) Parsing:

o For the given input string write a context free grammar


o Check the ambiguity of the grammar
o Add Augment production in the given grammar
o Create Canonical collection of LR (0) items
o Draw a data flow diagram (DFA)
o Construct a CLR (1) parsing table

LR (1) itemPlay Video

LR (1) item is a collection of LR (0) items and a look ahead symbol.

LR (1) item = LR (0) item + look ahead

The look ahead is used to determine that where we place the final item.

The look ahead always add $ symbol for the argument production.

Example
CLR ( 1 ) Grammar

1. S → AA
2. A → aA
3. A → b

Add Augment Production, insert '•' symbol at the first position for every production in G and also
add the lookahead.

1. S` → •S, $
2. S → •AA, $
3. A → •aA, a/b
4. A → •b, a/b

I0 State:

Add Augment production to the I0 State and Compute the Closure

I0 = Closure (S` → •S)

Add all productions starting with S in to I0 State because "." is followed by the non-terminal. So,
the I0 State becomes

I0= S`→•S,$
S → •AA, $

Add all productions starting with A in modified I0 State because "." is followed by the non-
terminal. So, the I0 State becomes.
I0= S`→•S,$
S→•AA,$
A→•aA,a/b
A → •b, a/b

I1= Go to (I0, S) = closure (S` → S•, $) = S` → S•, $


I2= Go to (I0, A) = closure ( S → A•A, $ )

Add all productions starting with A in I2 State because "." is followed by the non-terminal. So, the
I2 State becomes

I2= S→A•A,$
A→•aA,$
A → •b, $

I3= Go to (I0, a) = Closure ( A → a•A, a/b )

Add all productions starting with A in I3 State because "." is followed by the non-terminal. So, the
I3 State becomes

I3= A→a•A,a/b
A→•aA,a/b
A → •b, a/b

Go to (I3, a) = Closure (A → a•A, a/b) = (same as I3)


Go to (I3, b) = Closure (A → b•, a/b) = (same as I4)

I4= Go to (I0, b) = closure ( A → b•, a/b) = A → b•, a/b


I5= Go to (I2, A) = Closure (S → AA•, $) =S → AA•, $
I6= Go to (I2, a) = Closure (A → a•A, $)

Add all productions starting with A in I6 State because "." is followed by the non-terminal. So, the
I6 State becomes

I6= A→a•A,$
A→•aA,$
A → •b, $

Go to (I6, a) = Closure (A → a•A, $) = (same as I6)


Go to (I6, b) = Closure (A → b•, $) = (same as I7)

I7= Go to (I2, b) = Closure (A → b•, $) = A → b•, $


I8= Go to (I3, A) = Closure (A → aA•, a/b) = A → aA•, a/b
I9= Go to (I6, A) = Closure (A → aA•, $) = A → aA•, $
LALR (1) Parsing:
LALR refers to the lookahead LR. To construct the LALR (1) parsing table, we use the canonical
collection of LR (1) items.

In the LALR (1) parsing, the LR (1) items which have same productions but different look ahead are
combined to form a single set of items

LALR (1) parsing is same as the CLR (1) parsing, only difference in the parsing table.

Example
LALR ( 1 ) GrammarPlay Video

1. S → AA
2. A → aA
3. A → b

Add Augment Production, insert '•' symbol at the first position for every production in G and also
add the look ahead.

1. S` → •S, $
2. S → •AA, $
3. A → •aA, a/b
4. A → •b, a/b

I0 State:

Add Augment production to the I0 State and Compute the ClosureL

I0 = Closure (S` → •S)

Add all productions starting with S in to I0 State because "•" is followed by the non-terminal. So,
the I0 State becomes

I= S`→•S,$
S → •AA, $

Add all productions starting with A in modified I0 State because "•" is followed by the non-
terminal. So, the I0 State becomes.

I0= S`→•S,$
S→•AA,$
A → •aA, a/b
A → •b, a/b
I1= Go to (I0, S) = closure (S` → S•, $) = S` → S•, $
I2= Go to (I0, A) = closure ( S → A•A, $ )

Add all productions starting with A in I2 State because "•" is followed by the non-terminal. So, the
I2 State becomes

I2= S → A•A, $
A → •aA, $
A → •b, $

I3= Go to (I0, a) = Closure ( A → a•A, a/b )

Add all productions starting with A in I3 State because "•" is followed by the non-terminal. So, the
I3 State becomes

I3= A → a•A, a/b


A → •aA, a/b
A → •b, a/b

Go to (I3, a) = Closure (A → a•A, a/b) = (same as I3)


Go to (I3, b) = Closure (A → b•, a/b) = (same as I4)

I4= Go to (I0, b) = closure ( A → b•, a/b) = A → b•, a/b


I5= Go to (I2, A) = Closure (S → AA•, $) =S → AA•, $
I6= Go to (I2, a) = Closure (A → a•A, $)

Add all productions starting with A in I6 State because "•" is followed by the non-terminal. So, the
I6 State becomes

I6 = A → a•A, $
A → •aA, $
A → •b, $

Go to (I6, a) = Closure (A → a•A, $) = (same as I6)


Go to (I6, b) = Closure (A → b•, $) = (same as I7)

I7= Go to (I2, b) = Closure (A → b•, $) = A → b•, $


I8= Go to (I3, A) = Closure (A → aA•, a/b) = A → aA•, a/b
I9= Go to (I6, A) = Closure (A → aA•, $) A → aA•, $

If we analyze then LR (0) items of I3 and I6 are same but they differ only in their lookahead.

I3 = { A → a•A, a/b
A → •aA, a/b
A → •b, a/b
}
I6= { A → a•A, $
A → •aA, $
A → •b, $
}

Clearly I3 and I6 are same in their LR (0) items but differ in their lookahead, so we can combine
them and called as I36.

I36 = { A → a•A, a/b/$


A → •aA, a/b/$
A → •b, a/b/$
}

The I4 and I7 are same but they differ only in their look ahead, so we can combine them and called
as I47.

I47 = {A → b•, a/b/$}

The I8 and I9 are same but they differ only in their look ahead, so we can combine them and called
as I89.

I89 = {A → aA•, a/b/$}


moDule- iii
Syntax directed definitions
A syntax-directed definition (SDD) is a context-free grammar together with attributes and rules.
Attributes are associated with grammar symbols and rules are associated with productions. If X is a
symbol and a is one of its attributes, then we write X.a to denote the value of a at a particular parse-
tree node labeled X. If we implement the nodes of the parse tree by records or objects, then the
attributes of X can be implemented by data fields in the records that represent the nodes
for X. Attributes may be of any kind: numbers, types, table references, or strings, for instance. The
strings may even be long sequences of code, say code in the intermediate language used by a
compiler.

Inherited and Synthesized Attributes


We shall deal with two kinds of attributes for nonterminals:
1. A synthesized attribute for a nonterminal A at a parse-tree node N is defined by a semantic rule
associated with the production at N. Note that the production must have A as its head. A synthesized
attribute at node N is defined only in terms of attribute values at the children of N and at N itself.
2. An inherited attribute for a nonterminal B at a parse-tree node N is defined by a semantic rule
associated with the production at the parent of N. Note that the production must have B as a symbol
in its body. An inherited attribute at node N is defined only in terms of attribute values at JV's parent,
N itself, and N's siblings.

S.NO Synthesized Attributes Inherited Attributes

An attribute is said to be Synthesized An attribute is said to be Inherited


attribute if its parse tree node value is attribute if its parse tree node value is
determined by the attribute value at determined by the attribute value at
1. child nodes. parent and/or siblings node.

The production must have non- The production must have non-terminal
2. terminal as its head. as a symbol in its body.

A synthesized attribute at node n is A Inherited attribute at node n is defined


defined only in terms of attribute only in terms of attribute values of n’s
3. values at the children of n itself. parent, n itself, and n’s siblings.

It can be evaluated during a single top-


It can be evaluated during a single down and sideways traversal of parse
4. bottom-up traversal of parse tree. tree.
Synthesized attributes can be Inherited attributes can’t be contained
contained by both the terminals or by both, It is only contained by non-
5. non-terminals. terminals.

Synthesized attribute is used by both Inherited attribute is used by only L-


6. S-attributed SDT and L-attributed SDT. attributed SDT.

7.

Dependency Graph
A dependency graph is used to represent the flow of information among the attributes in a parse
tree. In a parse tree, a dependency graph basically helps to determine the evaluation order for the
attributes. The main aim of the dependency graphs is to help the compiler to check for various types
of dependencies between statements in order to prevent them from being executed in the incorrect
sequence, i.e. in a way that affects the program’s meaning. This is the main aspect that helps in
identifying the program’s numerous parallelizable components.

It assists us in determining the impact of a change and the objects that are affected by it. Drawing
edges to connect dependent actions can be used to create a dependency graph. These arcs result
in partial ordering among operations and also result in preventing a program from running in parallel.
Although use-definition chaining is a type of dependency analysis, it results in unduly cautious data
reliance estimations. On a shared control route, there may be four types of dependencies between
statements I and j.

Dependency graphs, like other-directed networks, have nodes or vertices depicted as boxes or
circles with names, as well as arrows linking them in their obligatory traversal direction. Dependency
graphs are commonly used in scientific literature to describe semantic links, temporal and causal
dependencies between events, and the flow of electric current in electronic circuits. Drawing
dependency graphs is so common in computer science that we’ll want to employ tools that automate
the process based on some basic textual instructions from us.

Types of dependencies:
Dependencies are broadly classified into the following categories:
1. Data Dependencies:
When a statement computes data that is later utilized by another statement. A state in which
instruction must wait for a result from a preceding instruction before it can complete its execution. A
data dependence will trigger a stoppage in the flowing services of a processor pipeline or block the
parallel issuing of instructions in a superscalar processor in high-performance processors using
pipeline or superscalar approaches.
2. Control Dependencies:
Control Dependencies are those that come from a program’s well-ordered control flow. A scenario
in which a program instruction executes if the previous instruction evaluates in a fashion that permits
it to execute is known as control dependence.
3. Flow Dependency:
In computer science, a flow dependence occurs when a program statement refers to the data of a
previous statement.
4. Antidependence:
When an instruction needs a value that is later modified, this is known as anti-dependency, or write-
after-read (WAR). Instruction 2 anti-depends on instruction 3 in the following example; the order of
these instructions cannot be modified, nor can they be performed in parallel (potentially changing
the instruction ordering), because this would modify the final value of A.
5. Output-Dependency:
An output dependence, also known as write-after-write (WAW), happens when the sequence in
which instructions are executed has an impact on the variable’s ultimate output value. There is an
output dependence between instructions 3 and 1 in the example below; altering the order of
instructions would affect the final value of A, hence these instructions cannot be run in parallel.
6. Control-Dependency:
If the outcome of A determines whether B should be performed or not, an instruction B has a control
dependence on a previous instruction A. The display style S 2S 2 instruction has a control reliance
on the display style S 1S 1 instruction in the following example. However, display style S 3S 3 is not
dependent on display style S 1S 1, because display style S 3S 3 is always done regardless of the
result of display style S 1S 1.
Example of Dependency Graph:
Design dependency graph for the following grammar:
E -> E1 * E2

PRODUCTIONS SEMANTIC RULES

E -> E1 + E2 E.val -> E1.val + E2.val


E -> E1 * E2 E.val -> E1.val * E2.val

Required dependency graph for the above grammar is represented as –


Dependency Graph for the above example

1. Synthesized attributes are represented by .val.


2. Hence, E.val, E1.val, and E2.val have synthesized attributes.
3. Dependencies are shown by black arrows.
4. Arrows from E1 and E2 show that the value of E depends upon E1 and E2.

Uses of Dependency Graph:


• The primary idea behind dependency graphs is for the compiler to check for various types of
dependencies between statements in order to prevent them from being executed in the
incorrect sequence, i.e. in a way that affects the program’s meaning.
• This aids it in identifying the program’s numerous parallelizable components.
• Automated software installers: They go around the graph seeking software packages that are
needed but haven’t been installed yet. The coupling of the packages determines the reliance.
• Instructions scheduling uses dependency in a wider way.
• Dependency graphs are widely used in Dead code elimination.

Evaluation Order
Evaluation order for SDD includes how the SDD(Syntax Directed Definition) is evaluated with the
help of attributes, dependency graphs, semantic rules, and S and L attributed definitions. SDD helps
in the semantic analysis in the compiler so it’s important to know about how SDDs are evaluated
and their evaluation order. This article provides detailed information about the SDD evaluation. It
requires some basic knowledge of grammar, production, parses tree, annotated parse tree,
synthesized and inherited attributes.

Terminologies:
• Parse Tree: A parse tree is a tree that represents the syntax of the production hierarchically.
• Annotated Parse Tree: Annotated Parse tree contains the values and attributes at each node.
• Synthesized Attributes: When the evaluation of any node’s attribute is based on children.
• Inherited Attributes: When the evaluation of any node’s attribute is based on children or
parents.

Dependency Graphs:
A dependency graph provides information about the order of evaluation of attributes with the help of
edges. It is used to determine the order of evaluation of attributes according to the semantic rules
of the production. An edge from the first node attribute to the second node attribute gives the
information that first node attribute evaluation is required for the evaluation of the second node
attribute. Edges represent the semantic rules of the corresponding production.
Dependency Graph Rules: A node in the dependency graph corresponds to the node of the parse
tree for each attribute. Edges (first node from the second node)of the dependency graph represent
that the attribute of the first node is evaluated before the attribute of the second node.

Ordering the Evaluation of Attributes:


The dependency graph provides the evaluation order of attributes of the nodes of the parse tree. An
edge( i.e. first node to the second node) in the dependency graph represents that the attribute of
the second node is dependent on the attribute of the first node for further evaluation. This order of
evaluation gives a linear order called topological order.
There is no way to evaluate SDD on a parse tree when there is a cycle present in the graph and
due to the cycle, no topological order exists.

Production Table

S.No. Productions Semantic Rules

1. S ⇢A&B S.val = A.syn + B.syn

A.syn = A1.syn * B.syn


2. A ⇢ A1 # B A1.inh = A.syn

3. A1 ⇢ B A1.syn = B.syn

4. B ⇢ digit B.syn = digit.lexval


Annotated Parse Tree For 1#2&3

Explanation of dependency graph:


Node number in the graph represents the order of the evaluation of the associated attribute.
Edges in the graph represent that the second value is dependent on the first value.
Table-1 represents the attributes corresponding to each node.
Table-2 represents the semantic rules corresponding to each edge.
Table-1

Node Attribute

1 digit.lexval

2 digit.lexval

3 digit.lexval

4 B.syn

5 B.syn

6 B.syn

7 A1.syn

8 A.syn

9 A1.inh

10 S.val

Table-2

Edge Corresponding Semantic Rule


(From the production table)
From To

1 4 B.syn = digit.lexval

2 5 B.syn = digit.lexval

3 6 B.syn = digit.lexval
Table-2

Edge Corresponding Semantic Rule


(From the production table)
From To

4 7 A1.syn = B.syn

5 8 A.syn = A1.syn * B.syn

6 10 S.val = A.syn + B.syn

7 8 A.syn = A1.syn * B.syn

8 10 S.val = A.syn + B.syn

8 9 A1.inh = A.syn

Bottom up and Top down Evaluation of Attributes


YouTube
L- and S-attributed definitions

S-Attributed Definitions:
S-attributed SDD can have only synthesized attributes. In this type of definitions semantic rules are
placed at the end of the production only. Its evaluation is based on bottom up parsing.
Example: S ⇢ AB { S.x = f(A.x | B.x) }

L-Attributed Definitions:
L-attributed SDD can have both synthesized and inherited (restricted inherited as attributes can only
be taken from the parent or left siblings). In this type of definition, semantics rules can be placed
anywhere in the RHS of the production. Its evaluation is based on inorder (topological sorting).
Example: S ⇢ AB {A.x = S.x + 2} or S ⇢ AB { B.x = f(A.x | B.x) } or S ⇢ AB { S.x = f(A.x | B.x) }

Note:
• Every S-attributed grammar is also L-attributed.
• For L-attributed evaluation in order of the annotated parse tree is used.
• For S-attributed reverse of the rightmost derivation is used.
• Semantic Rules with controlled side-effects:
• Side effects are the program fragment contained within semantic rules. These side effects in
SDD can be controlled in two ways: Permit incidental side effects and constraint admissible
evaluation orders to have the same translation as any admissible order.

Type checking
Type checking is the process of verifying and enforcing constraints of types in values. A compiler
must check that the source program should follow the syntactic and semantic conventions of the
source language and it should also check the type rules of the language. It allows the programmer
to limit what types may be used in certain circumstances and assigns types to values. The type-
checker determines whether these values are used appropriately or not.
It checks the type of objects and reports a type error in the case of a violation, and incorrect types
are corrected. Whatever the compiler we use, while it is compiling the program, it has to follow the
type rules of the language. Every language has its own set of type rules for the language. We know
that the information about data types is maintained and computed by the compiler.
The information about data types like INTEGER, FLOAT, CHARACTER, and all the other data types
is maintained and computed by the compiler. The compiler contains modules, where the type
checker is a module of a compiler and its task is type checking.

Conversion
Conversion from one type to another type is known as implicit if it is to be done automatically by the
compiler. Implicit type conversions are also called Coercion and coercion is limited in many
languages.
Example: An integer may be converted to a real but real is not converted to an integer.
Conversion is said to be Explicit if the programmer writes something to do the Conversion.

Tasks:
• has to allow “Indexing is only on an array”
• has to check the range of data types used
• INTEGER (int) has a range of -32,768 to +32767
• FLOAT has a range of 1.2E-38 to 3.4E+38.

Types of Type Checking:


There are two kinds of type checking:
• Static Type Checking.
• Dynamic Type Checking.
Static Type Checking:
Static type checking is defined as type checking performed at compile time. It checks the type
variables at compile-time, which means the type of the variable is known at the compile time. It
generally examines the program text during the translation of the program. Using the type rules of
a system, a compiler can infer from the source text that a function (fun) will be applied to an operand
(a) of the right type each time the expression fun(a) is evaluated.

Examples of Static checks include:

Type-checks: A compiler should report an error if an operator is applied to an incompatible operand.


For example, if an array variable and function variable are added together.
The flow of control checks: Statements that cause the flow of control to leave a construct must have
someplace to which to transfer the flow of control. For example, a break statement in C causes
control to leave the smallest enclosing while, for, or switch statement, an error occurs if such an
enclosing statement does not exist.
Uniqueness checks: There are situations in which an object must be defined only once. For example,
in Pascal an identifier must be declared uniquely, labels in a case statement must be distinct, and
else a statement in a scaler type may not be represented.
Name-related checks: Sometimes the same name may appear two or more times. For example in
Ada, a loop may have a name that appears at the beginning and end of the construct. The compiler
must check that the same name is used at both places.

The Benefits of Static Type Checking:

Runtime Error Protection.


It catches syntactic errors like spurious words or extra punctuation.
It catches wrong names like Math and Predefined Naming.
Detects incorrect argument types.
It catches the wrong number of arguments.
It catches wrong return types, like return “70”, from a function that’s declared to return an int.

Dynamic Type Checking:


Dynamic Type Checking is defined as the type checking being done at run time. In Dynamic Type
Checking, types are associated with values, not variables. Implementations of dynamically type-
checked languages runtime objects are generally associated with each other through a type tag,
which is a reference to a type containing its type information. Dynamic typing is more flexible. A
static type system always restricts what can be conveniently expressed. Dynamic typing results in
more compact programs since it is more flexible and does not require types to be spelled out.
Programming with a static type system often requires more design and implementation effort.

Languages like Pascal and C have static type checking. Type checking is used to check the
correctness of the program before its execution. The main purpose of type-checking is to check the
correctness and data type assignments and type-casting of the data types, whether it is syntactically
correct or not before their execution.
Static Type-Checking is also used to determine the amount of memory needed to store the variable.

The design of the type-checker depends on:


• Syntactic Structure of language constructs.
• The Expressions of languages.
• The rules for assigning types to constructs (semantic rules).

The Position of the Type checker in the Compiler:

Type checking in Compiler


The token streams from the lexical analyzer are passed to the PARSER. The PARSER will
generate a syntax tree. When a program (source code) is converted into a syntax tree, the type-
checker plays a Crucial Role. So, by seeing the syntax tree, you can tell whether each data type is
handling the correct variable or not. The Type-Checker will check and if any modifications are
present, then it will modify. It produces a syntax tree, and after that, INTERMEDIATE CODE
Generation is done.

Overloading:
An Overloading symbol is one that has different operations depending on its context.
Overloading is of two types:
• Operator Overloading
• Function Overloading

Operator Overloading: In Mathematics, the arithmetic expression “x+y” has the addition
operator ‘+’ is overloaded because ‘+’ in “x+y” have different operators when ‘x’ and ‘y’ are
integers, complex numbers, reals, and Matrices.
Example: In Ada, the parentheses ‘()’ are overloaded, the ith element of the expression A(i) of an
Array A has a different meaning such as a ‘call to function ‘A’ with argument ‘i’ or an explicit
conversion of expression i to type ‘A’. In most languages the arithmetic operators are overloaded.

Function Overloading: The Type Checker resolves the Function Overloading based on types of
arguments and Numbers.
Type Systems

• A type system is a collection of rules that assign types to program constructs (more constraints

added to checking the validity of the programs, violation of such constraints indicate errors).

• A languages type system specifies which operations are valid for which types.

• Type systems provide a concise formalization of the semantic checking rules.

• Type rules are defined on the structure of expressions.

• Type rules are language specific.

Type checks: The compiler checks that names and values are used in accordance with
type rules of the language.

Type conversions: Detection of implicit type conversions.

Dereferencing checks: The compiler checks that dereferencing is applied only to a


pointer.

Indexing checks: The compiler checks that indexing is applied only to an array.

Function call checks: The compiler checks that a function (or procedure) is applied to
the correct number and type of arguments.

Uniqueness checks: In many cases an identifier must be declared exactly once.

Flow-of-control checks: The compiler checks that if a statement causes the flow of
control to leave a construct, then there is a place to transfer this flow. For instance when
using break in C
Type Expressions

• A type expression is either a basic type or is formed by applying an operator called a type constructor

to a type expression. The sets of basic types and constructors depend on the language to be checked.

The following are some of type expressions:

• A basic type is a type expression. Typical basic types for a language include boolean, char, integer,

float, and void(the absence of a value). type_error is a special basic type.

• A type constructor applied to type expressions. Constructors include:

o Arrays : If T is a type expression, then array(I, T) is a type expression denoting the type of an

array with elements of type T and index set I. I is often a range of integers. Ex. int a[25] ;

o Products : If T1 and T2 are type expressions, then their Cartesian product T1 x T2 is a type

expression. x associates to the left and that it has higher precedence. Products are introduced

for completeness; they can be used to represent a list or tuple of types (e.g., for function

parameters).

o Records : A record is a data structure with named fields. A type expression can be formed by

applying the record type constructor to the field names and their types.

o Pointers : If T is a type expression, then pointer (T) is a type expression denoting the type

"pointer to an object of type T". For example: int a; int *p=&a;

o Functions : Mathematically, a function maps depends on one set (domain) to another

set(range). Function F : D -> R.A type expression can be formed by using the type

constructor -> for function types. We write s -> t for "function from type s to type t".

• Type expressions may contain variables whose values are themselves type expressions.

Example

• The array type int [2][3] can be written as a type expression array(2, array(3, integer)). This type is

represented by the tree. The operator array takes two parameters, a number and a type.

Equivalence of Type Expressions


If two type expressions are equal then return a certain type else return type_error.
Key Ideas:
The main difficulty arises from the fact that most modern languages allow the naming of user-defined
types.
For instance, in C and C++ this is achieved by the typedef statement.
When checking equivalence of named types, we have two possibilities.
• Structural Equivalence
• Names Equivalence

Structural Equivalence

• Type expressions are built from basic types and constructors, a natural concept of equivalence

between two type expressions is structural equivalence. i.e., two expressions are either the same

basic type or formed by applying the same constructor to structurally equivalent types. That is, two

type expressions are structurally equivalent if and only if they are identical.

• For example, the type expression integer is equivalent only to integer because they are the same

basic type.

• Similarly, pointer (integer) is equivalent only to pointer (integer) because the two are formed by

applying the same constructor pointer to equivalent types.

• The algorithm recursively compares the structure of type expressions without checking for cycles so

it can be applied to a tree representation. It assumes that the only type constructors are for arrays ,

products, pointers, and functions.

Names Equivalence

• In some languages, types can be given names (Data type name). For example, in the Pascal program

fragment.
Type Checking

• The identifier link is declared to be a name for the type cell. The variables next, last, p, q, r are not

identical type, because the type depends on the implementation.

• Type graph is constructed to check the name equivalence.

o Every time a type constructor or basic type is seen, a new node is created.

o Every time a new type name is seen, a leaf is created.

o Two type expressions are equivalent if they are represented by the same node in the type

graph.

Example: Consider Pascal program fragment

Names Equivalence

• The identifier link is declared to be a name for the type ‡cell. new type names np and nqr have been

introduced.

• since next and last are declared with the same type name, they are treated as having equivalent

types. Similarly, q and r are treated as having equivalent types because the same implicit type name

is associated with them.

• However, p, q, and next do not have equivalent types, since they all have types with different

names.
Pascal Program-Fragment

Note that type name cel1 has three parents. All labeled pointer. An equal sign appears between
the type name link and the node in the type graph to which it refers.

Type Conversion
• Converting type casts
– No code needed for structural equivalence
– Run-time semantic error for intersecting values
– Possible conversion of low-level representations, e.g., float to integer
• Non-converting type casts
– E.g., array of characters reinterpreted and pointers or integers, bit manipulation of floats

Run Time System

• Run Time Environment establishes relationships between names and data objects.

• The allocation and de-allocation of data objects are managed by the Run Time Environment

• Each execution of a procedure is referred to as an activation of the procedure.

• If the procedure is recursive, several of its activations may & alive at the same time. Each call of a

procedure leads to an activation that may manipulate data objects allocated for its use.

• The representation of a data object at run time is determined by its type.

• Elementary data types, such as characters, integers, and reals can be represented by equivalent data

objects in the target machine.

• However, aggregates, such as arrays, strings , and structures, are usually represented by collections

of primitive objects.
Source Language Issues

• Procedure

• Activation Trees

• Control Stack

• The Scope of a Declaration

• Bindings of Names

Storage Organization

• The executing target program runs in its own logical address space in which each program value has

a location. The management and organization of this logical address space is shared between the

compiler, operating system, and target machine. The operating system maps the logical addresses

into physical addresses, which are usually spread throughout memory.

• The run-time representation of an object program in the logical address space consists of data and

program areas.

• The run time storage is subdivided to hold code and data as follows:

o The generated target code

o Data objects

o Control stack(which keeps track of information of procedure activations0)

• The size of the generated target code is fixed at compile time, so the compiler can place the

executable target code in a statically determined area Code, usually in the low end of memory.

• The size of some program data objects, such as global constants, and data generated by the compiler,

such as information to support garbage collection, may be known at compile time, and these data

objects can be placed in another statically determined area called Static.

• One reason for statically allocating as many data objects as possible is that the addresses of these

objects can be compiled into the target code.

• In early versions of Fortran, all data objects could be allocated statically.

Activation Trees

• Each execution of procedure is referred to as an activation of the procedure.

• Lifetime of an activation is the sequence of steps present in the execution of the procedure.
• If ‘a’ and ‘b’ be two procedures then their activations will be non-overlapping (when one is called

after other) or nested (nested procedures).

• A procedure is recursive if a new activation begins before an earlier activation of the same

procedure has ended.

• An activation tree shows the way control enters and leaves activations.

Rules to Construct an Activation Tree

• Each node represents an activation of a procedure.

• The root node represents the activation of the main program.

• The node for a is the parent of the node for b if and only if control flows from activation a to b.

• The node for a is to the left of the node for b if and only if the lifetime of a occurs before the lifetime

of b.

Sample Code for Quick sort


main()

{
Int n;
readarray();
quicksort(1,n);
}

quicksort(int m, int n)

{
Int i= partition(m,n);
quicksort(m,i-1);
quicksort(i+1,n);
}

The activation tree for this program


Activation Tree

Stack Control

• We can use a stack , called a control stack to keep track of live procedure activations.

• The idea is to push the node for activation onto the control stack as the activation begins and to pop

the node when the activation ends.

• Then the contents of the control stack are related to paths to the root of the activation tree.

• When node n is at the top of the control stack, the stack contains the nodes along the path from n to

the root.
Example

• The activation tree that have been reached when control enters the activation represented by q(2,3

) . Activations with labels r, p(1, 9), p(1, 3), and q(1, 3) have executed to completion, so the figure

contains dashed lines to their nodes. The solid lines mark the path from q(2, 3) to the root.

Control stack contains nodes along a path to the root


Activation Record
Activation Records

Activation Records

• Procedure calls and returns are usually managed by a run-time stack called the control stack. Each

live activation has an activation record (sometimes called a frame) on the control stack. The contents

of activation records vary with the language being implemented.

• The following are the contents in an activation record

o Temporary values, such as those arising from the evaluation of expressions, in cases where

those temporaries cannot be held in registers.

o Local data belonging to the procedure whose activation record this is.

o A saved machine status, with information about the state of the machine just before the call to

the procedure. This information typically includes the return address and the contents of

registers that were used by the calling procedure and that must be restored when the return

occurs.

o An "access link" may be needed to locate data needed by the called procedure but found

elsewhere, e.g., in another activation record.

o A control link, pointing to the activation record of the caller.

o Space for the return value of the called function, if any. Again, not all called procedures return

a value, and if one does, we may prefer to place that value in a register for efficiency.
o The actual parameters used by the calling procedure. Commonly, these values are not placed

in the activation record but rather in registers.

Parameter Passing Symbol Table

Parameter Passing
The communication medium among procedures is known as parameter passing. The values of the
variables from a calling procedure are transferred to the called procedure by some mechanism.

R- value

The value of an expression is called its r-value. The value contained in a single variable also
becomes an r-value if its appear on the right side of the assignment operator.
R-value can always be assigned to some other variable.

L-value

The location of the memory(address) where the expression is stored is known as the l-value of that
expression.
It always appears on the left side if the assignment operator.

Different ways of passing the parameters to the procedure

• Call by Value

• Call by reference

• Copy restore

• Call by name

Call by Value
• In call by value the calling procedure pass the r-value of the actual parameters and the
compiler puts that into called procedure’s activation record.
• Formal parameters hold the values passed by the calling procedure, thus any changes made
in the formal parameters does not affect the actual parameters.
Call by Value

Call by Reference
• In call by reference the formal and actual parameters refers to same memory location.
• The l-value of actual parameters is copied to the activation record of the called function. Thus
the called function has the address of the actual parameters.
• If the actual parameters does not have a l-value (eg- i+3) then it is evaluated in a
new temporary location and the address of the location is passed.
• Any changes made in the formal parameter is reflected in the actual parameters (because
changes are made at the address).

Call by Reference

Call by Copy Restore


• In call by copy restore compiler copies the value in formal parameters when the procedure is
called and copy them back in actual parameters when control returns to the called function.
• The r-values are passed and on return r-value of formals are copied into l-value of actuals.
Call by Copy

Call by Name
• In call by name the actual parameters are substituted for formals in all the places formals
occur in the procedure.
• It is also referred as lazy evaluation because evaluation is done on parameters only when
needed.

Call by Name

Symbol Table
• Symbol tables are data structures that are used by compilers to hold information about
source-program constructs. The information is collected incrementally by the analysis phases
of a compiler and used by the synthesis phases to generate the target code.
• Entries in the symbol table contain information about an identifier such as its character string
(or lexeme) , its type, its position in storage, and any other relevant information.
• The symbol table, which stores information about the entire source program, is used by all
phases of the compiler.
• An essential function of a compiler is to record the variable names used in the source
program and collect information about various attributes of each name.
• These attributes may provide information about the storage allocated for a name, its type, its
scope.
• A symbol table can be implemented in one of the following ways:
• Linear (sorted or unsorted) list
• Binary Search Tree
• Hash table
• Among the above all, symbol tables are mostly implemented as hash tables, where the
source code symbol itself is treated as a key for the hash function and the return value is the
information about the symbol.
• A symbol table may serve the following purposes depending upon the language in hand:
• To store the names of all entities in a structured form at one place.
• To verify if a variable has been declared.
• To implement type checking, by verifying assignments and expressions.
• To determine the scope of a name (scope resolution).

Symbol-Table Entries
• A compiler uses a symbol table to keep track of scope and binding information about names.
The symbol table is searched every time a name is encountered in the source text.
• Changes to the table occur if a new name or new information about an existing name is
discovered. A linear list is the simplest to implement, but its performance is poor. Hashing
schemes provide better performance.
• The symbol table grows dynamically even though fixed at compile time.
• Each entry in the symbol table is for the declaration of a name.
• The format of entries does not uniform.
• The following information about identifiers are stored in symbol table.
• The name.
• The data type.
• The block level.
• Its scope (local, global).
• Pointer / address
• Its offset from base pointer
• Function name, parameter and variable.

Storage Allocation Information


• Information about the storage locations that will be bund to names at run time is kept in the
symbol table.
• Static and dynamic allocation can be done.
• Storage is allocated for code, data, stack, and heap.
• COMMON blocks in Fortran are loaded separately.

The List Data Structure for Symbol Tables


• The compiler plans out the activation record for each procedure.
• The simplest and easiest to implement data structure for a symbol table is a linear list of
records.
• We use a single array, or equivalently several arrays. To store names and
their associated information.
• If the symbol table contains n names, To find the data about a name, on the average, we
search n/2 names, so the cost of an inquiry is also proportional to n.
List of Data Structure

Dynamic Storage Allocation


• The techniques needed to implement dynamic storage allocation is mainly depends on how
the storage deallocated. If deallocation is implicit, then the run-time support package is
responsible for determining when a storage block is no longer needed. There is less
a compiler has to do if deallocation is done explicitly by the programmer.

Explicit Allocation of Fixed-Sized Blocks


• The simplest form of dynamic allocation involves blocks of a fixed size.
• Allocation and deallocation can be done quickly with little or no storage overhead.
• Suppose that blocks are to be drawn from a contiguous area of storage. Initialization of the
area is done by using a portion of each block for a link to the next block.
• A pointer available points to the first block. Allocation consists of taking a block off the list and
deallocation consists of putting the block back on the list.
Deallocated block is added to the lit of available blocks

Explicit Allocation of Variable-Sized Blocks


• When blocks are allocated and deallocated, storage can become fragmented; that is, the
heap may consist of alternate blocks that are free.
• The situation can occur if a program allocates five blocks and then de- allocates the second
and fourth.
• Fragmentation is of no consequence if blocks are of fixed size, but if they are of variable size,
because we could not allocate a block larger than any one of the free blocks, even though
the space is available.
• First fit, worst fit and best fit are some methods for allocating variable-sized blocks.

Variable Sized Block

Implicit Deallocation
• Implicit deallocation requires cooperation between the user program and the run-
time package, because the latter needs to know when a storage block is no longer in use.
• This cooperation is implemented by fixing the format of storage blocks.
Implicit Deallocation

• Reference counts:

o We keep track of the number of blocks that point directly to the present block. If this

count ever drops to 0, then the block can be deallocated because it cannot be referred

to.

o In other words, the block has become garbage that can be collected. Maintaining

reference counts can be costly in time.

• Marking techniques:

o An alternative approach is to suspend temporarily execution of the user program and

use the frozen pointers to determine which blocks are in use.

moDule- iV
Intermediate code generation
Intermediate Code Representation Techniques
In the analysis-synthesis model of a compiler, the front end of a compiler translates a source program into an
independent intermediate code, then the back end of the compiler uses this intermediate code to generate
the target code (which can be understood by the machine). The benefits of using machine-independent
intermediate code are:

• Because of the machine-independent intermediate code, portability will be enhanced. For ex,
suppose, if a compiler translates the source language to its target machine language without having
the option for generating intermediate code, then for each new machine, a full native compiler is
required. Because, obviously, there were some modifications in the compiler itself according to the
machine specifications.
• Retargeting is facilitated.
• It is easier to apply source code modification to improve the performance of source code by optimizing
the intermediate code.

The translation of the source code into the object code for the target machine, a compiler can produce a
middle-level language code, which is referred to as intermediate code or intermediate text. There are three
types of intermediate code representation are as follows −
Postfix Notation:
Also known as reverse Polish notation or suffix notation. The ordinary (infix) way of writing the sum of a
and b is with an operator in the middle: a + b The postfix notation for the same expression places the
operator at the right end as ab +. In general, if e1 and e2 are any postfix expressions, and + is any binary
operator, the result of applying + to the values denoted by e1 and e2 is postfix notation by e1e2 +. No
parentheses are needed in postfix notation because the position and arity (number of arguments) of the
operators permit only one way to decode a postfix expression. In postfix notation, the operator follows the
operand.

Example 1: The postfix representation of the expression (a + b) * c is : ab + c *

Example 2: The postfix representation of the expression (a – b) * (c + d) + (a – b) is : ab – cd + *ab -+

Syntax Tree:
A syntax tree is nothing more than a condensed form of a parse tree. The operator and keyword nodes of the parse
tree are moved to their parents and a chain of single productions is replaced by the single link in the syntax tree the
internal nodes are operators and child nodes are operands. To form a syntax tree put parentheses in the expression,
this way it’s easy to recognize which operand should come first.

Three-Address Code
The three-address code is a sequence of statements of the form A−=B op C, where A, B, C are either
programmer-defined names, constants, or compiler-generated temporary names, the op represents for an
operator that can be fixed or floatingpoint arithmetic operators or a Boolean valued data or a logical operator.
The reason for the name “three address code” is that each statement generally includes three addresses, two
for the operands, and one for the result.
There are three types of three address code statements which are as follows −
Quadruples representation − Records with fields for the operators and operands can be define three address
statements. It is possible to use a record structure with fields, first hold the operator ‘op’, next two hold
operands 1 and 2 respectively, and the last one holds the result. This representation of three addresses is called
a quadruple representation.
Triples representation − The contents of operand 1, operand 2, and result fields are generally pointer to
symbol records for the names described by these fields. Therefore, it is important to introduce temporary
names into the symbol table as they are generated.
This can be prevented by using the position of statement defines temporary values. If this is completed then,
a record structure with three fields is enough to define the three address statements− The first holds the
operator and the next two holds the values of operand 1 and operand 2 respectively. Such representation is
known as triple representation.
Indirect Triples Representation − The indirect triple representation uses an extra array to list the pointer to
the triples in the desired sequence. This is known as indirect triple representation.
The triple representation for the statement x− = (a + b)*-c is as follows −

Statement Statement Location Operator Operand 1 Operand 2

(0) (1) (1) + a B

(1) (2) (2) - c


Statement Statement Location Operator Operand 1 Operand 2

(2) (3) (3) * (0) (1)

(3) (4) (4) / (2) d

(4) (5) (5) :=) (3)

Intermediate Code Generation for


Control Flow
the translation of boolean expressions into three-address code in the context of statements such as those generated by
the following grammar:

In these productions, nonterminal B represents a boolean expression and non-terminal S represents a statement.

This grammar generalizes the running example of while expressions that we introduced in Example 5.19. As in that
example, both B and S have a synthe-sized attribute code, which gives the translation into three-address instructions.
For simplicity, we build up the translations B.code and S.code as strings, us-ing syntax-directed definitions. The
semantic rules defining the code attributes could be implemented instead by building up syntax trees and then emitting
code during a tree traversal.

The translation of if (B) S1 consists of B.code followed by Si.code, as illustrated in Fig. 6.35(a). Within B.code are
jumps based on the value of B. If B is true, control flows to the first instruction of Si.code, and if B is false, control
flows to the instruction immediately following S1 . code.
The labels for the jumps in B.code and S.code are managed using inherited attributes. With a boolean expression B, we
associate two labels: B.true, the label to which control flows if B is true, and B.false, the label to which control flows
if B is false. With a statement S, we associate an inherited attribute S.next denoting a label for the instruction
immediately after the code for S. In some cases, the instruction immediately following S.code is a jump to some
label L. A jump to a jump to L from within S.code is avoided using S.next.

The syntax-directed definition in Fig. 6.36-6.37 produces three-address code for boolean expressions in the context of
if-, if-else-, and while-statements.
Boolean expressions
Boolean expressions are composed of the boolean operators (which we denote &&, I I, and !, using the C convention
for the operators AND, OR, and NOT, respectively) applied to elements that are boolean variables or relational ex-
pressions. Relational expressions are of the form E rel E , where Ex and E are arithmetic expressions. In this section,
± 2 2

we consider boolean expressions generated by the following grammar:

We use the attribute rel. op to indicate which of the six comparison operators <, < = , =, ! =, >, or >= is represented by
rel. As is customary, we assume that I I and &;& are left-associative, and that I I has lowest precedence, then &&, then
!.

Given the expression Bi I I B2, if we determine that B1 is true, then we can conclude that the entire expression is true
without having to evaluate B2.
Similarly, given B1k,kB2, if Bi is false, then the entire expression is false. The semantic definition of the programming
language determines whether all parts of a boolean expression must be evaluated. If the language definition permits (or
requires) portions of a boolean expression to go unevaluated, then the compiler can optimize the evaluation of boolean
expressions by computing only enough of an expression to determine its value. Thus, in an expression such as B1 I I
B2, neither B1 nor B2 is necessarily evaluated fully. If either B1 or B2 is an expression with side effects (e.g., it contains
a function that changes a global variable), then an unexpected answer may be obtained.

Procedure Calls
construct in compiler design that generates good code for procedure calls and returns. For simplicity, we
assume that parameters are passed by value. A procedure is similar to a function. Technically, it is an
important and frequently used programming .
A procedure call is a simple statement that includes the procedure name, parentheses with actual parameter
names or values, and a semicolon at the end.
The types of the actual parameters must match the types of the formal parameters created when the
procedure was first declared (if any). The compiler will refuse to compile the compilation unit in which the call
is made if this is not done.
General Form Procedure_Name(actual_parameter_list);
-- where commas separate the parameters

Calling Sequence
A call's translation provides a list of activities executed at the beginning and end of each operation. In a calling
sequence, the following actions occur:
• Space is made available for the activation record when a procedure is called.

• Evaluate the called procedure's argument.

• To allow the called method to access data in enclosing blocks, set the environment
pointers.

• The caller procedure's state is saved so that it can resume execution following the
call.

• Save the return address as well. It is the location to which the called procedure
must transfer after it has been completed.

• Finally, for the called procedure, generate a jump to the beginning of the code.

Intermediate Code for Procedures


Let there be a function f(a1, a2, a3, a4), a function f with four parameters a1,a2,a3,a4.
Three address code for the above procedure call(f(a1, a2, a3, a4)).
param a1
param a2
param a3
param a4
call f, n

‘call’ is a calling function with f and n, here f represents name of the procedure and n represents number of
parameters
Now, let’s first take an example of a program to understand function definition and a function call.
main()
{
swap(x,y); //calling function
}

void swap(int a, int b) // called function


{
// set of statements
}

In the above program we have the main function and inside the main function a function call swap(x,y), where
x and y are actual arguments. We also have a function definition for swap, swap(int a, int b) where parameters
a and b are formal parameters.
In the three-address code, a function call is unraveled into the evaluation of parameters in preparation for a
call, followed by the call itself. Let’s understand this with another example:
n= f(a[i])
Here, f is a function containing an array of integers a[i]. This function will return some value and the value is
stored in ‘n’ which is also an integer variable.
A→array of integers
F→ function from integer to an integer.
Three address codes for the above function can be written as:
t1= i*4
t2=a[t1]
param t2
t3= call f,1
n=t3

t1= i*4
In this instruction, we are calculating the value of i which can be passed as index value for array a.
t2=a[t1]
In this instruction, we are getting value at a particular index in array a. Since t1 contains an index, here t2 will
contain a value. The above two expressions are used to compute the value of the expression(a[i]) and then
store it in t2.
param t2
The value t2 is passed as a parameter of function f(a[i])
t3= call f,1
This instruction is a function call, where position 1 represents the number of parameters in the function call.
It can vary for different function calls but here it is 1. The calling function will return some value and the value
is stored in t3.
n=t3
The returned value will be assigned to variable n.
Let's see the production with function definition and function call. Several nonterminals like D, F, S, E, A are
used to represent intermediate code.
D → define T id ( F ) { S }
F → 𝜖 | T id, F
S → return E ;
E → id ( A )
A→𝜖|E,A

In D→ define T id (F) {S}, the nonterminal D is for declaration, and T is for type. In the function declaration,
we are going to define the type of the function(T), function name(id), parameters and code to be executed.
(F) represent parameters/arguments of the function and {S} is code to be executed.
Now let’s see what can be a formal parameter.
F → 𝜖 | T id, F
Here the parameter can be empty(𝜖) or of some type(T) followed by the name(id). F at the end represents
the sequence of formal parameters. For example, add(int x, int y, int w,.........).
S → return E;
Here S is code(set of statements to be executed) which will return a value of an expression(E).
E → id ( A )
Expression has some function call with the actual parameters. id represents the name of function and (A)
represents actual parameters. Actual parameters can be generated by the nonterminal A.
A → 𝜖 | E, A.
Actual parameters can be generated with expression E, it can also be a sequence of parameters A. For
example, in add function there can be multiple parameters w,x,y………, etc, add(x,y,z……..).

Code optimization
Source of Optimizations
Optimization is a program transformation technique, which tries to improve the code by making it consume
less resources (i.e. CPU, Memory) and deliver high speed.

In optimization, high-level general programming constructs are replaced by very efficient low-level
programming codes. A code optimizing process must follow the three rules given below:

• The output code must not, in any way, change the meaning of the program.
• Optimization should increase the speed of the program and if possible, the program should
demand less number of resources.
• Optimization should itself be fast and should not delay the overall compiling process.
Efforts for an optimized code can be made at various levels of compiling the process.

• At the beginning, users can change/rearrange the code or use better algorithms to write the
code.
• After generating intermediate code, the compiler can modify the intermediate code by
address calculations and improving loops.
• While producing the target machine code, the compiler can make use of memory hierarchy
and CPU registers.
Optimization can be categorized broadly into two types : machine independent and machine dependent.

Machine-independent Optimization
In this optimization, the compiler takes in the intermediate code and transforms a part of the code that does
not involve any CPU registers and/or absolute memory locations. For example:

do
{
item = 10;
value = value + item;
} while(value<100);

This code involves repeated assignment of the identifier item, which if we put this way:

Item = 10;
do
{
value = value + item;
} while(value<100);

should not only save the CPU cycles, but can be used on any processor.
Machine-dependent Optimization
Machine-dependent optimization is done after the target code has been generated and when the code is
transformed according to the target machine architecture. It involves CPU registers and may have absolute
memory references rather than relative references. Machine-dependent optimizers put efforts to take
maximum advantage of memory hierarchy.

Optimization of Basic Blocks


Source codes generally have a number of instructions, which are always executed in sequence and are
considered as the basic blocks of the code. These basic blocks do not have any jump statements among
them, i.e., when the first instruction is executed, all the instructions in the same basic block will be executed
in their sequence of appearance without losing the flow control of the program.
A program can have various constructs as basic blocks, like IF-THEN-ELSE, SWITCH-CASE conditional
statements and loops such as DO-WHILE, FOR, and REPEAT-UNTIL, etc.

Basic block identification


We may use the following algorithm to find the basic blocks in a program:
• Search header statements of all the basic blocks from where a basic block starts:
o First statement of a program.
o Statements that are target of any branch (conditional/unconditional).
o Statements that follow any branch statement.
• Header statements and the statements following them form a basic block.
• A basic block does not include any header statement of any other basic block.
Basic blocks are important concepts from both code generation and optimization point of view.

Basic blocks play an important role in identifying variables, which are being used more than once in a single
basic block. If any variable is being used more than once, the register memory allocated to that variable
need not be emptied unless the block finishes execution.

Control Flow Graph


Basic blocks in a program can be represented by means of control flow graphs. A control flow graph depicts
how the program control is being passed among the blocks. It is a useful tool that helps in optimization by
help locating any unwanted loops in the program.

Optimization of Basic Blocks:

Optimization process can be applied on a basic block. While optimization, we don't need to change
the set of expressions computed by the block.

There are two type of basic block optimization. These are as follows:

1. Structure-Preserving Transformations
2. Algebraic Transformations

1. Structure preserving transformations:

The primary Structure-Preserving Transformation on basic blocks is as follows:

o Common sub-expression elimination


o Dead code elimination
o Renaming of temporary variables
o Interchange of two independent adjacent statements

(a) Common sub-expression elimination:

In the common sub-expression, you don't need to be computed it over and over again. Instead of
this you can compute it once and kept in store from where it's referenced when encountered again.

1. a : = b + c
2. b : = a - d
3. c : = b + c
4. d : = a - d

In the above expression, the second and forth expression computed the same expression. So the
block can be transformed as follows:

1. a : = b + c
2. b : = a - d
3. c : = b + c
4. d : = b
(b) Dead-code elimination
o It is possible that a program contains a large amount of dead code.
o This can be caused when once declared and defined once and forget to remove them in this case they
serve no purpose.
o Suppose the statement x:= y + z appears in a block and x is dead symbol that means it will never
subsequently used. Then without changing the value of the basic block you can safely remove this
statement.

(c) Renaming temporary variables

A statement t:= b + c can be changed to u:= b + c where t is a temporary variable and u is a new
temporary variable. All the instance of t can be replaced with the u without changing the basic block
value.

(d) Interchange of statement

Suppose a block has the following two adjacent statements:

1. t1 : = b + c
2. t2 : = x + y

These two statements can be interchanged without affecting the value of block when value of t1
does not affect the value of t2.

2. Algebraic transformations:

o In the algebraic transformation, we can change the set of expression into an algebraically equivalent
set. Thus the expression x:= x + 0 or x:= x *1 can be eliminated from a basic block without changing
the set of expression.
o Constant folding is a class of related optimization. Here at compile time, we evaluate constant
expressions and replace the constant expression by their values. Thus the expression 5*2.7 would be
replaced by13.5.
o Sometimes the unexpected common sub expression is generated by the relational operators like <=,
>=, <, >, +, = etc.
o Sometimes associative expression is applied to expose common sub expression without changing the
basic block value. if the source code has the assignments

1. a:= b + c
2. e:= c +d +b

The following intermediate code may be generated:

1. a:= b + c
2. t:= c +d
3. e:= t + b
Loop Optimization
Loop optimization is most valuable machine-independent optimization because program's inner loop takes
bulk to time of a programmer.

If we decrease the number of instructions in an inner loop then the running time of a program may be
improved even if we increase the amount of code outside that loop.

For loop optimization the following three techniques are important:

1. Code motion
2. Induction-variable elimination
3. Strength reduction

1.Code Motion:
Code motion is used to decrease the amount of code in loop. This transformation takes a statement or
expression which can be moved outside the loop body without affecting the semantics of the program.

For example

In the while statement, the limit-2 equation is a loop invariant equation.

1. while (i<=limit-2) /*statement does not change limit*/


2. After code motion the result is as follows:
3. a= limit-2;
4. while(i<=a) /*statement does not change limit or a*/

2.Induction-Variable Elimination
Induction variable elimination is used to replace variable from inner loop.
It can reduce the number of additions in a loop. It improves both code space and run time performance.

In this figure, we can replace the assignment t4:=4*j by t4:=t4-4. The only problem which will be
arose that t4 does not have a value when we enter block B2 for the first time. So we place a relation
t4=4*j on entry to the block B2.

3.Reduction in Strength
o Strength reduction is used to replace the expensive operation by the cheaper once on the target
machine.
o Addition of a constant is cheaper than a multiplication. So we can replace multiplication with an
addition within the loop.
o Multiplication is cheaper than exponentiation. So we can replace exponentiation with multiplication
within the loop.

Example:

1. while (i<10)
2. {
3. j= 3 * i+1;
4. a[j]=a[j]-2;
5. i=i+2;
6. }

After strength reduction the code will be:

1. s= 3*i+1;
2. while (i<10)
3. {
4. j=s;
5. a[j]= a[j]-2;
6. i=i+2;
7. s=s+6;
8. }

In the above code, it is cheaper to compute s=s+6 than j=3 *i

Global data flow analysis


o To efficiently optimize the code compiler collects all the information about the program and distribute
this information to each block of the flow graph. This process is known as data-flow graph analysis.
o Certain optimization can only be achieved by examining the entire program. It can't be achieve by
examining just a portion of the program.
o For this kind of optimization user defined chaining is one particular problem.
o Here using the value of the variable, we try to find out that which definition of a variable is applicable
in a statement.

Based on the local information a compiler can perform some optimizations. For example, consider the
following code:

1. x = a + b;
2. x=6*3

o In this code, the first assignment of x is useless. The value computer for x is never used in the
program.
o At compile time the expression 6*3 will be computed, simplifying the second assignment statement to
x = 18;

Some optimization needs more global information. For example, consider the following code:

1. a = 1;
2. b = 2;
3. c = 3;
4. if (....) x = a + 5;
5. else x = b + 4;
6. c = x + 1;

In this code, at line 3 the initial assignment is useless and x +1 expression can be simplified as 7.
But it is less obvious that how a compiler can discover these facts by looking only at one or two consecutive
statements. A more global analysis is required so that the compiler knows the following things at each point
in the program:

o Which variables are guaranteed to have constant values


o Which variables will be used before being redefined

Data flow analysis is used to discover this kind of property. The data flow analysis can be performed on the
program's control flow graph (CFG).

The control flow graph of a program is used to determine those parts of a program to which a particular value
assigned to a variable might propagate.

Solution to Iterative Dataflow Equations

Terminologies for Iterative Algorithm :


1. Data flow analysis –
It is defined as a technique in which set of values calculated at various points in a computer program for
collecting information.

2. Control Flow Graph (CFG) –


It is used to determine those parts of a program to which a particular value assigned to a variable might
propagate.

3. Naive approach (Kildall’s method) –


The easiest way to perform data-flow analysis of programs is to set up data-flow equations for each node of
the control-flow graph and in this approach until the whole system stabilizes such that it reaches a fix point
so, solve them by repeatedly calculating the output from the input locally at each node.

4. The efficiency of the above algorithm –


The efficiency of this algorithm for solving the data-flow equations is influenced by the order in which local
nodes are visited, and also it depends on whether the data-flow equations are used for forwarding or
backward data-flow analysis over the Control Flow Graph.

5. An iterative algorithm –
An iterative algorithm is the most common way to solve the data flow analysis equations. In this algorithm,
we particularly have two states first one is in-state and the other one is out-state. The algorithm starts with
an approximation of the in-state of each block and then computed by applying the transfer functions on the
in-states. The in-states is updated by applying the join operations. The latter two steps are repeated until we

Iteration orders for solving data flow equations :


A few iteration orders for solving data-flow equations are discussed below as follows.

Random order –
In this iteration, order is not aware whether the data-flow equations solve a forward or backward data-flow
problem. And hence, the performance is relatively poor compared to specialized iteration orders.

Post order –
This iteration order for backward data-flow problems. A node is visited after all its successor nodes have been
visited, and implemented with the depth-first strategy.
Reverse post order –
This iteration order for forwarding data-flow problems. The node is visited before any of its successor nodes has
been visited, except when the successor is reached by a back edge.

Forward data analysis –


Consider an arbitrary point ‘p’ In a forward analysis, we are reasoning about facts up to ‘p’, considering only the
predecessors of the node at ‘p’. In a backward analysis, we are reasoning about facts from ‘p’ onward,
considering only the successors.

Example –

line 1: if b==4 then

line 2: a = 5;

line 3: else

line 4: a = 3;

line 5: endif

line 6:

line 7: if a < 4 then

...// rest of the code

Example descriptions :
From the above example, we can observe that the reaching definition of variable at line 7 is the set of
assignments a = 5 at line 2 and a = 3 at line 4.

reach the fix point: the situation in which the in-states do not change.

Dealing with Aliases


In general, alias analysis determines whether or not separate memory references point to the same area of
memory. This allows the compiler to determine what variables in the program will be affected by a statement.
For example, consider the following section of code that accesses members of structures:
p.foo = 1;

q.foo = 2;

i = p.foo + 3;

There are three possible alias cases here:


1. The variables p and q cannot alias (i.e., they never point to the same memory location).
2. The variables p and q must alias (i.e., they always point to the same memory location).
3. It cannot be conclusively determined at compile time if p and q alias or not.
If p and q cannot alias, then i = p.foo + 3; can be changed to i = 4. If p and q must alias, then i = p.foo + 3; can
be changed to i = 5 because p.foo + 3 = q.foo + 3. In both cases, we are able to perform optimizations from
the alias knowledge (assuming that no other thread updating the same locations can interleave with the
current thread, or that the language memory model permits those updates to be not immediately visible to
the current thread in absence of explicit synchronization constructs). On the other hand, if it is not known if p
and q alias or not, then no optimizations can be performed and the whole of the code must be executed to
get the result. Two memory references are said to have a may-alias relation if their aliasing is unknown.
Performing alias analysis
In alias analysis, we divide the program's memory into alias classes. Alias classes are disjoint sets of
locations that cannot alias to one another. For the discussion here, it is assumed that the optimizations done
here occur on a low-level intermediate representation of the program. This is to say that the program has
been compiled into binary operations, jumps, moves between registers, moves from registers to memory,
moves from memory to registers, branches, and function calls/returns.
Type-based alias analysis
If the language being compiled is type safe, the compiler's type checker is correct, and the language lacks
the ability to create pointers referencing local variables, (such as ML, Haskell, or Java) then some useful
optimizations can be made.[1] There are many cases where we know that two memory locations must be in
different alias classes:
1. Two variables of different types cannot be in the same alias class since it is a property of strongly
typed, memory reference-free (i.e., references to memory locations cannot be changed directly)
languages that two variables of different types cannot share the same memory location
simultaneously.
2. Allocations local to the current stack frame cannot be in the same alias class as any previous
allocation from another stack frame. This is the case because new memory allocations must be
disjoint from all other memory allocations.
3. Each record field of each record type has its own alias class, in general, because the typing discipline
usually only allows for records of the same type to alias. Since all records of a type will be stored in
an identical format in memory, a field can only alias to itself.
4. Similarly, each array of a given type has its own alias class.

Flow-based alias analysis


Analysis based on flow, can be applied to programs in a language with references or type-casting. Flow
based analysis can be used in lieu of or to supplement type based analysis. In flow based analysis, new alias
classes are created for each memory allocation, and for every global and local variable whose address has
been used. References may point to more than one value over time and thus may be in more than one alias
class. This means that each memory location has a set of alias classes instead of a single alias class.

Data Flow Analysis of Structured Flow Graphs


A compiler first converts the source code of any programming language into an intermediate code
It is then converted into basic blocks. After dividing an intermediate code into basic blocks, the flow
of control among basic blocks is represented by a flow graph.

Flow graph is a directed graph. It contains the flow of control information for the set of basic block.

A control flow graph is used to depict that how the program control is being parsed among the
blocks. It is useful in the loop optimization.

Properties of Flow Graphs


1. The control flow graph is process-oriented.
2. A control flow graph shows how program control is parsed among the blocks.
3. The control flow graph depicts all of the paths that can be traversed during the execution of a
program.
4. It can be used in software optimization to find unwanted loops.
Representation of Flow Graphs
Flow graphs are directed graphs. The nodes/bocks of the control flow graph are the basic blocks of
the program. There are two designated blocks in Control Flow Graph:
1. Entry Block: The entry block allows the control to enter in the control flow graph.
2. Exit Block: Control flow leaves through the exit block.

An edge can flow from one block A to another block B if:


1. the first instruction of the B's block immediately follows the last instruction of the A's block.
2. there is a conditional/unconditional jump from A's end to the starting of B.
3. B follows X in the original order of the three-address code, and A does not end in an unconditional
jump.

Flow graph for the vector dot product is given as follows:

o Block B1 is the initial node. Block B2 immediately follows B1, so from B2 to B1 there is an edge.
o The target of jump from last statement of B1 is the first statement B2, so from B1 to B2 there is an
edge.
o B2 is a successor of B1 and B1 is the predecessor of B2.
moDule- V
Code Generation and Instruction Selection
The final phase in compiler model is the code generator. It takes as input an intermediate
representation of the source program and produces as output an equivalent target program. The code
generation techniques presented below can be used whether or not an optimizing phase occurs before code
generation.

Issues
The following issues arise during the code generation phase:

1. Input to code generator


2. Target program
3. Memory management
4. Instruction selection
5. Register allocation
6. Evaluation order
1. Input to the code generator
o The input to the code generator contains the intermediate representation of the source program and
the information of the symbol table. The source program is produced by the front end.
o Intermediate representation has the several choices:
a) Postfix notation
b) Syntax tree
c) Three address code
o We assume front end produces low-level intermediate representation i.e. values of names in it can
directly manipulated by the machine instructions.
o The code generation phase needs complete error-free intermediate code as an input requires.

2. Target program:
The target program is the output of the code generator. The output can be:

a) Assembly language: It allows subprogram to be separately compiled.

b) Relocatable machine language: It makes the process of code generation easier.

c) Absolute machine language: It can be placed in a fixed location in memory and can be executed
immediately.

3. Memory management
o During code generation process the symbol table entries have to be mapped to actual p addresses
and levels have to be mapped to instruction address.
o Mapping name in the source program to address of data is co-operating done by the front end and
code generator.
o Local variables are stack allocation in the activation record while global variables are in static area.

4. Instruction selection:
o Nature of instruction set of the target machine should be complete and uniform.
o When you consider the efficiency of target machine then the instruction speed and machine idioms
are important factors.
o The quality of the generated code can be determined by its speed and size.

Example:

The Three address code is:

1. a:= b + c
2. d:= a + e
Inefficient assembly code is:

1. MOV b, R0 R0→b
2. ADD c, R0 R0 c + R0
3. MOV R0, a a → R0
4. MOV a, R0 R0→ a
5. ADD e, R0 R0 → e + R0
6. MOV R0, d d → R0

5. Register allocation
Register can be accessed faster than memory. The instructions involving operands in register are shorter
and faster than those involving in memory operand.

The following sub problems arise when we use registers:

Register allocation: In register allocation, we select the set of variables that will reside in register.

Register assignment: In Register assignment, we pick the register that contains variable.

Certain machine requires even-odd pairs of registers for some operands and result.

For example:

Consider the following division instruction of the form:

1. D x, y

Where,

x is the dividend even register in even/odd register pair

y is the divisor

Even register is used to hold the reminder.

Old register is used to hold the quotient.

6. Evaluation order
The efficiency of the target code can be affected by the order in which the computations are performed. Some
computation orders need fewer registers to hold results of intermediate than others.

Basic Blocks and Flow Graphs

Basic Blocks:
Basic Block is a straight line code sequence that has no branches in and out branches except to the entry
and at the end respectively. Basic Block is a set of statements that always executes one after other, in a
sequence.
The first task is to partition a sequence of three-address code into basic blocks. A new basic block is begun
with the first instruction and instructions are added until a jump or a label is met. In the absence of a jump,
control moves further consecutively from one instruction to another.

Basic block construction:


Algorithm: Partition into basic blocks

Input: It contains the sequence of three address statements

Output: it contains a list of basic blocks with each three address statement in exactly one block

Method: First identify the leader in the code. The rules for finding leaders are as follows:

o The first statement is a leader.


o Statement L is a leader if there is an conditional or unconditional goto statement like: if....goto L or
goto L
o Instruction L is a leader if it immediately follows a goto or conditional goto statement like: if goto B or
goto B

For each leader, its basic block consists of the leader and all statement up to. It doesn't include the next
leader or end of the program.

Consider the following source code for dot product of two vectors a and b of length 10:

1. begin
2. prod :=0;
3. i:=1;
4. do begin
5. prod :=prod+ a[i] * b[i];
6. i :=i+1;
7. end
8. while i <= 10
9. end

The three address code for the above source program is given below:

B1

1. (1) prod := 0
2. (2) i := 1

B2

1. (3) t1 := 4* i
2. (4) t2 := a[t1]
3. (5) t3 := 4* i
4. (6) t4 := b[t3]
5. (7) t5 := t2*t4
6. (8) t6 := prod+t5
7. (9) prod := t6
8. (10) t7 := i+1
9. (11) i := t7
10. (12) if i<=10 goto (3)

Basic block B1 contains the statement (1) to (2)

Basic block B2 contains the statement (3) to (12)

Transformations on Basic blocks:


Transformations on basic blocks can be applied to a basic block. While transformation, we don’t
need to change the set of expressions computed by the block.

There are two types of basic block transformations. These are as follows:

1. Structure-Preserving Transformations

Structure preserving transformations can be achieved by the following methods:

1. Common sub-expression elimination


2. Dead code elimination
3. Renaming of temporary variables
4. Interchange of two independent adjacent statements

2. Algebraic Transformations

In the case of algebraic transformation, we basically change the set of expressions into an
algebraically equivalent set.

For example, and expression

x:= x + 0

or x:= x *1

This can be eliminated from a basic block without changing the set of expressions.

Flow Graphs:
• Flow graph is a directed graph containing the flow-of-control information for the set of basic blocks making
up a program.
The nodes of the flow graph are basic blocks. It has a distinguished initial node.
• E.g.: Flow graph for the vector dot product is given as follows:

Fig. 4.2 Flow graph for program

• B1 is the initial node. B2 immediately follows B1, so there is an edge from B1 to B2. The target of jump
from last statement of B1 is the first statement B2, so there is an edge from B1 (last statement) to B2 (first
statement).

• B1 is the predecessor of B2, and B2 is a successor of B1.

Loops
• A loop is a collection of nodes in a flow graph such that
1. All nodes in the collection are strongly connected.
2. The collection of nodes has a unique entry.

• A loop that contains no other loops is called an inner loop.

Register Allocation
Registers are the fastest locations in the memory hierarchy. But unfortunately, this resource is limited. It
comes under the most constrained resources of the target processor. Register allocation is an NP-complete
problem. However, this problem can be reduced to graph coloring to achieve allocation and assignment.
Therefore a good register allocator computes an effective approximate solution to a hard problem.

Figure – Input-Output
The register allocator determines which values will reside in the register and which register will hold each of
those values. It takes as its input a program with an arbitrary number of registers and produces a program
with a finite register set that can fit into the target machine.

Allocation vs Assignment:
Allocation: –
Maps an unlimited namespace onto that register set of the target machine.
• Reg. to Reg. Model: Maps virtual registers to physical registers but spills excess amount to memory.
• Mem. to Mem. Model: Maps some subset of the memory location to a set of names that models the
physical register set.
Allocation ensures that code will fit the target machine’s reg. set at each instruction.
Assignment: –
Maps an allocated name set to the physical register set of the target machine.
• Assumes allocation has been done so that code will fit into the set of physical registers.
• No more than ‘k’ values are designated into the registers, where ‘k’ is the no. of physical registers.

General register allocation is an NP-complete problem:


Solved in polynomial time, when (no. of required registers) <= (no. of available physical registers).
An assignment can be produced in linear time using Interval-Graph Coloring.

Local Register Allocation And Assignment:


Allocation just inside a basic block is called Local Reg. Allocation. Two approaches for local reg. allocation:
Top-down approach and bottom-up approach.
Top-Down Approach is a simple approach based on ‘Frequency Count’. Identify the values which should be
kept in registers and which should be kept in memory.
Algorithm:
1. Compute a priority for each virtual register.
2. Sort the registers into priority order.
3. Assign registers in priority order.
4. Rewrite the code.
Moving beyond single Blocks:
• More complicated because the control flow enters the picture.
• Liveness and Live Ranges: Live ranges consist of a set of definitions and uses that are related to
each other as they i.e. no single register can be common in a such couple of instruction/data.
Following is a way to find out Live ranges in a block. A live range is represented as an interval [i,j], where i
is the definition and j is the last use.

Global Register Allocation and Assignment:


1. The main issue of a register allocator is minimizing the impact of spill code;
• Execution time for spill code.
• Code space for spill operation.
• Data space for spilled values.
2. Global allocation can’t guarantee an optimal solution for the execution time of spill code.
3. Prime differences between Local and Global Allocation:
• The structure of a global live range is naturally more complex than the local one.
• Within a global live range, distinct references may execute a different number of times. (When basic
blocks form a loop)
4. To make the decision about allocation and assignments, the global allocator mostly uses graph coloring
by building an interference graph.
5. Register allocator then attempts to construct a k-coloring for that graph where ‘k’ is the no. of physical
registers.
• In case, the compiler can’t directly construct a k-coloring for that graph, it modifies the underlying
code by spilling some values to memory and tries again.
• Spilling actually simplifies that graph which ensures that the algorithm will halt.
6. Global Allocator uses several approaches, however, we’ll see top-down and bottom-up allocations
strategies. Subproblems associated with the above approaches.
• Discovering Global live ranges.
• Estimating Spilling Costs.
• Building an Interference graph.

Discovering Global Live Ranges:


How to discover Live range for a variable?

Figure – Discovering live ranges in a single block


The above diagram explains everything properly. Let’s take the example of Rarp, it’s been initialized at
program point 1 and its last usage is at program point 11. Therefore, the Live Range of Rarp i.e. Larp is
[1,11]. Similarly, others follow up.
Figure – Discovering Live Ranges
Estimating Global Spill Cost:
• Essential for taking a spill decision which includes – address computation, memory operation cost,
and estimated execution frequency.
• For performance benefits, these spilled values are kept typically for the Activation records.
• Some embedded processors offer ScratchPad Memory to hold such spilled values.
• Negative Spill Cost: Consecutive load-store for a single address needs to be removed as it
increases the burden, hence incurs negative spill cost.
• Infinite Spill Cost: A live range should have infinite spill cost if no other live range ends between its
definition and its use.

Interference and Interference Graph:

Figure – Building Interference Graph from Live Ranges


From the above diagram, it can be observed that the live range LRA starts in the first basic block and ends
in the last basic block. Therefore it will share an edge with every other live Range i.e. Lrb, Lrc,Lrd. However,
Lrb, Lrc, Lrd doesn’t overlap with any other live range except Lra so they are only sharing an edge with Lra.
Building an Allocator:
• Note that a k-colorable graph finding is an NP-complete problem, so we need an approximation for
this.
• Try with live range splitting into some non-trivial chunks (most used ones).
Top-Down Colouring:
1. Tries to color live range in an order determined by some ranking functions i.e. priority based.
2. If no color is available for a live range, the allocator invokes either spilling or splitting to handle
uncolored ones.
3. Live ranges having k or more neighbors are called constrained nodes and are difficult to handle.
4. The unconstrained nodes are comparatively easy to handle.
5. Handling Spills: When no color is found for some live ranges, spilling is needed to be done, but this
may not be a final/ultimate solution of course.
6. Live Range Splitting: For uncolored ones, split the live range into sub-ranges, those may have fewer
interferences than the original one so that some of them can be colored at least.
Chaitin’s Idea:
• Choose an arbitrary node of ( degree < k ) and put it in the stack.
• Remove that node and all its edges from the graph. (This may decrease the degree of some other
nodes and cause some more nodes to have degree = k, some node has to be spilled.
• If no vertex needs to be spilled, successively pop vertices off the stack and color them in a color not
used by neighbors. (reuse colors as far as possible).
Coalescing copies to reduce degree:
The compiler can use the interference graph to coalesce two live ranges. So by coalescing, what type of
benefits can you get?

Figure – Coalescing Live Ranges


Comparing Top-Down and Bottom-Up allocator:
• Top-down allocator could adopt the ‘spill and iterate’ philosophy used in bottom-up ones.
• ‘Spill and iterate’ trades additional compile time for an allocation that potentially, uses less spill code.
• Top-Down uses priority ranking to order all the constrained nodes. (However, it colors the
unconstrained nodes in arbitrary order)
• Bottom-up constructs an order in which most nodes are colored in a graph where they are
unconstrained.

Code Generation
Code Generator
A code generator is expected to have an understanding of the target machine’s runtime environment and its
instruction set. The code generator should take the following things into consideration to generate the code:
• Target language : The code generator has to be aware of the nature of the target language
for which the code is to be transformed. That language may facilitate some machine-specific
instructions to help the compiler generate the code in a more convenient way. The target
machine can have either CISC or RISC processor architecture.
• IR Type : Intermediate representation has various forms. It can be in Abstract Syntax Tree
(AST) structure, Reverse Polish Notation, or 3-address code.
• Selection of instruction : The code generator takes Intermediate Representation as input
and converts (maps) it into target machine’s instruction set. One representation can have many
ways (instructions) to convert it, so it becomes the responsibility of the code generator to
choose the appropriate instructions wisely.
• Register allocation : A program has a number of values to be maintained during the
execution. The target machine’s architecture may not allow all of the values to be kept in the
CPU memory or registers. Code generator decides what values to keep in the registers. Also,
it decides the registers to be used to keep these values.
• Ordering of instructions : At last, the code generator decides the order in which the
instruction will be executed. It creates schedules for instructions to execute them.
Descriptors
The code generator has to track both the registers (for availability) and addresses (location of values) while
generating the code. For both of them, the following two descriptors are used:
• Register descriptor : Register descriptor is used to inform the code generator about the
availability of registers. Register descriptor keeps track of values stored in each register.
Whenever a new register is required during code generation, this descriptor is consulted for
register availability.
• Address descriptor : Values of the names (identifiers) used in the program might be stored
at different locations while in execution. Address descriptors are used to keep track of memory
locations where the values of identifiers are stored. These locations may include CPU
registers, heaps, stacks, memory or a combination of the mentioned locations.
Code generator keeps both the descriptor updated in real-time. For a load statement, LD R1, x, the code
generator:

• updates the Register Descriptor R1 that has value of x and


• updates the Address Descriptor (x) to show that one instance of x is in R1.

Code Generation
Basic blocks comprise of a sequence of three-address instructions. Code generator takes these sequence
of instructions as input.
Note : If the value of a name is found at more than one place (register, cache, or memory), the register’s
value will be preferred over the cache and main memory. Likewise cache’s value will be preferred over the
main memory. Main memory is barely given any preference.
getReg : Code generator uses getReg function to determine the status of available registers and the location
of name values. getReg works as follows:
• If variable Y is already in register R, it uses that register.
• Else if some register R is available, it uses that register.
• Else if both the above options are not possible, it chooses a register that requires minimal
number of load and store instructions.
For an instruction x = y OP z, the code generator may perform the following actions. Let us assume that L is
the location (preferably register) where the output of y OP z is to be saved:
• Call function getReg, to decide the location of L.
• Determine the present location (register or memory) of y by consulting the Address Descriptor
of y. If y is not presently in register L, then generate the following instruction to copy the value
of y to L:
MOV y’, L
where y’ represents the copied value of y.
• Determine the present location of z using the same method used in step 2 for y and generate
the following instruction:
OP z’, L
where z’ represents the copied value of z.
• Now L contains the value of y OP z, that is intended to be assigned to x. So, if L is a register,
update its descriptor to indicate that it contains the value of x. Update the descriptor of x to
indicate that it is stored at location L.
• If y and z has no further use, they can be given back to the system.
Other code constructs like loops and conditional statements are transformed into assembly language in
general assembly way.

DAG Representation of Programs


What is DAG Representation?
DAG representation or Directed Acyclic Graph representation is used to represent the structure of basic
blocks. A basic block is a set of statements that execute one after another in sequence. It displays the flow
of values between basic blocks and provides basic block optimization algorithms. It is an efficient way of
identifying common subexpressions.
A DAG is a three address code formed due to an intermediate code generation to apply an optimization
technique to a basic block.

DAG in Compiler Design


In the compilation process, the high level code must be transformed into low level code. To perform this
transformation, the object code generated must retain the exact meaning of the source code. Hence DAG is
used to depict the structure of the basic blocks, and helps to see the flow of the values among the blocks and
offers some degree of optimization too.
A DAG is used in compiler design to optimize the basic block. It is constructed using Three Address Code.
Then after construction, multiple transformations are applied such as dead code elimination, and common
subexpression elimination. DAG's are useful in compilers because topological ordering can be defined in the
case of DAGs, which is crucial for construction of object level code. Transitive reduction as well as closure
are uniquely defined for DAGs.

Characteristics
The following are some characteristics of DAG.
• DAG is a type of data structure used to represent the structure of basic blocks.
• Its main aim is to perform the transformation on basic blocks.
• The leaf nodes of the directed acyclic graph represent a unique identifier that can be a variable
or a constant.
• The non-leaf nodes represent an operator symbol.
• Moreover, the nodes are also given a string of identifiers to use as labels for the computed
value.
• Transitive closure and transitive reduction are defined differently in DAG.
• DAG has defined topological ordering.

Algorithm for construction of DAG


Input:It contains a basic block

Output: It contains the following information:

o Each node contains a label. For leaves, the label is an identifier.

o Each node contains a list of attached identifiers to hold the computed values.

1. Case (i) x:= y OP z


2. Case (ii) x:= OP y
3. Case (iii) x:= y

Method:

Step 1:

If y operand is undefined then create node(y).

If z operand is undefined then for case(i) create node(z).

Step 2:

For case(i), create node(OP) whose right child is node(z) and left child is node(y).

For case(ii), check whether there is node(OP) with one child node(y).

For case(iii), node n will be node(y).

Output:
For node(x) delete x from the list of identifiers. Append x to attached identifiers list for the node n found in
step 2. Finally set node(x) to n.

Example:
Consider the following three address statement:

1. S1:= 4 * i
2. S2:= a[S1]
3. S3:= 4 * i
4. S4:= b[S3]
5. S5:= s2 * S4
6. S6:= prod + S5
7. Prod:= s6
8. S7:= i+1
9. i := S7
10. if i<= 20 goto (1)

Stages in DAG Construction:


Code Generation From DAGS
The advantage of generating code for a basic block from its dag representation is that from a dag we
can easily see how to rearrange the order of the final computation sequence than we can start from a linear
sequence of three-address statements or quadruples.

Rearranging the order

The order in which computations are done can affect the cost of resulting object code. For example,
consider the following basic block:
t1 : = a + b
t2 : = c + d
t3 : = e - t2
t4 : = t1 - t3

Generated code sequence for basic block:


MOV a , R0
ADD b , R0
MOV c , R1
ADD d , R1
MOV R0 , t1
MOV e , R0
SUB R1 , R0
MOV t1 , R1
SUB R0 , R1
MOV R1 , t4

Rearranged basic block:


Now t1 occurs immediately before t4.
t2 : = c + d
t3 : = e - t2
t1 : = a + b
t4 : = t1 - t3
Revised code sequence:
MOV c , R0
ADD d , R0
MOV a , R0
SUB R0 , R1
MOV a , R0
ADD b , R0
SUB R1 , R0
MOV R0 , t4
In this order, two instructions MOV R0 , t1 and MOV t1 , R1 have been saved.

Peephole Optimization
A statement-by-statement code-generations strategy often produces target code that contains redundant
instructions and suboptimal constructs. The quality of such target code can be improved by applying
“optimizing” transformations to the target program.

A simple but effective technique for improving the target code is peephole optimization, a method for
trying to improving the performance of the target program by examining a short sequence of target
instructions (called the peephole) and replacing these instructions by a shorter or faster sequence, whenever
possible.

The peephole is a small, moving window on the target program. The code in the peephole need not
be contiguous, although some implementations do require this. It is characteristic of peephole optimization
that each improvement may spawn opportunities for additional improvements.

Peephole optimization is an optimization technique by which code is optimized to improve the machine's
performance. More formally, Peephole optimization is an optimization technique performed on a small
set of compiler-generated instructions; the small set is known as the peephole optimization in
compiler design or window.

Characteristics of peephole optimizations:


• Redundant-instructions elimination
• Flow-of-control optimizations
• Algebraic simplifications
• Use of machine idioms
• Unreachable

Peephole Optimization Techniques


There are various peephole optimization techniques.

Redundant Load and Store

In this optimization, the redundant operations are removed. For example, loading and storing
values on registers can be optimized.
For example,
a= b+c
d= a+e
It is implemented on the register(R0) as
MOV b, R0; instruction to copy b to the register
ADD c, R0; instruction to Add c to the register, the register is now b+c
MOV R0, a; instruction to Copy the register(b+c) to a
MOV a, R0; instruction to Copy a to the register
ADD e, R0 ;instruction to Add e to the register, the register is now a(b+c)+e
MOV R0, d; instruction to Copy the register to d
This can be optimized by removing load and store operation, like in third instruction value in
register R0 is copied to a, and it again loaded to R0 in the next step for further operation. The
optimized implementation will be:
MOV b, R0; instruction to Copy b to the register
ADD c, R0; instruction to Add c to the register, which is now b+c (a)
MOV R0, a; instruction to Copy the register to a
ADD e, R0; instruction to Add e to the register, which is now b+c+e [(a)+e]
MOV R0, d; instruction to Copy the register to d

Strength Reduction

In strength reduction optimization, operators that consume higher execution time are replaced by
the operators consuming less execution time. Like multiplication and division, operators can be
replaced by shift operators.
Initial code:
n = a * 2;
Optimized code:
b= a << 1;
//left shifting the bit
Initial code:
b = a / 2;
Optimized code:
b = a >> 1;
// right shifting the bit by one will give the same result

Simply Algebraic Expressions

The algebraic expressions that are useless or written inefficiently are transformed.
For example:
a=a+0
a=a*1
a=a/1
a=a-0
//All these above expression are causing calculation overhead.
// These can be removed for optimization
Replace Slower Instructions With Faster

Slower instructions can be replaced with faster ones, and registers play an important role. For
example, a register supporting unit increment operation will perform better than adding one to the
register. The same can be done with many other operations, like multiplication.
Add #1
SUB #1
//The above instruction can be replaced with
// INC R
// DEC R
//If the register supports increment and decrement

Let’s see another example of java bytecode:


Here X is loaded on ‘a’ twice and then multiplied. We can use dup function, it will copy the value
on the top of the stack( ‘X’ need not be loaded again), and then we can perform our operation. It
works faster and can be preferred over slower operations.
a load X
a load X
Mul
// The above instructions can be replaced with
a load X
dup
Mul

Dead code Elimination

The dead code can be eliminated to improve the system's performance; resources will be free and
less memory will be needed.
int dead(void)
{
int a=1;
int b=5;
int c=a+b;
return c;
// c will be returned
// The remaining part of code is dead code, never reachable
int k=1;
k=k*2;
k=k+b;
return k;
// This dead code can be removed for optimization
}
Moreover, null sequences and user less operations can be deleted too.

Symbol Table Management


Symbol Table is a vital data structure created and maintained by the compiler to keep track of the semantics
of variables. A symbol table stores information about the scope and necessary details on names and
instances of several entities, like names of variables and functions, classes, objects, etc. Devices will have a
challenging time determining addresses or understanding anything about the program if the symbol table has
been stripped before the translation into an executable.
A symbol table serves the following objectives:
• It verifies whether a variable has been declared.
• It stores all entities' names in a structured form in a single place.
• It determines the scope of a name.
• It implements type checking by verifying whether the assignments and expressions in the source
code are semantically correct or not.
• It may be included in a process output to be used later during debugging sessions.
• It serves as a resource for creating a diagnostic report during or after program execution.

Format of Symbol Table


A symbol table is simply a table which can be either linear or a hash table. Each item in the symbol table has
properties like name, kind, type, and others, like the image added below.

Operations:

The symbol table provides the following operations:

Insert ()
o Insert () operation is more frequently used in the analysis phase when the tokens are identified and
names are stored in the table.
o The insert() operation is used to insert the information in the symbol table like the unique name
occurring in the source code.
o In the source code, the attribute for a symbol is the information associated with that symbol. The
information contains the state, value, type and scope about the symbol.
o The insert () function takes the symbol and its value in the form of argument.

For example:
1. int x;

Should be processed by the compiler as:

1. insert (x, int)

lookup()
In the symbol table, lookup() operation is used to search a name. It is used to determine:

o The existence of symbol in the table.


o The declaration of the symbol before it is used.
o Check whether the name is used in the scope.
o Initialization of the symbol.
o Checking whether the name is declared multiple times.

The basic format of lookup() function is as follows:

1. lookup (symbol)

This format is varies according to the programming language.

It is used by various phases of the compiler as follows:-


1. Lexical Analysis: Creates new table entries in the table, for example like entries about
tokens.
2. Syntax Analysis: Adds information regarding attribute type, scope, dimension, line of
reference, use, etc in the table.
3. Semantic Analysis: Uses available information in the table to check for semantics i.e. to
verify that expressions and assignments are semantically correct(type checking) and
update it accordingly.
4. Intermediate Code generation: Refers symbol table for knowing how much and what type
of run-time is allocated and table helps in adding temporary variable information.
5. Code Optimization: Uses information present in the symbol table for machine-dependent
optimization.
6. Target Code generation: Generates code by using address information of identifier
present in the table.

Items stored in Symbol table:


• Variable names and constants
• Procedure and function names
• Literal constants and strings
• Compiler generated temporaries
• Labels in source languages

Information used by the compiler from Symbol table:


• Data type and name
• Declaring procedures
• Offset in storage
• If structure or record then, a pointer to structure table.
• For parameters, whether parameter passing by value or by reference
• Number and type of arguments passed to function
• Base Address

Implementation of Symbol table –


Following are commonly used data structures for implementing symbol table:-
List –
• In this method, an array is used to store names and associated information.
• A pointer “available” is maintained at end of all stored records and new names are added in the
order as they arrive
• To search for a name we start from the beginning of the list till available pointer and if not found we
get an error “use of the undeclared name”
• While inserting a new name we must ensure that it is not already present otherwise an error occurs
i.e. “Multiple defined names”
• Insertion is fast O(1), but lookup is slow for large tables – O(n) on average
• The advantage is that it takes a minimum amount of space.

Hash Table –
• In hashing scheme, two tables are maintained – a hash table and symbol table and are the most
commonly used method to implement symbol tables.
• A hash table is an array with an index range: 0 to table size – 1. These entries are pointers pointing
to the names of the symbol table.
• To search for a name we use a hash function that will result in an integer between 0 to table size –
1.
• Insertion and lookup can be made very fast – O(1).
• The advantage is quick to search is possible and the disadvantage is that hashing is complicated to
implement.

Binary Search Tree –


• Another approach to implementing a symbol table is to use a binary search tree i.e. we add two link
fields i.e. left and right child.
• All names are created as child of the root node that always follows the property of the binary search
tree.
• Insertion and lookup are O(log2 n) on average.

Data Structures in Symbol Table:


Data Structures used for the implementation of symbol tables are-
1.Binary Search Tree
2.Hash Tables
3.Linear search
A compiler contains two types of symbol tables: global and scope symbols tables. All the procedures and the
scope symbol table can access the global symbol table while the scope symbol tables are created for each
scope in the program. The scope of a name can be determined by arranging the table in the hierarchy
structure as shown below:

int value=10;

void sum_num()
{
int num_1;
int num_2;

{
int num_3;
int num_4;
}

int num_5;
{
int_num 6;
int_num 7;
}
}

Void sum_id
{
int id_1;
int id_2;

{
int id_3;
int id_4;
}

int num_5;
}

The above code can be arranged hierarchically in the following order:

The global symbol table contains one global variable of integer value and two procedure names, which can
be accessed by all the child nodes. The names mentioned in the method1 symbol table (and all its child
tables) are not available for the method2 table and its child tables.
The hierarchy of the data structures implemented for the symbol table is stored in the semantic analyzer. A
name is searched in the symbol table using the following hierarchy
• First, a symbol will be searched in the current scope, i.e., the current symbol table.
• if a name is found, then the search is completed; else, it will be searched in the parent symbol
table until,
• Either the name is found, or the global symbol table has been searched for the name.

Error Handling and Recovery


a compiler error occurs whenever it fails to compile a line of code or an algorithm due to a bug in the code or
a bug in the compiler itself. So, first, we'll look at various compiler mistakes, and then we'll talk about the
recovery procedures that need to be used to fix the problems in a compiler.
The goals of the Error Handling process are to identify each error, display it to the user, and then develop
and apply a recovery strategy to handle the error. The program's processing time should not be slow during
this entire operation.
Features of an Error handler:
1. Error Detection
2. Error Reporting
3. Error Recovery
Error handler = Error Detection + Error Report + Error Recovery
The blank entries in the symbol table are Errors. The parser should be able to discover and report program
errors. The parser can handle any errors that occur and continue parsing the rest of the input. Although the
parser is primarily responsible for error detection, faults can arise at any point during the compilation process.

Types Of Errors
Now we'll take a look at some common types of errors. There are three kinds of errors:
1. Compile-time errors
2. Runtime errors
3. Logical errors

Compile Time Errors:-


Compile-time errors appear during the compilation process before the program is executed. This can be
due to a syntax error or a missing file reference that stops the application from compiling properly.
Types of Compile Time Errors:-
Now, we will have a look at the major three kinds of compile-time errors.
1. Lexical Phase errors
2. Syntactic phase errors
3. Semantic errors

1. Lexical Phase Errors

Misspellings of identifiers, keywords, or operators fall into this category. These errors occur both at the
lexical phase and during program execution. When a series of characters does not satisfy the pattern of
any token, a lexical error occurs. These mistakes might occur as a result of spelling mistakes or the
appearance of any unlawful characters.
In general, lexical errors occurs when:
• Identifier or numeric constants are too long.
• Characters that seem to be illegal.
• Strings that don't match.

Example:
class Factorial{
public static void main(String args[]){
int i,fact=1;
int number=5;
for(i=1;i<=number;i++){
fact=fact*i;
}
System.out.println("Factorial of "+number+" is: "+fact);$$$
}
}
Here, we have used a non-readable syntax after the print statement which results in an error and it
comes under lexical error. So, the error will look like:
error: illegal start of expression

2. Syntactic Phase Errors

These problems arise during the syntactic phase and execution. These issues occur when there is
an imbalance in the parenthesis or when some operators are missing. For example, a semicolon
that isn't there or a parenthesis that isn't balanced.
Typically Syntactic errors look like this:
• Discrepancies in the structure
• The operator is absent.
• Misspelled Keywords
• Parenthesis that aren't balanced

Example:
class Factorial{
public static void main(String args[]){
int i,fact=1;
int number == 5;
for(i=1;i<=number;i++){
fact=fact*i;
}
System.out.println("Factorial of "+number+" is: "+fact);
}
}
The above code has a syntactic error since we need to use =(assigment operator) in the statement, but
we're using invalid expression (==) instead.

3. Semantic Errors
These errors are detected during the compilation time of a program when they occur during the semantic
analysis step. These problems occur when operators, variables, or undeclared variables are used
incorrectly. Like the type conflicts between operator and operand or incompatible value assignment.
Some of the semantic errors are

• Inappropriate types of operands


• Variables that were not declared
• Actual arguments do not match a formal argument.

Example:
class Factorial{
public static void main(String args[]){
int i;
int number = 5;
for(i=1;i<=number;i++){
fact=fact*i;
}
System.out.println("Factorial of "+number+" is: "+fact);
}
}
The above code will produce an error as we didn’t declare the variable “fact” which generates a semantic
error. And hence the output looks like:
error: cannot find symbol

Runtime Errors:-
A run-time error occurs during the execution of a program and is most commonly caused by incorrect system
parameters or improper input data. This can include a lack of memory to run an application, a memory conflict
with another software, or a logical error, which is an example of run-time error.
These are errors that occur when a user enters improper syntax into a code or enters code that a typical
compiler cannot run.

Logical Errors:-
When programs execute poorly and yet don't terminate abnormally, logic errors occur. A logic error can result
in unexpected or unwanted outputs or other behavior, even if it is not immediately identified as such. These
are errors that occur when the specified code is unreachable or when an infinite loop is present.
Now, We'll look at several error recovery mechanisms in a compiler as we've got a good knowledge of the
different types of errors.

Modes of Error Recovery

The compiler's simplest requirement is to simply stop, issue a message, and halt compiling. To cope with
problems in the code, the parser can implement one of five typical error-recovery mechanisms in the following
which are some of the most prevalent recovery strategies.

Panic Mode Recovery:-


This method involves removing successive characters from the input one by one until a set of synchronized
tokens are found. Delimiters, such as a semicolon, opening, or closing parenthesis, for example, are
synchronizing tokens.
For example,
int a, $b, sum, 5z;
Since variable declaration starts with invalid symbols ($) and numbers (5) hence Panic mode recovery will
discard such variables until it synchronized tokens such as commas or semicolons are not found.
The benefit is that it's simple to implement and ensures that you won't get stuck in an infinite loop. The
drawback is that a significant amount of data is skipped without being checked for additional problems.

Statement Mode Recovery:-


When a parser detects an error, it attempts to rectify the problem so that the rest of the statement's inputs
allow the parser to continue parsing. Inserting a missing semicolon, replacing a comma with a semicolon,
and so forth. Designers of parsers must exercise caution here, as one incorrect correction could result in an
unending cycle.
One drawback is that it is difficult to deal with circumstances where the real problem occurred prior to the
detection moment.
Error Productions:-
If a compiler designer is aware of prevalent errors, these types of faults can be recovered by enhancing the
grammar with error production that results in improper constructs. Maintaining it is extremely difficult and
problematic.
Error productions are used to recover syntactic phase errors.

Global Corrections:-
The parser looks over the entire program and tries to figure out what it's supposed to accomplish, then finds
the closest match that's error-free.
When given an incorrect input (statement) X, it generates a parse tree for the closest error-free statement Y.
This method may allow the parser to make minimum changes to the source code, but it has yet to be deployed
in practice due to its complexity (time and space).

Using Symbol Table:-


In semantic errors, the errors are retrieved by using a symbol table for the relevant identifier, and the
compiler does automated type conversion if the data types of two operands are incompatible.
Example:
int data1 = “10”;
String data2 = "12";
int data3 = data1 + data2;
So, after compiling this statement, it will result in the following parse tree:
(+) (int + int)
/ \
/ \
/ \
data1 (int) data2 (string)
data1 should be fixed by turning it into a string and then appending the two strings together. Although there
is an error in this statement, the compiler makes assumptions and resolves them.
The diagram given below depicts in which phases the recovery methods can be applied.

You might also like