CD Notes Final

Download as pdf or txt
Download as pdf or txt
You are on page 1of 72

.

Compiler:
It is a translator which takes input i.e., High-Level Language, and produces an
output of low-level language i.e. machine or assembly language.
 A compiler is more intelligent than an assembler it checks all kinds
of limits, ranges, errors, etc.
 But its program run time is more and occupies a larger part of
memory. It has a slow speed because a compiler goes through the
entire program and then translates the entire program into machine
codes.

Language processing systems (using Compiler): We know a computer is a


logical assembly of Software and Hardware. The hardware knows a language,
that is hard for us to grasp, consequently, we tend to write programs in a high-
level language, that is much less complicated for us to comprehend and maintain
in thoughts. Now, these programs go through a series of transformations so that
they can readily be used by machines. This is where language procedure systems
come in handy.
A translator or language processor is a program that translates an input
program written in a programming language into an equivalent program in
another language. The compiler is a type of translator, which takes a program
written in a high-level programming language as input and translates it into an
equivalent program in low-level languages such as machine language or
assembly language. The program written in a high-level language is known as
a source program, and the program converted into low-level language is
known as an object (or target) program. Moreover, the compiler traces the
errors in the source program and generates the error report. Without
compilation, no program written in a high-level language can be executed.
After compilation, only the program in machine language is loaded into the
memory for execution. For every programming language, we have a different
compiler; however, the basic tasks performed by every compiler are the same.
 High-Level Language: If a program contains #define or #include
directives such as #include or #define it is called HLL. They are
closer to humans but far from machines. These (#) tags are called
preprocessor directives. They direct the pre-processor about what to
do.
 Pre-Processor: The pre-processor removes all the #include
directives by including the files called file inclusion and all the
#define directives using macro expansion. It performs file inclusion,
augmentation, macro-processing, etc.
 Assembly Language: It’s neither in binary form nor high level. It is
an intermediate state that is a combination of machine instructions
and some other useful data needed for execution.
 Assembler: For every platform (Hardware + OS) we will have an
assembler. They are not universal since for each platform we have
one. The output of the assembler is called an object file. Its translates
assembly language to machine code.
 Interpreter: An interpreter converts high-level language into low-
level machine language, just like a compiler. But they are different in
the way they read the input. The Compiler in one go reads the inputs,
does the processing, and executes the source code whereas the
interpreter does the same line by line. A compiler scans the entire
program and translates it as a whole into machine code whereas an
interpreter translates the program one statement at a time. Interpreted
programs are usually slower with respect to compiled ones.


 Relocatable Machine Code: It can be loaded at any point and can be
run. The address within the program will be in such a way that it will
cooperate with the program movement.
Linker
Linker is a computer program that links and merges various object files together
in order to make an executable file. All these files might have been compiled by
separate assemblers. The major task of a linker is to search and locate referenced
module/routines in a program and to determine the memory location where these
codes will be loaded, making the program instruction to have absolute references.
Loader
Loader is a part of operating system and is responsible for loading executable
files into memory and execute them. It calculates the size of a program
(instructions and data) and creates memory space for it. It initializes various
registers to initiate execution.
Phases of a Compiler:
There are two major phases of compilation, which in turn have many parts.
Each of them takes input from the output of the previous level and works in a
coordinated way.

Analysis Phase: Known as the front-end of the compiler, the analysis phase of
the compiler reads the source program, divides it into core parts and then checks
for lexical, grammar and syntax errors.The analysis phase generates an
intermediate representation of the source program and symbol table, which
should be fed to the Synthesis phase as input. An intermediate representation is
created from the given source code :
1. Lexical Analyzer
2. Syntax Analyzer
3. Semantic Analyzer
4. Intermediate Code Generator
Lexical analyzer divides the program into “tokens”, the Syntax analyzer
recognizes “sentences” in the program using the syntax of the language and
the Semantic analyzer checks the static semantics of each construct.
Intermediate Code Generator generates “abstract” code.
Synthesis Phase:
Known as the back-end of the compiler, the synthesis phase generates the target
program with the help of intermediate source code representation and symbol
table. Equivalent target program is created from the intermediate
representation. It has two parts:
1. Code Optimizer
2. Code Generator
Code Optimizer optimizes the abstract code, and the final Code Generator
translates abstract intermediate code into specific machine instructions.

Cross Compiler
The simple compiler works in one system only, but what will happen if we
need a compiler that can compile code from another platform, to perform such
compilation, the cross compiler is introduced.
A cross compiler is a compiler capable of creating executable code for a
platform other than the one on which the compiler is running. For example, a
cross compiler executes on machine X and produces machine code for
machine Y.

How compiler is different from a cross-compiler?


The native compiler is a compiler that generates code for the same platform on
which it runs and on the other hand, a Cross compiler is a compiler that generates
executable code for a platform other than one on which the compiler is running.

Where is the cross compiler used?

 In bootstrapping, a cross-compiler is used for transitioning to a new platform.


When developing software for a new platform, a cross-compiler is used to
compile necessary tools such as the operating system and a native compiler.
 For microcontrollers, we use cross compiler because it doesn’t support an
operating system.
 It is useful for embedded computers which are with limited computing
resources.
 To compile for a platform where it is not practical to do the compiling, a
cross-compiler is used.
 When direct compilation on the target platform is not infeasible, so we can
use the cross compiler.
 It helps to keep the target environment separate from the built environment.

Source-to-source Compiler
A compiler that takes the source code of one programming language and
translates it into the source code of another programming language is called a
source-to-source compiler

S.No. Compiler Interpreter

The compiler scans the Translates the program one


1. whole program in one go. statement at a time.

As it scans the code in one Considering it scans code one


go, the errors (if any) are line at a time, errors are shown
2. shown at the end together. line by line.
S.No. Compiler Interpreter

The main advantage of Due to interpreters being slow in


compilers is its execution executing the object code, it is
3. time. preferred less.

It does not convert source code


It converts the source code into object code instead it scans it
4. into object code. line by line

It does not require source It requires source code for later


5 code for later execution. execution.

Execution of the program Execution of the program


takes place only after the happens after every line is
6 whole program is compiled. checked or evaluated.

The machine code is stored


7 in the disk storage. Machine code is nowhere stored.

Compilers more often take a In comparison, Interpreters take


large amount of time for less time for analyzing the source
8 analyzing the source code. code.

9. It is more efficient. It is less efficient.

10. CPU utilization is more. CPU utilization is less.

Python, Ruby, Perl, SNOBOL,


C, C++, C#, etc are MATLAB, etc are programming
programming languages that languages that are interpreter-
Eg. are compiler-based. based.

Phases of a Compiler

We basically have two phases of compilers, namely the Analysis phase and
Synthesis phase. The analysis phase creates an intermediate representation
from the given source code. The synthesis phase creates an equivalent target
program from the intermediate representation.

Symbol Table – It is a data structure being used and maintained by the


compiler, consisting of all the identifier’s names along with their types. It
helps the compiler to function smoothly by finding the identifiers quickly.
The analysis of a source program is divided into mainly three phases. They
are:
1. Linear Analysis-
This involves a scanning phase where the stream of characters is read
from left to right. It is then grouped into various tokens having a
collective meaning.
2. Hierarchical Analysis-
In this analysis phase, based on a collective meaning, the tokens are
categorized hierarchically into nested groups.
3. Semantic Analysis-
This phase is used to check whether the components of the source
program are meaningful or not.
The compiler has two modules namely the front end and the back end. Front-
end constitutes the Lexical analyzer, semantic analyzer, syntax analyzer, and
intermediate code generator. And the rest are assembled to form the back end.
1. Lexical Analyzer –
It is also called a scanner. It takes the output of the preprocessor
(which performs file inclusion and macro expansion) as the input
which is in a pure high-level language. It reads the characters from
the source program and groups them into lexemes (sequence of
characters that “go together”). Each lexeme corresponds to a token.
Tokens are defined by regular expressions which are understood by
the lexical analyzer. It also removes lexical errors (e.g., erroneous
characters), comments, and white space.
2. Syntax Analyzer – It is sometimes called a parser. It constructs the
parse tree. It takes all the tokens one by one and uses Context-Free
Grammar to construct the parse tree.
Why Grammar?
The rules of programming can be entirely represented in a few
productions. Using these productions we can represent what the
program actually is. The input has to be checked whether it is in the
desired format or not.
The parse tree is also called the derivation tree. Parse trees are
generally constructed to check for ambiguity in the given grammar.
There are certain rules associated with the derivation tree.

 Any identifier is an expression


 Any number can be called an expression
 Performing any operations in the given expression will
always result in an expression. For example, the sum of two
expressions is also an expression.
 The parse tree can be compressed to form a syntax tree
Syntax error can be detected at this level if the input is not in accordance with
the grammar.
 Semantic Analyzer – It verifies the parse tree, whether it’s
meaningful or not. It furthermore produces a verified parse tree. It
also does type checking, Label checking, and Flow control checking.
 Intermediate Code Generator – It generates intermediate code,
which is a form that can be readily executed by a machine Example –
Three address codes etc. Intermediate code is converted to machine
language using the last two phases which are platform dependent.
Till intermediate code, it is the same for every compiler out there, but
after that, it depends on the platform. To build a new compiler we
don’t need to build it from scratch. We can take the intermediate
code from the already existing compiler and build the last two parts.
 Code Optimizer – It transforms the code so that it consumes fewer
resources and produces more speed. The meaning of the code being
transformed is not altered. Optimization can be categorized into two
types: machine-dependent and machine-independent.
 Target Code Generator – The main purpose of the Target Code
generator is to write a code that the machine can understand and also
register allocation, instruction selection, etc. The output is dependent
on the type of assembler. This is the final stage of compilation. The
optimized code is converted into relocatable machine code which
then forms the input to the linker and loader.

1 ANALYSIS OF THE SOURCE PROGRAM


In compiling, analysis consists of three phases:
1. Lexical Analysis
2. Syntax Analysis
3. Semantic Analysis
Lexical Analysis (Scanning)
In a compiler linear analysis is called lexical analysis or scanning. The lexical
analysis phase reads the characters in the source program and grouped into
tokens that are sequence of characters having a collective meaning.
EXAMPLE
position : = initial + rate * 60

This can be grouped into the following tokens;


1. The identifier position.
2. The assignment symbol : =
3. The identifier initial
4. The plus sign
5. The identifier rate
6. The multiplication sign
7. The number 60
Blanks separating characters of these tokens are normally eliminated during
lexical analysis.
Syntax Analysis (Parsing)
Hierarchical Analysis is called parsing or syntax analysis.
It involves grouping the tokens of the source program into grammatical phrases
that are used by the complier to synthesize output. They are represented using a
syntax tree.
A syntax tree is the tree generated as a result of syntax analysis in which the
interior nodes are the operators and the exterior nodes are the operands. This
analysis shows an error when the syntax is incorrect.
Semantic Analysis
This phase checks the source program for semantic errors and gathers type
information for subsequent code generation phase.
An important component of semantic analysis is type checking.
Here the compiler checks that each operator has operands that are permitted by
the source language specification.
Lexical Analysis
Lexical analysis is the first phase of a compiler. It takes modified source code
from language preprocessors that are written in the form of sentences. The lexical
analyzer breaks these syntaxes into a series of tokens, by removing any
whitespace or comments in the source code.
If the lexical analyzer finds a token invalid, it generates an error. The lexical
analyzer works closely with the syntax analyzer. It reads character streams from
the source code, checks for legal tokens, and passes the data to the syntax analyzer
when it demands.

Tokens
Lexemes are said to be a sequence of characters (alphanumeric) in a token. There
are some predefined rules for every lexeme to be identified as a valid token. These
rules are defined by grammar rules, by means of a pattern. A pattern explains
what can be a token, and these patterns are defined by means of regular
expressions.
In programming language, keywords, constants, identifiers, strings, numbers,
operators and punctuations symbols can be considered as tokens.
For example, in C language, the variable declaration line
int value = 100;
contains the tokens:
int (keyword), value (identifier), = (operator), 100 (constant) and ; (symbol).
Specifications of Tokens
Let us understand how the language theory undertakes the following terms:
Alphabets
Any finite set of symbols {0,1} is a set of binary alphabets,
{0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F} is a set of Hexadecimal alphabets, {a-z, A-Z}
is a set of English language alphabets.
Strings
Any finite sequence of alphabets (characters) is called a string. Length of the
string is the total number of occurrence of alphabets, e.g., the length of the string
tutorialspoint is 14 and is denoted by |tutorialspoint| = 14. A string having no
alphabets, i.e. a string of zero length is known as an empty string and is denoted
by ε (epsilon).
Language
A language is considered as a finite set of strings over some finite set of alphabets.
Computer languages are considered as finite sets, and mathematically set
operations can be performed on them. Finite languages can be described by means
of regular expressions.

Lexeme

It is a sequence of characters in the source code that are matched by given


predefined language rules for every lexeme to be specified as a valid token.
Example:
main is lexeme of type identifier(token)
(,),{,} are lexemes of type punctuation(token)

Pattern

It specifies a set of rules that a scanner follows to create a token.


Example of Programming Language (C, C++):
For a keyword to be identified as a valid token, the pattern is the sequence of
characters that make the keyword.
For identifier to be identified as a valid token, the pattern is the predefined
rules that it must start with alphabet, followed by alphabet or a digit.

Criteria Token Lexeme Pattern

Token is It is a sequence of It specifies


Definition basically a characters in the a set of
Criteria Token Lexeme Pattern

sequence of source code that are rules that a


characters that matched by given scanner
are treated as a predefined language follows to
unit as it cannot rules for every create a
be further broken lexeme to be token.
down. specified as a valid
token.

The
sequence
of
all the reserved characters
Interpretation keywords of that that make
of type language(main, the
Keyword printf, etc.) int, goto keyword.

it must
start with
the
alphabet,
followed
Interpretation name of a by the
of type variable, alphabet or
Identifier function, etc main, a a digit.

Interpretation all the operators


of type are considered
Operator tokens. +, = +, =

each kind of
punctuation is
considered a
token. e.g.
Interpretation semicolon,
of type bracket, comma,
Punctuation etc. (, ), {, } (, ), {, }
Criteria Token Lexeme Pattern

any string
of
characters
Interpretation (except ‘ ‘)
of type a grammar rule or “Welcome to between ”
Literal boolean literal. GeeksforGeeks!” and “
How Lexical Analyzer works-
1. Input preprocessing: This stage involves cleaning up the input text
and preparing it for lexical analysis. This may include removing
comments, whitespace, and other non-essential characters from the
input text.
2. Tokenization: This is the process of breaking the input text into a
sequence of tokens. This is usually done by matching the characters
in the input text against a set of patterns or regular expressions that
define the different types of tokens.
3. Token classification: In this stage, the lexer determines the type of
each token. For example, in a programming language, the lexer
might classify keywords, identifiers, operators, and punctuation
symbols as separate token types.
4. Token validation: In this stage, the lexer checks that each token is
valid according to the rules of the programming language. For
example, it might check that a variable name is a valid identifier, or
that an operator has the correct syntax.
5. Output generation: In this final stage, the lexer generates the output
of the lexical analysis process, which is typically a list of tokens.
This list of tokens can then be passed to the next stage of compilation
or interpretation.

The lexical analyzer identifies the error with the help of the automation
machine and the grammar of the given language on which it is based like C,
C++, and gives row number and column number of the error.
The lexical analyzer also follows rule priority where a reserved word, e.g., a
keyword, of a language is given priority over user input. That is, if the lexical
analyzer finds a lexeme that matches with any existing reserved word, it should
generate an error.
The output of Lexical Analysis Phase:
The output of Lexical Analyzer serves as an input to Syntax Analyzer as a
sequence of tokens and not the series of lexemes because during the syntax
analysis phase individual unit is not vital but the category or class to which
this lexeme belongs is considerable.
Regular Expressions
The lexical analyzer needs to scan and identify only a finite set of valid
string/token/lexeme that belong to the language in hand. It searches for the pattern
defined by the language rules.
Regular expressions have the capability to express finite languages by defining a
pattern for finite strings of symbols. The grammar defined by regular expressions
is known as regular grammar. The language defined by regular grammar is
known as regular language.
Regular expression is an important notation for specifying patterns. Each pattern
matches a set of strings, so regular expressions serve as names for a set of strings.
Programming language tokens can be described by regular languages. The
specification of regular expressions is an example of a recursive definition.
Regular languages are easy to understand and have efficient implementation.
There are a number of algebraic laws that are obeyed by regular expressions,
which can be used to manipulate regular expressions into equivalent forms.
Operations
The various operations on languages are:
 Union of two languages L and M is written as
L U M = {s | s is in L or s is in M}
 Concatenation of two languages L and M is written as
LM = {st | s is in L and t is in M}
 The Kleene Closure of a language L is written as
L* = Zero or more occurrence of language L.
Notations
If r and s are regular expressions denoting the languages L(r) and L(s), then
 Union : (r)|(s) is a regular expression denoting L(r) U L(s)
 Concatenation : (r)(s) is a regular expression denoting L(r)L(s)
 Kleene closure : (r)* is a regular expression denoting (L(r))*
 (r) is a regular expression denoting L(r)
Precedence and Associativity
 *, concatenation (.), and | (pipe sign) are left associative
 * has the highest precedence
 Concatenation (.) has the second highest precedence.
 | (pipe sign) has the lowest precedence of all.
Representing valid tokens of a language in regular expression
If x is a regular expression, then:
x* means zero or more occurrence of x.
i.e., it can generate { e, x, xx, xxx, xxxx, … }
x+ means one or more occurrence of x.
i.e., it can generate { x, xx, xxx, xxxx … } or x.x*
x? means at most one occurrence of x
i.e., it can generate either {x} or {e}.
[a-z] is all lower-case alphabets of English language.
[A-Z] is all upper-case alphabets of English language.
[0-9] is all natural digits used in mathematics.
Representation occurrence of symbols using regular expressions
letter = [a – z] or [A – Z]
digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 or [0-9]
sign = [ + | - ]
Representation of language tokens using regular expressions
Decimal = (sign)?(digit)+
Identifier = (letter)(letter | digit)*
The only problem left with the lexical analyzer is how to verify the validity of a
regular expression used in specifying the patterns of keywords of a language. A
well-accepted solution is to use finite automata for verification.
Finite Automata
Finite automata is a state machine that takes a string of symbols as input and
changes its state accordingly. Finite automata is a recognizer for regular
expressions. When a regular expression string is fed into finite automata, it
changes its state for each literal. If the input string is successfully processed and
the automata reaches its final state, it is accepted, i.e., the string just fed was said
to be a valid token of the language in hand.
The mathematical model of finite automata consists of:
 Finite set of states (Q)
 Finite set of input symbols (Σ)
 One Start state (q0)
 Set of final states (qf)
 Transition function (δ)
The transition function (δ) maps the finite set of state (Q) to a finite set of input
symbols (Σ), Q × Σ ➔ Q
Finite Automata Construction
Let L(r) be a regular language recognized by some finite automata (FA).
 States : States of FA are represented by circles. State names are of
the state is written inside the circle.
 Start state : The state from where the automata starts, is known as
start state. Start state has an arrow pointed towards it.
 Intermediate states : All intermediate states has at least two arrows;
one pointing to and another pointing out from them.
 Final state : If the input string is successfully parsed, the automata
is expected to be in this state. Final state is represented by double
circles. It may have any odd number of arrows pointing to it and even
number of arrows pointing out from it. The number of odd arrows
are one greater than even, i.e. odd = even+1.
 Transition : The transition from one state to another state happens
when a desired symbol in the input is found. Upon transition,
automata can either move to next state or stay in the same state.
Movement from one state to another is shown as a directed arrow,
where the arrows points to the destination state. If automata stays on
the same state, an arrow pointing from a state to itself is drawn.
Example : We assume FA accepts any three digit binary value ending in digit 1.
FA = {Q(q0, qf), Σ(0,1), q0, qf, δ}

1) Deterministic Finite Automata (DFA):


DFA consists of 5 tuples {Q, Σ, q, F, δ}.
Q : set of all states.
Σ : set of input symbols. ( Symbols which machine takes as input )
q : Initial state. ( Starting state of a machine )
F : set of final state.
δ : Transition Function, defined as δ : Q X Σ --> Q.
In a DFA, for a particular input character, the machine goes to one state only.
A transition function is defined on every state for every input symbol. Also in
DFA null (or ε) move is not allowed, i.e., DFA cannot change state without
any input character.

For example, below DFA with Σ = {0, 1} accepts all strings ending with 0.

Figure: DFA with Σ = {0, 1}


One important thing to note is, there can be many possible DFAs for a
pattern. A DFA with a minimum number of states is generally preferred.
2) Nondeterministic Finite Automata(NFA): NFA is similar to DFA except
following additional features:
1. Null (or ε) move is allowed i.e., it can move forward without reading
symbols.
2. Ability to transmit to any number of states for a particular input.
However, these above features don’t add any power to NFA. If we compare
both in terms of power, both are equivalent.
Due to the above additional features, NFA has a different transition function,
the rest is the same as DFA.
δ: Transition Function
δ: Q X (Σ U ε ) --> 2 ^ Q.
As you can see in the transition function is for any input including null (or ε),
NFA can go to any state number of states. For example, below is an NFA for
the above problem.
NFA
One important thing to note is, in NFA, if any path for an input string leads
to a final state, then the input string is accepted. For example, in the above
NFA, there are multiple paths for the input string “00”. Since one of the paths
leads to a final state, “00” is accepted by the above NFA.

Input Buffering in Compiler Design

The lexical analyzer scans the input from left to right one character at a time. It
uses two pointers begin ptr(bp) and forward ptr(fp) to keep track of the pointer
of the input
scanned.

Initially both the pointers point to the first character of the input string as
shown
below
The forward ptr moves ahead to search for end of lexeme. As soon as the blank
space is encountered, it indicates end of lexeme. In above example as soon as
ptr (fp) encounters a blank space the lexeme “int” is identified. The fp will be
moved ahead at white space, when fp encounters white space, it ignore and
moves ahead. then both the begin ptr(bp) and forward ptr(fp) are set at next
token. The input character is thus read from secondary storage, but reading in
this way from secondary storage is costly. hence buffering technique is used.A
block of data is first read into a buffer, and then second by lexical analyzer.
there are two methods used in this context: One Buffer Scheme, and Two Buffer
Scheme. These are explained as following
below.

1. One Buffer Scheme: In this scheme, only one buffer is used to store
the input string but the problem with this scheme is that if lexeme is
very long then it crosses the buffer boundary, to scan rest of the
lexeme the buffer has to be refilled, that makes overwriting the first of

lexeme.
2. Two Buffer Scheme: To overcome the problem of one buffer scheme,
in this method two buffers are used to store the input string. the first
buffer and second buffer are scanned alternately. when end of current
buffer is reached the other buffer is filled. the only problem with this
method is that if length of the lexeme is longer than length of the
buffer then scanning input cannot be scanned completely. Initially
both the bp and fp are pointing to the first character of first buffer.
Then the fp moves towards right in search of end of lexeme. as soon
as blank character is recognized, the string between bp and fp is
identified as corresponding token. to identify, the boundary of first
buffer end of buffer character should be placed at the end first buffer.
Similarly end of second buffer is also recognized by the end of buffer
mark present at the end of second buffer. when fp encounters first eof,
then one can recognize end of first buffer and hence filling up second
buffer is started. in the same way when second eof is obtained then it
indicates of second buffer. alternatively both the buffers can be filled
up until end of the input program and stream of tokens is identified.
This eof character introduced at the end is calling Sentinel which is
used to identify the end of

buffer.
Syntax Analysis
Syntax analysis or parsing is the second phase of a compiler.
We have seen that a lexical analyzer can identify tokens with the help of regular
expressions and pattern rules. But a lexical analyzer cannot check the syntax of a
given sentence due to the limitations of the regular expressions. Regular
expressions cannot check balancing tokens, such as parenthesis. Therefore, this
phase uses context-free grammar (CFG), which is recognized by push-down
automata.
CFG, on the other hand, is a superset of Regular Grammar, as depicted below:

It implies that every Regular Grammar is also context-free, but there exists some
problems, which are beyond the scope of Regular Grammar. CFG is a helpful tool
in describing the syntax of programming languages.
Context-Free Grammar
In this section, we will first see the definition of context-free grammar and
introduce terminologies used in parsing technology.
A context-free grammar has four components:
 A set of non-terminals (V). Non-terminals are syntactic variables
that denote sets of strings. The non-terminals define sets of strings
that help define the language generated by the grammar.
 A set of tokens, known as terminal symbols (Σ). Terminals are the
basic symbols from which strings are formed.
 A set of productions (P). The productions of a grammar specify the
manner in which the terminals and non-terminals can be combined
to form strings. Each production consists of a non-terminal called
the left side of the production, an arrow, and a sequence of tokens
and/or on- terminals, called the right side of the production.
 One of the non-terminals is designated as the start symbol (S); from
where the production begins.
The strings are derived from the start symbol by repeatedly replacing a non-
terminal (initially the start symbol) by the right side of a production, for that non-
terminal.
Example
We take the problem of palindrome language, which cannot be described by
means of Regular Expression. That is, L = { w | w = wR } is not a regular language.
But it can be described by means of CFG, as illustrated below:
G = ( V, Σ, P, S )
Where:
V = { Q, Z, N }
Σ = { 0, 1 }
P = { Q → Z | Q → N | Q → ℇ | Z → 0Q0 | N → 1Q1 }
S={Q}
This grammar describes palindrome language, such as: 1001, 11100111, 00100,
1010101, 11111, etc.
Syntax Analyzers
A syntax analyzer or parser takes the input from a lexical analyzer in the form of
token streams. The parser analyzes the source code (token stream) against the
production rules to detect any errors in the code. The output of this phase is
a parse tree.

This way, the parser accomplishes two tasks, i.e., parsing the code, looking for
errors and generating a parse tree as the output of the phase.
Parsers are expected to parse the whole code even if some errors exist in the
program. Parsers use error recovering strategies, which we will learn later in this
chapter.
Derivation
A derivation is basically a sequence of production rules, in order to get the input
string. During parsing, we take two decisions for some sentential form of input:
 Deciding the non-terminal which is to be replaced.
 Deciding the production rule, by which, the non-terminal will be
replaced.
To decide which non-terminal to be replaced with production rule, we can have
two options.
Left-most Derivation
If the sentential form of an input is scanned and replaced from left to right, it is
called left-most derivation. The sentential form derived by the left-most
derivation is called the left-sentential form.
Right-most Derivation
If we scan and replace the input with production rules, from right to left, it is
known as right-most derivation. The sentential form derived from the right-most
derivation is called the right-sentential form.
Example
Production rules:
E→E+E
E→E*E
E → id
Input string: id + id * id
The left-most derivation is:
E→E*E
E→E+E*E
E → id + E * E
E → id + id * E
E → id + id * id
Notice that the left-most side non-terminal is always processed first.
The right-most derivation is:
E→E+E
E→E+E*E
E → E + E * id
E → E + id * id
E → id + id * id
Parse Tree
A parse tree is a graphical depiction of a derivation. It is convenient to see how
strings are derived from the start symbol. The start symbol of the derivation
becomes the root of the parse tree. Let us see this by an example from the last
topic.
We take the left-most derivation of a + b * c
The left-most derivation is:
E→E*E
E→E+E*E
E → id + E * E
E → id + id * E
E → id + id * id
Step 1:

E→E*E

Step 2:

E→E+E*E

Step 3:
E → id + E * E

Step 4:

E → id + id * E

Step 5:

E → id + id * id

In a parse tree:
 All leaf nodes are terminals.
 All interior nodes are non-terminals.
 In-order traversal gives original input string.
A parse tree depicts associativity and precedence of operators. The deepest sub-
tree is traversed first, therefore the operator in that sub-tree gets precedence over
the operator which is in the parent nodes.
Ambiguity
A grammar G is said to be ambiguous if it has more than one parse tree (left or
right derivation) for at least one string.
Example
E→E+E
E→E–E
E → id
For the string id + id – id, the above grammar generates two parse trees:

The language generated by an ambiguous grammar is said to be inherently


ambiguous. Ambiguity in grammar is not good for a compiler construction. No
method can detect and remove ambiguity automatically, but it can be removed by
either re-writing the whole grammar without ambiguity, or by setting and
following associativity and precedence constraints.
Associativity
If an operand has operators on both sides, the side on which the operator takes
this operand is decided by the associativity of those operators. If the operation is
left-associative, then the operand will be taken by the left operator or if the
operation is right-associative, the right operator will take the operand.
Example
Operations such as Addition, Multiplication, Subtraction, and Division are left
associative. If the expression contains:
id op id op id
it will be evaluated as:
(id op id) op id
For example, (id + id) + id
Operations like Exponentiation are right associative, i.e., the order of evaluation
in the same expression will be:
id op (id op id)
For example, id ^ (id ^ id)
Precedence
If two different operators share a common operand, the precedence of operators
decides which will take the operand. That is, 2+3*4 can have two different parse
trees, one corresponding to (2+3)*4 and another corresponding to 2+(3*4). By
setting precedence among operators, this problem can be easily removed. As in
the previous example, mathematically * (multiplication) has precedence over +
(addition), so the expression 2+3*4 will always be interpreted as:
2 + (3 * 4)
These methods decrease the chances of ambiguity in a language or its grammar.
Left Recursion
A grammar becomes left-recursive if it has any non-terminal ‘A’ whose
derivation contains ‘A’ itself as the left-most symbol. Left-recursive grammar is
considered to be a problematic situation for top-down parsers. Top-down parsers
start parsing from the Start symbol, which in itself is non-terminal. So, when the
parser encounters the same non-terminal in its derivation, it becomes hard for it
to judge when to stop parsing the left non-terminal and it goes into an infinite
loop.
Example:
(1) A => Aα | β

(2) S => Aα | β
A => Sd
(1) is an example of immediate left recursion, where A is any non-terminal
symbol and α represents a string of non-terminals.
(2) is an example of indirect-left recursion.

A top-down parser will first parse the A, which in-turn will yield a string
consisting of A itself and the parser may go into a loop forever.
Removal of Left Recursion
One way to remove left recursion is to use the following technique:
The production
A => Aα | β
is converted into following productions
A => βA'
A'=> αA' | ε
This does not impact the strings derived from the grammar, but it removes
immediate left recursion.
Second method is to use the following algorithm, which should eliminate all
direct and indirect left recursions.
START

Arrange non-terminals in some order like A1, A2, A3,…, An

for each i from 1 to n


{
for each j from 1 to i-1
{
replace each production of form Ai ⟹Aj𝜸
with Ai ⟹ δ1𝜸 | δ2𝜸 | δ3𝜸 |…| 𝜸
where Aj ⟹ δ1 | δ2|…| δn are current Aj productions
}
}
eliminate immediate left-recursion
END
Example
The production set
S => Aα | β
A => Sd
after applying the above algorithm, should become
S => Aα | β
A => Aαd | βd
and then, remove immediate left recursion using the first technique.
A => βdA'
A' => αdA' | ε
Now none of the production has either direct or indirect left recursion.
Left Factoring
If more than one grammar production rules has a common prefix string, then the
top-down parser cannot make a choice as to which of the production it should
take to parse the string in hand.
Example
If a top-down parser encounters a production like
A ⟹ αβ | α𝜸 | …
Then it cannot determine which production to follow to parse the string as both
productions are starting from the same terminal (or non-terminal). To remove this
confusion, we use a technique called left factoring.
Left factoring transforms the grammar to make it useful for top-down parsers. In
this technique, we make one production for each common prefixes and the rest of
the derivation is added by new productions.
Example
The above productions can be written as
A => αA'
A'=> β | 𝜸 | …
Now the parser has only one production per prefix which makes it easier to take
decisions.
First and Follow Sets
An important part of parser table construction is to create first and follow sets.
These sets can provide the actual position of any terminal in the derivation. This
is done to create the parsing table where the decision of replacing T[A, t] = α with
some production rule.
First Set
This set is created to know what terminal symbol is derived in the first position
by a non-terminal. For example,
α→tβ
That is α derives t (terminal) in the very first position. So, t ∈ FIRST(α).
Algorithm for calculating First set
Look at the definition of FIRST(α) set:
 if α is a terminal, then FIRST(α) = { α }.
 if α is a non-terminal and α → ℇ is a production, then FIRST(α) = {
ℇ }.
 if α is a non-terminal and α → 𝜸1 𝜸2 𝜸3 … 𝜸n and any FIRST(𝜸)
contains t then t is in FIRST(α).
First set can be seen as:

Follow Set
Likewise, we calculate what terminal symbol immediately follows a non-terminal
α in production rules. We do not consider what the non-terminal can generate but
instead, we see what would be the next terminal symbol that follows the
productions of a non-terminal.
Algorithm for calculating Follow set:
 if α is a start symbol, then FOLLOW() = $
 if α is a non-terminal and has a production α → AB, then FIRST(B)
is in FOLLOW(A) except ℇ.
 if α is a non-terminal and has a production α → AB, where B ℇ, then
FOLLOW(A) is in FOLLOW(α).
Follow set can be seen as: FOLLOW(α) = { t | S *αt*}
Limitations of Syntax Analyzers
Syntax analyzers receive their inputs, in the form of tokens, from lexical
analyzers. Lexical analyzers are responsible for the validity of a token supplied
by the syntax analyzer. Syntax analyzers have the following drawbacks -
 it cannot determine if a token is valid,
 it cannot determine if a token is declared before it is being used,
 it cannot determine if a token is initialized before it is being used,
 it cannot determine if an operation performed on a token type is
valid or not.
These tasks are accomplished by the semantic analyzer, which we shall study in
Semantic Analysis.

Parser

Parser is a compiler that is used to break the data into smaller elements coming
from lexical analysis phase.

A parser takes input in the form of sequence of tokens and produces output in the
form of parse tree.

Parsing is of two types: top down parsing and bottom up parsing.

Top down paring


o The top down parsing is known as recursive parsing or predictive parsing.
o Bottom up parsing is used to construct a parse tree for an input string.
o In the top down parsing, the parsing starts from the start symbol and
transform it into the input symbol.

Parse Tree representation of input string "acdb" is as follows:


Bottom up parsing
o Bottom up parsing is also known as shift-reduce parsing.
o Bottom up parsing is used to construct a parse tree for an input string.
o In the bottom up parsing, the parsing starts with the input symbol and
construct the parse tree up to the start symbol by tracing out the rightmost
derivations of string in reverse.

Example

Production

1. E→T
2. T→T*F
3. T → id
4. F→T
5. F → id

Parse Tree representation of input string "id * id" is as follows:


Bottom up parsing is classified in to various parsing. These are as follows:

1. Shift-Reduce Parsing
2. Operator Precedence Parsing
3. Table Driven LR Parsing

a. LR( 1 )
b. SLR( 1 )
c. CLR ( 1 )
d. LALR( 1 )

Top-Down Parsing Bottom-Up Parsing

It is a parsing strategy that first looks It is a parsing strategy that first


at the highest level of the parse tree looks at the lowest level of the
Top-Down Parsing Bottom-Up Parsing

and works down the parse tree by parse tree and works up the parse
using the rules of grammar. tree by using the rules of grammar.

Bottom-up parsing can be defined


Top-down parsing attempts to find as an attempt to reduce the input
the left most derivations for an input string to the start symbol of a
string. grammar.

In this parsing technique we start


In this parsing technique we start parsing from the bottom (leaf node
parsing from the top (start symbol of of the parse tree) to up (the start
parse tree) to down (the leaf node of symbol of the parse tree) in a
parse tree) in a top-down manner. bottom-up manner.

This parsing technique uses Left This parsing technique uses Right
Most Derivation. Most Derivation.

The main decision is to select


The main leftmost decision is to when to use a production rule to
select what production rule to use in reduce the string to get the starting
order to construct the string. symbol.

Example: Recursive Descent parser. Example: ItsShift Reduce parser.

Recursive Descent Parsing –


1. Whenever a Non-terminal spend the first time then go with the first
alternative and compare it with the given I/P String
2. If matching doesn’t occur then go with the second alternative and
compare with the given I/P String.
3. If matching is not found again then go with the alternative and so on.
4. Moreover, If matching occurs for at least one alternative, then the I/P
string is parsed successfully.
LL(1) or Table Driver or Predictive Parser –
1. In LL1, first L stands for Left to Right and second L stands for Left-
most Derivation. 1 stands for a number of Look Ahead tokens used
by parser while parsing a sentence.
2. LL(1) parsing is constructed from the grammar which is free from
left recursion, common prefix, and ambiguity.
3. LL(1) parser depends on 1 look ahead symbol to predict the
production to expand the parse tree.
4. This parser is Non-Recursive.

LR Parser

LR parsing is one type of bottom up parsing. It is used to parse the large class of
grammars.

In the LR parsing, "L" stands for left-to-right scanning of the input.

"R" stands for constructing a right most derivation in reverse.

"K" is the number of input symbols of the look ahead used to make number of
parsing decision.

LR parsing is divided into four parts: LR (0) parsing, SLR parsing, CLR parsing
and LALR parsing.

LR algorithm:

The LR algorithm requires stack, input, output and parsing table. In all type of
LR parsing, input, output and stack are same but parsing table is different.
Fig: Block diagram of LR parser

Input buffer is used to indicate end of input and it contains the string to be parsed
followed by a $ Symbol.

A stack is used to contain a sequence of grammar symbols with a $ at the bottom


of the stack.

Parsing table is a two dimensional array. It contains two parts: Action part and
Go To part.

LR (1) Parsing

Various steps involved in the LR (1) Parsing:

o For the given input string write a context free grammar.


o Check the ambiguity of the grammar.
o Add Augment production in the given grammar.
o Create Canonical collection of LR (0) items.
o Draw a data flow diagram (DFA).
o Construct a LR (1) parsing table.

Augment Grammar

Augmented grammar G` will be generated if we add one more production in the


given grammar G. It helps the parser to identify when to stop the parsing and
announce the acceptance of the input.

Example

Given grammar
1. S → AA
2. A → aA | b

The Augment grammar G` is represented by

1. S`→ S
2. S → AA
3. A → aA | b

S.
SLR Parsers Canonical LR Parsers LALR Parsers
no.

LALR parsers are


SLR parser are easiest CLR parsers are difficult difficult to implement
1
to implement. to implement. than SLR parser but
less than CLR parsers.

SLR parsers make use LALR parsers LR(1)


CLR parsers are uses
of canonical collection collection , items with
LR(1) collection of items
2. of LR(0) items for items having same core
for constructing the
constructing the parsing merged into a single
parsing tables part.
tables. itemset.

SLR parsers don’t do


CLR parsers lookahead LALR parsers
3 any lookahead i.e., they
one symbol. lookahead one symbol.
lookahead zero

SLR parsers are cost CLR parsers are The cost of constructing
effective to construct in expensive to construct in LALR parsers is i
4.
terms of time and terms – of time and intermediate between
space. space. SLR and CLR parser.

LALR parsers have


SLR parsers have CLR parses have hundreds of state and is
5.
hundreds of states. thousands of states. same as number states
in SLR parsers.

SLR parsers uses CLR parsers uses LALR parsers uses


6. FOLLOW information lookahead symbol to lookahead symbol to
to guide reductions. guide reductions. guide reductions.
S.
SLR Parsers Canonical LR Parsers LALR Parsers
no.

SLR parsers may fail to


I produce a table for CLR parser works on LALR parser works on
7. certain class of very large class of very large class
grammars on which grammar. grammars.
ether succeed.

Every LALR(1)
Every SLR(1) grammar Every LR(1) grammar grammar may not be
8. is LR(1) grammar and may not be SLR( 1) SLR(1) but every
LALR(1). grammar. LALR(1) grammar is
LR(1) grammar.

A shift-reduce or reduce-
A shift-reduce or A shift-reduce conflict
reduce conflicted may
reduce-reduce conflict can not arise but a
9. arise but chances are less
may arise in SLR reduce-reduce conflict
than that in SLR parsing
parsing table. may arise.
tables.

A CLR parsers is most A LALR parser is


SLR parser is least powerful among the intermediate in power
10.
powerful. family canonical of between SLR and LR
bottom-up parsers. parser.

Q. what are the two conflicts encountered in LR parsing technique.


Error Recovery
A parser should be able to detect and report any error in the program. It is
expected that when an error is encountered, the parser should be able to handle it
and carry on parsing the rest of the input. Mostly it is expected from the parser to
check for errors but errors may be encountered at various stages of the
compilation process. A program may have the following kinds of errors at various
stages:
 Lexical : name of some identifier typed incorrectly
 Syntactical : missing semicolon or unbalanced parenthesis
 Semantical : incompatible value assignment
 Logical : code not reachable, infinite loop
There are four common error-recovery strategies that can be implemented in the
parser to deal with errors in the code.
Panic mode
When a parser encounters an error anywhere in the statement, it ignores the rest
of the statement by not processing input from erroneous input to delimiter, such
as semi-colon. This is the easiest way of error-recovery and also, it prevents the
parser from developing infinite loops.
Statement mode
When a parser encounters an error, it tries to take corrective measures so that the
rest of inputs of statement allow the parser to parse ahead. For example, inserting
a missing semicolon, replacing comma with a semicolon etc. Parser designers
have to be careful here because one wrong correction may lead to an infinite loop.
Error productions
Some common errors are known to the compiler designers that may occur in the
code. In addition, the designers can create augmented grammar to be used, as
productions that generate erroneous constructs when these errors are encountered.
Global correction
The parser considers the program in hand as a whole and tries to figure out what
the program is intended to do and tries to find out a closest match for it, which is
error-free. When an erroneous input (statement) X is fed, it creates a parse tree
for some closest error-free statement Y. This may allow the parser to make
minimal changes in the source code, but due to the complexity (time and space)
of this strategy, it has not been implemented in practice yet.
Abstract Syntax Trees
Parse tree representations are not easy to be parsed by the compiler, as they
contain more details than actually needed. Take the following parse tree as an
example:

If watched closely, we find most of the leaf nodes are single child to their parent
nodes. This information can be eliminated before feeding it to the next phase. By
hiding extra information, we can obtain a tree as shown below:

Abstract tree can be represented as:

ASTs are important data structures in a compiler with least unnecessary


information. ASTs are more compact than a parse tree and can be easily used by
a compiler.
Semantic Analysis is the third phase of Compiler. Semantic Analysis makes
sure that declarations and statements of program are semantically correct. It is
a collection of procedures which is called by parser as and when required by
grammar. Both syntax tree of previous phase and symbol table are used to
check the consistency of the given code. Type checking is an important part
of semantic analysis where compiler makes sure that each operator has
matching operands.
Semantic Analyzer:
It uses syntax tree and symbol table to check whether the given program is
semantically consistent with language definition. It gathers type information
and stores it in either syntax tree or symbol table. This type information is
subsequently used by compiler during intermediate-code generation.
Semantic Errors:
Errors recognized by semantic analyzer are as follows:
 Type mismatch
 Undeclared variables
 Reserved identifier misuse
Functions of Semantic Analysis:
1. Type Checking –
Ensures that data types are used in a way consistent with their
definition.
2. Label Checking –
A program should contain labels references.
3. Flow Control Check –
Keeps a check that control structures are used in a proper
manner.(example: no break statement outside a loop)
Example:
float x = 10.1;
float y = x*30;
In the above example integer 30 will be typecasted to float 30.0 before
multiplication, by semantic analyzer.
Static and Dynamic Semantics:
1. Static Semantics –
It is named so because of the fact that these are checked at compile
time. The static semantics and meaning of program during execution,
are indirectly related.
2. Dynamic Semantic Analysis –
It defines the meaning of different units of program like expressions
and statements. These are checked at runtime unlike static semantics.
Syntax Directed Translation in Compiler Design
Parser uses a CFG(Context-free-Grammar) to validate the input string and
produce output for the next phase of the compiler. Output could be either a
parse tree or an abstract syntax tree. Now to interleave semantic analysis with
the syntax analysis phase of the compiler, we use Syntax Directed Translation.

Conceptually, with both syntax-directed definition and translation schemes, we


parse the input token stream, build the parse tree, and then traverse the tree as
needed to evaluate the semantic rules at the parse tree nodes. Evaluation of the
semantic rules may generate code, save information in a symbol table, issue
error messages, or perform any other activities. The translation of the token
stream is the result obtained by evaluating the semantic rules.
Definition
Syntax Directed Translation has augmented rules to the grammar that facilitate
semantic analysis. SDT involves passing information bottom-up and/or top-
down to the parse tree in form of attributes attached to the nodes. Syntax-
directed translation rules use 1) lexical values of nodes, 2) constants & 3)
attributes associated with the non-terminals in their definitions.
The general approach to Syntax-Directed Translation is to construct a parse tree
or syntax tree and compute the values of attributes at the nodes of the tree by
visiting them in some order. In many cases, translation can be done during
parsing without building an explicit tree.
Example

E -> E+T | T
T -> T*F | F
F -> INTLIT
This is a grammar to syntactically validate an expression having additions and
multiplications in it. Now, to carry out semantic analysis we will augment SDT
rules to this grammar, in order to pass some information up the parse tree and
check for semantic errors, if any. In this example, we will focus on the
evaluation of the given expression, as we don’t have any semantic assertions to
check in this very basic example.

E -> E+T { E.val = E.val + T.val } PR#1


E -> T { E.val = T.val } PR#2
T -> T*F { T.val = T.val * F.val } PR#3
T -> F { T.val = F.val } PR#4
F -> INTLIT { F.val = INTLIT.lexval } PR#5
For understanding translation rules further, we take the first SDT augmented to
[ E -> E+T ] production rule. The translation rule in consideration has val as an
attribute for both the non-terminals – E & T. Right-hand side of the translation
rule corresponds to attribute values of the right-side nodes of the production rule
and vice-versa. Generalizing, SDT are augmented rules to a CFG that associate
1) set of attributes to every node of the grammar and 2) a set of translation rules
to every production rule using attributes, constants, and lexical values.
Let’s take a string to see how semantic analysis happens – S = 2+3*4. Parse tree
corresponding to S would be

To evaluate translation rules, we can employ one depth-first search traversal on


the parse tree. This is possible only because SDT rules don’t impose any
specific order on evaluation until children’s attributes are computed before
parents for a grammar having all synthesized attributes. Otherwise, we would
have to figure out the best-suited plan to traverse through the parse tree and
evaluate all the attributes in one or more traversals. For better understanding, we
will move bottom-up in the left to right fashion for computing the translation
rules of our example.
The above diagram shows how semantic analysis could happen. The flow of
information happens bottom-up and all the children’s attributes are computed
before parents, as discussed above. Right-hand side nodes are sometimes
annotated with subscript 1 to distinguish between children and parents.
Additional Information
Synthesized Attributes are such attributes that depend only on the attribute
values of children nodes.
Thus [ E -> E+T { E.val = E.val + T.val } ] has a synthesized attribute val
corresponding to node E. If all the semantic attributes in an augmented grammar
are synthesized, one depth-first search traversal in any order is sufficient for the
semantic analysis phase.
Inherited Attributes are such attributes that depend on parent and/or sibling’s
attributes.
Thus [ Ep -> E+T { Ep.val = E.val + T.val, T.val = Ep.val } ], where E & Ep are
same production symbols annotated to differentiate between parent and child,
has an inherited attribute val corresponding to node T.

Advantages of Syntax Directed Translation:


Ease of implementation: SDT is a simple and easy-to-implement method for
translating a programming language. It provides a clear and structured way to
specify translation rules using grammar rules.
Separation of concerns: SDT separates the translation process from the parsing
process, making it easier to modify and maintain the compiler. It also separates
the translation concerns from the parsing concerns, allowing for more modular
and extensible compiler designs.
Efficient code generation: SDT enables the generation of efficient code by
optimizing the translation process. It allows for the use of techniques such as
intermediate code generation and code optimization.

Disadvantages of Syntax Directed Translation:

Limited expressiveness: SDT has limited expressiveness in comparison to


other translation methods, such as attribute grammars. This limits the types of
translations that can be performed using SDT.
Inflexibility: SDT can be inflexible in situations where the translation rules are
complex and cannot be easily expressed using grammar rules.
Limited error recovery: SDT is limited in its ability to recover from errors
during the translation process. This can result in poor error messages and may
make it difficult to locate and fix errors in the input program.

S – attributed and L – attributed SDTs in Syntax directed translation


Before coming up to S-attributed and L-attributed SDTs, here is a brief
intro to Synthesized or Inherited attributes Types of attributes
– Attributes may be of two types – Synthesized or Inherited.
1. Synthesized attributes – A Synthesized attribute is an
attribute of the non-terminal on the left-hand side of a
production. Synthesized attributes represent information that
is being passed up the parse tree. The attribute can take value
only from its children (Variables in the RHS of the production).
For e.g. let’s say A -> BC is a production of a grammar, and
A’s attribute is dependent on B’s attributes or C’s attributes
then it will be synthesized attribute.
2. Inherited attributes – An attribute of a nonterminal on the
right-hand side of a production is called an inherited attribute.
The attribute can take value either from its parent or from its
siblings (variables in the LHS or RHS of the production). For
example, let’s say A -> BC is a production of a grammar and
B’s attribute is dependent on A’s attributes or C’s attributes
then it will be inherited attribute.
Now, let’s discuss about S-attributed and L-attributed SDT.
1. S-attributed SDT :
 If an SDT uses only synthesized attributes, it is called
as S-attributed SDT.
 S-attributed SDTs are evaluated in bottom-up
parsing, as the values of the parent nodes depend
upon the values of the child nodes.
 Semantic actions are placed in rightmost place of
RHS.
2. L-attributed SDT:
 If an SDT uses both synthesized attributes and
inherited attributes with a restriction that inherited
attribute can inherit values from left siblings only, it is
called as L-attributed SDT.
 Attributes in L-attributed SDTs are evaluated by
depth-first and left-to-right parsing manner.
 Semantic actions are placed anywhere in RHS.

Type Checking in Compiler Design

Type checking is the process of verifying and enforcing constraints of types


in values. A compiler must check that the source program should follow the
syntactic and semantic conventions of the source language and it should also
check the type rules of the language. It allows the programmer to limit what
types may be used in certain circumstances and assigns types to values. The
type-checker determines whether these values are used appropriately or not.
It checks the type of objects and reports a type error in the case of a violation,
and incorrect types are corrected. Whatever the compiler we use, while it is
compiling the program, it has to follow the type rules of the language. Every
language has its own set of type rules for the language. We know that the
information about data types is maintained and computed by the compiler.
The information about data types like INTEGER, FLOAT, CHARACTER, and
all the other data types is maintained and computed by the compiler. The
compiler contains modules, where the type checker is a module of a compiler
and its task is type checking.

Conversion

Conversion from one type to another type is known as implicit if it is to be


done automatically by the compiler. Implicit type conversions are also
called Coercion and coercion is limited in many languages.
Example: An integer may be converted to a real but real is not converted to
an integer.
Conversion is said to be Explicit if the programmer writes something to
do the Conversion.
Tasks:
1. has to allow “Indexing is only on an array”
2. has to check the range of data types used
3. INTEGER (int) has a range of -32,768 to +32767
4. FLOAT has a range of 1.2E-38 to 3.4E+38.

Types of Type Checking:

There are two kinds of type checking:


1. Static Type Checking.
2. Dynamic Type Checking.

Static Type Checking:

Static type checking is defined as type checking performed at compile time. It


checks the type variables at compile-time, which means the type of the
variable is known at the compile time. It generally examines the program text
during the translation of the program. Using the type rules of a system, a
compiler can infer from the source text that a function (fun) will be applied to
an operand (a) of the right type each time the expression fun(a) is evaluated.
Examples of Static checks include:
 Type-checks: A compiler should report an error if an operator is
applied to an incompatible operand. For example, if an array
variable and function variable are added together.
 The flow of control checks: Statements that cause the flow of
control to leave a construct must have someplace to which to
transfer the flow of control. For example, a break statement in C
causes control to leave the smallest enclosing while, for, or switch
statement, an error occurs if such an enclosing statement does not
exist.
 Uniqueness checks: There are situations in which an object must
be defined only once. For example, in Pascal an identifier must be
declared uniquely, labels in a case statement must be distinct, and
else a statement in a scalar type may not be represented.
 Name-related checks: Sometimes the same name may appear two
or more times. For example in Ada, a loop may have a name that
appears at the beginning and end of the construct. The compiler
must check that the same name is used at both places.
The Benefits of Static Type Checking:
1. Runtime Error Protection.
2. It catches syntactic errors like spurious words or extra punctuation.
3. It catches wrong names like Math and Predefined Naming.
4. Detects incorrect argument types.
5. It catches the wrong number of arguments.
6. It catches wrong return types, like return “70”, from a function that’s
declared to return an int.

Dynamic Type Checking:

Dynamic Type Checking is defined as the type checking being done at run
time. In Dynamic Type Checking, types are associated with values, not
variables. Implementations of dynamically type-checked languages runtime
objects are generally associated with each other through a type tag, which is
a reference to a type containing its type information. Dynamic typing is more
flexible. A static type system always restricts what can be conveniently
expressed. Dynamic typing results in more compact programs since it is more
flexible and does not require types to be spelled out. Programming with a
static type system often requires more design and implementation effort.
Languages like Pascal and C have static type checking. Type checking is used
to check the correctness of the program before its execution. The main
purpose of type-checking is to check the correctness and data type
assignments and type-casting of the data types, whether it is syntactically
correct or not before their execution.
Static Type-Checking is also used to determine the amount of memory
needed to store the variable.
The design of the type-checker depends on:
1. Syntactic Structure of language constructs.
2. The Expressions of languages.
3. The rules for assigning types to constructs (semantic rules).

The Position of the Type checker in the Compiler:

Type checking in Compiler

The token streams from the lexical analyzer are passed to the PARSER. The
PARSER will generate a syntax tree. When a program (source code) is
converted into a syntax tree, the type-checker plays a Crucial Role. So, by
seeing the syntax tree, you can tell whether each data type is handling the
correct variable or not. The Type-Checker will check and if any modifications
are present, then it will modify. It produces a syntax tree, and after that,
INTERMEDIATE CODE Generation is done.

Intermediate code can translate the source program into the machine program.
Intermediate code is generated because the compiler can’t generate machine code
directly in one pass. Therefore, first, it converts the source program into intermediate
code, which performs efficient generation of machine code further. The intermediate
code can be represented in the form of postfix notation, syntax tree, directed acyclic
graph, three address codes, Quadruples, and triples.
If it can divide the compiler stages into two parts, i.e., Front end & Back end, then this
phase comes in between.

Three address code in Compiler

Three address code is a type of intermediate code which is easy to generate


and can be easily converted to machine code. It makes use of at most three
addresses and one operator to represent an expression and the value computed at
each instruction is stored in temporary variable generated by compiler. The
compiler decides the order of operation given by three address code.

Three address code is used in compiler applications:

Optimization: Three address code is often used as an intermediate


representation of code during optimization phases of the compilation process.
The three address code allows the compiler to analyze the code and perform
optimizations that can improve the performance of the generated code.
Code generation: Three address code can also be used as an intermediate
representation of code during the code generation phase of the compilation
process. The three address code allows the compiler to generate code that is
specific to the target platform, while also ensuring that the generated code is
correct and efficient.
Debugging: Three address code can be helpful in debugging the code generated
by the compiler. Since three address code is a low-level language, it is often
easier to read and understand than the final generated code. Developers can use
the three address code to trace the execution of the program and identify errors
or issues that may be present.
Language translation: Three address code can also be used to translate code
from one programming language to another. By translating code to a common
intermediate representation, it becomes easier to translate the code to multiple
target languages.
General representation –
a = b op c
Where a, b or c represents operands like names, constants or compiler generated
temporaries and op represents the operator
Example-1: Convert the expression a * – (b + c) into three address code.

Example-2: Write three address code for following code


for(i = 1; i<=10; i++)
{
a[i] = x * 5;
}

Implementation of Three Address Code –


There are 3 representations of three address code namely
1. Quadruple
2. Triples
3. Indirect Triples
1. Quadruple – It is a structure which consists of 4 fields namely op, arg1, arg2
and result. op denotes the operator and arg1 and arg2 denotes the two operands
and result is used to store the result of the expression.
Advantage –
 Easy to rearrange code for global optimization.
One can quickly access value of temporary variables using symbol

table.
Disadvantage –
 Contain lot of temporaries.
 Temporary variable creation increases time and space complexity.
Example – Consider expression a = b * – c + b * – c. The three address code is:
t1 = uminus c
t2 = b * t1
t3 = uminus c
t4 = b * t3
t5 = t2 + t4
a = t5

2. Triples – This representation doesn’t make use of extra temporary variable to


represent a single operation instead when a reference to another triple’s value is
needed, a pointer to that triple is used. So, it consist of only three fields namely
op, arg1 and arg2.
Disadvantage –
 Temporaries are implicit and difficult to rearrange code.
 It is difficult to optimize because optimization involves moving
intermediate code. When a triple is moved, any other triple referring to
it must be updated also. With help of pointer one can directly access
symbol table entry.
Example – Consider expression a = b * – c + b * – c
3. Indirect Triples – This representation makes use of pointer to the listing of
all references to computations which is made separately and stored. Its similar
in utility as compared to quadruple representation but requires less space than it.
Temporaries are implicit and easier to rearrange code.
Example – Consider expression a = b * – c + b * – c

Question – Write quadruple, triples and indirect triples for following expression
: (x + y) * (y + z) + (x + y + z)
Explanation – The three address code is:
t1 = x + y
t2 = y + z
t3 = t1 * t2
t4 = t1 + z
t5 = t3 + t4
Intermediate Code Generation in Compiler Design
In the analysis-synthesis model of a compiler, the front end of a compiler
translates a source program into an independent intermediate code, then the
back end of the compiler uses this intermediate code to generate the target
code (which can be understood by the machine). The benefits of using
machine-independent intermediate code are:
 Because of the machine-independent intermediate code, portability
will be enhanced. For ex, suppose, if a compiler translates the source
language to its target machine language without having the option for
generating intermediate code, then for each new machine, a full
native compiler is required. Because, obviously, there were some
modifications in the compiler itself according to the machine
specifications.
 Retargeting is facilitated.
 It is easier to apply source code modification to improve the
performance of source code by optimizing the intermediate code.
If we generate machine code directly from source code then for n target
machine we will have optimizers and n code generator but if we will have a
machine-independent intermediate code, we will have only one optimizer.
Intermediate code can be either language-specific (e.g., Bytecode for Java) or
language. independent (three-address code). The following are commonly used
intermediate code representations:
1. Postfix Notation: Also known as reverse Polish notation or suffix
notation. The ordinary (infix) way of writing the sum of a and b is
with an operator in the middle: a + b The postfix notation for the
same expression places the operator at the right end as ab +. In
general, if e1 and e2 are any postfix expressions, and + is any binary
operator, the result of applying + to the values denoted by e1 and e2
is postfix notation by e1e2 +. No parentheses are needed in postfix
notation because the position and arity (number of arguments) of the
operators permit only one way to decode a postfix expression. In
postfix notation, the operator follows the operand.
Example 1: The postfix representation of the expression (a + b) * c
is : ab + c *
Example 2: The postfix representation of the expression (a – b) * (c
+ d) + (a – b) is : ab – cd + *ab -+
Read more: Infix to Postfix

2. Three-Address Code: A statement involving no more than three


references(two for operands and one for result) is known as a three
address statement. A sequence of three address statements is known
as a three address code. Three address statement is of form x = y op
z, where x, y, and z will have address (memory location). Sometimes
a statement might contain less than three references but it is still
called a three address statement.
Example: The three address code for the expression a + b * c + d : T
1 = b * c T 2 = a + T 1 T 3 = T 2 + d T 1 , T 2 , T 3 are temporary
variables.
There are 3 ways to represent a Three-Address Code in compiler
design:
i) Quadruples
ii) Triples
iii) Indirect Triples
Read more: Three-address code

3. Syntax Tree: A syntax tree is nothing more than a condensed form


of a parse tree. The operator and keyword nodes of the parse tree are
moved to their parents and a chain of single productions is replaced
by the single link in the syntax tree the internal nodes are operators
and child nodes are operands. To form a syntax tree put parentheses
in the expression, this way it’s easy to recognize which operand
should come first.
Example: x = (a + b * c) / (a – b * c)

Advantages of Intermediate Code Generation:


Easier to implement: Intermediate code generation can simplify the code
generation process by reducing the complexity of the input code, making it
easier to implement.
Facilitates code optimization: Intermediate code generation can enable the
use of various code optimization techniques, leading to improved performance
and efficiency of the generated code.
Platform independence: Intermediate code is platform-independent, meaning
that it can be translated into machine code or bytecode for any platform.
Code reuse: Intermediate code can be reused in the future to generate code for
other platforms or languages.
Easier debugging: Intermediate code can be easier to debug than machine
code or bytecode, as it is closer to the original source code.

Disadvantages of Intermediate Code Generation:

Increased compilation time: Intermediate code generation can significantly


increase the compilation time, making it less suitable for real-time or time-
critical applications.
Additional memory usage: Intermediate code generation requires additional
memory to store the intermediate representation, which can be a concern for
memory-limited systems.
Increased complexity: Intermediate code generation can increase the
complexity of the compiler design, making it harder to implement and
maintain.
Reduced performance: The process of generating intermediate code can
result in code that executes slower than code generated directly from the
source code.

Code Optimization in Compiler Design


The code optimization in the synthesis phase is a program transformation
technique, which tries to improve the intermediate code by making it
consume fewer resources (i.e. CPU, Memory) so that faster-running machine
code will result. Compiler optimizing process should meet the following
objectives :
 The optimization must be correct, it must not, in any way, change
the meaning of the program.
 Optimization should increase the speed and performance of the
program.
 The compilation time must be kept reasonable.
 The optimization process should not delay the overall compiling
process.

When to Optimize?

Optimization of the code is often performed at the end of the development


stage since it reduces readability and adds code that is used to increase the
performance.

Why Optimize?

Optimizing an algorithm is beyond the scope of the code optimization


phase. So the program is optimized. And it may involve reducing the size of
the code. So optimization helps to:
 Reduce the space consumed and increases the speed of
compilation.
 Manually analyzing datasets involves a lot of time. Hence we make
use of software like Tableau for data analysis. Similarly manually
performing the optimization is also tedious and is better done
using a code optimizer.
 An optimized code often promotes re-usability.
Types of Code Optimization: The optimization process can be broadly
classified into two types :
1. Machine Independent Optimization: This code optimization phase
attempts to improve the intermediate code to get a better target
code as the output. The part of the intermediate code which is
transformed here does not involve any CPU registers or absolute
memory locations.
2. Machine Dependent Optimization: Machine-dependent
optimization is done after the target code has been generated and
when the code is transformed according to the target machine
architecture. It involves CPU registers and may have absolute
memory references rather than relative references. Machine-
dependent optimizers put efforts to take maximum advantage of
the memory hierarchy.
Code Optimization is done in the following different ways:

1. Compile Time Evaluation:


 C

(i) A = 2*(22.0/7.0)*r
Perform 2*(22.0/7.0)*r at compile time.
(ii) x = 12.4
y = x/2.3
Evaluate x/2.3 as 12.4/2.3 at compile time.

2. Variable Propagation:

 C

//Before Optimization
c = a * b
x = a
till
d = x * b + 4

//After Optimization
c = a * b
x = a
till
d = a * b + 4

3. Constant Propagation:
If the value of a variable is a constant, then replace the variable

with the constant. The variable may not always be a constant.
Example:

 C

(i) A = 2*(22.0/7.0)*r
Performs 2*(22.0/7.0)*r at compile time.
(ii) x = 12.4
y = x/2.3
Evaluates x/2.3 as 12.4/2.3 at compile time.
(iii) int k=2;
if(k) go to L3;
It is evaluated as :
go to L3 ( Because k = 2 which implies condition is always true)
4. Constant Folding:

Consider an expression : a = b op c and the values b and c are



constants, then the value of a can be computed at compile time.
Example:

 C

#define k 5
x = 2 * k
y = k + 5

This can be computed at compile time and the values of x and y are :
x = 10
y = 10

Note: Difference between Constant Propagation and Constant Folding:


 In Constant Propagation, the variable is substituted with its
assigned constant where as in Constant Folding, the variables
whose values can be computed at compile time are considered and
computed.

5. Copy Propagation:

It is extension of constant propagation.



 After a is assigned to x, use a to replace x till a is assigned again to
another variable or value or expression.
 It helps in reducing the compile time as it reduces copying.
Example :

 C

//Before Optimization
c = a * b
x = a
till
d = x * b + 4

//After Optimization
c = a * b
x = a
till
d = a * b + 4

6. Common Sub Expression Elimination:

 In the above example, a*b and x*b is a common sub expression.

7. Dead Code Elimination:

 Copy propagation often leads to making assignment statements


into dead code.
 A variable is said to be dead if it is never used after its last
definition.
 In order to find the dead variables, a data flow analysis should be
done.
Example:

 C

c = a * b
x = a
till
d = a * b + 4

//After elimination :
c = a * b
till
d = a * b + 4

8. Unreachable Code Elimination:

 First, Control Flow Graph should be constructed.


 The block which does not have an incoming edge is an
Unreachable code block.
 After constant propagation and constant folding, the unreachable
branches can be eliminated.

 C++
#include <iostream>
using namespace std;

int main() {
int num;
num=10;
cout << "GFG!";
return 0;
cout << num; //unreachable code
}
//after elimination of unreachable code
int main() {
int num;
num=10;
cout << "GFG!";
return 0;
}

9. Function Inlining:

 Here, a function call is replaced by the body of the function itself.


 This saves a lot of time in copying all the parameters, storing the
return address, etc.

10. Function Cloning:

 Here, specialized codes for a function are created for different


calling parameters.
 Example: Function Overloading

11. Induction Variable and Strength Reduction:

An induction variable is used in the loop for the following kind of



assignment i = i + constant. It is a kind of Loop Optimization
Technique.
 Strength reduction means replacing the high strength operator
with a low strength.
Examples:

 C
Example 1 :
Multiplication with powers of 2 can be replaced by shift left operator which is
less
expensive than multiplication
a=a*16
// Can be modified as :
a = a<<4

Example 2 :
i = 1;
while (i<10)
{
y = i * 4;
}

//After Reduction
i = 1
t = 4
{
while( t<40)
y = t;
t = t + 4;
}

Loop Optimization Techniques:

1. Code Motion or Frequency Reduction:


 The evaluation frequency of expression is reduced.
 The loop invariant statements are brought out of the loop.
Example:

 C

a = 200;
while(a>0)
{
b = x + y;
if (a % b == 0)
printf(“%d”, a);
}

//This code can be further optimized as


a = 200;
b = x + y;
while(a>0)
{
if (a % b == 0}
printf(“%d”, a);
}

2. Loop Jamming:
 Two or more loops are combined in a single loop. It helps in
reducing the compile time.
Example:

 C

// Before loop jamming


for(int k=0;k<10;k++)
{
x = k*2;
}

for(int k=0;k<10;k++)
{
y = k+3;
}

//After loop jamming


for(int k=0;k<10;k++)
{
x = k*2;
y = k+3;
}

3. Loop Unrolling:
 It helps in optimizing the execution time of the program by
reducing the iterations.
 It increases the program’s speed by eliminating the loop control
and test instructions.
Example:

 C

//Before Loop Unrolling


for(int i=0;i<2;i++)
{
printf("Hello");
}

//After Loop Unrolling

printf("Hello");
printf("Hello");

Where to apply Optimization?

Now that we learned the need for optimization and its two types,now let’s
see where to apply these optimization.
 Source program: Optimizing the source program involves making
changes to the algorithm or changing the loop structures. The user
is the actor here.
 Intermediate Code: Optimizing the intermediate code involves
changing the address calculations and transforming the procedure
calls involved. Here compiler is the actor.
 Target Code: Optimizing the target code is done by the compiler.
Usage of registers, and select and move instructions are part of the
optimization involved in the target code.
 Local Optimization: Transformations are applied to small basic
blocks of statements. Techniques followed are Local Value
Numbering and Tree Height Balancing.
 Regional Optimization: Transformations are applied to Extended
Basic Blocks. Techniques followed are Super Local Value
Numbering and Loop Unrolling.
 Global Optimization: Transformations are applied to large
program segments that include functions, procedures, and loops.
Techniques followed are Live Variable Analysis and Global Code
Replacement.
 Interprocedural Optimization: As the name indicates, the
optimizations are applied inter procedurally. Techniques followed
are Inline Substitution and Procedure Placement.

Advantages of Code Optimization:


Improved performance: Code optimization can result in code that executes
faster and uses fewer resources, leading to improved performance.
Reduction in code size: Code optimization can help reduce the size of the
generated code, making it easier to distribute and deploy.
Increased portability: Code optimization can result in code that is more
portable across different platforms, making it easier to target a wider range
of hardware and software.
Reduced power consumption: Code optimization can lead to code that
consumes less power, making it more energy-efficient.
Improved maintainability: Code optimization can result in code that is
easier to understand and maintain, reducing the cost of software
maintenance.

Disadvantages of Code Optimization:

Increased compilation time: Code optimization can significantly increase


the compilation time, which can be a significant drawback when developing
large software systems.
Increased complexity: Code optimization can result in more complex code,
making it harder to understand and debug.
Potential for introducing bugs: Code optimization can introduce bugs into
the code if not done carefully, leading to unexpected behavior and errors.
Difficulty in assessing the effectiveness: It can be difficult to determine the
effectiveness of code optimization, making it hard to justify the time and
resources spent on the process.

You might also like