0% found this document useful (0 votes)
5 views79 pages

Compiler Design Notes CSE

The document explains the process of converting high-level programming languages (HLL) to low-level machine languages using a Language Processing System, which includes components like compilers, assemblers, and linkers. It also compares compilers and interpreters, detailing the steps involved in compilation, including lexical analysis, syntax analysis, and semantic analysis. Additionally, it discusses concepts such as deterministic and non-deterministic grammars, ambiguous grammars, and methods for minimizing context-free grammars (CFG).
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views79 pages

Compiler Design Notes CSE

The document explains the process of converting high-level programming languages (HLL) to low-level machine languages using a Language Processing System, which includes components like compilers, assemblers, and linkers. It also compares compilers and interpreters, detailing the steps involved in compilation, including lexical analysis, syntax analysis, and semantic analysis. Additionally, it discusses concepts such as deterministic and non-deterministic grammars, ambiguous grammars, and methods for minimizing context-free grammars (CFG).
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 79

LANGUAGE PROCESSING SYSTEM

Users code in a high-level language while the machine can understand only low-level machine
language. To convert HLL to LLL, we need to use a Language Processing system –

Compiler is present in the Language Processing system which will change the pure HLL to Assembly
level language. Let us understand this chain of processing –

1. First, we write a code in HLL like C.


2. Next, we send this code to the pre-processor. In this stage, the pre-processor will execute the
pre-processor directive (#) and will convert the code to pure HLL. Basically, pre-processor will
import, include and execute statements like #include<stdio.h>,
#include<stdlib.h> etc. Pre-processor also will replace short-hand operators like ++, --
, += etc. to their standard form.
3. Next, the pure HLL will be sent to the compiler. This is where the code will be converted to
Assembly level language. The assembly level language is low level and provides the
information as to how is the data stored in which registers.
4. After this comes the assembler that will take the assembly language code and convert it to the
lowest level machine code.
5. Finally, the linker will take all the modules of the machine codes and then link them together.
Then, this linked code will be loaded and the code will be executed.

COMPILER vs INTERPRETER

Another alternative to compiler is an interpreter. Interpreter will take the HLL code line by line,
processes it to a lower-level language and gets it executed. Then, it moves to the next line. Thus,
compared to the compiler an interpreter will not store any intermediate/assembly code. Thus, there
is no need for a loader/linker.
COMPILER BREAKDOWN

A compiler consists of the following steps –

Here, the lexical analyzer, syntax analyzer and semantic analyzer are present in the “front-end” while
the rest of the blocks are present in the “back-end”.

Lexical Analyzer

This is a part of the compiler that reads a program and converts it into Lexemes (words/tokens). It also
removes white – spaces and comments. For example, let us assume the following code –

In this case, the lexical analyzer will divide the code line into the following tokens –
Syntax Analyzer (Parser)

A parser takes the tokens from the lexical analyzer and creates a parse tree. This parse tree is formed
with the help of the Context-free grammar that is provided for the system.

Semantic Analyzer

This part checks the parse tree and checks whether it is a meaningful or not. It checks the semantics
and grammar to check if it is meaningful or not.

Intermediate Code Generator

This generates an intermediate code which is a form that can be readily understood by the machine.
The intermediate code is the same for all compilers and only the last two steps are platform
dependent. This means that we can take the intermediate code from any compiler and run it on any
other compiler.

Code Optimizer

This is used to transform the code to be more optimized such that is consumes lesser resources and
produces more speed. The optimization can be either machine dependent or machine independent.

Target Code Generator

This is the final step which returns the final assembly code. The assembly code being generated is
dependent on the type of assembler.

LEXICAL ANALYZER

Lexical analysis (also called tokenization or lexing) is a process of converting a sequence of characters
into a sequence of tokens. A sequence of characters in the input string that matches the pattern of a
token is called a lexeme. A Lexical analyzer (Lexer) will take an input string and yield output in the form
of tuples –
(𝑻𝒐𝒌𝒆𝒏, 𝑳𝒆𝒙𝒆𝒎𝒆)
For example, let us take the code line as follows –
int x = a*b;

The Lexer will return the output as follows –

• (datatype, int)
• (identifier, x)
• (operator, =)
• (identifier, a)
• (operator, *)
• (identifier, b)
• (separator, ;)

Apart from the above function, the lexer also is responsible for –

• Removing comments
• Removing Whitespaces
• Co-relating the errors with the code line number.

Lexer Implementation

To implement a lexer, we need to define the Lexical Grammar. The lexer will recognize and work on
strings based on the rules of the Lexical Grammar. For example, let us assume that the grammar of a
password is given as follows –

𝑅𝑒𝑔𝑢𝑙𝑎𝑟 𝐸𝑥𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛 → 𝐴(𝐴 + 𝑆 + 𝐷)7 (𝐴 + 𝑆 + 𝐷 + 𝜖)7


Where, 𝐴 → 𝑎𝑙𝑝ℎ𝑎, 𝑆 → 𝑠𝑝𝑒𝑐𝑖𝑎𝑙 𝑠𝑦𝑚𝑏𝑜𝑙, 𝐷 → 𝐷𝑖𝑔𝑖𝑡. Thus, the rules of the password are as follows

• The password must start with an alphabet.


• After that, the next characters can be either alphabet, digit or special character
• Minimum length of the password should be 8 (1+7)
• Maximum length of the password should be 15 (1+7+7)

Question
Answer

We can re-write the token expressions as follows –

𝑇1 = (𝑎 + 𝜖)(𝑏 + 𝑐)∗ 𝑎
𝑇2 = (𝑏 + 𝜖)(𝑎 + 𝑐)∗ 𝑏
𝑇3 = (𝑐 + 𝜖)(𝑏 + 𝑎)∗ 𝑐
Now, the string we have is 𝒃𝒃𝒂𝒂𝒄𝒂𝒃𝒄. If we put this string through the different token expressions,
we can see that –

𝑇1 → 𝑏𝑏𝑎
𝑇2 → 𝑏𝑏
𝑇3 → 𝑏𝑏𝑎𝑎𝑐
As the question instructs to take the token expression with the longest prefix, we must take T3. Hence,
the correct option is Option D.

LEXICAL GRAMMAR

Lexical grammar is usually a CFG. CFG is a Type – 2 grammar wherein any production of the form 𝛼 →
𝛽 will have 𝛼 as a variable while 𝛽 can be anything. The process of deriving a string from the grammar
is called derivation and the graphical representation of the derivation is called a Parse (Syntax) Tree.
For example, let us take the grammar as follows –

𝐸 → 𝐸 + 𝐸 | 𝐸 ∗ 𝐸 | 𝐸 = 𝐸 | 𝑖𝑑
If we want to create the string – 𝐸 = 𝑖𝑑 + 𝑖𝑑 ∗ 𝑖𝑑. Then, the parse tree will be –
If we mention the intermediate steps involved in the derivation, then it is called the Sentential form.

In every step of the sentential form, if the extreme left term is terminated, then it called Left Most
Derivation. On the other hand, if every step has the extreme right term terminated, then it is called
Right Most Derivation. The above sentential form is the LMD. To get the RMD for the same tree, we
can write –

NON – DETERMINISTIC GRAMMAR

If a production has common prefix, then it is called a Non-Deterministic Grammar as it is not possible
to have a deterministic way to derive the string. For example,

𝐴 → 𝛼𝛽1 | 𝛼𝛽2
Now, we know that first symbol is 𝛼 but we don’t know which will be the next symbol. To change it to
deterministic grammar, we can write it as follows –

𝐴 → 𝛼𝐵
𝐵 → 𝛽1 | 𝛽2
In short, a parser using a Non-deterministic grammar will have to backtrack at some point. To avoid
backtracking, we need to convert the Non-Deterministic grammar to a Deterministic Grammar.

NOTE

• Deterministic Grammar is also called Left Factored Grammar.


• Non-Deterministic Grammar is also called Non-Left Factored Grammar
• The process to convert NDG to DG is called Left Factoring.

Question

Convert the following grammar to Left Factored Grammar

𝑆 → 𝑎𝑆𝑏 | 𝑎𝑏𝑆 | 𝑎𝑏
Answer

𝑆 → 𝑎𝐴
𝐴 → 𝑆𝑏 | 𝑏𝐵
𝐵 →𝑆|𝜖

If the variable on the LHS appears on the RHS as well, it is called recursive grammar. In addition to
that, if the recursive variable appearing on the RHS is at the extreme left, then it is called Left recursive
grammar. If the recursive variable appears on the extreme right, then it is called Right recursive
grammar.

𝑆 → 𝑆𝑎 (𝐿𝑒𝑓𝑡 𝑅𝐺)
𝑆 → 𝑎𝑆 (𝑅𝑖𝑔ℎ𝑡 𝑅𝐺)
𝑆 → 𝑎𝑆𝑏 (𝐺𝑒𝑛𝑒𝑟𝑎𝑙 𝑅𝐺)
The expressive power of Left RG and Right RG is the same.

NOTE

Normally, a Left RG can result in an infinite loop so it is usually a good thing to first change Left RG to
Right RG.

Question

Convert the Left RG to Right RG.

𝐴 → 𝐴𝛼 | 𝛽1 | 𝛽2
Answer

𝐴 → 𝛽1 𝐵 | 𝛽2 𝐵
𝐵 → 𝛼𝐵 | 𝜖

Question

Convert the Left RG to Right RG.

𝐴 → 𝐴𝛼1 | 𝐴𝛼2 | 𝛽
Answer

𝐴 → 𝛽𝐵
𝐵 → 𝛼1 𝐵 | 𝛼2 𝐵 | 𝜖

Question

Convert the Left RG to Right RG.

𝑆 → 𝑆𝑆𝑆 | 0
Answer

𝑆 → 0𝐴
𝐴 → 𝑆𝑆𝐴 | 𝜖

Question

Convert the Left RG to Right RG.

𝑆 → 𝑆1𝑆 | 0
Answer

𝑆 → 0𝐴
𝐴 → 1𝑆𝐴 | 𝜖

Question

Convert the Left RG to Right RG.

𝑆 → 𝑆12 | 0
Answer

𝑆 → 0𝐴
𝐴 → 12𝐴 | 𝜖
Question

Convert the Left RG to Right RG.

𝑆 → 𝑆0𝑆1 | 0 | 1
Answer

𝑆 → 0𝐴 | 1𝐴
𝐴 → 0𝑆1𝐴 | 𝜖

Question

Convert the Left RG to Right RG.

Answer

𝐸 → 𝑇𝐴
𝐴 → +𝑇𝐴 | 𝜖
𝑇 → 𝐹𝐵
𝐵 → ∗ 𝐹𝐵 | 𝜖
𝐹 → (𝐸) | 𝑖𝑑

AMBIGUOUS GRAMMAR

If a CFG exists where any string can have more than one derivation tree then it is said to be ambiguous.
For example,

𝑆 → 𝑎𝑆 | 𝑆𝑎 | 𝑎
In this case, we can draw two derivation graphs as follows –
In this case, both the graphs result in the same string 𝒂𝒂 and hence, the above grammar is ambiguous.
We can write the same grammar as –

𝑆 → 𝑎𝑆 | 𝑎
Now the grammar is unambiguous. Here are a few things to note about ambiguous grammar –

• If a grammar has both left and right recursive, then it will be ambiguous. However, the
converse is not true and an ambiguous grammar need not have both left and right recursive.
• There are some ambiguous grammars that can’t be converted to unambiguous grammar. For
example, {𝑎𝑛 𝑏𝑚 𝑐 𝑚 𝑑𝑛 } ∪ {𝑎𝑛 𝑏 𝑛 𝑐 𝑚 𝑑𝑚 } is a CFL which doesn’t have any unambiguous way to
derive.

Question

Is the following grammar ambiguous or unambiguous?

𝑆 → 𝑆𝑆 | 0 | 1
Answer

Since there are both left and right recursive grammar, the grammar is ambiguous.

MINIMIZATION OF CFG

We can follow the given steps to simplify/minimize the given CFG –

• Removal of NULL productions


• Removal of Unit productions
• Removal of Useless symbols

Removal of NULL Productions

A NULL production is of the form 𝐴 → 𝜖. During derivation, if there is a null production then it replaces
a non-terminal with 𝜖 and hence we will need to derive the string further to get the required string. In
short, a NULL production doesn’t functionally add to the string derivation. Therefore, it is advisable to
remove NULL productions to simplify the CFG. For example, let us take the CFG as follows –

If 𝐴 → 𝜖, then 𝑆 → 𝑏𝐵. Similarly, if 𝐵 → 𝜖, then 𝑆 → 𝐴𝑏. Finally, if both 𝐴 → 𝜖 and 𝐵 → 𝜖, then 𝑆 → 𝑏.


Thus, we get –

𝑆 → 𝐴𝑏𝐵 | 𝑏𝐵 | 𝐴𝑏 | 𝑏
𝐴→𝑎
𝐵→𝑏
Question

For the given CFG, remove the NULL productions.

Answer

𝑆 ′ → 𝑆 ∪ {𝜖}
𝑆 → 𝐴𝐵 | 𝐵 | 𝐴
𝐴→𝑎
𝐵→𝑏

Removal of Unit Productions

A production where there is a single terminal or variable on the RHS is called a Unit production. A unit
production is a waste and instead can be included in any of the previous productions itself. For example

𝑆 → 𝐴𝑎
𝐴 →𝑎|𝐵
𝐵→𝑑
Here, 𝐵 → 𝑑 is a unit production. Thus, we can remove it as follows –

𝑆 → 𝐴𝑎
𝐴→𝑎|𝑑

Question

Remove the unit productions in the given CFG –

Answer
𝑆 → 𝑎𝐴𝑏
𝐴→𝑑|𝑐|𝑏|𝑎

Removal of Useless Symbols

These are analogous to the dead and unreachable states in FA. There can be two types of useless
productions –

• Which can’t be reached by the start symbol (Unreachable)


• Which don’t derive any terminal and hence can’t be terminated (Dead)

For example –

In the first CFG, 𝐶 → 𝑑 is an unreachable production and hence can be removed. In the second CFG,
the variable 𝐵 can be reached from 𝑆 but it doesn’t derive any strings/terminals. So, that is also useless.

Question

Remove the useless productions –

Answer

𝑆 → 𝑎𝐴𝐵 | 𝑏𝐴
𝐴 → 𝑎𝐵 | 𝑏
𝐵→𝑑

NORMALIZATION OF CFG

This is done to transform the CFG to be more compiler friendly. We have 2 normal forms –

• Chomsky Normal Form (CNF)


• Greiback Normal Form (GNF)
Both CNF and GNF have the same expressive power. Also, every CFG can be expressed in the CNF and
GNF form. Additionally, to perform normalization we need the grammar to be in its minimized form.

Chomsky Normal Form (CNF)

Every production in CNF must be of the form –

𝑨 → 𝑩𝑪 | 𝒂
Where 𝐴, 𝐵, 𝐶 are variables and 𝑎 is a terminal. For example, let us take the CFG as follows –

𝑆 → 𝑎𝑆𝑏 | 𝑎𝑏
For a CNF, we can have either 2 variables or 1 terminal in the RHS. So, we can write the CFG as follows

𝑆 → 𝐴𝑆 ′ | 𝐴𝐵
𝑆 ′ → 𝑆𝐵
𝐴→𝑎
𝐵→𝑏

We can see that the above grammar is in CNF

Question

Convert to CNF form –

Answer

𝑆 → 𝑆′𝐵 | 𝐵𝐵
𝑆′ → 𝐶𝐴
𝐴→𝑎|𝑏
𝐵→𝑏
𝐶→𝑎
Suppose we need to create a string 𝒂𝒂𝒃 using the above grammar, then we can write the sentential
forms as follows –

• 𝑆 → 𝑆′𝐵
• 𝑆 → 𝐶𝐴𝐵
• 𝑆 → 𝑎𝐴𝐵
• 𝑆 → 𝑎𝑎𝐵
• 𝑆 → 𝑎𝑎𝑏
Thus, we can see that to generate a string of length 3, we have 5 sentential forms. Therefore, we can
conclude that in general for CNF, if we need to generate a string of length 𝒏, we would have a total of
𝟐𝒏 − 𝟏 sentential forms.

Greiback Normal Form

Here, every production needs to be of the form –

𝐴 → 𝑎𝛼
Where 𝐴 is a variable, 𝑎 is a terminal and 𝛼 ∈ 𝑉𝑛∗. For example, let us take the CFG as follows –

𝑆 → 𝑎𝑆𝑏 | 𝑎𝑏
Then, we can write GNF as follows –

𝑆 → 𝑎𝑆𝐵 | 𝑎𝐵
𝐵→𝑏

We can see that the above grammar is in GNF

Question

Convert to CNF form –

Answer

𝑆 → 𝑎𝐴𝐵 | 𝑏𝐵
𝐴→𝑎|𝑏
𝐵→𝑏
Unlike CNF, if we need to generate a string of length 𝒏 in a GNF grammar, then we need 𝒏 sentential
forms.
SYNTAX ANALYZER (PARSER)

This is the next step after the Lexical analysis. Parsers can be of various types –

TOP-DOWN PARSERS

Top-down parsers start parsing the input, and start constructing a parse tree from the root node
gradually moving down to the leaf nodes. These use left – most derivations for construction. For any
top-down parser, the CFG from which it is constructed must be –

• Non-left recursive grammar (as the parser might go into an infinite loop)
• Unambiguous

The top-down parsers can be of 2 types –

• Brute Force (requires back tracking)


• Predictive Parser (doesn’t support back tracking)

Brute Force Parsers

Brute force parsers are simple…you check a string by trial and error. If the string doesn’t match in the
CFG, then back track and try another method. It is a basic but time consuming and long processes.
These can be constructed from both deterministic and non-deterministic CFG. For example, let us
take the CFG as follows –

Now, suppose we need to construct a parse tree for the word 𝒘 = 𝒄𝒂𝒅. Then, we can start as follows

We got this using the left – most derivation. However, this is not the word we want. Hence, we now
backtrack and can re-write the parse tree as follows –

Question

Draw the parse tree from a Brute force parser to select the word 𝒘 = 𝒂𝒅𝒅𝒄 using the following CFG

Answer
Since Brute force is taking a lot of time, it has a time complexity of 𝑶(𝟐𝒏 ).

Predictive Parsers

A predictive parser will use rules to predict the next incoming symbol. This parser doesn’t support
back tracking. Therefore, it can only be constructed from deterministic grammar. Since this parser
doesn’t have the option of back tracking, we can’t have any scope for error. Thus, before beginning we
need to understand 2 functions –

• FIRST (X) – It is a function that gives the set of terminals that begin the strings derived from
the production rule. In other words, it returns the set of terminals with which the string 𝑿 can
start with.
• FOLLOW (X) – It returns a set of terminals that will be achieved when 𝑿 = 𝝐. FOLLOW function
can only be applied on variables and not on terminals.

For example, suppose we have the grammar –

𝑆 → 𝐴𝐵𝐶
𝐴 → 𝑎𝑏
𝐵 → 𝑐𝑎
𝐶→𝑏
Then, we can write –

𝐹𝐼𝑅𝑆𝑇(𝑎𝑏𝑐) = 𝑎
𝐹𝐼𝑅𝑆𝑇(𝑐𝑏𝑎) = 𝑐
𝐹𝐼𝑅𝑆𝑇(𝐴) = 𝑎 ; 𝐹𝑂𝐿𝐿𝑂𝑊(𝐴) = {𝑐}
𝐹𝐼𝑅𝑆𝑇(𝐵) = 𝑐 ; 𝐹𝑂𝐿𝐿𝑂𝑊(𝐵) = 𝑏
𝐹𝐼𝑅𝑆𝑇(𝑆) = 𝑎 ; 𝐹𝑂𝐿𝐿𝑂𝑊(𝑆) = {$}

Question

Find the First and Follow for the following CFG –

𝑆 →𝑎|𝑏|𝜖
Answer

𝐹𝐼𝑅𝑆𝑇(𝑆) = {𝑎, 𝑏, 𝜖}
𝐹𝑂𝐿𝐿𝑂𝑊(𝑆) = {$}

Question

Find the First and Follow for the following CFG –


Answer

𝐹𝐼𝑅𝑆𝑇(𝑆) = {𝑎, 𝑏}
𝐹𝐼𝑅𝑆𝑇(𝐴) = {𝜖}
𝐹𝐼𝑅𝑆𝑇(𝐵) = {𝜖}
𝐹𝑂𝐿𝐿𝑂𝑊(𝑆) = {$}
𝐹𝑂𝐿𝐿𝑂𝑊(𝐴) = {$}
𝐹𝑂𝐿𝐿𝑂𝑊(𝐵) = {$}

Question

Find the First and Follow for the following CFG –

Answer

𝐹𝐼𝑅𝑆𝑇(𝑆) = {𝑎, 𝑐, 𝑑}
𝐹𝐼𝑅𝑆𝑇(𝐴) = {𝑎, 𝑏}
𝐹𝐼𝑅𝑆𝑇(𝐵) = {𝑐, 𝑑}
𝐹𝑂𝐿𝐿𝑂𝑊(𝑆) = {$}
𝐹𝑂𝐿𝐿𝑂𝑊(𝐴) = {𝑏}
𝐹𝑂𝐿𝐿𝑂𝑊(𝐵) = {𝑎}

Question

Find the First and Follow for the following CFG –


Answer

𝐹𝐼𝑅𝑆𝑇(𝑆) = {𝑎, 𝑏, 𝑑, 𝑒}
𝐹𝐼𝑅𝑆𝑇(𝐴) = {𝑎, 𝑏}
𝐹𝐼𝑅𝑆𝑇(𝐵) = {𝑑, 𝑒}
𝐹𝑂𝐿𝐿𝑂𝑊(𝑆) = {$}
𝐹𝑂𝐿𝐿𝑂𝑊(𝐴) = {$, 𝑎}
𝐹𝑂𝐿𝐿𝑂𝑊(𝐵) = {$, 𝑎, 𝑏}

Question

Find the First and Follow for the following CFG –

Answer

𝐹𝐼𝑅𝑆𝑇(𝑆) = {𝑏, 𝑎}
𝐹𝐼𝑅𝑆𝑇(𝐴) = {𝑏, 𝜖}
𝐹𝐼𝑅𝑆𝑇(𝐵) = {𝑐}
𝐹𝑂𝐿𝐿𝑂𝑊(𝑆) = {$}
𝐹𝑂𝐿𝐿𝑂𝑊(𝐴) = {𝑎}
𝐹𝑂𝐿𝐿𝑂𝑊(𝐵) = {$}

Question

Find the First and Follow for the following CFG –

Answer

𝐹𝐼𝑅𝑆𝑇(𝑆) = {𝑎, 𝜖}
𝐹𝐼𝑅𝑆𝑇(𝐴) = {𝑎, 𝜖}
𝐹𝐼𝑅𝑆𝑇(𝐵) = {𝑎, 𝜖}
𝐹𝑂𝐿𝐿𝑂𝑊(𝑆) = {$}
𝐹𝑂𝐿𝐿𝑂𝑊(𝐴) = {𝑎, $}
𝐹𝑂𝐿𝐿𝑂𝑊(𝐵) = {𝑎, $}

NOTE

We can easily find the first of an element, but FOLLOW seems to be a little weird. So here is a cheat
sheet –

LL1 PARSER

First and foremost, we need to have a CFG that is non-left recursive, unambiguous and deterministic.
Once we have that, we can proceed with the LL1 parser derivation. Let us take an example of a CFG as
follows –

𝐸 → 𝑇𝐸′
𝐸 ′ → +𝑇𝐸 ′ | 𝜖
𝑇 → 𝐹𝑇′
𝑇 ′ →∗ 𝐹𝑇 ′ | 𝜖
𝐹 → (𝐸) | 𝑖𝑑
This CFG is non-left recursive, unambiguous and deterministic. Now, let us assume that we need to
make the parse tree for the string –

𝑊 = 𝑖𝑑 + 𝑖𝑑 ∗ 𝑖𝑑
From the grammar, we can observe that –

𝑇𝑒𝑟𝑚𝑖𝑛𝑎𝑙𝑠 = {+,∗, (, ), 𝑖𝑑}


𝑉𝑎𝑟𝑖𝑎𝑏𝑙𝑒𝑠 = {𝐸, 𝐸 ′ , 𝑇, 𝑇 ′ , 𝐹}
We now draw a table with these values as shown below –
∗ + ( ) 𝒊𝒅 $
E
E’
T
T’
F

From the grammar, we can see that –

𝐹𝐼𝑅𝑆𝑇(𝐸) = {(, 𝑖𝑑} 𝐹𝑂𝐿𝐿𝑂𝑊(𝐸) = {$, )}


𝐹𝐼𝑅𝑆𝑇(𝐸′) = {+, 𝜖} 𝐹𝑂𝐿𝐿𝑂𝑊(𝐸′) = {$, )}
𝐹𝐼𝑅𝑆𝑇(𝑇) = {(, 𝑖𝑑} 𝐹𝑂𝐿𝐿𝑂𝑊(𝑇) = {+, $, )}
𝐹𝐼𝑅𝑆𝑇(𝑇′) = {∗, 𝜖} 𝐹𝑂𝐿𝐿𝑂𝑊(𝑇′) = {+, $, )}
𝐹𝐼𝑅𝑆𝑇(𝐹) = {( , 𝑖𝑑} 𝐹𝑂𝐿𝐿𝑂𝑊(𝐹) = {∗, +, $, )}
Now, we can start filling the table. To fill the table, we first take the productions one-by-one as follows

𝛼→𝛽 𝑤ℎ𝑒𝑟𝑒 𝛽 ≠ 𝜖
For the production, we take the row 𝛼 and take the columns of the values returned from 𝐹𝐼𝑅𝑆𝑇(𝛼)
and we fill the production in those columns. In our case, the first production is –

𝐸 → 𝑇𝐸′
Thus, we fill the above production in the row 𝐸 and column will be {(, 𝑖𝑑}. Thus, we get –

∗ + ( ) 𝒊𝒅 $
E 𝐸 → 𝑇𝐸′ 𝐸 → 𝑇𝐸′
E’
T
T’
F

Similarly, we can do the same for the rest of the productions and fill the table as follows –

∗ + ( ) 𝒊𝒅 $
E 𝐸 → 𝑇𝐸′ 𝐸 → 𝑇𝐸′
E’ 𝐸 ′ → +𝑇𝐸 ′
T 𝑇 → 𝐹𝑇′ 𝑇 → 𝐹𝑇′
T’ 𝑇 ′ →∗ 𝐹𝑇 ′
F 𝐹 → (𝐸) 𝐹 → (𝐸)

Now, we still need to consider the case where –

𝛼→𝜖
For the production, we take the row 𝛼 and take the columns of the values returned from
𝐹𝑂𝐿𝐿𝑂𝑊𝑆(𝛼) and we fill the production in those columns. Thus, we get –
∗ + ( ) 𝒊𝒅 $
E 𝐸 → 𝑇𝐸′ 𝐸 → 𝑇𝐸′
E’ 𝐸 ′ → +𝑇𝐸 ′ 𝐸′ → 𝜖 𝐸′ → 𝜖
T 𝑇 → 𝐹𝑇′ 𝑇 → 𝐹𝑇′
T’ 𝑇 ′ →∗ 𝐹𝑇 ′ 𝑇′ → 𝜖 𝑇′ → 𝜖 𝑇′ → 𝜖
F 𝐹 → (𝐸) 𝐹 → 𝑖𝑑

Now that we have filled this table, we can start creating the parse tree. First, we break the given string
into tokens as shown –

Now, we try to use Left – most derivation to get the tokens one-by-one. The first token is 𝒊𝒅. So, we
can use the productions under the 𝑖𝑑 column to draw the parse tree as follows –

Just like that, we will do the same for the rest of the tokens as well. Thus, we get –

NOTE

There are 3 parts of the name of the parser –


• The first letter represents the direction in which the tape is accessed.
• The second letter represents the derivation direction.
• The number represents the number of tokens it takes at a time.

Since the parser here accesses the tape from left to right, we are using left-most derivation and taking
one token at a time. Thus, the name becomes LL1.

NOTE

In the LL1 table, each cell can have just one production. If there is a case where the cell will have more
than 1 entries, then it is not LL1.

Question

Is the following grammar in LL1?

Answer

We can see that this grammar is non-left recursive, unambiguous and deterministic. Therefore, the
initial conditions are satisfied. Next, we can write –

𝐹𝐼𝑅𝑆𝑇(𝑆) = {𝑎, 𝑏, 𝜖} 𝐹𝑂𝐿𝐿𝑂𝑊(𝑆) = {$}


𝐹𝐼𝑅𝑆𝑇(𝐴) = {𝑏, 𝜖} 𝐹𝑂𝐿𝐿𝑂𝑊(𝐴) = {𝑎, $}
𝐹𝐼𝑅𝑆𝑇(𝐵) = {𝑎, 𝜖} 𝐹𝑂𝐿𝐿𝑂𝑊(𝐵) = {$}
We can draw the LL1 table as follows –

a b $
S 𝑆 → 𝐴𝐵 𝑆 → 𝐴𝐵
A 𝐴→𝜖 𝐴 → 𝑏𝐴 𝐴→𝜖
B 𝐵 → 𝑎𝐵 𝐵→𝜖

Since every cell has just one production, we can conclude that the grammar is in LL1.

Question

Is the following LL1 grammar?


Answer

The grammar is not LL1 as the cell (𝑆, 𝑎) will have multiple values.

NOTE

Question

Is the following grammar LL1 or not?

𝑆 → 𝑎𝐴𝑏𝐵
𝐴→𝑎|𝜖
𝐵 →𝑏|𝜖
Answer

The grammar is LL1.

Question

Is the following grammar LL1 or not?

Answer

The grammar is LL1.

Question

Is the following grammar LL1 or not?


Answer

The grammar is LL1.

Question

Is the following grammar LL1 or not?

Answer

The grammar is NOT LL1.

BOTTOM – UP PARSER

This is more detailed and has a higher power when compared to Top – down parsers. Most of the high
level languages like C, Python, Java etc. use a Bottom – up parser. As the name suggests, the parser
creates the parse tree by starting from the children and proceeding towards the root.

To begin with, let us take the grammar as shown –

First step, we need to re-write this grammar as follows –

𝑆′ → . 𝑆
𝑆 → . 𝐴𝐴
𝐴 → . 𝑎𝐴
𝐴 → .𝑏
Here, the (. )(𝒅𝒐𝒕) operator represents the position to which we have scanned the production. Since
initially we haven’t scanned the production yet, the dot operator is in the beginning.
Also, the group of these productions is called a closure and is represented as follows –

Now, we will start to scan the productions one symbol at a time and create a sort of tree of closures.
This can be done as follows –

Let me now deep wtf just happened. First, we write the original closure I0. From I0, we are parsing
based on each symbol – both terminals and non-terminals. Here is a step – by – step process of the
parsing –

1. First, choose a terminal/non-terminal 𝒙 based on which you are parsing.


2. Find all the productions of the form 𝜶 → . 𝒙𝜷
3. Write those productions in a new closure.
4. If 𝜷 is a non – terminal, then write all the productions of the form 𝜷 → 𝒚 in the closure as well.

Now that a step-by-step procedure has been established, we can start with the process. First, let us
parse using 𝑺. We can see that there is only one production to be considered –

𝑆′ → . 𝑆
Therefore, we write a new closure I1 with the production 𝑺′ → 𝑺. As we can see, there is no 𝜷, so we
can skip step 4. Thus, there is only 1 production in closure I1.

Now, we consider a parse using 𝑨. In this case, we again have just one production –

𝑆 → . 𝐴𝐴
So, we transform the production to 𝑺 → 𝑨. 𝑨 and write it in production I2. However, we can see that
𝜷 = 𝑨 which is a non-terminal. So, we need to include all such productions with 𝑨 on the LHS. Thus,
closure I2 has a total of 3 productions.

Following this process, we can get a total of 7 closures (I0 – I6)


Once we have the closures, we can then proceed to create the parse table as shown below –

ACTION GOTO
𝑎 𝑏 $ 𝑆 𝐴
I0
I1
I2
I3
I4
I5
I6

Basically, we write ACTION and GOTO as the 2 sections in the table wherein the terminals are listed
under ACTION and the non-terminals are listed under GOTO. This is because we terminate at the
terminals and we usually proceed to the next step for a non-terminal. Each row represents each of the
closures we had created.

To fill this table, let us start with I0. We can see from the closure diagram that upon encountering 𝒂,
we shift from I0 to I3. This is represented as 𝑺𝟑. Similarly, upon encountering 𝒃, we shift from I0 to I4.
This is represented as 𝑺𝟒. Now, we fill the table accordingly.

ACTION GOTO
𝑎 𝑏 $ 𝑆 𝐴
I0 S3 S4
I1
I2 S3 S4
I3 S3 S4
I4
I5
I6

Just like the terminals, we now check for non-terminals as well. We can see that upon 𝑆, I0 goes to I1.
This is represented simply by 𝟏. We don’t write S1 for the non-terminals in the GOTO section. Thus,
the table now becomes –

ACTION GOTO
𝑎 𝑏 $ 𝑆 𝐴
I0 S3 S4 1 2
I1
I2 S3 S4 5
I3 S3 S4 6
I4
I5
I6

After this, we divert our attention to the original grammar –


From here, we assign each production as a rule as follows –

𝑹𝟏 ∶ 𝑆 → 𝐴𝐴
𝑹𝟐 ∶ 𝐴 → 𝑎𝐴
𝑹𝟑 ∶ 𝐴 → 𝑏
We can see that rules R1, R2 and R3 correspond to closures I5, I6 and I4 respectively. Hence, we will
fill those out in the table as follows –

ACTION GOTO
𝑎 𝑏 $ 𝑆 𝐴
I0 S3 S4 1 2
I1
I2 S3 S4 5
I3 S3 S4 6
I4 R3 R3 R3
I5 R1 R1 R1
I6 R2 R2 R2

Finally, we can see that the start symbol (𝑺′) is present in closure I1. So, we will fill the row I1 and
column $ with the word ACCEPT. Basically, if after parsing the string we end up on ACCEPT, then the
string is accepted by the parser.

ACTION GOTO
𝑎 𝑏 $ 𝑆 𝐴
I0 S3 S4 1 2
I1 accept
I2 S3 S4 5
I3 S3 S4 6
I4 R3 R3 R3
I5 R1 R1 R1
I6 R2 R2 R2

Phew! The table is now complete and now we can proceed with the next step which is to use a string
for testing. Let us assume a string as 𝒂𝒃𝒂𝒃. From intuition, we know that this string is part of the CFG
and hence should be accepted by the parser. As usual, we first get the tokens of the string –

To perform the analysis, the parser actually uses a stack. Let us assume such a stack with the initial
value as 0
Here, the number 0 denotes the initial closure I0. Similarly, a number 3 will represent closure I3 and
so on. Now, we traverse the input string.

The first element is 𝒂. As per the stack, we are currently in closure I0. Thus, we can see that at closure
I0, we have encountered 𝒂. Thus, as per the table we need to shift to I3 (𝑺𝟑). Thus, the stack now
becomes –

Now, the next symbol coming is 𝒃. Thus, as per the stack we are in closure I3 and now we have
encountered a 𝒃. Looking at the table, we need to perform 𝑺𝟒. Hence, the stack becomes –

Next, we have the symbol 𝒃. As per the table, when we encounter 𝒃 in closure I4, the output is Rule 3
(𝑹𝟑). This is an interesting case. Here, we are not shifting but rather this is where we perform
reduction – which is where we go up a level in the tree. Here are the rules of the reduction –

• Move up the graph as per the rule.


• If the rule is like 𝜶 → 𝜷, then pop 𝟐 ∗ 𝒍𝒆𝒏(𝜷) elements from the stack
• Push the reduced non-terminal to the stack.

In our case, we have encountered Rule 3 which is –

𝐴→𝑏
Therefore, the 𝒃 symbol will get reduced to 𝑨 non-terminal. At the same time, we know that 𝒍𝒆𝒏(𝒃) =
𝟏. Thus, we pop 2 elements from the stack and push 𝑨 into it. So, the tree will now become something
as shown below –

At the same time, the stack has now become –

Now, we are at closure I3 and have encountered symbol 𝑨. As per the table, we GOTO closure I6.

The next symbol is 𝒂 and we are at closure I6. As per the table, we need to perform reduction via Rule
2. As per R2, we have –

𝐴 → 𝑎𝐴
That means both 𝒂 and 𝑨 get reduced to 𝑨. We move up a level in the graph. Also, 𝒍𝒆𝒏(𝒂𝑨) = 𝟐.
Thus, we pop 4 elements from the stack. The graph and the stack are now modified as follows –

At this point, we are at closure I0 and the symbol encountered is 𝑨. Thus, we GOTO closure I2.

The next symbol is 𝒂 and at closure I2, we encounter a 𝒂 and then move on to closure I3.

The next symbol to appear is 𝒃 which upon closure I3 will shift to closure I4.

At this point, all the symbols of the string have been pushed onto the stack at some point or the other.
So all that is left now is reduction. Now, the last symbol to be encountered is $. As per the table, the
closure I4 on $ will give reduction via Rule 3. Thus, the tree and stack become –

On closure I3, we encountered 𝑨 and thus we GOTO to closure I6.

Now, in closure 6 if we encounter $, then we need to reduce via Rule 2. Thus, we get –
In closure 2, if we encounter 𝑨, we GOTO closure I5.

Now, if we encounter a $ in closure I5, we need to reduce via Rule 1. Hence, we get –

On closure I0, when we encounter 𝑺, we need to GOTO closure I1. Now, if we encounter $ at closure
I1, we can see that we get ACCEPT.

OOOFF!!! We are done finally. We can see that we started with the entire string and built the tree in a
bottom – up fashion. At the same time, since we reached ACCEPT, it means that the string is a valid
string for the given CFG.

BOTTOM – UP PARSER THEORY

We just performed the process of developing a BUP, so we should define a few terms as well.

• Handle – Any subset of the RHS of a production is called a handle.


• Pruning – The process of searching and replacing a handle with its LHS part is called pruning.

In the example we had taken previously, there was a production –


𝑆 → 𝐴𝐴
Thus, 𝑨𝑨 is a handle while the process in the last step where we replaced 𝐴𝐴 with 𝑺 is called pruning.

Now, we can see that we performed basically 2 main operations in the entire process – Shifting to a
new closure and Reducing based on a rule. Therefore, BUPs are also called Shift-Reduce Parsers.

The production we have just created parses the string from left-to-right and at the same time, it also
performs reverse rightmost derivation. It also doesn’t have any lookahead symbols. Therefore, this
parser is called a LR(0) parser.

Since we are doing Reverse Rightmost Derivation, we can use this parser on CFG with Left-recursion as
well. Additionally, suppose we have a production as follows –

𝑆 → 𝑎𝐴 | 𝑎𝐵
The CFG with the above production is non-deterministic since when we encounter 𝒂, we don’t know
if the next symbol will be either 𝑨 or 𝑩. However, this is the case when we use Leftmost Derivation. In
case we are using Rightmost derivation, then we don’t have any problems. In conclusion, LR(0) can
work on left-recursive and non-deterministic CFG. Therefore, a BUP has higher expressive power
when compared to TDP.

Question

Check if the given CFG is LR(0) or not

Answer

For this case, the canonical closures will be as follows –

𝑆 → .𝐸
𝐸 → .𝑇 + 𝐸
𝐸 → .𝑇
𝑇 → . 𝑖𝑑
At this point itself, we can see that if we apply 𝑻 to I0, we can either get a shift to I2 or a reduction.
This will result in the parse table having multiple values in a cell and thus, the CFG is NOT LR(0).

Question

Check if the given CFG is LR(0) or not

Answer

For this case, the canonical closures will be as follows –

In this case, I1 has a reduction and a shifting. However, the reduction 𝑆 → 𝐸. Is just the accept
condition so it will not have any clash. On the other hand, in I2 there is a reduction and a shifting.
Therefore, there is a SR clash and hence the grammar is not LR(0).

Question

Check if the given grammar is LR(0) or not.

Answer

This is not a LR(0) grammar.

SLR(1) PARSER

Let us take the case of the grammar shown below –

𝐸 →𝑇+𝐸|𝑇
𝑇 → 𝑖𝑑
Now, for this grammar we can proceed to form the canonical closure –

From this, we can draw the parse table as follows –

ACTION GOTO
+ 𝑖𝑑 $ 𝐸 𝑇
I0 S3 1 2
I1 ACCEPT
I2 S4/R2 R2 R2
I3 R3 R3 R3
I4 S3 5 2
I5 R1 R1 R1
We can see that the cell in row I2 and column + has 2 values. This is a SR conflict. Thus, the given
grammar is not in LR(0) form. However, we are now going to learn about a new parser – SLR(1) parser.

The process for SLR(1) is the same as LR(0) however as the name suggests, it has a single lookahead
symbol. This means that “the reduction rules are applied to the Follow of the LHS”.

In our case, we have –

𝐹𝑂𝐿𝐿𝑂𝑊(𝐸) = {$}
𝐹𝑂𝐿𝐿𝑂𝑊(𝑇) = {$, +}
We can see that Rules R1 and R2 have 𝑬 as the LHS while Rule R3 has 𝑻 as the LHS. Thus, we can say
that –

• R1 and R2 will only be filled in the $ column


• R3 will be filled only in {$, +} columns

Thus, the parse table now becomes –

ACTION GOTO
+ 𝑖𝑑 $ 𝐸 𝑇
I0 S3 1 2
I1 ACCEPT
I2 S4 R2
I3 R3 R3
I4 S3 5 2
I5 R1

Now, there are no conflicts and thus, this CFG is accepted under SLR(1) parser. Thus, we can say that
the expressive power of SLR(1) is greater than the expressive power of LR(0).

Question

Check if the given grammar is in LR(0) and SLR(1)

Answer

𝑆′ → . 𝑆
𝑆 → . 𝐴𝑎
𝑆 → . 𝑏𝐴𝑐
𝑆 → . 𝑑𝑐
𝑆 → . 𝑏𝑑𝑎
𝐴 → .𝑑
We can draw the closures as follows –

For LR(0), we can write the parse table as follows –

ACTION GOTO
𝑎 𝑏 𝑐 𝑑 $ 𝑆 𝐴
I0 S3 S4 1 2
I1 ACCEPT
I2 S7
I3 S6 8
I4 R5 R5 S5/R5 R5 R5
I5 R3 R3 R3 R3 R3
I6 S10/R5 R5 R5 R5 R5
I7 R1 R1 R1 R1 R1
I8 S9
I9 R2 R2 R2 R2 R2
I10 R4 R4 R4 R4 R4

We can see that this grammar is not in LR(0). Now, we need to check for SLR(1). To do so, first we need
to get the Follow functions as follows –

𝐹𝑂𝐿𝐿𝑂𝑊(𝑆) = {$}
𝐹𝑂𝐿𝐿𝑂𝑊(𝐴) = {𝑎, 𝑐}
Hence, the SLR(1) parse table becomes –

ACTION GOTO
𝑎 𝑏 𝑐 𝑑 $ 𝑆 𝐴
I0 S3 S4 1 2
I1 ACCEPT
I2 S7
I3 S6 8
I4 R5 S5/R5
I5 R3
I6 S10/R5 R5
I7 R1
I8 S9
I9 R2
I10 R4

We can see that there are still SR conflicts. Hence, this grammar is NOT in SLR(1) as well.

NOTE

In general, the Venn diagram of the different types of parsers is given as follows –

We can see that LL(1) has a lot of power, but every LL(1) grammar is LR(1). Therefore, it just proves
that the BUP have a higher power when compared to TDP.

Question

Check if the given grammar is in LR(0) and SLR(1)

Answer

The CFG are in SRL(1) but not in LR(0)

Question

Check if the given grammar is in LR(0) and SLR(1)


Answer

The CFG is neither in LR(0) and SLR(1).

Question

Check if the given grammar is in LR(0) and SLR(1)

Answer

The CFG is neither in LR(0) and SLR(1).

CLR(1) PARSER

This is the most powerful parser. In short, if a grammar is not in CLR(1), then it can’t be done by any
of the parsers. To understand the CLR(1) parser, let us take an example of a CFG as follows –

𝑆′ → . 𝑆
𝑆 → . 𝐶𝐶
𝐶 → . 𝑐𝐶
𝐶 → .𝑑
If we were making SLR(1), then we would have found the follow for S and C. However, we can see that
𝑪 occurs a total of 3 times in the RHS of the productions. Each of these occurrences have a different
follow. Let us label these occurrences as follows –

𝑆′ → . 𝑆
𝑆 → . 𝐶1 𝐶2
𝐶1 → . 𝑐𝐶3
𝐶1 → . 𝑑
We can see that even though 𝐶1 = 𝐶2 = 𝐶3 = 𝐶, their follows will be different –

𝐹𝑂𝐿𝐿𝑂𝑊(𝐶1 ) = {𝑐, 𝑑}
𝐹𝑂𝐿𝐿𝑂𝑊(𝐶2 ) = {$}
𝐹𝑂𝐿𝐿𝑂𝑊(𝐶3 ) = {𝑐, 𝑑}
Here is the interesting part. If we were constructing SLR(1), we would be writing the Reduction rules
in the columns 𝒄, 𝒅, $ since SLR(1) would have calculated 𝐹𝑂𝐿𝐿𝑂𝑊(𝐶) = {𝑐, 𝑑, $}. On the other hand,
for CLR(1), we can see that each occurrence has a different follows and hence, when we fill the table,
CLR(1) will have lesser SR/RR conflicts when compared to SLR(1). This is the reason CLR(1) is the most
accommodating, has the highest expressive powers and also has minimal elements in the parse table.
To create CLR(1), we need to include each productions follow in the closure. For example, if closure
has the production 𝜶 → 𝜷, then we write –

𝜶 → 𝜷 , 𝑭𝑶𝑳𝑳𝑶𝑾(𝜶)
Thus, for the example grammar, the initial closure will be as follows –

One thing to note here is that 𝐶1 , 𝐶2 𝑎𝑛𝑑 𝐶3 are basically the same as 𝑪. Since this is the first example,
I am differentiating them so that it is easy to understand which occurrence of 𝐶 we are talking about.
After this, we continue drawing the canonical closures as we have –

Since we have 𝐶1 = 𝐶2 = 𝐶3 = 𝐶, we can modify the canonical closure as follows –


Using the above closure, we can write the parse table as follows –

ACTION GOTO
𝑐 𝑑 $ 𝑆 𝐶
I0 S3 S4 1 2
I1 ACCEPT
I2 S6 S7 5
I3 S3 S4 8
I4 R3 R3
I5 R1
I6 S6 S7 9
I7 R3
I8 R2 R2
I9 R2

As we can see, there are no SR/RR conflicts thus the CFG is CLR(1) parser.

We can see that there are a couple of closures that have the same productions but have different
follows. These closures are –

𝐼3 − 𝐼6
𝐼4 − 𝐼7
𝐼8 − 𝐼9
Thus, we can combine these closures into each other. That will give us the following parse table –

ACTION GOTO
𝑐 𝑑 $ 𝑆 𝐶
I0 S3 S47 1 2
I1 ACCEPT
I2 S6 S7 5
I36 S36 S47 89
I47 R3 R3 R3
I5 R1
I89 R2 R2 R2

As we can see, we are able to perform the combination without any conflicts. Thus, we can say that
this grammar is also LALR(1).

NOTE

If a grammar is in CLR(1) and there is no possibility of combination or if there is a possibility of


combination and there are no conflicts, then the CFG is also in LALR(1). If there is a possibility of
combination but it results in a conflict, then the CFG is not in LALR(1).
Question

Check if the given grammar is CLR(1) and LALR(1) or not.

Answer

This is CLR(1) but not LALR(1).

OPERATOR PRECEDENCE GRAMMAR

These are parsers that are used to manipulate and work on mathematical expressions. It is a less
complex grammar that can be made on both ambiguous and unambiguous grammars. Every CFG is
not OPG. OPG are CFG which have the following properties –

• They don’t contain NULL productions


• There are no adjacent non-terminals on the RHS of the productions.

For example –

SEMANTIC ANALYSIS

This is the next step in the compiler process. To perform a Semantic Analysis, we need to first define
something called Syntax Directed Translation (SDT). We can say that –

𝑆𝐷𝑇 = 𝐺𝑟𝑎𝑚𝑚𝑎𝑟 + 𝑆𝑒𝑚𝑎𝑛𝑡𝑖𝑐 𝑅𝑢𝑙𝑒𝑠 + 𝑆𝑒𝑚𝑎𝑛𝑡𝑖𝑐 𝐴𝑐𝑡𝑖𝑜𝑛𝑠


Semantic analyzer is quite tough to learn in – depth, so in GATE they usually ask very specific type of
questions. We will have a look at these questions along the way . Also note that SDT can be used
for functions apart from the semantic analysis.

Question
Answer

First off, let us notice that against each of the productions, we have some expressions inside curly
braces. These are the semantic statements. These statements are usually of 2 types –

• Semantic Rules – If there is a 𝒊𝒇 statement in the Semantic statement, then it is called a


semantic rule.
• Semantic Action – If there is no 𝒊𝒇 statement, then it is called a semantic action

In the question, we don’t have any semantic rules, so there is no condition check here. Now, let us
learn how to interpret the semantic action. For example, let us take the production –

𝑆 → 𝑆1 # 𝑇 {𝑆. 𝑣𝑎𝑙 = 𝑆. 𝑣𝑎𝑙 ∗ 𝑇. 𝑣𝑎𝑙}

As per the semantic action, we can say that the production will multiply 𝑆 and 𝑇 and store the result
back in 𝑆. One small thing to note is that as per the question, sub-scripted non-terminals are the same
non-terminals in a time instant. So basically 𝑆 = 𝑆1.

Thus, we can write –

𝑥#𝑦 → 𝑥 =𝑥∗𝑦
Similarly, we can see that the % (mod) sign represents division. Thus, the expression becomes –

20 ∗ 10 ÷ 5 ∗ 8 ÷ 2 ÷ 2
Here comes the important part. Now that we have the operator definitions, we need the OPERATOR
PRECEDENCE AND ASSOCIATIVITY.

To get the precedence, we use common sense. When we make a tree, we go from root node to the
leaf nodes. However, when we calculate values, we go from leaf nodes to the root nodes. Hence, the
operator that comes first when we go from bottom to top has higher precedence. In our case,

Pr(%) > Pr(#)


Once we have the precedence, we need associativity. To get associativity, check the production
recursion which has the operator. In our case, we have –

𝑆 → 𝑆1 # 𝑇
𝑇 → 𝑇1 % 𝑅
Since these both productions are left recursive, we can say that both operators have a left-to-right
associativity.
Now, we can solve the expression –

20 ∗ (10 ÷ 5) ∗ ((8 ÷ 2) ÷ 2)

The final value will be 80.

Question

Answer

We can see from the grammar that –

• # is multiplication and is left-to-right associative


• & is addition and is left-to-right associative

We can also say that since & comes first when going from bottom-to-top, & has higher priority when
compared to #.

2 ∗ (3 + 5) ∗ (6 + 4)
The final answer will be 160.

Question
Answer

In this case, the best thing will be to make the parse tree as follows –

Now, we go from bottom to top to get the value of the expression. Shown below is the path we take
for valuation –

In Paths 1, 4, 6, 7 and 8 will have printing as per the semantic actions mentioned in the grammar. Thus,
we get –

𝑭𝒊𝒏𝒂𝒍 𝑬𝒙𝒑𝒓𝒆𝒔𝒔𝒊𝒐𝒏 = 𝟐𝟑𝟒 ∗ +


If we note carefully, this grammar is used to perform infix to postfix conversions.

Question
Answer

Here,

• − means subtraction and ∗ means multiplication.


• − is right associative while ∗ is left associative.
• − has higher precedence when compared to ∗

With this information, the expression becomes –

𝐸 = (4 − (2 − 4)) ∗ 2 = 𝟏𝟐

Question

Answer

If we notice, the semantic actions mentioned here are not for evaluation but rather for making a node.
So, this grammar is used to create a tree. To start, first let us draw a rough tree and parse sequence as
follows –
Let us now go step-by-step.

Step 1 – We go from 𝑛𝑢𝑚(2) to 𝐹. As per the grammar, the action here is as follows –

This means that we need to make a node with left and right pointers as NULL, the value as 𝒊𝒅(𝟐) and
its parent pointer will be 𝑭.

In steps 2 and 3, the parent pointer first changes to 𝑇 and then to 𝐸. In step 4, just like Step 1 another
node is created –

Then in Step 5 the parent pointer for this node is changed to 𝑇 and then in Step 6, another node for
value 4 is created. So now, we have 3 nodes is total –

In step 6, we are going to merge using the production 𝑇 → 𝑇 ∗ 𝐹. The semantic action in this case will
be –

Hence, we need to create a node with the left pointer as the 𝑇 pointer, the right pointer as the 𝐹
pointer and the value as ∗. The parent pointer will be the 𝑇 pointer.

Similarly, we do the final step and get the final tree as follows –
CLASSIFICATION OF ATTRIBUTES

Based on the process of evaluation, attributes are classified into two types –

• Synthesized – The attributes whose values are calculated based on their children values.
• Inherited – The attributes whose values are calculated based on their parent or left sibling.

For example, let us take the following production –

𝐴 → 𝑋𝑌𝑍
If the value of 𝑨 is calculated by using the values of 𝑋, 𝑌 or 𝑍, then 𝑋 is a synthesized attribute. If the
value of 𝒀 is calculated by using the values of 𝐴 or 𝑋, then it will be termed as an inherited attribute.

INTERMEDIATE CODE GENERATION


Let us take one expression as follows –

((𝑎 + 𝑎) + (𝑎 + 𝑎)) + ((𝑎 + 𝑎) + (𝑎 + 𝑎))

We can convert this to postfix using a stack as we had seen in DS. Using syntax tree, we can represent
this expression as follows –

As we can see, the syntax tree has a lot of repetition. In short, we don’t need to create new nodes for
each of the 𝑎 + 𝑎 operations. Hence, we reduce this to get the Directed Acyclic Graph (DAG).

Now, let us take another expression as follows –

𝐹 = −(𝑎 + 𝑏) ∗ (𝑐 + 𝑑) + (𝑎 + 𝑏 + 𝑐)
Here, we can also express using the Three Adress Code. In a three address code, each expression can
have only 3 address aka variables. Hence, for this case the TAC will be –

The TAC is also represented in the form of Quadruples –


Here, we utilize a lot of space but at the same time we have the flexibility to move the instructions up
and down. To save space, we can represent using triplets –

Here, the statement numbers are used as operands. Here, even though the space used is less, there is
no flexibility to change the statement order.

Static Single Assignment Form

This is a property of the intermediate code where each variable is assigned only once. So, existing
variables are split into versions. For example, let us say we have the following code –

𝑎 =𝑏+𝑐
𝑎 =2∗𝑎
For SSA form, we want the assignment to a variable be done just once. Therefore, we split 𝑎 into two
versions as follows –

𝑎 =𝑏+𝑐
𝑎1 = 2 ∗ 𝑎
Now this is in SSA.

Constant Folding

In an expression, there is a chance that the constants in the expression can be simplified to reduce run
time. This is called constant folding. Suppose we have the expression –

𝑎 =𝑏+𝑐+2+3∗4
Using constant folding, we can simplify the expression as follows –

𝑎 = 𝑏 + 𝑐 + 14

Constant Propagation

Suppose we have expressions as follows –

𝑝𝑖 = 3.1415
𝑐 = 2 ∗ 𝑝𝑖 ∗ 𝑟
We know that the value of PI will not change as it is a constant. So, instead of declaring it as a variable,
we can directly plug it in the expression. This is called constant propagation. Combining with constant
folding, we can improve the expression as follows –

𝑐 = 6.283 ∗ 𝑟

Strength Reduction

In this case, we can implement the same expression in more than 1 way. Thus, it would make sense to
have the compiler use the expression that has the lowest cost. This is called strength reducing. For
example, let us take the 2 expressions –

𝑦 =2∗𝑥
𝑦=𝑥+𝑥
In both cases, 𝑦 will have the same value. However, we know that since multiplication is repetitive
addition, it will be more costly. Thus, the compiler would just compile the second expression to save
time and resources.

Redundant Code Elimination

As we have seen that there is more than one way to execute an expression, hence causing a case where
there can be redundant code lines. Thus, it is better to remove those expressions. This is called
Redundant Code Elimination. For example, let us look at the expression below –

𝑥 = 𝑎+𝑏
𝑦 =𝑏+𝑎
Instead of performing addition twice, we can simply write –

𝑥 = 𝑎+𝑏
𝑦=𝑥

Algebraic Simplification

Use the basic laws of math to simplify the expressions. For example,
𝑥 =𝑎∗1
𝑦 = 𝑏+0
We can save out on both the operations and simply write –

𝑥=𝑎
𝑦=𝑏

Loop Optimization

As the name suggests, we need to optimize the loops in the program. To do so, we first need to perform
the following steps –

• First, we convert the High level code to 3-address code


• Then, we get the leader statements in the 3-address code
o The first statement is always the leader
o The statement right after jump statement is a leader
o The target statements in the jump statements are also leaders
• Now, we divide the 3-address code lines into blocks where the statements between 2 leaders
is called a block.
• Finally, we draw the Control Flow Graph to see the relation between the blocks

For example, let us take the code –

First, we convert the HLL to 3-address codes –

In this case, Lines 1,3,4,9 will be the leaders. Hence, we can make the blocks as follows –
Now, we can draw the CFG as follows –

Loop Jamming

This is the process where we combine 2 loops into one loop if they share the same index and no of
variables. For example, let us take the case –

We can see that the second loop and the outer first loop have the same indexing. Thus, we can perform
loop jamming here as shown below –
Loop Unrolling

This is the case where we reduce the number of iterations in a loop. For example, let us consider the
code as follows –

In this case, we need to perform 100 iterations. Instead, we can perform Loop unrolling and get the
following code –

In this case, we will be performing just 50 iterations and getting the same output.

Code Movement

Basically, move all the code that is not dependent on the loop outside the loop.

Live and Dead variables

The basic idea is that there can be a case where a variable will finish its execution before the end of
the program and thus, it can relinquish control of the memory location. When a variable is being used,
it is said to be live. When a variable has finished its role in the program, it is termed as dead. For
example,

1. 𝑝 =𝑞+𝑟
2. 𝑠 =𝑝+𝑞
3. 𝑢 =𝑠∗𝑣
4. 𝑣 =𝑟+𝑢
5. 𝑞 =𝑣+𝑟
For these 5 statements, we can use a table to get the live and dead variables as follows –

𝒑 𝒒 𝒓 𝒖 𝒗 𝒔
1. Dead Live Live Dead Live Dead
2. Live Live Live Dead Live Dead
3. Dead Dead Live Dead Live Live
4. Dead Dead Live Live Live Dead
5. Dead Dead Live Dead Live Dead
QUESTION BANK
Question 1

Question 2

Question 3
Question 4

Question 5

Question 6
Question 7

Question 8

Question 9
Question 10

Question 11

Question 12
Question 13

Question 14
Question 15

Question 16

Question 17
Question 18

Question 19

Question 20
Question 21

Question 22

Question 23
Question 24

Question 25

Question 26
Question 27

Question 28

Question 29
Question 30

Question 31

Question 32
Question 33

Question 34

Question 35
Question 36

Question 37

Question 38

Question 39
Question 40

Question 41

Question 42
Question 43

Question 44

Question 45
Question 46

Question 47

Question 48

Question 49
Question 50

Question 51

Question 52
Question 53

Question 54

Question 55
Question 56

Question 57

Question 58

Question 59
Question 60

Question 61

Question 62
Question 63

Question 64

Question 65
Question 66

Question 67

Question 68
Question 69

Question 70

Question 71
Question 72

Question 73

Question 74

Question 75
Question 76

Question 77

Question 78
ANSWER KEY
1 C 17 B 33 C 49 B 65 D
2 B 18 A 34 A 50 A 66 C
3 C 19 C 35 A 51 D 67 B
4 A 20 A 36 5 52 C 68 B
5 A 21 C 37 D 53 C 69 B
6 C 22 A 38 D 54 D 70 B
7 A 23 C 39 C 55 D 71 A
8 C 24 C 40 8 56 C 72 6
9 25 C 41 D 57 A 73 C
10 C 26 D 42 B 58 A 74 B
11 D 27 31 43 A 59 A 75 10
12 B 28 C 44 9 60 B 76 8
13 D 29 A 45 B 61 B 77 B
14 C 30 D 46 B 62 80 78 C
15 C 31 A 47 5 63 C
16 C 32 C 48 A 64 B

You might also like