Compiler Design Notes CSE
Compiler Design Notes CSE
Users code in a high-level language while the machine can understand only low-level machine
language. To convert HLL to LLL, we need to use a Language Processing system –
Compiler is present in the Language Processing system which will change the pure HLL to Assembly
level language. Let us understand this chain of processing –
COMPILER vs INTERPRETER
Another alternative to compiler is an interpreter. Interpreter will take the HLL code line by line,
processes it to a lower-level language and gets it executed. Then, it moves to the next line. Thus,
compared to the compiler an interpreter will not store any intermediate/assembly code. Thus, there
is no need for a loader/linker.
COMPILER BREAKDOWN
Here, the lexical analyzer, syntax analyzer and semantic analyzer are present in the “front-end” while
the rest of the blocks are present in the “back-end”.
Lexical Analyzer
This is a part of the compiler that reads a program and converts it into Lexemes (words/tokens). It also
removes white – spaces and comments. For example, let us assume the following code –
In this case, the lexical analyzer will divide the code line into the following tokens –
Syntax Analyzer (Parser)
A parser takes the tokens from the lexical analyzer and creates a parse tree. This parse tree is formed
with the help of the Context-free grammar that is provided for the system.
Semantic Analyzer
This part checks the parse tree and checks whether it is a meaningful or not. It checks the semantics
and grammar to check if it is meaningful or not.
This generates an intermediate code which is a form that can be readily understood by the machine.
The intermediate code is the same for all compilers and only the last two steps are platform
dependent. This means that we can take the intermediate code from any compiler and run it on any
other compiler.
Code Optimizer
This is used to transform the code to be more optimized such that is consumes lesser resources and
produces more speed. The optimization can be either machine dependent or machine independent.
This is the final step which returns the final assembly code. The assembly code being generated is
dependent on the type of assembler.
LEXICAL ANALYZER
Lexical analysis (also called tokenization or lexing) is a process of converting a sequence of characters
into a sequence of tokens. A sequence of characters in the input string that matches the pattern of a
token is called a lexeme. A Lexical analyzer (Lexer) will take an input string and yield output in the form
of tuples –
(𝑻𝒐𝒌𝒆𝒏, 𝑳𝒆𝒙𝒆𝒎𝒆)
For example, let us take the code line as follows –
int x = a*b;
• (datatype, int)
• (identifier, x)
• (operator, =)
• (identifier, a)
• (operator, *)
• (identifier, b)
• (separator, ;)
Apart from the above function, the lexer also is responsible for –
• Removing comments
• Removing Whitespaces
• Co-relating the errors with the code line number.
Lexer Implementation
To implement a lexer, we need to define the Lexical Grammar. The lexer will recognize and work on
strings based on the rules of the Lexical Grammar. For example, let us assume that the grammar of a
password is given as follows –
Question
Answer
𝑇1 = (𝑎 + 𝜖)(𝑏 + 𝑐)∗ 𝑎
𝑇2 = (𝑏 + 𝜖)(𝑎 + 𝑐)∗ 𝑏
𝑇3 = (𝑐 + 𝜖)(𝑏 + 𝑎)∗ 𝑐
Now, the string we have is 𝒃𝒃𝒂𝒂𝒄𝒂𝒃𝒄. If we put this string through the different token expressions,
we can see that –
𝑇1 → 𝑏𝑏𝑎
𝑇2 → 𝑏𝑏
𝑇3 → 𝑏𝑏𝑎𝑎𝑐
As the question instructs to take the token expression with the longest prefix, we must take T3. Hence,
the correct option is Option D.
LEXICAL GRAMMAR
Lexical grammar is usually a CFG. CFG is a Type – 2 grammar wherein any production of the form 𝛼 →
𝛽 will have 𝛼 as a variable while 𝛽 can be anything. The process of deriving a string from the grammar
is called derivation and the graphical representation of the derivation is called a Parse (Syntax) Tree.
For example, let us take the grammar as follows –
𝐸 → 𝐸 + 𝐸 | 𝐸 ∗ 𝐸 | 𝐸 = 𝐸 | 𝑖𝑑
If we want to create the string – 𝐸 = 𝑖𝑑 + 𝑖𝑑 ∗ 𝑖𝑑. Then, the parse tree will be –
If we mention the intermediate steps involved in the derivation, then it is called the Sentential form.
In every step of the sentential form, if the extreme left term is terminated, then it called Left Most
Derivation. On the other hand, if every step has the extreme right term terminated, then it is called
Right Most Derivation. The above sentential form is the LMD. To get the RMD for the same tree, we
can write –
If a production has common prefix, then it is called a Non-Deterministic Grammar as it is not possible
to have a deterministic way to derive the string. For example,
𝐴 → 𝛼𝛽1 | 𝛼𝛽2
Now, we know that first symbol is 𝛼 but we don’t know which will be the next symbol. To change it to
deterministic grammar, we can write it as follows –
𝐴 → 𝛼𝐵
𝐵 → 𝛽1 | 𝛽2
In short, a parser using a Non-deterministic grammar will have to backtrack at some point. To avoid
backtracking, we need to convert the Non-Deterministic grammar to a Deterministic Grammar.
NOTE
Question
𝑆 → 𝑎𝑆𝑏 | 𝑎𝑏𝑆 | 𝑎𝑏
Answer
𝑆 → 𝑎𝐴
𝐴 → 𝑆𝑏 | 𝑏𝐵
𝐵 →𝑆|𝜖
If the variable on the LHS appears on the RHS as well, it is called recursive grammar. In addition to
that, if the recursive variable appearing on the RHS is at the extreme left, then it is called Left recursive
grammar. If the recursive variable appears on the extreme right, then it is called Right recursive
grammar.
𝑆 → 𝑆𝑎 (𝐿𝑒𝑓𝑡 𝑅𝐺)
𝑆 → 𝑎𝑆 (𝑅𝑖𝑔ℎ𝑡 𝑅𝐺)
𝑆 → 𝑎𝑆𝑏 (𝐺𝑒𝑛𝑒𝑟𝑎𝑙 𝑅𝐺)
The expressive power of Left RG and Right RG is the same.
NOTE
Normally, a Left RG can result in an infinite loop so it is usually a good thing to first change Left RG to
Right RG.
Question
𝐴 → 𝐴𝛼 | 𝛽1 | 𝛽2
Answer
𝐴 → 𝛽1 𝐵 | 𝛽2 𝐵
𝐵 → 𝛼𝐵 | 𝜖
Question
𝐴 → 𝐴𝛼1 | 𝐴𝛼2 | 𝛽
Answer
𝐴 → 𝛽𝐵
𝐵 → 𝛼1 𝐵 | 𝛼2 𝐵 | 𝜖
Question
𝑆 → 𝑆𝑆𝑆 | 0
Answer
𝑆 → 0𝐴
𝐴 → 𝑆𝑆𝐴 | 𝜖
Question
𝑆 → 𝑆1𝑆 | 0
Answer
𝑆 → 0𝐴
𝐴 → 1𝑆𝐴 | 𝜖
Question
𝑆 → 𝑆12 | 0
Answer
𝑆 → 0𝐴
𝐴 → 12𝐴 | 𝜖
Question
𝑆 → 𝑆0𝑆1 | 0 | 1
Answer
𝑆 → 0𝐴 | 1𝐴
𝐴 → 0𝑆1𝐴 | 𝜖
Question
Answer
𝐸 → 𝑇𝐴
𝐴 → +𝑇𝐴 | 𝜖
𝑇 → 𝐹𝐵
𝐵 → ∗ 𝐹𝐵 | 𝜖
𝐹 → (𝐸) | 𝑖𝑑
AMBIGUOUS GRAMMAR
If a CFG exists where any string can have more than one derivation tree then it is said to be ambiguous.
For example,
𝑆 → 𝑎𝑆 | 𝑆𝑎 | 𝑎
In this case, we can draw two derivation graphs as follows –
In this case, both the graphs result in the same string 𝒂𝒂 and hence, the above grammar is ambiguous.
We can write the same grammar as –
𝑆 → 𝑎𝑆 | 𝑎
Now the grammar is unambiguous. Here are a few things to note about ambiguous grammar –
• If a grammar has both left and right recursive, then it will be ambiguous. However, the
converse is not true and an ambiguous grammar need not have both left and right recursive.
• There are some ambiguous grammars that can’t be converted to unambiguous grammar. For
example, {𝑎𝑛 𝑏𝑚 𝑐 𝑚 𝑑𝑛 } ∪ {𝑎𝑛 𝑏 𝑛 𝑐 𝑚 𝑑𝑚 } is a CFL which doesn’t have any unambiguous way to
derive.
Question
𝑆 → 𝑆𝑆 | 0 | 1
Answer
Since there are both left and right recursive grammar, the grammar is ambiguous.
MINIMIZATION OF CFG
A NULL production is of the form 𝐴 → 𝜖. During derivation, if there is a null production then it replaces
a non-terminal with 𝜖 and hence we will need to derive the string further to get the required string. In
short, a NULL production doesn’t functionally add to the string derivation. Therefore, it is advisable to
remove NULL productions to simplify the CFG. For example, let us take the CFG as follows –
𝑆 → 𝐴𝑏𝐵 | 𝑏𝐵 | 𝐴𝑏 | 𝑏
𝐴→𝑎
𝐵→𝑏
Question
Answer
𝑆 ′ → 𝑆 ∪ {𝜖}
𝑆 → 𝐴𝐵 | 𝐵 | 𝐴
𝐴→𝑎
𝐵→𝑏
A production where there is a single terminal or variable on the RHS is called a Unit production. A unit
production is a waste and instead can be included in any of the previous productions itself. For example
–
𝑆 → 𝐴𝑎
𝐴 →𝑎|𝐵
𝐵→𝑑
Here, 𝐵 → 𝑑 is a unit production. Thus, we can remove it as follows –
𝑆 → 𝐴𝑎
𝐴→𝑎|𝑑
Question
Answer
𝑆 → 𝑎𝐴𝑏
𝐴→𝑑|𝑐|𝑏|𝑎
These are analogous to the dead and unreachable states in FA. There can be two types of useless
productions –
For example –
In the first CFG, 𝐶 → 𝑑 is an unreachable production and hence can be removed. In the second CFG,
the variable 𝐵 can be reached from 𝑆 but it doesn’t derive any strings/terminals. So, that is also useless.
Question
Answer
𝑆 → 𝑎𝐴𝐵 | 𝑏𝐴
𝐴 → 𝑎𝐵 | 𝑏
𝐵→𝑑
NORMALIZATION OF CFG
This is done to transform the CFG to be more compiler friendly. We have 2 normal forms –
𝑨 → 𝑩𝑪 | 𝒂
Where 𝐴, 𝐵, 𝐶 are variables and 𝑎 is a terminal. For example, let us take the CFG as follows –
𝑆 → 𝑎𝑆𝑏 | 𝑎𝑏
For a CNF, we can have either 2 variables or 1 terminal in the RHS. So, we can write the CFG as follows
–
𝑆 → 𝐴𝑆 ′ | 𝐴𝐵
𝑆 ′ → 𝑆𝐵
𝐴→𝑎
𝐵→𝑏
Question
Answer
𝑆 → 𝑆′𝐵 | 𝐵𝐵
𝑆′ → 𝐶𝐴
𝐴→𝑎|𝑏
𝐵→𝑏
𝐶→𝑎
Suppose we need to create a string 𝒂𝒂𝒃 using the above grammar, then we can write the sentential
forms as follows –
• 𝑆 → 𝑆′𝐵
• 𝑆 → 𝐶𝐴𝐵
• 𝑆 → 𝑎𝐴𝐵
• 𝑆 → 𝑎𝑎𝐵
• 𝑆 → 𝑎𝑎𝑏
Thus, we can see that to generate a string of length 3, we have 5 sentential forms. Therefore, we can
conclude that in general for CNF, if we need to generate a string of length 𝒏, we would have a total of
𝟐𝒏 − 𝟏 sentential forms.
𝐴 → 𝑎𝛼
Where 𝐴 is a variable, 𝑎 is a terminal and 𝛼 ∈ 𝑉𝑛∗. For example, let us take the CFG as follows –
𝑆 → 𝑎𝑆𝑏 | 𝑎𝑏
Then, we can write GNF as follows –
𝑆 → 𝑎𝑆𝐵 | 𝑎𝐵
𝐵→𝑏
Question
Answer
𝑆 → 𝑎𝐴𝐵 | 𝑏𝐵
𝐴→𝑎|𝑏
𝐵→𝑏
Unlike CNF, if we need to generate a string of length 𝒏 in a GNF grammar, then we need 𝒏 sentential
forms.
SYNTAX ANALYZER (PARSER)
This is the next step after the Lexical analysis. Parsers can be of various types –
TOP-DOWN PARSERS
Top-down parsers start parsing the input, and start constructing a parse tree from the root node
gradually moving down to the leaf nodes. These use left – most derivations for construction. For any
top-down parser, the CFG from which it is constructed must be –
• Non-left recursive grammar (as the parser might go into an infinite loop)
• Unambiguous
Brute force parsers are simple…you check a string by trial and error. If the string doesn’t match in the
CFG, then back track and try another method. It is a basic but time consuming and long processes.
These can be constructed from both deterministic and non-deterministic CFG. For example, let us
take the CFG as follows –
Now, suppose we need to construct a parse tree for the word 𝒘 = 𝒄𝒂𝒅. Then, we can start as follows
–
We got this using the left – most derivation. However, this is not the word we want. Hence, we now
backtrack and can re-write the parse tree as follows –
Question
Draw the parse tree from a Brute force parser to select the word 𝒘 = 𝒂𝒅𝒅𝒄 using the following CFG
–
Answer
Since Brute force is taking a lot of time, it has a time complexity of 𝑶(𝟐𝒏 ).
Predictive Parsers
A predictive parser will use rules to predict the next incoming symbol. This parser doesn’t support
back tracking. Therefore, it can only be constructed from deterministic grammar. Since this parser
doesn’t have the option of back tracking, we can’t have any scope for error. Thus, before beginning we
need to understand 2 functions –
• FIRST (X) – It is a function that gives the set of terminals that begin the strings derived from
the production rule. In other words, it returns the set of terminals with which the string 𝑿 can
start with.
• FOLLOW (X) – It returns a set of terminals that will be achieved when 𝑿 = 𝝐. FOLLOW function
can only be applied on variables and not on terminals.
𝑆 → 𝐴𝐵𝐶
𝐴 → 𝑎𝑏
𝐵 → 𝑐𝑎
𝐶→𝑏
Then, we can write –
𝐹𝐼𝑅𝑆𝑇(𝑎𝑏𝑐) = 𝑎
𝐹𝐼𝑅𝑆𝑇(𝑐𝑏𝑎) = 𝑐
𝐹𝐼𝑅𝑆𝑇(𝐴) = 𝑎 ; 𝐹𝑂𝐿𝐿𝑂𝑊(𝐴) = {𝑐}
𝐹𝐼𝑅𝑆𝑇(𝐵) = 𝑐 ; 𝐹𝑂𝐿𝐿𝑂𝑊(𝐵) = 𝑏
𝐹𝐼𝑅𝑆𝑇(𝑆) = 𝑎 ; 𝐹𝑂𝐿𝐿𝑂𝑊(𝑆) = {$}
Question
𝑆 →𝑎|𝑏|𝜖
Answer
𝐹𝐼𝑅𝑆𝑇(𝑆) = {𝑎, 𝑏, 𝜖}
𝐹𝑂𝐿𝐿𝑂𝑊(𝑆) = {$}
Question
𝐹𝐼𝑅𝑆𝑇(𝑆) = {𝑎, 𝑏}
𝐹𝐼𝑅𝑆𝑇(𝐴) = {𝜖}
𝐹𝐼𝑅𝑆𝑇(𝐵) = {𝜖}
𝐹𝑂𝐿𝐿𝑂𝑊(𝑆) = {$}
𝐹𝑂𝐿𝐿𝑂𝑊(𝐴) = {$}
𝐹𝑂𝐿𝐿𝑂𝑊(𝐵) = {$}
Question
Answer
𝐹𝐼𝑅𝑆𝑇(𝑆) = {𝑎, 𝑐, 𝑑}
𝐹𝐼𝑅𝑆𝑇(𝐴) = {𝑎, 𝑏}
𝐹𝐼𝑅𝑆𝑇(𝐵) = {𝑐, 𝑑}
𝐹𝑂𝐿𝐿𝑂𝑊(𝑆) = {$}
𝐹𝑂𝐿𝐿𝑂𝑊(𝐴) = {𝑏}
𝐹𝑂𝐿𝐿𝑂𝑊(𝐵) = {𝑎}
Question
𝐹𝐼𝑅𝑆𝑇(𝑆) = {𝑎, 𝑏, 𝑑, 𝑒}
𝐹𝐼𝑅𝑆𝑇(𝐴) = {𝑎, 𝑏}
𝐹𝐼𝑅𝑆𝑇(𝐵) = {𝑑, 𝑒}
𝐹𝑂𝐿𝐿𝑂𝑊(𝑆) = {$}
𝐹𝑂𝐿𝐿𝑂𝑊(𝐴) = {$, 𝑎}
𝐹𝑂𝐿𝐿𝑂𝑊(𝐵) = {$, 𝑎, 𝑏}
Question
Answer
𝐹𝐼𝑅𝑆𝑇(𝑆) = {𝑏, 𝑎}
𝐹𝐼𝑅𝑆𝑇(𝐴) = {𝑏, 𝜖}
𝐹𝐼𝑅𝑆𝑇(𝐵) = {𝑐}
𝐹𝑂𝐿𝐿𝑂𝑊(𝑆) = {$}
𝐹𝑂𝐿𝐿𝑂𝑊(𝐴) = {𝑎}
𝐹𝑂𝐿𝐿𝑂𝑊(𝐵) = {$}
Question
Answer
𝐹𝐼𝑅𝑆𝑇(𝑆) = {𝑎, 𝜖}
𝐹𝐼𝑅𝑆𝑇(𝐴) = {𝑎, 𝜖}
𝐹𝐼𝑅𝑆𝑇(𝐵) = {𝑎, 𝜖}
𝐹𝑂𝐿𝐿𝑂𝑊(𝑆) = {$}
𝐹𝑂𝐿𝐿𝑂𝑊(𝐴) = {𝑎, $}
𝐹𝑂𝐿𝐿𝑂𝑊(𝐵) = {𝑎, $}
NOTE
We can easily find the first of an element, but FOLLOW seems to be a little weird. So here is a cheat
sheet –
LL1 PARSER
First and foremost, we need to have a CFG that is non-left recursive, unambiguous and deterministic.
Once we have that, we can proceed with the LL1 parser derivation. Let us take an example of a CFG as
follows –
𝐸 → 𝑇𝐸′
𝐸 ′ → +𝑇𝐸 ′ | 𝜖
𝑇 → 𝐹𝑇′
𝑇 ′ →∗ 𝐹𝑇 ′ | 𝜖
𝐹 → (𝐸) | 𝑖𝑑
This CFG is non-left recursive, unambiguous and deterministic. Now, let us assume that we need to
make the parse tree for the string –
𝑊 = 𝑖𝑑 + 𝑖𝑑 ∗ 𝑖𝑑
From the grammar, we can observe that –
𝛼→𝛽 𝑤ℎ𝑒𝑟𝑒 𝛽 ≠ 𝜖
For the production, we take the row 𝛼 and take the columns of the values returned from 𝐹𝐼𝑅𝑆𝑇(𝛼)
and we fill the production in those columns. In our case, the first production is –
𝐸 → 𝑇𝐸′
Thus, we fill the above production in the row 𝐸 and column will be {(, 𝑖𝑑}. Thus, we get –
∗ + ( ) 𝒊𝒅 $
E 𝐸 → 𝑇𝐸′ 𝐸 → 𝑇𝐸′
E’
T
T’
F
Similarly, we can do the same for the rest of the productions and fill the table as follows –
∗ + ( ) 𝒊𝒅 $
E 𝐸 → 𝑇𝐸′ 𝐸 → 𝑇𝐸′
E’ 𝐸 ′ → +𝑇𝐸 ′
T 𝑇 → 𝐹𝑇′ 𝑇 → 𝐹𝑇′
T’ 𝑇 ′ →∗ 𝐹𝑇 ′
F 𝐹 → (𝐸) 𝐹 → (𝐸)
𝛼→𝜖
For the production, we take the row 𝛼 and take the columns of the values returned from
𝐹𝑂𝐿𝐿𝑂𝑊𝑆(𝛼) and we fill the production in those columns. Thus, we get –
∗ + ( ) 𝒊𝒅 $
E 𝐸 → 𝑇𝐸′ 𝐸 → 𝑇𝐸′
E’ 𝐸 ′ → +𝑇𝐸 ′ 𝐸′ → 𝜖 𝐸′ → 𝜖
T 𝑇 → 𝐹𝑇′ 𝑇 → 𝐹𝑇′
T’ 𝑇 ′ →∗ 𝐹𝑇 ′ 𝑇′ → 𝜖 𝑇′ → 𝜖 𝑇′ → 𝜖
F 𝐹 → (𝐸) 𝐹 → 𝑖𝑑
Now that we have filled this table, we can start creating the parse tree. First, we break the given string
into tokens as shown –
Now, we try to use Left – most derivation to get the tokens one-by-one. The first token is 𝒊𝒅. So, we
can use the productions under the 𝑖𝑑 column to draw the parse tree as follows –
Just like that, we will do the same for the rest of the tokens as well. Thus, we get –
NOTE
Since the parser here accesses the tape from left to right, we are using left-most derivation and taking
one token at a time. Thus, the name becomes LL1.
NOTE
In the LL1 table, each cell can have just one production. If there is a case where the cell will have more
than 1 entries, then it is not LL1.
Question
Answer
We can see that this grammar is non-left recursive, unambiguous and deterministic. Therefore, the
initial conditions are satisfied. Next, we can write –
a b $
S 𝑆 → 𝐴𝐵 𝑆 → 𝐴𝐵
A 𝐴→𝜖 𝐴 → 𝑏𝐴 𝐴→𝜖
B 𝐵 → 𝑎𝐵 𝐵→𝜖
Since every cell has just one production, we can conclude that the grammar is in LL1.
Question
The grammar is not LL1 as the cell (𝑆, 𝑎) will have multiple values.
NOTE
Question
𝑆 → 𝑎𝐴𝑏𝐵
𝐴→𝑎|𝜖
𝐵 →𝑏|𝜖
Answer
Question
Answer
Question
Question
Answer
BOTTOM – UP PARSER
This is more detailed and has a higher power when compared to Top – down parsers. Most of the high
level languages like C, Python, Java etc. use a Bottom – up parser. As the name suggests, the parser
creates the parse tree by starting from the children and proceeding towards the root.
𝑆′ → . 𝑆
𝑆 → . 𝐴𝐴
𝐴 → . 𝑎𝐴
𝐴 → .𝑏
Here, the (. )(𝒅𝒐𝒕) operator represents the position to which we have scanned the production. Since
initially we haven’t scanned the production yet, the dot operator is in the beginning.
Also, the group of these productions is called a closure and is represented as follows –
Now, we will start to scan the productions one symbol at a time and create a sort of tree of closures.
This can be done as follows –
Let me now deep wtf just happened. First, we write the original closure I0. From I0, we are parsing
based on each symbol – both terminals and non-terminals. Here is a step – by – step process of the
parsing –
Now that a step-by-step procedure has been established, we can start with the process. First, let us
parse using 𝑺. We can see that there is only one production to be considered –
𝑆′ → . 𝑆
Therefore, we write a new closure I1 with the production 𝑺′ → 𝑺. As we can see, there is no 𝜷, so we
can skip step 4. Thus, there is only 1 production in closure I1.
Now, we consider a parse using 𝑨. In this case, we again have just one production –
𝑆 → . 𝐴𝐴
So, we transform the production to 𝑺 → 𝑨. 𝑨 and write it in production I2. However, we can see that
𝜷 = 𝑨 which is a non-terminal. So, we need to include all such productions with 𝑨 on the LHS. Thus,
closure I2 has a total of 3 productions.
ACTION GOTO
𝑎 𝑏 $ 𝑆 𝐴
I0
I1
I2
I3
I4
I5
I6
Basically, we write ACTION and GOTO as the 2 sections in the table wherein the terminals are listed
under ACTION and the non-terminals are listed under GOTO. This is because we terminate at the
terminals and we usually proceed to the next step for a non-terminal. Each row represents each of the
closures we had created.
To fill this table, let us start with I0. We can see from the closure diagram that upon encountering 𝒂,
we shift from I0 to I3. This is represented as 𝑺𝟑. Similarly, upon encountering 𝒃, we shift from I0 to I4.
This is represented as 𝑺𝟒. Now, we fill the table accordingly.
ACTION GOTO
𝑎 𝑏 $ 𝑆 𝐴
I0 S3 S4
I1
I2 S3 S4
I3 S3 S4
I4
I5
I6
Just like the terminals, we now check for non-terminals as well. We can see that upon 𝑆, I0 goes to I1.
This is represented simply by 𝟏. We don’t write S1 for the non-terminals in the GOTO section. Thus,
the table now becomes –
ACTION GOTO
𝑎 𝑏 $ 𝑆 𝐴
I0 S3 S4 1 2
I1
I2 S3 S4 5
I3 S3 S4 6
I4
I5
I6
𝑹𝟏 ∶ 𝑆 → 𝐴𝐴
𝑹𝟐 ∶ 𝐴 → 𝑎𝐴
𝑹𝟑 ∶ 𝐴 → 𝑏
We can see that rules R1, R2 and R3 correspond to closures I5, I6 and I4 respectively. Hence, we will
fill those out in the table as follows –
ACTION GOTO
𝑎 𝑏 $ 𝑆 𝐴
I0 S3 S4 1 2
I1
I2 S3 S4 5
I3 S3 S4 6
I4 R3 R3 R3
I5 R1 R1 R1
I6 R2 R2 R2
Finally, we can see that the start symbol (𝑺′) is present in closure I1. So, we will fill the row I1 and
column $ with the word ACCEPT. Basically, if after parsing the string we end up on ACCEPT, then the
string is accepted by the parser.
ACTION GOTO
𝑎 𝑏 $ 𝑆 𝐴
I0 S3 S4 1 2
I1 accept
I2 S3 S4 5
I3 S3 S4 6
I4 R3 R3 R3
I5 R1 R1 R1
I6 R2 R2 R2
Phew! The table is now complete and now we can proceed with the next step which is to use a string
for testing. Let us assume a string as 𝒂𝒃𝒂𝒃. From intuition, we know that this string is part of the CFG
and hence should be accepted by the parser. As usual, we first get the tokens of the string –
To perform the analysis, the parser actually uses a stack. Let us assume such a stack with the initial
value as 0
Here, the number 0 denotes the initial closure I0. Similarly, a number 3 will represent closure I3 and
so on. Now, we traverse the input string.
The first element is 𝒂. As per the stack, we are currently in closure I0. Thus, we can see that at closure
I0, we have encountered 𝒂. Thus, as per the table we need to shift to I3 (𝑺𝟑). Thus, the stack now
becomes –
Now, the next symbol coming is 𝒃. Thus, as per the stack we are in closure I3 and now we have
encountered a 𝒃. Looking at the table, we need to perform 𝑺𝟒. Hence, the stack becomes –
Next, we have the symbol 𝒃. As per the table, when we encounter 𝒃 in closure I4, the output is Rule 3
(𝑹𝟑). This is an interesting case. Here, we are not shifting but rather this is where we perform
reduction – which is where we go up a level in the tree. Here are the rules of the reduction –
𝐴→𝑏
Therefore, the 𝒃 symbol will get reduced to 𝑨 non-terminal. At the same time, we know that 𝒍𝒆𝒏(𝒃) =
𝟏. Thus, we pop 2 elements from the stack and push 𝑨 into it. So, the tree will now become something
as shown below –
Now, we are at closure I3 and have encountered symbol 𝑨. As per the table, we GOTO closure I6.
The next symbol is 𝒂 and we are at closure I6. As per the table, we need to perform reduction via Rule
2. As per R2, we have –
𝐴 → 𝑎𝐴
That means both 𝒂 and 𝑨 get reduced to 𝑨. We move up a level in the graph. Also, 𝒍𝒆𝒏(𝒂𝑨) = 𝟐.
Thus, we pop 4 elements from the stack. The graph and the stack are now modified as follows –
At this point, we are at closure I0 and the symbol encountered is 𝑨. Thus, we GOTO closure I2.
The next symbol is 𝒂 and at closure I2, we encounter a 𝒂 and then move on to closure I3.
The next symbol to appear is 𝒃 which upon closure I3 will shift to closure I4.
At this point, all the symbols of the string have been pushed onto the stack at some point or the other.
So all that is left now is reduction. Now, the last symbol to be encountered is $. As per the table, the
closure I4 on $ will give reduction via Rule 3. Thus, the tree and stack become –
Now, in closure 6 if we encounter $, then we need to reduce via Rule 2. Thus, we get –
In closure 2, if we encounter 𝑨, we GOTO closure I5.
Now, if we encounter a $ in closure I5, we need to reduce via Rule 1. Hence, we get –
On closure I0, when we encounter 𝑺, we need to GOTO closure I1. Now, if we encounter $ at closure
I1, we can see that we get ACCEPT.
OOOFF!!! We are done finally. We can see that we started with the entire string and built the tree in a
bottom – up fashion. At the same time, since we reached ACCEPT, it means that the string is a valid
string for the given CFG.
We just performed the process of developing a BUP, so we should define a few terms as well.
Now, we can see that we performed basically 2 main operations in the entire process – Shifting to a
new closure and Reducing based on a rule. Therefore, BUPs are also called Shift-Reduce Parsers.
The production we have just created parses the string from left-to-right and at the same time, it also
performs reverse rightmost derivation. It also doesn’t have any lookahead symbols. Therefore, this
parser is called a LR(0) parser.
Since we are doing Reverse Rightmost Derivation, we can use this parser on CFG with Left-recursion as
well. Additionally, suppose we have a production as follows –
𝑆 → 𝑎𝐴 | 𝑎𝐵
The CFG with the above production is non-deterministic since when we encounter 𝒂, we don’t know
if the next symbol will be either 𝑨 or 𝑩. However, this is the case when we use Leftmost Derivation. In
case we are using Rightmost derivation, then we don’t have any problems. In conclusion, LR(0) can
work on left-recursive and non-deterministic CFG. Therefore, a BUP has higher expressive power
when compared to TDP.
Question
Answer
𝑆 → .𝐸
𝐸 → .𝑇 + 𝐸
𝐸 → .𝑇
𝑇 → . 𝑖𝑑
At this point itself, we can see that if we apply 𝑻 to I0, we can either get a shift to I2 or a reduction.
This will result in the parse table having multiple values in a cell and thus, the CFG is NOT LR(0).
Question
Answer
In this case, I1 has a reduction and a shifting. However, the reduction 𝑆 → 𝐸. Is just the accept
condition so it will not have any clash. On the other hand, in I2 there is a reduction and a shifting.
Therefore, there is a SR clash and hence the grammar is not LR(0).
Question
Answer
SLR(1) PARSER
𝐸 →𝑇+𝐸|𝑇
𝑇 → 𝑖𝑑
Now, for this grammar we can proceed to form the canonical closure –
ACTION GOTO
+ 𝑖𝑑 $ 𝐸 𝑇
I0 S3 1 2
I1 ACCEPT
I2 S4/R2 R2 R2
I3 R3 R3 R3
I4 S3 5 2
I5 R1 R1 R1
We can see that the cell in row I2 and column + has 2 values. This is a SR conflict. Thus, the given
grammar is not in LR(0) form. However, we are now going to learn about a new parser – SLR(1) parser.
The process for SLR(1) is the same as LR(0) however as the name suggests, it has a single lookahead
symbol. This means that “the reduction rules are applied to the Follow of the LHS”.
𝐹𝑂𝐿𝐿𝑂𝑊(𝐸) = {$}
𝐹𝑂𝐿𝐿𝑂𝑊(𝑇) = {$, +}
We can see that Rules R1 and R2 have 𝑬 as the LHS while Rule R3 has 𝑻 as the LHS. Thus, we can say
that –
ACTION GOTO
+ 𝑖𝑑 $ 𝐸 𝑇
I0 S3 1 2
I1 ACCEPT
I2 S4 R2
I3 R3 R3
I4 S3 5 2
I5 R1
Now, there are no conflicts and thus, this CFG is accepted under SLR(1) parser. Thus, we can say that
the expressive power of SLR(1) is greater than the expressive power of LR(0).
Question
Answer
𝑆′ → . 𝑆
𝑆 → . 𝐴𝑎
𝑆 → . 𝑏𝐴𝑐
𝑆 → . 𝑑𝑐
𝑆 → . 𝑏𝑑𝑎
𝐴 → .𝑑
We can draw the closures as follows –
ACTION GOTO
𝑎 𝑏 𝑐 𝑑 $ 𝑆 𝐴
I0 S3 S4 1 2
I1 ACCEPT
I2 S7
I3 S6 8
I4 R5 R5 S5/R5 R5 R5
I5 R3 R3 R3 R3 R3
I6 S10/R5 R5 R5 R5 R5
I7 R1 R1 R1 R1 R1
I8 S9
I9 R2 R2 R2 R2 R2
I10 R4 R4 R4 R4 R4
We can see that this grammar is not in LR(0). Now, we need to check for SLR(1). To do so, first we need
to get the Follow functions as follows –
𝐹𝑂𝐿𝐿𝑂𝑊(𝑆) = {$}
𝐹𝑂𝐿𝐿𝑂𝑊(𝐴) = {𝑎, 𝑐}
Hence, the SLR(1) parse table becomes –
ACTION GOTO
𝑎 𝑏 𝑐 𝑑 $ 𝑆 𝐴
I0 S3 S4 1 2
I1 ACCEPT
I2 S7
I3 S6 8
I4 R5 S5/R5
I5 R3
I6 S10/R5 R5
I7 R1
I8 S9
I9 R2
I10 R4
We can see that there are still SR conflicts. Hence, this grammar is NOT in SLR(1) as well.
NOTE
In general, the Venn diagram of the different types of parsers is given as follows –
We can see that LL(1) has a lot of power, but every LL(1) grammar is LR(1). Therefore, it just proves
that the BUP have a higher power when compared to TDP.
Question
Answer
Question
Question
Answer
CLR(1) PARSER
This is the most powerful parser. In short, if a grammar is not in CLR(1), then it can’t be done by any
of the parsers. To understand the CLR(1) parser, let us take an example of a CFG as follows –
𝑆′ → . 𝑆
𝑆 → . 𝐶𝐶
𝐶 → . 𝑐𝐶
𝐶 → .𝑑
If we were making SLR(1), then we would have found the follow for S and C. However, we can see that
𝑪 occurs a total of 3 times in the RHS of the productions. Each of these occurrences have a different
follow. Let us label these occurrences as follows –
𝑆′ → . 𝑆
𝑆 → . 𝐶1 𝐶2
𝐶1 → . 𝑐𝐶3
𝐶1 → . 𝑑
We can see that even though 𝐶1 = 𝐶2 = 𝐶3 = 𝐶, their follows will be different –
𝐹𝑂𝐿𝐿𝑂𝑊(𝐶1 ) = {𝑐, 𝑑}
𝐹𝑂𝐿𝐿𝑂𝑊(𝐶2 ) = {$}
𝐹𝑂𝐿𝐿𝑂𝑊(𝐶3 ) = {𝑐, 𝑑}
Here is the interesting part. If we were constructing SLR(1), we would be writing the Reduction rules
in the columns 𝒄, 𝒅, $ since SLR(1) would have calculated 𝐹𝑂𝐿𝐿𝑂𝑊(𝐶) = {𝑐, 𝑑, $}. On the other hand,
for CLR(1), we can see that each occurrence has a different follows and hence, when we fill the table,
CLR(1) will have lesser SR/RR conflicts when compared to SLR(1). This is the reason CLR(1) is the most
accommodating, has the highest expressive powers and also has minimal elements in the parse table.
To create CLR(1), we need to include each productions follow in the closure. For example, if closure
has the production 𝜶 → 𝜷, then we write –
𝜶 → 𝜷 , 𝑭𝑶𝑳𝑳𝑶𝑾(𝜶)
Thus, for the example grammar, the initial closure will be as follows –
One thing to note here is that 𝐶1 , 𝐶2 𝑎𝑛𝑑 𝐶3 are basically the same as 𝑪. Since this is the first example,
I am differentiating them so that it is easy to understand which occurrence of 𝐶 we are talking about.
After this, we continue drawing the canonical closures as we have –
ACTION GOTO
𝑐 𝑑 $ 𝑆 𝐶
I0 S3 S4 1 2
I1 ACCEPT
I2 S6 S7 5
I3 S3 S4 8
I4 R3 R3
I5 R1
I6 S6 S7 9
I7 R3
I8 R2 R2
I9 R2
As we can see, there are no SR/RR conflicts thus the CFG is CLR(1) parser.
We can see that there are a couple of closures that have the same productions but have different
follows. These closures are –
𝐼3 − 𝐼6
𝐼4 − 𝐼7
𝐼8 − 𝐼9
Thus, we can combine these closures into each other. That will give us the following parse table –
ACTION GOTO
𝑐 𝑑 $ 𝑆 𝐶
I0 S3 S47 1 2
I1 ACCEPT
I2 S6 S7 5
I36 S36 S47 89
I47 R3 R3 R3
I5 R1
I89 R2 R2 R2
As we can see, we are able to perform the combination without any conflicts. Thus, we can say that
this grammar is also LALR(1).
NOTE
Answer
These are parsers that are used to manipulate and work on mathematical expressions. It is a less
complex grammar that can be made on both ambiguous and unambiguous grammars. Every CFG is
not OPG. OPG are CFG which have the following properties –
For example –
SEMANTIC ANALYSIS
This is the next step in the compiler process. To perform a Semantic Analysis, we need to first define
something called Syntax Directed Translation (SDT). We can say that –
Question
Answer
First off, let us notice that against each of the productions, we have some expressions inside curly
braces. These are the semantic statements. These statements are usually of 2 types –
In the question, we don’t have any semantic rules, so there is no condition check here. Now, let us
learn how to interpret the semantic action. For example, let us take the production –
As per the semantic action, we can say that the production will multiply 𝑆 and 𝑇 and store the result
back in 𝑆. One small thing to note is that as per the question, sub-scripted non-terminals are the same
non-terminals in a time instant. So basically 𝑆 = 𝑆1.
𝑥#𝑦 → 𝑥 =𝑥∗𝑦
Similarly, we can see that the % (mod) sign represents division. Thus, the expression becomes –
20 ∗ 10 ÷ 5 ∗ 8 ÷ 2 ÷ 2
Here comes the important part. Now that we have the operator definitions, we need the OPERATOR
PRECEDENCE AND ASSOCIATIVITY.
To get the precedence, we use common sense. When we make a tree, we go from root node to the
leaf nodes. However, when we calculate values, we go from leaf nodes to the root nodes. Hence, the
operator that comes first when we go from bottom to top has higher precedence. In our case,
𝑆 → 𝑆1 # 𝑇
𝑇 → 𝑇1 % 𝑅
Since these both productions are left recursive, we can say that both operators have a left-to-right
associativity.
Now, we can solve the expression –
20 ∗ (10 ÷ 5) ∗ ((8 ÷ 2) ÷ 2)
Question
Answer
We can also say that since & comes first when going from bottom-to-top, & has higher priority when
compared to #.
2 ∗ (3 + 5) ∗ (6 + 4)
The final answer will be 160.
Question
Answer
In this case, the best thing will be to make the parse tree as follows –
Now, we go from bottom to top to get the value of the expression. Shown below is the path we take
for valuation –
In Paths 1, 4, 6, 7 and 8 will have printing as per the semantic actions mentioned in the grammar. Thus,
we get –
Question
Answer
Here,
𝐸 = (4 − (2 − 4)) ∗ 2 = 𝟏𝟐
Question
Answer
If we notice, the semantic actions mentioned here are not for evaluation but rather for making a node.
So, this grammar is used to create a tree. To start, first let us draw a rough tree and parse sequence as
follows –
Let us now go step-by-step.
Step 1 – We go from 𝑛𝑢𝑚(2) to 𝐹. As per the grammar, the action here is as follows –
This means that we need to make a node with left and right pointers as NULL, the value as 𝒊𝒅(𝟐) and
its parent pointer will be 𝑭.
In steps 2 and 3, the parent pointer first changes to 𝑇 and then to 𝐸. In step 4, just like Step 1 another
node is created –
Then in Step 5 the parent pointer for this node is changed to 𝑇 and then in Step 6, another node for
value 4 is created. So now, we have 3 nodes is total –
In step 6, we are going to merge using the production 𝑇 → 𝑇 ∗ 𝐹. The semantic action in this case will
be –
Hence, we need to create a node with the left pointer as the 𝑇 pointer, the right pointer as the 𝐹
pointer and the value as ∗. The parent pointer will be the 𝑇 pointer.
Similarly, we do the final step and get the final tree as follows –
CLASSIFICATION OF ATTRIBUTES
Based on the process of evaluation, attributes are classified into two types –
• Synthesized – The attributes whose values are calculated based on their children values.
• Inherited – The attributes whose values are calculated based on their parent or left sibling.
𝐴 → 𝑋𝑌𝑍
If the value of 𝑨 is calculated by using the values of 𝑋, 𝑌 or 𝑍, then 𝑋 is a synthesized attribute. If the
value of 𝒀 is calculated by using the values of 𝐴 or 𝑋, then it will be termed as an inherited attribute.
We can convert this to postfix using a stack as we had seen in DS. Using syntax tree, we can represent
this expression as follows –
As we can see, the syntax tree has a lot of repetition. In short, we don’t need to create new nodes for
each of the 𝑎 + 𝑎 operations. Hence, we reduce this to get the Directed Acyclic Graph (DAG).
𝐹 = −(𝑎 + 𝑏) ∗ (𝑐 + 𝑑) + (𝑎 + 𝑏 + 𝑐)
Here, we can also express using the Three Adress Code. In a three address code, each expression can
have only 3 address aka variables. Hence, for this case the TAC will be –
Here, the statement numbers are used as operands. Here, even though the space used is less, there is
no flexibility to change the statement order.
This is a property of the intermediate code where each variable is assigned only once. So, existing
variables are split into versions. For example, let us say we have the following code –
𝑎 =𝑏+𝑐
𝑎 =2∗𝑎
For SSA form, we want the assignment to a variable be done just once. Therefore, we split 𝑎 into two
versions as follows –
𝑎 =𝑏+𝑐
𝑎1 = 2 ∗ 𝑎
Now this is in SSA.
Constant Folding
In an expression, there is a chance that the constants in the expression can be simplified to reduce run
time. This is called constant folding. Suppose we have the expression –
𝑎 =𝑏+𝑐+2+3∗4
Using constant folding, we can simplify the expression as follows –
𝑎 = 𝑏 + 𝑐 + 14
Constant Propagation
𝑝𝑖 = 3.1415
𝑐 = 2 ∗ 𝑝𝑖 ∗ 𝑟
We know that the value of PI will not change as it is a constant. So, instead of declaring it as a variable,
we can directly plug it in the expression. This is called constant propagation. Combining with constant
folding, we can improve the expression as follows –
𝑐 = 6.283 ∗ 𝑟
Strength Reduction
In this case, we can implement the same expression in more than 1 way. Thus, it would make sense to
have the compiler use the expression that has the lowest cost. This is called strength reducing. For
example, let us take the 2 expressions –
𝑦 =2∗𝑥
𝑦=𝑥+𝑥
In both cases, 𝑦 will have the same value. However, we know that since multiplication is repetitive
addition, it will be more costly. Thus, the compiler would just compile the second expression to save
time and resources.
As we have seen that there is more than one way to execute an expression, hence causing a case where
there can be redundant code lines. Thus, it is better to remove those expressions. This is called
Redundant Code Elimination. For example, let us look at the expression below –
𝑥 = 𝑎+𝑏
𝑦 =𝑏+𝑎
Instead of performing addition twice, we can simply write –
𝑥 = 𝑎+𝑏
𝑦=𝑥
Algebraic Simplification
Use the basic laws of math to simplify the expressions. For example,
𝑥 =𝑎∗1
𝑦 = 𝑏+0
We can save out on both the operations and simply write –
𝑥=𝑎
𝑦=𝑏
Loop Optimization
As the name suggests, we need to optimize the loops in the program. To do so, we first need to perform
the following steps –
In this case, Lines 1,3,4,9 will be the leaders. Hence, we can make the blocks as follows –
Now, we can draw the CFG as follows –
Loop Jamming
This is the process where we combine 2 loops into one loop if they share the same index and no of
variables. For example, let us take the case –
We can see that the second loop and the outer first loop have the same indexing. Thus, we can perform
loop jamming here as shown below –
Loop Unrolling
This is the case where we reduce the number of iterations in a loop. For example, let us consider the
code as follows –
In this case, we need to perform 100 iterations. Instead, we can perform Loop unrolling and get the
following code –
In this case, we will be performing just 50 iterations and getting the same output.
Code Movement
Basically, move all the code that is not dependent on the loop outside the loop.
The basic idea is that there can be a case where a variable will finish its execution before the end of
the program and thus, it can relinquish control of the memory location. When a variable is being used,
it is said to be live. When a variable has finished its role in the program, it is termed as dead. For
example,
1. 𝑝 =𝑞+𝑟
2. 𝑠 =𝑝+𝑞
3. 𝑢 =𝑠∗𝑣
4. 𝑣 =𝑟+𝑢
5. 𝑞 =𝑣+𝑟
For these 5 statements, we can use a table to get the live and dead variables as follows –
𝒑 𝒒 𝒓 𝒖 𝒗 𝒔
1. Dead Live Live Dead Live Dead
2. Live Live Live Dead Live Dead
3. Dead Dead Live Dead Live Live
4. Dead Dead Live Live Live Dead
5. Dead Dead Live Dead Live Dead
QUESTION BANK
Question 1
Question 2
Question 3
Question 4
Question 5
Question 6
Question 7
Question 8
Question 9
Question 10
Question 11
Question 12
Question 13
Question 14
Question 15
Question 16
Question 17
Question 18
Question 19
Question 20
Question 21
Question 22
Question 23
Question 24
Question 25
Question 26
Question 27
Question 28
Question 29
Question 30
Question 31
Question 32
Question 33
Question 34
Question 35
Question 36
Question 37
Question 38
Question 39
Question 40
Question 41
Question 42
Question 43
Question 44
Question 45
Question 46
Question 47
Question 48
Question 49
Question 50
Question 51
Question 52
Question 53
Question 54
Question 55
Question 56
Question 57
Question 58
Question 59
Question 60
Question 61
Question 62
Question 63
Question 64
Question 65
Question 66
Question 67
Question 68
Question 69
Question 70
Question 71
Question 72
Question 73
Question 74
Question 75
Question 76
Question 77
Question 78
ANSWER KEY
1 C 17 B 33 C 49 B 65 D
2 B 18 A 34 A 50 A 66 C
3 C 19 C 35 A 51 D 67 B
4 A 20 A 36 5 52 C 68 B
5 A 21 C 37 D 53 C 69 B
6 C 22 A 38 D 54 D 70 B
7 A 23 C 39 C 55 D 71 A
8 C 24 C 40 8 56 C 72 6
9 25 C 41 D 57 A 73 C
10 C 26 D 42 B 58 A 74 B
11 D 27 31 43 A 59 A 75 10
12 B 28 C 44 9 60 B 76 8
13 D 29 A 45 B 61 B 77 B
14 C 30 D 46 B 62 80 78 C
15 C 31 A 47 5 63 C
16 C 32 C 48 A 64 B