Compiler Design - Module 1-Notes
Compiler Design - Module 1-Notes
Module No. – 1
Contents :-
Definition of compiler, interpreter and its differences, the phases of a compiler, role of lexical analyzer,
regular expressions, finite automata, from regular expressions to finite automata, pass and phases of
translation, bootstrapping, LEX-lexical analyzer generator. PARSING: Parsing, role of parser, context free
grammar, derivations, parse trees, ambiguity, elimination of left recursion, left factoring, eliminating
ambiguity from dangling-else grammar, classes of parsing, top down parsing - backtracking, recursive
descent parsing, predictive parsers, LL(1) grammars.
1) Compiler is a translator that converts the high-level language into the machine language.
Introduction to Compiler
o A compiler is a translator that converts the high-level language into the machine language.
o High-level language is written by a developer and machine language can be understood by the
processor.
o Compiler is used to show errors to the programmer.
o The main purpose of compiler is to change the code written in one language without changing the
meaning of the program.
o When you execute a program which is written in HLL programming language then it executes into
two parts.
o In the first part, the source program compiled and translated into the object program (low level
language).
o In the second part, object program translated into the target program through the assembler.
The compilation process contains the sequence of various phases. Each phase takes source program in one
representation and produces output in another representation. Each phase takes input from its previous stage.
We basically have two phases of compilers, namely the Analysis phase and Synthesis phase. The analysis
phase creates an intermediate representation from the given source code. The synthesis phase creates an
equivalent target program from the intermediate representation.
A compiler is a software program that converts the high-level source code written in a programming
language into low-level machine code that can be executed by the computer hardware. The process of
converting the source code into machine code involves several phases or stages, which are collectively
known as the phases of a compiler. The typical phases of a compiler are:
We basically have two phases of compilers, namely the Analysis phase and Synthesis phase. The analysis
phase creates an intermediate representation from the given source code. The synthesis phase creates an
equivalent target program from the intermediate representation.
A compiler is a software program that converts the high-level source code written in a programming
language into low-level machine code that can be executed by the computer hardware. The process of
converting the source code into machine code involves several phases or stages, which are collectively
known as the phases of a compiler. The typical phases of a compiler are:
1. Lexical Analysis: The first phase of a compiler is lexical analysis, also known as scanning.
This phase reads the source code and breaks it into a stream of tokens, which are the basic units
of the programming language. The tokens are then passed on to the next phase for further
processing.
2. Syntax Analysis: The second phase of a compiler is syntax analysis, also known as parsing.
This phase takes the stream of tokens generated by the lexical analysis phase and checks
whether they conform to the grammar of the programming language. The output of this phase
is usually an Abstract Syntax Tree (AST).
3. Semantic Analysis: The third phase of a compiler is semantic analysis. This phase checks
whether the code is semantically correct, i.e., whether it conforms to the language’s type
system and other semantic rules.
4. Intermediate Code Generation: The fourth phase of a compiler is intermediate code
generation. This phase generates an intermediate representation of the source code that can be
easily translated into machine code.
5. Optimization: The fifth phase of a compiler is optimization. This phase applies various
optimization techniques to the intermediate code to improve the performance of the generated
machine code.
6. Code Generation: The final phase of a compiler is code generation. This phase takes the
optimized intermediate code and generates the actual machine code that can be executed by the
target hardware.
Symbol Table – It is a data structure being used and maintained by the compiler, consisting of all the
identifier’s names along with their types. It helps the compiler to function smoothly by finding the
identifiers quickly.
The analysis of a source program is divided into mainly three phases. They are:
1. Linear Analysis-
This involves a scanning phase where the stream of characters is read from left to right. It is
then grouped into various tokens having a collective meaning.
2. Hierarchical Analysis-
In this analysis phase, based on a collective meaning, the tokens are categorized hierarchically
into nested groups.
3. Semantic Analysis-
This phase is used to check whether the components of the source program are meaningful or
not.
The compiler has two modules namely the front end and the back end. Front-end constitutes the
Lexical analyzer, semantic analyzer, syntax analyzer, and intermediate code generator. And the
rest are assembled to form the back end.
Lexical Analyzer –
It is also called a scanner. It takes the output of the preprocessor (which performs file
inclusion and macro expansion) as the input which is in a pure high-level language. It
reads the characters from the source program and groups them into lexemes (sequence
of characters that “go together”). Each lexeme corresponds to a token. Tokens are
defined by regular expressions which are understood by the lexical analyzer. It also
removes lexical errors (e.g., erroneous characters), comments, and white space.
Syntax Analyzer – It is sometimes called a parser. It constructs the parse tree. It takes all the tokens one
by one and uses Context-Free Grammar to construct the parse tree.
Why Grammar?
The rules of programming can be entirely represented in a few productions. Using these productions we
can represent what the program actually is. The input has to be checked whether it is in the desired format
or not.
The parse tree is also called the derivation tree. Parse trees are generally constructed to check for
ambiguity in the given grammar. There are certain rules associated with the derivation tree.
Any identifier is an expression
Any number can be called an expression
Performing any operations in the given expression will always result in an
expression. For example, the sum of two expressions is also an expression.
The parse tree can be compressed to form a syntax tree
Syntax error can be detected at this level if the input is not in accordance with the grammar.
Semantic Analyzer – It verifies the parse tree, whether it’s meaningful or not. It
furthermore produces a verified parse tree. It also does type checking, Label checking,
and Flow control checking.
Intermediate Code Generator – It generates intermediate code, which is a form that
can be readily executed by a machine We have many popular intermediate codes.
Example – Three address codes etc. Intermediate code is converted to machine language
using the last two phases which are platform dependent.
Till intermediate code, it is the same for every compiler out there, but after that, it
depends on the platform. To build a new compiler we don’t need to build it from
scratch. We can take the intermediate code from the already existing compiler and build
the last two parts.
Code Optimizer – It transforms the code so that it consumes fewer resources and
produces more speed. The meaning of the code being transformed is not altered.
Optimization can be categorized into two types: machine-dependent and machine-
independent.
Target Code Generator – The main purpose of the Target Code generator is to write a
code that the machine can understand and also register allocation, instruction selection,
etc. The output is dependent on the type of assembler. This is the final stage of
compilation. The optimized code is converted into relocatable machine code which then
forms the input to the linker and loader.
All these six phases are associated with the symbol table manager and error handler as shown in the above
block diagram.
The advantages of using a compiler to translate high-level programming languages into machine
code are:
Example:
Define Tokens , Lexeme , Pattern
Tokens :-
It is basically a sequence of characters that are treated as a unit as it cannot be further broken down. In
programming languages like C language- keywords (int, char, float, const, goto, continue, etc.) identifiers
(user-defined names), operators (+, -, *, /), delimiters/punctuators like comma (,), semicolon(;), braces
({ }), etc. , strings can be considered as tokens. This phase recognizes three types of tokens: Terminal
Symbols (TRM)- Keywords and Operators, Literals (LIT), and Identifiers (IDN).
Example 1:
Tokens
int (keyword), a(identifier), =(operator), 10(constant) and ;(punctuation-semicolon)
Answer – Total number of tokens = 5
Lexeme :-
It is a sequence of characters in the source code that are matched by given predefined language rules for
every lexeme to be specified as a valid token.
Example:
main is lexeme of type identifier(token)
(,),{,} are lexemes of type punctuation(token)
Pattern :-
It specifies a set of rules that a scanner follows to create a token.
Example of Programming Language (C, C++):
For a keyword to be identified as a valid token, the pattern is the sequence of characters that make the
keyword.
For identifier to be identified as a valid token, the pattern is the predefined rules that it must start with
alphabet, followed by alphabet or a digit.
a valid token.
The sequence of
all the reserved keywords of that
Interpretation of int, goto characters that make the
language(main, printf, etc.)
type Keyword keyword.
Compiler Passes
Pass is a complete traversal of the source program. Compiler has two passes to traverse the source program.
Multi-pass Compiler
o Multi pass compiler is used to process the source code of a program several times.
o In the first pass, compiler can read the source program, scan it, extract the tokens and store the result
in an output file.
o In the second pass, compiler can read the output file produced by first pass, build the syntactic tree
and perform the syntactical analysis. The output of this phase is a file that contains the syntactical
tree.
o In the third pass, compiler can read the output file produced by second pass and check that the tree
follows the rules of language or not. The output of semantic analysis phase is the annotated tree
syntax.
o This pass is going on, until the target output is produced.
One-pass Compiler
o One-pass compiler is used to traverse the program only once. The one-pass compiler passes only
once through the parts of each compilation unit. It translates each part into its final machine code.
o In the one pass compiler, when the line source is processed, it is scanned and the token is extracted.
o Then the syntax of each line is analyzed and the tree structure is build. After the semantic part, the
code is generated.
o The same process is repeated for each line of code until the entire program is compiled.
Bootstrapping
1. Source Language
2. Target Language
3. Implementation Language
1. Create a compiler SCAA for subset, S of the desired language, L using language "A" and that compiler runs
on machine A.
4. Compile LCSA using the compiler SCAA to obtain LCAA. LCAA is a compiler for language L, which runs
on machine A and produces code for machine A.
The process described by the T-diagrams is called bootstrapping.
OR
Bootstrapping is a process in which simple language is used to translate more complicated program
which in turn may handle for more complicated program. This complicated program can further
handle even more complicated program and so on. Writing a compiler for any high level language is
a complicated process. It takes lot of time to write a compiler from scratch. Hence simple language is
used to generate target code in some stages. to clearly understand the Bootstrapping technique
consider a following scenario. Suppose we want to write a cross compiler for new language X. The
implementation language of this compiler is say Y and the target code being generated is in language
Z. That is, we create XYZ. Now if existing compiler Y runs on machine M and generates code for M
then it is denoted as YMM. Now if we run XYZ using YMM then we get a compiler XMZ. That
means a compiler for source language X that generates a target code in language Z and which runs
on machine M. Following diagram illustrates the above scenario. Example: We can create compiler
of many different forms. Now we will generate.
Compiler which takes C language and generates an assembly language as an output with the
availability of a machine of assembly language.
Step-2: Then using with small subset of C i.e. C0, for the source language c the compiler is written
Step-3: Finally we compile the second compiler. using compiler 1 the compiler 2 is compiled.
Step-4: Thus we get a compiler written in ASM which compiles C and generates code in ASM.
Bootstrapping is the process of writing a compiler for a programming language using the language
itself. In other words, it is the process of using a compiler written in a particular programming
language to compile a new version of the compiler written in the same language.
The process of bootstrapping typically involves several stages. In the first stage, a minimal
version of the compiler is written in a different language, such as assembly language or C.
This minimal version of the compiler is then used to compile a slightly more complex version
of the compiler written in the target language. This process is repeated until a fully functional
version of the compiler is written in the target language.
There are several advantages to bootstrapping. One advantage is that it ensures that the
compiler is compatible with the language it is designed to compile. This is because the
compiler is written in the same language, so it is better able to understand and interpret the
syntax and semantics of the language.
Another advantage is that it allows for greater control over the optimization and code
generation process. Since the compiler is written in the target language, it can be optimized to
generate code that is more efficient and better suited to the target platform.
However, bootstrapping also has some disadvantages. One disadvantage is that it can be a
time-consuming process, especially for complex languages or compilers. It can also be more
difficult to debug a bootstrapped compiler, since any errors or bugs in the compiler will affect
the subsequent versions of the compiler.
Overall, bootstrapping is an important technique in compiler design that allows for greater control over the
optimization and code generation process, while ensuring compatibility between the compiler and the target
language.
Advantages:
1. Bootstrapping ensures that the compiler is compatible with the language it is designed to compile, as
it is written in the same language.
2. It allows for greater control over the optimization and code generation process.
3. It provides a high level of confidence in the correctness of the compiler because it is self-hosted.
Disadvantages:
2. Debugging a bootstrapped compiler can be challenging since any errors or bugs in the compiler will
affect the subsequent versions of the compiler.
3. Bootstrapping requires that a minimal version of the compiler be written in a different language,
which can introduce compatibility issues between the two languages.
4. Overall, bootstrapping is a useful technique in compiler design, but it requires careful planning and
execution to ensure that the benefits outweigh the drawbacks.
Lexical Analysis is the first phase when compiler scans the source code. This process can be left to right,
character by character, and group these characters into tokens.
Here, the character stream from the source program is grouped in meaningful sequences by identifying the
tokens. It makes the entry of the corresponding tickets into the symbol table and passes that token to next
phase.
The primary functions of this phase are:
Example:
x = y + 10
Tokens
X identifier
= Assignment operator
Y identifier
+ Addition operator
10 Number
Syntax analysis is all about discovering structure in code. It determines whether or not a text follows the
expected format. The main aim of this phase is to make sure that the source code was written by the
programmer is correct or not.
Syntax analysis is based on the rules based on the specific programing language by constructing the parse
tree with the help of tokens. It also determines the structure of source language and grammar or syntax of the
language.
Here, is a list of tasks performed in this phase:
Example
(a+b)*c
In Parse Tree
Interior node: record with an operator filed and two files for children
Leaf: records with 2/more fields; one for token and other information about the token
Ensure that the components of the program fit together meaningfully
Gathers type information and checks for type compatibility
Checks operands are permitted by the source language
Helps you to store type information gathered and save it in symbol table or syntax tree
Allows you to perform type checking
In the case of type mismatch, where there are no exact type correction rules which satisfy the desired
operation a semantic error is shown
Collects type information and checks for type compatibility
Checks if the source language permits the operands or not
Example
float x = 20.2;
float y = x*30;
In the above code, the semantic analyzer will typecast the integer 30 to float 30.0 before multiplication
Once the semantic analysis phase is over the compiler, generates intermediate code for the target machine. It
represents a program for some abstract machine.
Intermediate code is between the high-level and machine level language. This intermediate code needs to be
generated in such a manner that makes it easy to translate it into the target machine code.
Functions on Intermediate Code generation:
Example
For example,
t1 := int_to_float(5)
t2 := rate * t1
t3 := count + t2
total := t3
The next phase of is code optimization or Intermediate code. This phase removes unnecessary code line and
arranges the sequence of statements to speed up the execution of the program without wasting resources.
The main goal of this phase is to improve on the intermediate code to generate a code that runs faster and
occupies less space.
Example:
Consider the following code
a = intofloat(10)
b=c*a
d=e+b
f=d
Can become
b =c * 10.0
f = e+b
Example:
a = b + 60.0
Would be possibly translated to registers.
MOVF a, R1
MULF #60.0, R2
ADDF R1, R2
Most common errors are invalid character sequence in scanning, invalid token sequences in type, scope
error, and parsing in semantic analysis.
The error may be encountered in any of the above phases. After finding errors, the phase needs to deal with
the errors to continue with the compilation process. These errors need to be reported to the error handler
which handles the error to perform the compilation process. Generally, the errors are reported in the form of
message.
Summary
Compiler operates in various phases each phase transforms the source program from one
representation to another
Six phases of compiler design are 1) Lexical analysis 2) Syntax analysis 3) Semantic analysis 4)
Intermediate code generator 5) Code optimizer 6) Code Generator
Lexical Analysis is the first phase when compiler scans the source code
Syntax analysis is all about discovering structure in text
Semantic analysis checks the semantic consistency of the code
Once the semantic analysis phase is over the compiler, generate intermediate code for the target
machine
Code optimization phase removes unnecessary code line and arranges the sequence of statements
Code generation phase gets inputs from code optimization phase and produces the page code or
object code as a result
A symbol table contains a record for each identifier with fields for the attributes of the identifier
Error handling routine handles error and reports during many phases
While transition, the automata can either move to the next state or stay in the same state.
FA has two states: accept state or reject state. When the input string is successfully processed and the
automata reached its final state then it will accept.
F: final state
δ: Transition function
δ: Q x ∑ →Q
DFA :-
DFA stands for Deterministic Finite Automata. Deterministic refers to the uniqueness of the computation.
In DFA, the input character goes to one state only. DFA doesn't accept the null move that means the DFA
cannot change state without any input character.
δ: Transition function
Example
∑ = {0, 1}
q0 = {q0}
F = {q2}
NDFA :-
NDFA refer to the Non Deterministic Finite Automata. It is used to transit the any number of states for a
particular input. NDFA accepts the NULL move that means it can change state without reading the symbols.
NDFA also has five states same as DFA. But NDFA has different transition function.
δ: Q x ∑ →2Q
Example :-
o Regular expression is a sequence of pattern that defines a string. It is used to denote regular
languages.
o It is also used to match character combinations in strings. String searching algorithm used this
pattern to find the operations on string.
o In regular expression, x* means zero or more occurrence of x. It can generate {e, x, xx, xxx,
xxxx,.....}
o In regular expression, x+ means one or more occurrence of x. It can generate {x, xx, xxx, xxxx,.....}
Union: If L and M are two regular languages then their union L U M is also a union.
1. L U M = {s | s is in L or s is in M}
Intersection: If L and M are two regular languages then their intersection is also an intersection.
1. L ⋂ M = {st | s is in L and t is in M}
Kleene closure: If L is a regular language then its kleene closure L1* will also be a regular language.
Example
Solution:
The string of language L starts with "a" followed by atleast three b's. Itcontains atleast one "a" or one "b"
that is string are like abbba, abbbbbba, abbbbbbbb, abbbb.....a
Compilers are the tool used to translate high-level programming language to low-level programming
language. The simple compiler works in one system only, but what will happen if we need a compiler that
can compile code from another platform, to perform such compilation, the cross compiler is introduced.
A cross compiler is a compiler capable of creating executable code for a platform other than the one on
which the compiler is running. For example, a cross compiler executes on machine X and produces machine
code for machine Y.
A cross compiler is a compiler capable of creating executable code for a platform other than the one on
which the compiler is running. In paravirtualization, one computer runs multiple operating systems and a
cross compiler could generate an executable for each of them from one main source.
The native compiler is a compiler that generates code for the same platform on which it runs and on the
other hand, a Cross compiler is a compiler that generates executable code for a platform other than one on
which the compiler is running.
The compiler writer can use some specialized tools that help in implementing various phases of a compiler.
These tools assist in the creation of an entire compiler or its parts. Some commonly used compiler
construction tools include:
1.Parser Generator – It produces syntax analyzers (parsers) from the input that is based on a grammatical description
of programming language or on a context-free grammar. It is useful as the syntax analysis phase is highly complex
and consumes more manual and compilation time.
2.Scanner Generator – It generates lexical analyzers from the input that consists of regular expression description
based on tokens of a language. It generates a finite automaton to recognize the regular expression. Example: Lex
3. Syntax directed translation engines – It generates intermediate code with three address format from the input that
consists of a parse tree. These engines have routines to traverse the parse tree and then produces the intermediate code.
In this, each node of the parse tree is associated with one or more translations.
4. Automatic code generators – It generates the machine language for a target machine. Each operation of the
intermediate language is translated using a collection of rules and then is taken as an input by the code generator. A
template matching process is used. An intermediate language statement is replaced by its equivalent machine language
statement using templates.
5. Data-flow analysis engines – It is used in code optimization. Data flow analysis is a key part of the code
optimization that gathers the information, that is the values that flow from one part of a program to another.
6. Compiler construction toolkits – It provides an integrated set of routines that aids in building compiler
components or in the construction of various phases of compiler.
Optimization Tools: These tools help in optimizing the generated code for efficiency and
performance. They can perform various optimizations such as dead code elimination, loop
optimization, and register allocation.
Debugging Tools: These tools help in debugging the compiler itself or the programs that are being
compiled. They can provide debugging information such as symbol tables, call stacks, and runtime
errors.
Profiling Tools: These tools help in profiling the compiler or the compiled code to identify
performance bottlenecks and optimize the code accordingly.
Documentation Tools: These tools help in generating documentation for the compiler and the
programming language being compiled. They can generate documentation for the syntax, semantics,
and usage of the language.
Language Support: Compiler construction tools are designed to support a wide range of
programming languages, including high-level languages such as C++, Java, and Python, as well as
low-level languages such as assembly language.
Cross-Platform Support: Compiler construction tools may be designed to work on multiple
platforms, such as Windows, Mac, and Linux.
User Interface: Some compiler construction tools come with a user interface that makes it easier for
developers to work with the compiler and its associated tools.
LEX
o Lex is a program that generates lexical analyzer. It is used with YACC parser generator.
o The lexical analyzer is a program that transforms an input stream into a sequence of tokens.
o It reads the input stream and produces the source code as output through implementing the lexical analyzer in
the C program.
A Lex program is separated into three sections by %% delimiters. The formal of Lex source is as follows:
1. { definitions }
2. %%
3. { rules }
4. %%
5. { user subroutines }
User subroutines are auxiliary procedures needed by the actions. The subroutine can be loaded with the
lexical analyzer and compiled separately.
Lex is a tool which generates the Lexical analyser. Lex tool takes input as the regular expression and forms a
DFA corresponding to that regular expression.
Declaration
%%
%%
Auxiliary function
a) Declaration:-
Auxiliary declaration starts with %{ and ends with %}. All the statements written in this is directly copied to
the lex.yy.c[ Lexical Analyser ]
Regular Declaration in lex is used to define special keywords used in the rule part. Also it is used to define
option such as noyywrap.
Example :-
%{
#include<iostream> //DECLARATION
%}
%%
%%
yylex();
CODE 1.1
b) Rules :-
c) Auxillary Function :-
In addition to the code generated by lex tool, if we want to define some function than Auxiliary function is
used[ See code 1.1 ].
a) int yylex():-
yylex is the main Scanner function, Lex creates yylex but do not call it, We need to call the yylex in main as
to run the lexical analyser.
-> Otherwise we need to compile the lex.yy.c file with option -ll
Note:- If the yylex is not return than it takes the input recursively until and unless end of file is reached. In
case of console input(stdin) we can give end of file by pressing cltrl+D.
In case of return of yylex, if we call yylex again then we get the input from the place we left.
By default action of left pattern is simple an ECHO.( The echo command in Linux is a built-in command
that allows users to display lines of text or strings that are passed as arguments.)
b) int yywrap() :-
lex calls the yywrap function. Whenever a lex encounter the end of file it calls yywrap function. If yywrap
returns non zero value yylex terminates and return to the main [with value zero]. If the programmer wants to
scan more than one input file, then yywrap should return zero, stating work is not finished. Meanwhile in
yywrap we can change the file pointer as to read another file.
Note:- Lex by default does not define the yywrap, therefore there is a compulsion to define the yywrap
function otherwise it will give an error stating that it is causing an error. Alternatively we can use %option
noyywrap to define yywrap internally by lex.
This internal implementation returns non zero value , stating that work is finished.[see code 1.1]
1. yyin
2. yytext
3. yyleng
1) yyin :-
yyin is a variable of type FILE * and points to the file which we want to input in the lexical analyser.
If we set yyin to some file than it will read the character stream from that file.
Example:-
//AUXILIARY FUNCTION
int main(void){
FILE *f=fopen(“Temprary”,“r”);
if(f){
yyin=f;
yylex();
return 0;
->Inside lex.yy.c
if(!yyin)
{
yyin=stdin;
This shows that if the value of yyin is not pointed by any file than yyin will get initialised to stdin[Standard
Input].
2) yytext :
yytext is of char* type and it contains the matched lexeme found. In every iteration of pattern matched the
value is getting overwrite again and again.
Example :-
%{
#include<iostream>
%}
%option no yywrap
DIGIT [0–9]+
%%
%%
int main(void){
yylex();
return 1;
Note:- In this example we define DIGIT as a lex pattern variable and to access the variable in Pattern section
we use {}.
3) yyleng :-
yyleng is an int type, it stores the length of the lexeme or the size of yytext.
/* Rules Section*/
%%
([a-zA-Z0-9])* {i++;} /* Rule for counting
number of words*/
int yywrap(void){}
int main()
{
// The function that starts the analysis
yylex();
return 0;
}
Output :-
Example2. - LEX program to count the number of vowels and consonants in a given string .
%{
int vow_count=0;
int const_count =0;
%}
%%
[aeiouAEIOU] {vow_count++;}
[a-zA-Z] {const_count++;}
%%
int yywrap(){}
int main()
{
printf("Enter the string of vowels and consonants:");
yylex();
printf("Number of vowels are: %d\n", vow_count);
printf("Number of consonants are: %d\n", const_count);
return 0;
}
Output:-
%
{
int i, j, flag;
%
}
/* Rule Section */
%%
[a - z A - z 0 - 9]*
{
for (i = 0, j = yyleng - 1; i <= j; i++, j--) {
if (yytext[i] == yytext[j]) {
flag = 1;
}
else {
flag = 0;
break;
}
}
if (flag == 1)
printf("Given string is Palindrome");
else
printf("Given string is not Palindrome");
}
%%
// driver code
int main()
{
printf("Enter a string :");
yylex();
return 0;
}
int yywrap()
{
return 1;
}
Output:-
Parser :-
Parser is a compiler that is used to break the data into smaller elements coming from lexical analysis phase.
A parser takes input in the form of sequence of tokens and produces output in the form of parse tree.
Top-Down Parsing is based on Left Most Derivation whereas Bottom-Up Parsing is dependent on Reverse
Right Most Derivation.
The process of constructing the parse tree which starts from the root and goes down to the leaf is Top-Down
Parsing.
1. Top-Down Parsers constructs from the Grammar which is free from ambiguity and left recursion.
2. Top-Down Parsers uses leftmost derivation to construct a parse tree.
3. It does not allow Grammar With Common Prefixes.
Parse Tree
1. Leftmost Derivation-
The process of deriving a string by expanding the leftmost non-terminal at each step is called
as leftmost derivation.
The geometrical representation of leftmost derivation is called as a leftmost derivation tree.
Example-
S → aB / bA
S → aS / bAA / a
B → bS / aBB / b
(Unambiguous Grammar)
Leftmost Derivation-
S → aB
→ aaabBB (Using B → b)
→ aaabbB (Using B → b)
→ aaabbabB (Using B → b)
→ aaabbabbba (Using A → a)
2. Rightmost Derivation-
The process of deriving a string by expanding the rightmost non-terminal at each step is called
as rightmost derivation.
The geometrical representation of rightmost derivation is called as a rightmost derivation tree.
S → aB / bA
AS → aS / bAA / a
B → bS / aBB / b
(Unambiguous Grammar)
Let us consider a string w = aaabbabbba
Rightmost Derivation-
S → aB
→ aaBB (Using B → aBB)
→ aaBaBB (Using B → aBB)
→ aaBaBbS (Using B → bS)
→ aaBaBbbA (Using S → bA)
→ aaBaBbba (Using A → a)
→ aaBabbba (Using B → b)
→ aaaBBabbba (Using B → aBB)
→ aaaBbabbba (Using B → b)
→ aaabbabbba (Using B → b)
NOTES
For unambiguous grammars, Leftmost derivation and Rightmost derivation represents the same
parse tree.
For ambiguous grammars, Leftmost derivation and Rightmost derivation represents different
parse trees.
Here,
The given grammar was unambiguous.
That is why, leftmost derivation and rightmost derivation represents the same parse tree.
Leftmost Derivation Tree = Rightmost Derivation Tree
2. Without Backtracking:
1. Whenever a Non-terminal spend the first time then go with the first alternative and compare it with
the given I/P String
2. If matching doesn’t occur then go with the second alternative and compare with the given I/P String.
3. If matching is not found again then go with the alternative and so on.
4. Moreover, If matching occurs for at least one alternative, then the I/P string is parsed successfully.
1. In LL1, first L stands for Left to Right and second L stands for Left-most Derivation. 1 stands for a
number of Look Ahead tokens used by parser while parsing a sentence.
2. LL(1) parsing is constructed from the grammar which is free from left recursion, common prefix,
and ambiguity.
3. LL(1) parser depends on 1 look ahead symbol to predict the production to expand the parse tree.
4. This parser is Non-Recursive.
Features :
Predictive parsing: Top-down parsers often use predictive parsing techniques, in which the parser predicts
the following symbol inside the enter based on the present day country of the parse stack and the
manufacturing rules of the grammar. This permits the parser to speedy determine if a particular enter string
is valid beneath the grammar.
LL parsing: LL parsing is a selected type of pinnacle-down parsing that uses a left-to-right scan of the enter
and leftmost derivation of the grammar. This form of parsing is generally utilized in programming language
compilers.
Recursive descent parsing: Recursive descent parsing is another type of top-down parsing that uses a hard
and fast of recursive approaches to suit nonterminals inside the grammar. Each nonterminal has a
corresponding manner this is answerable for parsing that nonterminal.
Backtracking: Top-down parsers may also use backtracking to discover multiple parsing paths whilst the
grammar is ambiguous or when a parsing mistakes occurs. This can be steeply-priced in terms of
computation time and memory utilization, such a lot of pinnacle-down parsers use strategies to reduce the
need for backtracking.
Memoization: Memoization is a method used to cache intermediate parsing effects and keep away from
repeated computation. Some pinnacle-down parsers use memoization to reduce the amount of backtracking
required.
Lookahead: Top-down parsers might also use lookahead to expect the next symbol in the enter based totally
on a hard and fast range of input symbols. This can enhance parsing velocity and decrease the amount of
backtracking required.
Error healing: Top-down parsers may use blunders recuperation techniques to deal with syntax errors
within the input. These techniques may additionally consist of inserting or deleting symbols to healthy the
grammar or skipping over misguided symbols to maintain parsing the input.
Advantages:
Easy to Understand: Top-down parsers are easy to understand and implement, making them a good choice
for small to medium-sized grammars.
Efficient: Some types of top-down parsers, such as LL(1) and predictive parsers, are efficient and can
handle larger grammars.
Flexible: Top-down parsers can be easily modified to handle different types of grammars and programming
languages.
Disadvantages:
Limited Power: Top-down parsers have limited power and may not be able to handle all types of grammars,
particularly those with complex structures or ambiguous rules.
Left-Recursion: Top-down parsers can suffer from left-recursion, which can make the parsing process more
complex and less efficient.
Look-Ahead Restrictions: Some top-down parsers, such as LL(1) parsers, have restrictions on the number
of look-ahead symbols they can use, which can limit their ability to handle certain types of grammars.
Bottom up parsing
Production
1. E→T
2. T→T*F
3. T → id
4. F→T
5. F → id
1. Shift-Reduce Parsing
2. Operator Precedence Parsing
3. Table Driven LR Parsing
a. LR( 1 )
b. SLR( 1 )
c. CLR ( 1 )
d. LALR
Ambiguous Grammar-
A grammar is said to ambiguous if for any string generated by it, it produces more than one-
Parse tree
Or derivation tree
Or syntax tree
Or leftmost derivation
Or rightmost derivation
E → E + E / E x E / id
Ambiguous Grammar
w = id + id x id
Reason 1:-
Since two parse trees exist for string w, therefore the grammar is ambiguous.
Reason 2:-
w = id + id x id
Since two syntax trees exist for string w, therefore the grammar is ambiguous.
Reason 3:-
w = id + id x id
Since two leftmost derivations exist for string w, therefore the grammar is ambiguous.
Reason 4:-
w = id + id x id
Unambiguous Grammar-
A grammar is said to unambiguous if for every string generated by it, it produces exactly one-
Parse tree
Or derivation tree
Or syntax tree
Or leftmost derivation
Or rightmost derivation
E→E+T/T
T→TxF/F
F → id
Unambiguous Grammar
For ambiguous grammar, leftmost derivation and For unambiguous grammar, leftmost derivation
rightmost derivation represents different parse and rightmost derivation represents the same parse
trees. tree.
Ambiguous grammar contains less number of non- Unambiguous grammar contains more number of
terminals. non-terminals.
For ambiguous grammar, length of parse tree is For unambiguous grammar, length of parse tree is
less. large.
Example- Example-
E→E+T/T
E → E + E / E x E / id T→TxF/F
(Ambiguous Grammar) F → id
(Unambiguous Grammar)
D → bDc / bc
Solution :-
w = aabbccdd
Since two different parse trees exist for string w, therefore the given grammar is ambiguous.
E→E+T/T
T→TxF/F
F → id
Solution-
There exists no string belonging to the language of grammar which has more than one parse tree.
Since a unique parse tree exists for all the strings, therefore the given grammar is unambiguous.
Example 3 –
S → aSbS / bSaS / ∈
Solution-
w = abab
Since two different parse trees exist for string w, therefore the given grammar is ambiguous.
Precedence Constraints
Associativity Constraints
Rule-01:
The level at which the production is present defines the priority of the operator contained in it.
The higher the level of the production, the lower the priority of operator.
The lower the level of the production, the higher the priority of operator.
Rule-02:
The associativity constraint is implemented using the following rules-
Problem-01:
R → R + R / R . R / R* / a / b
Solution-
To convert the given grammar into its corresponding unambiguous grammar, we implement the precedence
and associativity constraints.
We have-
+,.,*
a,b
where-
E→E+T/T
T→T.F/F
F → F* / G
G→a/b
Unambiguous Grammar
OR
E→E+T/T
T→T.F/F
F → F* / a / b
Unambiguous Grammar
Problem-02:
where bexp represents Boolean expression, T represents True and F represents False.
Solution-
To convert the given grammar into its corresponding unambiguous grammar, we implement the precedence
and associativity constraints.
We have-
T,F
where-
Using the precedence and associativity rules, we write the corresponding unambiguous grammar as-
bexp → bexp or M / M
M → M and N / N
N → not N / G
G→T/F
Unambiguous Grammar
OR
bexp → bexp or M / M
M → M and N / N
N → not N / T / F
Unambiguous Grammar
In such cases, alternative parsing techniques, such as predictive parsing or bottom-up parsing, may be more
efficient. In the field of parsing algorithms, backtracking plays a crucial role in handling ambiguity and
making alternative choices during the parsing process.
Specifically, in top-down parsers, which start with the initial grammar symbol and recursively expand non-
terminals, backtracking allows the parser to explore different options when the chosen production fails to
match the input.
The concept of backtracking in top-down parsing, its significance in handling ambiguity, and its impact on
the parsing process.
To improve the performance of top-down parsers, various optimization techniques can be employed,
including left-factoring, left-recursion elimination, and the use of lookahead tokens to predict the correct
production choice without excessive backtracking.
In Top-Down Parsing with Backtracking, Parser will attempt multiple rules or production to identify the
match for input string by backtracking at every step of derivation. So, if the applied production does not give
the input string as needed, or it does not match with the needed string, then it can undo that shift.
S→aAd
A→bc|b
Make parse tree for the string a bd. Also, show parse Tree when Backtracking is required when the wrong
alternative is chosen.
Solution
Top-down parsing is a technique where a parser starts with the start symbol of the grammar and attempts to
derive the input string by recursively expanding non-terminals. The process involves selecting a production
rule that matches the current input, expanding non-terminals, and backtracking when necessary. Let’s
explore the key aspects of backtracking in top-down parsing:
1. Ambiguity Resolution: Ambiguity arises when there are multiple choices available for expanding a
non-terminal or selecting a production rule. Backtracking allows the parser to explore these choices
systematically, backtracking to previous decision points and trying alternative paths if the current
choice fails. This iterative exploration continues until a successful match is found or all possibilities
are exhausted, leading to a parsing error.
2. Decision Points: At each step in the parsing process, the parser makes decisions such as choosing a
non-terminal to expand or selecting a production rule. These decision points serve as potential
backtracking points. If a chosen production fails to match the input, the parser backtracks to the
previous decision point and explores an alternative choice, allowing for a different path of parsing.
3. Recursive Expansion: During the parsing process, the chosen production rule is recursively
expanded by applying the rules to its non-terminals. This expansion continues until a terminal
symbol is reached or further expansion is not possible. If a successful match is not found, the parser
backtracks to the previous decision point to try an alternative choice.
4. Successful Match or Parsing Error: The parsing process concludes either when a successful match
is found, indicating that the input string conforms to the grammar, or when all possible alternatives
are exhausted, resulting in a parsing error.
1. Choose a non-terminal: At each step, the parser chooses a non-terminal from the current production
rule to expand.
2. Apply a production: The parser selects a production rule for the chosen non-terminal that matches
the current input. If multiple choices are available, it may need to try each alternative.
3. Recursive expansion: The chosen production is recursively expanded by applying the rules to its
non-terminals. This process continues until a terminal symbol is reached or until further expansion is
not possible.
4. Backtrack on failure: If a selected production fails to match the input, the parser backtracks to the
previous decision point, undoing the previous expansion and selecting an alternative choice if
available.
5. Repeat until success or failure: The parser repeats the above steps, trying different alternatives and
backtracking as necessary until it either successfully matches the entire input or exhausts all possible
alternatives, resulting in a parsing error.
Advantages of Backtracking
Disadvantages of Backtracking
Performance Impact: Backtracking can lead to inefficient parsing, particularly in cases where there
are numerous backtracking points or long ambiguous sequences in the input. In such scenarios,
alternative parsing techniques may be more efficient.
Complexity: Managing backtracking points and tracking alternative choices can introduce additional
complexity to the parsing algorithm, requiring careful implementation and optimization.
2. Without Backtracking:
1. In LL1, first L stands for Left to Right and second L stands for Left-most Derivation. 1 stands for a
number of Look Ahead tokens used by parser while parsing a sentence.
2. LL(1) parsing is constructed from the grammar which is free from left recursion, common prefix,
and ambiguity.
3. LL(1) parser depends on 1 look ahead symbol to predict the production to expand the parse tree.
4. This parser is Non-Recursive.
Predictive parsing: Top-down parsers often use predictive parsing techniques, in which the parser predicts
the following symbol inside the enter based on the present day country of the parse stack and the
manufacturing rules of the grammar. This permits the parser to speedy determine if a particular enter string
is valid beneath the grammar.
LL parsing: LL parsing is a selected type of pinnacle-down parsing that uses a left-to-right scan of the enter
and leftmost derivation of the grammar. This form of parsing is generally utilized in programming language
compilers.
Recursive descent parsing: Recursive descent parsing is another type of top-down parsing that uses a hard
and fast of recursive approaches to suit nonterminals inside the grammar. Each nonterminal has a
corresponding manner this is answerable for parsing that nonterminal.
Backtracking: Top-down parsers may also use backtracking to discover multiple parsing paths whilst the
grammar is ambiguous or when a parsing mistakes occurs. This can be steeply-priced in terms of
computation time and memory utilization, such a lot of pinnacle-down parsers use strategies to reduce the
need for backtracking.
Memoization: Memoization is a method used to cache intermediate parsing effects and keep away from
repeated computation. Some pinnacle-down parsers use memoization to reduce the amount of backtracking
required.
Lookahead: Top-down parsers might also use lookahead to expect the next symbol in the enter based totally
on a hard and fast range of input symbols. This can enhance parsing velocity and decrease the amount of
backtracking required.
Error healing: Top-down parsers may use blunders recuperation techniques to deal with syntax errors
within the input. These techniques may additionally consist of inserting or deleting symbols to healthy the
grammar or skipping over misguided symbols to maintain parsing the input.
Advantages:
Easy to Understand: Top-down parsers are easy to understand and implement, making them a good choice
for small to medium-sized grammars.
Efficient: Some types of top-down parsers, such as LL(1) and predictive parsers, are efficient and can
handle larger grammars.
Flexible: Top-down parsers can be easily modified to handle different types of grammars and programming
languages.
Disadvantages:
Limited Power: Top-down parsers have limited power and may not be able to handle all types of grammars,
particularly those with complex structures or ambiguous rules.
Left-Recursion: Top-down parsers can suffer from left-recursion, which can make the parsing process more
complex and less efficient.
Look-Ahead Restrictions: Some top-down parsers, such as LL(1) parsers, have restrictions on the number
of look-ahead symbols they can use, which can limit their ability to handle certain types of grammars.
Predictive parsing is a straightforward form of recursive descent parsing. And it does not requires back-
tracking. Instead, it can determine which production must be selected to derive the input string. Predictive
parsing selects the correct production by looking at the input string. It allows looking at a fixed number of
input symbols from the input string.
Stack
Input Buffer
It includes the input string that the predictive parser needs to parse.
Parsing Table
With the entries present in this table, it becomes effortless for the top-down parser to choose the production
to be applied. Input buffer and stack both include the end marker '$'. It denotes the bottom of the stack and
the input string's end in the input buffer. In the beginning, the grammar symbol on the top of $ is the
beginning symbol.
The parser first views the grammar symbol present on the top of the stack saying 'X'. And compares it with
the existing input symbol, say 'a' present in the input buffer.
If X is a non-terminal, then the parser selects a product of X from the parse table, conferring the
entry M [X, a].
In case X is a terminal, then the parser scans it for a match with the present symbol' a'.
This is how predictive parsing recognizes the right production. In this way, it successfully derives the input
string.
LL Parser
The LL parser is a predictive parser. It does not require back-tracking. LL (1) parser takes only LL (1)
grammar.
First L in LL (1) implies that the parser scans the input string from left to right.
Second, L defines the leftmost derivation for the input string.
The '1' in LL (1) demonstrates that the parser lookahead a single input symbol from the input string.
Rule-01:
First(X) = { ∈ }
Rule-02:
Rule-03:
Calculating First(X)
Calculating First(Y2Y3)
Follow Function-
Rule-01:
Rule-02:
Follow(B) = Follow(A)
Rule-03:
Important Notes-
Note-01:
Note-02:
Before calculating the first and follow functions, eliminate Left Recursion from the grammar, if
present.
Note-03:
We calculate the follow function of a non-terminal by looking where it is present on the RHS of a
production rule.
PRACTICE PROBLEMS BASED ON CALCULATING FIRST AND FOLLOW-
Example1. - Calculate the first and follow functions for the given grammar-
S → aBDh
B → cC
C → bC / ∈
D → EF
E→g/∈
F→f/∈
Solution-
First Functions-
First(S) = { a }
First(C) = { b , ∈ }
First(B) = { c }
First(F) = { f , ∈ }
Follow Functions-
Example2. - Calculate the first and follow functions for the given grammar-
S → AaAb / BbBa
A→∈
B→∈
Solution-
First Functions-
First(B) = { ∈ }
Follow Functions-
Example3. - Calculate the first and follow functions for the given grammar-
S → ACB / CbB / Ba
A → da / BC
B→g/∈
C→h/∈
Solution-
First Functions-
First(C) = { h , ∈ }
Follow Functions-
If RHS of more than one production starts with the same symbol,
Example-
This kind of grammar creates a problematic situation for Top down parsers.
Top down parsers can not decide which production must be chosen to parse the string in hand.
Left Factoring-
Left factoring is a process by which the grammar with common prefixes is transformed to make it useful for
Top down parsers.
In left factoring,
The grammar obtained after the process of left factoring is called as Left Factored Grammar.
PRACTICE PROBLEMS BASED ON LEFT FACTORING-
Problem-01:
S → iEtS / iEtSeS / a
E→b
Solution-
S → iEtSS’ / a
S’ → eS / ∈
E→b
Problem-02:
Solution-
Step-01:
A → aA’
A’ → AB / Bc / Ac
A → aA’
A’ → AD / Bc
D→B/c
Problem-02:
Solution-
Step-01:
A → aA’
A’ → AB / Bc / Ac
Step-02:
A → aA’
A’ → AD / Bc
D→B/c
Solution-
Step-01:
S → bSS’ / a
S’ → SaaS / SaSb / b
Step-02:
S → bSS’ / a
S’ → SaA / b
A → aS / Sb
Problem-04:
Solution-
Step-01:
S → aS’ / b
S’ → SSbS / SaSb / bb
Step-02:
S → aS’ / b
S’ → SA / bb
A → SbS / aSb
Problem-05:
S → a / ab / abc / abcd
Solution-
Step-01:
S → aS’
S’ → b / bc / bcd / ∈
Step-02:
S → aS’
S’ → bA / ∈
A → c / cd / ∈
Step-03:
S → aS’
S’ → bA / ∈
A → cB / ∈
B→d/∈
Problem-06:
S → aAd / aB
A → a / ab
B → ccd / ddc
Solution-
S → aS’
S’ → Ad / B
A → aA’
A’ → b / ∈
B → ccd / ddc
To gain better understanding about Left Factoring,
Recursion-
Recursion can be classified into following three types-
1. Left Recursion
2. Right Recursion
3. General Recursion
1. Left Recursion-
A production of grammar is said to have left recursion if the leftmost variable of its RHS is same as
variable of its LHS.
A grammar containing a production having left recursion is called as Left Recursive Grammar.
Example-
S → Sa / ∈
Left recursion is eliminated by converting the grammar into a right recursive grammar.
A → Aα / β
Then, we can eliminate left recursion by replacing the pair of productions with-
A → βA’
A’ → αA’ / ∈
2. Right Recursion-
A production of grammar is said to have right recursion if the rightmost variable of its RHS is same
as variable of its LHS.
A grammar containing a production having right recursion is called as Right Recursive Grammar.
Example-
S → aS / ∈
Right recursion does not create any problem for the Top down parsers.
Therefore, there is no need of eliminating right recursion from the grammar.
3. General Recursion-
The recursion which is neither left recursion nor right recursion is called as general recursion.
Example-
S → aSb / ∈
Problem-01:
A → ABd / Aa / a
B → Be / b
Solution-
A → aA’
A’ → BdA’ / aA’ / ∈
B → bB’
B’ → eB’ / ∈
Problem-02:
Consider the following grammar and eliminate left recursion-
E→E+E/ExE/a
Solution-
E → aA
A → +EA / xEA / ∈
Problem-03:
E→E+T/T
T→TxF/F
F → id
Solution-
E → TE’
E’ → +TE’ / ∈
T → FT’
T’ → xFT’ / ∈
F → id
Problem-04:
S → (L) / a
L→L,S/S
Solution-
S → (L) / a
L → SL’
L’ → ,SL’ / ∈
Problem-05:
S → S0S1S / 01
Solution-
S → 01A
A → 0S1SA / ∈
Problem-06:
S→A
A → Ad / Ae / aB / ac
B → bBc / f
Solution-
S→A
A → aBA’ / acA’
A’ → dA’ / eA’ / ∈
B → bBc / f
Problem-07:
A → AAα / β
Solution-
A → βA’
A’ → AαA’ / ∈
Problem-08:
A → Ba / Aa / c
B → Bb / Ab / d
Solution-
This is a case of indirect left recursion.
Step-01:
A → BaA’ / cA’
A’ → aA’ / ∈
A → BaA’ / cA’
A’ → aA’ / ∈
B → Bb / Ab / d
Step-02:
A → BaA’ / cA’
A’ → aA’ / ∈
B → Bb / BaA’b / cA’b / d
Step-03:
Now, eliminating left recursion from the productions of B, we get the following grammar-
A → BaA’ / cA’
A’ → aA’ / ∈
B → cA’bB’ / dB’
B’ → bB’ / aA’bB’ / ∈
Problem-09:
X → XSb / Sa / b
S → Sb / Xa / a
Solution-
Step-01:
X → SaX’ / bX’
X’ → SbX’ / ∈
X → SaX’ / bX’
X’ → SbX’ / ∈
S → Sb / Xa / a
Step-02:
Substituting the productions of X in S → Xa, we get the following grammar-
X → SaX’ / bX’
X’ → SbX’ / ∈
S → Sb / SaX’a / bX’a / a
Step-03:
Now, eliminating left recursion from the productions of S, we get the following grammar-
X → SaX’ / bX’
X’ → SbX’ / ∈
S → bX’aS’ / aS’
S’ → bS’ / aX’aS’ / ∈
Problem-10:
S → Aa / b
A → Ac / Sd / ∈
Solution-
Step-01:
S → Aa / b
A → Ac / Aad / bd / ∈
Step-03:
Now, eliminating left recursion from the productions of A, we get the following grammar-
S → Aa / b
A → bdA’ / A’
A’ → cA’ / adA’ / ∈