0% found this document useful (0 votes)
30 views

Compiler Design - Module 1-Notes

Uploaded by

Sneha Patle
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views

Compiler Design - Module 1-Notes

Uploaded by

Sneha Patle
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 74

Compiler Design

Module No. – 1

Contents :-
Definition of compiler, interpreter and its differences, the phases of a compiler, role of lexical analyzer,
regular expressions, finite automata, from regular expressions to finite automata, pass and phases of
translation, bootstrapping, LEX-lexical analyzer generator. PARSING: Parsing, role of parser, context free
grammar, derivations, parse trees, ambiguity, elimination of left recursion, left factoring, eliminating
ambiguity from dangling-else grammar, classes of parsing, top down parsing - backtracking, recursive
descent parsing, predictive parsers, LL(1) grammars.

1) Compiler is a translator that converts the high-level language into the machine language.

Introduction to Compiler

o A compiler is a translator that converts the high-level language into the machine language.
o High-level language is written by a developer and machine language can be understood by the
processor.
o Compiler is used to show errors to the programmer.
o The main purpose of compiler is to change the code written in one language without changing the
meaning of the program.
o When you execute a program which is written in HLL programming language then it executes into
two parts.
o In the first part, the source program compiled and translated into the object program (low level
language).
o In the second part, object program translated into the target program through the assembler.

Fig: Execution process of source program in Compiler


Compiler Phases

The compilation process contains the sequence of various phases. Each phase takes source program in one
representation and produces output in another representation. Each phase takes input from its previous stage.

We basically have two phases of compilers, namely the Analysis phase and Synthesis phase. The analysis
phase creates an intermediate representation from the given source code. The synthesis phase creates an
equivalent target program from the intermediate representation.

A compiler is a software program that converts the high-level source code written in a programming
language into low-level machine code that can be executed by the computer hardware. The process of
converting the source code into machine code involves several phases or stages, which are collectively
known as the phases of a compiler. The typical phases of a compiler are:
We basically have two phases of compilers, namely the Analysis phase and Synthesis phase. The analysis
phase creates an intermediate representation from the given source code. The synthesis phase creates an
equivalent target program from the intermediate representation.

A compiler is a software program that converts the high-level source code written in a programming
language into low-level machine code that can be executed by the computer hardware. The process of
converting the source code into machine code involves several phases or stages, which are collectively
known as the phases of a compiler. The typical phases of a compiler are:

1. Lexical Analysis: The first phase of a compiler is lexical analysis, also known as scanning.
This phase reads the source code and breaks it into a stream of tokens, which are the basic units
of the programming language. The tokens are then passed on to the next phase for further
processing.
2. Syntax Analysis: The second phase of a compiler is syntax analysis, also known as parsing.
This phase takes the stream of tokens generated by the lexical analysis phase and checks
whether they conform to the grammar of the programming language. The output of this phase
is usually an Abstract Syntax Tree (AST).
3. Semantic Analysis: The third phase of a compiler is semantic analysis. This phase checks
whether the code is semantically correct, i.e., whether it conforms to the language’s type
system and other semantic rules.
4. Intermediate Code Generation: The fourth phase of a compiler is intermediate code
generation. This phase generates an intermediate representation of the source code that can be
easily translated into machine code.
5. Optimization: The fifth phase of a compiler is optimization. This phase applies various
optimization techniques to the intermediate code to improve the performance of the generated
machine code.
6. Code Generation: The final phase of a compiler is code generation. This phase takes the
optimized intermediate code and generates the actual machine code that can be executed by the
target hardware.

Symbol Table – It is a data structure being used and maintained by the compiler, consisting of all the
identifier’s names along with their types. It helps the compiler to function smoothly by finding the
identifiers quickly.
The analysis of a source program is divided into mainly three phases. They are:
1. Linear Analysis-
This involves a scanning phase where the stream of characters is read from left to right. It is
then grouped into various tokens having a collective meaning.
2. Hierarchical Analysis-
In this analysis phase, based on a collective meaning, the tokens are categorized hierarchically
into nested groups.
3. Semantic Analysis-
This phase is used to check whether the components of the source program are meaningful or
not.

The compiler has two modules namely the front end and the back end. Front-end constitutes the
Lexical analyzer, semantic analyzer, syntax analyzer, and intermediate code generator. And the
rest are assembled to form the back end.
Lexical Analyzer –

It is also called a scanner. It takes the output of the preprocessor (which performs file
inclusion and macro expansion) as the input which is in a pure high-level language. It
reads the characters from the source program and groups them into lexemes (sequence
of characters that “go together”). Each lexeme corresponds to a token. Tokens are
defined by regular expressions which are understood by the lexical analyzer. It also
removes lexical errors (e.g., erroneous characters), comments, and white space.

Syntax Analyzer – It is sometimes called a parser. It constructs the parse tree. It takes all the tokens one
by one and uses Context-Free Grammar to construct the parse tree.

Why Grammar?

The rules of programming can be entirely represented in a few productions. Using these productions we
can represent what the program actually is. The input has to be checked whether it is in the desired format
or not.

The parse tree is also called the derivation tree. Parse trees are generally constructed to check for
ambiguity in the given grammar. There are certain rules associated with the derivation tree.
 Any identifier is an expression
 Any number can be called an expression
 Performing any operations in the given expression will always result in an
expression. For example, the sum of two expressions is also an expression.
 The parse tree can be compressed to form a syntax tree

Syntax error can be detected at this level if the input is not in accordance with the grammar.

 Semantic Analyzer – It verifies the parse tree, whether it’s meaningful or not. It
furthermore produces a verified parse tree. It also does type checking, Label checking,
and Flow control checking.
 Intermediate Code Generator – It generates intermediate code, which is a form that
can be readily executed by a machine We have many popular intermediate codes.
Example – Three address codes etc. Intermediate code is converted to machine language
using the last two phases which are platform dependent.
Till intermediate code, it is the same for every compiler out there, but after that, it
depends on the platform. To build a new compiler we don’t need to build it from
scratch. We can take the intermediate code from the already existing compiler and build
the last two parts.
 Code Optimizer – It transforms the code so that it consumes fewer resources and
produces more speed. The meaning of the code being transformed is not altered.
Optimization can be categorized into two types: machine-dependent and machine-
independent.

 Target Code Generator – The main purpose of the Target Code generator is to write a
code that the machine can understand and also register allocation, instruction selection,
etc. The output is dependent on the type of assembler. This is the final stage of
compilation. The optimized code is converted into relocatable machine code which then
forms the input to the linker and loader.

All these six phases are associated with the symbol table manager and error handler as shown in the above
block diagram.

The advantages of using a compiler to translate high-level programming languages into machine
code are:

1. Portability: Compilers allow programs to be written in a high-level programming language,


which can be executed on different hardware platforms without the need for modification. This
means that programs can be written once and run on multiple platforms, making them more
portable.
2. Optimization: Compilers can apply various optimization techniques to the code, such as loop
unrolling, dead code elimination, and constant propagation, which can significantly improve
the performance of the generated machine code.
3. Error Checking: Compilers perform a thorough check of the source code, which can detect
syntax and semantic errors at compile-time, thereby reducing the likelihood of runtime errors.
4. Maintainability: Programs written in high-level languages are easier to understand and
maintain than programs written in low-level assembly language. Compilers help in translating
high-level code into machine code, making programs easier to maintain and modify.
5. Productivity: High-level programming languages and compilers help in increasing the
productivity of developers. Developers can write code faster in high-level languages, which can
be compiled into efficient machine code.

Example:
Define Tokens , Lexeme , Pattern
Tokens :-
It is basically a sequence of characters that are treated as a unit as it cannot be further broken down. In
programming languages like C language- keywords (int, char, float, const, goto, continue, etc.) identifiers
(user-defined names), operators (+, -, *, /), delimiters/punctuators like comma (,), semicolon(;), braces
({ }), etc. , strings can be considered as tokens. This phase recognizes three types of tokens: Terminal
Symbols (TRM)- Keywords and Operators, Literals (LIT), and Identifiers (IDN).
Example 1:

int a = 10; //Input Source code

Tokens
int (keyword), a(identifier), =(operator), 10(constant) and ;(punctuation-semicolon)
Answer – Total number of tokens = 5

Lexeme :-
It is a sequence of characters in the source code that are matched by given predefined language rules for
every lexeme to be specified as a valid token.
Example:
main is lexeme of type identifier(token)
(,),{,} are lexemes of type punctuation(token)

Pattern :-
It specifies a set of rules that a scanner follows to create a token.
Example of Programming Language (C, C++):
For a keyword to be identified as a valid token, the pattern is the sequence of characters that make the
keyword.
For identifier to be identified as a valid token, the pattern is the predefined rules that it must start with
alphabet, followed by alphabet or a digit.

Difference between Token, Lexeme, and Pattern

Criteria Token Lexeme Pattern

Definition Token is basically a sequence of It is a sequence It specifies a set of rules


characters that are treated as a unit of characters in that a scanner follows to
as it cannot be further broken the source code create a token.
down. that are matched
by given
predefined
language rules
for every lexeme
to be specified as
Criteria Token Lexeme Pattern

a valid token.

The sequence of
all the reserved keywords of that
Interpretation of int, goto characters that make the
language(main, printf, etc.)
type Keyword keyword.

it must start with the


Interpretation of name of a variable, function, etc main, a alphabet, followed by the
type Identifier alphabet or a digit.

Interpretation of all the operators are considered


+, = +, =
type Operator tokens.

Interpretation of each kind of punctuation is


type considered a token. e.g. semicolon, (, ), {, } (, ), {, }
Punctuation bracket, comma, etc.

any string of characters


“Welcome to
Interpretation of a grammar rule or boolean literal. (except ‘ ‘) between ” and
GeeksforGeeks!”
type Literal “

Compiler Passes

Pass is a complete traversal of the source program. Compiler has two passes to traverse the source program.

Multi-pass Compiler
o Multi pass compiler is used to process the source code of a program several times.
o In the first pass, compiler can read the source program, scan it, extract the tokens and store the result
in an output file.
o In the second pass, compiler can read the output file produced by first pass, build the syntactic tree
and perform the syntactical analysis. The output of this phase is a file that contains the syntactical
tree.
o In the third pass, compiler can read the output file produced by second pass and check that the tree
follows the rules of language or not. The output of semantic analysis phase is the annotated tree
syntax.
o This pass is going on, until the target output is produced.
One-pass Compiler
o One-pass compiler is used to traverse the program only once. The one-pass compiler passes only
once through the parts of each compilation unit. It translates each part into its final machine code.
o In the one pass compiler, when the line source is processed, it is scanned and the token is extracted.
o Then the syntax of each line is analyzed and the tree structure is build. After the semantic part, the
code is generated.
o The same process is repeated for each line of code until the entire program is compiled.

Bootstrapping

o Bootstrapping is widely used in the compilation development.


o Bootstrapping is used to produce a self-hosting compiler. Self-hosting compiler is a type of compiler
that can compile its own source code.
o Bootstrap compiler is used to compile the compiler and then you can use this compiled compiler to
compile everything else as well as future versions of itself.

A compiler can be characterized by three languages:

1. Source Language
2. Target Language
3. Implementation Language

The T- diagram shows a compiler SCIT for Source S, Target T, implemented in I.

Follow some steps to produce a new language L for machine A:

1. Create a compiler SCAA for subset, S of the desired language, L using language "A" and that compiler runs
on machine A.

2.Create a compiler LCSA for language L written in a subset of L.

4. Compile LCSA using the compiler SCAA to obtain LCAA. LCAA is a compiler for language L, which runs
on machine A and produces code for machine A.
The process described by the T-diagrams is called bootstrapping.

OR
Bootstrapping is a process in which simple language is used to translate more complicated program
which in turn may handle for more complicated program. This complicated program can further
handle even more complicated program and so on. Writing a compiler for any high level language is
a complicated process. It takes lot of time to write a compiler from scratch. Hence simple language is
used to generate target code in some stages. to clearly understand the Bootstrapping technique
consider a following scenario. Suppose we want to write a cross compiler for new language X. The
implementation language of this compiler is say Y and the target code being generated is in language
Z. That is, we create XYZ. Now if existing compiler Y runs on machine M and generates code for M
then it is denoted as YMM. Now if we run XYZ using YMM then we get a compiler XMZ. That
means a compiler for source language X that generates a target code in language Z and which runs
on machine M. Following diagram illustrates the above scenario. Example: We can create compiler
of many different forms. Now we will generate.

Compiler which takes C language and generates an assembly language as an output with the
availability of a machine of assembly language.

 Step-1: First we write a compiler for a small of C in assembly language.

Step-2: Then using with small subset of C i.e. C0, for the source language c the compiler is written
Step-3: Finally we compile the second compiler. using compiler 1 the compiler 2 is compiled.

 Step-4: Thus we get a compiler written in ASM which compiles C and generates code in ASM.

Bootstrapping is the process of writing a compiler for a programming language using the language
itself. In other words, it is the process of using a compiler written in a particular programming
language to compile a new version of the compiler written in the same language.

 The process of bootstrapping typically involves several stages. In the first stage, a minimal
version of the compiler is written in a different language, such as assembly language or C.
This minimal version of the compiler is then used to compile a slightly more complex version
of the compiler written in the target language. This process is repeated until a fully functional
version of the compiler is written in the target language.

 There are several advantages to bootstrapping. One advantage is that it ensures that the
compiler is compatible with the language it is designed to compile. This is because the
compiler is written in the same language, so it is better able to understand and interpret the
syntax and semantics of the language.

 Another advantage is that it allows for greater control over the optimization and code
generation process. Since the compiler is written in the target language, it can be optimized to
generate code that is more efficient and better suited to the target platform.

 However, bootstrapping also has some disadvantages. One disadvantage is that it can be a
time-consuming process, especially for complex languages or compilers. It can also be more
difficult to debug a bootstrapped compiler, since any errors or bugs in the compiler will affect
the subsequent versions of the compiler.

Overall, bootstrapping is an important technique in compiler design that allows for greater control over the
optimization and code generation process, while ensuring compatibility between the compiler and the target
language.

As for the advantages and disadvantages of bootstrapping in compiler design:

Advantages:

1. Bootstrapping ensures that the compiler is compatible with the language it is designed to compile, as
it is written in the same language.

2. It allows for greater control over the optimization and code generation process.

3. It provides a high level of confidence in the correctness of the compiler because it is self-hosted.
Disadvantages:

1. It can be a time-consuming process, especially for complex languages or compilers.

2. Debugging a bootstrapped compiler can be challenging since any errors or bugs in the compiler will
affect the subsequent versions of the compiler.

3. Bootstrapping requires that a minimal version of the compiler be written in a different language,
which can introduce compatibility issues between the two languages.

4. Overall, bootstrapping is a useful technique in compiler design, but it requires careful planning and
execution to ensure that the benefits outweigh the drawbacks.

Phase 1: Lexical Analysis

Lexical Analysis is the first phase when compiler scans the source code. This process can be left to right,
character by character, and group these characters into tokens.
Here, the character stream from the source program is grouped in meaningful sequences by identifying the
tokens. It makes the entry of the corresponding tickets into the symbol table and passes that token to next
phase.
The primary functions of this phase are:

 Identify the lexical units in a source code


 Classify lexical units into classes like constants, reserved words, and enter them in different tables. It
will Ignore comments in the source program
 Identify token which is not a part of the language

Example:

x = y + 10

Tokens

X identifier
= Assignment operator
Y identifier
+ Addition operator
10 Number

Phase 2: Syntax Analysis

Syntax analysis is all about discovering structure in code. It determines whether or not a text follows the
expected format. The main aim of this phase is to make sure that the source code was written by the
programmer is correct or not.
Syntax analysis is based on the rules based on the specific programing language by constructing the parse
tree with the help of tokens. It also determines the structure of source language and grammar or syntax of the
language.
Here, is a list of tasks performed in this phase:

 Obtain tokens from the lexical analyzer


 Checks if the expression is syntactically correct or not
 Report all syntax errors
 Construct a hierarchical structure which is known as a parse tree

Example

Any identifier/number is an expression

If x is an identifier and y+10 is an expression, then x= y+10 is a statement.

Consider parse tree for the following example

(a+b)*c

In Parse Tree

 Interior node: record with an operator filed and two files for children
 Leaf: records with 2/more fields; one for token and other information about the token
 Ensure that the components of the program fit together meaningfully
 Gathers type information and checks for type compatibility
 Checks operands are permitted by the source language

Phase 3: Semantic Analysis


Semantic analysis checks the semantic consistency of the code. It uses the syntax tree of the previous phase
along with the symbol table to verify that the given source code is semantically consistent. It also checks
whether the code is conveying an appropriate meaning.
Semantic Analyzer will check for Type mismatches, incompatible operands, a function called with improper
arguments, an undeclared variable, etc.
Functions of Semantic analyses phase are:

 Helps you to store type information gathered and save it in symbol table or syntax tree
 Allows you to perform type checking
 In the case of type mismatch, where there are no exact type correction rules which satisfy the desired
operation a semantic error is shown
 Collects type information and checks for type compatibility
 Checks if the source language permits the operands or not

Example
float x = 20.2;
float y = x*30;

In the above code, the semantic analyzer will typecast the integer 30 to float 30.0 before multiplication

Phase 4: Intermediate Code Generation

Once the semantic analysis phase is over the compiler, generates intermediate code for the target machine. It
represents a program for some abstract machine.
Intermediate code is between the high-level and machine level language. This intermediate code needs to be
generated in such a manner that makes it easy to translate it into the target machine code.
Functions on Intermediate Code generation:

 It should be generated from the semantic representation of the source program


 Holds the values computed during the process of translation
 Helps you to translate the intermediate code into target language
 Allows you to maintain precedence ordering of the source language
 It holds the correct number of operands of the instruction

Example

For example,

total = count + rate * 5

Intermediate code with the help of address code method is:

t1 := int_to_float(5)

t2 := rate * t1

t3 := count + t2
total := t3

Phase 5: Code Optimization

The next phase of is code optimization or Intermediate code. This phase removes unnecessary code line and
arranges the sequence of statements to speed up the execution of the program without wasting resources.
The main goal of this phase is to improve on the intermediate code to generate a code that runs faster and
occupies less space.

The primary functions of this phase are:

 It helps you to establish a trade-off between execution and compilation speed


 Improves the running time of the target program
 Generates streamlined code still in intermediate representation
 Removing unreachable code and getting rid of unused variables
 Removing statements which are not altered from the loop

Example:
Consider the following code

a = intofloat(10)
b=c*a
d=e+b
f=d
Can become

b =c * 10.0
f = e+b

Phase 6: Code Generation


Code generation is the last and final phase of a compiler. It gets inputs from code optimization phases and
produces the page code or object code as a result. The objective of this phase is to allocate storage and
generate relocatable machine code.
It also allocates memory locations for the variable. The instructions in the intermediate code are converted
into machine instructions. This phase coverts the optimize or intermediate code into the target language.
The target language is the machine code. Therefore, all the memory locations and registers are also selected
and allotted during this phase. The code generated by this phase is executed to take inputs and generate
expected outputs.

Example:
a = b + 60.0
Would be possibly translated to registers.

MOVF a, R1
MULF #60.0, R2
ADDF R1, R2

Symbol Table Management:


A symbol table contains a record for each identifier with fields for the attributes of the identifier. This
component makes it easier for the compiler to search the identifier record and retrieve it quickly. The
symbol table also helps you for the scope management. The symbol table and error handler interact with all
the phases and symbol table update correspondingly.

Error Handling Routine:


In the compiler design process error may occur in all the below-given phases:

 Lexical analyzer: Wrongly spelled tokens


 Syntax analyzer: Missing parenthesis
 Intermediate code generator: Mismatched operands for an operator
 Code Optimizer: When the statement is not reachable
 Code Generator: When the memory is full or proper registers are not allocated
 Symbol tables: Error of multiple declared identifiers

Most common errors are invalid character sequence in scanning, invalid token sequences in type, scope
error, and parsing in semantic analysis.
The error may be encountered in any of the above phases. After finding errors, the phase needs to deal with
the errors to continue with the compilation process. These errors need to be reported to the error handler
which handles the error to perform the compilation process. Generally, the errors are reported in the form of
message.

Summary

 Compiler operates in various phases each phase transforms the source program from one
representation to another
 Six phases of compiler design are 1) Lexical analysis 2) Syntax analysis 3) Semantic analysis 4)
Intermediate code generator 5) Code optimizer 6) Code Generator
 Lexical Analysis is the first phase when compiler scans the source code
 Syntax analysis is all about discovering structure in text
 Semantic analysis checks the semantic consistency of the code
 Once the semantic analysis phase is over the compiler, generate intermediate code for the target
machine
 Code optimization phase removes unnecessary code line and arranges the sequence of statements
 Code generation phase gets inputs from code optimization phase and produces the page code or
object code as a result
 A symbol table contains a record for each identifier with fields for the attributes of the identifier
 Error handling routine handles error and reports during many phases

Finite state machine

Finite state machine is used to recognize patterns.


Finite automata machine takes the string of symbol as input and changes its state accordingly. In the input,
when a desired symbol is found then the transition occurs.

While transition, the automata can either move to the next state or stay in the same state.

FA has two states: accept state or reject state. When the input string is successfully processed and the
automata reached its final state then it will accept.

A finite automata consists of following:

Q: finite set of states

∑: finite set of input symbol

q0: initial state

F: final state

δ: Transition function

Transition function can be define as

δ: Q x ∑ →Q

FA is characterized into two ways:

 DFA (finite automata)


 NDFA (non deterministic finite automata)

DFA :-

DFA stands for Deterministic Finite Automata. Deterministic refers to the uniqueness of the computation.
In DFA, the input character goes to one state only. DFA doesn't accept the null move that means the DFA
cannot change state without any input character.

DFA has five tuples {Q, ∑, q0, F, δ}

Q: set of all states

∑: finite set of input symbol where δ: Q x ∑ →Q

q0: initial state


F: final state

δ: Transition function

Example

example of deterministic finite automata:

Q = {q0, q1, q2}

∑ = {0, 1}

q0 = {q0}

F = {q2}

NDFA :-

NDFA refer to the Non Deterministic Finite Automata. It is used to transit the any number of states for a
particular input. NDFA accepts the NULL move that means it can change state without reading the symbols.

NDFA also has five states same as DFA. But NDFA has different transition function.

Transition function of NDFA can be defined as:

δ: Q x ∑ →2Q

Example :-

example of non deterministic finite automata:

1. Q = {q0, q1, q2}


2. ∑ = {0, 1}
3. q0 = {q0}
4. F = {q2}
Regular expression

o Regular expression is a sequence of pattern that defines a string. It is used to denote regular
languages.
o It is also used to match character combinations in strings. String searching algorithm used this
pattern to find the operations on string.
o In regular expression, x* means zero or more occurrence of x. It can generate {e, x, xx, xxx,
xxxx,.....}
o In regular expression, x+ means one or more occurrence of x. It can generate {x, xx, xxx, xxxx,.....}

Operations on Regular Language

The various operations on regular language are:

Union: If L and M are two regular languages then their union L U M is also a union.

1. L U M = {s | s is in L or s is in M}

Intersection: If L and M are two regular languages then their intersection is also an intersection.

1. L ⋂ M = {st | s is in L and t is in M}

Kleene closure: If L is a regular language then its kleene closure L1* will also be a regular language.

1. L* = Zero or more occurrence of language L.

Example

Write the regular expression for the language:

L = {abn w:n ≥ 3, w ∈ (a,b)+}

Solution:

The string of language L starts with "a" followed by atleast three b's. Itcontains atleast one "a" or one "b"
that is string are like abbba, abbbbbba, abbbbbbbb, abbbb.....a

So regular expression is:


r= ab3b* (a+b)+

Here + is a positive closure i.e. (a+b)+ = (a+b)* - ∈

What is Cross Compiler?

Compilers are the tool used to translate high-level programming language to low-level programming
language. The simple compiler works in one system only, but what will happen if we need a compiler that
can compile code from another platform, to perform such compilation, the cross compiler is introduced.

A cross compiler is a compiler capable of creating executable code for a platform other than the one on
which the compiler is running. For example, a cross compiler executes on machine X and produces machine
code for machine Y.

Where is the cross compiler used?

 In bootstrapping, a cross-compiler is used for transitioning to a new platform. When developing


software for a new platform, a cross-compiler is used to compile necessary tools such as the
operating system and a native compiler.
 For microcontrollers, we use cross compiler because it doesn’t support an operating system.
 It is useful for embedded computers which are with limited computing resources.
 To compile for a platform where it is not practical to do the compiling, a cross-compiler is used.
 When direct compilation on the target platform is not infeasible, so we can use the cross compiler.
 It helps to keep the target environment separate from the built environment.

A compiler is characterized by three languages:

1. The source language that is compiled.


2. The target language T is generated.
3. The implementation language that is used for writing the compiler.
How does cross-compilation work?

A cross compiler is a compiler capable of creating executable code for a platform other than the one on
which the compiler is running. In paravirtualization, one computer runs multiple operating systems and a
cross compiler could generate an executable for each of them from one main source.

How compiler is different from a cross-compiler?

The native compiler is a compiler that generates code for the same platform on which it runs and on the
other hand, a Cross compiler is a compiler that generates executable code for a platform other than one on
which the compiler is running.

Explain in details various compiler writing tools.

The compiler writer can use some specialized tools that help in implementing various phases of a compiler.
These tools assist in the creation of an entire compiler or its parts. Some commonly used compiler
construction tools include:

1.Parser Generator – It produces syntax analyzers (parsers) from the input that is based on a grammatical description
of programming language or on a context-free grammar. It is useful as the syntax analysis phase is highly complex
and consumes more manual and compilation time.

2.Scanner Generator – It generates lexical analyzers from the input that consists of regular expression description
based on tokens of a language. It generates a finite automaton to recognize the regular expression. Example: Lex
3. Syntax directed translation engines – It generates intermediate code with three address format from the input that
consists of a parse tree. These engines have routines to traverse the parse tree and then produces the intermediate code.
In this, each node of the parse tree is associated with one or more translations.

4. Automatic code generators – It generates the machine language for a target machine. Each operation of the
intermediate language is translated using a collection of rules and then is taken as an input by the code generator. A
template matching process is used. An intermediate language statement is replaced by its equivalent machine language
statement using templates.

5. Data-flow analysis engines – It is used in code optimization. Data flow analysis is a key part of the code
optimization that gathers the information, that is the values that flow from one part of a program to another.

6. Compiler construction toolkits – It provides an integrated set of routines that aids in building compiler
components or in the construction of various phases of compiler.

Features of compiler construction tools :

 Optimization Tools: These tools help in optimizing the generated code for efficiency and
performance. They can perform various optimizations such as dead code elimination, loop
optimization, and register allocation.
 Debugging Tools: These tools help in debugging the compiler itself or the programs that are being
compiled. They can provide debugging information such as symbol tables, call stacks, and runtime
errors.
 Profiling Tools: These tools help in profiling the compiler or the compiled code to identify
performance bottlenecks and optimize the code accordingly.
 Documentation Tools: These tools help in generating documentation for the compiler and the
programming language being compiled. They can generate documentation for the syntax, semantics,
and usage of the language.
 Language Support: Compiler construction tools are designed to support a wide range of
programming languages, including high-level languages such as C++, Java, and Python, as well as
low-level languages such as assembly language.
 Cross-Platform Support: Compiler construction tools may be designed to work on multiple
platforms, such as Windows, Mac, and Linux.
 User Interface: Some compiler construction tools come with a user interface that makes it easier for
developers to work with the compiler and its associated tools.
LEX
o Lex is a program that generates lexical analyzer. It is used with YACC parser generator.
o The lexical analyzer is a program that transforms an input stream into a sequence of tokens.
o It reads the input stream and produces the source code as output through implementing the lexical analyzer in
the C program.

The function of Lex is as follows:


o Firstly lexical analyzer creates a program lex.1 in the Lex language. Then Lex compiler runs the
lex.1 program and produces a C program lex.yy.c.
o Finally C compiler runs the lex.yy.c program and produces an object program a.out.
o a.out is lexical analyzer that transforms an input stream into a sequence of tokens.

Lex file format

A Lex program is separated into three sections by %% delimiters. The formal of Lex source is as follows:

1. { definitions }
2. %%
3. { rules }
4. %%
5. { user subroutines }

Definitions include declarations of constant, variable and regular definitions.

Rules define the statement of form p1 {action1} p2 {action2}....pn {action}.


Where pi describes the regular expression and action1 describes the actions what action the lexical analyzer
should take when pattern pi matches a lexeme.

User subroutines are auxiliary procedures needed by the actions. The subroutine can be loaded with the
lexical analyzer and compiled separately.

Lex is a tool which generates the Lexical analyser. Lex tool takes input as the regular expression and forms a
DFA corresponding to that regular expression.

Lex Programs has following section:-

Declaration

%%

RULES and ACTIONS

%%

Auxiliary function

a) Declaration:-

Two type of declaration is available in lex

-> Auxiliary Declaration

-> Regular Declaration

Auxiliary declaration starts with %{ and ends with %}. All the statements written in this is directly copied to
the lex.yy.c[ Lexical Analyser ]

Regular Declaration in lex is used to define special keywords used in the rule part. Also it is used to define
option such as noyywrap.

Example :-
%{

#include<iostream> //DECLARATION

using namespace std;

%}

%%

int {cout<<”Integer detected…”;} //RULES

%%

int main(){ //AUXILIARY FUNCTION

yylex();

CODE 1.1

b) Rules :-

Rules consist of two parts.

-> Pattern to be matched

-> Action to be taken

[see code 1.1]

c) Auxillary Function :-

In addition to the code generated by lex tool, if we want to define some function than Auxiliary function is
used[ See code 1.1 ].

Functions of Lex File

a) int yylex():-

yylex is the main Scanner function, Lex creates yylex but do not call it, We need to call the yylex in main as
to run the lexical analyser.

-> Otherwise we need to compile the lex.yy.c file with option -ll

Note:- If the yylex is not return than it takes the input recursively until and unless end of file is reached. In
case of console input(stdin) we can give end of file by pressing cltrl+D.

In case of return of yylex, if we call yylex again then we get the input from the place we left.

By default action of left pattern is simple an ECHO.( The echo command in Linux is a built-in command
that allows users to display lines of text or strings that are passed as arguments.)

b) int yywrap() :-
lex calls the yywrap function. Whenever a lex encounter the end of file it calls yywrap function. If yywrap
returns non zero value yylex terminates and return to the main [with value zero]. If the programmer wants to
scan more than one input file, then yywrap should return zero, stating work is not finished. Meanwhile in
yywrap we can change the file pointer as to read another file.

Note:- Lex by default does not define the yywrap, therefore there is a compulsion to define the yywrap
function otherwise it will give an error stating that it is causing an error. Alternatively we can use %option
noyywrap to define yywrap internally by lex.

This internal implementation returns non zero value , stating that work is finished.[see code 1.1]

Variables in lex tools

There are majorly three type of variables in lex.

1. yyin

2. yytext

3. yyleng

1) yyin :-

yyin is a variable of type FILE * and points to the file which we want to input in the lexical analyser.

If we set yyin to some file than it will read the character stream from that file.

Example:-

//AUXILIARY FUNCTION

int main(void){

FILE *f=fopen(“Temprary”,“r”);

if(f){

yyin=f;

yylex();

return 0;

By default value of yyin is stdin.

->Inside lex.yy.c

if(!yyin)
{

yyin=stdin;

This shows that if the value of yyin is not pointed by any file than yyin will get initialised to stdin[Standard
Input].

2) yytext :

yytext is of char* type and it contains the matched lexeme found. In every iteration of pattern matched the
value is getting overwrite again and again.

Example :-

%{

#include<iostream>

using namespace std;

%}

%option no yywrap

DIGIT [0–9]+

%%

{DIGIT} {cout<<”Digit Found”<<yytext;}

%%

int main(void){

yylex();

return 1;

Note:- In this example we define DIGIT as a lex pattern variable and to access the variable in Pattern section
we use {}.

3) yyleng :-

yyleng is an int type, it stores the length of the lexeme or the size of yytext.

Example1 :- Write a lex program to count a number of words.

/*lex program to count number of words*/


%{
#include<stdio.h>
#include<string.h>
int i = 0;
%}

/* Rules Section*/
%%
([a-zA-Z0-9])* {i++;} /* Rule for counting
number of words*/

"\n" {printf("%d\n", i); i = 0;}


%%

int yywrap(void){}

int main()
{
// The function that starts the analysis
yylex();

return 0;
}

Output :-

Example2. - LEX program to count the number of vowels and consonants in a given string .

%{
int vow_count=0;
int const_count =0;
%}

%%
[aeiouAEIOU] {vow_count++;}
[a-zA-Z] {const_count++;}
%%
int yywrap(){}
int main()
{
printf("Enter the string of vowels and consonants:");
yylex();
printf("Number of vowels are: %d\n", vow_count);
printf("Number of consonants are: %d\n", const_count);
return 0;
}

Output:-

Example3: Lex program to check whether given string is Palindrome or Not .

/* Lex program to check whether


- given string is Palindrome or Not */

%
{
int i, j, flag;
%
}

/* Rule Section */
%%
[a - z A - z 0 - 9]*
{
for (i = 0, j = yyleng - 1; i <= j; i++, j--) {
if (yytext[i] == yytext[j]) {
flag = 1;
}
else {
flag = 0;
break;
}
}
if (flag == 1)
printf("Given string is Palindrome");
else
printf("Given string is not Palindrome");
}
%%

// driver code
int main()
{
printf("Enter a string :");
yylex();
return 0;
}
int yywrap()
{
return 1;
}

Output:-

Parser :-
Parser is a compiler that is used to break the data into smaller elements coming from lexical analysis phase.

A parser takes input in the form of sequence of tokens and produces output in the form of parse tree.

Parsing is of two types: top down parsing and bottom up parsing.


Top down paring

o The top down parsing is known as recursive parsing or predictive parsing.


o Bottom up parsing is used to construct a parse tree for an input string.
o In the top down parsing, the parsing starts from the start symbol and transform it into the input
symbol.

Parse Tree representation of input string "acdb" is as follows:

Top-Down Parsing is based on Left Most Derivation whereas Bottom-Up Parsing is dependent on Reverse
Right Most Derivation.

The process of constructing the parse tree which starts from the root and goes down to the leaf is Top-Down
Parsing.

1. Top-Down Parsers constructs from the Grammar which is free from ambiguity and left recursion.
2. Top-Down Parsers uses leftmost derivation to construct a parse tree.
3. It does not allow Grammar With Common Prefixes.

Parse Tree

 The process of deriving a string is called as derivation.


 The geometrical representation of a derivation is called as a parse tree or derivation tree.

1. Leftmost Derivation-

 The process of deriving a string by expanding the leftmost non-terminal at each step is called
as leftmost derivation.
 The geometrical representation of leftmost derivation is called as a leftmost derivation tree.

Example-

Consider the following grammar-

S → aB / bA

S → aS / bAA / a

B → bS / aBB / b

(Unambiguous Grammar)

Let us consider a string w = aaabbabbba

Now, let us derive the string w using leftmost derivation.

Leftmost Derivation-

S → aB

→ aaBB (Using B → aBB)


→ aaaBBB (Using B → aBB)

→ aaabBB (Using B → b)

→ aaabbB (Using B → b)

→ aaabbaBB (Using B → aBB)

→ aaabbabB (Using B → b)

→ aaabbabbS (Using B → bS)

→ aaabbabbbA (Using S → bA)

→ aaabbabbba (Using A → a)

2. Rightmost Derivation-

 The process of deriving a string by expanding the rightmost non-terminal at each step is called
as rightmost derivation.
 The geometrical representation of rightmost derivation is called as a rightmost derivation tree.

Consider the following grammar-

S → aB / bA

AS → aS / bAA / a

B → bS / aBB / b

(Unambiguous Grammar)
Let us consider a string w = aaabbabbba

Now, let us derive the string w using rightmost derivation.

Rightmost Derivation-

S → aB
→ aaBB (Using B → aBB)
→ aaBaBB (Using B → aBB)
→ aaBaBbS (Using B → bS)
→ aaBaBbbA (Using S → bA)
→ aaBaBbba (Using A → a)
→ aaBabbba (Using B → b)
→ aaaBBabbba (Using B → aBB)
→ aaaBbabbba (Using B → b)
→ aaabbabbba (Using B → b)

NOTES

 For unambiguous grammars, Leftmost derivation and Rightmost derivation represents the same
parse tree.
 For ambiguous grammars, Leftmost derivation and Rightmost derivation represents different
parse trees.

Here,
 The given grammar was unambiguous.
 That is why, leftmost derivation and rightmost derivation represents the same parse tree.
Leftmost Derivation Tree = Rightmost Derivation Tree

Properties Of Parse Tree-

 Root node of a parse tree is the start symbol of the grammar.


 Each leaf node of a parse tree represents a terminal symbol.
 Each interior node of a parse tree represents a non-terminal symbol.
 Parse tree is independent of the order in which the productions are used during derivations.

Classification of Top-Down Parsing –

1. With Backtracking: Brute Force Technique

2. Without Backtracking:

1. Recursive Descent Parsing


2. Predictive Parsing or Non-Recursive Parsing or LL(1) Parsing or Table Driver Parsing

Recursive Descent Parsing –

1. Whenever a Non-terminal spend the first time then go with the first alternative and compare it with
the given I/P String
2. If matching doesn’t occur then go with the second alternative and compare with the given I/P String.
3. If matching is not found again then go with the alternative and so on.
4. Moreover, If matching occurs for at least one alternative, then the I/P string is parsed successfully.

LL(1) or Table Driver or Predictive Parser –

1. In LL1, first L stands for Left to Right and second L stands for Left-most Derivation. 1 stands for a
number of Look Ahead tokens used by parser while parsing a sentence.
2. LL(1) parsing is constructed from the grammar which is free from left recursion, common prefix,
and ambiguity.
3. LL(1) parser depends on 1 look ahead symbol to predict the production to expand the parse tree.
4. This parser is Non-Recursive.

Features :

Predictive parsing: Top-down parsers often use predictive parsing techniques, in which the parser predicts
the following symbol inside the enter based on the present day country of the parse stack and the
manufacturing rules of the grammar. This permits the parser to speedy determine if a particular enter string
is valid beneath the grammar.

LL parsing: LL parsing is a selected type of pinnacle-down parsing that uses a left-to-right scan of the enter
and leftmost derivation of the grammar. This form of parsing is generally utilized in programming language
compilers.
Recursive descent parsing: Recursive descent parsing is another type of top-down parsing that uses a hard
and fast of recursive approaches to suit nonterminals inside the grammar. Each nonterminal has a
corresponding manner this is answerable for parsing that nonterminal.

Backtracking: Top-down parsers may also use backtracking to discover multiple parsing paths whilst the
grammar is ambiguous or when a parsing mistakes occurs. This can be steeply-priced in terms of
computation time and memory utilization, such a lot of pinnacle-down parsers use strategies to reduce the
need for backtracking.

Memoization: Memoization is a method used to cache intermediate parsing effects and keep away from
repeated computation. Some pinnacle-down parsers use memoization to reduce the amount of backtracking
required.

Lookahead: Top-down parsers might also use lookahead to expect the next symbol in the enter based totally
on a hard and fast range of input symbols. This can enhance parsing velocity and decrease the amount of
backtracking required.

Error healing: Top-down parsers may use blunders recuperation techniques to deal with syntax errors
within the input. These techniques may additionally consist of inserting or deleting symbols to healthy the
grammar or skipping over misguided symbols to maintain parsing the input.

Advantages:

Easy to Understand: Top-down parsers are easy to understand and implement, making them a good choice
for small to medium-sized grammars.

Efficient: Some types of top-down parsers, such as LL(1) and predictive parsers, are efficient and can
handle larger grammars.

Flexible: Top-down parsers can be easily modified to handle different types of grammars and programming
languages.

Disadvantages:

Limited Power: Top-down parsers have limited power and may not be able to handle all types of grammars,
particularly those with complex structures or ambiguous rules.

Left-Recursion: Top-down parsers can suffer from left-recursion, which can make the parsing process more
complex and less efficient.

Look-Ahead Restrictions: Some top-down parsers, such as LL(1) parsers, have restrictions on the number
of look-ahead symbols they can use, which can limit their ability to handle certain types of grammars.
Bottom up parsing

o Bottom up parsing is also known as shift-reduce parsing.


o Bottom up parsing is used to construct a parse tree for an input string.
o In the bottom up parsing, the parsing starts with the input symbol and construct the parse tree up to
the start symbol by tracing out the rightmost derivations of string in reverse.

Production

1. E→T
2. T→T*F
3. T → id
4. F→T
5. F → id

Parse Tree representation of input string "id * id" is as follows:

Bottom up parsing is classified in to various parsing. These are as follows:

1. Shift-Reduce Parsing
2. Operator Precedence Parsing
3. Table Driven LR Parsing

a. LR( 1 )
b. SLR( 1 )
c. CLR ( 1 )
d. LALR
Ambiguous Grammar-

A grammar is said to ambiguous if for any string generated by it, it produces more than one-

 Parse tree
 Or derivation tree
 Or syntax tree
 Or leftmost derivation
 Or rightmost derivation

Consider the following grammar-

E → E + E / E x E / id

Ambiguous Grammar

Let us consider a string w generated by the grammar-

w = id + id x id

Now, let us draw the parse trees for this string w.

Reason 1:-

Since two parse trees exist for string w, therefore the grammar is ambiguous.
Reason 2:-

Let us consider a string w generated by the grammar-

w = id + id x id

Now, let us draw the syntax trees for this string w.

Since two syntax trees exist for string w, therefore the grammar is ambiguous.

Reason 3:-

Let us consider a string w generated by the grammar-

w = id + id x id

Now, let us write the leftmost derivations for this string w.

Since two leftmost derivations exist for string w, therefore the grammar is ambiguous.
Reason 4:-

Let us consider a string w generated by the grammar-

w = id + id x id

Now, let us write the rightmost derivations for this string w.

Unambiguous Grammar-

A grammar is said to unambiguous if for every string generated by it, it produces exactly one-

 Parse tree
 Or derivation tree
 Or syntax tree
 Or leftmost derivation
 Or rightmost derivation

Consider the following grammar-

E→E+T/T

T→TxF/F

F → id

Unambiguous Grammar

This grammar is an example of unambiguous grammar.


Difference Between Ambiguous and Unambiguous Grammar-

Ambiguous Grammar Unambiguous Grammar

A grammar is said to be ambiguous if for at least


A grammar is said to be unambiguous if for all the
one string generated by it, it produces more than
strings generated by it, it produces exactly one-
one-
 parse tree
 parse tree
 or derivation tree
 or derivation tree
 or syntax tree
 or syntax tree
 or leftmost derivation
 or leftmost derivation
 or rightmost derivation
 or rightmost derivation

For ambiguous grammar, leftmost derivation and For unambiguous grammar, leftmost derivation
rightmost derivation represents different parse and rightmost derivation represents the same parse
trees. tree.

Ambiguous grammar contains less number of non- Unambiguous grammar contains more number of
terminals. non-terminals.

For ambiguous grammar, length of parse tree is For unambiguous grammar, length of parse tree is
less. large.

Ambiguous grammar is faster than unambiguous


grammar in the derivation of a tree. Unambiguous grammar is slower than ambiguous
grammar in the derivation of a tree.

Example- Example-

E→E+T/T

E → E + E / E x E / id T→TxF/F

(Ambiguous Grammar) F → id

(Unambiguous Grammar)

PROBLEMS BASED ON CHECKING WHETHER GRAMMAR IS AMBIGUOUS-

Example1 . Check whether the given grammar is ambiguous or not-


S → AB / C
A → aAb / ab
B → cBd / cd
C → aCd / aDd

D → bDc / bc

Solution :-

Let us consider a string w generated by the given grammar-

w = aabbccdd

Now, let us draw parse trees for this string w.

Since two different parse trees exist for string w, therefore the given grammar is ambiguous.

Example 2 - Check whether the given grammar is ambiguous or not-

E→E+T/T

T→TxF/F

F → id

Solution-
 There exists no string belonging to the language of grammar which has more than one parse tree.
 Since a unique parse tree exists for all the strings, therefore the given grammar is unambiguous.

Example 3 –

Check whether the given grammar is ambiguous or not-

S → aSbS / bSaS / ∈

Solution-

Let us consider a string w generated by the given grammar-

w = abab

Now, let us draw parse trees for this string w.

Since two different parse trees exist for string w, therefore the given grammar is ambiguous.

Converting Ambiguous Grammar Into Unambiguous Grammar-


The ambiguity from the grammar may be removed using the following methods-

 By fixing the grammar


 By adding grouping rules
 By using semantics and choosing the parse that makes the most sense
 By adding the precedence rules or other context sensitive parsing rules

Removing Ambiguity By Precedence & Associativity Rules-

An ambiguous grammar may be converted into an unambiguous grammar by implementing-

 Precedence Constraints
 Associativity Constraints

These constraints are implemented using the following rules-

Rule-01:

The precedence constraint is implemented using the following rules-

 The level at which the production is present defines the priority of the operator contained in it.
 The higher the level of the production, the lower the priority of operator.
 The lower the level of the production, the higher the priority of operator.

Rule-02:
The associativity constraint is implemented using the following rules-

 If the operator is left associative, induce left recursion in its production.


 If the operator is right associative, induce right recursion in its production.

PROBLEMS BASED ON CONVERSION INTO UNAMBIGUOUS GRAMMAR-

Problem-01:

Convert the following ambiguous grammar into unambiguous grammar-

R → R + R / R . R / R* / a / b

where * is kleen closure and . is concatenation.

Solution-

To convert the given grammar into its corresponding unambiguous grammar, we implement the precedence
and associativity constraints.

We have-

 Given grammar consists of the following operators-

+,.,*

 Given grammar consists of the following operands-

a,b

The priority order is-

(a , b) > * > . > +

where-

 . operator is left associative


 + operator is left associative
Using the precedence and associativity rules, we write the corresponding unambiguous grammar as-

E→E+T/T

T→T.F/F

F → F* / G

G→a/b

Unambiguous Grammar

OR

E→E+T/T

T→T.F/F

F → F* / a / b

Unambiguous Grammar

Problem-02:

Convert the following ambiguous grammar into unambiguous grammar-

bexp → bexp or bexp / bexp and bexp / not bexp / T / F

where bexp represents Boolean expression, T represents True and F represents False.

Solution-

To convert the given grammar into its corresponding unambiguous grammar, we implement the precedence
and associativity constraints.

We have-

 Given grammar consists of the following operators-


or , and , not

 Given grammar consists of the following operands-

T,F

The priority order is-

(T , F) > not > and > or

where-

 and operator is left associative


 or operator is left associative

Using the precedence and associativity rules, we write the corresponding unambiguous grammar as-

bexp → bexp or M / M

M → M and N / N

N → not N / G

G→T/F

Unambiguous Grammar

OR

bexp → bexp or M / M

M → M and N / N

N → not N / T / F

Unambiguous Grammar

Backtracking in TOP-DOWN Parsing


Backtracking in top-down parsing provides flexibility to handle ambiguous grammars or situations where
the parser encounters uncertainty in choosing the correct production rule. However, it can lead to inefficient
parsing, especially when the grammar has many backtracking points or when the input has long ambiguous
sequences.

In such cases, alternative parsing techniques, such as predictive parsing or bottom-up parsing, may be more
efficient. In the field of parsing algorithms, backtracking plays a crucial role in handling ambiguity and
making alternative choices during the parsing process.

Specifically, in top-down parsers, which start with the initial grammar symbol and recursively expand non-
terminals, backtracking allows the parser to explore different options when the chosen production fails to
match the input.

The concept of backtracking in top-down parsing, its significance in handling ambiguity, and its impact on
the parsing process.

To improve the performance of top-down parsers, various optimization techniques can be employed,
including left-factoring, left-recursion elimination, and the use of lookahead tokens to predict the correct
production choice without excessive backtracking.

In Top-Down Parsing with Backtracking, Parser will attempt multiple rules or production to identify the
match for input string by backtracking at every step of derivation. So, if the applied production does not give
the input string as needed, or it does not match with the needed string, then it can undo that shift.

Example1 − Consider the Grammar

S→aAd

A→bc|b

Make parse tree for the string a bd. Also, show parse Tree when Backtracking is required when the wrong
alternative is chosen.

Solution

The derivation for the string abd will be −

S ⇒ a A d ⇒ abd (Required String)

If bc is substituted in place of non-terminal A then the string obtained will be abcd.

S ⇒ a A d ⇒ abcd (Wrong Alternative)

Figure (a) represents S → aAd


Figure (b) represents an expansion of tree with production A → bc which gives string abcd which does not
match with string abd. So, it backtracks & chooses another alternative, i.e., A → b in figure (c) which
matches with abd.

Understanding Backtracking in Top-Down Parsing

Top-down parsing is a technique where a parser starts with the start symbol of the grammar and attempts to
derive the input string by recursively expanding non-terminals. The process involves selecting a production
rule that matches the current input, expanding non-terminals, and backtracking when necessary. Let’s
explore the key aspects of backtracking in top-down parsing:

1. Ambiguity Resolution: Ambiguity arises when there are multiple choices available for expanding a
non-terminal or selecting a production rule. Backtracking allows the parser to explore these choices
systematically, backtracking to previous decision points and trying alternative paths if the current
choice fails. This iterative exploration continues until a successful match is found or all possibilities
are exhausted, leading to a parsing error.

2. Decision Points: At each step in the parsing process, the parser makes decisions such as choosing a
non-terminal to expand or selecting a production rule. These decision points serve as potential
backtracking points. If a chosen production fails to match the input, the parser backtracks to the
previous decision point and explores an alternative choice, allowing for a different path of parsing.

3. Recursive Expansion: During the parsing process, the chosen production rule is recursively
expanded by applying the rules to its non-terminals. This expansion continues until a terminal
symbol is reached or further expansion is not possible. If a successful match is not found, the parser
backtracks to the previous decision point to try an alternative choice.
4. Successful Match or Parsing Error: The parsing process concludes either when a successful match
is found, indicating that the input string conforms to the grammar, or when all possible alternatives
are exhausted, resulting in a parsing error.

Key Steps for Backtracking in a Top-Down Parser

1. Choose a non-terminal: At each step, the parser chooses a non-terminal from the current production
rule to expand.

2. Apply a production: The parser selects a production rule for the chosen non-terminal that matches
the current input. If multiple choices are available, it may need to try each alternative.

3. Recursive expansion: The chosen production is recursively expanded by applying the rules to its
non-terminals. This process continues until a terminal symbol is reached or until further expansion is
not possible.

4. Backtrack on failure: If a selected production fails to match the input, the parser backtracks to the
previous decision point, undoing the previous expansion and selecting an alternative choice if
available.

5. Repeat until success or failure: The parser repeats the above steps, trying different alternatives and
backtracking as necessary until it either successfully matches the entire input or exhausts all possible
alternatives, resulting in a parsing error.

Advantages of Backtracking

Backtracking in top-down parsing provides several benefits, including:

 Ambiguity Handling: Backtracking enables the parser to handle ambiguous grammars by


systematically exploring alternative choices and selecting the correct production rule.

 Flexibility: By allowing alternative choices, backtracking provides flexibility in resolving parsing


ambiguities and making informed decisions during the parsing process.

Disadvantages of Backtracking

 Performance Impact: Backtracking can lead to inefficient parsing, particularly in cases where there
are numerous backtracking points or long ambiguous sequences in the input. In such scenarios,
alternative parsing techniques may be more efficient.

 Complexity: Managing backtracking points and tracking alternative choices can introduce additional
complexity to the parsing algorithm, requiring careful implementation and optimization.

Classification of Top-Down Parsing –

1. With Backtracking: Brute Force Technique

2. Without Backtracking:

1. Recursive Descent Parsing


2. Predictive Parsing or Non-Recursive Parsing or LL(1) Parsing or Table Driver Parsing

Recursive Descent Parsing –


1. Whenever a Non-terminal spend the first time then go with the first alternative and compare it with
the given I/P String
2. If matching doesn’t occur then go with the second alternative and compare with the given I/P String.
3. If matching is not found again then go with the alternative and so on.
4. Moreover, If matching occurs for at least one alternative, then the I/P string is parsed successfully.

LL(1) or Table Driver or Predictive Parser –

1. In LL1, first L stands for Left to Right and second L stands for Left-most Derivation. 1 stands for a
number of Look Ahead tokens used by parser while parsing a sentence.
2. LL(1) parsing is constructed from the grammar which is free from left recursion, common prefix,
and ambiguity.
3. LL(1) parser depends on 1 look ahead symbol to predict the production to expand the parse tree.
4. This parser is Non-Recursive.

Features : - Top down parsing

Predictive parsing: Top-down parsers often use predictive parsing techniques, in which the parser predicts
the following symbol inside the enter based on the present day country of the parse stack and the
manufacturing rules of the grammar. This permits the parser to speedy determine if a particular enter string
is valid beneath the grammar.

LL parsing: LL parsing is a selected type of pinnacle-down parsing that uses a left-to-right scan of the enter
and leftmost derivation of the grammar. This form of parsing is generally utilized in programming language
compilers.

Recursive descent parsing: Recursive descent parsing is another type of top-down parsing that uses a hard
and fast of recursive approaches to suit nonterminals inside the grammar. Each nonterminal has a
corresponding manner this is answerable for parsing that nonterminal.

Backtracking: Top-down parsers may also use backtracking to discover multiple parsing paths whilst the
grammar is ambiguous or when a parsing mistakes occurs. This can be steeply-priced in terms of
computation time and memory utilization, such a lot of pinnacle-down parsers use strategies to reduce the
need for backtracking.

Memoization: Memoization is a method used to cache intermediate parsing effects and keep away from
repeated computation. Some pinnacle-down parsers use memoization to reduce the amount of backtracking
required.

Lookahead: Top-down parsers might also use lookahead to expect the next symbol in the enter based totally
on a hard and fast range of input symbols. This can enhance parsing velocity and decrease the amount of
backtracking required.

Error healing: Top-down parsers may use blunders recuperation techniques to deal with syntax errors
within the input. These techniques may additionally consist of inserting or deleting symbols to healthy the
grammar or skipping over misguided symbols to maintain parsing the input.

Advantages:

Easy to Understand: Top-down parsers are easy to understand and implement, making them a good choice
for small to medium-sized grammars.
Efficient: Some types of top-down parsers, such as LL(1) and predictive parsers, are efficient and can
handle larger grammars.

Flexible: Top-down parsers can be easily modified to handle different types of grammars and programming
languages.

Disadvantages:

Limited Power: Top-down parsers have limited power and may not be able to handle all types of grammars,
particularly those with complex structures or ambiguous rules.

Left-Recursion: Top-down parsers can suffer from left-recursion, which can make the parsing process more
complex and less efficient.

Look-Ahead Restrictions: Some top-down parsers, such as LL(1) parsers, have restrictions on the number
of look-ahead symbols they can use, which can limit their ability to handle certain types of grammars.

Predictive Top-down Parsing :-

Predictive parsing is a straightforward form of recursive descent parsing. And it does not requires back-
tracking. Instead, it can determine which production must be selected to derive the input string. Predictive
parsing selects the correct production by looking at the input string. It allows looking at a fixed number of
input symbols from the input string.

Components of Predictive top-down parsing

Following are the components of predictive top-down parsing:-

Stack

A predictive parser sustains a stack, including a sequence of grammar symbols.

Input Buffer

It includes the input string that the predictive parser needs to parse.

Parsing Table

With the entries present in this table, it becomes effortless for the top-down parser to choose the production
to be applied. Input buffer and stack both include the end marker '$'. It denotes the bottom of the stack and
the input string's end in the input buffer. In the beginning, the grammar symbol on the top of $ is the
beginning symbol.

Steps to perform predictive parsing

The parser first views the grammar symbol present on the top of the stack saying 'X'. And compares it with
the existing input symbol, say 'a' present in the input buffer.

 If X is a non-terminal, then the parser selects a product of X from the parse table, conferring the
entry M [X, a].
 In case X is a terminal, then the parser scans it for a match with the present symbol' a'.
This is how predictive parsing recognizes the right production. In this way, it successfully derives the input
string.

LL Parser

The LL parser is a predictive parser. It does not require back-tracking. LL (1) parser takes only LL (1)
grammar.

 First L in LL (1) implies that the parser scans the input string from left to right.
 Second, L defines the leftmost derivation for the input string.
 The '1' in LL (1) demonstrates that the parser lookahead a single input symbol from the input string.

LL (1) grammar does not comprise left recursion and no ambiguity.

Rules to calculate FIRST and FOLLOW –

Rules For Calculating First Function-

First(α) is a set of terminal symbols that begin in strings derived from α.

Rule-01:

For a production rule X → ∈,

First(X) = { ∈ }

Rule-02:

For any terminal symbol ‘a’,


First(a) = { a }

Rule-03:

For a production rule X → Y1Y2Y3,

Calculating First(X)

If ∈ ∉ First(Y1), then First(X) = First(Y1)


If ∈ ∈ First(Y1), then First(X) = { First(Y1) – ∈ } ∪ First(Y2Y3)

Calculating First(Y2Y3)

 If ∈ ∉ First(Y2), then First(Y2Y3) = First(Y2)

 If ∈ ∈ First(Y2), then First(Y2Y3) = { First(Y2) – ∈ } ∪ First(Y3)

Similarly, we can make expansion for any production rule X → Y1Y2Y3…..Yn.

Follow Function-

Follow(α) is a set of terminal symbols that appear immediately to the right of α.

Rules For Calculating Follow Function-

Rule-01:

For the start symbol S, place $ in Follow(S).


Rule-01:

For the start symbol S, place $ in Follow(S).

Rule-02:

For any production rule A → αB,

Follow(B) = Follow(A)

Rule-03:

For any production rule A → αBβ,

If ∈ ∉ First(β), then Follow(B) = First(β)


If ∈ ∈ First(β), then Follow(B) = { First(β) – ∈ } ∪ Follow(A)

Important Notes-

Note-01:

∈ may appear in the first function of a non-terminal.


∈ will never appear in the follow function of a non-terminal.

Note-02:

 Before calculating the first and follow functions, eliminate Left Recursion from the grammar, if
present.

Note-03:

 We calculate the follow function of a non-terminal by looking where it is present on the RHS of a
production rule.
PRACTICE PROBLEMS BASED ON CALCULATING FIRST AND FOLLOW-

Example1. - Calculate the first and follow functions for the given grammar-

Or Check whether the following grammer is LL(1) or not.

S → aBDh

B → cC

C → bC / ∈

D → EF

E→g/∈

F→f/∈

Solution-

The first and follow functions are as follows-

First Functions-

 First(S) = { a }

First(C) = { b , ∈ }
 First(B) = { c }

First(D) = { First(E) – ∈ } ∪ First(F) = { g , f , ∈ }


First(E) = { g , ∈ }

First(F) = { f , ∈ }

Follow Functions-

Follow(B) = { First(D) – ∈ } ∪ First(h) = { g , f , h }


 Follow(S) = { $ }

 Follow(C) = Follow(B) = { g , f , h }

Follow(E) = { First(F) – ∈ } ∪ Follow(D) = { f , h }


 Follow(D) = First(h) = { h }

 Follow(F) = Follow(D) = { h }

Example2. - Calculate the first and follow functions for the given grammar-

Or Check whether the following grammer is LL(1) or not.

S → AaAb / BbBa
A→∈

B→∈

Solution-

The first and follow functions are as follows-

First Functions-

First(S) = { First(A) – ∈ } ∪ First(a) ∪ { First(B) – ∈ } ∪ First(b) = { a , b }


First(A) = { ∈ }

First(B) = { ∈ }

Follow Functions-

Follow(A) = First(a) ∪ First(b) = { a , b }


 Follow(S) = { $ }

Follow(B) = First(b) ∪ First(a) = { a , b }



Example3. - Calculate the first and follow functions for the given grammar-

Or Check whether the following grammer is LL(1) or not.

S → ACB / CbB / Ba

A → da / BC

B→g/∈

C→h/∈

Solution-

The first and follow functions are as follows-

First Functions-

First(S) = { First(A) – ∈ } ∪ { First(C) – ∈ } ∪ First(B) ∪ First(b) ∪ { First(B) – ∈ } ∪ First(a) =


{d,g,h,∈,b,a}

First(A) = First(d) ∪ { First(B) – ∈ } ∪ First(C) = { d , g , h , ∈ }


First(B) = { g , ∈ }

First(C) = { h , ∈ }

Follow Functions-

Follow(A) = { First(C) – ∈ } ∪ { First(B) – ∈ } ∪ Follow(S) = { h , g , $ }


 Follow(S) = { $ }

Follow(B) = Follow(S) ∪ First(a) ∪ { First(C) – ∈ } ∪ Follow(A) = { $ , a , h , g }


Follow(C) = { First(B) – ∈ } ∪ Follow(S) ∪ First(b) ∪ Follow(A) = { g , $ , b , h }




Grammar With Common Prefixes-

If RHS of more than one production starts with the same symbol,

then such a grammar is called as

Grammar With Common Prefixes.

Example-

A → αβ1 / αβ2 / αβ3

(Grammar with common prefixes)

 This kind of grammar creates a problematic situation for Top down parsers.
 Top down parsers can not decide which production must be chosen to parse the string in hand.

To remove this confusion, we use left factoring.

Left Factoring-

Left factoring is a process by which the grammar with common prefixes is transformed to make it useful for
Top down parsers.

In left factoring,

 We make one production for each common prefixes.


 The common prefix may be a terminal or a non-terminal or a combination of both.
 Rest of the derivation is added by new productions.

The grammar obtained after the process of left factoring is called as Left Factored Grammar.
PRACTICE PROBLEMS BASED ON LEFT FACTORING-

Problem-01:

Do left factoring in the following grammar-

S → iEtS / iEtSeS / a

E→b

Solution-

The left factored grammar is-

S → iEtSS’ / a

S’ → eS / ∈

E→b

Problem-02:

Do left factoring in the following grammar-

A → aAB / aBc / aAc

Solution-

Step-01:

A → aA’

A’ → AB / Bc / Ac

Again, this is a grammar with common prefixes.


Step-02:

A → aA’

A’ → AD / Bc

D→B/c

This is a left factored grammar.

Problem-02:

Do left factoring in the following grammar-

A → aAB / aBc / aAc

Solution-

Step-01:

A → aA’

A’ → AB / Bc / Ac

Again, this is a grammar with common prefixes.

Step-02:

A → aA’

A’ → AD / Bc

D→B/c

This is a left factored grammar.


Problem-03:

Do left factoring in the following grammar-

S → bSSaaS / bSSaSb / bSb / a

Solution-

Step-01:

S → bSS’ / a

S’ → SaaS / SaSb / b

Again, this is a grammar with common prefixes.

Step-02:

S → bSS’ / a

S’ → SaA / b

A → aS / Sb

This is a left factored grammar.

Problem-04:

Do left factoring in the following grammar-

S → aSSbS / aSaSb / abb / b

Solution-
Step-01:

S → aS’ / b

S’ → SSbS / SaSb / bb

Again, this is a grammar with common prefixes.

Step-02:

S → aS’ / b

S’ → SA / bb

A → SbS / aSb

This is a left factored grammar.

Problem-05:

Do left factoring in the following grammar-

S → a / ab / abc / abcd

Solution-

Step-01:

S → aS’

S’ → b / bc / bcd / ∈

Again, this is a grammar with common prefixes.

Step-02:
S → aS’

S’ → bA / ∈

A → c / cd / ∈

Again, this is a grammar with common prefixes.

Step-03:

S → aS’

S’ → bA / ∈

A → cB / ∈

B→d/∈

This is a left factored grammar.

Problem-06:

Do left factoring in the following grammar-

S → aAd / aB

A → a / ab

B → ccd / ddc

Solution-

The left factored grammar is-

S → aS’

S’ → Ad / B

A → aA’

A’ → b / ∈

B → ccd / ddc
To gain better understanding about Left Factoring,

Recursion-
Recursion can be classified into following three types-

1. Left Recursion
2. Right Recursion
3. General Recursion

1. Left Recursion-

 A production of grammar is said to have left recursion if the leftmost variable of its RHS is same as
variable of its LHS.
 A grammar containing a production having left recursion is called as Left Recursive Grammar.

Example-

S → Sa / ∈

(Left Recursive Grammar)

 Left recursion is considered to be a problematic situation for Top down parsers.


 Therefore, left recursion has to be eliminated from the grammar.
Elimination of Left Recursion

Left recursion is eliminated by converting the grammar into a right recursive grammar.

If we have the left-recursive pair of productions-

A → Aα / β

(Left Recursive Grammar)

where β does not begin with an A.

Then, we can eliminate left recursion by replacing the pair of productions with-

A → βA’

A’ → αA’ / ∈

(Right Recursive Grammar)

This right recursive grammar functions same as left recursive grammar.

2. Right Recursion-

 A production of grammar is said to have right recursion if the rightmost variable of its RHS is same
as variable of its LHS.
 A grammar containing a production having right recursion is called as Right Recursive Grammar.

Example-

S → aS / ∈

(Right Recursive Grammar)

 Right recursion does not create any problem for the Top down parsers.
 Therefore, there is no need of eliminating right recursion from the grammar.
3. General Recursion-

 The recursion which is neither left recursion nor right recursion is called as general recursion.

Example-

S → aSb / ∈

PRACTICE PROBLEMS BASED ON LEFT RECURSION ELIMINATION-

Problem-01:

Consider the following grammar and eliminate left recursion-

A → ABd / Aa / a

B → Be / b

Solution-

The grammar after eliminating left recursion is-

A → aA’

A’ → BdA’ / aA’ / ∈

B → bB’

B’ → eB’ / ∈

Problem-02:
Consider the following grammar and eliminate left recursion-

E→E+E/ExE/a

Solution-

The grammar after eliminating left recursion is-

E → aA

A → +EA / xEA / ∈

Problem-03:

Consider the following grammar and eliminate left recursion-

E→E+T/T

T→TxF/F

F → id

Solution-

The grammar after eliminating left recursion is-

E → TE’

E’ → +TE’ / ∈

T → FT’

T’ → xFT’ / ∈

F → id
Problem-04:

Consider the following grammar and eliminate left recursion-

S → (L) / a

L→L,S/S

Solution-

The grammar after eliminating left recursion is-

S → (L) / a

L → SL’

L’ → ,SL’ / ∈

Problem-05:

Consider the following grammar and eliminate left recursion-

S → S0S1S / 01

Solution-

The grammar after eliminating left recursion is-

S → 01A

A → 0S1SA / ∈

Problem-06:

Consider the following grammar and eliminate left recursion-

S→A

A → Ad / Ae / aB / ac
B → bBc / f

Solution-

The grammar after eliminating left recursion is-

S→A

A → aBA’ / acA’

A’ → dA’ / eA’ / ∈

B → bBc / f

Problem-07:

Consider the following grammar and eliminate left recursion-

A → AAα / β

Solution-

The grammar after eliminating left recursion is-

A → βA’

A’ → AαA’ / ∈

Problem-08:

Consider the following grammar and eliminate left recursion-

A → Ba / Aa / c

B → Bb / Ab / d

Solution-
This is a case of indirect left recursion.

Step-01:

First let us eliminate left recursion from A → Ba / Aa / c

Eliminating left recursion from here, we get-

A → BaA’ / cA’

A’ → aA’ / ∈

Now, given grammar becomes-

A → BaA’ / cA’

A’ → aA’ / ∈

B → Bb / Ab / d

Step-02:

Substituting the productions of A in B → Ab, we get the following grammar-

A → BaA’ / cA’

A’ → aA’ / ∈

B → Bb / BaA’b / cA’b / d

Step-03:

Now, eliminating left recursion from the productions of B, we get the following grammar-

A → BaA’ / cA’

A’ → aA’ / ∈

B → cA’bB’ / dB’
B’ → bB’ / aA’bB’ / ∈

This is the final grammar after eliminating left recursion.

Problem-09:

Consider the following grammar and eliminate left recursion-

X → XSb / Sa / b

S → Sb / Xa / a

Solution-

This is a case of indirect left recursion.

Step-01:

First let us eliminate left recursion from X → XSb / Sa / b

Eliminating left recursion from here, we get-

X → SaX’ / bX’

X’ → SbX’ / ∈

Now, given grammar becomes-

X → SaX’ / bX’

X’ → SbX’ / ∈

S → Sb / Xa / a

Step-02:
Substituting the productions of X in S → Xa, we get the following grammar-

X → SaX’ / bX’

X’ → SbX’ / ∈

S → Sb / SaX’a / bX’a / a

Step-03:

Now, eliminating left recursion from the productions of S, we get the following grammar-

X → SaX’ / bX’

X’ → SbX’ / ∈

S → bX’aS’ / aS’

S’ → bS’ / aX’aS’ / ∈

This is the final grammar after eliminating left recursion.

Problem-10:

Consider the following grammar and eliminate left recursion-

S → Aa / b

A → Ac / Sd / ∈

Solution-

This is a case of indirect left recursion.

Step-01:

First let us eliminate left recursion from S → Aa / b

This is already free from left recursion.


Step-02:

Substituting the productions of S in A → Sd, we get the following grammar-

S → Aa / b

A → Ac / Aad / bd / ∈

Step-03:

Now, eliminating left recursion from the productions of A, we get the following grammar-

S → Aa / b

A → bdA’ / A’

A’ → cA’ / adA’ / ∈

This is the final grammar after eliminating left recursion.

You might also like