0% found this document useful (0 votes)
31 views11 pages

Compiler Construction

no comments
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views11 pages

Compiler Construction

no comments
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

Compiler Construction

hafiza aniqa
Compiler:
A complier is a software tool or platform that is used to translate the high-level language programming
source code into low level machine language code that can be executed directly by the computer’s hardware.
 A compiler is a specific translates that translate information form one representation to another.
 Applications that convert for example word file into PDF is called translators not a compiler.
Issues in compilation:
There is no algorithm that exist for an ideal translation. Translation is a complex process. To manage this
complex process, the translation is done in multiple process.
Types of compilers:
Compliers can be translated into two main categories:
1. Single pass complier – in this type of complier the source code is process in single go that means
the compiler read source code and do the necessary analysis and generate the target code in one go.
2. Multi pass compiler – when several intermediate codes are created in a program and the parse tree
is processed many times it is called mutli-pass complier. It breaks program or code into smaller
program.
Type of multi pass compiler:
Multi pass complier can be further divided into two categories:
1. Two-pass compiler
2. Three-pass complier
Two-pass compiler:
In this type of compilation, where the program is translated twice once form the front end and second from
the back end.

Front end:
The algorithm employed in the front end have the polynomial time complexity. The front end maps the legal
source code into intermediate code representative.
Phases of Front end:
Front end consists of the following phases:
 Lexical analysis
 Syntax analysis / parser
 Semantic analysis
 Intermediate code generator

Back End:
The back end has the time complexity of NP complete and maps IR into target machine code. The back end
of the compiler translates the intermediate code representation (IR) into target machine code. It decides
which value to keep in the register in order to avoid the memory access. It is also responsible for instruction
selection to produce fast and compact code.
Intermediate Code Representation Steps:
An intermediate code representation consists of following steps:
1. Instruction selection
2. Register allocation
3. Instruction scheduling

Register Allocation:
Register in the CPU plays an important role for providing the high-speed access to the operands. The
number of registers is small in CPU and some of them are pre allocated for specialized used such as program
counter and are not available for the back end to use. Optimal register allocation is NP- complete.
Instruction Scheduling:
The back end needs to do instruction scheduling to avoid hardware stalls and interlocks. Optimal scheduling
is NP complete in nearly all cases.
Phase of Back end:
Back end consists of following phases:
 Code optimization
 Target code generator
Modules of Front End:
Front end recognizes the legal and illegal program presented to it. Front end consists of two modules:
 Scanner
 Parser
Scanner:
The first phase of front end is lexical analysis which is also called the scanner. It takes the program (Source
code) as input and convert it into sequence of tokens. The output of lexical analysis is a sequence of tokens
that is then given to the parser as input.
Token:
A lexical token is sequence of character that can be treated as a unit in the grammar of programming
language. We call the pair <token type, word> a token.
Types of tokens:
Keywords – e.g., for, if, the, void etc.
Identifier – e.g. variables name, function name, class name etc.
Symbol – e.g. +, -, % etc.
Non-Token:
Preprocessor, macros, comments, tabs, newline etc are all non tokens.
Token definition:
Tokens can be defined by using regular language as:
 As they are based on simple and useful theory
 Are easy to understand
 And have efficient implementation
Parser:
The second phase of front end is also known as syntax analysis. It takes the sequence of token form the
previous phase i.e., lexical analysis as input and recognize the context free grammar and converts it into
intermediate representation (IR). If there are any errors it will also reports the errors.
Context Free Grammar (CFG):
The syntax of most programming language is specified or defined by using context free grammar. A context
free grammar consists of following:
 S – start symbol
 N – Non – terminals
 T – Terminals
 P – set of production rule
Terminals: The symbol that can’t be replaced by any symbol is called terminal (constant). They are denoted
by small letters. (a,b,c)
Non- Terminals: The symbol that must be replaced by other things is called non terminal (variable). They
are donated by capital letter. (X, S, Y)
Productions: The grammatical rules are often called Productions.
Parse representation:
A parse can be represented by using:
 Parse tree
 Syntax tree
 Abstract Syntax tree
These representations help in understanding the structure of the source code.
Prase tree:
It is the hierarchical representation of the terminal or non-terminals. This is also known as derivation tree. A
parse tree is created by a parser.
Example:
Tree for string = baab
S→AA
A→AA | bA |Ab | a

Syntax Tree:
A syntax tree is tree in which each leaf node is represented by an operand and inside node are represented by
operator.
Example:
3*4+5

Abstract Syntax Tree:


The parse tree often has a lot of unneeded information. Compiler often use abstract syntax tree to get rid of
unneeded information.
Three-pass compiler:
An intermediate stage is used for code improvement or optimization. The middle end is introduced on
between the front end and back end which analyse the IR and rewrites (transform) the IR. Its primary goal is
to reduce the run time of the compile code. It is generally term as optimizer.

Lexical analysis:
The scanner is the first component of the front end and parser is the second component. The task of the
scanner is to take a program (source code) that is written in some language such as java, c++ etc as stream of
character and break the stream of characters into tokens. This activity is knowns lexical analysis. The lexical
analyser partition input string into substring called word. And classify them according to their role.
Specifying token:
We don not what kind of token we are going to read after reading the first character such as if a token starts
with i it can be either an identifier or a keyword. Regular language is the most popular for specifying the
token because:
 They are base on simple and useful theory
 Are easy to understand
 And have efficient implementa/
Language:
A language over ∑(alphabet) is finite set of strings (finite sequence of character). For lexical analysis we
care about regular language.
Regular Language:
It is way to define a language. A language that can be expressed using regular expression is called regular
language. Regular expression is a way to represent a certain set of string in an algebraic fashion. The token
we want to recognize are encoded using regular expression. If A is regular expression, then L(A) refers to a
language denoted by A.
Acceptor:
We need a mechanism to determine if the input string w belong to the language L(R), the language that is
denoted by the regular expression R. Such mechanism is called acceptor. The acceptor is based on finite
automaton.
Finite automaton:

Finite automaton is also known as finite state machine which is computational model to describe and design
a system with a finite number of states. Finite automaton has a very limited amount of memory. A finite
automaton has following characteristics:

 An input alphabet (∑)


 A set of states
 A start (initial) state
 A set of transition
 A set of accepting (final) states

Types:

Finite automaton is further divided into two main categories which is further divided into subcategories:

1.Finite automata without Output

 Deterministic Finite Automaton


 Non- deterministic Finite Automaton
 Ε-Finite Automaton

2.Finite automata with Output

 Moore Machine
 Mealy Machine

Table encoding of FA:


A FA can be encoded as a table called transition table. The row represents the state and the column
represents the character of the alphabet set. The cell of the table represents the next state. The encoding
makes the implementation of FA simple and efficient.

Deterministic Finite Automaton (DFA):


In Deterministic Finite Automaton (DFA), there is only one transition per input per state and there is no ε -
moves.

Non- deterministic Finite Automaton (NFA):


In Non- deterministic Finite Automaton (NFA), it can have multiple transition for one input in a given state.
It can also have ε -moves.

Comparison Of DFA and NFA:

NFAs and DFAs recognize the same set of language (regular language). DFA are easy to implement. For a
given language, NFA is simpler than the DFA. DFA can exponentially large than NFA.

NFA construction:
NFA can be constructed using an algorithm. The algorithm, that converts the RE into NFA is called Thompson’s
Construction. This algorithm first appeared in CACM in 1968. The algorithm builds NDF for each RE and then
combines NFAs using ε -moves.

DFA construction:
The process of converting the NFA into DFA is done using an algorithm called subset construction. The
technique involves creating DFA where each state of the DFA represent the set of states in the NFA. Each
state corresponds to the set of states in the original NFA. The DFA uses its states to keep the track of all the
possible states the NFA can be in after reading each input symbol.
DFA minimize:
The generated DFA may have a large number of states, which can be minimized using Hopcroft’s algorithm.
In this algorithm, equivalent states are grouped together to reduce the overall number of states while
preserving the language recognition capabilities of the DFA.
Lexical analyser generator:
Two popular lexical analyser generators are;
 Flex – generates lexical analyser in C and C++ and it is the modern version of the original Lex tool
which was the part of the AT&T bells labs version of UNIX.
 Jlex – it is written in java and generates the lexical analyser in java.
Flex:
To use flex, one has to provide a specification file as input to Flex. Flex read this file and produce the output
file which contains the lexical analyser code in C or C++.
The input specification file consists of three sections:
1. C or C++ and flex definition
%%
2. Token definition and actions
%%
3. User code
%%
The symbol “%%” marks each section. The input file name is “flex.l”, it is customary to use the “.l”
extension for the flex input files.
Semantic Analysis:
This is the third phase of the compiler, following lexical and syntax analysis. It checks if the declarations
and statements in the program make sense and are logically correct. It also ensures that the code makes
sense according to the rules of the programming language. It checks for logical errors that cannot be caught
by syntax analysis alone.
Using Symbol Table:
Semantic analysis uses a symbol table, which is like a dictionary that keeps track of all the variables,
functions, and other elements in the program. This helps in checking the consistency and correctness of the
code.
Error Detection: If the code contains semantic errors, such as using a variable before it's declared or
assigning the wrong type of value to a variable, semantic analysis detects these errors and reports them.
Ensuring Consistency: The goal of semantic analysis is to ensure that the code is not only syntactically
correct but also logically consistent and meaningful.
Parser:
Parsing is the second phase of the front end. The parser checks the token (stream of words) and parts of
speech for grammatically correctness. The scanner that is based on regular expression will not be able to
detect the syntax error. Not all the sequence of tokens are program. Parser must distinguish between the
valid and invalid tokens.
Parsing:
Parsing is a process of discovering a derivation for some sentence of language. The mathematical model of
syntax is represented by grammar G. Syntax of most programming language is represented by context free
grammar (CGF). The syntax of C / C++ and Java is heavily derived from Algol-60.
Derivation:
Derivation is a sequence of production rule that is used in order get input string. During parsing we take two
decisions for an input string:
 Deciding which non terminal to replace
 Deciding the production rule by which the non terminal will be change.
Types of derivation:
On the basic of the decision, we made for an input string we can have two types of derivation:
 Left-most Derivation – replacing left most non-terminal at each stage.
 Right most derivation – replacing the right most non-terminal at each stage.
Parse tree:
Derivation can be represented in tree like fashion called parse tree. In a parse tree:
 All leaf nodes are terminals.
 All interior nodes are non-terminals.
 In-order traversal gives original input string.
Precedence:
If two different operators share a common operand, the precedence of operators decides which will take the
operand. To add the precedence, create a non-terminal for each level of precedence.
Ambiguous grammar:
A grammar G is said to be ambiguous if it has more than one parse tree (left or right derivation) for at least
one string.
Parsing techniques:
Following is the classification of parsing technique:

Top-down parsing:
A top-down parse tree starts at the root and grows towards the leaves. At each node, parser picks the
production rule and tried to match it with the input string. In simple term, it starts from the start symbol of
the grammar and reaching the input string.
Types:
Top down is further divided into two categories:
 Recursive descent parsing
 Predictive Parsing (LL (1))
Bottom-up parsing:
The bottom-up parsing start at the leave nodes and grow towards the root node of the parse tree. It handles a
large class of grammar. It is also knowns as shift reduce parser This process involves shifting input symbols
onto the stack and then reducing them based on the grammar rules, hence the name "shift-reduce" parsing.
Types:
Bottom-up parsing can be further divided into sub categories:
 Operator precedence parsing
 LR parsing
Final term:
Parsing:
Parsing is the second phase of the front end in two-pass compiler. Where it takes the sequence of string from
the previous phase i.e. lexical analysis. Parsing is the process of driving the string from a given grammar. It
is also known as syntax analysis.
Context free grammar:
Parser uses the CFG to check weather a given string belongs to a particular grammar or not.
Types of parsers:
Parser can be of following types:
 Top-down parser
 Bottom-up parser
Which can be further divided as follow:

Bottom-up parsing:
The bottom-up parsing start at the leave nodes and grow towards the root node of the parse tree. It handles a
large class of grammar. It is also knowns as shift reduce parser This process involves shifting input symbols
onto the stack and then reducing them based on the grammar rules, hence the name "shift-reduce" parsing.
Types:
Bottom-up parsing can be further divided into sub categories:
 Operator precedence parsing
 LR parsing
Top-down parsing:
A top-down parse tree starts at the root and grows towards the leaves. At each node, parser picks the
production rule and tried to match it with the input string. In simple term, it starts from the start symbol of
the grammar and reaching the input string.
Types:
Top down is further divided into two categories:
 Recursive descent parsing
 Predictive Parsing (LL (1))
LL (1):
In LL (1) parsing technique we use the first and the follow of the terminals:
First:
First(A) contains all the terminals present in the first place of every string derived by A.
Note:
 First (Terminal) = terminal
 First (Epsilon) = Epsilon(e)
Capital = non-terminal
Small = terminal
Follow:
Follow (A) contains set of all terminals present immediate in right of A.
Note:
Follow of start symbol is $.
Bottom-up parsing:
The bottom-up parsing start at the leave nodes and grow towards the root node of the parse tree. It handles a
large class of grammar. It is also knowns as shift reduce parser This process involves shifting input symbols
onto the stack and then reducing them based on the grammar rules, hence the name "shift-reduce" parsing.
Types of bottom-up parsing:

You might also like