Unit 5 SP
Unit 5 SP
Phases of a Compiler
We basically have two phases of compilers, namely the Analysis phase and Synthesis
phase. The analysis phase creates an intermediate representation from the given
source code. The synthesis phase creates an equivalent target program from the
intermediate representation.
Symbol Table – It is a data structure being used and maintained by the compiler,
consisting of all the identifier’s names along with their types. It helps the compiler to
function smoothly by finding the identifiers quickly.
The analysis of a source program is divided into mainly three phases. They are:
1. Linear Analysis-
This involves a scanning phase where the stream of characters is read from left to
right. It is then grouped into various tokens having a collective meaning.
2. Hierarchical Analysis-
In this analysis phase, based on a collective meaning, the tokens are categorized
hierarchically into nested groups.
3. Semantic Analysis-
This phase is used to check whether the components of the source program are
meaningful or not.
The compiler has two modules namely the front end and the back end. Front-end
constitutes the Lexical analyzer, semantic analyzer, syntax analyzer, and intermediate
code generator. And the rest are assembled to form the back end.
1. Lexical Analyzer –
It is also called a scanner. It takes the output of the preprocessor (which performs
file inclusion and macro expansion) as the input which is in a pure high-level
language. It reads the characters from the source program and groups them into
lexemes (sequence of characters that “go together”). Each lexeme corresponds to
a token. Tokens are defined by regular expressions which are understood by the
lexical analyzer. It also removes lexical errors (e.g., erroneous characters),
comments, and white space.
2. Syntax Analyzer – It is sometimes called a parser. It constructs the parse tree. It
takes all the tokens one by one and uses Context-Free Grammar to construct the
parse tree.
Why Grammar?
The rules of programming can be entirely represented in a few productions. Using
these productions we can represent what the program actually is. The input has to
be checked whether it is in the desired format or not.
The parse tree is also called the derivation tree. Parse trees are generally
constructed to check for ambiguity in the given grammar. There are certain rules
associated with the derivation tree.
Any identifier is an expression
Any number can be called an expression
Performing any operations in the given expression will always result in an
expression. For example, the sum of two expressions is also an expression.
The parse tree can be compressed to form a syntax tree
Syntax error can be detected at this level if the input is not in accordance with the
grammar.
Semantic Analyzer – It verifies the parse tree, whether it’s meaningful or not. It
furthermore produces a verified parse tree. It also does type checking, Label
checking, and Flow control checking.
Intermediate Code Generator – It generates intermediate code, which is a form
that can be readily executed by a machine We have many popular intermediate
codes. Example – Three address codes etc. Intermediate code is converted to
machine language using the last two phases which are platform dependent.
Till intermediate code, it is the same for every compiler out there, but after that, it
depends on the platform. To build a new compiler we don’t need to build it from
scratch. We can take the intermediate code from the already existing compiler and
build the last two parts.
Code Optimizer – It transforms the code so that it consumes fewer resources and
produces more speed. The meaning of the code being transformed is not altered.
Optimization can be categorized into two types: machine-dependent and machine-
independent.
Target Code Generator – The main purpose of the Target Code generator is to
write a code that the machine can understand and also register allocation,
instruction selection, etc. The output is dependent on the type of assembler. This is
the final stage of compilation. The optimized code is converted into relocatable
machine code which then forms the input to the linker and loader.
Lexical Analysis
Lexical Analysis is the first phase of the compiler also known as a scanner. It
converts the High level input program into a sequence of Tokens.
Lexical Analysis can be implemented with the Deterministic finite Automata.
The output is a sequence of tokens that is sent to the parser for syntax
analysis
What is a token?
A lexical token is a sequence of characters that can be treated as a unit in the
grammar of the programming languages.
Example of tokens:
Type token (id, number, real, . . . )
Punctuation tokens (IF, void, return, . . . )
Alphabetic tokens (keywords)
Keywords; Examples-for, while, if etc.
Identifier; Examples-Variable name, function name, etc.
Operators; Examples '+', '++', '-' etc.
Separators; Examples ',' ';' etc
Example of Non-Tokens:
Comments, preprocessor directive, macros, blanks, tabs, newline, etc.
Lexeme: The sequence of characters matched by a pattern to form
the corresponding token or a sequence of input characters that comprises a
single token is called a lexeme. eg- “float”, “abs_zero_Kelvin”, “=”, “-”, “273”, “;” .
How Lexical Analyzer functions
The lexical analyzer identifies the error with the help of the automation machine and the
grammar of the given language on which it is based like C, C++, and gives row number and
column number of the error.
Suppose we pass a statement through lexical analyzer –
a=b+c; It will generate token sequence like this:
id=id+id; Where each id refers to it’s variable in the symbol table referencing
all details
For example, consider the program
int main()
{
// 2 variables
int a, b;
a = 10;
return 0;
}
All the valid tokens are:
'int' 'main' '(' ')' '{' 'int' 'a' ',' 'b' ';'
'a' '=' '10' ';' 'return' '0' ';' '}'
Above are the valid tokens.
You can observe that we have omitted comments.
Later the syntax analyzer forwards this parse tree to the next front end for
processing. We also refer to the syntax analyzer as the parser.
Besides building the parse tree syntax analyzer even collects information
about each token. And stores this information in the symbol table. Along with
this it even performs:
Type checking.
Does semantic analysis.
Generates an intermediate code.
Types of Parsing
The three common types of parsing are as follow:
1. Universal Parsing
2. Top-down Parsing
3. Bottom-up Parsing
Universal Parsing
Though universal parsing can parse any type of grammar. But it is quite
ineffective to be used in the production compiler. So usually, we only use two
methods for parsing top-down and bottom.
Top-down Parsing
In the top-down method, the parser builds the parse tree starting from the top.
That means it starts from the root of the parse tree, traversing towards the
bottom i.e. the leaves of the parse tree.
Bottom-up Parsing
In the bottom-up method, the parser builds the parse tree starting from the
bottom. This implies it starts from the leaves of the parse tree, traversing
upwards to the top i.e. root of the parse tree.
The syntax analysis phase is the second phase of a compiler. It takes input
from the lexical analyzer. And provides an output that serves as input to the
semantic analyzer.
Syntax analysis is also referred to as syntax analyzer or parser. It reads the
string of tokens from the lexical analyzer. And confirm that it can be generated
from the grammar used for the source language.
Later the syntax analyzer forwards this parse tree to the next front end for
processing. We also refer to the syntax analyzer as the parser.
Besides building the parse tree syntax analyzer even collects information
about each token. And stores this information in the symbol table. Along with
this it even performs:
Type checking.
Does semantic analysis.
Generates an intermediate code.
Types of Parsing
The three common types of parsing are as follow:
1. Universal Parsing
2. Top-down Parsing
3. Bottom-up Parsing
Universal Parsing
Though universal parsing can parse any type of grammar. But it is quite
ineffective to be used in the production compiler. So usually, we only use two
methods for parsing top-down and bottom.
Top-down Parsing
In the top-down method, the parser builds the parse tree starting from the top.
That means it starts from the root of the parse tree, traversing towards the
bottom i.e. the leaves of the parse tree.
Bottom-up Parsing
In the bottom-up method, the parser builds the parse tree starting from the
bottom. This implies it starts from the leaves of the parse tree, traversing
upwards to the top i.e. root of the parse tree.
Note: Whatever type of parsing, the parser chooses, it starts scanning the
parse tree from the left. And it will continue traversing the tree towards the
right. Remember it will scan only one symbol or node at a time.
However, the main task of the parser is to detect syntactic error efficiently.
The error handling in parser involves:
The error handler must report the location of the error in the program. This
helps the compiler to detect the line at which an error had occurred. Also, it
points to the line at which error occurs.
Context-Free Grammar
In this section, we will first see the definition of context-free grammar and introduce
terminologies used in parsing technology.
A context-free grammar has four components:
A set of non-terminals (V). Non-terminals are syntactic variables that denote
sets of strings. The non-terminals define sets of strings that help define the
language generated by the grammar.
A set of tokens, known as terminal symbols (Σ). Terminals are the basic
symbols from which strings are formed.
A set of productions (P). The productions of a grammar specify the manner in
which the terminals and non-terminals can be combined to form strings. Each
production consists of a non-terminal called the left side of the production, an
arrow, and a sequence of tokens and/or on- terminals, called the right side of the
production.
One of the non-terminals is designated as the start symbol (S); from where the
production begins.
The strings are derived from the start symbol by repeatedly replacing a non-terminal
(initially the start symbol) by the right side of a production, for that non-terminal.
Example
We take the problem of palindrome language, which cannot be described by means of
Regular Expression. That is, L = { w | w = wR } is not a regular language. But it can be
described by means of CFG, as illustrated below:
G = ( V, Σ, P, S )
Where:
V = { Q, Z, N }
Σ = { 0, 1 }
P = { Q → Z | Q → N | Q → ℇ | Z → 0Q0 | N → 1Q1 }
S={Q}
This grammar describes palindrome language, such as: 1001, 11100111, 00100,
1010101, 11111, etc.
Syntax Analyzers
A syntax analyzer or parser takes the input from a lexical analyzer in the form of token
streams. The parser analyzes the source code (token stream) against the production
rules to detect any errors in the code. The output of this phase is a parse tree.
This way, the parser accomplishes two tasks, i.e., parsing the code, looking for errors
and generating a parse tree as the output of the phase.
What is LEX?
It is a tool or software which automatically generates a lexical analyzer (finite
Automata). It takes as its input a LEX source program and produces lexical Analyzer as
its output. Lexical Analyzer will convert the input string entered by the user into tokens
as its output.
LEX is a program generator designed for lexical processing of character input/output
stream. Anything from simple text search program that looks for pattern in its input-
output file to a C compiler that transforms a program into optimized code.
In program with structure input-output two tasks occurs over and over. It can divide the
input-output into meaningful units and then discovering the relationships among the
units for C program (the units are variable names, constants, and strings). This division
into units (called tokens) is known as lexical analyzer or LEXING. LEX helps by taking a
set of descriptions of possible tokens n producing a routine called a lexical analyzer or
LEXER or Scanner.
Auxiliary Definitions
Translation Rules
Auxiliary Definition
It denotes the regular expression of the form.
Translation Rules
It is a set of rules or actions which tells Lexical Analyzer what it has to do or what it has
to return to parser on encountering the toke
n.
It consists of statements of the form −
P1 {Action1}
P2 {Action2}
.
.
.
Pn {Actionn}
Where
Pi → Pattern or Regular Expression consisting of input alphabets and Auxiliary definition
names.
Actioni → It is a piece of code that gets executed whenever a token is Recognized.
Each Actioni specifies a set of statements to be executed whenever each regular
expression or pattern Pi matches with the input string.
Example
Translation Rules for "Keywords"
We can see that if Lexical Analyzer is given the input "begin", it will recognize the token
"begin" and Lexical Analyzer will return 1 as integer code to the parser.
Translation Rules for "Identifiers"
letter (letter + digit)* {Install ( );return 6}
If Lexical Analyzer is given the token which is an "identifier", then the Action taken by
the Lexical Analyzer is to install or store the name in the symbol table & return value 6
as integer code to the parser.
YACC
A parser generator is a program that takes as input a specification of a syntax,
and produces as output a procedure for recognizing that language. Historically,
they are also called compiler-compilers.
YACC (yet another compiler-compiler) is an LALR(1) (LookAhead, Left-to-right,
Rightmost derivation producer with 1 lookahead token) parser generator. YACC
was originally designed for being complemented by Lex.
Input File:
YACC input file is divided into three parts.
/* definitions */
....
%%
/* rules */
....
%%
/* auxiliary routines */
....
Input File: Definition Part:
The definition part includes information about the tokens used in the syntax
definition:
%token NUMBER
%token ID
Yacc automatically assigns numbers for tokens, but it can be overridden by
The definition part can include C code external to the definition of the parser
and variable declarations, within %{ and %} in the first column.
It can also include the specification of the starting symbol in the grammar:
%start nonterminal
Input File: Rule Part:
Input File:
#include "lex.yy.c"
YACC input file generally finishes with:
.y
Output Files:
If called with the –d option in the command line, Yacc produces as output a
header file y.tab.h with all its specific definition (particularly important are
token definitions to be included, for example, in a Lex input file).
Example:
Yacc File (.y)
What Is An Interpreter?
Just like a compiler, interpreters also do the same job. It also translates the high-level
language into low-level languages.
Advantages Of Interpreter
1. Cross-Platform → In interpreted language we directly share the source code which
can run on any system without any system incompatibility issue.
2. Easier To Debug → Code debugging is easier in interpreters since it reads the code
line by line, and returns the error message on the spot. Also, the client with the
source code can debug or modify the code easily if they need to.
3. Less Memory and Step → Unlike the compiler, interpreters don’t generate new
separate files. So it doesn’t take extra Memory and we don’t need to perform one
extra step to execute the source code, it is executed on the fly.
4. Execution Control → Interpreter reads code line by line so you can stop the
execution and edit the code at any point, which is not possible in a compiled
language. However, after being stopped it will start from the beginning if you execute
the code again.
Disadvantages Of Interpreter
1. Slower → Interpreter is often slower than compiler as it reads, analyzes and
converts the code line by line.
2. Dependencies file required → A client or anyone with the shared source code
needs to have an interpreter installed in their system, in order to execute the code.
3. Less Secure → Unlike compiled languages, an interpreter doesn’t generate any
executable file so to share the program with others we need to share our source
code which is not secure and private. So it is not good for any company or
corporations who are concerned about their privacy.
Types of Errors
There are five different types of errors in C.
1. Syntax Error
2. Run Time Error
3. Logical Error
4. Semantic Error
5. Linker Error
1. Syntax Error
Syntax errors occur when a programmer makes mistakes in typing the code's syntax
correctly or makes typos. In other words, syntax errors occur when a programmer does
not follow the set of rules defined for the syntax of C language.
Syntax errors are sometimes also called compilation errors because they are always
detected by the compiler. Generally, these errors can be easily identified and rectified
by programmers.
void main() {
var = 5; // we did not declare the data type of variable
Output:
If the user assigns any value to a variable without defining the data type of the variable,
the compiler throws a syntax error.
void main() {
for (inti=0;) { // incorrect syntax of the for loop
printf("Scaler Academy");
}
}
Output:
A for loop needs 3 arguments to run. Since we entered only one argument, the compiler
threw a syntax error.
Errors that occur during the execution (or running) of a program are called Run Time
Errors. These errors occur after the program has been compiled successfully. When a
program is running, and it is not able to perform any particular operation, it means that
we have encountered a run time error. For example, while a certain program is running,
if it encounters the square root of -1 in the code, the program will not be able to
generate an output because calculating the square root of -1 is not possible. Hence, the
program will produce an error.
Run time errors can be a little tricky to identify because the compiler can not detect
these errors. They can only be identified once the program is running. Some of the most
common run time errors are: number not divisible by zero, array index out of bounds,
string index out of bounds, etc.
Run time errors can occur because of various reasons. Some of the reasons are:
1. Mistakes in the Code: Let us say during the execution of a while loop, the programmer
forgets to enter a break statement. This will lead the program to run infinite times, hence
resulting in a run time error.
2. Memory Leaks: If a programmer creates an array in the heap but forgets to delete the
array's data, the program might start leaking memory, resulting in a run time error.
3. Mathematically Incorrect Operations: Dividing a number by zero, or calculating the
square root of -1 will also result in a run time error.
4. Undefined Variables: If a programmer forgets to define a variable in the code, the
program will generate a run time error.
Example 1:
// A program that calculates the square root of integers
#include<stdio.h>
#include<math.h>
int main() {
for (inti = 4; i>= -2; i--) {
printf("%f", sqrt(i));
printf("\n");
}
return0;
}
Output:
2.000000
1.732051
1.414214
1.000000
0.000000
-1.#IND00
-1.#IND00
2.000000
1.732051
1.414214
1.000000
0.000000
-nan
-nan
In the above example, we used a for loop to calculate the square root of six integers.
But because we also tried to calculate the square root of two negative numbers, the
program generated two errors (the IND written above stands for "Ideterminate"). These
errors are the run time errors. -nan is similar to IND.
Example 2:
#include<stdio.h>
void main() {
intvar = 2147483649;
printf("%d", var);
}
Output:
-2147483647
This is an integer overflow error. The maximum value an integer can hold in C is
2147483647. Since in the above example, we assigned 2147483649 to the variable var,
the variable overflows, and we get -2147483647 as the output (because of the circular
property).
3. Logical Error
Sometimes, we do not get the output we expected after the compilation and execution
of a program. Even though the code seems error free, the output generated is different
from the expected one. These types of errors are called Logical Errors. Logical errors
are those errors in which we think that our code is correct, the code compiles without
any error and gives no error while it is running, but the output we get is different from
the output we expected.
In 1999, NASA lost a spacecraft due to a logical error. This happened because of some
miscalculations between the English and the American Units. The software was coded
to work for one system but was used with the other.
For Example:
#include<stdio.h>
void main() {
float a = 10;
float b = 5;
if (b = 0) { // we wrote = instead of ==
printf("Division by zero is not possible");
} else {
printf("The output is: %f", a/b);
}
}
Output:
INF signifies a division by zero error. In the above example, at line 8, we wanted to
check whether the variable b was equal to zero. But instead of using the equal to
comparison operator (==), we use the assignment operator (=). Because of this,
the if statement became false and the value of b became 0. Finally, the else clause got
executed.
4. Semantic Error
Errors that occur because the compiler is unable to understand the written code are
called Semantic Errors. A semantic error will be generated if the code makes no sense
to the compiler, even though it is syntactically correct. It is like using the wrong word in
the wrong place in the English language. For example, adding a string to an integer will
generate a semantic error.
Semantic errors are different from syntax errors, as syntax errors signify that the
structure of a program is incorrect without considering its meaning. On the other hand,
semantic errors signify the incorrect implementation of a program by considering the
meaning of the program.
The most commonly occurring semantic errors are: use of un-initialized variables, type
compatibility, and array index out of bounds.
Example 1:
#include<stdio.h>
void main() {
int a, b, c;
a * b = c;
// This will generate a semantic error
}
Output:
When we have an expression on the left hand side of an assignment operator (=), the
program generates a semantic error. Even though the code is syntactically correct, the
compiler does not understand the code.
Example 2:
#include<stdio.h>
void main() {
intarr[5] = {5, 10, 15, 20, 25};
intarraySize = sizeof(arr)/sizeof(arr[0]);
Output:
5
10
15
20
25
32764
In the above example, we printed six elements while the array arr only had five.
Because we tried to access the sixth element of the array, we got a semantic error and
hence, the program generated a garbage value.
5. Linker Error
Linker is a program that takes the object files generated by the compiler and combines
them into a single executable file. Linker errors are the errors encountered when the
executable file of the code can not be generated even though the code gets compiled
successfully. This error is generated when a different object file is unable to link with the
main object file. We can run into a linked error if we have imported an incorrect header
file in the code, we have a wrong function declaration, etc.
For Example:
#include<stdio.h>
void Main() {
intvar = 10;
printf("%d", var);
}
Output:
In the above code, as we wrote Main() instead of main(), the program generated a linker
error. This happens because every file in the C language must have a main() function.
As in the above program, we did not have a main() function, the program was unable to
run the code, and we got an error. This is one of the most common type of linker error.
Difference between Unix and Windows
S. Parameters UNIX Windows
No.
1. Licensing It is an open-source system It is a proprietary software
which can be used to under owned by Microsoft.
General Public License.
2. User Interface It has a text base interface, It has a Graphical User
making it harder to grasp for Interface, making it simpler
newcomers. to use.
3. Processing It supports Multiprocessing. It supports Multithreading.
4. File System It uses Unix File System(UFS) It uses File Allocation
that comprises STD.ERR and System (FAT32) and New
STD.IO file systems. technology file
system(NTFS).
5. Security It is more secure as all It is less secure compared
changes to the system require to UNIX.
explicit user permission.
6. Data Backup It is tedious to create a backup It has an integrated backup
& Recovery and recovery system in UNIX, and recovery system that
but it is improving with the make it simpler to use.
introduction of new
distributions of Unix.
8. Hardware Hardware support is limited in Drivers are available for
UNIX system. Some hardware almost all the hardware.
might not have drivers built for
them.
9. Reliability Unix and its distributions are Although Windows has
well known for being very been stable in recent
stable to run. years, it is still to match
the stability provided by
Unix systems.