Unit 1
Unit 1
UNIT-1
A programming language defines a set of instructions that are compiled together to perform a
specific task by the CPU (Central Processing Unit). The programming language mainly refers
to high-level languages such as C, C++, Pascal, Ada, COBOL, etc.
Each programming language contains a unique set of keywords and syntax, which are used to
create a set of instructions. Thousands of programming languages have been developed till
now, but each language has its specific purpose. These languages vary in the level of
abstraction they provide from the hardware. Some programming languages provide less or no
abstraction while some provide higher abstraction. Based on the levels of abstraction, they can
be classified into two categories:
o Low-level language
o High-level language
The image which is given below describes the abstraction level from hardware. As we can
observe from the below image that the machine language provides no abstraction, assembly
language provides less abstraction whereas high-level language provides a higher level of
abstraction.
Low-level language
The low-level language is a programming language that provides no abstraction from the
hardware, and it is represented in 0 or 1 forms, which are the machine instructions. The
languages that come under this category are the Machine level language and Assembly
language.
Machine-level language
The machine-level language is a language that consists of a set of instructions that are in the
binary form 0 or 1. As we know that computers can understand only machine instructions,
which are in binary digits, i.e., 0 and 1, so the instructions given to the computer can be only
in binary codes. Creating a program in a machine-level language is a very difficult task as it is
not easy for the programmers to write the program in machine instructions. It is error-prone as
it is not easy to understand, and its maintenance is also very high. A machine-level language is
not portable as each computer has its machine instructions, so if we write a program in one
computer will no longer be valid in another computer.
The different processor architectures use different machine codes, for example, a PowerPC
processor contains RISC architecture, which requires different code than intel x86 processor,
which has a CISC architecture.
Assembly Language
The assembly language contains some human-readable commands such as mov, add, sub, etc.
The problems which we were facing in machine-level language are reduced to some extent by
using an extended form of machine-level language known as assembly language. Since
assembly language instructions are written in English words like mov, add, sub, so it is easier
to write and understand.
As we know that computers can only understand the machine-level instructions, so we require
a translator that converts the assembly code into machine code. The translator used for
translating the code is known as an assembler.
The assembly language code is not portable because the data is stored in computer registers,
and the computer has to know the different sets of registers.
The assembly code is not faster than machine code because the assembly language comes
above the machine language in the hierarchy, so it means that assembly language has some
abstraction from the hardware while machine language has zero abstraction.
The following are the differences between machine-level language and assembly language:
The machine-level language comes at the The assembly language comes above the
lowest level in the hierarchy, so it has zero machine language means that it has less
abstraction level from the hardware. abstraction level from the hardware.
It does not require any translator as the In assembly language, the assembler is used to
machine code is directly executed by the convert the assembly code into machine code.
computer.
High-Level Language
The high-level language is a programming language that allows a programmer to write the
programs which are independent of a particular type of computer. The high-level languages are
considered as high-level because they are closer to human languages than machine-level
languages.
When writing a program in a high-level language, then the whole attention needs to be paid to
the logic of the problem.
o The high-level language is easy to read, write, and maintain as it is written in English
like words.
o The high-level languages are designed to overcome the limitation of low-level
language, i.e., portability. The high-level language is portable; i.e., these languages are
machine-independent.
The following are the differences between low-level language and high-level language:
It requires the assembler to convert the It requires the compiler to convert the high-
assembly code into machine code. level language instructions into machine
code.
The machine code cannot run on all The high-level code can run all the platforms,
machines, so it is not a portable language. so it is a portable language.
Debugging and maintenance are not easier Debugging and maintenance are easier in a
in a low-level language. high-level language.
Compiler
Interpreter
Assembler is a translator which is used to translate the assembly language code into machine
language code.
Phase of Compiler
Compiler operates in various phases each phase transforms the source program from one
representation to another. Every phase takes inputs from its previous stage and feeds its output
to the next phase of the compiler.
There are 6 phases in a compiler. Each of this phase help in converting the high-level langue
the machine code. The phases of a compiler are:
1. Lexical analysis
2. Syntax analysis
3. Semantic analysis
4. Intermediate code generator
5. Code optimizer
6. Code generator
Phases of Compiler
All these phases convert the source code by dividing into tokens, creating parse trees, and
optimizing the source code by different phases.
Phase 1: Lexical Analysis
Lexical Analysis is the first phase when compiler scans the source code. This process can be
left to right, character by character, and group these characters into tokens.
Here, the character stream from the source program is grouped in meaningful sequences by
identifying the tokens. It makes the entry of the corresponding tickets into the symbol table and
passes that token to next phase.
In Parse Tree
Interior node: record with an operator filed and two files for children
Leaf: records with 2/more fields; one for token and other information about the token
Ensure that the components of the program fit together meaningfully
Gathers type information and checks for type compatibility
Checks operands are permitted by the source language
Phase 3: Semantic Analysis
Semantic analysis checks the semantic consistency of the code. It uses the syntax tree of the
previous phase along with the symbol table to verify that the given source code is semantically
consistent. It also checks whether the code is conveying an appropriate meaning.
Semantic Analyzer will check for Type mismatches, incompatible operands, a function called
with improper arguments, an undeclared variable, etc.
Functions of Semantic analyses phase are:
Helps you to store type information gathered and save it in symbol table or syntax tree
Allows you to perform type checking
In the case of type mismatch, where there are no exact type correction rules which
satisfy the desired operation a semantic error is shown
Collects type information and checks for type compatibility
Checks if the source language permits the operands or not
Example
float x = 20.2;
float y = x*30;
In the above code, the semantic analyzer will typecast the integer 30 to float 30.0 before
multiplication
Phase 4: Intermediate Code Generation
Once the semantic analysis phase is over the compiler, generates intermediate code for the
target machine. It represents a program for some abstract machine.
Intermediate code is between the high-level and machine level language. This intermediate
code needs to be generated in such a manner that makes it easy to translate it into the target
machine code.
Functions on Intermediate Code generation:
It should be generated from the semantic representation of the source program
Holds the values computed during the process of translation
Helps you to translate the intermediate code into target language
Allows you to maintain precedence ordering of the source language
It holds the correct number of operands of the instruction
Example
For example,
total = count + rate * 5
Intermediate code with the help of address code method is:
t1 := int_to_float(5)
t2 := rate * t1
t3 := count + t2
total := t3
Phase 5: Code Optimization
The next phase of is code optimization or Intermediate code. This phase removes unnecessary
code line and arranges the sequence of statements to speed up the execution of the program
without wasting resources. The main goal of this phase is to improve on the intermediate code
to generate a code that runs faster and occupies less space.
The primary functions of this phase are:
It helps you to establish a trade-off between execution and compilation speed
Improves the running time of the target program
Generates streamlined code still in intermediate representation
Removing unreachable code and getting rid of unused variables
Removing statements which are not altered from the loop
Example:
Consider the following code
a = intofloat(10)
b=c*a
d=e+b
f=d
Can become
b =c * 10.0
f = e+b
Phase 6: Code Generation
Code generation is the last and final phase of a compiler. It gets inputs from code optimization
phases and produces the page code or object code as a result. The objective of this phase is to
allocate storage and generate relocatable machine code.
It also allocates memory locations for the variable. The instructions in the intermediate code
are converted into machine instructions. This phase coverts the optimize or intermediate code
into the target language.
The target language is the machine code. Therefore, all the memory locations and registers are
also selected and allotted during this phase. The code generated by this phase is executed to
take inputs and generate expected outputs.
Example:
a = b + 60.0
Would be possibly translated to registers.
MOVF a, R1
MULF #60.0, R2
ADDF R1, R2
Symbol Table Management
A symbol table contains a record for each identifier with fields for the attributes of the
identifier. This component makes it easier for the compiler to search the identifier record and
retrieve it quickly. The symbol table also helps you for the scope management. The symbol
table and error handler interact with all the phases and symbol table update correspondingly.
Lexical Error
Lexical errors are errors that your lexer throws when it is unable to continue. This means that there's no
way to recognize a lexeme as a valid token for your lexer. If you consider a lexer to be a finite state
machine that accepts valid input strings, errors are any input strings that do not result in that finite state
machine reaching an accepting state.
During the Lexical analyzer phase, lexical errors occur. In lexical analyzer conversion of the program
into the stream of tokens is done. There are patterns through which the identifiers are identified.
A lexical error is a sequence of characters that does not match the pattern of any token. During the
execution of a program, a lexical phase error is found.
Lexical phase error can be:
Any Spelling errors.
Exceeding the length of an identifier or numeric constants.
The appearance of illegal characters.
To replace a character with an incorrect character.
Transposition of two characters.
Syntactic Error
In computer science, a syntactic error is an error in the syntax of a sequence of characters or tokens
intended to be written in a specific programming language. This type of error appears during the syntax
analysis phase. Syntax or syntactic error is also found during the execution of the program.
Semantic Error
This type of error appears during the semantic analysis phase. These types of errors are detected during
the compilation process. Now, it is the phase where your defined identifiers are verified.
The majority of compile-time errors are scope and declaration errors. For example, undeclared
identifiers or multiple declared identifiers. Semantic errors can occur when the invalid variable or
operator is used, or the operations are performed in the incorrect order.
There can be different types of compilation errors depending on the program you’ve written.
Some examples of semantic errors are:
Operands of incompatible types
Variable not declared
The failure to match the actual argument with the formal argument
One-Pass Compiler
One pass compiler reads the code only once and then translates it. The one-pass compiler
passes only once through the parts of each compilation unit. It can translate each part into its
final machine program. In the one-pass compiler, when the line source is processed, it is
scanned and the token is extracted. This is in contrast to a multi-pass compiler which modifies
the program into one or more intermediate representations in steps between source program
and machine program, and which convert the whole compilation unit in each sequential pass.
A one-pass compiler is fast since all the compiler code is loaded in the memory at once. It can
process the source text without the overhead of the operating system having to shut down one
process and start another. A one-pass tends to impose some restrictions upon the program
constants, types, variables, and procedures that must be defined before they are used.
Multi-Pass Compiler
A multi-pass compiler can process the source code of a program multiple times. In the first
pass, the compiler can read the source code, scan it, extract the tokens and save the result in
an output file.
In the second pass, the compiler can read the output file produced by the first pass, build the
syntactic tree and implement the syntactical analysis. The output of this phase is a file that
includes the syntactical tree.
In the third pass, the compiler can read the output file produced by the second pass and check
that the tree follows the rules of language or not. The output of the semantic analysis phase is
the annotated tree syntax. This pass continues until the target output is produced.
Comparison between One-Pass and Multi-Pass Compiler.
It reads the code only once and It reads the code multiple times, each time changing it
translates it at a similar time. into numerous forms.
They are faster. They are "slower." As more number of passes means
more execution time.
Less efficient code optimization Better code optimization and code generation.
and code generation.
It is also called a "Narrow It is also called a "wide compiler." As they can scan
compiler." It has limited scope. every portion of the program.
The compiler requires large The memory occupied by one pass can be reused by a
memory. subsequent pass; therefore, small memory is needed by
the compiler.
Lexical Analysis
Lexical analysis is the starting phase of the compiler. It gathers modified source code that is
written in the form of sentences from the language preprocessor. The lexical analyzer is
responsible for breaking these syntaxes into a series of tokens, by removing whitespace in the
source code. If the lexical analyzer gets any invalid token, it generates an error. The stream of
character is read by it and it seeks the legal tokens, and then the data is passed to the syntax
analyzer, when it is asked for.
Terminologies
There are three terminologies-
Token
Pattern
Lexeme
Token: It is a sequence of characters that represents a unit of information in the source code.
Pattern: The description used by the token is known as a pattern.
Lexeme: A sequence of characters in the source code, as per the matching pattern of a token,
is known as lexeme. It is also called the instance of a token.
The lexical analyzer is responsible for removing the white spaces and comments from
the source program.
It corresponds to the error messages with the source program.
It helps to identify the tokens.
The input characters are read by the lexical analyzer from the source code.
Lexical analysis helps the browsers to format and display a web page with the help of
parsed data.
It is responsible to create a compiled binary executable code.
It helps to create a more efficient and specialised processor for the task.
It requires additional runtime overhead to generate the lexer table and construct the
tokens.
It requires much effort to debug and develop the lexer and its token description.
Much significant time is required to read the source code and partition it into tokens.
Lexical Analyzer: Input Buffering
Lexical Analysis has to access secondary memory each time to identify tokens. It is time-consuming
and costly. So, the input strings are stored into a buffer and then scanned by Lexical Analysis.
Lexical Analysis scans input string from left to right one character at a time to identify tokens. It uses
two pointers to scan tokens −
Begin Pointer (bptr) − It points to the beginning of the string to be read.
Look Ahead Pointer (lptr) − It moves ahead to search for the end of the token.
Example − For statement int a, b;
Both pointers start at the beginning of the string, which is stored in the buffer.
The character ("blank space") beyond the token ("int") have to be examined before the token
("int") will be determined.
After processing token ("int") both pointers will set to the next token ('a'), & this process will
be repeated for the whole program.
A buffer can be divided into two halves. If the look Ahead pointer moves towards halfway in First
Half, the second half is filled with new characters to be read. If the look Ahead pointer moves towards
the right end of the buffer of the second half, the first half will be filled with new characters, and it
goes on.
Sentinels − Sentinels are used to making a check, each time when the forward pointer is converted, a
check is completed to provide that one half of the buffer has not converted off. If it is completed, then
the other half should be reloaded.
Buffer Pairs − A specialized buffering technique can decrease the amount of overhead, which is
needed to process an input character in transferring characters. It includes two buffers, each includes
N-character size which is reloaded alternatively.
There are two pointers such as the lexeme Begin and forward are supported. Lexeme Begin points to
the starting of the current lexeme which is discovered. Forward scans ahead before a match for a pattern
are discovered. Before a lexeme is initiated, lexeme begin is set to the character directly after the lexeme
which is only constructed, and forward is set to the character at its right end.
Bootstrapping:
Bootstrapping is a process in which simple language is used to translate more complicated program
which in turn may handle for more complicated program. This complicated program can further
handle even more complicated program and so on.
Writing a compiler for any high level language is a complicated process. It takes lot of time to write
a compiler from scratch. Hence simple language is used to generate target code in some stages. to
clearly understand the Bootstrapping technique consider a following scenario.
Suppose we want to write a cross compiler for new language X. The implementation language of this
compiler is say Y and the target code being generated is in language Z. That is, we create XYZ. Now
if existing compiler Y runs on machine M and generates code for M then it is denoted as YMM. Now
if we run XYZ using YMM then we get a compiler XMZ. That means a compiler for source language
X that generates a target code in language Z and which runs on machine M.
Following diagram illustrates the above scenario.
Example:
We can create compiler of many different forms. Now we will generate.
Compiler which takes C language and generates an assembly language as an output with the
availability of a machine of assembly language.
Step-3: Finally we compile the second compiler. using compiler 1 the compiler 2 is compiled.
Step-4: Thus we get a compiler written in ASM which compiles C and generates code in ASM.
The compiler writer can use some specialized tools that help in implementing various phases of a
compiler. These tools assist in the creation of an entire compiler or its parts. Some commonly used
compiler construction tools include:
1. Parser Generator –
It produces syntax analyzers (parsers) from the input that is based on a grammatical description
of programming language or on a context-free grammar. It is useful as the syntax analysis phase
is highly complex and consumes more manual and compilation time.
Example: PIC, EQM
2. Scanner Generator –
It generates lexical analyzers from the input that consists of regular expression description based
on tokens of a language. It generates a finite automaton to recognize the regular expression.
Example: Lex
3. Syntax directed translation engines –
It generates intermediate code with three address format from the input that consists of a parse
tree. These engines have routines to traverse the parse tree and then produces the intermediate
code. In this, each node of the parse tree is associated with one or more translations.
4. Automatic code generators –
It generates the machine language for a target machine. Each operation of the intermediate
language is translated using a collection of rules and then is taken as an input by the code
generator. A template matching process is used. An intermediate language statement is replaced
by its equivalent machine language statement using templates.
5. Data-flow analysis engines –
It is used in code optimization.Data flow analysis is a key part of the code optimization that
gathers the information, that is the values that flow from one part of a program to another. Refer
– data flow analysis in Compiler
6. Compiler construction toolkits –
It provides an integrated set of routines that aids in building compiler components or in the
construction of various phases of compiler.
LEX
It is a tool or software which automatically generates a lexical analyzer (finite Automata). It takes as its
input a LEX source program and produces lexical Analyzer as its output. Lexical Analyzer will convert
the input string entered by the user into tokens as its output.
LEX is a program generator designed for lexical processing of character input/output stream. Anything
from simple text search program that looks for pattern in its input-output file to a C compiler that
transforms a program into optimized code.
Use of Lex
• lex.l is an a input file written in a language which describes the generation of lexical analyzer. The lex
compiler transforms lex.l to a C program known as lex.yy.c.
• lex.yy.c is compiled by the C compiler to a file called a.out.
• The output of C compiler is the working lexical analyzer which takes stream of input characters and
produces a stream of tokens.
• yylval is a global variable which is shared by lexical analyzer and parser to return the name and an
attribute value of token.
• The attribute value can be numeric code, pointer to symbol table or nothing.
• Another tool for lexical analyzer generation is Flex.
Structure of Lex Programs
Lex program will be in following form
declarations
%%
translation rules
%%
auxiliary functions
Declarations This section includes declaration of variables, constants and regular definitions.
Translation rules It contains regular expressions and code segments.
Form : Pattern {Action}
Pattern is a regular expression or regular definition.
Action refers to segments of code.
Auxiliary functions This section holds additional functions which are used in actions. These functions
are compiled separately and loaded with lexical analyzer.
Lexical analyzer produced by lex starts its process by reading one character at a time until a valid match
for a pattern is found.