0% found this document useful (0 votes)
45 views40 pages

Unit - 1

The document discusses compilers, which translate source code written in high-level programming languages into machine-readable object code. It describes the main components of a compiler - the lexer, parser, semantic analyzer, intermediate code generator, optimizer, and code generator. Compilers translate the entire source code at once before execution, while interpreters translate and execute code line-by-line. The document outlines the different phases and components of a compiler and their functions, such as parsing code with a context-free grammar and generating optimized machine code.

Uploaded by

Krishna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views40 pages

Unit - 1

The document discusses compilers, which translate source code written in high-level programming languages into machine-readable object code. It describes the main components of a compiler - the lexer, parser, semantic analyzer, intermediate code generator, optimizer, and code generator. Compilers translate the entire source code at once before execution, while interpreters translate and execute code line-by-line. The document outlines the different phases and components of a compiler and their functions, such as parsing code with a context-free grammar and generating optimized machine code.

Uploaded by

Krishna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

INTRODUCTION TO COMPILERS

A computer understands instructions in machine code, i.e. in the form of 0s and 1s. It is a monotonous task to
write a computer program directly in machine code. The programs are written mostly in high level languages
like Java, C++, Python etc. and are called source code. These source code cannot be executed directly by the
computer and must be converted into machine language to be executed. Hence, a special translator system
software is used to translate the program written in high-level language into machine code is called Language
Processor and the program after translated into machine code (object program / object code).

So, In short

A language processor is a special type of a computer software that has the capacity of translate the
source code or program codes into machine codes.

The language processors can be any of the following three types:

COMPILER

The language processor that reads the complete source program written in high level language as a whole in
one go and translates it into an equivalent program in machine language is called as a Compiler.

An important role of the compiler is to report any errors in the source program that it detects during the
translation process. So, in a compiler, the source code is translated to object code successfully if it is free of
errors. The compiler specifies the errors at the end of compilation with line numbers when there are any errors
in the source code. The errors must be removed before the compiler can successfully recompile the source
code again.

Example: C, C++, Java.

Source Target
COMPILER
Program Program

Input TARGET Output


Program

ASSEMBLER

The compiler may produce an assembly language program as its output, because assembly language is easier
to debug.

The Assembler is used to translate the program written in Assembly language into machine code. The source
program is an input of assembler that contains assembly language instructions. The output generated by
assembler is the object code or machine code understandable by the computer.
INTERPRETER

The translation of single statement of source program into machine code is done by language processor and
executes it immediately before moving on to the next line is called an interpreter. If there is an error in the
statement, the interpreter terminates its translating process at that statement and displays an error message.
The interpreter moves on to the next line for execution only after removal of the error. An Interpreter directly
executes instructions written in a programming or scripting language without previously converting them to
an object code or machine code.

Example: Perl, Python and Matlab.

Source
Program and Output
INTERPRETER
Input

A LANGUAGE PROCESSING SYSTEM

A source program may be divided into modules stored in the separated files. The task of collecting the source
program is sometimes entrusted to a separate program, called a preprocessor. The preprocessor may also
expand short hands, called macros, into source language statements. The modified source program then fed to
a compiler. The compiler may produce an assembly language program as its output, because assembly
language is easier to produce as output and easier to debug. The assembly language is then processed by a
program called an assembler that produces relocatable machine code as its output.

Large program is often compiled in pieces, so the relocatable machine code may have to be linked together
with other relocatable object files and library files into the code that actually runs on the machine.

The linker resolves external memory addresses, where the code in one file may refer to the location in another
file. The loader then puts together all the executable object files into memory for execution.

DIFFERENCE BETWEEN COMPILER AND INTERPRETER

COMPILER INTERPRETER
A compiler is a program which converts the Interpreter takes a source program and runs
entire source code of a programming it line by line, translating each line as it
language into executable machine code for a comes to it.
CPU.
Compiler takes large amount of time to Interpreter takes less amount of time to
analyze the entire source code, but the analyze the source code, but the overall
overall execution time of the program is execution time of the program is slower.
comparatively faster.
Compiler generates the error message only Its Debugging is easier as it continues
after scanning the whole program, so translating the program until the error is met
debugging is comparatively hard as the error
can be present anywhere in the program.
Generates intermediate object code. No intermediate object code is generated.
Examples: C, C++, Java Examples: Python, Perl

A Hybrid Compiler

Source
Program

Translator

Intermediate Program Virtual Machine Output

Input
A hybrid complier combines compilation and interpretation. For example: Java language processor. A java
source program may first be complied into an intermediate form called byte codes. The bytecodes are then
interpreted by a virtual machine.

THE STRUCTURE OF A COMPILER

It based on two parts:

1. Analysis
2. Synthesis

The analysis part breaks up the source program into constituent pieces and imposes a grammatical structure
on them. It then uses this structure to create an intermediate representation of the source program. If the
analysis part detects that the source program is syntactically ill formed and semantically unsound, then it must
provide informative messages for taking corrective action.

The analysis part also collects information about the source program and stores it in the data structure called
a symbol table, which is passed along with the intermediate representation to the synthesis part. It is also
known as front end of the compiler.

The synthesis part constructs the desired target program from the intermediate representation and the
information in the symbol table. It is also known as back end of the compiler.
PHASES OF COMPILER
Translation of an assignment statement

Lexical analyzer:

It is also called scanner. It takes the output of preprocessor (which performs file inclusion and macro
expansion) as the input which is in pure high level language. It reads the characters from source program and
groups them into lexemes (sequence of characters that “go together”). Each lexeme corresponds to a token.
Tokens are defined by regular expressions which are understood by the lexical analyzer. It also removes lexical
errors (for e.g., erroneous characters), comments and white space.

Syntax Analyzer:

It is sometimes called as parser. It constructs the parse tree. It takes all the tokens one by one and uses Context
Free Grammar to construct the parse tree.

Why Grammar?

The rules of programming can be entirely represented in some few productions. Using these productions, it
can represent what the program actually is. The input has to be checked whether it is in the desired format or
not.

The parse tree is also called the derivation tree. Parse trees are generally constructed to check for ambiguity
in the given grammar.

Semantic Analyzer:

It verifies the parse tree, whether it’s meaningful or not. It furthermore produces a verified parse tree. It also
does type checking, Label checking and Flow control checking.

Intermediate Code Generator:

It generates intermediate code, that is a form which can be readily executed by machine There are many
popular intermediate codes. Example – Three address code etc. Intermediate code is converted to machine
language using the last two phases which are platform dependent.

Till intermediate code, it is same for every compiler out there, but after that, it depends on the platform. To
build a new compiler, don’t need to build it from scratch. Take the intermediate code from the already existing
compiler and build the last two parts.

Code Optimizer:

It transforms the code so that it consumes fewer resources and produces more speed. The meaning of the code
being transformed is not altered. Optimisation can be categorized into two types: machine dependent and
machine independent.
Target Code Generator:

The main purpose of Target Code generator is to write a code that the machine can understand and also register
allocation, instruction selection etc. The output is dependent on the type of assembler. This is the final stage
of compilation. The optimized code is converted into relocatable machine code which then forms the input to
the linker and loader.

Symbol Table Management:

Symbol table is an important data structure created and maintained by compilers in order to store information
about the occurrence of various entities such as variable names, function names, objects, classes, interfaces,
etc. Symbol table is used by both the analysis and the synthesis parts of a compiler.

A symbol table may serve the following purposes depending upon the language in hand:

1. To store the names of all entities in a structured form at one place.


2. To verify if a variable has been declared.
3. To implement type checking, by verifying assignments and expressions in the source code are
semantically correct.
4. To determine the scope of a name (scope resolution).

A symbol table is simply a table which can be either linear or a hash table. It maintains an entry for each name
in the following format:

<symbol name, type, attribute>

For example, if a symbol table has to store information about the following variable declaration:

static int interest;

then it should store the entry such as:

<interest, int, static>

COMPILER-CONSTRUCTION TOOLS
THE EVOLUTION OF PROGRAMMING LANGUAGES
APPLICATIONS OF COMPILER TECHNOLOGY

There are various applications of Complier Technology in several areas of computer science:

1. Implementation of High-Level Programming Languages

A high-level programming language defines a programming abstraction: the programmer expresses an


algorithm using the language, and the compiler must translate that program to the target language. Generally,
higher-level programming languages are easier to program in, but are less efficient, that is, the target programs
run more slowly. Programmers using a low-level language have more control over a computation and can, in
principle, produce more efficient code. Unfortunately, lower-level programs are harder to write and — worse
still — less portable, more prone to errors, and harder to maintain.

The many shifts in the popular choice of programming languages have been in the direction of increased levels
of abstraction. C was the predominant systems programming language of the 80's; many of the new projects
started in the 90's chose C + + ; Java, introduced in 1995, gained popularity quickly in the late 90's. The new
programming-language features introduced in each round spurred new research in compiler optimization.

2. Optimizations for Computer Architectures

The rapid evolution of computer architectures has also led to an insatiable demand for new compiler
technology. Almost all high-performance systems take advantage of the same two basic techniques:
parallelism and memory hierarchies.

Parallelism can be found at several levels: at the instruction level, where multiple operations are executed
simultaneously and at the processor level, where different threads of the same application are run on different
processors.

Memory hierarchies are a response to the basic limitation that can build very fast storage or very large storage,
but not storage that is both fast and large.

3. Design of New Computer Architectures

In the early days of computer architecture design, compilers were developed after the machines were
built. That has changed. Since programming in high-level languages is the norm, the performance of a
computer system is determined not by its raw speed but also by how well compilers can exploit its features.
Thus, in modern computer architecture development, compilers are developed in the processor-design stage,
and compiled code, running on simulators, is used to evaluate the proposed architectural features.

RISC

One of the best-known examples of how compilers influenced the design of computer architecture was the
invention of the RISC (Reduced Instruction-Set Computer) architecture. Prior to this invention, the trend was
to develop progressively complex instruction sets intended to make assembly programming easier; these
architectures were known as CISC (Complex Instruction-Set Computer). For example, CISC instruction sets
include complex memory-addressing modes to support data-structure accesses and procedure-invocation
instructions that save registers and pass parameters on the stack.

Compiler optimizations often can reduce these instructions to a small number of simpler operations by
eliminating the redundancies across complex instructions. Thus, it is desirable to build simple instruction sets;
compilers can use them effectively and the hardware is much easier to optimize.

Most general-purpose processor architectures, including PowerPC, SPARC, MIPS, Alpha, and PA-RISC, are
based on the RISC concept.

Specialized Architectures

Over the last three decades, many architectural concepts have been proposed. They include data flow
machines, vector machines, VLIW (Very Long Instruction Word) machines, SIMD (Single Instruction,
Multiple Data) arrays of processors, systolic arrays, multiprocessors with shared memory, and multiprocessors
with distributed memory. The development of each of these architectural concepts was accompanied by the
research and development of corresponding compiler technology.

4. Program Translations

Compilation is not only as a translation from a high-level language to the machine level, the same technology
can be applied to translate between different kinds of languages. The following are some of the important
applications of program-translation techniques.

Binary Translation

Compiler technology can be used to translate the binary code for one machine to another, allowing a machine
to run programs originally compiled for another instruction set. Binary translation technology has been used
by various computer companies to increase the availability of software for their machines. PC processors run
legacy MC 68040 code.

Hardware Synthesis

Hardware designs are typically described at the register transfer level (RTL), where variables represent
registers and expressions represent combinational logic. Hardware-synthesis tools translate RTL descriptions
automatically into gates, which are then mapped to transistors and eventually to a physical layout. Unlike
compilers for programming languages, these tools often take hours optimizing the circuit. Techniques to
translate designs at higher levels, such as the behaviour or functional level, also exist.

Database Query Interpreters: For example, query languages, especially SQL (Structured Query Language),
are used to search databases. Database queries consist of predicates containing relational and Boolean
operators. They can be interpreted or compiled into commands to search a database for records satisfying that
predicate.
Compiled Simulation

Simulation is a general technique used in many scientific and engineering disciplines to understand a
phenomenon or to validate a design. Inputs to a simulator usually include the description of the design and
specific input parameters for that particular simulation run. Simulations can be very expensive.

Instead of writing a simulator that interprets the design, it is faster to compile the design to produce machine
code that simulates that particular design natively. Compiled simulation can run orders of magnitude faster
than an interpreter-based approach. Compiled simulation is used in many state-of-the-art tools that simulate
designs written in Verilog or VHDL.

5. Software Productivity Tools

Programs are arguably the most complicated engineering artifacts ever produced; they consist of many details,
every one of which must be correct before the program will work completely. As a result, errors are rampant
in programs; errors may crash a system, produce wrong results, render a system vulnerable to security attacks,
or even lead to catastrophic failures in critical systems. Testing is the primary technique for locating errors in
programs.

Type Checking

Type checking is an effective and well-established technique to catch inconsistencies in programs. It can be
used to catch errors, for example, where an operation is applied to the wrong type of object, or if parameters
passed to a procedure do not match the signature of the procedure. Program analysis can go beyond finding
type errors by analysing the flow of data through a program.

Bounds Checking

It is easier to make mistakes when programming in a lower-level language than a higher-level one. For
example, many security breaches in systems are caused by buffer overflows in programs written in C. Because
C does not have array-bounds checks, it is up to the user to ensure that the arrays are not accessed out of
bounds. Failing to check that the data supplied by the user can overflow a buffer, the program may be tricked
into storing user data outside of the buffer. An attacker can manipulate the input data that causes the program
to misbehave and compromise the security of the system. Techniques have been developed to find buffer
overflows in programs, but with limited success.

Memory Management Tools

Garbage collection is another excellent example of the trade-off between efficiency and a combination of ease
of programming and software reliability. Automatic memory management obliterates all memory
management errors (e.g., "memory leaks"), which are a major source of problems in C and C + + programs.
Various tools have been developed to help programmers find memory management errors. For example,
Purify is a widely used tool that dynamically catches memory management errors as they occur.
ROLE OF LEXICAL ANALYZER AND INPUT BUFFERS

A Model of a compiler front end

Lexical Analyser

A lexical analyser reads character from the input and groups them in to “token objects”.

In a compiler, the lexical analyser reads the characters of the source program, groups them into lexically
meaningful units called lexemes and produces as output tokens representing these lexemes. And this stream
of tokens is sent to the parser for syntax analysis. It is common for the lexical analyser to interact with the
symbol table as well. When the lexical analyser discover a lexeme constituting an identifier, it needs to enter
that lexeme into the symbol table.

For example: Example 1:

So, the main responsibilities of lexical analyser are:

1. It eliminates the comments and white spaces (blank, new line, tab and other characters that are used
to separate tokens in the input).
2. It checks each lexemes line by line and if any syntax error appears, it will notify.
3. Correlating error messages generated by the compiler with the source program.
Example 2:

1. One token for each keyword.


2. Tokens for the operators.
3. One token representing identifiers.
4. One or more tokens representing constants.
5. Tokens for each punctuation symbol, such as left and right parentheses, comma, and semicolon.

So,Lexical analysers are divided into a cascade of two processes:

1. Scaaning: It consists of the simple processes that do not require tokenization of the input, such as
deletion of comments and compaction of consecutive white space characters into one.
2. Lexical analysis: produces tokens from the output of the scanner.

Lexical Analysis Versus Parsing


Tokens : A token consists of two components, a token name and an attribute value.

Token name is an abstract symbol representing a kind of lexical unit: a particular keyword, or an identifier.

Pattern: Sequence of lexemes.

Lexeme: Sequence of characters in the source program that matches the pattern for a token and is identified
by the lexical analyser as an instance of that token.

Input Buffering:
INTRODUCTION TO REGULAR EXPRESSIONS
Alphabet:
An alphabet is any finite set of symbols like letters, digits, and punctuation. The set {0,1}is the binary
alphabet.
String:
A string over an alphabet is a finite sequence of symbols drawn from that alphabet. The length of a string s
is written as │s│, is the number of occurrences of symbols in s. The empty string, denoted €, is the string of
length zero.
Language:
A language is any countable set of strings over some fixed alphabet.
Operations on languages:
Regular expression:
In general,
1. It is a way of representing Regular Language.
2. Expression of strings and operators.
3. Regular language is generated by Regular Grammar.
4. A language is said to be regular if there exist finite automata to accept this language.
5. A Language is said to be regular if there exist a regular expression to represent it.
In formal,
In simple,
TRANSITION DIAGRAMS AND LEX
SPECIFICATION AND RECOGNITION OF TOKENS
FINITE AUTOMATA
DESIGN OF A LEXICAL ANALYZER GENERATOR

You might also like