0% found this document useful (0 votes)
15 views70 pages

Overview of language processors

Explains compilers and toy compiler

Uploaded by

drmadhavpvt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views70 pages

Overview of language processors

Explains compilers and toy compiler

Uploaded by

drmadhavpvt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 70

Overview Of Language

Processors
Presented by: Foram Thakor
Overview
● Language Processors
● Language processing systems
● Language processing activities
● Fundamentals of language processing
● Toy compiler
● Search data structure
● Allocation data structure
Language Processors
● A Language Processor is a software which bridges a specification or
execution gap.
● Program to input to a LP is referred as a Source Program and output as
Target Program.
● A language translator bridges an execution gap to the machine language of a
computer system.
● Language processors are essential tools that translate human-readable code
into machine-executable instructions.
● Language processors are tools that convert human-readable code into a form
that computers can execute.
Language Processing Systems
● Any computer system is made of hardware and software, The hardware
understands a language, which humans cannot understand.
● So we write programs in high-level language, which is easier for us to
understand and remember.
● These programs are then fed into a series of tools and OS components to get
the desired code that can be used by the machine.
● This is known as Language Processing System.
● The high-level language is converted into binary language in various phases.
Language Processing Systems
● A compiler is a program that converts high-level language to assembly
language.
● Similarly, an assembler is a program that converts the assembly language to
machine-level language.
● A linker tool is used to link all the parts of the program together for execution
(executable machine code).
● A loader loads all of them into memory and then the program is executed.
Language Processing Systems
Different kinds of Language Processors are as follows:
1. Preprocessor
● A preprocessor, generally considered as a part of compiler, is a tool that
produces input for compilers.
● Processes source code before it is compiled.
● Tasks:
○ Macro Expansion: Replaces macros with their definitions.
○ File Inclusion: Incorporates content of other files.
○ Conditional Compilation: Compiles code based on certain conditions.
○ Use: Simplifies coding by allowing code reuse and conditional compilation.
○ Example Languages: C, C++ (using directives like #include, #define).
Language Processing Systems
2. Compiler

● A compiler reads the whole source code at once, creates tokens, checks
semantics, generates intermediate code, executes the whole program and
may involve many passes.
● Translate high-level programming languages (e.g., C, C++, Java) into
machine code.
● Tasks: Lexical Analysis, Syntax Analysis, Semantic Analysis, Optimization,
Code Generation.
Language Processing Systems
3. Interpreter

● An interpreter, like a compiler, translates high-level language into low-level


machine language.
● An interpreter reads a statement from the input, converts it to an intermediate
code, executes it, then takes the next statement in sequence.
● If an error occurs, an interpreter stops execution and reports it. whereas a
compiler reads the whole program even if it encounters several errors.
● Executes high-level code line-by-line without converting it into machine code
beforehand.
Language Processing Systems
3. Assembler

● An assembler translates assembly language programs into machine code.


● The output of an assembler is called an object file, which contains a
combination of machine instructions as well as the data required to place
these instructions in memory.
● Used in system programming for tasks requiring hardware control.
Language Processing Systems
4. Linker

● Linker is a computer program that links and merges various object files
together in order to make an executable file.
● All these files might have been compiled by separate assemblers.
● The major task of a linker is to search and locate referenced module/routines
in a program and to determine the memory location where these codes will be
loaded, making the program instruction to have absolute references.
Language Processing Systems
5. Loader
● Loader is a part of operating system and is responsible for loading executable files
into memory and execute them.
● It calculates the size of a program (instructions and data) and creates memory
space for it. It initializes various registers to initiate execution.
6. Cross-compiler
● A compiler that runs on platform (A) and is capable of generating executable code
for platform (B) is called a cross-compiler.
● Source-to-source Compiler A compiler that takes the source code of one
programming language and translates it into the source code of another
programming language is called a source-to-source compiler.
Language Processing Systems
Language Processing Activities
● Language processing activities arises due to the differences between the
manner in which the software designer describes the ideas and the manner in
which the ideas are implemented.
● The designer expresses the idea related to the application domain of the
software.
● To implement these ideas, the description of the ideas has to be interpreted
as related to the execution domain of the computer system.
● We use the term “Semantics” to represent the “rules of the domain”.
● The term “Semantic Gap” represents the difference between the semantics
of two domain.
Language Processing Activities
● The semantic gap has many consequences, some of the important ones
being:
○ Large Development Cycles (time)
○ Large Development Efforts
○ Poor quality softwares
● These issues can be tackled by using Software Engineering steps:
○ Specification, Design and coding steps
○ PL implementation
Language processors after the introduction of PL domain:
Language Processing Activities
● There are mainly two types of language processing activity which bridges the
semantic gap between source language and target language:
1. Program generation activities
2. Program execution activity.
● A program generation activity aims at automatic generation of a program.
● The source language is a specification language of an application domain and
the target language is typically a procedure oriented PL.
● A program execution activity organizes the execution of a program written in a
PL on a computer system.
● Its source language could be a procedure oriented language or a problem
oriented language.
Program Generation Activities
● This activity generates a program from its specification program. Program
generation activity bridges the specification gap.
● A program generator is a software system program that accepts the
specifications of the program to be generated and generates the program in
the target program in a target programming language.
● The program generator introduces a new domain between the application and
the programming language domain called the program generator domain.
Program Execution Activities
● This activity aims at bridging the execution gap by organizing the execution of
a program written in a programming language on a computer system.
● A program may be executed through Translation and Interpretation.
Working:
● This activity involves the actual execution of the program generated by the
program generator. The program execution activity bridges the
implementation gap.
● The program execution activity involves loading the program into the memory
of the computer system, interpreting the program instructions, and executing
them on the computer system.
Program Execution Activities (Contd.)
● During program execution, the computer system reads the program
instructions from memory and executes them one by one, performing the
necessary computations and producing the desired results.
● The program execution activity involves various components of the computer
system, including the processor, memory, input/output devices, and other
system resources required to execute the program.

In summary, the program generation activity generates a program from its


specification, while the program execution activity involves the actual execution of
the program on the computer system.
Program Execution Activities (Contd.)
1. Program Translation:
● The program translation model bridges the execution gap by translating a
program written in a programming language called the source program into an
equivalent program in the machine language called the target program.
● A program must be translated before it can be executed.
● A translated program may save in a file. So that, we can execute saved files
or programs repeatedly.
Program Translation
Program Execution Activities (Contd.)
2. Program Interpretation
● The interpreter reads the source program and stores it in its memory. Throughout
interpretation takes a statement, finds its meaning, and performs actions that
implement this. This includes computational and input-output actions.
● The CPU uses a program counter (PC) to note the address of the next instruction to
be executed.
● This Instruction is subjected to the instruction execution cycle consisting of the
following steps:
1. Fetch the instruction.
2. Decode the instruction to determine the operation to be performed, and also its
operands.
3. Execute the Instruction
Program Interpretation
Fundamentals of Language Processing

Phases Passes
● Phases refer to the logical stages in the ● Passes refer to the number of times the
process of translating source code to compiler scans the entire source code or
executable code. intermediate representation.
● Break down the compilation process into ● Allow for thorough analysis and
manageable and modular tasks. optimization of the code.
● Each phase focuses on a specific aspect ● Each pass may involve multiple phases or
of compilation, such as syntax or focus on specific optimization tasks.
semantics. ● Multi-pass compilers are generally more
● Modular approach makes debugging and efficient in terms of the final code quality.
maintenance easier.
Fundamentals of Language Processing (Contd.)
Language Processor = Analysis of SP + Synthesis of TP
Analysis Phase:
● Known as the front-end of the compiler, the analysis phase of the compiler
reads the source program, divides it into core parts and then checks for
lexical, grammar and syntax errors.
● The analysis phase generates an intermediate representation of the source
program and symbol table, which should be fed to the Synthesis phase as
input.
● It is responsible for understanding the structure and meaning of the source
code.
Analysis Phase
● This phase can be further divided into several sub-phases:
1. Lexical Analysis:
● Purpose: Convert the sequence of characters in the source code into tokens.
● Activities: Scanning and tokenizing the source code.
● Output: Tokens (basic syntactic units like keywords, identifiers, literals, operators).
● Example: Converting int a = 5; into tokens [int, a, =, 5, ;].
● Lexical analyzer represents these lexemes in the form of tokens as: Lexical analysis
builds a descriptor, called a token.
■ <token-name, attribute-value>
● We represent token as entry “Code #no” where “Code” can be Id or Op for identifier
or operator respectively and “no” indicates the entry for the identifier or operator in
symbol or operator table.
Example of Lexical Analysis
● Consider following code:

i: integer;

a, b: real;

a= b + i;

The statement a= b+i is represented as a string of token

a = b + i

Id#1 Op#1 Id#2 Op#2 Id#3


Analysis Phase(Contd.)
2. Syntax Analysis (Parsing):

● Purpose: Determine the grammatical structure of the source code.


● Activities: Constructing a parse tree from the tokens.
● Output: Parse tree or abstract syntax tree (AST).
● Example: Verifying that int a = 5; follows the rules of the language's grammar.
● Syntax analysis processes the string of token to determine its grammatical
structure builds an intermediate code that represents the structure.
● The tree structure is used to represent the intermediate code.
Example of Syntax Analysis
● Consider the statement a = b + i can be represented in tree form as:
Analysis Phase(Contd.)
3. Semantic Analysis:
● Purpose: Ensure the code makes logical sense.
● Activities: Type checking, scope resolution, and verifying that operations are
semantically valid.
● Output: Annotated syntax tree.
● Example: Checking that variables are declared before use and that operations are
performed on compatible types.
● While processing a declaration statement, it adds information concerning the type,
length and dimensionality of a symbol to the symbol table.
● While processing an imperative statement, it determines the sequence of actions
that would have to be performed for implementing the meaning of the statement and
represents them in the intermediate code.
Example of Semantic Analysis
● Considering the tree structure for the statement a = b + i
● If node is operand, then type of operand is added in the description field of
operand. While evaluating the expression the type of b is real and i is int so
type of i is converted to real i*.
Analysis Phase(Contd.)
4. Intermediate Code(IC) Generation:

● Purpose: Convert the high-level source code into an intermediate


representation (IR).
● Activities: Generating a platform-independent intermediate code.
● Output: Intermediate code.
● Example: Transforming int a = 5; into an intermediate form like three-address
code.
● IR contains intermediate code and table.
Example of Intermediate Code Generation:
Symbol table:

Intermediate code:
Convert(id1#1) to real, giving (id#4)
Add(id#4) to (id#3), giving (id#5)
Store (id#5) in (id#2)
What is the need of Intermediate Code?
● A source code can directly be translated into its target machine code, then why at
all we need to translate the source code into an intermediate code which is
then translated to its target code? Let us see the reasons why we need an
intermediate code.
● If a compiler translates the source language to its target machine language without
having the option for generating intermediate code, then for each new machine, a
full native compiler is required.
● Intermediate code eliminates the need of a new full compiler for every unique
machine by keeping the analysis portion same for all the compilers.
● The second part of compiler, synthesis, is changed according to the target machine.
● It becomes easier to apply the source code modifications to improve code
performance by applying code optimization techniques on the intermediate code.
Synthesis Phase
● The synthesis phase, also known as the back-end of a compiler, is
responsible for translating the intermediate representation of a program into
machine code that can be executed by the target processor.
Synthesis Phase (Contd.)
● The key tasks it performs are as follows:
1. Code Optimization
● Improve the intermediate code to enhance performance and reduce resource
usage.
● Activities: Removing redundant code, optimizing loops, and enhancing
resource utilization.
● Output: Optimized intermediate code.
● Example: Eliminating dead code and simplifying expressions to reduce the
number of instructions.
Synthesis Phase(Contd.)
2. Code Generation
● Purpose: Translate the optimized intermediate code into machine code.
● Activities:
○ Instruction Selection: Mapping intermediate instructions to the target machine
instructions.
○ Register Allocation: Assigning variables and temporary values to machine registers.
○ Instruction Scheduling: Reordering instructions to minimize pipeline stalls and
improve execution speed.
● Output: Machine code instructions.
● Example: Converting an intermediate representation of an addition operation
to the appropriate ADD machine instruction.
Synthesis Phase(Contd.)
3. Creation of Data Structures

● Purpose: Set up runtime structures required during program execution.


● Activities:
○ Symbol Table Management: Storing information about variables, functions, scopes,
and types.
○ Activation Records: Managing function calls and local variables using stack frames.
○ Control Flow Graphs (CFG): Representing the flow of control within the program.
● Output: Symbol tables, activation records, and CFGs.
● Example: Creating an activation record for a function call to manage local
variables and return address.
Intermediate Representation (IR)
● IR serve as a bridge between the source code and the target machine code,
providing an abstraction that enables various optimizations and transformations.
● Intermediate codes can be represented in a variety of ways and they have their own
benefits.
● High Level IR
○ High-level intermediate code representation is very close to the source language itself. They
can be easily generated from the source code and we can easily apply code modifications to
enhance performance. But for target machine optimization, it is less preferred.
● Low Level IR
○ This one is close to the target machine, which makes it suitable for register and memory
allocation, instruction set selection, etc. It is good for machine-independent optimization.
Toy Compiler
Toy Compiler (Contd.) – Frontend
● It includes 3 analysis:
1. Lexical analysis
2. Syntax analysis
3. Semantic analysis
● Functions of Front End:
1. Determine validity of source statement from view point of analysis.
2. Determine ‘content’ of source statement.
3. Construct a suitable representation of source statement for the use of subsequent
analysis function or synthesis phase of a language processor.
● Result is stored in the following forms:
1. Table of Information
2. IC: Description of source statement
● Output of Front End:
● IR : Intermediate Representation.
● It has two things with it. 1. Table of Information. 2. Intermediate Code (IC)
Toy Compiler (Contd.) – Frontend
Toy Compiler (Contd.) – Backend
● The main tasks it performs:
1. Memory Allocation
● A memory allocation requirement of an identifier is computed from its Size, Length,
Type, Dimension – And then memory is allocated to it.
● The address of memory area is then entered in symbol table.
1. Code Generation
● Uses knowledge of target architecture.
1. Knowledge of instruction
2. Addressing mode in target computer.
● Important issues affecting code generation:
○ Determine the place where the intermediate results should be kept. i.e in memory or register?
○ Determine which instruction should be used for type conversion operation.
○ Determine which addressing mode should be used for accessing variable.
Toy Compiler (Contd.) – Backend
Symbol Table
● Symbol table is an important data structure created and maintained by
compilers in order to store information about the occurrence of various entities
such as variable names, function names, objects, classes, interfaces, etc.
● Symbol table is used by both the analysis and the synthesis parts of a
compiler.
● A language processor uses the symbol table to maintain the information about
attributes of symbols used in a source program.
Symbol Table (Contd.)
● It performs the following four kinds of operations on the symbol table:
1. Add a symbol and its attributes: Make a new entry in the symbol table.
2. Locate a symbol’s entry: Find a symbol’s entry in the symbol table.
3. Delete a symbol’s entry: Remove the symbol’s information from the table.
4. Access a symbol’s entry: Access the entry and set, modify or copy its
attribute information.
● The symbol table consists of a set of entries organized in memory.
Symbol Table (Contd.)
● A symbol table may serve the following purposes depending upon the language in
hand:
1. To store the names of all entities in a structured form at one place.
2. To verify if a variable has been declared.
3. To implement type checking, by verifying assignments and expressions in the
source code are semantically correct.
4. To determine the scope of a name (scope resolution).
● A symbol table is simply a table which can be either linear or a hash table. It
maintains an entry for each name in the following format:
■ <symbol name, type, attribute>
Symbol Table (Contd.)
For example,

if a symbol table has to store information about the following variable declaration:
static int interest;

then it should store the entry such as:

<interest, int, static>

The attribute clause contains the entries related to the name.


Implementation
● If a compiler is to handle a small amount of data, then the symbol table can
be implemented as an unordered list, which is easy to code, but it is only
suitable for small tables only.
● A symbol table can be implemented in one of the following ways:
○ Linear (sorted or unsorted) list
○ Binary Search Tree
○ Hash table
● Among all, symbol tables are mostly implemented as hash tables, where the
source code symbol itself is treated as a key for the hash function and the
return value is the information about the symbol.
Operations – insert()
● A symbol table, either linear or hash, should provide the following operations.
1. insert()
● This operation is more frequently used by analysis phase, i.e., the first half of
the compiler where tokens are identified and names are stored in the table.
● This operation is used to add information in the symbol table about unique
names occurring in the source code.
● The format or structure in which the names are stored depends upon the
compiler in hand.
● An attribute for a symbol in the source code is the information associated with
that symbol.
Operations – insert() (Contd.)
● This information contains the value, state, scope, and type about the symbol.
The insert() function takes the symbol and its attributes as arguments and
stores the information in the symbol table.
● For example:

int a;

should be processed by the compiler as:

insert(a, int);
Operations – lookup()
● lookup() operation is used to search a name in the symbol table to determine:
○ if the symbol exists in the table.
○ if it is declared before it is being used.
○ if the name is used in the scope.
○ if the symbol is initialized.
○ if the symbol declared multiple times.
● The format of lookup() function varies according to the programming language.
● The basic format should match the following: lookup(symbol)
● This method returns 0 (zero) if the symbol does not exist in the symbol table.
● If the symbol exists in the symbol table, it returns its attributes stored in the table.
Data Structures for Language Processing
● Two kinds of data structures can be used for organizing its entries:
● Linear data structure: Entries in the symbol table occupy adjoining areas
of memory. This property is used to facilitate search.
● Non-linear data structure: Entries in the symbol table do not occupy
contiguous areas of memory. The entries are searched and accessed using
pointers.
Symbol Table Entry Formats
● Each entry in the symbol table is comprised of fields that accommodate the
attributes of one symbol. The symbol field of fields stores the symbol to which
entry pertains.
● The symbol field is key field which forms the basis for a search in the table.
● The following entry formats can be used for accommodating the attributes:
● Fixed length entries: Each entry in the symbol table has fields for all
attributes specified in the programming language.
● Variable-length entries: The entry occupied by a symbol has fields only for
the attributes specified for symbols of its class.
● Hybrid entries: A hybrid entry has fixed-length part and a variable-length
part.
Search Data Structures
● Search data structures (Search structure) is used to create and organize
various tables of information and mainly used during the analysis of the
program.
● The important features of search data structures include the following:
● An entry in search data structure is essentially a set of fields referred to as a
record.
● Every entry in search structure contains two parts: fixed and variable. The
value in fixed part determines the information to be stored in the variable
part of the entry.
Operations On Search Data structures
● Insert Operation:
● To add the entry of a newly found symbol during language processing.
● Search Operation:
● To enable and support search and locate activity for the entry of symbol
● Delete Operation:
● To delete the entry of a symbol especially when identified by processor as
redundant declarations.
Sequential Search Organization – Linear Search
● In sequential search organization, during the search for a symbol, probability
of all active entries being accessed in the table is same.
● For an unsuccessful search, the symbol can be entered using an ‘add’
operation into table.
● It operates by checking each element in a list or array one by one from the
beginning to the end until the target element is found or the entire list has
been searched.
● If found: If the current element matches the target, return the index of the
element.
● If not found: If the current element does not match the target, move to the
next element.
Linear Link List
● Linear list organization is the simplest and easiest way to implement the
symbol tables.
● It can be constructed using single array or equivalently several arrays that
store names and their associated information.
● During insertion of a new name, we must scan the list to ensure whether it is
a new entry or not.
● If an entry is found during the scan, it may update the associated information
but no new entries are made.
● The advantage of using list is that it takes minimum possible space. On the
other hand, it may suffer for performance for larger values of 'n' and 'm’,
where ‘n’ is names and ‘m’ is the information associated with the name.
Self Organizing List
● Searching in symbol table takes most of the time during symbol table
management process.
● The pointer field called 'LINK' is added to each record, and the search is
controlled by the order indicated by the ‘LINK’.
● A pointer called 'FIRST' can be used to designate the position of the first
record on the linked list, and each 'LINK' field indicates the next record on the
list.
● Self-organizing list is advantageous over simple list implementation in the
sense that frequently referenced name variables will likely to be at the top of
the list.
● If the access is random, the self-organizing list will cost time and space.
Binary Search Organization
● Binary search is a highly efficient algorithm for finding an element in a sorted
list or array.
● It operates by repeatedly dividing the search interval in half, making it
significantly faster than linear search for large datasets.
● Find Middle:
● Calculate the middle index (mid) of the current search interval.
● Compare:
● Compare the target element with the element at the middle index.
○ If the target is equal to the middle element, the search is complete.
○ If the target is less than the middle element, adjust the high pointer to mid - 1 and repeat.
○ If the target is greater than the middle element, adjust the low pointer to mid + 1 and
repeat.
Search Trees
● Symbol tables can also be organized as binary tree organization with two
pointer fields, namely, 'LEFT' and 'RIGHT' in each record that points to the
left and right sub trees respectively.
● The left subtree of the record contains only records with names less than
the current records name.
● The right subtree of the node will contain only records with name variables
greater than the current name.
● The advantage of using search tree organization is that it proves efficient in
searching operations, which are the most performed operations over the
symbol tables.
● A binary search tree gives performance compared to list organization at some
difficulty in implementation.
Hash Table Organization
● A hash table, also known as a hash map is a data structure that has the ability to
map keys to the values using a hash function.
● It is based on a concept called hashing, which involves using a hash function to
convert keys into indices in an array, where the corresponding values are stored
● Hash Function:
○ A function that takes an input (or 'key') and returns an index in the hash table array.
● Indexing:
○ The key is passed through the hash function, and the resulting index determines
where the value associated with the key will be stored in the array.
● Collision Handling:
○ When two keys hash to the same index, a collision occurs. Various strategies are
used to handle collisions.
Allocation Data Structures
● Allocation strategy is an important factor in efficient utilization of memory for
objects, defining their scope and lives using either static, stack, or heap
allocations.
1. Heap
● A heap is a specialized tree-based data structure that satisfies the heap
property.
● Heaps are typically used to implement priority queues and for efficient sorting
(heapsort).
● This means that the tree is completely filled on all levels except possibly the
lowest, which is filled from left to right.
● Heaps are a kind of non-linear data structure that permits allocation and
deallocation of list of entities in any (random) order as needed.
Allocation Data Structure (Contd.)
● Heap Operations:
○ Insert: Add a new element to the heap and restore the heap property.
○ Extract (Max or Min): Remove and return the root element and restore the heap property.
○ Peek (Max or Min): Return the root element without removing it.

2. Stack
● Stack is a linear data structure that satisfies last-in first-out (LIFO) policy for its
allocation and deallocation.
● This makes only last element of the stack accessible at any time.
● Implementing stack data structure requires use of Stack Base (SB) that points to
first entry of stack, and a Top of Stack (TOS) pointer to point to last entry allocated
to stack.
Allocation Data Structure (Contd.)
● Stack Operations
○ Push: Add an element to the top of the stack.
○ Pop: Remove and return the top element of the stack.
○ Peek (Top): Return the top element without removing it.
○ IsEmpty: Check if the stack is empty.

You might also like