Overview of language processors
Overview of language processors
Processors
Presented by: Foram Thakor
Overview
● Language Processors
● Language processing systems
● Language processing activities
● Fundamentals of language processing
● Toy compiler
● Search data structure
● Allocation data structure
Language Processors
● A Language Processor is a software which bridges a specification or
execution gap.
● Program to input to a LP is referred as a Source Program and output as
Target Program.
● A language translator bridges an execution gap to the machine language of a
computer system.
● Language processors are essential tools that translate human-readable code
into machine-executable instructions.
● Language processors are tools that convert human-readable code into a form
that computers can execute.
Language Processing Systems
● Any computer system is made of hardware and software, The hardware
understands a language, which humans cannot understand.
● So we write programs in high-level language, which is easier for us to
understand and remember.
● These programs are then fed into a series of tools and OS components to get
the desired code that can be used by the machine.
● This is known as Language Processing System.
● The high-level language is converted into binary language in various phases.
Language Processing Systems
● A compiler is a program that converts high-level language to assembly
language.
● Similarly, an assembler is a program that converts the assembly language to
machine-level language.
● A linker tool is used to link all the parts of the program together for execution
(executable machine code).
● A loader loads all of them into memory and then the program is executed.
Language Processing Systems
Different kinds of Language Processors are as follows:
1. Preprocessor
● A preprocessor, generally considered as a part of compiler, is a tool that
produces input for compilers.
● Processes source code before it is compiled.
● Tasks:
○ Macro Expansion: Replaces macros with their definitions.
○ File Inclusion: Incorporates content of other files.
○ Conditional Compilation: Compiles code based on certain conditions.
○ Use: Simplifies coding by allowing code reuse and conditional compilation.
○ Example Languages: C, C++ (using directives like #include, #define).
Language Processing Systems
2. Compiler
● A compiler reads the whole source code at once, creates tokens, checks
semantics, generates intermediate code, executes the whole program and
may involve many passes.
● Translate high-level programming languages (e.g., C, C++, Java) into
machine code.
● Tasks: Lexical Analysis, Syntax Analysis, Semantic Analysis, Optimization,
Code Generation.
Language Processing Systems
3. Interpreter
● Linker is a computer program that links and merges various object files
together in order to make an executable file.
● All these files might have been compiled by separate assemblers.
● The major task of a linker is to search and locate referenced module/routines
in a program and to determine the memory location where these codes will be
loaded, making the program instruction to have absolute references.
Language Processing Systems
5. Loader
● Loader is a part of operating system and is responsible for loading executable files
into memory and execute them.
● It calculates the size of a program (instructions and data) and creates memory
space for it. It initializes various registers to initiate execution.
6. Cross-compiler
● A compiler that runs on platform (A) and is capable of generating executable code
for platform (B) is called a cross-compiler.
● Source-to-source Compiler A compiler that takes the source code of one
programming language and translates it into the source code of another
programming language is called a source-to-source compiler.
Language Processing Systems
Language Processing Activities
● Language processing activities arises due to the differences between the
manner in which the software designer describes the ideas and the manner in
which the ideas are implemented.
● The designer expresses the idea related to the application domain of the
software.
● To implement these ideas, the description of the ideas has to be interpreted
as related to the execution domain of the computer system.
● We use the term “Semantics” to represent the “rules of the domain”.
● The term “Semantic Gap” represents the difference between the semantics
of two domain.
Language Processing Activities
● The semantic gap has many consequences, some of the important ones
being:
○ Large Development Cycles (time)
○ Large Development Efforts
○ Poor quality softwares
● These issues can be tackled by using Software Engineering steps:
○ Specification, Design and coding steps
○ PL implementation
Language processors after the introduction of PL domain:
Language Processing Activities
● There are mainly two types of language processing activity which bridges the
semantic gap between source language and target language:
1. Program generation activities
2. Program execution activity.
● A program generation activity aims at automatic generation of a program.
● The source language is a specification language of an application domain and
the target language is typically a procedure oriented PL.
● A program execution activity organizes the execution of a program written in a
PL on a computer system.
● Its source language could be a procedure oriented language or a problem
oriented language.
Program Generation Activities
● This activity generates a program from its specification program. Program
generation activity bridges the specification gap.
● A program generator is a software system program that accepts the
specifications of the program to be generated and generates the program in
the target program in a target programming language.
● The program generator introduces a new domain between the application and
the programming language domain called the program generator domain.
Program Execution Activities
● This activity aims at bridging the execution gap by organizing the execution of
a program written in a programming language on a computer system.
● A program may be executed through Translation and Interpretation.
Working:
● This activity involves the actual execution of the program generated by the
program generator. The program execution activity bridges the
implementation gap.
● The program execution activity involves loading the program into the memory
of the computer system, interpreting the program instructions, and executing
them on the computer system.
Program Execution Activities (Contd.)
● During program execution, the computer system reads the program
instructions from memory and executes them one by one, performing the
necessary computations and producing the desired results.
● The program execution activity involves various components of the computer
system, including the processor, memory, input/output devices, and other
system resources required to execute the program.
Phases Passes
● Phases refer to the logical stages in the ● Passes refer to the number of times the
process of translating source code to compiler scans the entire source code or
executable code. intermediate representation.
● Break down the compilation process into ● Allow for thorough analysis and
manageable and modular tasks. optimization of the code.
● Each phase focuses on a specific aspect ● Each pass may involve multiple phases or
of compilation, such as syntax or focus on specific optimization tasks.
semantics. ● Multi-pass compilers are generally more
● Modular approach makes debugging and efficient in terms of the final code quality.
maintenance easier.
Fundamentals of Language Processing (Contd.)
Language Processor = Analysis of SP + Synthesis of TP
Analysis Phase:
● Known as the front-end of the compiler, the analysis phase of the compiler
reads the source program, divides it into core parts and then checks for
lexical, grammar and syntax errors.
● The analysis phase generates an intermediate representation of the source
program and symbol table, which should be fed to the Synthesis phase as
input.
● It is responsible for understanding the structure and meaning of the source
code.
Analysis Phase
● This phase can be further divided into several sub-phases:
1. Lexical Analysis:
● Purpose: Convert the sequence of characters in the source code into tokens.
● Activities: Scanning and tokenizing the source code.
● Output: Tokens (basic syntactic units like keywords, identifiers, literals, operators).
● Example: Converting int a = 5; into tokens [int, a, =, 5, ;].
● Lexical analyzer represents these lexemes in the form of tokens as: Lexical analysis
builds a descriptor, called a token.
■ <token-name, attribute-value>
● We represent token as entry “Code #no” where “Code” can be Id or Op for identifier
or operator respectively and “no” indicates the entry for the identifier or operator in
symbol or operator table.
Example of Lexical Analysis
● Consider following code:
i: integer;
a, b: real;
a= b + i;
a = b + i
Intermediate code:
Convert(id1#1) to real, giving (id#4)
Add(id#4) to (id#3), giving (id#5)
Store (id#5) in (id#2)
What is the need of Intermediate Code?
● A source code can directly be translated into its target machine code, then why at
all we need to translate the source code into an intermediate code which is
then translated to its target code? Let us see the reasons why we need an
intermediate code.
● If a compiler translates the source language to its target machine language without
having the option for generating intermediate code, then for each new machine, a
full native compiler is required.
● Intermediate code eliminates the need of a new full compiler for every unique
machine by keeping the analysis portion same for all the compilers.
● The second part of compiler, synthesis, is changed according to the target machine.
● It becomes easier to apply the source code modifications to improve code
performance by applying code optimization techniques on the intermediate code.
Synthesis Phase
● The synthesis phase, also known as the back-end of a compiler, is
responsible for translating the intermediate representation of a program into
machine code that can be executed by the target processor.
Synthesis Phase (Contd.)
● The key tasks it performs are as follows:
1. Code Optimization
● Improve the intermediate code to enhance performance and reduce resource
usage.
● Activities: Removing redundant code, optimizing loops, and enhancing
resource utilization.
● Output: Optimized intermediate code.
● Example: Eliminating dead code and simplifying expressions to reduce the
number of instructions.
Synthesis Phase(Contd.)
2. Code Generation
● Purpose: Translate the optimized intermediate code into machine code.
● Activities:
○ Instruction Selection: Mapping intermediate instructions to the target machine
instructions.
○ Register Allocation: Assigning variables and temporary values to machine registers.
○ Instruction Scheduling: Reordering instructions to minimize pipeline stalls and
improve execution speed.
● Output: Machine code instructions.
● Example: Converting an intermediate representation of an addition operation
to the appropriate ADD machine instruction.
Synthesis Phase(Contd.)
3. Creation of Data Structures
if a symbol table has to store information about the following variable declaration:
static int interest;
int a;
insert(a, int);
Operations – lookup()
● lookup() operation is used to search a name in the symbol table to determine:
○ if the symbol exists in the table.
○ if it is declared before it is being used.
○ if the name is used in the scope.
○ if the symbol is initialized.
○ if the symbol declared multiple times.
● The format of lookup() function varies according to the programming language.
● The basic format should match the following: lookup(symbol)
● This method returns 0 (zero) if the symbol does not exist in the symbol table.
● If the symbol exists in the symbol table, it returns its attributes stored in the table.
Data Structures for Language Processing
● Two kinds of data structures can be used for organizing its entries:
● Linear data structure: Entries in the symbol table occupy adjoining areas
of memory. This property is used to facilitate search.
● Non-linear data structure: Entries in the symbol table do not occupy
contiguous areas of memory. The entries are searched and accessed using
pointers.
Symbol Table Entry Formats
● Each entry in the symbol table is comprised of fields that accommodate the
attributes of one symbol. The symbol field of fields stores the symbol to which
entry pertains.
● The symbol field is key field which forms the basis for a search in the table.
● The following entry formats can be used for accommodating the attributes:
● Fixed length entries: Each entry in the symbol table has fields for all
attributes specified in the programming language.
● Variable-length entries: The entry occupied by a symbol has fields only for
the attributes specified for symbols of its class.
● Hybrid entries: A hybrid entry has fixed-length part and a variable-length
part.
Search Data Structures
● Search data structures (Search structure) is used to create and organize
various tables of information and mainly used during the analysis of the
program.
● The important features of search data structures include the following:
● An entry in search data structure is essentially a set of fields referred to as a
record.
● Every entry in search structure contains two parts: fixed and variable. The
value in fixed part determines the information to be stored in the variable
part of the entry.
Operations On Search Data structures
● Insert Operation:
● To add the entry of a newly found symbol during language processing.
● Search Operation:
● To enable and support search and locate activity for the entry of symbol
● Delete Operation:
● To delete the entry of a symbol especially when identified by processor as
redundant declarations.
Sequential Search Organization – Linear Search
● In sequential search organization, during the search for a symbol, probability
of all active entries being accessed in the table is same.
● For an unsuccessful search, the symbol can be entered using an ‘add’
operation into table.
● It operates by checking each element in a list or array one by one from the
beginning to the end until the target element is found or the entire list has
been searched.
● If found: If the current element matches the target, return the index of the
element.
● If not found: If the current element does not match the target, move to the
next element.
Linear Link List
● Linear list organization is the simplest and easiest way to implement the
symbol tables.
● It can be constructed using single array or equivalently several arrays that
store names and their associated information.
● During insertion of a new name, we must scan the list to ensure whether it is
a new entry or not.
● If an entry is found during the scan, it may update the associated information
but no new entries are made.
● The advantage of using list is that it takes minimum possible space. On the
other hand, it may suffer for performance for larger values of 'n' and 'm’,
where ‘n’ is names and ‘m’ is the information associated with the name.
Self Organizing List
● Searching in symbol table takes most of the time during symbol table
management process.
● The pointer field called 'LINK' is added to each record, and the search is
controlled by the order indicated by the ‘LINK’.
● A pointer called 'FIRST' can be used to designate the position of the first
record on the linked list, and each 'LINK' field indicates the next record on the
list.
● Self-organizing list is advantageous over simple list implementation in the
sense that frequently referenced name variables will likely to be at the top of
the list.
● If the access is random, the self-organizing list will cost time and space.
Binary Search Organization
● Binary search is a highly efficient algorithm for finding an element in a sorted
list or array.
● It operates by repeatedly dividing the search interval in half, making it
significantly faster than linear search for large datasets.
● Find Middle:
● Calculate the middle index (mid) of the current search interval.
● Compare:
● Compare the target element with the element at the middle index.
○ If the target is equal to the middle element, the search is complete.
○ If the target is less than the middle element, adjust the high pointer to mid - 1 and repeat.
○ If the target is greater than the middle element, adjust the low pointer to mid + 1 and
repeat.
Search Trees
● Symbol tables can also be organized as binary tree organization with two
pointer fields, namely, 'LEFT' and 'RIGHT' in each record that points to the
left and right sub trees respectively.
● The left subtree of the record contains only records with names less than
the current records name.
● The right subtree of the node will contain only records with name variables
greater than the current name.
● The advantage of using search tree organization is that it proves efficient in
searching operations, which are the most performed operations over the
symbol tables.
● A binary search tree gives performance compared to list organization at some
difficulty in implementation.
Hash Table Organization
● A hash table, also known as a hash map is a data structure that has the ability to
map keys to the values using a hash function.
● It is based on a concept called hashing, which involves using a hash function to
convert keys into indices in an array, where the corresponding values are stored
● Hash Function:
○ A function that takes an input (or 'key') and returns an index in the hash table array.
● Indexing:
○ The key is passed through the hash function, and the resulting index determines
where the value associated with the key will be stored in the array.
● Collision Handling:
○ When two keys hash to the same index, a collision occurs. Various strategies are
used to handle collisions.
Allocation Data Structures
● Allocation strategy is an important factor in efficient utilization of memory for
objects, defining their scope and lives using either static, stack, or heap
allocations.
1. Heap
● A heap is a specialized tree-based data structure that satisfies the heap
property.
● Heaps are typically used to implement priority queues and for efficient sorting
(heapsort).
● This means that the tree is completely filled on all levels except possibly the
lowest, which is filled from left to right.
● Heaps are a kind of non-linear data structure that permits allocation and
deallocation of list of entities in any (random) order as needed.
Allocation Data Structure (Contd.)
● Heap Operations:
○ Insert: Add a new element to the heap and restore the heap property.
○ Extract (Max or Min): Remove and return the root element and restore the heap property.
○ Peek (Max or Min): Return the root element without removing it.
2. Stack
● Stack is a linear data structure that satisfies last-in first-out (LIFO) policy for its
allocation and deallocation.
● This makes only last element of the stack accessible at any time.
● Implementing stack data structure requires use of Stack Base (SB) that points to
first entry of stack, and a Top of Stack (TOS) pointer to point to last entry allocated
to stack.
Allocation Data Structure (Contd.)
● Stack Operations
○ Push: Add an element to the top of the stack.
○ Pop: Remove and return the top element of the stack.
○ Peek (Top): Return the top element without removing it.
○ IsEmpty: Check if the stack is empty.