0% found this document useful (0 votes)
3 views126 pages

Compiler Construction Notes - Hamza

Uploaded by

hamzazahoor182
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views126 pages

Compiler Construction Notes - Hamza

Uploaded by

hamzazahoor182
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 126

Hamza zahoor whatsapp 0341-8377-917

NOTES

SUBJECT :
Compiler Construction
CLASS :
BSCS 6th Semester
WRITTEN BY :
(CR) KASHIF MALIK

COURSE OUTLINE COMPILER CONSTRUCTION


Hamza zahoor whatsapp 0341-8377-917

PAST PAPER COMPILER CONSTRUCTION


Hamza zahoor whatsapp 0341-8377-917

INTRODUCTION TO COMPILERS

The compiler is software that converts a program written in a


high-level language (Source Language) to a low-level language
(Object/Target/Machine Language/0, 1’s).
Hamza zahoor whatsapp 0341-8377-917

A translator or language processor is a program that translates an


input program written in a programming language into an
equivalent program in another language. The compiler is a type
of translator, which takes a program written in a high-level
programming language as input and translates it into an
equivalent program in low-level languages such as machine
language or assembly language.
The program written in a high-level language is known as a
source program, and the program converted into a low-level
language is known as an object (or target) program. Without
compilation, no program written in a high-level language can be
executed. For every programming language, we have a different
compiler; however, the basic tasks performed by every compiler
are the same. The process of translating the source code into
machine code involves several stages, including lexical analysis,
syntax analysis, semantic analysis, code generation, and
optimization.

High-Level Programming Language


A high-level programming language is a language that has an
abstraction of attributes of the computer. High-level
programming is more convenient to the user in writing a
program.
Low-Level Programming Language
Hamza zahoor whatsapp 0341-8377-917

A low-Level Programming language is a language that doesn’t


require programming ideas and concepts.
Stages of Compiler Design
1. Lexical Analysis: The first stage of compiler design is
lexical analysis, also known as scanning. In this stage, the
compiler reads the source code character by character and
breaks it down into a series of tokens, such as keywords,
identifiers, and operators. These tokens are then passed on
to the next stage of the compilation process.
2. Syntax Analysis: The second stage of compiler design is
syntax analysis, also known as parsing. In this stage, the
compiler checks the syntax of the source code to ensure
that it conforms to the rules of the programming language.
The compiler builds a parse tree, which is a hierarchical
representation of the program’s structure, and uses it to
check for syntax errors.
3. Semantic Analysis: The third stage of compiler design is
semantic analysis. In this stage, the compiler checks the
meaning of the source code to ensure that it makes sense.
The compiler performs type checking, which ensures that
variables are used correctly and that operations are
performed on compatible data types. The compiler also
checks for other semantic errors, such as undeclared
variables and incorrect function calls.
4. Code Generation: The fourth stage of compiler design is
code generation. In this stage, the compiler translates the
parse tree into machine code that can be executed by the
computer. The code generated by the compiler must be
efficient and optimized for the target platform.
Hamza zahoor whatsapp 0341-8377-917

5. Optimization: The final stage of compiler design is


optimization. In this stage, the compiler analyzes the
generated code and makes optimizations to improve its
performance. The compiler may perform optimizations
such as constant folding, loop unrolling, and function
inlining.

• Cross Compiler that runs on a machine ‘A’ and produces a


code for another machine ‘B’. It is capable of creating code
for a platform other than the one on which the compiler is
running.
• Source-to-source Compiler or transcompiler or transpiler
is a compiler that translates source code written in one
programming language into the source code of another
programming language.

Language Processing Systems


We know a computer is a logical assembly of Software and
Hardware. The hardware knows a language, that is hard for us to
grasp, consequently, we tend to write programs in a high-level
language, that is much less complicated for us to comprehend
and maintain in our thoughts. Now, these programs go through a
series of transformations so that they can readily be used by
Hamza zahoor whatsapp 0341-8377-917

machines. This is where language procedure systems come in


handy.

• High-Level Language: If a program contains preprocessor


directives such as #include or #define it is called HLL.
They are closer to humans but far from machines. These (#)
tags are called preprocessor directives. They direct the pre-
processor about what to do.
• Pre-Processor: The pre-processor removes all the #include
directives by including the files called file inclusion and all
the #define directives using macro expansion. It performs
file inclusion, augmentation, macro-processing, etc. •
Assembly Language: It’s neither in binary form nor high
level. It is an intermediate state that is a combination of
Hamza zahoor whatsapp 0341-8377-917

machine instructions and some other useful data needed for


execution.
• Assembler: For every platform (Hardware + OS) we will
have an assembler. They are not universal since for each
platform we have one. The output of the assembler is called
an object file. Its translates assembly language to machine
code.
• Interpreter: An interpreter converts high-level language
into low-level machine language, just like a compiler. But
they are different in the way they read the input. The
Compiler in one go reads the inputs, does the processing,
and executes the source code whereas the interpreter does
the same line by line. A compiler scans the entire program
and translates it as a whole into machine code whereas an
interpreter translates the program one statement at a time
• Relocatable Machine Code: It can be loaded at any point
and can be run. The address within the program will be in
such a way that it will cooperate with the program
movement.
• Loader/Linker: Loader/Linker converts the relocatable
code into absolute code and tries to run the program
resulting in a running program or an error message (or
sometimes both can happen). Linker loads a variety of
object files into a single file to make it executable. Then
loader loads it in memory and executes it
Types of Compiler
There are mainly three types of compilers.
• Single Pass Compilers
• Two Pass Compilers
Hamza zahoor whatsapp 0341-8377-917

• Multipass Compilers
Single Pass Compiler
When all the phases of the compiler are present inside a single
module, it is simply called a single-pass compiler. It performs
the work of converting source code to machine code.
Two Pass Compiler
Two-pass compiler is a compiler in which the program is
translated twice, once from the front end and the back from the
back end known as Two Pass Compiler.
Multipass Compiler
When several intermediate codes are created in a program and a
syntax tree is processed many times, it is called Multi pass
Compiler. It breaks codes into smaller programs.

2-THE PHASES OF COMPILER


Hamza zahoor whatsapp 0341-8377-917

Phase 1: Lexical Analysis


Lexical Analysis is the first phase when compiler scans the
source code. This process can be left to right, character by
character, and group these characters into tokens.

X identifier
= Assignment operator
Y identifier
+ Addition operator
10 Number

Here, the character stream from the source program is grouped


in meaningful sequences by identifying the tokens. It makes the
entry of the corresponding tickets into the symbol table and
passes that token to next phase. The primary functions of this
phase are:
• Identify the lexical units in a source code
• Classify lexical units into classes like constants, reserved
words, and enter them in different tables. It will Ignore
comments in the source program
• Identify token which is not a part of the language
Example:

x = y + 10
Tokens
Hamza zahoor whatsapp 0341-8377-917

Phase 2: Syntax Analysis


Syntax analysis is all about discovering structure in code. It
determines whether or not a text follows the expected format.
The main aim of this phase is to make sure that the source code
was written by the programmer is correct or not.
Syntax analysis is based on the rules based on the specific
programing language by constructing the parse tree with the
help of tokens. It also determines the structure of source
language and grammar or syntax of the language. Here, is a
list of tasks performed in this phase:
• Obtain tokens from the lexical analyzer
• Checks if the expression is syntactically correct or not
• Report all syntax errors
• Construct a hierarchical structure which is known as a parse
tree
Example
Any identifier/number is an expression
If x is an identifier and y+10 is an expression, then x= y+10 is a
statement.
Consider parse tree for the following example
(a+b)*c
Hamza zahoor whatsapp 0341-8377-917

In Parse Tree • Interior node: record with an operator filed and


two files for children
• Leaf: records with 2/more fields; one for token and other
information about the token
• Ensure that the components of the program fit together
meaningfully
• Gathers type information and checks for type compatibility
• Checks operands are permitted by the source languageis
conveying an appropriate meaning.Semantic Analyzer will

Phase 3: Semantic Analysis


Semantic analysis checks the semantic consistency of the code.
It uses the syntax tree of the previous phase along with the
symbol table to verify that the given source code is
semantically consistent. It also checks whether the code check
for Type mismatches, incompatible operands, a function called
with improper arguments, an undeclared variable, etc.
Functions of Semantic analyses phase are:
Hamza zahoor whatsapp 0341-8377-917

• Helps you to store type information gathered and save it in


symbol table or syntax tree
• Allows you to perform type checking
• In the case of type mismatch, where there are no exact type
correction rules which satisfy the desired operation a
semantic error is shown
• Collects type information and checks for type compatibility
• Checks if the source language permits the operands or not
Example
float x = 20.2; float
y = x*30;

Phase 4: Intermediate Code Generation


Once the semantic analysis phase is over the compiler, generates
intermediate code for the target machine. It represents a program
for some abstract machine.
Intermediate code is between the high-level and machine level
language. This intermediate code needs to be generated in such a
manner that makes it easy to translate it into the target machine
code.
Functions on Intermediate Code generation:
• It should be generated from the semantic representation of the
source program
• Holds the values computed during the process of translation
• Helps you to translate the intermediate code into target
language • Allows you to maintain precedence ordering of the
source language
• It holds the correct number of operands of the instruction
Hamza zahoor whatsapp 0341-8377-917

Example
For example, total =
count + rate * 5
Intermediate code with the help of address code method is:

t1 := int_to_float(5)
t2 := rate * t1 t3 :=
count + t2

total := t3

Phase 5: Code Optimization


The next phase of is code optimization or Intermediate code.
This phase removes unnecessary code line and arranges the
sequence of statements to speed up the execution of the program
without wasting resources. The main goal of this phase is to
improve on the intermediate code to generate a code that runs
faster and occupies less space.
The primary functions of this phase are:
• It helps you to establish a trade-off between execution and
compilation speed
• Improves the running time of the target program
• Generates streamlined code still in intermediate
representation
• Removing unreachable code and getting rid of unused
variables
Hamza zahoor whatsapp 0341-8377-917

• Removing statements which are not altered from the loop


Example:
Consider the following code
a = intofloat(10) b = c * a d
=e+bf=d

Can become
b =c * 10.0 f
= e+b
Phase 6: Code Generation
Code generation is the last and final phase of a compiler. It gets
inputs from code optimization phases and produces the page
code or object code as a result. The objective of this phase is to
allocate storage and generate relocatable machine code. It also
allocates memory locations for the variable. The instructions in
the intermediate code are converted into machine instructions.
This phase coverts the optimize or intermediate code into the
target language.
The target language is the machine code. Therefore, all the
memory locations and registers are also selected and allotted
during this phase. The code generated by this phase is executed
to take inputs and generate expected outputs. Example
a = b + 60.0
Would be possibly translated to registers.
MOVF a, R1
MULF #60.0, R2
ADDF R1, R2
Hamza zahoor whatsapp 0341-8377-917

Symbol Table Management


A symbol table contains a record for each identifier with fields
for the attributes of the identifier. This component makes it
easier for the compiler to search the identifier record and
retrieve it quickly. The symbol table also helps you for the scope
management. The symbol table and error handler interact with
all the phases and symbol table update correspondingly.
Error Handling Routine
In the compiler design process error may occur in all the
belowgiven phases:
• Lexical analyzer: Wrongly spelled tokens
• Syntax analyzer: Missing parenthesis
• Intermediate code generator: Mismatched operands for an
operator
• Code Optimizer: When the statement is not reachable
• Code Generator: When the memory is full or proper
registers are not allocated
• Symbol tables: Error of multiple declared identifiers

• EXAMPLE
Hamza zahoor whatsapp 0341-8377-917
Hamza zahoor whatsapp 0341-8377-917

GROUPING OF PHASES
1. Front End phases: The front end consists of those
phases or parts of phases that are source
languagedependent and target machine,
independents. These generally consist of lexical
analysis, semantic analysis, syntactic analysis,
symbol table creation, and intermediate code
generation. A little part of code optimization can
also be included in the frontend part. The front-end
part also includes the error handling that goes along
with each of the phases.

Front End Phases

2. Back End phases: The portions of compilers that


depend on the target machine and do not depend on
the source language are included in the back end. In
the back end, code generation and necessary
features of code optimization phases, along with
error handling and symbol table operations are also
included.
Hamza zahoor whatsapp 0341-8377-917

Back End Phases

Grouping
Several phases are grouped together to a pass so that it can read
the input file and write an output file.
1. One-Pass – In One-pass all the phases are grouped into one
phase. The six phases are included here in one pass.
2. Two-Pass – In Two-pass the phases are divided into two
parts i.e. Analysis or Front End part of the compiler and the
synthesis part or back end part of the compiler.
Hamza zahoor whatsapp 0341-8377-917

Purpose of One Pass Compiler

A one-pass compiler generates a structure of machine


instructions as it looks like a stream of instructions and then
sums up with machine address for these guidelines to a
rundown of directions to be backpatched once the machine
address for it is generated. It is used to pass the program for
one time. Whenever the line source is handled, it is checked
and the token is removed.
Hamza zahoor whatsapp 0341-8377-917

Purpose of Two-Pass Compiler

A two-pass compiler utilizes its first pass to go into its symbol


table a rundown of identifiers along with the memory areas to
which these identifiers relate. Then, at that point, a second
pass replaces mnemonic operation codes by their machine
language equivalent and replaced uses of identifiers by their
machine address. In the second pass, the compiler can read
the result document delivered by the first pass, assemble the
syntactic tree and deliver the syntactical examination. The
result of this stage is a record that contains the syntactical
tree.
Hamza zahoor whatsapp 0341-8377-917

• COMPILER CONSTRUCTION TOOLS


The compiler writer can use some specialized tools that help in
implementing various phases of a compiler. These tools assist in
the creation of an entire compiler or its parts. Some commonly
used compiler construction tools include:
1. Parser Generator – It produces syntax analyzers (parsers)
from the input that is based on a grammatical description of
programming language or on a context-free grammar. It is
useful as the syntax analysis phase is highly complex and
consumes more manual and compilation time. Example:
PIC, EQM

2. Scanner Generator – It generates lexical analyzers from


the input that consists of regular expression description
based on tokens of a language. It generates a finite
automaton to recognize the regular expression. Example:
Lex
Hamza zahoor whatsapp 0341-8377-917

3. Syntax directed translation engines – It generates


intermediate code with three address format from the input
that consists of a parse tree. These engines have routines to
traverse the parse tree and then produces the intermediate
code. In this, each node of the parse tree is associated with
one or more translations.
4. Automatic code generators – It generates the machine
language for a target machine. Each operation of the
intermediate language is translated using a collection of
rules and then is taken as an input by the code generator. A
template matching process is used. An intermediate
language statement is replaced by its equivalent machine
language statement using templates.
5. Data-flow analysis engines – It is used in code
optimization.Data flow analysis is a key part of the code
optimization that gathers the information, that is the values
that flow from one part of a program to another. Refer
– data flow analysis in Compiler
Hamza zahoor whatsapp 0341-8377-917

6. Compiler construction toolkits – It provides an integrated


set of routines that aids in building compiler components or
in the construction of various phases of compiler.

Features of compiler construction tools :

• Lexical Analyzer Generator: This tool helps in


generating the lexical analyzer or scanner of the
compiler. It takes as input a set of regular expressions
that define the syntax of the language being compiled
and produces a program that reads the input source code
and tokenizes it based on these regular expressions.
• Parser Generator: This tool helps in generating the
parser of the compiler. It takes as input a context-free
grammar that defines the syntax of the language being
compiled and produces a program that parses the input
tokens and builds an abstract syntax tree.
• Code Generation Tools: These tools help in generating
the target code for the compiler. They take as input the
abstract syntax tree produced by the parser and produce
code that can be executed on the target machine.
• Optimization Tools: These tools help in optimizing the
generated code for efficiency and performance. They can
perform various optimizations such as dead code
elimination, loop optimization, and register allocation.
• Debugging Tools: These tools help in debugging the
compiler itself or the programs that are being compiled.
They can provide debugging information such as symbol
tables, call stacks, and runtime errors.
• Profiling Tools: These tools help in profiling the
compiler or the compiled code to identify performance
bottlenecks and optimize the code accordingly.
Hamza zahoor whatsapp 0341-8377-917

• Documentation Tools: These tools help in generating


documentation for the compiler and the programming
language being compiled. They can generate
documentation for the syntax, semantics, and usage of
the language.
• Language Support: Compiler construction tools are
designed to support a wide range of programming
languages, including high-level languages such as C++,
Java, and Python, as well as low-level languages such as
assembly language.

• SYNTAX DEFINITION TRANSLATION


Syntax Definition:

*Syntax* refers to the structure or form of expressions in a


programming language. It defines how programs should be
written in terms of symbols, keywords, and their arrangement.
Syntax is typically defined using formal languages, and one
widely used formalism is Backus-Naur Form (BNF) or Extended
Backus-Naur Form (EBNF).
*Example of BNF:*

<expression> ::= <term> "+" <expression>


| <term> "-" <expression>
| <term>
Hamza zahoor whatsapp 0341-8377-917

<term> ::= <factor> "*" <term>


| <factor> "/" <term>
| <factor>

<factor> ::= "(" <expression> ")"


| <number>

In this example, <expression>, <term>, and <factor> are


nonterminal symbols, and +, -, *, /, (, ), and <number> are
terminal symbols. This BNF defines the syntax for basic
arithmetic expressions.
Translation:

*Translation* involves converting the source code written in a


high-level programming language into an equivalent target code,
often in the form of machine code or intermediate code. The
translation process can be broken down into multiple phases,
including lexical analysis, syntax analysis, semantic analysis,
intermediate code generation, code optimization, and code
generation.

Example of Translation:
Consider a simple assignment statement in a programming
language:
Hamza zahoor whatsapp 0341-8377-917

int result = a + b * c;

1. Lexical Analysis (Tokenization):


- Identify tokens: int, result, =, a, +, b, *, c, ;.

2. Syntax Analysis (Parsing):


- Use the syntax rules to create a parse tree representing the
syntactic structure of the statement.

3. Semantic Analysis:
- Ensure that the statement adheres to the language's semantic
rules. For example, check if the variables are declared before
use.

4. Intermediate Code Generation:


- Generate an intermediate representation of the code. For
simplicity, let's consider a three-address code:

t1 = b * c
t2 = a + t1
result = t2
Hamza zahoor whatsapp 0341-8377-917

5. Code Optimization (Optional)


- Optimize the intermediate code for better performance.

6. Code Generation:
- Translate the intermediate code into the target machine code or
another intermediate representation.

• PARSING
The process of transforming the data from one format to another
is called Parsing. This process can be accomplished by the
parser. The parser is a component of the translator that helps to
organise linear text structure following the set of defined rules
which is known as grammar.

Types of Parsing:
Hamza zahoor whatsapp 0341-8377-917

There are two types of Parsing:


• The Top-down Parsing
• The Bottom-up Parsing

• Top-down Parsing: When the parser generates a


parse with top-down expansion to the first trace,
the left-most derivation of input is called top-
down parsing. The topdown parsing initiates with
the start symbol and ends on the terminals. Such
parsing is also known as predictive parsing.
Hamza zahoor whatsapp 0341-8377-917

• Recursive Descent Parsing: Recursive descent


parsing is a type of top-down parsing technique.
This technique follows the process for every
terminal and non-terminal entity. It reads the
input from left to right and constructs the parse
tree from right to left. As the technique works
recursively, it is called recursive descent parsing.
• Back-tracking: The parsing technique that starts
from the initial pointer, the root node. If the
derivation fails, then it restarts the process with
different rules.

• Bottom-up Parsing: The bottom-up parsing


works just the reverse of the top-down parsing. It
first traces the rightmost derivation of the input
until it reaches the start symbol.
Hamza zahoor whatsapp 0341-8377-917

• Shift-Reduce Parsing: Shift-reduce parsing


works on two steps: Shift step and Reduce step.
• Shift step: The shift step indicates the increment
of the input pointer to the next input symbol that
is shifted.
• Reduce Step: When the parser has a complete
grammar rule on the right-hand side and replaces
it with RHS.
• LR Parsing: LR parser is one of the most
efficient syntax analysis techniques as it works
with context-free grammar. In LR parsing L
stands for the left to right tracing, and R stands
for the right to left tracing.
Hamza zahoor whatsapp 0341-8377-917

• A Translator for Simple Expressions

1Abstract and Concrete Syntax


2 Adapting the Translation Scheme
3 Procedures for the Nonterminals
4 Simplifying the Translator
5 The Complete Program
• LEXICAL ANALYSIS
Lexical Analysis is the first phase of the compiler also known as
a scanner. It converts the High level input program into a
sequence of Tokens.
• Lexical Analysis can be implemented with the
Deterministic finite Automata.
• The output is a sequence of tokens that is sent to the
parser for syntax analysis

What is a token?
Hamza zahoor whatsapp 0341-8377-917

A lexical token is a sequence of characters that can be treated as


a unit in the grammar of the programming languages. Example
of tokens:
• Type token (id, number, real, . . . )
• Punctuation tokens (IF, void, return, . . . )
• Alphabetic tokens (keywords)
Keywords; Examples-for, while, if etc.
Identifier; Examples-Variable name, function name, etc.
Operators; Examples '+', '++', '-' etc.
Separators; Examples ',' ';' etc

Example of Non-Tokens:
• Comments, preprocessor directive, macros, blanks, tabs,
newline, etc.
Lexeme: The sequence of characters matched by a pattern to
form the corresponding token or a sequence of input characters
that comprises a single token is called a lexeme. eg- “float”,
“abs_zero_Kelvin”, “=”, “-”, “273”, “;” .

How Lexical Analyzer works-


1. Input preprocessing: This stage involves cleaning up the
input text and preparing it for lexical analysis. This may
include removing comments, whitespace, and other
nonessential characters from the input text.
2. Tokenization: This is the process of breaking the input text
into a sequence of tokens. This is usually done by matching
Hamza zahoor whatsapp 0341-8377-917

the characters in the input text against a set of patterns or


regular expressions that define the different types of tokens.
3. Token classification: In this stage, the lexer determines the
type of each token. For example, in a programming
language, the lexer might classify keywords, identifiers,
operators, and punctuation symbols as separate token types.
4. Token validation: In this stage, the lexer checks that each
token is valid according to the rules of the programming
language. For example, it might check that a variable name
is a valid identifier, or that an operator has the correct
syntax.
5. Output generation: In this final stage, the lexer generates
the output of the lexical analysis process, which is typically
a list of tokens. This list of tokens can then be passed to the
next stage of compilation or interpretation.
Hamza zahoor whatsapp 0341-8377-917

• INPUT BUFFERING
Input buffering is a technique that allows the compiler to read
input in larger chunks, which can improve performance and
reduce overhead.
1. The basic idea behind input buffering is to read a block of
input from the source code into a buffer, and then process
that buffer before reading the next block.

The lexical analyzer scans the input from left to right one
character at a time. It uses two pointers begin ptr(bp) and
forward ptr(fp) to keep track of the pointer of the input scanned.

Initially both the pointers point to the first character of the input string as shown below
Hamza zahoor whatsapp 0341-8377-917

The forward ptr moves ahead to search for end of lexeme. As


soon as the blank space is encountered, it indicates end of
lexeme. In above example as soon as ptr (fp) encounters a blank
space the lexeme “int” is identified. The fp will be moved ahead
at white space, when fp encounters white space, it ignore and
moves ahead. then both the begin ptr(bp) and forward ptr(fp) are
set at next token. The input character is thus read from
secondary storage, but reading in this way from secondary
storage is costly. hence buffering technique is used.A block of
data is first read into a buffer, and then second by lexical
analyzer. there are two methods used in this context: One Buffer
Scheme, and Two Buffer Scheme. These are explained as
following below.
Hamza zahoor whatsapp 0341-8377-917

1. One Buffer Scheme: In this scheme, only one buffer is


used to store the input string but the problem with this
scheme is that if lexeme is very long then it crosses the
buffer boundary, to scan rest of the lexeme the buffer has to
be refilled, that makes overwriting the first of lexeme.
Hamza zahoor whatsapp 0341-8377-917

2. Two Buffer Scheme: To overcome the problem of one


buffer scheme, in this method two buffers are used to store
the input string. the first buffer and second buffer are
scanned alternately. when end of current buffer is reached
the other buffer is filled. the only problem with this method
is that if length of the lexeme is longer than length of the
buffer then scanning input cannot be scanned completely.
Initially both the bp and fp are pointing to the first character
of first buffer. Then the fp moves towards right in search of
end of lexeme. as soon as blank character is recognized, the
string between bp and fp is identified as corresponding
token. to identify, the boundary of first buffer end of buffer
character should be placed at the end first buffer. Similarly
end of second buffer is also recognized by the end of buffer
mark present at the end of second buffer. whenfp encounters
first eof, then one can recognize end of first buffer and hence
filling up second buffer is started. in the same way when
second eof is obtained then it indicates of second buffer.
alternatively both the buffers can be filled up until end of the
input program and stream of tokens is identified. This eof
character introduced at the end is calling Sentinel which is
used to identify the end of buffer.
Hamza zahoor whatsapp 0341-8377-917

• Specification & Recognition of Tokens


First know about Lexical Analysis:
1. The lexical analyzer breaks syntaxes into a series of
tokens, by removing any whitespace or comments in the
source code.
2. If the lexical analyzer finds a token invalid, it generates an
error. It reads character streams from the source code,
checks for legal tokens, and passes the data to the syntax
analyzer when it demands.
Hamza zahoor whatsapp 0341-8377-917

What is Token ?
In programming language, keywords, constants, identifiers,
strings, numbers, operators and punctuations symbols can be
considered as tokens.For example, in C language, the variable
declaration lineint value = 100;contains the tokens:int
(keyword), value (identifier), = (operator), 100 (const ant) and ;
(symbol).
Lexeme Token

= EQUAL_OP

* MULT_OP

, COMMA
Hamza zahoor whatsapp 0341-8377-917

( LEFT_PAREN
Specifications of Tokens:
Let us understand how the language theory undertakes the
following terms:
1. Alphabets
2. Strings
3. Special symbols
4. Language
5. Longest match rule
6. Operations
7. Notations
8. Representing valid tokens of a language in regular
expression
9. Finite automata
1. Alphabets: Any finite set of symbols
• {0,1} is a set of binary alphabets,
• {0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F} is a set of Hexadecimal
alphabets,
• {a-z, A-Z} is a set of English language alphabets.
2. Strings: Any finite sequence of alphabets is called a string.
3. Special symbols: A typical high-level language contains the
following symbols:
Hamza zahoor whatsapp 0341-8377-917

Arithmetic Addition(+), Subtraction(-),


Symbols Multiplication(*), Division(/)

Punctuation Comma(,), Semicolon(;), Dot(.)

Assignment =

Special +=, -=, *=, /=


assignment

Comparison ==, !=. <. <=. >, >=

Preprocessor #

4. Language: A language is considered as a finite set of strings


over some finite set of alphabets.
5. Longest match rule: When the lexical analyzer read the
source-code, it scans the code letter by letter and when it
encounters a whitespace, operator symbol, or special symbols
it decides that a word is completed.
6. Operations: The various operations on languages are:
1. Union of two languages L and M is written as, L U M
= {s | s is in L or s is in M}
2. Concatenation of two languages L and M is written as,
LM = {st | s is in L and t is in M}
3. The Kleene Closure of a language L is written as, L*
= Zero or more occurrence of language L.
7. Notations: If r and s are regular expressions denoting the
languages L(r) and L(s), then
Hamza zahoor whatsapp 0341-8377-917

1. Union : L(r)UL(s)
2. Concatenation : L(r)L(s)
3. Kleene closure : (L(r))*
8. Representing valid tokens of a language in regular
expression:If x is a regular expression, then:
• x* means zero or more occurrence of x.
• x+ means one or more occurrence of x.
9. Finite automata: Finite automata is a state machine that takes
a string of symbols as input and changes its state
accordingly.Ifthe input string is successfully processed and the
automata reaches its final state, it is accepted.The
mathematical model of finite automata consists of:
• Finite set of states (Q)
• Finite set of input symbols (Σ)
• One Start state (q0)
• Set of final states (qf)
• Transition function (δ)
The transition function (δ) maps the finite set of state (Q) to a
finite set of input symbols (Σ), Q × Σ ➔ Q

• SPECIFIYING LEXICAL ANALYERS


Hamza zahoor whatsapp 0341-8377-917

• LEX
o Lex is a program that generates lexical analyzer. It is used
with YACC parser generator. o The lexical analyzer is a
program that transforms an input stream into a sequence of
tokens.
o It reads the input stream and produces the source code as
output through implementing the lexical analyzer in the C
program.
The function of Lex is as follows:
o Firstly lexical analyzer creates a program lex.1 in the Lex
language. Then Lex compiler runs the lex.1 program and
produces a C program lex.yy.c. o Finally C compiler runs
the lex.yy.c program and produces an object program a.out.
o a.out is lexical analyzer that transforms an input stream

into a sequence of tokens.

Lex file format


Hamza zahoor whatsapp 0341-8377-917

A Lex program is separated into three sections by %%


delimiters. The formal of Lex source is as follows:
1. { definitions }
2. %%
3. { rules }
4. %%
5. { user subroutines }
Definitions include declarations of constant, variable and regular
definitions.
Rules define the statement of form p1 {action1} p2
{action2}....pn {action}.
Where pi describes the regular expression and action1 describes
the actions what action the lexical analyzer should take when
pattern pi matches a lexeme.
User subroutines are auxiliary procedures needed by the actions.
The subroutine can be loaded with the lexical analyzer and
compiled separately.

• INCORPORATION OF SYMBOL TABLE


Symbol Table is an important data structure created and
maintained by the compiler in order to keep track of semantics
of variables i.e. it stores information about the scope and binding
information about names, information about instances of various
Hamza zahoor whatsapp 0341-8377-917

entities such as variable and function names, classes, objects,


etc.
• It is built-in lexical and syntax analysis phases.
• The information is collected by the analysis phases of the
compiler and is used by the synthesis phases of the
compiler to generate code.
• It is used by the compiler to achieve compile-time
efficiency.
• It is used by various phases of the compiler as follows:-
1. Lexical Analysis: Creates new table entries in the
table, for example like entries about tokens.
2. Syntax Analysis: Adds information regarding
attribute type, scope, dimension, line of reference, use,
etc in the table.
3. Semantic Analysis: Uses available information in the
table to check for semantics i.e. to verify that
expressions and assignments are semantically
correct(type checking) and update it accordingly.
4. Intermediate Code generation: Refers symbol table
for knowing how much and what type of run-time is
allocated and table helps in adding temporary variable
information.
5. Code Optimization: Uses information present in the
symbol table for machine-dependent optimization.
6. Target Code generation: Generates code by using
address information of identifier present in the table.
Hamza zahoor whatsapp 0341-8377-917

Items stored in Symbol table:


• Variable names and constants
• Procedure and function names
• Literal constants and strings
• Compiler generated temporaries
• Labels in source languages
Information used by the compiler from Symbol table:
• Data type and name
• Declaring procedures
• Offset in storage
• If structure or record then, a pointer to structure table.
• For parameters, whether parameter passing by value or by
reference
• Number and type of arguments passed to function
• Base Address
• Operations of Symbol table – The basic operations defined on a symbol
table include:
Hamza zahoor whatsapp 0341-8377-917

• ABSTRACT STACK MACHINE


An abstract stack machine is a theoretical model used in
compiler construction to design and optimize the execution of
programming languages. It abstracts the underlying hardware
and represents the computation using a stack-based approach.

In this model, operands are pushed onto a stack, and operations


pop operands from the stack, perform the operation, and push
the result back onto the stack. This simplicity allows for easier
code generation and optimization during the compilation
process.

Compiler developers use the abstract stack machine as an


intermediate representation between the source code and the
target machine code. This abstraction helps in the optimization
Hamza zahoor whatsapp 0341-8377-917

and translation of high-level programming constructs into


efficient low-level code.

Ultimately, the abstract stack machine provides a clear and


uniform way to express computations during the compilation
process, facilitating the implementation of compilers for diverse
programming languages.

• FINITE AUTOMATA
Finite automata is a state machine that takes a string of symbols
as input and changes its state accordingly. Finite automata is a
recognizer for regular expressions. When a regular expression
string is fed into finite automata, it changes its state for each
literal. If the input string is successfully processed and the
automata reaches its final state, it is accepted, i.e., the string just
fed was said to be a valid token of the language in hand.
The mathematical model of finite automata consists of:
• Finite set of states (Q)
• Finite set of input symbols (Σ)
• One Start state (q0)
• Set of final states (qf)
• Transition function (δ)
The transition function (δ) maps the finite set of state (Q) to a
finite set of input symbols (Σ), Q × Σ ➔ Q
Finite Automata Construction
Hamza zahoor whatsapp 0341-8377-917

Let L(r) be a regular language recognized by some finite


automata (FA).
• States : States of FA are represented by circles. State names
are written inside circles.
• Start state : The state from where the automata starts, is
known as the start state. Start state has an arrow pointed
towards it.
• Intermediate states : All intermediate states have at least two
arrows; one pointing to and another pointing out from them.
• Final state : If the input string is successfully parsed, the
automata is expected to be in this state. Final state is
represented by double circles. It may have any odd number of
arrows pointing to it and even number of arrows pointing out
from it. The number of odd arrows are one greater than even,
i.e. odd = even+1.
• Transition : The transition from one state to another state
happens when a desired symbol in the input is found. Upon
transition, automata can either move to the next state or stay in
the same state. Movement from one state to another
is shown as a directed arrow, where the arrows points to the
destination state. If automata stays on the same state, an arrow
pointing from a state to itself is drawn. Example : We assume
FA accepts any three digit binary value ending in digit 1. FA =
{Q(q0, qf), Σ(0,1), q0, qf, δ}
Hamza zahoor whatsapp 0341-8377-917

• DETERMINISTIC FINITE AUTOMATA

• DFA consists of 5 tuples {Q, Σ, q, F, δ}.


• Q : set of all states.
• Σ : set of input symbols. ( Symbols which machine takes as
input )
• q : Initial state. ( Starting state of a machine )
• F : set of final state.
• δ : Transition Function, defined as δ : Q X Σ --> Q.

In a DFA, for a particular input character, the machine goes to


one state only. A transition function is defined on every state for
every input symbol. Also in DFA null (or ε) move is not allowed,
i.e., DFA cannot change state without any input character.
• For example, construct a DFA which accept a language of
all strings ending with ‘a’.
Given: Σ = {a,b}, q = {q0}, F={q1}, Q = {q0, q1}
• First, consider a language set of all the possible acceptable
strings in order to construct an accurate state transition
diagram.
Hamza zahoor whatsapp 0341-8377-917

L = {a, aa, aaa, aaaa, aaaaa, ba, bba, bbbaa, aba, abba, aaba,
abaa}
Above is simple subset of the possible acceptable strings
there can many other strings which ends with ‘a’ and
contains symbols {a,b}.

• Fig 1. State Transition Diagram for DFA with Σ = {a, b}

• Strings not accepted are, ab, bb, aab, abbb, etc.

State transition table for above automaton,


Symbol⇢
⇣State\ a b

q0 q1 q0

q1 q1 q0

One important thing to note is, there can be many possible


DFAs for a pattern. A DFA with a minimum number of states is
generally preferred.

• NON-DETERMINISTIC FINITE AUTOMATA


NFA is similar to DFA except following additional features:
Hamza zahoor whatsapp 0341-8377-917

1. Null (or ε) move is allowed i.e., it can move forward without


reading symbols.
2. Ability to transmit to any number of states for a particular
input.

However, these above features don’t add any power to NFA. If


we compare both in terms of power, both are equivalent.

Due to the above additional features, NFA has a different


transition function, the rest is the same as DFA. δ:
Transition Function δ: Q X (Σ U ε ) --> 2 ^ Q.
As you can see in the transition function is for any input
including null (or ε), NFA can go to any state number of states.
For example, below is an NFA for the above problem.

Fig 2. State Transition Diagram for NFA with Σ = {a, b}

State Transition Table for above Automaton,


Hamza zahoor whatsapp 0341-8377-917

Symbol⇢
⇣State\ a b

q0 {q0 ,q1 } q0

q1 ∅ ∅

One important thing to note is, in NFA, if any path for an input
string leads to a final state, then the input string is accepted.
For example, in the above NFA, there are multiple paths for the
input string “00”. Since one of the paths leads to a final state,
“00” is accepted by the above NFA.

• CONVERSION FROM REGULAR


EXPRESSION TO NFA
-NFA is similar to the NFA but have minor difference by
epsilon move. This automaton replaces the transition function
with the one that allows the empty string as a possible input.
The transitions without consuming an input symbol are called
transitions. In the state diagrams, they are usually labeled with
the Greek letter . -transitions provide a convenient way of
modeling the systems whose current states are not precisely
known: i.e., if we are modeling a system and it is not clear
whether the current state (after processing some input string)
should be q or q’, then we can add an -transition between these
two states, thus putting the automaton in both states
simultaneously.

One way to implement regular expressions is to convert them


into a finite automaton, known as an -NFA (epsilon-NFA). An
Hamza zahoor whatsapp 0341-8377-917

-NFA is a type of automaton that allows for the use of


“epsilon” transitions, which do not consume any input. This
means that the automaton can move from one state to another
without consuming any characters from the input string.

The process of converting a regular expression into an NFA


is as follows:
1. Create a single start state for the automaton, and mark it as
the initial state.
2. For each character in the regular expression, create a new
state and add an edge between the previous state and the
new state, with the character as the label.
3. For each operator in the regular expression (such as “*” for
zero or more, “+” for one or more, and “?” for zero or one),
create new states and add the appropriate edges to represent
the operator.
4. Mark the final state as the accepting state, which is the state
that is reached when the regular expression is fully
matched.
Hamza zahoor whatsapp 0341-8377-917

Common regular expression used in make -NFA:

Example: Create a -NFA for regular expression: (a/b)*a


Hamza zahoor whatsapp 0341-8377-917

• DESIGN OF LEXICAL ANALYZER


GENERATOR
Hamza zahoor whatsapp 0341-8377-917

Designing a lexical analyzer generator involves creating a tool


that takes a formal description of lexical rules and generates a
lexical analyzer (lexer) for a specific programming language.
Here's a simplified outline of the design process:

1. Input Specification:
- Define a formal language to specify lexical rules, often using
regular expressions or similar formalisms.
- Allow users to provide a description of tokens, patterns, and
associated actions.

2.Lexer Generator Input Format:


- Define a clear syntax for users to specify token patterns,
possibly using regular expressions, along with any associated
semantic actions.

3. Token Definition:
- Define structures to represent tokens and their attributes.
Hamza zahoor whatsapp 0341-8377-917

- Specify any necessary metadata associated with tokens.

4. Lexical Rules Processing:


- Implement algorithms to process the lexical rules and convert
them into a format suitable for code generation.

5. Code Generation:
- Generate code for the lexical analyzer based on the processed
lexical rules.
- Output code should typically be in a language like C, Java, or
another programming language.

6. Error Handling:
- Implement mechanisms to handle errors gracefully, providing
meaningful error messages when lexical errors are
encountered.

7. Integration with Parser:


- Design the generated lexical analyzer to integrate seamlessly
with the parser, possibly by providing functions to retrieve the
next token.

8. Performance Considerations:
- Optimize generated code for speed and efficiency.
Hamza zahoor whatsapp 0341-8377-917

- Consider techniques like DFA (Deterministic Finite


Automaton) minimization to reduce the size of the state
transition table.

9. Customization Options:
- Allow users to customize the generated lexer, for example, by
specifying additional code snippets to be included.

10. Documentation:
- Provide comprehensive documentation for users, explaining
the input format, customization options, and integration steps.

11. Testing and Debugging:


- Include tools for testing the generated lexer, such as providing
sample inputs and expected outputs.
- Enable debugging features to help users diagnose issues in
their lexical rules.

12. Cross-Platform Support:


- Ensure that the generated lexer is portable and can be easily
integrated into different environments and platforms.

Remember that designing a lexical analyzer generator can be


complex, and the effectiveness of the tool depends on the clarity
of the input specification, the efficiency of code generation, and
the quality of error handling. Additionally, studying existing
Hamza zahoor whatsapp 0341-8377-917

lexer generator tools like Lex or Flex can provide valuable


insights into best practices and implementation details.

• PATTERN MATCHING
Pattern matching in compiler construction involves identifying
and recognizing specific patterns within the source code. This is
crucial for tasks like lexical analysis, where tokens need to be
matched against predefined patterns. Here's a simple example
using regular expressions for pattern matching in a lexical
analyzer:

Let's consider a simplified scenario where we're building a


lexical analyzer for a programming language that includes
identifiers, numbers, and arithmetic operators.

1. Token Definitions:
- Identifiers: Any sequence of letters and digits, starting with a
letter.
- Numbers: Integer or floating-point numbers.
- Arithmetic Operators: '+', '-', '*', '/'

2. Regular Expressions:
- Identifier Pattern: [a-zA-Z][a-zA-Z0-9]* - Number Pattern: \
d+(\.\d+)?
- Arithmetic Operator Patterns: +, -, *, /
Hamza zahoor whatsapp 0341-8377-917

3. Example Source Code:


plaintext sum = 10
+ 20 * 3; average =
sum / 2.0;

4. Lexical Analysis:
- The lexical analyzer scans the source code character by
character.
- It uses the defined regular expressions to match and identify
tokens.

5. Identified Tokens:
- For the given source code, the lexical analyzer might produce
the following sequence of identified tokens:
- Identifier: sum
- Arithmetic Operator: =
- Number: 10
- Arithmetic Operator: +
- Number: 20
- Arithmetic Operator: *
- Number: 3
- Arithmetic Operator: ;
Hamza zahoor whatsapp 0341-8377-917

- Identifier: average
- Arithmetic Operator: =
- Identifier: sum
- Arithmetic Operator: /
- Number: 2.0
- Arithmetic Operator: ;

6. Usage in Compiler Construction:


- The identified tokens are then used by subsequent compiler
phases, such as parsing and semantic analysis.

In this example, pattern matching with regular expressions is a


fundamental part of the lexical analysis phase. The defined
patterns help recognize and extract meaningful tokens from the
source code. This process lays the groundwork for building a
compiler that can understand and process a programming
language.

• OPTIMIZATION OF DFA BASED PATTERN


MATCHERS
In this section we present three algorithms that have been used
to implement and optimize pattern matchers constructed from
regular expressions.
Hamza zahoor whatsapp 0341-8377-917

1. The first algorithm is useful in a Lex compiler, because it


constructs a DFA directly from a regular expression, without
constructing an intermediate NFA. The resulting DFA also may
have fewer states than the DFA constructed via an NFA.

2. The second algorithm minimizes the number of states of any


DFA, by combining states that have the same future behavior.
The algorithm itself is quite efficient, running in time 0(n log n),
where n is the number of states of the DFA.

3. The third algorithm produces more compact representations of


transition tables than the standard, two-dimensional table.

Important States of an NFA: To begin our discussion of how to


go directly from a regular expression to a DFA, we must first
dissect the NFA construction of Algorithm 3.23 and consider the
roles played by various states. We call a state of an NFA
important if it has a non-e out-transition. Notice that the subset
construction (Algorithm 3.20) uses only the important states in a
set T when it computes e-closure (move (T, a)), the set of states
reachable from T on input a. That is, the set of states move(s, a)
is nonempty only if state s is important. During the subset
construction, two sets of NFA states can be identified (treated as
if they were the same set) if they:

1. Have the same important states, and

2. Either both have accepting states or neither does.


Hamza zahoor whatsapp 0341-8377-917

When the NFA is constructed from a regular expression, we can


say more about the important states. The only important states
are those introduced as initial states in the basis part for a
particular symbol position in the regular expression. That is,
each important state corresponds to a particular operand in the
regular expression.

The constructed NFA has only one accepting state, but this state,
having no out-transitions, is not an important state. By
concatenating a unique right end marker # to a regular
expression r, we give the accepting state for r a transition on #,
making it an important state of the NFA for ( r ) # . In other
words, by using the augmented regular expression ( r ) # , we
can forget about accepting states as the subset construction
proceeds; when the construction is complete, any state with a
transition on # must be an accepting state.

The important states of the NFA correspond directly to the


positions in the regular expression that hold symbols of the
alphabet. It is useful, as we shall see, to present the regular
expression by its syntax tree, where the leaves correspond to
operands and the interior nodes correspond to operators. An
interior node is called a cat-node, or-node, or star-node if it is
labeled by the concatenation operator (dot), union operator |, or
star operator *, respectively.
Hamza zahoor whatsapp 0341-8377-917

Example:Figure 3.56 shows the syntax tree for the regular


expression of our running example. Cat-nodes are represented
by circles.
Leaves in a syntax tree are labeled by e or by an alphabet
symbol. To each leaf not labeled e, we attach a unique integer.
We refer to this integer as the position of the leaf and also as a
position of its symbol. Note that a symbol can have several
positions; for instance, a has positions 1 and 3 in Fig. 3.56. The
positions in the syntax tree correspond to the important states of
the constructed NFA.
Hamza zahoor whatsapp 0341-8377-917

E x a m p l e:Figure 3.57 shows the NFA for the same regular


expression as Fig. 3.56, with the important states numbered and
other states represented by letters. The numbered states in the
NFA and the positions in the syntax tree correspond in a way we
shall soon see.

• THE ROLE OF PARSERS


In the syntax analysis phase, a compiler verifies whether or
not the tokens generated by the lexical analyzer are grouped
according to the syntactic rules of the language. This is done
by a parser. The parser obtains a string of tokens from the
lexical analyzer and verifies that the string can be the
grammar for the source language. It detects and reports any
syntax errors and produces a parse tree from which
intermediate code can be generated.
Hamza zahoor whatsapp 0341-8377-917

• CONTEXT FREE GRAMMAR


Context free grammar is a formal grammar which is used to generate all
possible strings in a given formal language.
Context free grammar G can be defined by four tuples as:
1. G= (V, T, P, S)
Where,
G describes the grammar
T describes a finite set of terminal symbols.
Hamza zahoor whatsapp 0341-8377-917

V describes a finite set of non-terminal symbols


P describes a set of production rules
S is the start symboL

Two parse trees that describe CFGs that generate the string "x
+ y * z"

• TOP DOWN AND BOTTOM UP PARSING


Parser
Parser is a compiler that is used to break the data into smaller
elements coming from lexical analysis phase.
A parser takes input in the form of sequence of tokens and
produces output in the form of parse tree.
Parsing is of two types: top down parsing and bottom up parsing.
Hamza zahoor whatsapp 0341-8377-917

Top down paring o The top down parsing is known as


recursive parsing or predictive parsing. o Bottom up
parsing is used to construct a parse tree for an input string.
o In the top down parsing, the parsing starts from the start

symbol and transform it into the input symbol.

Parse Tree representation of input string "acdb" is as follows


Hamza zahoor whatsapp 0341-8377-917

Bottom up parsing o Bottom up parsing is also known as shift-


reduce parsing. o Bottom up parsing is used to construct a
parse tree for an input string. o In the bottom up parsing, the
parsing starts with the input symbol and construct the parse
tree up to the start symbol by tracing out the rightmost
derivations of string in reverse.
Example
Production
1. E → T
2. T → T * F
3. T → id
4. F → T
5. F → id
Parse Tree representation of input string "id * id" is as follows:
Hamza zahoor whatsapp 0341-8377-917

Bottom up parsing is classified in to various parsing. These are


as follows:
1. Shift-Reduce Parsing
2. Operator Precedence Parsing 3. Table Driven
LR Parsing
a. LR( 1 )
b. SLR( 1 )
c. CLR ( 1 )
d. LALR( 1 )
Hamza zahoor whatsapp 0341-8377-917

• OPERATOR PRECEDENCE PARSING


Operator precedence grammar is kinds of shift reduce parsing
method. It is applied to a small class of operator grammars.

A grammar is said to be operator precedence grammar if it has


two properties:
• No R.H.S. of any production has a∈.
• No two non-terminals are adjacent.
Operator precedence can only established between the terminals
of the grammar. It ignores the non-terminal.

There are the three operator precedence relations:


a⋗ b means that terminal "a" has the higher precedence than
terminal "b".
a⋖ b means that terminal "a" has the lower precedence than
terminal "b".

a≐ b means that the terminal "a" and "b" both have same
precedence.
Precedence table:
Hamza zahoor whatsapp 0341-8377-917

Parsing Action o Both end of the given input string, add


the $ symbol.
o Now scan the input string from left right until the ⋗ is

precedence until the first left most ⋖ is encountered.


encountered. o Scan towards left over all the equal

o Everything between left most ⋖ and right most ⋗ is a


handle.
o $ on $ means parsing is successful.
Example
Grammar:
1. E → E+T/T
2. T → T*F/F
3. F → id
Given string:
1. w = id + id * id
Let us consider a parse tree for it as follows:
Hamza zahoor whatsapp 0341-8377-917

On the basis of above tree, we can design following operator


precedence table:

Now let us process the string with the help of the above
precedence table:
Hamza zahoor whatsapp 0341-8377-917

• LR PARSERS
LR parser is a bottom-up parser for context-free grammar that is
very generally used by computer programming language
compiler and other associated tools. LR parser reads their input
from left to right and produces a right-most derivation. It is
called a Bottom-up parser because it attempts to reduce the
toplevel grammar productions by building up from the leaves.
LR parsers are the most powerful parser of all deterministic
parsers in practice.
Hamza zahoor whatsapp 0341-8377-917

Description of LR parser :
The term parser LR(k) parser, here the L refers to the left-toright
scanning, R refers to the rightmost derivation in reverse and k
refers to the number of unconsumed “look ahead” input symbols
that are used in making parser decisions. Typically, k is 1 and is
often omitted. A context-free grammar is called LR (k) if the LR
(k) parser exists for it. This first reduces the sequence of tokens
to the left. But when we read from above, the derivation order
first extends to non-terminal.

1. The stack is empty, and we are looking to reduce the rule


by S’→S$.
2. Using a “.” in the rule represents how many of the rules are
already on the stack.
3. A dotted item, or simply, the item is a production rule with
a dot indicating how much RHS has so far been recognized.
Hamza zahoor whatsapp 0341-8377-917

Closing an item is used to see what production rules can be


used to expand the current structure. It is calculated as
follows:
Rules for LR parser :
The rules of LR parser as follows.
1. The first item from the given grammar rules adds itself as
the first closed set.
2. If an object is present in the closure of the form A→ α. β. γ,
where the next symbol after the symbol is non-terminal,
add the symbol’s production rules where the dot precedes
the first item.
3. Repeat steps (B) and (C) for new items added under (B).
LR parser algorithm :
LR Parsing algorithm is the same for all the parser, but the
parsing table is different for each parser. It consists following
components as follows.
1. Input Buffer –
It contains the given string, and it ends with a $ symbol.

2. Stack –
The combination of state symbol and current input symbol
is used to refer to the parsing table in order to take the
parsing decisions.
Parsing Table :
Parsing table is divided into two parts- Action table and Go-To
table. The action table gives a grammar rule to implement the
given current state and current terminal in the input stream.
There are four cases used in action table as follows.
Hamza zahoor whatsapp 0341-8377-917

1. Shift Action- In shift action the present terminal is removed


from the input stream and the state n is pushed onto the
stack, and it becomes the new present state.
2. Reduce Action- The number m is written to the output
stream.
3. The symbol m mentioned in the left-hand side of rule m
says that state is removed from the stack.
4. The symbol m mentioned in the left-hand side of rule m
says that a new state is looked up in the goto table and
made the new current state by pushing it onto the stack.

LR parser diagram :
Hamza zahoor whatsapp 0341-8377-917

• REMOVING AMBIGUITIES IN CONTEXT


FREE GRAMMAR
The grammars which have more than one derivation tree or
parse tree are ambiguous grammars. These grammar is not
parsed by any parsers.

Example-
1. Consider the production shown below –

S->aSbS | bSaS | ∈
Say, we want to generate the string “abab” from the above
grammar. We can observe that the given string can be derived
using two parse trees. So, the above grammar is ambiguous.

The grammars which have only one derivation tree or parse tree
are called unambiguous grammars.
2. Consider the productions shown below –
Hamza zahoor whatsapp 0341-8377-917

S -> AB
A ->Aa | a
B -> b
For the string “aab” we have only one Parse Tree for the above
grammar as shown below.

It is important to note that there are no direct algorithms to find


whether grammar is ambiguous or not. We need to build the
parse tree for a given input string that belongs to the language
produced by the grammar and then decide whether the grammar
is ambiguous or unambiguous based on the number of parse
trees obtained as discussed above.
Removal of Ambiguity :
We can remove ambiguity solely on the basis of the following
two properties –
1. Precedence –
Hamza zahoor whatsapp 0341-8377-917

If different operators are used, we will consider the precedence


of the operators. The three important characteristics are :
1. The level at which the production is present denotes the
priority of the operator used.
2. The production at higher levels will have operators with
less priority. In the parse tree, the nodes which are at top
levels or close to the root node will contain the lower
priority operators.
3. The production at lower levels will have operators with
higher priority. In the parse tree, the nodes which are at
lower levels or close to the leaf nodes will contain the
higher priority operators.
2. Associativity –
If the same precedence operators are in production, then we will
have to consider the associativity.
• If the associativity is left to right, then we have to prompt a
left recursion in the production. The parse tree will also be
left recursive and grow on the left side.
+, -, *, / are left associative operators.
• If the associativity is right to left, then we have to prompt
the right recursion in the productions. The parse tree will
also be right recursive and grow on the right side.
^ is a right associative operator.
Hamza zahoor whatsapp 0341-8377-917

• PARSER GENERATOR
A parser generator is a tool used in compiler construction to
automatically generate parsers for a given formal grammar.
Here's a brief overview of how a parser generator works:

1. *Grammar Specification:*
- Define the grammar of the programming language using a
formalism such as BNF (Backus-Naur Form) or EBNF
(Extended Backus-Naur Form). This grammar describes the
syntactic structure of the language.

2. *Parser Generator Input:*


- Input the formal grammar into the parser generator tool.

3. *Parser Generation Process:*


- The parser generator processes the formal grammar and
generates source code for a parser in a programming language
like C, Java, or others.

4. *Types of Parsers:*
- Parser generators can produce different types of parsers, such
as LL parsers, LR parsers, or LALR parsers, depending on the
algorithm used.

5. *Table Construction:*
Hamza zahoor whatsapp 0341-8377-917

- The generator constructs parsing tables based on the grammar.


These tables guide the parsing process, specifying how the
parser should behave for different input symbols and states.

6. *Code Output:*
- The parser generator outputs source code for the generated
parser. This code typically includes functions or methods to
parse the input source code and build a parse tree or abstract
syntax tree.

7. *Integration with Lexical Analysis:*


- The generated parser is often integrated with a lexer (lexical
analyzer) to create a complete syntax analysis phase.

8. *Error Handling:*
- The generated parser includes mechanisms for error detection
and reporting, helping to provide meaningful feedback to
developers when syntax errors are encountered.

9. *Integration into the Compiler:*


- The generated parser becomes a crucial component of the
compiler, handling the syntactic analysis of the source code.

10. *Optimizations:*
Hamza zahoor whatsapp 0341-8377-917

- Some parser generators may include options for optimization,


allowing users to fine-tune the generated parser for better
performance.

Popular parser generators include tools like Yacc (Yet Another


Compiler Compiler), Bison, ANTLR (ANother Tool for
Language Recognition), and JavaCC (Java Compiler Compiler).
Using a parser generator simplifies the process of creating a
parser for a programming language, as developers can focus on
defining the grammar rather than writing the intricate parsing
code manually.

• The Parser Generator Yacc


A translator can be constructed using Yacc in the manner
illustrated in Fig. 4.57. First, a file, say translate.y, containing a
Yacc specification of the translator is prepared. The UNIX
system command

yacc translate.y

transforms the file translate.y into a C program called y.tab.c


using the LALR method outlined in Algorithm 4.63. The
program y.tab.c is a repre-sentation of an LALR parser written
in C, along with other C routines that the user may have
prepared. The LALR parsing table is compacted as described in
Section 4.7. By compiling y.tab.c along with the ly library that
contains the LR parsing program using the command
Hamza zahoor whatsapp 0341-8377-917

cc y.tab.c -ly

we obtain the desired object program a. out that performs the


translation spec-ified by the original Yacc program . 7 If other
procedures are needed, they can be compiled or loaded with
y.tab.c, just as with any C program.

A Yacc source program has three parts:


Hamza zahoor whatsapp 0341-8377-917

• PROBLEMS SOLVING
In compiler construction, various challenges and problems arise
that developers need to address to create effective and efficient
compilers. Here are some common problems encountered and
solved in the process:

1. *Ambiguity in Grammar:*
-
*Problem:* Ambiguous grammars can lead to multiple
interpretations of the same input, causing challenges for parser
generators.
- *Solution:* Refine the grammar to remove ambiguity or
use techniques like associativity and precedence to clarify
parsing rules.

2. *Error Handling:*
- *Problem:* Detecting and recovering from errors in the
source code is crucial for providing meaningful feedback to
developers.
- *Solution:* Implement robust error-handling mechanisms
within the lexer and parser to identify and report errors
gracefully.

3. *Efficient Parsing:*
- *Problem:* Some parsing algorithms can be
computationally expensive, affecting compiler performance.
- *Solution:* Choose parsing algorithms carefully (e.g., LL,
LR, LALR) based on the characteristics of the language
grammar. Implement optimizations to enhance parsing speed.

4. *Symbol Table Management:*


-
*Problem:* Efficiently managing symbol tables for
variables, functions, and other identifiers is crucial for semantic
analysis.
- *Solution:* Design and implement data structures and
algorithms for symbol table management, supporting scope
resolution, type checking, and other semantic tasks.

5. *Optimizations:*
- *Problem:* Generating efficient target code while
maintaining correctness can be challenging.
- *Solution:* Implement various optimization techniques,
including constant folding, loop optimization, and register
allocation, to enhance the performance of the generated code.

6. *Handling Language Features:*


- *Problem:* Implementing complex language features, such
as nested functions or advanced data types, can be challenging.
- *Solution:* Break down language features into smaller
tasks and implement them incrementally. Leverage modular
design principles to handle different language constructs.

7. *Code Generation for Different Architectures:*


-
- *Problem:* Generating code that is optimal for diverse
target architectures requires careful consideration.
*Solution:* Design code generation modules that can be
customized for different target architectures. Consider
platformspecific optimizations.

8. *Debugging Information:*
- *Problem:* Generating debug information that aids
developers in identifying issues in the source code can be
complex.
- *Solution:* Implement mechanisms to include debugging
information in the generated code, such as line numbers and
variable names, to facilitate debugging.

9. *Security Concerns:*
- *Problem:* Compilers need to be designed with security in
mind to prevent vulnerabilities like buffer overflows or injection
attacks.
- *Solution:* Implement secure coding practices, conduct
rigorous testing, and incorporate security checks in the compiler
construction process.

10. *Cross-Platform Compatibility:*


-
- *Problem:* Ensuring that the compiler works seamlessly
across different platforms can be challenging.
- *Solution:* Adopt platform-independent design principles
and conduct thorough testing on various platforms to ensure
compatibility.

Addressing these problems requires a combination of theoretical


understanding, practical implementation skills, and careful
consideration of the specific requirements of the target language
and architecture.

• SYNTAX DIRECTED DEFINITION


Syntax Directed Definition (SDD) is a kind of abstract
specification. It is generalization of context free grammar in
which each grammar production X –> a is associated with it a
set of production rules of the form s = f(b1, b2, ……bk) where s
is the attribute obtained from function f. The attribute can be a
string, number, type or a memory location. Semantic rules are
fragments of code which are embedded usually at the end of
production and enclosed in curly braces ({ }).

Example:

E --> E1 + T { E.val = E1.val + T.val}


Annotated Parse Tree – The parse tree containing the values of
attributes at each node for given input string is called annotated
or decorated parse tree.

Features –

High level specification


Hides implementation details
Explicit order of evaluation is not specified
Types of attributes – There are two types of attributes:

1. Synthesized Attributes – These are those attributes which


derive their values from their children nodes i.e. value of
synthesized attribute at node is computed from the values of
attributes at children nodes in parse tree.

Example:
E --> E1 + T { E.val = E1.val + T.val}
In this, E.val derive its values from E1.val and T.val

Computation of Synthesized Attributes –


Write the SDD using appropriate semantic rules for each
production in given grammar.
The annotated parse tree is generated and attribute values are
computed in bottom up manner.
The value obtained at root node is the final output.
Example: Consider the following grammar

S --> E
E --> E1 + T
E --> T
T --> T1 * F
T --> F
F --> digit

The SDD for the above grammar can be written as follow


Let us assume an input string 4 * 5 + 6 for computing
synthesized attributes. The annotated parse tree for the input
string is
2. Inherited Attributes – These are the attributes which derive
their values from their parent or sibling nodes i.e. value of
inherited attributes are computed by value of parent or sibling
nodes.
Example:

A --> BCD { C.in = A.in, C.type = B.type }


Computation of Inherited Attributes –

Construct the SDD using semantic actions.


The annotated parse tree is generated and attribute values are
computed in top down manner.
Example: Consider the following grammar

S --> T L
T --> int
T --> float
T --> double
L --> L1, id
L --> id
The SDD for the above grammar can be written as
follow

Let us assume an input string int a, c for computing inherited


attributes. The annotated parse tree for the input string is
• CONSTRUCTION OF SYNTAX TREE
A syntax tree is a tree in which each leaf node represents an
operand, while each inside node represents an operator. The
Parse Tree is abbreviated as the syntax tree. The syntax tree is
usually used when representing a program in a tree structure.

Rules of Constructing a Syntax Tree


A syntax tree’s nodes can all be performed as data with numerous
fields. One element of the node for an operator identifies the
operator, while the remaining field contains a pointer to the
operand nodes. The operator is also known as the node’s label.
The nodes of the syntax tree for expressions with binary
operators are created using the following functions. Each
function returns a reference to the node that was most recently
created.

1. mknode (op, left, right): It creates an operator node with


the name op and two fields, containing left and right pointers.
2. mkleaf (id, entry): It creates an identifier node with the
label id and the entry field, which is a reference to the identifier’s
symbol table entry.
3. mkleaf (num, val): It creates a number node with the name
num and a field containing the number’s value, val. Make a
syntax tree for the expression a 4 + c, for example. p1, p2,…, p5
are pointers to the symbol table entries for identifiers ‘a’ and ‘c’,
respectively, in this sequence.

Example 1: Syntax Tree for the string a – b c + d is:


Example 2: Syntax Tree for the string a * (b + c) – d /2 is:
• BOTTOM-UP EVALUATION OF
SATTRIBUTED DEFINITIONS
S-attributed SDT :
• If an SDT uses only synthesized attributes, it is called as
Sattributed SDT.
• S-attributed SDTs are evaluated in bottom-up parsing, as the
values of the parent nodes depend upon the values of the
child nodes.
• Semantic actions are placed in rightmost place of RHS.
In compiler construction, the evaluation of S-attributed
definitions involves attributing values to the nodes of a syntax
tree, where the attributes are synthesized or inherited from the
child nodes to the parent nodes. Bottom-up evaluation, also
known as bottom-up parsing, starts from the leaves (the
terminals or basic elements of the language) and works its way
up to the root of the syntax tree.

Here are the key steps in the bottom-up evaluation of


Sattributed definitions:

1. *Syntax Tree Construction:*


- Build the syntax tree for the input program based on the
grammar rules. The leaves of the tree represent the basic
elements of the language (terminals), and the interior nodes
represent higher-level language constructs.

2. *Attribute Computation:*
- At each node of the syntax tree, compute the synthesized and
inherited attributes according to the S-attributed definitions
associated with the grammar rules.

3. *Bottom-Up Evaluation:*
- Start the evaluation process from the leaves of the syntax tree
and proceed towards the root.
- For each node, compute its synthesized attributes based on the
values inherited from its children. If the node has inherited
attributes, they are computed first.
- Continue this process until the attributes for the root node are
computed.

4. *Inherited Attributes:*
- If a node has inherited attributes, compute them based on the
values inherited from its parent or other ancestors. These
attributes may depend on values computed during the bottom-
up evaluation of its siblings.

5. *Semantic Actions:*
- Incorporate semantic actions into the evaluation process. These
actions are associated with the grammar rules and are executed
during attribute computation. They are responsible for assigning
values to attributes.
6. *Example:*
- Consider a simple S-attributed definition for a programming
language construct like arithmetic expressions. The synthesized
attribute might be the value of the expression, and the inherited
attribute might represent additional information.
- An S-attributed rule could be:

E -> E1 + T { E.val = E1.val + T.val }


E -> T { E.val = T.val }
T -> T1 * F { T.val = T1.val * F.val }
T -> F { T.val = F.val }
F -> ( E ) { F.val = E.val }
F -> num { F.val = num.val }

- Here, val is the synthesized attribute representing the value of


the corresponding non-terminal.

Bottom-up evaluation is often associated with bottom-up parsing


techniques such as LR (left-to-right, rightmost derivation)
parsing. This approach is common in compiler construction
tools like Yacc and Bison, which use bottom-up parsing to build
syntax trees and evaluate S-attributed definitions during the
construction process.
• L-ATTRIBUTED DEFINITION

➢ If an SDT uses both synthesized attributes and inherited


attributes with a restriction that inherited attribute can
inherit values from left siblings only, it is called as
Lattributed SDT.
➢ Attributes in L-attributed SDTs are evaluated by depthfirst
and left-to-right parsing manner.
➢ Semantic actions are placed anywhere in RHS.
➢ Example : S->ABC, Here attribute B can only obtain its
value either from the parent – S or its left sibling A but It
can’t inherit from its right sibling C. Same goes for A & C –
A can only get its value from its parent & C can get its value
from S, A, & B as well because C is the rightmost attribute
in the given production.

• TOP DOWN TRANSLATION


Top-down Parsing: When the parser generates a parse with top-
down expansion to the first trace, the left-most derivation of
input is called top-down parsing. The top-down parsing initiates
with the start symbol and ends on the terminals. Such parsing is
also known as predictive parsing.
Recursive Descent Parsing: Recursive descent parsing is a type
of top-down parsing technique. This technique follows the
process for every terminal and non-terminal entity. It reads the
input from left to right and constructs the parse tree from right to
left. As the technique works recursively, it is called recursive
descent parsing.
Back-tracking: The parsing technique that starts from the initial
pointer, the root node. If the derivation fails, then it restarts the
process with different rules.

• BOTTOM UP EVALUATION
Bottom-up Parsing: The bottom-up parsing works just the
reverse of the top-down parsing. It first traces the rightmost
derivation of the input until it reaches the start symbol.
Shift-Reduce Parsing: Shift-reduce parsing works on two steps:
Shift step and Reduce step.
Shift step: The shift step indicates the increment of the input
pointer to the next input symbol that is shifted.
Reduce Step: When the parser has a complete grammar rule on
the right-hand side and replaces it with RHS.
LR Parsing: LR parser is one of the most efficient syntax
analysis techniques as it works with context-free grammar. In LR
parsing L stands for the left to right tracing, and R stands for the
right to left tracing.

• RECURSION
Recursion is defined as a process which calls itself directly
or indirectly and the corresponding function is called a
recursive function.
Recursion plays a crucial role in various aspects of compiler
construction, particularly in parsing and semantic analysis. Here
are some key areas where recursion is commonly used:

1. *Parsing:*
- *Recursive Descent Parsing:* In top-down parsing,
recursive descent parsing involves implementing parsing
functions that correspond to grammar rules. Each parsing
function may call other parsing functions recursively to handle
subexpressions.
- *Bottom-Up Parsing:* In bottom-up parsing, techniques like
LR parsing involve recognizing and reducing parts of the input
by applying production rules. LR parsers often use recursive
techniques to handle the reduction steps.

2. *Abstract Syntax Trees (AST):*


- *Tree Traversal:* Recursive algorithms are commonly
employed for traversing and processing abstract syntax trees.
This is essential for performing various semantic analysis tasks
and code generation.
- *Semantic Actions:* During parsing, semantic actions are
often associated with grammar rules. Recursive traversal of the
parse tree allows the execution of these semantic actions,
facilitating the construction of AST nodes and the generation of
intermediate code.
3. *Semantic Analysis:*
- *Symbol Table Construction:* Recursive algorithms are
used to traverse the parse tree and construct the symbol table.
Symbol tables store information about identifiers, types, and
other semantic details.
- *Type Checking:* Recursive algorithms may be employed
to perform type checking by traversing the AST and ensuring that
expressions and operations are used in a type-safe manner.

4. *Code Generation:*
- *Expression Evaluation:* Recursive algorithms are utilized
for evaluating expressions during code generation. This involves
traversing the AST and generating machine code or intermediate
code for arithmetic and logical operations.
- *Function Calls:* Code generation for function calls often
involves recursion, as the compiler needs to generate code for the
called function and handle the return values.

5. *Optimization:*
- *Recursive Algorithms for Analysis:* Some optimization
techniques, such as data flow analysis or constant folding, use
recursive algorithms to analyze and transform code for improved
performance.
- *Loop Optimization:* Recursive algorithms may be
employed to analyze and optimize loops, identifying
opportunities for unrolling or other loop transformations.
6. *Error Handling:*
- *Error Recovery:* Recursive descent parsers often
incorporate recursive error recovery mechanisms. These
mechanisms involve backtracking or other recursive strategies to
handle syntax errors and resume parsing.

Recursion is a powerful and flexible technique in compiler


construction, allowing for elegant and modular design. However,
it's essential to manage recursion carefully to avoid issues like
stack overflow, especially in languages or grammars that can
lead to deep recursion. Recursive algorithms need to be designed
efficiently to balance readability and performance.

• TYPE CHECKING
Type checking is the process of verifying and enforcing
constraints of types in values. A compiler must check that the
source program should follow the syntactic and semantic
conventions of the source language and it should also check the
type rules of the language. It allows the programmer to limit
what types may be used in certain circumstances and assigns
types to values. The type-checker determines whether these
values are used appropriately or not.
• It checks the type of objects and reports a type error in the
case of a violation, and incorrect types are corrected.
Whatever the compiler we use, while it is compiling the
program, it has to follow the type rules of the language.
Every language has its own set of type rules for the
language. We know that the information about data types is
maintained and computed by the compiler.
• The information about data types like INTEGER, FLOAT,
CHARACTER, and all the other data types is maintained
and computed by the compiler. The compiler contains
modules, where the type checker is a module of a compiler
and its task is type checking.

Types of Type Checking:


There are two kinds of type checking:

• Static Type Checking.


• Dynamic Type Checking.

Static Type Checking:


Static type checking is defined as type checking performed at
compile time. It checks the type variables at compile-time,
which means the type of the variable is known at the compile
time. It generally examines the program text during the
translation of the program. Using the type rules of a system, a
compiler can infer from the source text that a function (fun)
will be applied to an operand (a) of the right type each time the
expression fun(a) is evaluated. Examples of Static checks
include:

• Type-checks: A compiler should report an error if an


operator is applied to an incompatible operand. For
example, if an array variable and function variable are
added together.
• The flow of control checks: Statements that cause the flow
of control to leave a construct must have someplace to
which to transfer the flow of control. For example, a break
statement in C causes control to leave the smallest
enclosing while, for, or switch statement, an error occurs if
such an enclosing statement does not exist.
• Uniqueness checks: There are situations in which an object
must be defined only once. For example, in Pascal an
identifier must be declared uniquely, labels in a case
statement must be distinct, and else a statement in a scalar
type may not be represented.
• Name-related checks: Sometimes the same name may
appear two or more times. For example in Ada, a loop may
have a name that appears at the beginning and end of the
construct. The compiler must check that the same name is
used at both places.

Dynamic Type Checking:


Dynamic Type Checking is defined as the type checking being
done at run time. In Dynamic Type Checking, types are
associated with values, not variables. Implementations of
dynamically type-checked languages runtime objects are
generally associated with each other through a type tag, which is
a reference to a type containing its type information. Dynamic
typing is more flexible. A static type system always restricts what
can be conveniently expressed. Dynamic typing results in more
compact programs since it is more flexible and does not require
types to be spelled out. Programming with a static type system
often requires more design and implementation effort.

Languages like Pascal and C have static type checking. Type


checking is used to check the correctness of the program before
its execution. The main purpose of type-checking is to check the
correctness and data type assignments and type-casting of the
data types, whether it is syntactically correct or not before their
execution.
Static Type-Checking is also used to determine the amount of
memory needed to store the variable.

The design of the type-checker depends on:


• Syntactic Structure of language constructs.
• The Expressions of languages.
• The rules for assigning types to constructs (semantic
rules).

• TYPE SYSTEM
A type system in compiler construction is a set of rules that
govern the usage of types in a programming language. It
enforces constraints on how different types of data can be
manipulated, combined, and assigned within a program. The
primary goals of a type system include catching errors at
compile-time, enhancing program reliability, and facilitating
optimizations. Here are key components and aspects of a type
system:

1. *Type Checking:*
- *Static Type Checking:* Performed at compile-time, ensuring
type correctness before the program runs. Common in
languages like Java or C++.
-
*Dynamic Type Checking:* Performed at runtime, allowing
more flexibility but may lead to runtime errors. Common in
languages like Python or JavaScript.

2. *Data Types:*
- Define basic data types (integers, floating-point numbers,
characters, etc.) and user-defined types (structures, classes,
enums).
- Specify the size, representation, and behavior of each type.

3. *Type Inference:*
- Automatically deducing types without explicit declarations.
Helps reduce redundancy and enhance code readability.
- Common in languages like ML, Haskell, or Rust.

4. *Type Compatibility:*
- Define rules for how different types can be used together in
expressions or assignments.
- Implicit conversions, coercion, or casting may be allowed
based on the language.

5. *Polymorphism:*
-
*Parametric Polymorphism:* Enables writing generic code
that works with different types.
- *Ad-hoc Polymorphism (Overloading):* Allows using the
same function or operator with different types.

6. *Type Hierarchies:*
- Define relationships between types, such as inheritance or
interfaces in object-oriented languages.
- Hierarchies may include base types and derived types.

7. *Type Safety:*
- Prevents operations that could result in runtime errors or
unexpected behaviors.
- Enforces rules to catch type-related errors during compilation.

8. *Type Annotations:*
- Allow programmers to provide explicit type information for
variables, parameters, and functions.
- Facilitates understanding and improves tool support.

9. *Structural vs. Nominal Typing:*


-
*Structural Typing:* Types are defined by their structure
(e.g., the shape of data).
- *Nominal Typing:* Types are identified by their names or
explicit declarations.

10. *Type Declarations and Definitions:*


- Specify how to declare and define types in the source code.
- Define type aliases, structs, classes, or other constructs based
on language features.

11. *Type Annotations:*


- Allow programmers to provide explicit type information for
variables, parameters, and functions.
- Facilitates understanding and improves tool support.

The type system is a critical component of a programming


language's design and influences both how developers express
their intentions in code and how compilers analyze and
optimize that code. A well-designed type system enhances code
quality, readability, and maintainability while contributing to
the overall robustness of software.
-
• EQUIVALENCE OF TYPE EXPRESSION
In compiler construction, equivalence of type expressions refers
to determining whether two type expressions are equivalent or
identical. This is a fundamental concept in type systems, and it
involves comparing the structures and properties of types to
ensure compatibility in a programming language. The
equivalence of types is crucial for type checking, especially
when dealing with assignments, function calls, or other
operations that involve interacting types.

There are generally two types of equivalence for type


expressions:

1. *Structural Equivalence:*
- *Definition:* Two types are structurally equivalent if their
internal structures match.
- *Example:* Consider two record types: text
type Person1 = { name: String, age: Int }
type Person2 = { name: String, age: Int }

Person1 and Person2 are structurally equivalent because their


structures match.
2. *Name Equivalence:*
- *Definition:* Two types are name-equivalent if they have
the same name or label, regardless of their internal structures.
- *Example:* Consider two type aliases: text
type Name1 = String
type Name2 = String

Name1 and Name2 are name-equivalent because they share


the same name.

The choice between structural and name equivalence depends on


the design and requirements of the programming language.
Some languages, like C and C++, use name equivalence, while
others, like Haskell or ML, use structural equivalence.

In addition, there is a concept of *Type Equivalence


Classes*:
- *Definition:* Types that are equivalent belong to the same
equivalence class.
- *Example:* In a language with structural equivalence, the
classes might include all equivalent structures, even if they have
different names.
When implementing type equivalence in a compiler, it's crucial
to consider the rules specified by the language's type system.
This involves recursively comparing the components of complex
types, handling generic types, and accounting for various
language-specific features.

Ensuring correct type equivalence is essential for the proper


functioning of a compiler's type checker, allowing it to catch
potential type-related errors and ensure that the program adheres
to the language's type rules.

• TYPE CONVERSION
Type conversion : In type conversion, a data type is
automatically converted into another data type by a compiler at
the compiler time. In type conversion, the destination data type
cannot be smaller than the source data type, that’s why it is also
called widening conversion. One more important thing is that it
can only be applied to compatible data types.

Type Conversion example –

int x=30; float y;


y=x; // y==30.000000.
• INTERMEDIATE CODE GENERATION
AND OPTIMIZATION
In the analysis-synthesis model of a compiler, the front end of a
compiler translates a source program into an independent
intermediate code, then the back end of the compiler uses this
intermediate code to generate the target code (which can be
understood by the machine). The benefits of using
machineindependent intermediate code are:

• Because of the machine-independent intermediate code,


portability will be enhanced. For ex, suppose, if a compiler
translates the source language to its target machine
language without having the option for generating
intermediate code, then for each new machine, a full native
compiler is required. Because, obviously, there were some
modifications in the compiler itself according to the
machine specifications.
• Retargeting is facilitated.
• It is easier to apply source code modification to improve
the performance of source code by optimizing the
intermediate code.
If we generate machine code directly from source code then for
n target machine we will have optimizers and n code generator
but if we will have a machine-independent intermediate code,
we will have only one optimizer. Intermediate code can be either
language-specific (e.g., Bytecode for Java) or language.
independent (three-address code). The following are commonly
used intermediate code representations:

Postfix Notation: Also known as reverse Polish notation or


suffix notation. The ordinary (infix) way of writing the sum of a
and b is with an operator in the middle: a + b The postfix
notation for the same expression places the operator at the right
end as ab +. In general, if e1 and e2 are any postfix expressions,
and + is any binary operator, the result of applying + to the
values denoted by e1 and e2 is postfix notation by e1e2 +. No
parentheses are needed in postfix notation because the position
and arity (number of arguments) of the operators permit only
one way to decode a postfix expression. In postfix notation, the
operator follows the operand.
Example 1: The postfix representation of the expression (a + b)
* c is : ab + c *
Example 2: The postfix representation of the expression (a – b)
* (c + d) + (a – b) is : ab – cd + *ab -+
Read more: Infix to Postfix

Three-Address Code: A statement involving no more than three


references(two for operands and one for result) is known as a
three address statement. A sequence of three address statements
is known as a three address code. Three address statement is of
form x = y op z, where x, y, and z will have address (memory
location). Sometimes a statement might contain less than three
references but it is still called a three address statement.
Example: The three address code for the expression a + b * c +
d : T 1 = b * c T 2 = a + T 1 T 3 = T 2 + d T 1 , T 2 , T 3 are
temporary variables.
There are 3 ways to represent a Three-Address Code in compiler
design:
i) Quadruples
ii) Triples iii) Indirect Triples
Read more: Three-address code
Syntax Tree: A syntax tree is nothing more than a condensed
form of a parse tree. The operator and keyword nodes of the
parse tree are moved to their parents and a chain of single
productions is replaced by the single link in the syntax tree the
internal nodes are operators and child nodes are operands. To
form a syntax tree put parentheses in the expression, this way
it’s easy to recognize which operand should come first.

Example: x = (a + b * c) / (a – b * c)

OPTIMIZATION
The code optimization in the synthesis phase is a program
transformation technique, which tries to improve the
intermediate code by making it consume fewer resources (i.e.
CPU, Memory) so that faster-running machine code will result.
Compiler optimizing process should meet the following
objectives :

• The optimization must be correct, it must not, in any way,


change the meaning of the program.
• Optimization should increase the speed and performance of
the program.
• The compilation time must be kept reasonable.
• The optimization process should not delay the overall
compiling process.

END

You might also like