0% found this document useful (0 votes)
16 views139 pages

Csc311 - Lecture Note

The document is a student course manual for Compiler Construction 1 (CSC311) at the University of Abuja, detailing course content, assessment methods, recommended textbooks, and a weekly synopsis of topics covered. It emphasizes the theory and practice of compiler design, including phases such as lexical analysis, syntax analysis, semantic analysis, and code generation. The course is targeted at 300-level students and is taught by Dr. Isaac Abiodun.

Uploaded by

Cornelius
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views139 pages

Csc311 - Lecture Note

The document is a student course manual for Compiler Construction 1 (CSC311) at the University of Abuja, detailing course content, assessment methods, recommended textbooks, and a weekly synopsis of topics covered. It emphasizes the theory and practice of compiler design, including phases such as lexical analysis, syntax analysis, semantic analysis, and code generation. The course is targeted at 300-level students and is taught by Dr. Isaac Abiodun.

Uploaded by

Cornelius
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 139

University of Abuja

Faculty of Science
Department of Computer Science

STUDENT COURSE MANUAL


COURSE TITLE: COMPILER CONSTRUCTION 1
COURSE CODE: CSC311
CREDIT UNITS: 3
TARGET AUDIENCE: 300level Students
Lecturer: Dr. Isaac Abiodun
Phone: 07065589574
Email: [email protected]
Google scholar:
https://scholar.google.com/citations?
hl=en&user=9H6daroAAAAJ&view_op=list_works&sortby=pubdate
Page 1 of 139
Method of Students Assessment
(i) Attendant, Assignments, Classwork (e.g. quiz, work to do) and Test = 30%
(ii)Examination = 70%

Lecture Materials
RECOMMENDED TEXBOOKS:

(1) Introduction to Compilers and Language Design,


2nd edition, 2020. (Revision Date: January 15, 2021)
By Douglas Thain
http://compilerbook.org

Paperback ISBN: 979-8-655-18026-0

Buy Paperback via Amazon

(2) Introduction to Compilers and Language Design,


1st edition, 2019.
By Douglas Thain
http://compilerbook.org

Hardcover ISBN: 978-0-359-13804-3


Paperback ISBN: 978-0-359-14283-5

Buy Hardcover at Lulu


Buy Paperback at Lulu
Buy Paperback at Amazon

(3) Compiler Design and Construction 2nd Edition


by Arthur Pyster (Author)

(4) Compiler design and construction (Electrical/computer science and engineering


series) Hardcover – January 1, 1980, by Arthur B Pyster (Author)

(5) Compiler Construction by WM Waite

Page 2 of 139
FREE TEXTBOOKS ON THE INTERNET
(1) A Compact Guide to Lex & Yacc
Post date: 24 Oct 2004
Explains how to construct a compiler using lex and yacc, the tools used to generate
lexical analyzers and parsers.
Publication date: 30 Nov -0001
(2) Basics of Compiler Design
Post date: 19 Apr 2007
Conveys the general picture of compiler design without going into extreme detail.
Gives the students an understanding of how compilers work and the ability to make
simple (but not simplistic) compilers for simple languages.
Publisher: University of Copenhagen
Publication date: 01 Jan 2010

(3) Compiler Construction


Post date: 07 Jan 2016
This text demonstrates how a compiler is built and describes the necessary tools
and how to create and use them.
Publication date: 22 Feb 1996
(4) Compiler Design: Theory, Tools, and Examples (C/C++ Edition)
Post date: 01 Dec 2016
This textbook is a revision of an earlier edition that was written for a Pascal based
curriculum. It is not intended to be strictly an object-oriented approach to compiler
design.
Publication date: 31 Dec 2010
Document Type: Textbook
(5) Compilers and Compiler Generators: an introduction with C++
Post date: 24 Oct 2004
Combines theory, practical applications and the use of compiler writing tools to
give students a solid introduction to the subject of programming language
translation.
Publisher: International Thomson Computer Press
Publication date: 01 Mar 1997

Page 3 of 139
(6) Let's Build a Compiler
Post date: 24 Oct 2004
A fifteen-part tutorial series, written from 1988 to 1995, on the theory and practice
of developing language parsers and compilers from scratch.
Publication date: 31 Dec 1988
(7) Languages And Machines
Post date: 19 Oct 2006
Provides a view to the concept of a language as a system of strings of characters
strings obeying certain rules. Topics covered includes logic, meta languages,
proofs, finite state machine, Turing machine, encryption and coding.
Publication date: 01 Jul 2005
(8) Compiler Construction
Post date: 17 Sep 2006
A concise, practical guide to modern compiler design and construction by the
author of Pascal and Oberon. Readers are taken step-by-step through each stage of
compiler design, using the simple yet powerful method of recursive descent to
create a compiler.
Publisher: Addison-Wesley Pub Co
Publication date: 01 May 2017
Document Type: Book

Reference
https://www.amazon.com/Compiler-Design-Construction-Arthur-Pyster/dp/0442275366

https://www.amazon.com/Compiler-construction-Electrical-computer-engineering/dp/0442243944

https://www.amazon.com/Introduction-Compilers-Language-Design-Second/dp/B08BFWKRJH

https://www3.nd.edu/~dthain/compilerbook/

https://www.oreilly.com/library/view/compiler-construction/9789332524590/

Page 4 of 139
COURSE CONTENT

Introduction to Compiling: Overview of the compilation process, Source code and Target code
(Analysis of the source programme), Translators (language processors), Advantages and
Disadvantages, Cousins of the compiler and Types of compiler.The phases of a Compiler:
Compiler structure, the grouping of phases, Compiler-construction tools.A Simple One-Pass
Compiler:Anatomy of a compiler. Syntax definition, Syntax-directed translation, Parsing, A
translator for simple expressions, Lexical analysis, incorporating a symbol table, Abstract stack
machines, putting the techniques together.Lexical Analysis: The role of the lexical analyzer,
Input buffering, Specification of tokens, Recognition of tokens, A language for specifying lexical
analyzers, Finite automata, From a regular expression to an NFA, Design of a lexical analyzer
generator, Optimization of DFA-based pattern matchers.Syntax Analysis:The role of the parser,
Context-free grammars, Writing a grammar, Top-down parsing, Bottom up parsing, Operator-
precedence parsing, LR parsers, Using ambiguous grammars, Parser generators.Syntax-Directed
Translation:Syntax-directed definitions, Construction of syntax trees, Bottom-up evaluation of S-
attributed definitions, L-attributed definitions, Top-down translation, Bottom-up evaluation of
inherited attributes, Recursive evaluators, Space for attribute values at compile time, Assigning
space at compile time, Analysis of syntax-directed definitions.Type Checking:Type systems,
Specification of a simple type checker, Equivalence of type expressions, Type conversions,
Overloading of functions and operators, Polymorphic functions, An algorithm for
unification.Run-Time Environments:Source language issues, Storage organization, Storage-
allocation strategies, Access to nonlocal names, parameter passing, Symbol tables, Language
facilities for dynamic storage allocation, Dynamic storage allocation techniques, Storage
allocation in Fortran.Intermediate Code Generation:Intermediate languages,
Declarations, Assignment statements, Boolean expressions, Case statements, Back Patching,
Procedure calls.Code generation:Issues in the design of a code generator, The target machine,
Run-time storage management, Basic blocks and flow graphs, Next-use information, A Simple
code generator, Register allocation and assignment, The dag representation of basic blocks,
Peephole optimization, Generating code from dags, Dynamic programming code-generation
algorithm, Code-generator generators.Code Optimization:Introduction, The Principal sources of
optimization, Optimization of basic blocks, Loops in flow graphs, Introduction to global data-
flow analysis, Iterative solution of data-flow equations, Code improving transformations,
Dealing with aliases, Data-flow analysis of structured flow graphs, Efficient data-flow
algorithms, A tool for data-flow analysis, Estimation of types, Symbolic debugging of optimized
code.Advanced topics include garbage collection; dynamic data structures, pointer analysis,
aliasing; code scheduling, pipelining; dependence testing; loop level optimisation; superscalar
optimisation; profile-driven optimisation; debugging support; incremental parsing; type
inference; advanced parsing algorithms; practical attribute evaluation; function in-lining and
partial evaluation.

Page 5 of 139
CSC311: Compiler Construction 1 (3 Credit Units)

Lecture Note:
Weekly Synopsis
Week 1: Introduction to Compiling: Overview of the compilation process, Source
code and Target code (Analysis of the source programme), Translators
(language processors), Advantages and Disadvantages Cousins of the
compiler and Types of compilers. The phases of a Compiler: Compiler
structure, the grouping of phases, Compiler-construction tools. A Simple
One-Pass Compiler:Anatomy of a compiler.

Week 2: Syntax definition, Syntax-directed translation, Parsing, A translator for


simple expressions, Lexical analysis, incorporating a symbol table,
Abstract stack machines,

Weeks 3 & 4: putting the techniques together. Lexical Analysis: The role of the lexical
analyzer, Input buffering, Specification of tokens, Recognition of tokens,
A language for specifying lexical analyzers, Finite automata, From a
regular expression to an NFA,

Weeks 5 & 6: Design of a lexical analyzer generator, Optimization of DFA-based pattern


matchers.Syntax Analysis: The role of the parser, Context-free grammars,
Writing a grammar, Top-down parsing, Bottom up parsing, Operator-
precedence parsing, LR parsers, Using ambiguous grammars, Parser
generators.Syntax-Directed Translation: Syntax-directed definitions

Week 7: Construction of syntax trees, Bottom-up evaluation of S-attributed


definitions, L-attributed definitions, Top-down translation, Bottom-up
evaluation of inherited attributes, Recursive evaluators, Space for attribute
values at compile time, Assigning space at compile time,

Week 8: Test

Weeks 9 & 10: Analysis of syntax-directed definitions. Type Checking:Type systems,


Specification of a simple type checker, Equivalence of type expressions,
Type conversions, Overloading of functions and operators, Polymorphic
functions, An algorithm for unification. Run-Time Environments:Source
language issues, Storage organization, Storage-allocation strategies.

Page 6 of 139
Week 11: Access to nonlocal names, parameter passing, Symbol tables, Language
facilities for dynamic storage allocation, Dynamic storage allocation
techniques, Storage allocation in Fortran. Intermediate Code Generation:
Intermediate languages, Declarations, Assignment statements, Boolean
expressions, Case statements, Back Patching, Procedure calls. Code
generation: Issues in the design of a code generator, The target machine,
Run-time storage management, Basic blocks and flow graphs, Next-use
information.

Week 12: A Simple code generator, Register allocation and assignment, The dag
representation of basic blocks, Peephole optimization, Generating code
from dags, Dynamic programming code-generation algorithm, Code-
generator generators.Code Optimization: Introduction, The Principal
sources of optimization, Optimization of basic blocks, Loops in flow
graphs, Introduction to global data-flow analysis, Iterative solution of
data-flow equations, Code improving transformations, Dealing with
aliases, Data-flow analysis of structured flow graphs, Efficient data-flow
algorithms, A tool for data-flow analysis, Estimation of types,

Week 13: Symbolic debugging of optimized code. Advanced topics include garbage
collection; dynamic data structures, pointer analysis, aliasing; code
scheduling, pipelining; dependence testing; loop level optimisation;
superscalar optimisation; profile-driven optimisation; debugging support;
incremental parsing; type inference; advanced parsing algorithms;
practical attribute evaluation; function in-lining and partial evaluation.

Week 14: Review of Weeks 1 – 13

Week 15: Test

Week 16: Examination

Page 7 of 139
STUDENT COURSE MANUAL

COMPILER CONSTRUCTION 1 (3 UNITS)

Description: This course deals with the theory and practice of compiler design. Topics
emphasized are scanning, parsing and semantic analysis.
A compiler translates source code into object without tempering with the meaning of the source
code. The steps involved in translating a language are six namely: lexical, syntax, semantic,
intermediate representation, code optimizer and code generator. Each of this phase perform a
single task.
In computing, a compiler is a computer program that transforms source code written in a
programming language or computer language, into another computer language. The most
common reason for transforming source code is to create an executable program.
Compiler construction is a complex task. A good compiler combines ideas from formal language
theory, from the study of algorithms, from artificial intelligence, from systems design, from
computer architecture, and from the theory of programming languages and applies them to the
problem of translating a program.

Weekly Synopsis
Compiler Design and Construction (CDC)

Week 1: Introduction of Compiler Design and Construction


What is compiler?

A compiler is software that converts a program written in a high-level language (Source


Language) to a low-level language (Object/Target/Machine Language/0, 1’s).

A translator or language processor is a program that translates an input program written


in a programming language into an equivalent program in another language. The
compiler is a type of translator, which takes a program written in a high-level
programming language as input and translates it into an equivalent program in low-level
languages such as machine language or assembly language.
The program written in a high-level language is known as a source program, and the
program converted into a low-level language is known as an object (or target) program.
Without compilation, no program written in a high-level language can be executed. For
every programming language, we have a different compiler; however, the basic tasks
performed by every compiler are the same. The process of translating the source code
Page 8 of 139
into machine code involves several stages, including lexical analysis, syntax analysis,
semantic analysis, code generation, and optimization.

What is Compiler Design and Construction?


Compiler design principles provide an in-depth view of translation and optimization process.
Compiler design covers basic translation mechanism and error detection & recovery. It includes
lexical, syntax, and semantic analysis as front end, and code generation and optimization as
back-end.

Compiler Design
Computers are a balanced mix of software and hardware. Hardware is just a piece of mechanical
device and its functions are being controlled by a compatible software. Hardware understands
instructions in the form of electronic charge, which is the counterpart of binary language in
software programming. Binary language has only two alphabets, 0 and 1. To instruct, the
hardware codes must be written in binary format, which is simply a series of 1s and 0s. It would
be a difficult and cumbersome task for computer programmers to write such codes, which is why
we have compilers to write such codes.

Phases of a Compiler
There are two major phases of compilation, which in turn have many parts. Each of them takes
input from the output of the previous level and works in a coordinated way.

Phases of a compiler are the processes that source code goes through prior to being converted to
object code by a compiler. Each step fulfills a specific, particular function. The output of each
stage must be stored in a data structure called a symbol table, and an error handler must be
provided to keep track of errors. A compiler's phases are divided into six phases. These phases
can be divided into the following two (2) types. These phases can be divided into the following
two (2) groups, that is, analysis, and synthesis.

That is, a compiler can broadly be divided into two phases based on the way they compile.
1. Analysis Phase
Known as the front-end of the compiler, the analysis phase of the compiler reads the source
program, divides it into core parts and then checks for lexical, grammar and syntax errors. The
analysis phase generates an intermediate representation of the source program and symbol table,
which should be fed to the Synthesis phase as input.

Page 9 of 139
 Analysis: The output of the analysis is used here to produce the desired machine-oriented
code. The source code is divided into meaning characters and creates an intermediate
representation. This part is broken into three (3) further sections as follows: (a) Lexical
Analysis, (b) Syntax Analysis, and (c) Semantic Analysis.
An intermediate representation is created from the given source code:
 Lexical Analyzer
 Syntax Analyzer
 Semantic Analyzer

The lexical analyzer divides the program into “tokens”, the Syntax analyzer recognizes
“sentences” in the program using the syntax of the language and the Semantic analyzer checks
the static semantics of each construct. Intermediate Code Generator generates “abstract” code.
2. Synthesis Phase
Known as the back-end of the compiler, the synthesis phase generates the target program with
the help of intermediate source code representation and symbol table.
A compiler can have many phases and passes.
 Pass: A pass refers to the traversal of a compiler through the entire program.
 Phase: A phase of a compiler is a distinguishable stage, which takes input from
the previous stage, processes and yields output that can be used as input for the
next stage. A pass can have more than one phase.

Synthesis: This section is subdivided into three (3). (a). Intermediate code generation (b). code
optimization (c). code generator.

A compiler translates source code into object without tempering with the meaning of the source
code. The steps involved in translating a language are six namely: lexical, syntax, semantic,
intermediate representation, code optimizer and code generator. Each of these phases perform a
single task. The compilation process contains the sequence of various phases. Each phase takes
source program in one representation and produces output in another representation. Each phase
takes input from its previous stage.

An equivalent target program is created from the intermediate representation. It has two parts:
 Code Optimizer
 Code Generator
Page 10 of 139
Code Optimizer optimizes the abstract code, and the final Code Generator translates abstract
intermediate code into specific machine instructions.

The six phases of compiler can be described as follow:

Figure: Phases of Compiler

1. Analysis Phase

(i) Lexical Analysis:

Lexical analyzer phase is the first phase of compilation process. lexical analysis, also known as
scanning. It takes source code as input. It reads the source program one character at a time and
converts it into meaningful lexemes. That is, a series of tokens, such as keywords, identifiers,
and operators. Lexical analyzer represents these lexemes in the form of tokens. These tokens
are then passed on to the next stage of the compilation process.

Page 11 of 139
(ii) Syntax Analysis

Syntax analysis is the second phase of compilation process. Syntax analysis, also known as
parsing. It takes tokens as input and generates a parse tree as output. In syntax analysis phase, the
parser checks that the expression made by the tokens is syntactically correct or not. That is, in
this stage, the compiler checks the syntax of the source code to ensure that it conforms to the
rules of the programming language. The compiler builds a parse tree, which is a hierarchical
representation of the program’s structure, and uses it to check for syntax errors.

(iii) Semantic Analysis

Semantic analysis is the third phase of compilation process. It checks whether the parse tree
follows the rules of language. Likewise, in this phase, the compiler checks the meaning of the
source code to ensure that it makes sense. Semantic analyzer keeps track of identifiers, their
types and expressions. The compiler performs type checking, which ensures that variables are
used correctly and that operations are performed on compatible data types. The compiler also
checks for other semantic errors, such as undeclared variables and incorrect function calls. The
output of semantic analysis phase is the annotated tree syntax. Examples of semantic errors are
data compatibility (data type), undeclared variable use and many more.

(iv) Intermediate Code Generation

The fourth phase of compiler design is intermediate code generation also called code generation.
In this phase, the compiler translates the parse tree into machine code that can be executed by the
computer. The code generated by the compiler must be efficient and optimized for the target
platform. In this phase of the intermediate code generation, compiler generates the source code
into the intermediate code. Intermediate code is generated between the high-level language and
the machine language. The intermediate code should be generated in such a way that one can
easily translate it into the target machine code.

2. Synthesis Phase

(v) Code Optimization

The intermediate code generated in the previous stage has been optimized in this phase. The
structure of the tree that is generated by the parser can be rearranged to suit the needs of the
machine architecture to produce an object code that runs faster. The optimization is achieved by
removing unnecessary lines of codes. Code optimization is an optional phase. It is used to
improve the intermediate code so that the output of the program could run faster and take less
space. It removes the unnecessary lines of the code and arranges the sequence of statements in
order to speed up the program execution. In this stage, the compiler analyzes the generated code
and makes optimizations to improve its performance. The compiler may perform optimizations
such as constant folding, loop unrolling, and function inlining.

(vi) Code Generation

Code generation is the final phase of the compilation process. It takes the optimized intermediate
code as input and maps it to the target machine language. Code generator translates the
intermediate code into the machine code of the specified computer.
Page 12 of 139
Example:

Page 13 of 139
Operations of Compiler
These are some operations that are done by the compiler.
 It breaks source programs into smaller parts.
 It enables the creation of symbol tables and intermediate representations.
 It helps in code compilation and error detection.
 it saves all codes and variables.
 It analyses the full program and translates it.
 Convert source code to machine code.

Advantages of Compiler Design


1. Efficiency: Compiled programs are generally more efficient than interpreted
programs because the machine code produced by the compiler is optimized for the
specific hardware platform on which it will run.
2. Portability: Once a program is compiled, the resulting machine code can be run on
any computer or device that has the appropriate hardware and operating system,
making it highly portable.
3. Error Checking: Compilers perform comprehensive error checking during the
compilation process, which can help catch syntax, semantic, and logical errors in
the code before it is run.
4. Optimizations: Compilers can make various optimizations to the generated
machine code, such as eliminating redundant instructions or rearranging code for
better performance.

Disadvantages of Compiler Design


1. Longer Development Time: Developing a compiler is a complex and time-
consuming process that requires a deep understanding of both the programming
language and the target hardware platform.
2. Debugging Difficulties: Debugging compiled code can be more difficult than
debugging interpreted code because the generated machine code may not be easy to
read or understand.
3. Lack of Interactivity: Compiled programs are typically less interactive than
interpreted programs because they must be compiled before they can be run, which
can slow down the development and testing process.
4. Platform-Specific Code: If the compiler is designed to generate machine code for a
specific hardware platform, the resulting code may not be portable to other
platforms.

REASONS/IMPORTANCE OF A COMPILER
(i) A compiler is a translator that converts the high-level language into the machine language.
(ii) High-level language is written by a developer and machine language can be understood by
the processor.
(iii) Compiler is used to show errors to the programmer.
(iv) The main purpose of compiler is to change the code written in one language without
changing the meaning of the program.

Page 14 of 139
(v) When one executes a program which is written in HLL programming language then it
executes into two parts.
(vi) In the first part, the source program compiled and translated into the object program (low
level language).
(vii) In the second part, object program translated into the target program through the assembler.

Figure: Execution process of source program in Compiler

Summary of Compiler Design

Overall, compiler design is a complex process that involves multiple stages and requires a deep
understanding of both the programming language and the target platform. A well-designed
compiler can greatly improve the efficiency and performance of software programs, making
them more useful and valuable for users.

Compiler

 Cross Compiler that runs on a machine ‘A’ and produces a code for another machine
‘B’. It is capable of creating code for a platform other than the one on which the
compiler is running.
 Source-to-source Compiler or transcompiler or transpiler is a compiler that
translates source code written in one programming language into the source code of
another programming language.

Page 15 of 139
TYPES OF COMPILERS

There are mainly three types of compilers.


 Single Pass Compilers
 Two Pass Compilers
 Multipass Compilers

Single Pass Compiler

When all the phases of the compiler are present inside a single module, it is simply called
a single-pass compiler. It performs the work of converting source code to machine code.

Two Pass Compiler

Two-pass compiler is a compiler in which the program is translated twice, once from the front
end and the back from the back end known as Two Pass Compiler.

Multipass Compiler

When several intermediate codes are created in a program and a syntax tree is processed many
times, it is called Multi pass Compiler. It breaks codes into smaller programs.

Features of a compiler
a. Correctness
b. Speed of compilation
c. Preserve the correct meaning of the code
d. Compile-time proportion to program size
e. Good diagnostics for syntax errors
f. Good error reporting and handling
g. Work well with the debugging

QUIZ QUESTIONS

Questions 1: What are the 3 levels of programming languages?


Answer
Types of Languages:

Page 16 of 139
There are three main kinds of programming language: Machine language. Assembly
language. High-level language

Questions 2: Is C++ a high or low-level language?


Answer

C++ can perform both low-level and high-level programming, and that's why it is essentially
considered a mid-level language. However, as its programming syntax also includes
comprehensible English, many also view C++ as another high-level language.

Questions 3: Is Python a low-level language?


Answer

Python and C# are examples of high-level languages that are widely used in education and in the
workplace. A high-level language is one that is user-oriented in that it has been designed to make
it straightforward for a programmer to convert an algorithm into program code. A low-level
language is machine oriented.

Questions 4: Is HTML is a programming language?


Answer

HTML is not a programming language. It's a markup language. In fact, that is the technology's
name: HyperText Markup Language. That self-identified fact alone should settle the debate.

Questions 5: What is the No 1 programming language?

Answer

JavaScript. JavaScript is one of the world's most popular programming languages on the web.
Using JavaScript, you can build some of the most interactive websites.

Questions 6: Is JSON (JavaScript Object Notation) a programming language?

Answer
JSON is a lightweight, text-based, language-independent data interchange format. It was derived
from the JavaScript/ECMAScript programming language, but is programming language
independent.
Questions 7: What is the hardest programming language?
Answer

Page 17 of 139
Malbolge is considered the hardest programming language to learn. It is so hard that it has to be
set aside in its own paragraph. It took two whole two years to finish writing the code for
Malbolge.
LANGUAGE PROCESSING SYSTEM
We have learnt that any computer system is made of hardware and software. The hardware
understands a language, which humans cannot understand. So we write programs in high-level
language, which is easier for us to understand and remember. These programs are then fed into a
series of tools and OS components to get the desired code that can be used by the machine. This
is known as Language Processing System.

Page 18 of 139
The high-level language is converted into binary language in various phases. A compiler is a
program that converts high-level language to assembly language. Similarly, an assembler is a
program that converts the assembly language to machine-level language.
Let us first understand how a program, using C compiler, is executed on a host machine.
 User writes a program in C language (high-level language).
 The C compiler, compiles the program and translates it to assembly program (low-
level language).
 An assembler then translates the assembly program into machine code (object).
 A linker tool is used to link all the parts of the program together for execution
(executable machine code).
 A loader loads all of them into memory and then the program is executed.
Before diving straight into the concepts of compilers, we should understand a few other tools that
work closely with compilers.
Preprocessor
A preprocessor, generally considered as a part of compiler, is a tool that produces input for
compilers. It deals with macro-processing, augmentation, file inclusion, language extension, etc.
Interpreter
An interpreter, like a compiler, translates high-level language into low-level machine language.
The difference lies in the way they read the source code or input. A compiler reads the whole
source code at once, creates tokens, checks semantics, generates intermediate code, executes the
whole program and may involve many passes. In contrast, an interpreter reads a statement from
the input, converts it to an intermediate code, executes it, then takes the next statement in
sequence. If an error occurs, an interpreter stops execution and reports it. whereas a compiler
reads the whole program even if it encounters several errors.

Assembler
An assembler translates assembly language programs into machine code.The output of an
assembler is called an object file, which contains a combination of machine instructions as well
as the data required to place these instructions in memory.
Linker
Linker is a computer program that links and merges various object files together in order to make
an executable file. All these files might have been compiled by separate assemblers. The major
task of a linker is to search and locate referenced module/routines in a program and to determine
the memory location where these codes will be loaded, making the program instruction to have
absolute references.
Loader
Loader is a part of operating system and is responsible for loading executable files into memory
and execute them. It calculates the size of a program (instructions and data) and creates memory
space for it. It initializes various registers to initiate execution.
Cross-compiler
A compiler that runs on platform (A) and is capable of generating executable code for platform
(B) is called a cross-compiler.
Page 19 of 139
Source-to-source Compiler
A compiler that takes the source code of one programming language and translates it into the
source code of another programming language is called a source-to-source compiler.

Computer languages
Computer languages have progressed from Low-Level Languages to High-Level Languages over
the years. Programs could only be written in Binary Language in the early days of computing.
The following categories apply to computer languages:

Machine language

Machine language, often known as a low-level language, refers to a programming language that
is directly understood and executed by a computer's hardware.

Page 20 of 139
The machine-level language is a language that consists of a set of instructions that are in the
binary form 0 or 1. As we know that computers can understand only machine instructions, which
are in binary digits, i.e., 0 and 1, so the instructions given to the computer can be only in binary
codes. Creating a program in a machine-level language is a very difficult task as it is not easy for
the programmers to write the program in machine instructions. It is error-prone as it is not easy
to understand, and its maintenance is also very high. A machine-level language is not portable as
each computer has its machine instructions, so if we write a program in one computer will no
longer be valid in another computer.

The different processor architectures use different machine codes, for example, a PowerPC
processor contains RISC architecture, which requires different code than intel x86 processor,
which has a CISC architecture.

Low-level language

Low-level language is the sole form of programming language that can be comprehended
by a computer. Low-level language, alternatively referred to as Machine Language, is a
programming language that is closely associated with the hardware architecture of a
computer system. The machine language is composed exclusively of two symbols,
namely 1 and 0. The instructions of machine language are exclusively expressed in binary
notation, consisting solely of the digits 1 and 0. Computers has the inherent ability to
comprehend machine language directly.

The low-level language is a programming language that provides no abstraction from the
hardware, and it is represented in 0 or 1 forms, which are the machine instructions. The
languages that come under this category are the Machine level language and Assembly language.

Assembly language

The assembly language contains some human-readable commands such as mov, add, sub, etc.
The problems which we were facing in machine-level language are reduced to some extent by
using an extended form of machine-level language known as assembly language. Since assembly
language instructions are written in English words like mov, add, sub, so it is easier to write and
understand.

As we know that computers can only understand the machine-level instructions, so we require a
translator that converts the assembly code into machine code. The translator used for translating
the code is known as an assembler.

The assembly language code is not portable because the data is stored in computer registers, and
the computer has to know the different sets of registers.

The assembly code is not faster than machine code because the assembly language comes above
the machine language in the hierarchy, so it means that assembly language has some abstraction
from the hardware while machine language has zero abstraction.

Page 21 of 139
Assembly language, also referred to as a middle-level language, is a low-level programming
language that is closely related to machine code.

A middle-level language refers to a type of computer language wherein instructions are


formulated utilizing symbols such as letters, numerals, and special characters. Assembly
language can be classified as a middle-level programming language. In the context of assembly
language programming, mnemonics are utilized as pre-established terms to represent specific
instructions or operations. In the process of programming, binary code instructions, which are
written in low-level language, are substituted with mnemonics and operands in middle-level
language. However, due to the computer's inability to comprehend mnemonics, the utilization of
a translator known as Assembler is necessary in order to convert mnemonics into machine
language.

The assembler is a software tool that functions as a translator, accepting assembly code as its
input and generating machine code as its output. This implies that the computer lacks the ability
to comprehend middle-level language, necessitating its translation into a low-level language in
order to render it comprehensible to the computer. The assembler is employed to convert an
intermediate-level language into a low-level language.

The command "g++ -S main.cpp -o main.s" is used to compile the source code file "main.cpp"
using the g++ compiler and generate the assembly code file "main.s" as output.

High-Level Programming Language

Definition 1: A high-level programming language is a language that has an abstraction of


attributes of the computer.

Or
Definition 2: A high-level language (HLL) is a programming language such as C, FORTRAN,
or Pascal that enables a programmer to write programs that are more or less independent of a
particular type of computer. Such languages are considered high-level because they are closer to
human languages and further from machine languages.

High-level programming is more convenient to the user in writing a program.

Definition 3: A high-level programming language is a type of computer programming language


that is designed to be easily understood and used by humans. It provides a level of abstraction
from the underlying hardware

Definition 4: A high-level language refers to a type of computer programming language that is


designed to be easily comprehensible by users.

Description of a high-level language:

The high-level programming language bears resemblance to natural human languages,


employing a prescribed set of grammatical rules to facilitate the formulation of instructions in a
more accessible manner. Keywords and syntax are integral components of high-level
Page 22 of 139
programming languages. Keywords are predetermined terms that hold certain meanings inside
the language, while syntax refers to the collection of rules that dictate how these keywords are
combined to form valid instructions. The high-level programming language is more
comprehensible for consumers, however it remains incomprehensible to computers. The process
of converting high-level language into low-level language is necessary in order to render it
comprehensible to the computer. Compilers or interpreters are employed to facilitate the
conversion of high-level programming languages into low-level machine code.
Languages such as FORTRAN, C, C++, JAVA, and Python are illustrative of high-level
programming languages. All of these programming languages employ a form of human-readable
language, such as English, to compose computer instructions. The instructions undergo a process
of conversion into low-level language by the compiler or interpreter, facilitating comprehension
by the computer.

What is a programming language?

A programming language defines a set of instructions that are compiled together to perform a
specific task by the CPU (Central Processing Unit). The programming language mainly refers to
high-level languages such as C, C++, Pascal, Ada, COBOL, etc.

Each programming language contains a unique set of keywords and syntax, which are used to
create a set of instructions. Thousands of programming languages have been developed till now,
but each language has its specific purpose. These languages vary in the level of abstraction they
provide from the hardware. Some programming languages provide less or no abstraction while
some provide higher abstraction. Based on the levels of abstraction, they can be classified into
two categories:

o Low-level language
o High-level language

The image which is given below describes the abstraction level from hardware. As we can
observe from the below image that the machine language provides no abstraction, assembly
language provides less abstraction whereas high-level language provides a higher level of
abstraction.

Page 23 of 139
The first high-level programming languages were designed in the 1950s. Now there are dozens
of different languages, including Ada, Algol, BASIC, COBOL, C, C++, FORTRAN, LISP,
Pascal, and Prolog.

Differences between Machine-Level language and Assembly language

The following are differences between machine-level language and assembly language:

S/n Machine-level language Assembly language

1 The machine-level language comes at the The assembly language comes above the
lowest level in the hierarchy, so it has zero machine language means that it has less
abstraction level from the hardware. abstraction level from the hardware.

2 It cannot be easily understood by humans. It is easy to read, write, and maintain.

3 The machine-level language is written in The assembly language is written in simple


binary digits, i.e., 0 and 1. English language, so it is easily
understandable by the users.

4 It does not require any translator as the In assembly language, the assembler is used
machine code is directly executed by the to convert the assembly code into machine
computer. code.

5 It is a first-generation programming language. It is a second-generation programming

Page 24 of 139
language.

Advantages Of High-Level Languages


The main advantage of high-level languages over low-level languages is that they are easier to
read, write, and maintain. Ultimately, programs written in a high-level language must be
translated into machine language by a compiler or interpreter.

Low-Level Programming Language


A low-Level Programming language is a language that does not require programming ideas and
concepts. In contrast to high level language, assembly languages are considered low-level
because they are very close to machine languages.

The terms high-level and low-level are inherently relative.

What is the main difference between high level and low-level language?
The main difference between high level language and low-level language is that Programmers
can easily understand or interpret or compile the high level language in comparison of machine.
On the other hand, Machine can easily understand the low-level language in comparison of
human beings.

High-Level Versus Low-Level Languages


Low-level languages require little interpretation by the computer. This makes machine code fast
compared to other programming languages. Low-level languages give programmers more
control over data storage, memory, and computer hardware. It’s typically used to write kernel or
driver software. It wouldn’t be used to write web applications or games.
In contrast, high-level languages are easier to grasp. It allows a programmer to write code more
efficiently. High-level languages have more safeguards to keep coders from issues commands
Page 25 of 139
that could potentially damage a computer. These languages don’t give programmers as much
control as low-level ones do.
Here are some main differences between high and low-level languages:
S/n High Level Language Low Level Language

1 Programmer friendly Machine friendly

2 Less memory efficient Highly memory efficient

3 Easy to understand for programmers Tough to understand for programmers

4 Simple to debug Complex to debug comparatively

5 Simple to maintain Complex to maintain comparatively

6 Portable Non-portable

7 Can run on any platform Machine-dependent

8 Needs compiler or interpreter for translation Needs assembler for translation

9 Widely used for programming Not commonly used in programming

Assignments
1. Itemize ten differences between high-level language and low-level language
2. What are the differences between an interpreter and a compiler in language processing?
3. Explain the processes involved in the language processing system, such as pre-processor,
compiler, assembler, linker, loader, and memory
4. Explain the 3 levels of programming languages

Compiler Structure
Any large software is easier to understand and implement if it is divided into well-defined
modules.

Page 26 of 139
Figure: Structure of a compiler

 In a compiler,
o linear analysis
 is called LEXICAL ANALYSIS or SCANNING and
 is performed by the LEXICAL ANALYZER or LEXER,
o hierarchical analysis
 is called SYNTAX ANALYSIS or PARSING and
 is performed by the SYNTAX ANALYZER or PARSER.

 During the analysis, the compiler manages a SYMBOL TABLE by


o recording the identifiers of the source program
o collecting information (called ATTRIBUTES) about them: storage allocation,
type, scope, and (for functions) signature.

Page 27 of 139
 When the identifier x is found by the lexical analyzer
o generates the token id
o enters the lexeme x in the symbol-table (if it is not already there)
o associates to the generated token a pointer to the symbol-table entry x. This
pointer is called the LEXICAL VALUE of the token.

 During the analysis or synthesis, the compiler may DETECT ERRORS and report on
them.
o However, after detecting an error, the compilation should proceed allowing
further errors to be detected.
o The syntax and semantic phases usually handle a large fraction of the errors
detectable by the compiler.

QUIZ QUESTIONS

Questions 1: What are the 3 levels of programming languages?


Answer
Types of Languages:

There are three main kinds of programming language: Machine language. Assembly
language. High-level language

Questions 2: Is C++ a high or low-level language?


Answer

C++ can perform both low-level and high-level programming, and that's why it is essentially
considered a mid-level language. However, as its programming syntax also includes
comprehensible English, many also view C++ as another high-level language.

Questions 3: Is Python a low-level language?


Answer

Python and C# are examples of high-level languages that are widely used in education and in the
workplace. A high-level language is one that is user-oriented in that it has been designed to make
it straightforward for a programmer to convert an algorithm into program code. A low-level
language is machine oriented.

Questions 4: Is HTML is a programming language?


Page 28 of 139
Answer

HTML is not a programming language. It's a markup language. In fact, that is the technology's
name: HyperText Markup Language. That self-identified fact alone should settle the debate.

Questions 5: What is the No 1 programming language?

Answer

JavaScript. JavaScript is one of the world's most popular programming languages on the web.
Using JavaScript, you can build some of the most interactive websites.

Questions 6: Is JSON (JavaScript Object Notation) is a programming language?

Answer
JSON is a lightweight, text-based, language-independent data interchange format. It was derived
from the JavaScript/ECMAScript programming language, but is programming language
independent.

Questions 7: What is the hardest programming language?


Answer
Malbolge is considered the hardest programming language to learn. It is so hard that it has to be
set aside in its own paragraph. It took two whole two years to finish writing the code for
Malbolge.

HOME EXERCISE / PRACTICE QUESTIONS

Practicing the following questions will help you test your knowledge on the course. It is highly
recommended that you practice them.
1. GATE CS 2011, Question 1
Link https://www.geeksforgeeks.org/gate-gate-cs-2011-question-1/

2. GATE CS 2011, Question 19


Link https://www.geeksforgeeks.org/gate-gate-cs-2011-question-19/

3. GATE CS 2009, Question 17


Link https://www.geeksforgeeks.org/gate-gate-cs-2009-question-17/

4. GATE CS 1998, Question 27


Link https://www.geeksforgeeks.org/aptitude-gate-cs-1998-question-27/

5. GATE CS 2008, Question 85

Page 29 of 139
Link https://www.geeksforgeeks.org/gate-gate-cs-2008-question-11/

6. GATE CS 1997, Question 8


Link https://www.geeksforgeeks.org/gate-gate-cs-1997-question-8/

7. GATE CS 2014 (Set 3), Question 65


Link https://www.geeksforgeeks.org/gate-gate-cs-2014-set-3-question-27/

8. GATE CS 2015 (Set 2), Question 29


Link https://www.geeksforgeeks.org/gate-gate-cs-2015-set-2-question-29/

Quiz
1. What are the main principles of compiled code?
Answer
Lexical analysis, Syntax analysis, Intermediate code generation, Code optimisation, Code
generation. Like an assembler, a compiler usually performs the above tasks by making multiple
passes over the input or some intermediate representation of the same.

2. What is compiler and its types?


Answer
There are various types of compilers which are as follows − Traditional Compilers (C, C++, and
Pascal). These compilers transform a source program in an HLL into its similar in native
machine program or object program. Interpreters (LISP, SNOBOL, and Java1.

3. What is the difference between a compiler and an interpreter?


Answer
The compiler generates an output in the form of (.exe). The interpreter does not generate any
output. Any change in the source program after the compilation requires recompiling the entire
code. Any change in the source program during the translation does not require retranslation of
the entire code.

4. What is an example of a compiler?

Answer
The language processor that reads the complete source program written in high-level language as
a whole in one go and translates it into an equivalent program in machine language is called a
Compiler. Example: C, C++, C#.

5. What is the rule for first in compiler design?


Answer
Rules to find First()

Page 30 of 139
If X is a terminal, then First(X) is {X}. If X is a non-terminal and X tends to aα is production,
then add 'a' to the first of X. if X->ε, then add null to the First(X). If X_>YZ then if First(Y)=ε,
then First(X) = { First(Y)-ε} U First(Z).

6. Why is it called a compiler?


Answer
compiler, computer software that translates (compiles) source code written in a high-level
language (e.g., C++) into a set of machine-language instructions that can be understood by a
digital computer's CPU.

7. Why is compiler better than interpreter?

Answer
Programs that use compilers to translate their code can sometimes run faster than interpreted
code. A compiler keeps source code contained and private from end-users, which can be
especially beneficial for programs that use commercial code.

8. Which is faster compiler or interpreter?


Answer
A compiled program is faster to run than an interpreted program, but it takes more time to
compile and run a program than to just interpret it. A compiler indeed produces faster programs.
It happens fundamentally because it must analyze each statement just once, while an interpreter
must analyze it each time.

9. Which language is known as the machine code?


Answer
Machine code (also called machine language) is software that is executed directly by the CPU.
Machine code is CPU-dependent; it is a series of ones and zeroes that translate to instructions
that the CPU understands.

10. What is the main advantage of compiler?


Answer
The main advantage of Compiler is that, a compiler can translates a code in a single run. It
consumes less time. CPU utilization is more. Both syntactic and semantic errors can be checked
concurrently

11. Is Python an interpreter or compiler?


Answer
Python is both compiled as well as an interpreted language, which means when we run a python
code, it is first compiled and then interpreted line by line. The compile part gets deleted as soon
as the code gets executed in Python so that the programmer doesn't get onto unnecessary
complexity.

Page 31 of 139
TAKE HOME ASSIGNMENT
Question 1. What are the 6 phases of a compiler?

ANSWER
The 6 phases of a compiler are:
 Lexical Analysis.
 Syntactic Analysis or Parsing.
 Semantic Analysis.
 Intermediate Code Generation.
 Code Optimization.
 Code Generation

Question 2. Describe the 6 phases of a compiler

ANSWER
Question 2. What are the two parts of compilation?

ANSWER
The two components of compilation are analysis and synthesis. The analysis stage separates the
source code into its constituent elements and produces an intermediate representation of the
source program. The target program is created from the intermediate term by the synthesis
component.

Question 3. What are the four stages of the compilation process?

ANSWER
The compilation process can be divided into four steps, i.e., Pre-processing, Compiling,
Assembling, and Linking. The preprocessor takes the source code as an input, and it removes all
the comments from the source code. The preprocessor takes the preprocessor directive and
interprets it.

Example is the Compilation Process in C – javatpoint which is represented as follow;

Page 32 of 139
Figure: Compilation Process in C – javatpoint

COMPILER CONSTRUCTION TOOLS

The compiler writer can use some specialized tools that help in implementing various phases of
a compiler. These tools assist in the creation of an entire compiler or its parts. Some commonly
used compiler construction tools include:
1. Parser Generator – It produces syntax analyzers (parsers) from the input that is
based on a grammatical description of programming language or on a context-free
grammar. It is useful as the syntax analysis phase is highly complex and consumes
more manual and compilation time. Example: PIC, EQM

Page 33 of 139
2. Scanner Generator – It generates lexical analyzers from the input that consists of
regular expression description based on tokens of a language. It generates a finite
automaton to recognize the regular expression.
Example:

3. Syntax directed translation engines – It generates intermediate code with three


address format from the input that consists of a parse tree. These engines have
routines to traverse the parse tree and then produces the intermediate code. In this,
each node of the parse tree is associated with one or more translations.

4. Automatic code generators – It generates the machine language for a target


machine. Each operation of the intermediate language is translated using a
collection of rules and then is taken as an input by the code generator. A template
matching process is used. An intermediate language statement is replaced by its
equivalent machine language statement using templates.

Page 34 of 139
5. Data-flow analysis engines – It is used in code optimization.Data flow analysis is a
key part of the code optimization that gathers the information, that is the values that
flow from one part of a program to another. Refer – data flow analysis in Compiler
6. Compiler construction toolkits – It provides an integrated set of routines that aids
in building compiler components or in the construction of various phases of
compiler.

Features of compiler construction tools :

Lexical Analyzer Generator: This tool helps in generating the lexical analyzer or scanner of
the compiler. It takes as input a set of regular expressions that define the syntax of the
language being compiled and produces a program that reads the input source code and
tokenizes it based on these regular expressions.

Parser Generator: This tool helps in generating the parser of the compiler. It takes as input a
context-free grammar that defines the syntax of the language being compiled and produces a
program that parses the input tokens and builds an abstract syntax tree.

Code Generation Tools: These tools help in generating the target code for the compiler. They
take as input the abstract syntax tree produced by the parser and produce code that can be
executed on the target machine.

Optimization Tools: These tools help in optimizing the generated code for efficiency and
performance. They can perform various optimizations such as dead code elimination, loop
optimization, and register allocation.

Debugging Tools: These tools help in debugging the compiler itself or the programs that are
being compiled. They can provide debugging information such as symbol tables, call stacks,
and runtime errors.

Profiling Tools: These tools help in profiling the compiler or the compiled code to identify
performance bottlenecks and optimize the code accordingly.

Documentation Tools: These tools help in generating documentation for the compiler and the
programming language being compiled. They can generate documentation for the syntax,
semantics, and usage of the language.

Language Support: Compiler construction tools are designed to support a wide range of
programming languages, including high-level languages such as C++, Java, and Python, as
well as low-level languages such as assembly language.

Cross-Platform Support: Compiler construction tools may be designed to work on multiple


platforms, such as Windows, Mac, and Linux.

User Interface: Some compiler construction tools come with a user interface that makes it
easier for developers to work with the compiler and its associated tools.

Page 35 of 139
Anatomy of a Compiler

A compiler is a tool that translates a program from one language to another language.
An interpreter is a tool that takes a program and executes it. In the first case the
program often comes from a file on disk and in the second the program is sometimes
stored in a RAM buffer, so that changes can be made quickly and easily through an
integrated editor. This is often the case in BASIC interpreters and calculator
programs. We will refer to the source of the program, whether it is on disk or in RAM,
as the input stream.

Regardless of where the program comes from it must first pass through a Tokenizer,
or as it is sometimes called, a Lexer. The tokenizer is responsible for dividing the
input stream into individual tokens, identifying the token type, and passing tokens one
at a time to the next stage of the compiler.

The next stage of the compiler is called the Parser. This part of the compiler has an
understanding of the language's grammar. It is responsible for identifying syntax
errors and for translating an error free program into internal data structures that can be
interpreted or written out in another language.

The data structure is called a Parse Tree, or sometimes an Intermediate Code


Representation. The parse tree is a language independent structure, which gives a
great deal of flexibility to the code generator. The lexer and parser together are often
referred to as the compiler's front end. The rest of the compiler is called the back end.
Due to the language independent nature of the parse tree, it is easy, once the front end
is in place, to replace the back end with a code generator for a different high level
language, or a different machine language, or replacing the code generator all together
with an interpreter. This approach allows a compiler to be easily ported to another
type of computer, or for a single compiler to produce code for a number of different
computers (cross compilation).

Sometimes, especially on smaller systems, the intermediate representation is written


to disk. This allows the front end to be unloaded from RAM, and RAM is not needed
for the intermediate representation. This has two disadvantages: it is slower, and it
requires that the parse tree be translated to a form that can be stored on disk.

The next step in the process is to send the parse tree to either an interpreter, where it is
executed, or to a code generator preprocessor. Not all compilers have a code generator
preprocessor. The preprocessor has two jobs. The first is to break any expressions into
their simplest components. For example, the assignment a := 1 + 2 * 3 would be
broken into temp := 2 * 3; a := 1 + temp; Such expressions are called Binary
Expressions. Such expressions are necessary for generating assembler language code.
Compilers that translate from one high level language to another often do not contain
Page 36 of 139
this step. Another task of the code generator preprocessor is to perform certain
machine independent optimizations.

After preprocessing, the parse tree is sent to the code generator, which creates a new
file in the target language. Sometimes the newly created file is then post processed to
add machine dependent optimizations.

Graphically the different parts of a compiler is shown in the Figure as follow;

Figure: Anatomy of a Compiler

Interpreters are sometimes called virtual machines. This stems from the idea that a
CPU is actually a low level interpreter - it interprets machine code. An interpreter is a
high level simulation of a CPU.

The Tokenizer

The job of the tokenizer is to read tokens one at a time from the input stream and pass the tokens
to the parser. The heart of the tokenizer is the following type:

token_type_enum = (glob_res_word,
con_res_word,
reserved_sym,
identifier,
string_type,
int_type,
real_type);

record Token_Type is
begin
infile : text;
cur_str : array [1..80] of char;
cur_str_len : integer;

cur_line : integer;
cur_pos : integer;

type_of_token : token_type_enum;
Page 37 of 139
read_string : char [1..30] of char;
cap_string : char [1..30] of char;
int_val : integer;
float_val : real;
glob_res_word : glob_res_word_type;
con_res_word : con_res_word_type;
res_sym : res_sym_type;
end; (* Token *)

A variable of this type is used to hold the current token. The field infile is the input stream the
program being parsed is held in (for those that do not know Pascal, text files have the type text).
The next field is the current line being parsed. It is more efficient to read files a chunk at a time
rather than a character at a time, so it is standard practice to add a field to hold an entire string to
the token. Cur_str_len gives the length of the current string;

If the stream is from a RAM buffer then these two fields can be replaced with a pointer to the
correct position in the buffer.

The cur_line and cur_pos fields hold the current line number and current position in that line.
This data is used by the parser to indicate where errors occur.

Glob_res_word_type, con_res_word_type, and res_sym are enumerations. The enumerations are


not given here because they are language specific (we should at least pay lip service to being
language independent here) and they can be quite large. The tokenizer handles context sensitive
reserved words like a separate group of globally reserved words. It is up to the parser to decide
what context is currently being parsed and whether a context sensitive reserved word should be
treated as a reserved word or an identifier.

There is an alternate way to handle context sensitive reserved words. The tokenizer can handle
all identifiers simply as identifiers, but provide additional procedures to determine if an identifier
is a globally reserved word or a context sensitive reserved word. Then when the parser reads an
identifier it queries the tokenizer as to whether the identifier is one or the other. Which ever
method is used, context sensitive reserved words mean more work for the parser. This is why it
is preferred to make all reserved words global.

Read_string contains the token string as it was read from the input stream, cap_stream contains
the token string after it has been capitalized. That is these strings contain only the token. When
the token type is reserved word, identifier, or string the correct value will be in one of these
fields. When the token type is integer or real a string representation of value will be found here.
Since Pascal is not case sensitive all strings will be capitalized as they are read. This will
facilitate locating variables and procedures in a case independent way. Sometimes, however, the
uncapitalized string is required, such as when a string constant is encountered in the input
stream.

Int_val and float_val will contain the correct value when either an integer or real are read.
Glob_res_word, con_res_word and res_sym are enumerations that contain all possible globally
reserved words, context sensitive reserved words and reserved symbols, respectively.

The tokenizer next must provide several procedures to manipulate tokens. An initialization
procedure is usually needed to open the input stream and find the first token. The parser will
need a procedure to read the next token on command. This procedure is shown below. The
Page 38 of 139
procedure looks long and scary, but it is very straight forward. Most of the space is taken up with
comments, and there is nothing tricky in the code itself.

procedure Advance_Token (var Token : Token_Type);

var
read_str_idx : integer;
i : integer;

begin
with token do
begin
(* Clear strings *)
(* You may have to provide the following *)
(* procedure. Check your compiler's manuals *)
(* for how to do this *)
clear_string (cur_str);
clear_string (read_string);
clear_string (cap_string);

(* Find start of next token *)


while (cur_str[cur_pos] = ' ') do
begin
(* if end of line, get next line *)
if (cur_pos > cur_str_len) then
begin
readln (infile, cur_str);
(* You may have to provide the following *)
(* procedure. Check your compiler's manuals *)
(* for how to do this *)
find_string_length (cur_str_len, cur_str);
cur_pos := 1;
end; {if (cur_pos > cur_str_len)}

(* if end of file, return end of


file reserved symbol *)
if (eof(infile)) then
begin
type_of_token := RESERVED_SYMBOL;
res_sym := END_OF_FILE;
return;
end; { if (eof(infile)) }
end; { while (cur_str[cur_pos] = ' ')

(* copy token to read_string and cap_string *)


read_str_idx := 1;
(* you have to provide the function not_delimiter *)
(* it simply tests the character and returns true *)
(* if it is not in the set of delimiters *)
while (not_delimiter(cur_str[cur_pos])) do
begin
read_str[read_str_idx] := cur_str[cur_pos];
cap_str[read_str_idx] :=
upcase (cur_str[cur_pos]);
read_str_idx := read_str_idx + 1;
end; { while (not_delimiter(cur_str[cur_pos])) }

(* determine token type *)


(* is token an identifier? *)

Page 39 of 139
if (cap_string[1] >= 'A') and
(cap_string[1] <= 'Z') then
begin
(* is token a global reserved word? *)

(* glob_res_word_table is a table (possibly a *)


(* binary search tree) of reserved words. *)
(* Find_in_table returns the enumeration value *)
(* associated with the reserved word if it is a *)
(* globally reserved word. Otherwise it returns *)
(* UNDEFINED. *)
find_in_table(glob_res_word_table,
cap_string,
glob_res_word);
if NOT (glob_res_word = UNDEFINED) then
begin
type_of_token := GLOBAL_RES_WORD;
return;
end; { if NOT (glob_res_word = UNDEFINED) }

(* is token a context sensitive reserved word? *)


find_in_table(con_res_word_table,
cap_string,
con_res_word);
if NOT (con_res_word = UNDEFINED) then
begin
type_of_token := CONTEXT_RES_WORD;
return;
end; { if NOT (con_res_word = UNDEFINED) }

(* if its not a global reserved word or a context *)


(* sensitive reserved word, it must be an *)
(* identifier *)

type_of_token := INDENTIFIER;
return;
end; { if (cap_string[1] >= 'A') and
(cap_string[1] <= 'Z') }

(* is token a number? *)
if ((cap_string[1] >= '0') and
(cap_string[1] <= '9')) or
(cap_string[1] = '-' then
(* is token a real or integer? *)
for i := 2 to read_str_idx do
if (cap_string[i] = '.') or
(cap_string[i] = 'E') then
begin
(* once again, you may have to provide *)
(* the following function to translate *)
(* a string to a real *)
float_val := string_to_real(cap_string);
type_of_token := real_type;
return;
end; {if (cap_string[i] = '.') or
(cap_string[i] = 'E') }
else
begin
int_val := string_to_int(cap_string);
type_of_token := int_type;

Page 40 of 139
return;
end;

(* is token a string? *)
if (cap_string[1] = '''') then (* this syntax seems
strange, but it seems to
work! *)
begin
type_of_token := string_type;
return;
end;

(* is token a reserved symbol? *)


find_in_table(res_sym_table,
cap_string,
res_sym);
if NOT (res_sym = UNDEFINED) then
begin
type_of_token := reserved_sym;
return;
end;

(* if the type of token has not been found yet *)


(* it must be an unknown type *)
(* This is a lexical error *)
type_of_token := UNKNOWN_TOKEN_TYPE;

end; { with token do }


end; { procedure advance_token }

This procedure is actually only about two and a half pages long, and without comments it would
probably be less than two. Some software engineers stress that a procedure should not be more
than a page long. Such "engineers" are generally college professors that have never ventured
beyond the walls of their ivory towers. In real life a two and a half page procedure is considered
not overly long. As long as the entire procedure is on the same logical level, it will be readable
and easy to understand.

The general logic of the procedure should be easy to see by reading it (or its comments). First we
find the next token. This might involve reading the next line from the input stream. Next we
copy the token into read_string and cap_string. Then we set about determining the type of the
token. If the token starts with a letter, it is an identifier, global reserved word or context sensitive
reserved word. To determine if the identifier is a context sensitive or global reserved word, tables
are queried that contain each type of word. If the identifier is found in one of the tables, the
associated enumeration is returned.

Note that a very flexible tokenizer could be created by using strings instead of enumerations and
keeping the reserved words and symbols in a file. When the tokenizer is initialized the reserved
words and symbols can be read into the tables. This way the language the tokenizer works on can
be changed by simply changing the files. No source code would need to be changed. The draw
back is that the parser needs to perform comparisons. If strings are used instead of enumerations,
less efficient string compares would have to be used instead of more efficient comparisons on
enumerations.

Page 41 of 139
If the first character of the token is a digit the token is a number, or if the first character is a
minus sign the token is a negative number. If the token is a number it might be a real or an
integer. If it contains a decimal point or the letter E (which indicates scientific notation) then it is
a real, otherwise it is an integer. Note that this could be masking a lexical error. If the file
contains a token "9abc" the lexer will turn it into an integer 9. It is likely that any such error will
cause a syntax error which the parser can catch, however the lexer should probably be beefed up
to look for such things. It will make for more readable error messages to the user.

If the token is not a number, it could be a string. Strings in Pascal are identified by single quote
marks. Finally, if the token is not a string it must be a reserved symbol. For convenience, the
reserved symbols are stored in the same type of table as the global and context sensitive reserved
words. If the token is not found in this table, it is a lexical error. The tokenizer does not handle
errors itself, so it simply notifies the parser that an unidentified token type has been found. The
parser will handle the error.

QUESTIONS
1. Draw anatomy of a compiler
2. Explain the work of tokenizer
3. Discuss the work of parser
4. Enumerate the work of intermediate code
5. Highlights the work of code generators

References
https://www.geeksforgeeks.org/introduction-of-compiler-design/
https://www.geeksforgeeks.org/introduction-of-lexical-analysis/
https://www.loc.gov/preservation/digital/formats/fdd/
fdd000381.shtml#:~:text=JSON%20is%20a%20lightweight%2C%20text,but%20is
%20programming%20language%20independent.
https://www.javatpoint.com/compiler-phases

Page 42 of 139
Lecture note:

Weekly Synopsis

Weeks 3 & 4: Automata theory – finite state automaton, state diagrams, state tables

What is Automata theory?

Automata theory is the study of abstract machines and automata, as well as the computational problems that
can be solved using them.

What is automata theory used for?

The major objective of automata theory is to develop methods by which computer scientists can describe
and analyze the dynamic behavior of discrete systems, in which signals are sampled periodically.

-It is a theory in theoretical computer science. The word automata come from the Greek word αὐτόματος,
which means "self-acting, self-willed, self-moving.

What do you mean by automata?

plural automatons or automata ȯ-ˈtä-mə-tə -mə-ˌtä : a mechanism that is relatively self-operating.


especially : robot. : a machine or control mechanism designed to follow automatically a predetermined
sequence of operations or respond to encoded instructions.

What is automata theory in computer?

Automata theory is a theoretical branch of computer science. It studies abstract mathematical machines
called automatons. When given a finite set of inputs, these automatons automatically imitate humans
performing tasks by going through a finite sequence of states.

Types of Automata:

What are the two types of automata?

 deterministic finite automata (DFA)


 non-deterministic finite automata (NFA)

Classes of finite automata

Finite Automata
Finite mean something that has a limited number of possibilities or possible outcome

Finite state Automata or Finite State Machine is the simplest model used in Automata.

Note: Model in this context means machine or learning algorithm

Page 43 of 139
Finite state automata accept regular language. In this, the term finite means it has a limited number of
possible states, and number of alphabets in the strings are finite. Finite state Automata is represented by 5
tuples or elements (Q, ?, q 0 , F, ?):

Q: It is a collection of a finite set of states.


?: It is a finite set of input Symbols called the alphabet of the automata.
?: It is used for representing the Transition Function.

The states of NDFA move from one state to another state in response to some inputs by using the transition
function.

The transition function used in NDFA is given as follow:

?: Q x ? ?2Q

Where 2Q is a power set of Q. In this graph representation of the automata, transition function is represented
by arcs between states and the labels on the arcs.

q0: It is used for representing the initial state of the NDFA form where any input is processed.

F: It is a collection of a finite set of final states.

Formal Notation used in the representation of Finite Automata


We can represent Finite automata in two ways, which are given as follow:

Classes of finite automata representation

1. Transition diagram

The transition diagram is also called a transition graph; it is represented by a diagraph. A transition graph
consists of three things:

 Arrow (->): The initial state in the transition diagram is marked with an arrow.

 Circle : Each circle represents the state.


 Double circle : Double circle indicates the final state or accepting state.

 Example of transition diagram representation are illustrated as follows:

Page 44 of 139
Note: Finite state Automata is represented by 5 tuples or elements (Q, ?, q 0 , F, ?):

 Transition Table

It is the tabular representation of the behavior of the transition function that takes two arguments, the first is a
state, and the other is input, and it returns a value, which is the new state of the automata. It represents all the
moves of finite-state function based on the current state and input.

In the transition table, the initial state is represented with an arrow, and the final state is represented by a single
circle.

Formally a transition table is a 2-dimensional array, which consists of rows and columns where:
 The rows in the transition table represent the states.
 The columns contain the state in which the machine will move on the input alphabet.

Example of transition table representation are illustrated as follows:

Self-assessments:

1. Compose a finite diagram to represent 5 tuples or elements (Q, ?, q 0 , F, ?):


2. Write a transition table to represent 5 tuples or elements (Q, ?, q 0 , F, ?):

Quiz:

(i) State finite state automata representation by 5 tuples or elements

(ii) Compose a finite diagram to represent 5 tuples or elements

(iii) State two composition of a formally transition table

Page 45 of 139
(iv) Formally a transition table is a 2-dimensional array, which consists of rows and columns. What does
the rows represent? What does the columns contain?
(v) What is the other name for the transition diagram called?

(vi) Write three compositions of a transition graph

Quiz:

(i) Answer:
The 5 tuples or elements representation are (Q, ?, q 0 , F, ?):

(ii) Answer:

(iii) Answer:
State two composition of a formally transition table are rows and columns

(iv) Answer:
 The rows in the transition table represent the states.
 The columns contain the state in which the machine will move on the input alphabet.

(v) Answer:
The transition diagram is also called a transition graph

(vi) Answer:
A transition graph consists of three things:

 Arrow (->): The initial state in the transition diagram is marked with an arrow.

 Circle : Each circle represents the state.


 Double circle : Double circle indicates the final state or accepting state.

END OF THE QUIZ ANSWER

Types of Finite Automata

Page 46 of 139
Deterministic Finite Automata (DFA)

DFA is a short form of Deterministic Finite Automata. In DFA, there is one and only one move from a given
state to the next state of any input symbol. In DFA, there is a finite set of states, a finite set of input symbols,
and a finite set of transitions from one state to another state that occur on input symbol chosen to form an
alphabet; there is exactly one transition out of each state.
Formal Notation of Deterministic Finite Automata (DFA):
A DFA contains 5 tuples or elements (Q, ?, ?, q0, F):

Where,
Q: It is a collection of a finite set of states.
?: It is a finite set of input symbols called as the alphabet of the automata.
?: It is used for representing the Transition Function.

The states of DFA move from one state to another state in response to some inputs by using the transition
function. The transition function given by DFA is given as follow:

Q x ? -> Q

q0: It is used for representing the initial state of the DFA form where any input is processed.

F: It is a collection of a finite set of final states.

Example of DFA
Design a DFA with ? = {0, 1} that accepts those string ending with '01'.

Solution:
L = {01, 010, 110 ……………..} is the language generated.
Q: {q0, q1, q2}, It represents the total number of states
? = {0, 1}
q0 is the initial state
q2 is the final state

Transition diagram

Page 47 of 139
Transition table

Non-Deterministic Finite Automata

NDFA is a short form of Non-Deterministic Finite Automata. In NDFA, there may be more than one move
or no move from a given state to the next state of any input symbol. NDFA is a simple machine that is used to
recognize the pattern by consuming the string of symbols and alphabets for each input symbol. NDFA differs
from DFA in the sense that NDFA can have any number of transitions to the next state from a given state on a
given input symbol.

Formal Notation of Non-Deterministic Finite Automata (NDFA):

A NDFA contains 5 tuples or elements (Q, ?, ?, q0, F):

Q: It is a collection of a finite set of states.


?: It is a finite set of input Symbols called the alphabet of the automata.
?: It is used for representing the Transition Function. The states of NDFA move from one state to another state
in response to some inputs by using the transition function. The transition function used in NDFA is given as
follow:

?: Q x ? ?2Q

Where 2Q is a power set of Q. In this graph representation of the automata, transition function is represented
by arcs between states and the labels on the arcs.
q0: It is used for representing the initial state of the NDFA form where any input is processed.
F: It is a collection of a finite set of final states.

Example of NDFA

Design a NDFA with ? = {0, 1} that accepts those strings starting with '01'.

Solution:
L = {01, 010, 011 ……………..} is the language generated using this language
Q: {q0, q1, q2}
It represents the total number of states
? = {0, 1}
q0 is the initial state
q2 is the final state

Transition diagram

Page 48 of 139
Transition table

QUIZ

1. What is automata theory in computer?

2. Explain two types of automata with the aid of a diagram

3. Describe finite state Automata

4. Define the terms in a NDFA that contains 5 tuples or elements (Q, ?, ?, q0, F):

Answer:
The definition of the terms in a NDFA that contains 5 tuples or elements (Q, ?, ?, q0, F):

Q: It is a collection of a finite set of states.


?: It is a finite set of input Symbols called the alphabet of the automata.
?: It is used for representing the Transition Function. The states of NDFA move from one state to another state
in response to some inputs by using the transition function.

q0: It is used for representing the initial state of the NDFA form where any input is processed.
F: It is a collection of a finite set of final states.

Note:

The transition function used in NDFA is given as follow:

?: Q x ? ?2Q

Where 2Q is a power set of Q. In this graph representation of the automata, transition function is represented
by arcs between states and the labels on the arcs.

Minimization of Finite Automata


Page 49 of 139
The term minimization refers to the construction of finite automata with a minimum number of states, which
is equivalent to the given finite automata. The number of states used in finite automata directly depend upon
the size of the automata that we have used. So, it is important to reduce the number of states. We minimize the
finite automata by detecting those states of automata whose presence or absence does not affect the language
accepted by the finite automata.

Some important concepts used in the minimization of finite automata are:


(1) Unreachable state
(2) Dead State

Unreachable state: Unreachable state is that state in which finite automata never reaches during the transition
from one state to another state.

In the above DFA, we have unreachable state E, because on any input from the initial state, we are unable to
reach to that state. This state is useless in finite automata. So, the best solution is to eliminate these types of
states to minimize the finite automata.

Dead State: It is a non-accepting state, which goes itself for every possible input symbol.

In the above DFA, we have q5, and q6 are dead states because every possible input symbol goes to itself.

Minimization of Deterministic Finite Automata

The following steps are used to minimize a Deterministic Finite Automata.


Step1: First of all, we detect the unreachable state.
Step2: After detecting the unreachable state, in the second step, we eliminate the unreachable state (if found).
Step3: Identify the equivalent states and merge them.

a. In this, we divide all the states into two groups:


b. Group A: This group contains all accepting states of automata.
c. Group B: This group contains all non-accepting states of automata.

Page 50 of 139
This step is repeated for every group. Find group the input lead to if there are differences the partition the
group into sets containing states which go to the same groups under the inputs.

 The resulting final partition contains equivalent states now merge them into a single state.

Step4: In this step, we detect dead states.


Step5: After detecting the dead states, the last step is to eliminate dead states.

Example:
Minimize the following DFA.

Solution

Step 1: Detect unreachable states.

 Start form initial state. Add q0 to temporary state (T)

T = {q0}

 Now, for all states in temporary state set T, find transition from each state on each input symbol in ?.
If resulting state is not in T add that state in T.

? (q0, a) = q1

? ( q0, b) = q2

Now, T = {q0, q1, q2}

Again

? ( q1, a) = q3

? ( q1, b) = q4

Now, T = {q0, q1, q2, q3, q4}

Page 51 of 139
Again

? ( q2, a) = q3

? ( q2, b) = q5

Now, T = {q0, q1, q2, q3, q4, q5}

Again

? ( q3, a) = q3

? ( q3, b) = q1

Now, change in T because q1, q3 are already in set T.

T = {q0, q1, q2, q3, q4, q5}

Again

? ( q4, a) = q4

? ( q4, b) = q5

Now, change in T because q4, q5 are already in set T

T = {q0, q1, q2, q3, q4, q5}

Again

? ( q5, a) = q5

? ( q5, b) = q4

T = {q0, q1, q2, q3, q4, q5}

 Repeat previous step until T does not change

Finally we get T as:


T = {q0, q1, q2, q3, q4, q5}

 Now we will find unreachable states

U=Q–T

Q = {q0, q1, q2, q3, q4, q5, q6}

U = {q0, q1, q2, q3, q4, q5, q6} – {q0, q1, q2, q3, q4, q5}

U = {q6}, is the unreachable state

Step 2: In this step, we eliminate the unreachable state found in first step.

Page 52 of 139
Step 3: Identify the equivalent steps and merge them.

 First of all, divide the states into two groups

Group A – q3, q4,q5 (contains accepting state)


Group B – q0, q1,q2 (contains non-accepting state)

Check Group A for input a


? (q3, a) = q3
? (q4, a) = q4
? (q5, a) = q5
In Similar way, check group A for input b
? (q3, b) = q1
? (q4, b) = q5
? (q5, b) = q4
q1 belongs to group B for input b, and q4 and q5 belong to group A for input b.

So, we partition group A as:

Group A1 - q3
Group A2 – q4, q5
Group B – q0, q1, q2

Now, we check Group B – q0, q1, q2 for both input symbols

But for input a, we have:


? (q0, a) = q1
? (q1, a) = q3
? (q2, a) = q3
For input b, we have
? (q0, b) = q2
? (q1, b) = q4
? (q2, b) = q5
q2 belongs to group B and q4, q4 belongs to group A2 for input b.

So, we partition group B as:

Group B1 – q0
Group B2 – q1, q2

Check Group A2 for input a


? (q4, a) = q4

Page 53 of 139
? (q5, a) = q5

Check Group A2 for input b


? (q4, b) = q5

? (q5, b) = q4

As both belong to the same group, the further division is not possible.

Now, we check Group B2 for input a and b

?(q1, a) = q3
?(q2, a) = q3
?(q1, b) = q4
?(q2, b) = q5
q4 and q5 belong to group A2 for input b. So no further partitioning is possible.

Finally, the following groups are formed:


Group A1 – q3
Group A2 – q4, q5
Group B1 – q0
Group B2 – q1, q2

The resulting automata is given as follow:

Step 4: In this step, we detected dead states. There are no dead states in the above DFA; hence it is minimized.

Assignment:

1. What is automata theory in computer?

2. Explain two types of automata with the aid of a diagram

3. Describe finite state Automata

4. What do you understand by minimization of finite automata


5. Explain some important concepts used in the minimization of finite automata
Such as:
(a) Unreachable state
(b) Dead State

The Design of State Machines


State tables and state diagrams

Page 54 of 139
Introduction
The most difficult task in designing sequential circuits occurs at the very start of the design; in determining
what characteristics of a given problem require sequential operations, and more particularly, what behaviors
must be represented by a unique state. A poor choice of states coupled with a poor understanding of the
problem can make a design lengthy, difficult, and error-prone. With better understanding and a better choice of
states, the same problem might well be trivial. Whereas it is relatively straight-forward to describe sequential
circuit structure and define applicable engineering design methods, it is relatively challenging to find analytical
methods capable of matching design problem requirements to eventual machine states.
Note:
A sequential circuit refers to a special type of circuit. It consists of a series of various inputs and outputs.
Here, the outputs depend on a combination of both the present inputs as well as the previous outputs. This
previous output gets treated in the form of the present state.
What is sequential circuit with example?

In other words, their output depends on a SEQUENCE of the events occurring at the circuit inputs. Examples
of such circuits include clocks, flip-flops, bi-stables, counters, memories, and registers. The actions of the
sequential circuits depend on the range of basic sub-circuits.

What is the function of sequential circuit?

Sequential circuits are the other important digital type, used in counting and for memory actions. The
simplest type is the S-R flip-flop (or latch) whose output(s) can be set by one pair of inputs and reset by
reversing each input.

General form of a sequential circuit.

Sequential circuits are essentially combinational circuits with feedback. A block diagram of a generalized
sequential circuit is shown in Figure 1.

Figure 1. A block diagram of a generalized sequential circuit


The generalised circuit contains a block of combinational logic which has two sets of inputs and two sets of
outputs. The inputs1 are:
Restated, we can effectively present how to design, but we will present what to design through examples and
guided design problems. Therefore, this initial and most important design task, identifying behaviors in the
solution-space to a problem that require unique states, will be presented over time through examples, and you
must learn this skill through experience (some general guidelines will also be presented later).
In general, the first step in designing a new state machine is to identify all behaviors that might need states, and
all branching dependencies between states. Then, as an understanding of the problem and solution evolve,
original choices can be rethought, challenged, and improved.
Self-assessment

1: Sequential circuits have ‘memory’ because their outputs depend, in part, upon past
outputs. 2: Combinational logic plus ‘memory’. 3: For n-outputs from ‘memory’, and m-external inputs; have:
Page 55 of 139
2n internal and 2m + n possible total states. 4: Memory elements in synchronous circuits are flip-flops which are
clocked. Asynchronous circuits are unclocked. 5: The internal inputs and outputs must match (as they are
connected). 6: Only one input can change at a time (fundamental mode operation). 7: ‘Cutting’ the connection
between internal inputs and outputs. 8: (a) Horizontal; (b) vertical. 9: Oscillation. 10: Non-critical races
do not affect final output; critical races do.

What is a state table?

A state table is nothing more than a truth table that specifies the requirements for the next-state logic,
with inputs coming from the state register and from outside the circuit. The state table lists all required
states, and all possible next states that might follow a given present state.

What are state diagram and state tables?

State Tables and State Diagrams. The relationship that exists among the inputs, outputs, present states
and next states can be specified by either the state table or the state diagram. The state table
representation of a sequential circuit consists of three sections labelled present state, next state and output.

What is state diagram and example?

A state diagram is used to represent the condition of the system or part of the system at finite instances of
time. It's a behavioral diagram and it represents the behavior using finite state transitions. State diagrams are
also referred to as State machines and State-chart Diagrams.

Unified Modeling Language (UML) | State Diagrams


A state diagram is used to represent the condition of the system or part of the system at finite instances of
time. It’s a behavioral diagram and it represents the behavior using finite state transitions. State diagrams
are also referred to as State machines and State-chart Diagrams. These terms are often used
interchangeably. So simply, a state diagram is used to model the dynamic behavior of a class in response to
time and changing external stimuli. We can say that each and every class has a state, but we don’t model
every class using State diagrams. We prefer to model the states with three or more states.

Uses of statechart diagram –

 We use it to state the events responsible for change in state (we do not show what processes cause
those events).
 We use it to model the dynamic behavior of the system.
 To understand the reaction of objects/classes to internal or external stimuli.

Firstly, let us understand what are Behavior diagrams? There are two types of diagrams in UML:
Page 56 of 139
1. Structure Diagrams – Used to model the static structure of a system, for example- class diagram,
package diagram, object diagram, deployment diagram etc.
2. Behavior diagram – Used to model the dynamic change in the system over time. They are used to
model and construct the functionality of a system. So, a behavior diagram simply guides us through
the functionality of the system using Use case diagrams, Interaction diagrams, Activity diagrams
and State diagrams.
Difference between state diagram and flowchart –

The basic purpose of a state diagram is to portray various changes in state of the class and not the
processes or commands causing the changes. However, a flowchart on the other hand portrays the processes
or commands that on execution change the state of class or an object of the class.

Figure – a state diagram for user verification

The state diagram above shows the different states in which the verification sub-system or class exist for a
particular system.

Basic components of a state-chart diagram –

1. Initial state – We use a black filled circle represent the initial state of a System or a class.

Figure – initial state notation

2. Transition – We use a solid arrow to represent the transition or change of control from one
state to another. The arrow is labelled with the event which causes the change in state.

Figure – transition

3. State – We use a rounded rectangle to represent a state. A state represents the conditions or
circumstances of an object of a class at an instant of time.

Figure – state notation

4. Fork – We use a rounded solid rectangular bar to represent a Fork notation with incoming
arrow from the parent state and outgoing arrows towards the newly created states. We use the
fork notation to represent a state splitting into two or more concurrent states.

Page 57 of 139
Figure – a diagram using the fork notation

5. Join – We use a rounded solid rectangular bar to represent a Join notation with incoming
arrows from the joining states and outgoing arrow towards the common goal state. We use the
join notation when two or more states concurrently converge into one on the occurrence of an
event or events.

Figure – a diagram using join notation

6. Self transition – We use a solid arrow pointing back to the state itself to represent a self
transition. There might be scenarios when the state of the object does not change upon the
occurrence of an event. We use self transitions to represent such cases.

Figure – self transition notation

7. Composite state – We use a rounded rectangle to represent a composite state also.We represent
a state with internal activities using a composite state.

Figure – a state with internal activities

8. Final state – We use a filled circle within a circle notation to represent the final state in a state
machine diagram.

Page 58 of 139
Figure – final state notation

Steps to draw a state diagram –

1. Identify the initial state and the final terminating states.


2. Identify the possible states in which the object can exist (boundary values corresponding to
different attributes guide us in identifying different states).
3. Label the events which trigger these transitions.

Example – state diagram for an online order –

Figure – state diagram for an online order

The UMl diagrams we draw depend on the system we aim to represent. Here is just an example of how an
online ordering system might look like:
1. On the event of an order being received, we transit from our initial state to Unprocessed order
state.
2. The unprocessed order is then checked.
3. If the order is rejected, we transit to the Rejected Order state.
4. If the order is accepted and we have the items available, we transit to the fulfilled order state.
5. However, if the items are not available we transit to the Pending Order state.
6. After the order is fulfilled, we transit to the final state. In this example, we merge the two states
i.e. Fulfilled order and Rejected order into one final state.

Note – Here we could have also treated fulfilled order and rejected order as final states separately.

QUIZ:

(i) What are state diagram and state tables?


Page 59 of 139
(ii) What is the difference between state diagram and flowchart?

Assignment:

Question 1.
1. What is the difference between state diagram and flowchart?

Answer 1:
The basic purpose of a state diagram is to portray various changes in state of the class and not the
processes or commands causing the changes. However, a flowchart on the other hand portrays the processes
or commands that on execution change the state of class or an object of the class.

Question 2.
2. Illustrate a state diagram for user verification

Answer 2:

Figure – a state diagram for user verification

Question 3.
3. Illustrate a state diagram to shows the different states in which the verification sub-system or class exist
for a particular system.

Answer 3:
Also, the state diagram below shows the different states in which the verification sub-system or class exist
for a particular system.

Question 4.

4. State the basic components of a state-chart diagram –


Page 60 of 139
Question 5.
5. If the basic components of a state-chart diagram are given as:

(i) Initial state, (ii) Transition, (iii) State, (iv) Fork, (v) Join, (vi) Self transition, (vii) Composite state,
(viii) final state.
Explain each basic component

Question 6.
6. Explain each basic component of a state-chart diagram given as:

(i) Initial state, (ii) Transition, (iii) State, (iv) Fork, (v) Join, (vi) Self transition, (vii) Composite state,
(viii) final state
Question 7.
(i) What is sequential circuit with example?

(ii) Illustrate a block diagram of a generalized sequential circuit


Reference

1. https://www.geeksforgeeks.org/unified-modeling-language-uml-state-diagrams/

2. https://www.sciencedirect.com/topics/engineering/sequential-circuits

3. Watch video on:


Introduction to State Table, State Diagram & State Equation
https://www.youtube.com/watch?v=NNOSWnTHakY

4. Watch video on:


State Tables and Diagrams

https://www.youtube.com/watch?v=WQbSFe_aab8

5. Watch video on:


Digital Logic - State Tables and State Diagrams

https://www.youtube.com/watch?v=2TGfiaCrL2s

Lecture Note 5 & 6


Week 5 & 6: Finite automata, From a regular expression to an NFA, Design of a lexical analyzer
generator, Optimization of DFA-based pattern matchers. Syntax Analysis: The role of the parser,
Context-free grammars.

Optimization of DFA-Based Pattern Matchers


What is the difference between optimisation and optimization?
Optimization is the American usage while optimisation is the way the British like to spell it and
both mean making the best of conditions, situations, environments or any given ingredients to
make the best possible (greatest, smallest, largest, tiniest etc.)
Page 61 of 139
What is another word for optimization?
Synonyms of optimization (noun addition, growth) boost. development. escalation. expansion.

What is optimization of DFA based pattern matchers in compiler design?

To optimize the DFA one has to follow the various steps. These are as follows:
Step 1: Remove all the states that are unreachable from the initial state via any set of the
transition of DFA.
Step 2: Draw the transition table for all pair of states.
Step 3: Now split the transition table into two tables T1 and T2.

What is the need of DFA optimization?


Minimisation/optimisation of a deterministic finite automaton refers to the detection of those
states of a DFA, whose presence or absence in a DFA does not affect the language accepted by
the automata. by automata, are: Unreachable or inaccessible states. Dead states.

What are the four optimization techniques used in the compiler?


Answer
(i) Compile Time Evaluation.
(ii) Common Subexpression Elimination.
(iii) Variable Propagation.
(iv) Dead Code Elimination

Optimization techniques used in the compiler are;

Page 62 of 139
What is the purpose of optimization in compiler?
Optimization is a program transformation technique, which tries to improve the code by making
it consume less resources (i.e. CPU, Memory) and deliver high speed.

What are optimization techniques?


Optimization techniques are a powerful set of tools that are important in efficiently managing an
enter- prise's resources and thereby maximizing share- holder wealth.

What are the two types of compiler optimization?


Source program: Optimizing the source program involves making changes to the algorithm or
changing the loop structures. The user is the actor here. Intermediate Code: Optimizing the
intermediate code involves changing the address calculations and transforming the procedure
calls involved. The techniques of code optimization is highlighted as follow;

What are the three categories of optimization?

Page 63 of 139
There are three main elements to solve an optimization problem: an objective, variables, and
constraints. Each variable can have different values, and the aim is to find the optimal value for
each one. The purpose is the desired result or goal of the problem.

What is classification of optimization?


Optimization problems can be classified based on the type of constraints, nature of design
variables, physical structure of the problem, nature of the equations involved, permissible value
of the design variables, deterministic/ stochastic nature of the variables, separability of the
functions and number of objective

What is optimization and its advantages?


Process Optimization is the field of adapting processes to perfect their features, while staying
within their limits. Generally, the objective is to minimize costs and maximize performance,
productivity, and efficiency.

What are the advantages of optimization in compiler design?


Here are six benefits that your business can get when optimizing code:
 It can make your code run faster. > ...
 It can make your code use less memory. > ...
 It can make your code more maintainable. > ...
 It can make your code more reliable. > ...
 It can make your code more understandable. > ...
 It can make your code more reusable. >

What is the best method of optimization?

Top Optimisation Methods in Machine Learning


 Gradient Descent. The gradient descent method is the most popular optimisation
method. ...
 Stochastic Gradient Descent. ...
 Adaptive Learning Rate Method. ...
 Conjugate Gradient Method. ...
 Derivative-Free Optimisation. ...
Page 64 of 139
 Zeroth Order Optimisation. ...
 For Meta Learning.

What are code optimization techniques?


The key areas of code optimization in compiler design are instruction scheduling, register
allocation, loop unrolling, dead code elimination, constant propagation, and function inlining.
These techniques aim to make code faster, more efficient, and smaller while preserving its
functionality.

What are optimization models used for?


Optimization models have been widely applied to information system design problems. Linear
programming models have been used to improve the efficiency of file allocation in distributed
information systems. The objective function of this type of model is to minimize the differences
between response times of servers.

What are the rules of optimization?


The most important rules you need to know when optimizing a program are:
 Don't.
 Don't yet.
 Don't optimize more than you need to.
Optimization of DFA-Based Pattern Matchers
1 Important States of an NFA
2 Functions Computed From the Syntax Tree
3 Computing unliable, firstpos, and lastpos
4 Computing followpos
5 Converting a Regular Expression Directly to a DFA
6 Minimizing the Number of States of a DFA
7 State Minimization in Lexical Analyzers
8 Trading Time for Space in DFA Simulation

In this section we present three algorithms that have been used to implement and optimize
pattern matchers constructed from regular expressions.
The first algorithm is useful in a Lex compiler, because it constructs a DFA directly from a
regular expression, without constructing an intermediate NFA. The resulting DFA also may have
fewer states than the DFA constructed via an NFA.

Page 65 of 139
The second algorithm minimizes the number of states of any DFA, by combining states that have
the same future behavior. The algorithm itself is quite efficient, running in time 0(n log n), where
n is the number of states of the DFA.
The third algorithm produces more compact representations of transition tables than the standard,
two-dimensional table.
1. Important States of an NFA
To begin our discussion of how to go directly from a regular expression to a DFA, we must first
dissect the NFA construction of Algorithm 3.23 and consider the roles played by various states.
We call a state of an NFA important if it has a non-e out-transition. Notice that the subset
construction (Algorithm 3.20) uses only the important states in a set T when it computes e-
closure(move(T, a)), the set of states reachable from T on input a. That is, the set of
states move(s, a) is nonempty only if states is important. During the subset construction, two sets
of NFA states can be identified (treated as if they were the same set) if they:
Have the same important states, and either both have accepting states or neither does.
When the NFA is constructed from a regular expression by Algorithm 3.23, we can say more
about the important states. The only important states are those introduced as initial states in the
basis part for a particular symbol position in the regular expression. That is, each important state
corresponds to a particular operand in the regular expression.
The constructed NFA has only one accepting state, but this state, having no out-transitions, is not
an important state. By concatenating a unique right endmarker # to a regular expression r, we
give the accepting state for r a transition on #, making it an important state of the NFA for ( r )
# . In other words, by using the augmented regular expression ( r ) # , we can forget about
accepting states as the subset construction proceeds; when the construction is complete, any state
with a transition on # must be an accepting state.
The important states of the NFA correspond directly to the positions in the regular expression
that hold symbols of the alphabet. It is useful, as we shall see, to present the regular expression
by its syntax tree, where the leaves correspond to operands and the interior nodes correspond to
operators. An interior node is called a cat-node, or-node, or star-node if it is labeled by the
concatenation operator (dot), union operator |, or star operator *, respectively. We can construct a
syntax tree for a regular expression just as we did for arithmetic expressions in Section 2.5.1.
Example 3.31 : Figure 3.56 shows the syntax tree for the regular expression of our running
example. Cat-nodes are represented by circles. •

Page 66 of 139
Leaves in a syntax tree are labeled by e or by an alphabet symbol. To each leaf not labeled e, we
attach a unique integer. We refer to this integer as the position of the leaf and also as a position
of its symbol. Note that a symbol can have several positions; for instance, a has positions 1 and 3
in Fig. 3.56. The positions in the syntax tree correspond to the important states of the constructed
NFA.
Example 3.32 : Figure 3.57 shows the NFA for the same regular expression as Fig. 3.56, with the
important states numbered and other states represented by letters. The numbered states in the
NFA and the positions in the syntax tree correspond in a way we shall soon see. •

2. Functions Computed from the Syntax Tree


Page 67 of 139
To construct a DFA directly from a regular expression, we construct its syntax tree and then
compute four functions: nullable, firstpos, lastpos, and followpos, defined as follows. Each
definition refers to the syntax tree for a particular augmented regular expression ( r ) # .

1. nullable(n) is true for a syntax-tree node n if and only if the subexpression represented
by n has e in its language. That is, the subexpression can be "made null" or the empty
string, even though there may be other strings it can represent as well.

2. firstpos(n) is the set of positions in the subtree rooted at n that corre-spond to the first symbol
of at least one string in the language of the subexpression rooted at n.

3. lastpos(n) is the set of positions in the subtree rooted at n that corre-spond to the last symbol
of at least one string in the language of the subexpression rooted at n.

4. followpos(p), for a position p, is the set of positions q in the entire syntax tree such that there
is some string x = axa2 ••• an in L ( ( r ) # ) such that for some i, there is a way to explain the
membership of x in L ( ( r ) # ) by matching to position p of the syntax tree and ai+i to position g.
Example 3.33 : Consider the cat-node n in Fig. 3.56 that corresponds to the expression (a|b)*a.
We claim nullable(ri) is false, since this node generates all strings of a's and 6's ending in an a; it
does not generate e. On the other hand, the star-node below it is nullable; it generates e along
with all other strings of a's and 6's.
firstpos{n) — {1,2,3} . In a typical generated string like aa, the first position of the string
corresponds to position 1 of the tree, and in a string like 6a, the first position of the string comes
from position 2 of the tree. However, when the string generated by the expression of node n is
just a, then this a comes from position 3.

lastpos{n) — {3}. That is, no matter what string is generated from the expression of node n, the
last position is the a from position 3 of the tree.
followpos is trickier to compute, but we shall see the rules for doing so shortly. Here is an
example of the reasoning: followpos(l) — {1,2,3} . Consider a string • • • ac • • •, where the c is
either a or 6, and the a comes from position 1. That is, this a is one of those generated by the a in
expression (a|b)*. This a could be followed by another a or 6 coming from the same
subexpression, in which case c comes from position 1 or 2. It is also possible that this a is the last
in the string generated by (a|b)*, in which case the symbol c must be the a that comes from
position 3. Thus, 1,2, and 3 are exactly the positions that can follow position 1. • • . '

3 Computing nullable, firstpos, and lastpos

Page 68 of 139
We can compute nullable, firstpos, and lastpos by a straightforward recursion on the height of
the tree. The basis and inductive rules for nullable and firstpos are summarized in Fig. 3.58. The
rules for lastpos are essentially the same as for firstpos, but the roles of children c\ and c2 must
be swapped in the rule for a cat-node.

Example 3 . 3 4 : Of all the nodes in Fig. 3.56 only the star-node is nullable. We note from the
table of Fig. 3.58 that none of the leaves are nullable, because they each correspond to non-e
operands. The or-node is not nullable, because neither of its children is. The star-node is
nullable, because every star-node is nullable. Finally, each of the cat-nodes, having at least one
nonnullable child, is not nullable.
The computation of firstpos and lastpos for each of the nodes is shown in
Fig. 3.59, with firstposin) to the left of node n, and lastpos(n) to its right. Each of the leaves has
only itself for firstpos and lastpos, as required by the rule for non-e leaves in Fig. 3.58. For the
or-node, we take the union of firstpos at the

children and do the same for lastpos. The rule for the star-node says that we take the value of
firstpos or lastpos at the one child of that node.
Now, consider the lowest cat-node, which we shall call n. To compute firstpos(n), we first
consider whether the left operand is mailable, which it is in this case. Therefore, firstpos for n is
the union of firstpos for each of its children, that is {1,2} U {3} = {1,2,3} . The rule for lastpos
does not appear explicitly in Fig. 3.58, but as we mentioned, the rules are the same as for
firstpos, with the children interchanged. That is, to compute lastpos(n) we must ask whether its
right child (the leaf with position 3) is nullable, which it is not. Therefore, lastpos(n) is the same
as lastpos of the right child, or {3}.
4 Computing followpos

Page 69 of 139
Finally, we need to see how to compute followpos. There are only two ways that a position of a
regular expression can be made to follow another.
1. If n is a cat-node with left child c \ and right child C 2 , then for every position i
in lastpos(ci), all positions in firstpos(c2) are in followpos(i).
2. If n is a star-node, and i is a position in lastpos(n), then all positions in firstpos(n) are
in followpos(i).
Example 3.35:
Let us continue with our running example; recall that firstpos and lastpos were computed in Fig.
3.59. Rule 1 for followpos requires that we look at each cat-node, and put each position
in firstpos of its right child in followpos for each position in lastpos of its left child. For the
lowest cat-node in Fig. 3.59, that rule says position 3 is in followpos(l) and followpos{2). The
next cat-node above says that 4 is in followpos(2>), and the remaining two cat-nodes give us 5
in followpos(4) and 6 in followpos(5).

Figure 3.59: firstpos and lastpos for nodes in the syntax tree for (a|b)*abb#
We must also apply rule 2 to the star-node. That rule tells us positions 1 and 2 are in both
followpos(l) and followpos(2), since both firstpos and lastpos for this node are {1,2} . The
complete sets followpos are summarized in Fig. 3.60.

Page 70 of 139
We can represent the function followpos by creating a directed graph with a node for each
position and an arc from position i to position j if and only if j is in followpos(i). Figure 3.61
shows this graph for the function of Fig. 3.60.
It should come as no surprise that the graph for followpos is almost an NFA without e-transitions
for the underlying regular expression, and would become one if we:
1. Make all positions in firstpos of the root be initial states,
2. Label each arc from i to j by the symbol at position i, and

3. Make the position associated with endmarker # be the only accepting state.
5. Converting a Regular Expression Directly to a DFA
Algorithm 3.36 : Construction of a DFA from a regular expression r.
INPUT : A regular expression r.
OUTPUT : A DFA D that recognizes L(r).
METHOD :

1. Construct a syntax tree T from the augmented regular expression ( r ) # .


2. Compute nullable, firstpos, lastpos, and followpos for T, using the methods of Sections 3.9.3
and 3.9.4.
Construct Dstates, the set of states of DFA D, and Dtran, the transition function for D, by the
procedure of Fig. 3.62. The states of D are sets of positions in T. Initially, each state is
"unmarked," and a state becomes "marked" just before we consider its out-transitions. The start
state of D is firstpos(no), where node no is the root of T. The accepting states are those
containing the position for the endmarker symbol #. •

Page 71 of 139
Example 3.37 : We can now put together the steps of our running example to construct a DFA
for the regular expression r = (a|b)*abb. The syntax tree for ( r ) # appeared in Fig. 3.56. We
observed that for this tree, nullable is true only for the star-node, and we exhibited firstpos and
lastpos in Fig. 3.59. The values of followpos appear in Fig. 3.60.
The value of firstpos for the root of the tree is {1,2,3}, so this set is the start state of D. Call this
set of states A. We must compute Dtran[A, a] and Dtran[A, b]. Among the positions of A, 1 and
3 correspond to a, while 2 corresponds to b. Thus, Dtran[A,a] = followpos(l) U followpos(3) =
{1,2,3,4},
initialize Dstates to contain only the unmarked state firstpos(no), where no is the root of syntax
tree T for ( r ) # ;
w h i le ( there is an unmarked state S in Dstates ) {
mark S;
for ( each input symbol a ) {
let U be the union of followpos(p) for all p
in S that correspond to a;
if ( U is not in Dstates )
add U as an unmarked state to Dstates; Dtran[S, a) = U;
}
}
Figure 3.62: Construction of a DFA directly from a regular expression
and Dtran[A, b] = followpos{2) = {1,2,3} . The latter is state A, and so does not have to be
added to Dstates, but the former, B = {1,2,3,4}, is new, so we add it to Dstates and proceed to
compute its transitions. The complete DFA is shown in Fig. 3.63.

6. Minimizing the Number of States of a DFA


There can be many DFA's that recognize the same language. For instance, note that the DFA's of
Figs. 3.36 and 3.63 both recognize language L ( ( a | b ) * a b b ) . Not only do these automata

Page 72 of 139
have states with different names, but they don't even have the same number of states. If we
implement a lexical analyzer as a DFA, we would generally prefer a DFA with as few states as
possible, since each state requires entries in the table that describes the lexical analyzer.
The matter of the names of states is minor. We shall say that two automata are the same up to
state names if one can be transformed into the other by doing nothing more than changing the
names of states. Figures 3.36 and 3.63 are not the same up to state names. However, there is a
close relationship between the states of each. States A and C of Fig. 3.36 are actually equivalent,
in the sense that neither is an accepting state, and on any input they transfer to the same state —
to B on input a and to C on input b. Moreover, both states A and C behave like state 123 of Fig.
3.63. Likewise, state B of Fig. 3.36 behaves like state 1234 of Fig. 3.63, state D behaves like
state 1235, and state E behaves like state 1236.

It turns out that there is always a unique (up to state names) minimum state DFA for any regular
language. Moreover, this minimum-state DFA can be constructed from any DFA for the same
language by grouping sets of equivalent states. In the case of L ( ( a | b ) * a b b ) , Fig. 3.63 is
the minimum-state DFA, and it can be constructed by partitioning the states of Fig. 3.36 as {A,
C}{B}{D}{E}.
In order to understand the algorithm for creating the partition of states that converts any DFA
into its minimum-state equivalent DFA, we need to see how input strings distinguish states from
one another. We say that string x distinguishes state s from state t if exactly one of the states
reached from s and t by following the path with label x is an accepting state. State s is
distinguishable from state t if there is some string that distinguishes them.
Example 3.38 : The empty string distinguishes any accepting state from any nonaccepting state.
In Fig. 3.36, the string bb distinguishes state A from state B, since bb takes A to a nonaccepting
state C, but takes B to accepting state •
The state-minimization algorithm works by partitioning the states of a DFA into groups of states
that cannot be distinguished. Each group of states is then merged into a single state of the
minimum-state DFA. The algorithm works by maintaining a partition, whose groups are sets of
states that have not yet been distinguished, while any two states from different groups are known
to be distinguishable. When the partition cannot be refined further by breaking any group into
smaller groups, we have the minimum-state DFA.

Initially, the partition consists of two groups: the accepting states and the nonaccepting states.
The fundamental step is to take some group of the current partition, say A = {si, s 2 , . . . , sk}, and
some input symbol a, and see whether a can be used to distinguish between any states in
group A. We examine the transitions from each of si, s2,... , on input a, and if the states reached
fall into two or more groups of the current partition, we split A into a collection of groups, so
that si and Sj are in the same group if and only if they go to the same group on input a. We

Page 73 of 139
repeat this process of splitting groups, until for no group, and for no input symbol, can the group
be split further. The idea is formalized in the next algorithm.
Algorithm 3 . 3 9 : Minimizing the number of states of a DFA.
INPUT : A DFA D with set of states 5, input alphabet S, state state s 0 , and set of accepting
states F.
OUTPUT : A DFA D' accepting the same language as D and having as few states as possible.
Why the State-Minimization Algorithm Works
We need to prove two things: that states remaining in the same group in
Hfinal are indistinguishable by any string, and that states winding up in different groups are
distinguishable. The first is an induction on i that if after the ith iteration of step (2) of
Algorithm 3.39, s and t are in the same group, then there is no string of length i or less that
distinguishes them. We shall leave the details of the induction to you.
The second is an induction on i that if states s and t are placed in different groups at
the ith iteration of step (2), then there is a string that distinguishes them. The basis,
when s and t are placed in different groups of the initial partition, is easy: one must be accepting
and the other not, so e distinguishes them. For the induction, there must be an input a and
states p and q such that s and t go to states p and q, respectively, on
input a. Moreover, p and q must already have been placed in different groups. Then by the
inductive hypothesis, there is some string x that distinguishes
p from q. Therefore, ax distinguishes s from t.

Method:
1. Start with an initial partition II with two groups, F and S — F, the accepting and nonaccepting
states of D.
2. Apply the procedure of Fig. 3 . 64 to construct a new partition n new initially, let llnew = II;
for ( each group G of U ) {
partition G into subgroups such that two states s and t are in the same subgroup if and only if for
all input symbols a, states s and t have transitions on a to states in the same group of Tl;
/* at worst, a state will be in a subgroup by itself */
replace G in nn e w by the set of all subgroups formed;
}
Figure 3.64: Construction of nn ew
3 . If nn e w = FT, let llflnai = II and continue with step (4). Otherwise, repeat step (2) with
Ilnew in place of Tl.
Page 74 of 139
4. Choose one state in each group of IIfin a i as the representative for that group. The
representatives will be the states of the minimum-state DFA D'. The other components of D' are
constructed as follows:
Eliminating the Dead State
The minimization algorithm sometimes produces a DFA with one dead state — one that is not
accepting and transfers to itself on each input symbol. This state is technically needed, because a
DFA must have a transition from every state on every symbol. However, as discussed in
Section 3.8.3, we often want to know when there is no longer any possibility of acceptance, so
we can establish that the proper lexeme has already been seen. Thus, we may wish to eliminate
the dead state and use an automaton that is missing some transitions. This automaton has one
fewer state than the minimum-state DFA, but is strictly speaking not a DFA, because of the
missing transitions to the dead state.

(a) The state state of D' is the representative of the group containing the start state of D.

The accepting states of D' are the representatives of those groups that contain an accepting state
of D. Note that each group contains either only accepting states, or only nonaccepting states,
because we started by separating those two classes of states, and the procedure of Fig. 3.64
always forms new groups that are subgroups of previously constructed groups.
Let s be the representative of some group G of nfin a i, and let the transition of D from s on input
a be to state t. Let r be the representative of fs group H. Then in D', there is a transition from s to
r on input a. Note that in D, every state in group G must go to some state of group H on input a,
or else, group G would have been split according to Fig. 3.64.

Example 3.40 : Let us reconsider the DFA of Fig. 3.36. The initial partition consists of the two
groups {A, B, C, D}{E}, which are respectively the nonaccepting states and the accepting states.
To construct n n e w 5 the procedure of Fig. 3.64 considers both groups and inputs a and b. The
group {E} cannot be split, because it has only one state, so {E} will remain intact in n n e w -

The other group {A, B:C,D} can be split, so we must consider the effect of each input symbol.
On input a, each of these states goes to state B, so there is no way to distinguish these states
using strings that begin with a. On input 6, states A, 5, and C go to members of group {A, B, C,
D}: while state D goes to E, a member of another group. Thus, in n n e w , group {A, B,C, D} is
split into {A,B,C}{D}, and n n e w for this round is {A,B,C}{D}{E}.

In the next round, we can split {A,B,C} into {A,C}{B}, since A and C each go to a member of
{A, B, C} on input b, while B goes to a member of another group, {D}. Thus, after the second
Page 75 of 139
round, nn e w = {A, C}{B}{D}{E}. For the third round, we cannot split the one remaining group
with more than one state, since A and C each go to the same state (and therefore to the same
group) on each input. We conclude that Ilfin a i = {A, C}{B}{D}{E}.
Now, we shall construct the minimum-state DFA. It has four states, corre-sponding to the four
groups of Ufina\, and let us pick A, B, D, and E as the representatives of these groups. The initial
state is A, and the only accepting state is E. Figure 3.65 shows the transition function for the
DFA. For instance, the transition from state E on input b is to A, since in the original DFA, E
goes to C on input 6, and A is the representative of C's group. For the same reason, the transition
on b from state A is to A itself, while all other transitions are as in Fig. 3.36. •

7. State Minimization in Lexical Analyzers


To apply the state minimization procedure to the DFA's generated in Sec-tion 3.8.3, we must
begin Algorithm 3.39 with the partition that groups to-gether all states that recognize a particular
token, and also places in one group all those states that do not indicate any token. An example
should make the extension clear.
E x a m p l e 3 . 4 1: For the DFA of Fig. 3.54, the initial partition is
{0137,7}{247}{8,58}{7}{68}{0}
That is, states 0137 and 7 belong together because neither announces any token. States 8 and 58
belong together because they both announce token a * b + . Note that we have added a dead
state 0, which we suppose has transitions to itself on inputs a and b. The dead state is also the
target of missing transitions on a from states 8, 58, and 68.
We must split 0137 from 7, because they go to different groups on input a. We also split 8 from
58, because they go to different groups on b. Thus, all states are in groups by themselves, and
Fig. 3.54 is the minimum-state DFA recognizing its three tokens. Recall that a DFA serving as a
lexical analyzer will normally drop the dead state, while we treat missing transitions as a signal
to end token recognition. •
8. Trading Time for Space in DFA Simulation
The simplest and fastest way to represent the transition function of a DFA is a two-dimensional
table indexed by states and characters. Given a state and next input character, we access the array

Page 76 of 139
to find the next state and any special action we must take, e.g., returning a token to the parser.
Since a typical lexical analyzer has several hundred states in its DFA and involves the ASCII
alphabet of 128 input characters, the array consumes less than a megabyte.
However, compilers are also appearing in very small devices, where even a megabyte of storage
may be too much. For such situations, there are many methods that can be used to compact the
transition table. For instance, we can represent each state by a list of transitions — that is,
character-state pairs — ended by a default state that is to be chosen for any input character not on
the list. If we choose as the default the most frequently occurring next state, we can often reduce
the amount of storage needed by a large factor.
There is a more subtle data structure that allows us to combine the speed of array access with the
compression of lists with defaults. We may think of this structure as four arrays, as suggested in
Fig. 3.66.5 The base array is used to determine the base location of the entries for state s, which
are located in the next and check arrays. The default array is used to determine an alternative
base location if the check array tells us the one given by base[s] is invalid.

To compute nextState(s, a), the transition for state s on input a, we examine the next and
check entries in location I = base[s]+a, where character o is treated as an integer, presumably in
the range 0 to 127. If check[l) = s, then this entry
I n practice, there would be another array indexed by states to give the action associated with
that state, if any.
is valid, and the next state for states on input a is next[l]. If check[l] ^ s, then we determine
another state t = defaults] and repeat the process as if t were the current state. More formally, the
function nextState is defined as follows:
int nextState(s,a) {
if ( check[base[s] + a] = s ) r e t u r n next[base[s] + a];
else return nextState(default[s],a);
}

Page 77 of 139
The intended use of the structure of Fig. 3.66 is to make the next-check arrays short by taking
advantage of the similarities among states. For instance, state t, the default for state s, might be
the state that says "we are working on an identifier," like state 10 in Fig. 3.14. Perhaps state s is
entered after seeing the letters t h , which are a prefix of keyword t h e n as well as potentially
being the prefix of some lexeme for an identifier. On input character e, we must go from state s
to a special state that remembers we have seen t h e , but otherwise, state s behaves as t does.
Thus, we set check[base[s] + e] to s (to confirm that this entry is valid for s) and we set
next[base[s] + e] to the state that remembers the . Also, default[s] is set to t.
While we may not be able to choose base values so that no next-check entries remain unused,
experience has shown that the simple strategy of assigning base values to states in turn, and
assigning each base[s] value the lowest integer so that the special entries for state s are not
previously occupied utilizes little more space than the minimum possible.

Exercises for Section 3.9


Exercise 3 . 9 . 1 : Extend the table of Fig. 3.58 to include the operators (a) ? and (b) + .
Exercise 3 . 9 . 2 : Use Algorithm 3.36 to convert the regular expressions of Exercise 3.7.3
directly to deterministic finite automata .
Exercise 3 . 9 . 3 : We can prove that two regular expressions are equivalent by showing that
their minimum-state DFA's are the same up to renaming of states. Show in this way that the
following regular expressions: (a|b)*, (a*|b*)*, and ((e|a)b*)* are all equivalent. Note: You may
have constructed the DFA's for these expressions in response to Exercise 3.7.3.
! Exercise 3 . 9 . 4 : Construct the minimum-state DFA's for the following regular expressions:
(a|b)*a(a|b) .
(a|b)*a(a|b)(a|b) .
(a|b)*a(a|b)(a|b)(a|b) .
Do you see a pattern?
!! Exercise 3 . 9 . 5 : To make formal the informal claim of Example 3.25, show that any
deterministic finite automaton for the regular expression
(a|b)*a(a|b)(a|b)...(a|b)
where (a|b) appears n — 1 times at the end, must have at least 2n states. Hint: Observe the
pattern in Exercise 3.9.4. What condition regarding the history of inputs does each state
represent?

Practice questions
What are the 6 phases of compiler?

Page 78 of 139
The 6 phases of a compiler are:
 Lexical Analysis.
 Syntactic Analysis or Parsing.
 Semantic Analysis.
 Intermediate Code Generation.
 Code Optimization.
 Code Generation.
What are the three types of compiler design?
Types of Compiler
 Single Pass Compilers.
 Two Pass Compilers.
 Multipass Compilers.

Reference
https://www.brainkart.com/article/Optimization-of-DFA-Based-Pattern-Matchers_8143/

Page 79 of 139
Week 7: Construction of syntax trees, Bottom-up evaluation of S-attributed
definitions, L-attributed definitions, Top-down translation, Bottom-up
evaluation of inherited attributes, Recursive evaluators, Space for attribute
values at compile time, Assigning space at compile time.
Note
In this section, we will discuss inherited attributes in compiler design. Along with that, we also
learn some of the basic terms which we will use while explaining the inherited attribute in
compiler design.
Construction of syntax trees
Compiler Design – Variants of Syntax Tree
A syntax tree is a tree in which each leaf node represents an operand, while each inside node
represents an operator. The Parse Tree is abbreviated as the syntax tree. The syntax tree is
usually used when representing a program in a tree structure.
Rules of Constructing a Syntax Tree
A syntax tree’s nodes can all be performed as data with numerous fields. One element of the
node for an operator identifies the operator, while the remaining field contains a pointer to the
operand nodes. The operator is also known as the node’s label. The nodes of the syntax tree for
expressions with binary operators are created using the following functions. Each function
returns a reference to the node that was most recently created.
1. mknode (op, left, right): It creates an operator node with the name op and two fields,
containing left and right pointers.
2. mkleaf (id, entry): It creates an identifier node with the label id and the entry field, which is a
reference to the identifier’s symbol table entry.
3. mkleaf (num, val): It creates a number node with the name num and a field containing the
number’s value, val. Make a syntax tree for the expression a 4 + c, for example. p1, p2,…, p5 are
pointers to the symbol table entries for identifiers ‘a’ and ‘c’, respectively, in this sequence.
Example 1: Syntax Tree for the string a – b ∗ c + d is:

Syntax tree for example 1

Page 80 of 139
Example 2: Syntax Tree for the string a * (b + c) – d /2 is:

Syntax tree for example 2


Variants of syntax tree:
A syntax tree basically has two variants which are described below:
Directed Acyclic Graphs for Expressions (DAG)
The Value-Number Method for Constructing DAGs
Directed Acyclic Graphs for Expressions (DAG)
A DAG, like an expression’s syntax tree, includes leaves that correspond to atomic operands and
inside codes that correspond to operators. If N denotes a common subexpression, a node N in a
DAG has many parents; in a syntax tree, the tree for the common subexpression would be
duplicated as many times as the subexpression appears in the original expression. As a result, a
DAG not only encodes expressions more concisely but also provides essential information to the
compiler about how to generate efficient code to evaluate the expressions.
The Directed Acyclic Graph (DAG) is a tool that shows the structure of fundamental blocks,
allows you to examine the flow of values between them, and also allows you to optimize them.
DAG allows for simple transformations of fundamental pieces.
Properties of DAG are:
Leaf nodes represent identifiers, names, or constants.
Interior nodes represent operators.
Interior nodes also represent the results of expressions or the identifiers/name where the values
are to be stored or assigned.

Page 81 of 139
Examples:
T0 = a+b --- Expression 1
T1 = T0 +c --- Expression 2
Expression 1: T0 = a+b

Syntax tree for expression 1


Expression 2: T1 = T0 +c

Syntax tree for expression 2


The Value-Number Method for Constructing DAGs:
An array of records is used to hold the nodes of a syntax tree or DAG. Each row of the array
corresponds to a single record, and hence a single node. The first field in each record is an
operation code, which indicates the node’s label. In the given figure below;

Figure: Nodes of a DAG for i = i + 10 allocated in an array

In the given figure above (Nodes of a DAG for i = i + 10 allocated in an array) the interior nodes
contain two more fields denoting the left and right children, while leaves have one additional
field that stores the lexical value (either a symbol-table pointer or a constant in this instance).
The integer index of the record for that node inside the array is used to refer to nodes in this
array. This integer has been referred to as the node’s value number or the expression represented
by the node in the past. The value of the node labeled -I- is 3, while the values of its left and right
children are 1 and 2, respectively. Instead of integer indexes, we may use pointers to records or
references to objects in practice, but the reference to a node would still be referred to as its
Page 82 of 139
“value number.” Value numbers can assist us in constructing expressions if they are stored in the
right data format.
Algorithm: The value-number method for constructing the nodes of a Directed Acyclic Graph.
INPUT: Label op, node /, and node r.
OUTPUT: The value number of a node in the array with signature (op, l,r).
METHOD: Search the array for node M with label op, left child I, and right child r. If there is
such a node, return the value number of M. If not, create in the array a new node N with label op,
left child I, and right child r, and return its value number.
While Algorithm produces the intended result, examining the full array every time one node is
requested is time-consuming, especially if the array contains expressions from an entire program.
A hash table, in which the nodes are divided into “buckets,” each of which generally contains
only a few nodes, is a more efficient method. The hash table is one of numerous data structures
that may effectively support dictionaries. 1 A dictionary is a data type that allows us to add and
remove elements from a set, as well as to detect if a particular element is present in the set. A
good dictionary data structure, such as a hash table, executes each of these operations in a
constant or near-constant amount of time, regardless of the size of the set.
To build a hash table for the nodes of a DAG, we require a hash function h that computes the
bucket index for a signature (op, I, r) in such a manner that the signatures are distributed across
buckets and no one bucket gets more than a fair portion of the nodes. The bucket index h(op, I, r)
is deterministically computed from the op, I, and r, allowing us to repeat the calculation and
always arrive at the same bucket index per node (op, I, r).
The buckets can be implemented as linked lists, as in the given figure. The bucket headers are
stored in an array indexed by the hash value, each of which corresponds to the first cell of a list.
Each column in a bucket’s linked list contains the value number of one of the nodes that hash to
that bucket. That is, node (op,l,r) may be located on the array’s list whose header is at index
h(op,l,r).

Figure: Data structure for searching buckets


We calculate the bucket index h(op,l,r) and search the list of cells in this bucket for the specified
input node, given the input nodes op, I, and r. There are usually enough buckets that no list has
more than a few cells. However, we may need to examine all of the cells in a bucket, and for
each value number v discovered in a cell, we must verify that the input node’s signature (op,l,r)
matches the node with value number v in the list of cells (as in figure above Figure: Data
structure for searching buckets). If a match is found, we return v. We build a new cell, add it to
Page 83 of 139
the list of cells for bucket index h(op, l,r), and return the value number in that new cell if we find
no match.

Bottom-up evaluation of S-attributed definitions


BOTTOM UP EVALUATION OF S ATTRIBUTED DEFINITION Bottom Up Evaluation of S
Attributed An attribute grammar is a formal way to define attributes for the productions of a
formal grammar, associating these attributes to values.
In a bottom-up evaluation of a syntax directed definition, inherited attributes can a)Always be
evaluated b)Be evaluated only if the definition is L-attributed c)Be evaluated only if the
definition has synthesized attributes d)Never be evaluated Correct answer is option 'C'. Can you
explain this answer? Verified Answer In a bottom-up evaluation of a syntax directed definition,
inherited a... Every S(Synthesized) - attributed definition is L- attributed. For implementing
inherited attributed during bottom-up parsing, extends to some, but not LR grammars. Consider
the following example In the example above the nonterminal L in L → E inherits the count of the
number of 1 ’s generated by S. Since the production L → E is the first that a bottom- up parser
would reduce by, the translator at the time can't know the number of 1 ’s in the input. So in a
bottom-up evaluation of a syntax directed definition, inherits attributes can’t be evaluated if the
definition is L-attributed in the given example. So we can say. that L-attributed definition is
based on simple LR(1) grammar, but it can’t be implemented always but inherit attributes can be
evaluated only if the definition has synthesized attributes.

Syntax Directed Definition (SDD)


SDD is a set of rules that are used to associate attributes with the grammar productions. The
attributes can be of two types: inherited and synthesized. Inherited Attributes Inherited attributes
are the attributes that are passed from the parent node to the child node in the parse tree. These
attributes are used to store information that is required by the child node to evaluate its own
attributes. Inherited attributes are evaluated during the parsing phase of the compiler.
Synthesized Attributes Synthesized attributes are the attributes that are computed by the child
node and passed up to the parent node in the parse tree. Synthesized attributes are used to store
information that is required by the parent node to evaluate its own attributes. Synthesized
attributes are evaluated during the code generation phase of the compiler. Bottom-Up Evaluation
In a bottom-up evaluation of an SDD, the attributes are evaluated in the order in which the parse
tree is constructed. This means that the attributes of the leaf nodes are evaluated first, followed
by the attributes of the internal nodes in a bottom-up fashion. Evaluation of Inherited Attributes
In a bottom-up evaluation of an SDD, inherited attributes can be evaluated only if the definition
has synthesized attributes. This is because the inherited attributes are dependent on the
synthesized attributes of the child node. If the child node does not have any synthesized
attributes, then there is no information to pass up to the parent node, and hence the inherited
attributes cannot be evaluated. Evaluation of Synthesized Attributes Synthesized attributes can
always be evaluated in a bottom-up evaluation of an SDD. This is because the synthesized
attributes are computed by the child node and passed up to the parent node, and hence there is
always information available to evaluate the synthesized attributes. Conclusion In summary, in a
Page 84 of 139
bottom-up evaluation of an SDD, inherited attributes can be evaluated only if the definition has
synthesized attributes. This is because the inherited attributes are dependent on the synthesized
attributes of the child node, and if the child node does not have any synthesized attributes, then
there is no information to pass up to the parent node, and hence the inherited attributes cannot be
evaluated.

L-attributed definitions
L-attributed grammars are a special type of attribute grammars. They allow the attributes to be
evaluated in one depth-first left-to-right traversal of the abstract syntax tree. As a result, attribute
evaluation in L-attributed grammars can be incorporated conveniently in top-down parsing.

THE DEPTH-FIRST EVALUATION ORDER

Let consider a syntax-directed definition S and a parse tree T for S showing the attributes of the
grammar symbols of T. Figure 1 shows an example of such a tree. Algorithm 1 provides an
order, called depth-first evaluation order, for evaluating attributes shown by T.

Algorithm 1

Top-down translation
In compiler design, top-down parsing is a parsing technique that involves starting with the
highest-level nonterminal symbol of the grammar and working downward to derive the input
string. An example of top-down parsing is recursive descent parsing.
Top-down parsing in computer science is a parsing strategy where one first looks at the highest
level of the parse tree and works down the parse tree by using the rewriting rules of a formal
grammar. LL parsers are a type of parser that uses a top-down parsing strategy.
Top-down parsing is a strategy of analyzing unknown data relationships by hypothesizing
general parse tree structures and then considering whether the known fundamental structures are

Page 85 of 139
compatible with the hypothesis. It occurs in the analysis of both natural languages and computer
languages.
Top-down parsing can be viewed as an attempt to find left-most derivations of an input-stream
by searching for parse-trees using a top-down expansion of the given formal grammar rules.
Inclusive choice is used to accommodate ambiguity by expanding all alternative right-hand-sides
of grammar rules.
Simple implementations of top-down parsing do not terminate for left-recursive grammars, and
top-down parsing with backtracking may have exponential time complexity with respect to the
length of the input for ambiguous CFGs. However, more sophisticated top-down parsers have
been created by Frost, Hafiz, and Callaghan, which do accommodate ambiguity and left
recursion in polynomial time and which generate polynomial-sized representations of the
potentially exponential number of parse trees.

Bottom-up evaluation of inherited attributes


Inherited Attribute in Compiler Design
Introduction
In compiler design, Syntax Directed Translation calculates the value of the attribute of the node
present in the parsing tree. The Syntax Directed Definition (SDD) uses two types of attributes,
which are used to determine the value of the attribute at the node.
These two attributes are Synthesized Attributes and Inherited Attributes.
Let us discuss the inherited attribute in compiler design used in Syntax Directed Definitions for
defining the node's value.
Some Common Terms Used in the Article
Siblings
Nodes with the same parent in a tree are known as siblings. For example, in Fig 1, B is the
sibling of C as the parent of both nodes is A.
Attribute Grammar
It is a type of CGF that provides more information about the non-terminals in grammar.
SDT. SDT stands for Syntax Directed Translation which is used for semantic analysis and to
create a parse tree or syntax tree.
Parse Tree
The parse tree is a syntactic structure in which the nodes are stored hierarchically according to
the grammar. It is condensed to make a syntax tree. Click to learn more about syntax trees.
Synthesized Attribute
It is an attribute in which the value of a node is taken only from the child node, and this can be S-
attributed or L-attributed SDT.
Page 86 of 139
What is an inherited attribute in Compiler Design?
If the node value of the parse tree is taken by the attribute value of either the parent node or
sibling node, then it will be known as an inherited attribute. So, there should be a non-terminal
symbol in its body.
As this is an inherited attribute in compiler design, the value of the node should be in terms of
itself, the parent or sibling of that node. It only contains non-terminals and uses only L-attributed
SDT. L-attribute restrict to inherit from parents or siblings only.
We can follow either the parse tree's sideways or top-down traversal to evaluate the inherited
attribute in compiler design.
Examples
Check whether an attribute is inherited or not.
A → BC { B.val = A.val, C.val = B.val}
Here, in the case of B.val = A.val, A is the parent of B. So, B is inheriting the attribute from its
parent.
In C.val = B.val, C is assigned the value of its sibling, i.e., B. So, this is also an inherited
attribute.
The curved lines of the above parse tree show the values assigned and are inherited attributes as
the value of B is taken from its parent, and C’s value is taken from its sibling.
A → BCD {B.val = A.val, D.val = C.val, A.val = C.val}
Here, in the case of B.val = A.val, since A is the parent of B, so its an inherited attribute.
In D.val = C.val, since D and C are siblings, so according to the rule of inherited attributes, this
is also an inherited attribute.
Now, for A.val = C.val, since C is the child of A, its not an inherited attribute.

The above parse tree also shows the values assigned to B and C from parent and sibling,
respectively, and the value of A is taken from node C, which is the child of A.
Difference between inherited and synthesized attributes

Inherited Attributes Synthesized Attributes

If the value of an attribute at its parent, siblings, If the value of an attribute at its child
or other nodes determines the value of that nodes determines the value of its parse
attribute at that node's parse tree node, that tree node, the attribute is said to be
attribute is said to be inherited. synthesized.

One can only specify an inherited property at Only the attribute values at n's children

Page 87 of 139
Inherited Attributes Synthesized Attributes

node n in terms of those nodes' siblings, parents, are used to define a synthesized
and attribute values. attribute at node n.

A single top-down and sideways parse tree Single bottom-up traversal of the parse
traversal is used to evaluate this attribute. tree is used to evaluate this attribute.

L-attributed and S-attributed SDT uses


L-attributed SDT uses this attribute.
this attribute.

It must have a non-terminal in its body. It must have a non-terminal as its head.

Which attribute is used in L-attributed SDT?


L-attribute SDT (Syntax Directed Translation) is used for both Synthesized as well as inherited
attributes in compiler design. These attributes are evaluated by traversing the parse tree by depth-
first or left to right.
How is a Synthesized attribute different from an Inherited attribute in Compiler Design?
Synthesized attributes are used for the node's value from the child, while inherited attribute
provides the node's value from its parent or siblings. S-attributed SDT is used only for
synthesized, while L-attributes are used for both attributes.
What is a compiler?
A compiler is a set of programs that converts high-level language to a language understood by a
machine and helps learn the programming language quickly and shorten the code. C, C++, or
Java are some examples.
What is top-down parsing?
Top-down parsing is to create the parse tree by parsing the input by starting from the root and
parsing toward the leaf of the parse tree. It is useful for LL(1) grammar.

Parsers Comparison:

LR(0) ⊂ SLR ⊂ LALR ⊂ CLR


LL(1) ⊂ LALR ⊂ CLR
If number of states LR(0) = n1, number of states SLR = n2, number of states LALR = n3, number
of states CLR = n4 then, n1 = n2 = n3 <= n4
Page 88 of 139
Syntax Directed Translation: Syntax Directed Translation are augmented rules to the grammar
that facilitate semantic analysis.
Eg – S -> AB {print (*)}
A -> a {print (1)}
B -> b {print (2)}
Synthesized Attribute : attribute whose value is evaluated in terms of attribute values of its
children.
Inherited Attribute : attribute whose value is evaluated in terms of attribute values of siblings or
parents.
S-attributed SDT: If an SDT uses only synthesized attributes, it is called as S-attributed SDT. S-
attributed SDTs are evaluated in bottom-up parsing, as the values of the parent nodes depend upon
the values of the child nodes.

L-attributed SDT : If an SDT uses either synthesized attributes or inherited attributes with a
restriction that it can inherit values from left siblings only, it is called as L-attributed SDT.
Attributes in L-attributed SDTs are evaluated by depth-first and left-to-right parsing manner.
Activation Record : Information needed by a single execution of a procedure is managed using a
contiguous block of storage called activation record. An activation record is allocated when a
procedure is entered and it is deallocated when that procedure is exited.
Intermediate Code : They are machine independent codes. Syntax trees, postfix notation, 3-
address codes can be used to represent intermediate code.
Three address code:
1. Quadruples (4 fields : operator, operand1, operand2, result)
2. Triplets (3 fields : operator, operand1, operand2)
3. Indirect triples

Code Optimization :
Types of machine independent optimizations –
1. Loop optimizations:
 Code motion: reduce the evaluation frequency of expression.
 Loop unrolling: to execute less number of iterations
 Loop jamming: combine body of two loops whenever they are sharing same index.
2. Constant folding: replacing the value of constants during compilation
3. Constant propagation: replacing the value of an expression during compile time.
4. Strength reduction: replacing costly operators by simple operators.

Recursive evaluators
Page 89 of 139
The evaluator is the core of the interpreter--it's what does all of the interesting work to evaluate
complicated expressions. The reader translates textual expressions into a convenient data
structure, and the evaluator actually interprets it, i.e., figures out the "meaning" of the expression.
Evaluation is done recursively. We write code to evaluate simple expressions, and use recursion
to break down complicated expressions into simple parts.
I'll show a simple evaluator for simple arithmetic expressions, like a four-function calculator,
which you can use like this, given the read-eval-print-loop above:
Scheme>(repl math-eval); start up read-eval-print loop w/arithmetic eval
repl>11
repl>(plus 1 2) 3
repl>(times (plus 1 3) (minus 4 2)) 8
As before, the read-eval-print-loop reads what you type at the repl> prompt as an s-expression,
and calls math-eval.
Here's the main dispatch routine of the interpreter, which figures out what kind of expression it's
given, and either evaluates it trivially or calls math-eval-combo to help:
(define (math-eval expr)
(cond;; self-evaluating object? (we only handle numbers)
((number? expr)
expr)
;; compound expression? (we only handle two-arg combinations)
(else
(math-eval-combo expr))))
First math-eval checks the expression to see if it's something simple that it can evaluate
straightforwardly, without recursion.
The only simple expressions in our language are numeric literals, so math-eval just uses the
predicate number? to test whether the expression is a number. If so, it just returns that value.
(Voila! We've implemented self-evaluating literals.)
If the expression is not simple, it's supposed to be an arithmetic expression with an operator and
two operands, represented as a three element list. (This is the subset of Scheme's combinations
that this interpreter can handle.) In this case, math-eval calls math-eval-combo.
(define (math-eval-combo expr)
(let ((operator-name (car expr))
(arg1 (math-eval (cadr expr)))
(arg2 (math-eval (caddr expr))))
Page 90 of 139
(cond ((eq? operator-name 'plus)
(+ arg1 arg2))
((eq? operator-name 'minus)
(- arg1 arg2))
((eq? operator-name 'times)
(* arg1 arg2))
((eq? operator-name 'quotient)
(/ arg1 arg2))
(else
(error "Invalid operation in expr:" expr)))))
math-eval-combo handles a combination (math operation) by calling math-eval recursively to
evaluate the arguments, checking which operator is used in the expression, and calling the
appropriate Scheme procedure to perform the actual operation.
Comments on the Arithmetic Evaluator
The 4-function arithmetic evaluator is very simple, but it demonstrates several important
principles of Scheme programming and programming language implementation.
Recursive style and Nested Lists. Note that an arithemetic expression is represented as an s-
expression that may be a 3-element list. If it's a three-element list, that list is made up of three
objects (pairs), but we essentially treat it as a single conceptual object--a node in a parse tree of
arithemetic expressions. The overall recursive structure of the evaluator is based on this
conceptual tree, not on the details of the lists' internal structure. We don't need recursion to
traverse the lists, because the lists are of fixed length and we can extract the relevant fields
using car, cadr, and caddr. We are essentially treating the lists as three-element structures. This
kind of recursion is extremely common in Scheme--nested lists are far more common than "pair
trees." As in the earlier examples of recursion over lists and pair trees, the main recursive
procedure can accept pointers to either interior nodes (lists representing compound
expressions), or leaves of the tree. Either counts as an expression.

Dynamic typing lets us implement this straightforwardly, so that our recursion doesn't have to
"bottom out" until we actually hit a leaf. Things would be more complicated in C or Pascal,
which don't allow a procedure to accept an argument that may be either a list or a number.\
footnote{In C or Pascal, we could represent all of the nodes in the expression tree as variant
records (in C, "unions") containing an integer or a list. We don't need to do that in Scheme,
because in Scheme every variable's type is really a kind of variant record--it can hold a (pointer
to a) number or a (pointer to a) pair or a (pointer to) anything else. C is particularly problematic
for this style of programming, because even if we bite the bullet and always define a variant
record type, the variant records are untagged. C doesn't automatically keep track of which variant
a particular record represents e.g., a leaf or nonleaf--and you must code this yourself by adding a
Page 91 of 139
tag field, and setting and checking it appropriately. In effect, must implement dynamic typing
yourself, every time.} It is possible to do Scheme-style recursion straightforwardly in some
statically-typed languages, notably ML and Haskell. These polymorphic languages allow you to
declare disjoint union types. A disjoint union is an "any of these" type--you can say that an
argument will be of some type or some other type. In Scheme, the language only supports one
very general kind of disjoint union type: pointer to anything. However, we usually think of data
structure definitions as disjoint unions. As usual, we can characterize what an arithmetic
expression recursively. It is either a numeric literal (the base case) or a three-element "node"
whose first "field" is an operator symbol and whose second and third "fields" are arithmetic
expressions.
Also as usual, this recursive characterization is what dictates the recursive structure of the
solution-not the details of how nodes are implemented. (The overall structure of recursion over
trees would be the same if the interior nodes were arrays or records, rather than linear lists.) The
conceptual "disjoint union" of leaves and interior nodes is what tells us we need a two-branch
conditional in math-eval. It is important to realize that in Scheme, we usually discriminate
between cases at edges in the graph, i.e., the pointers, rather than focusing on the nodes.
Conceptually, the type of the expr argument is an edge in the expression graph, which may point
to either a leaf node or an interior node. We apply math-eval to each edge, uniformly, and it
discriminates between the cases. We do not examine the object it points to and decide whether to
make the recursive call we always do the recursive call, and sort out the cases in the callee.
Primitive expressions and operations. In looking at any interpreter, it's important to notice which
operations are primitive, and which are compound. Primitive operations are "built into" the
interpreter, but the interpreter allows you to construct more complicated operations in terms of
those. In math-eval, the primitive operations are addition, subtraction, multiplication, and
division. We "snarf" these operations from the underlying Scheme system, in which we're
implementing our little four-function calculator. We don't implement addition, but we do
dispatch to this built-in addition operation. On the other hand, compound expressions are not
built-in.

The interpreter doesn't have a special case for each particular kind of expression e.g., there's no
code to add 4 to 5. We allow users to combine expressions by arbitrarily nesting them, and
support an effectively infinite number of possible expressions. Later, show more advanced
interpreters that support more kinds of primitive expressions not just numeric literals and more
kinds of primitive operations not just four arithmetic functions. I'll also show how a more
advanced interpreter can support more different ways of combining the primitive expressions.
Flexibility One reason for implementing your own interpreter is flexibility. You can change the
features of the language by making minor changes to the interpreter. For example, it is trivial to
modify math-eval to evaluate infix expressions rather than postfix expressions. (That is, with the
operator in the middle, e.g., (10 plus (3 times 2)). All we have to do is change the two lines
where the operator and the first operand are extracted from a compound expression. We just
swap the car and cadr, so that we treat the second element of the list as the operand and the first
element as the operator.

Page 92 of 139
Exercise
Read about:
1. Space for attribute values at compile time
2. Assigning space at compile time

Reference
https://www.csd.uwo.ca/~mmorenom/CS447/Lectures/Translation.html/node4.html
https://edurev.in/question/1755392/In-a-bottom-up-evaluation-of-a-syntax-directed-definition--
inherited-attributes-cana-Always-be-evalu
Attention Computer Science Engineering (CSE) Students! To make sure you are not studying
endlessly, EduRev has designed Computer Science Engineering (CSE) study material, with
Structured Courses, Videos, & Test Series. Plus get personalized analysis, doubt solving, and
improvement plans to achieve a great score in Computer Science Engineering (CSE).

Reference
https://www.geeksforgeeks.org/compiler-design-variants-of-syntax-tree/

Lecture 14: Review of Weeks 1 - 13

Compiler Design
Phases of Compiler:

Symbol Table : It is a data structure being used and maintained by the compiler, consists all the
identifier’s name along with their types. It helps the compiler to function smoothly by finding the
identifiers quickly.
Lexical Analysis : Lexical analyzer reads a source program character by character to produce
tokens. Tokens can be identifiers, keywords, operators, separators etc.
Syntax Analysis : Syntax analyzer is also known as parser. It constructs the parse tree. It takes all
the tokens one by one and uses Context Free Grammar to construct the parse tree.
Semantic Analyzer : It verifies the parse tree, whether it’s meaningful or not. It furthermore
produces a verified parse tree.
Intermediate Code Generator : It generates intermediate code, that is a form which can be readily
executed by machine We have many popular intermediate codes.
Code Optimizer : It transforms the code so that it consumes fewer resources and produces more
speed.
Page 93 of 139
Target Code Generator : The main purpose of Target Code generator is to write a code that the
machine can understand. The output is dependent on the type of assembler.
Error handling :
The tasks of the Error Handling process are to detect each error, report it to the user, and then
make some recover strategy and implement them to handle error. An Error is the blank entries in
the symbol table. There are two types of error :
Run-Time Error : A run-time error is an error which takes place during the execution of a
program, and usually happens because of adverse system parameters or invalid input data.
Compile-Time Error: Compile-time errors rises at compile time, before execution of the
program.
1. Lexical :This includes misspellings of identifiers, keywords or operators.
2. Syntactical :missing semicolon or unbalanced parenthesis.
3. Semantical :incompatible value assignment or type mismatches between operator and
operand.
4. Logical :code not reachable, infinite loop.
Left Recursion : The grammar : A -> Aa | a is left recursive. Top down parsing techniques

Left recursion elimination : A -> Aa | a ⇒ A -> aA’


cannot handle left recursive grammar so we convert left recursion into right recursion.

A’ -> aA’ | a
Left Factoring : If a grammar has common prefixes in r.h.s of nonterminal then suh grammar

A -> ab1 | ac2 ⇒ A -> A -> aA’


needs to be left factored by eliminating common prefixes as follows :

A’ -> A -> b1 | c2
FIRST(A) is a set of the terminal symbols which occur as first symbols in string derived from A
FOLLOW(A) is the set of terminals which occur immediately after the nonterminal A in the
strings derived from the starting symbol.

Top-down parser

Recursive decent parser Non recursive/predictive/LL (1) Parser

LL(1) Parser : LL(1) grammar is unambiguous, left factored and non-left recursive.

1. If A -> B1 | C2 ⇒ { FIRST(B1) ∩ FIRST(C2 ) = φ }


To check whether a grammar is LL(1) or not :

2. If A -> B | ∈ ⇒ { FIRST(B) ∩ FOLLOW(A) = φ }

Bottom up parser

LR(O) LR(O) LR(O) LR(0)

Page 94 of 139
LR(0) Parser : Closure() and goto() functions are used to create canonical collection of LR items.
Conflicts in LR(0) parser :
1. Shift Reduce (SR) conflict : when the same state in DFA contains both shift and reduce items.
A -> B . xC (shifting) B -> a. (reduced)
2. Reduced Reduced (RR) conflict : two reductions in same state of DFA A -> a. (reduced) B ->
b. (reduced)
SLR Parser : It is powerful than LR(0).
Ever LR(0) is SLR but every SLR need not be LR(0).
Conflicts in SLR
1. SR conflict : A -> B . xC (shifting) B -> a. (reduced) if FOLLOW(B) ∩ {x} ≠ φ
2. RR conflict : A -> a. (reduced) B -> b. (reduced) if FOLLOW(A) ∩ FOLLOW(B) ≠ φ

Assignments
Read about: Analysis of syntax-directed definitions. Type Checking:Type systems,
Specification of a simple type checker, Equivalence of type expressions,
Type conversions, Overloading of functions and operators, Polymorphic
functions, An algorithm for unification. Run-Time Environments:Source
language issues, Storage organization, Storage-allocation strategies.

Read about: Access to nonlocal names, parameter passing, Symbol tables, Language
facilities for dynamic storage allocation, Dynamic storage allocation
techniques, Storage allocation in Fortran. Intermediate Code
Generation:Intermediate languages, Declarations, Assignment statements,
Boolean expressions, Case statements, Back Patching, Procedure
calls.Code generation:Issues in the design of a code generator, The target
machine, Run-time storage management, Basic blocks and flow graphs,
Next-use information.

Read about: A Simple code generator, Register allocation and assignment, The dag
representation of basic blocks, Peephole optimization, Generating code
from dags, Dynamic programming code-generation algorithm, Code-
generator generators. Code Optimization: Introduction, The Principal
sources of optimization, Optimization of basic blocks, Loops in flow
graphs, Introduction to global data-flow analysis, Iterative solution of
data-flow equations, Code improving transformations, Dealing with
aliases, Data-flow analysis of structured flow graphs, Efficient data-flow
algorithms, A tool for data-flow analysis, and Estimation of types.

Read about: Symbolic debugging of optimized code. Advanced topics include garbage
collection; dynamic data structures, pointer analysis, aliasing; code
scheduling, pipelining; dependence testing; loop level optimisation;
superscalar optimisation; profile-driven optimisation; debugging support;

Page 95 of 139
incremental parsing; type inference; advanced parsing algorithms;
practical attribute evaluation; function in-lining and partial evaluation.

Weeks 9 & 10: Analysis of syntax-directed definitions. Type Checking: Type systems,
Specification of a simple type checker, Equivalence of type expressions,
Type conversions, Overloading of functions and operators, Polymorphic
functions, An algorithm for unification. Run-Time Environments: Source
language issues, Storage organization, Storage-allocation strategies.

Polymorphic functions
What are polymorphic functions?
Answer
Those functions that can evaluate to or be applied to values of different types are known as
polymorphic functions. A data type that can appear to be of a generalized type (e.g. a list with
elements of arbitrary type) is designated polymorphic data type like the generalized type from
which such specializations are made.
In programming language theory and type theory, polymorphism is the provision of a
single interface to entities of different types or the use of a single symbol to represent multiple
different types. The concept is borrowed from a principle in biology where an organism or
species can have many different forms or stages.

History about polymorphic functions


Interest in polymorphic type systems developed significantly in the 1990s, with practical
implementations beginning to appear by the end of the decade. Ad hoc
polymorphism and parametric polymorphism were originally described in Christopher
Strachey's Fundamental Concepts in Programming Languages, where they are listed as "the two
main classes" of polymorphism. Ad hoc polymorphism was a feature of Algol 68, while
parametric polymorphism was the core feature of ML's type system.
In a 1985 paper, Peter Wegner and Luca Cardelli introduced the term inclusion polymorphism to
model subtypes and inheritance, citing Simula as the first programming language to implement
it.

Type of polymorphic functions


The most commonly recognized major classes of polymorphism are:
(i) Ad hoc polymorphism: defines a common interface for an arbitrary set of individually
specified types.

Page 96 of 139
(ii) Parametric polymorphism: not specifying concrete types and instead use abstract
symbols that can substitute for any type.
(iii) Subtyping (also called subtype polymorphism or inclusion polymorphism): when a
name denotes instances of many different classes related by some common
superclass.
Ad hoc polymorphism
Christopher Strachey chose the term ad hoc polymorphism to refer to polymorphic functions that
can be applied to arguments of different types, but that behave differently depending on the type
of the argument to which they are applied (also known as function overloading or operator
overloading). The term "ad hoc" in this context is not intended to be pejorative; it refers simply
to the fact that this type of polymorphism is not a fundamental feature of the type system. In
the Pascal / Delphi example below, the Add functions seem to work generically over two types
(integer and string) when looking at the invocations, but are considered to be two entirely distinct
functions by the compiler for all intents and purposes:

Example of Pascal / Delphi

program Adhoc;

function Add(x, y : Integer) : Integer;


begin
Add := x + y
end;

function Add(s, t : String) : String;


begin
Add := Concat(s, t)
end;

begin
Writeln(Add(1, 2)); (* Prints "3" *)
Writeln(Add('Hello, ', 'Mammals!')); (* Prints "Hello, Mammals!" *)
end.

In dynamically typed languages the situation can be more complex as the correct function that needs
to be invoked might only be determinable at run time.
Implicit type conversion has also been defined as a form of polymorphism, referred to as "coercion
polymorphism".

Parametric polymorphism
Parametric polymorphism allows a function or a data type to be written generically, so that it can
handle values uniformly without depending on their type. Parametric polymorphism is a way to
make a language more expressive while still maintaining full static type-safety.
The concept of parametric polymorphism applies to both data types and functions. A function
that can evaluate to or be applied to values of different types is known as a polymorphic
function. A data type that can appear to be of a generalized type (e.g. a list with elements of

Page 97 of 139
arbitrary type) is designated polymorphic data type like the generalized type from which such
specializations are made.
Parametric polymorphism is ubiquitous in functional programming, where it is often simply
referred to as "polymorphism". The following example in Haskell shows a parameterized list
data type and two parametrically polymorphic functions on them:

data List a = Nil | Cons a (List a)

length :: List a -> Integer


length Nil = 0
length (Cons x xs) = 1 + length xs

map :: (a -> b) -> List a -> List b


map f Nil = Nil
map f (Cons x xs) = Cons (f x) (map f xs)

Parametric polymorphism is also available in several object-oriented languages. For


instance, templates in C++ and D, or under the name generics in C#, Delphi, Java and Go:

class List<T> {
class Node<T> {
T elem;
Node<T> next;
}
Node<T> head;
int length() { ... }
}

List<B> map(Func<A, B> f, List<A> xs) {


...
}

John C. Reynolds (and later Jean-Yves Girard) formally developed this notion of polymorphism
as an extension to lambda calculus (called the polymorphic lambda calculus or System F). Any
parametrically polymorphic function is necessarily restricted in what it can do, working on the
shape of the data instead of its value, leading to the concept of parametricity.

Subtyping
Some languages employ the idea of subtyping (also called subtype polymorphism or inclusion
polymorphism) to restrict the range of types that can be used in a particular case of
polymorphism. In these languages, subtyping allows a function to be written to take an object of
a certain type T, but also work correctly, if passed an object that belongs to a type S that is a
subtype of T (according to the Liskov substitution principle). This type relation is sometimes
written S <: T. Conversely, T is said to be a supertype of S—written T :> S. Subtype
polymorphism is usually resolved dynamically (see below).
In the following Java example we make cats and dogs subtypes of pets. The
procedure letsHear() accepts a pet, but will also work correctly if a subtype is passed to it:

Page 98 of 139
abstract class Pet {
abstract String speak();
}

class Cat extends Pet {


String speak() {
return "Meow!";
}
}

class Dog extends Pet {


String speak() {
return "Woof!";
}
}

static void letsHear(final Pet pet) {


println(pet.speak());
}

static void main(String[] args) {


letsHear(new Cat());
letsHear(new Dog());
}

In another example, if Number, Rational, and Integer are types such


that Number :> Rational and Number :> Integer (Rational and Integer as subtypes of a
type Number that is a supertype of them), a function written to take a Number will work equally
well when passed an Integer or Rational as when passed a Number. The actual type of the object
can be hidden from clients into a black box, and accessed via object identity. In fact, if
the Number type is abstract, it may not even be possible to get your hands on an object
whose most-derived type is Number (see abstract data type, abstract class). This particular kind
of type hierarchy is known especially in the context of the Scheme programming language as
a numerical tower, and usually contains many more types.
Object-oriented programming languages offer subtype polymorphism using subclassing (also
known as inheritance). In typical implementations, each class contains what is called a virtual
table (shortly called vtable) a table of functions that implement the polymorphic part of the class
interface—and each object contains a pointer to the vtable of its class, which is then consulted
whenever a polymorphic method is called. This mechanism is an example of:
Page 99 of 139
 late binding, because virtual function calls are not bound until the time of invocation;
 single dispatch (i.e., single-argument polymorphism), because virtual function calls
are bound simply by looking through the vtable provided by the first argument
(the this object), so the runtime types of the other arguments are completely
irrelevant.
The same goes for most other popular object systems. Some, however, such as Common Lisp
Object System, provide multiple dispatch, under which method calls are polymorphic
in all arguments.
The interaction between parametric polymorphism and subtyping leads to the concepts
of variance and bounded quantification.

Polymorphic Functions in Compiler Design

In today’s world, the life of a developer would be difficult without Polymorphism. It allows us to
treat items from different instructions as though they belong to a shared superclass. To
implement Polymorphism, we use Polymorphic functions.
What is Polymorphism?
Polymorphism is a Greek word. It comprises two words, where Poly means "many" and
morphism means "forms". It is the ability of an object to take on many forms. Polymorphism
allows you to “program in general” rather than “program in specific." It is the capability of a
method to do different things based on the object.
There are two types of Polymorphism:
1. Compile Time Polymorphism.

2. Run Time Polymorphism.

What is Compile Time Polymorphism?


Compile time Polymorphism refers to a behavior that is resolved when your class is compiled.
Example: Method overloading.
Method Overloading: This is a feature that we can use to allow a class to have more than one
method with the same name, but their argument lists should be different from each other.
Compile time is also known as "static binding" or "early binding."

What is Run Time Polymorphism?


It is a feature that allows a child class or subclass to provide a certain implementation of a
method that is already provided by one of its parent classes or superclasses.
Example: Method Overriding. Run time is also known as Dynamic binding or Late binding.
What is a Polymorphic Function?
A function is said to be a Polymorphic function if it can work with multiple types of data or
objects. We can also describe it as a function that can be used to perform the same type of
operation on different types of input.
Types of Polymorphic Functions in Compiler Design
There are two types of polymorphic functions that are used in compiler design.

Page 100 of 139


1. Ad-hoc Polymorphism:
It is also known as “Overloading Ad-hoc Polymorphism”, which allows functions that have the
same name to act differently for different data types. For example, The plus operator will add
two numbers but perform a concatenation operation on two strings.
#include <iostream>
using namespace std;

int add(int x, int y)


{
int z = x + y;
return z;
}

string add(const char* x, const char* y)


{
string addition(x);
addition += y;
return addition;
}

int main()
{
cout << add(71, 72)
<< " is Integer addition Output\n";
cout << add("Coding", " Ninjas")
<< " is String Concatenation Output\n";
}
Output:
143 is Integer addition Output
Coding Ninjas is String Concatenation Output

So, we are calling two different functions (which differ in the type of arguments), and both of
them have the same name to execute multiple operations. We have successfully achieved Ad-hoc
Polymorphism.

Advantages of Ad-hoc Polymorphism:


(i) It allows the functions to be more versatile and handle different data types and different ways.
(ii) It provides a simpler Interface to the programmer.
(iii) It will group all the related functions under a single name. And this will improve the readability.

2. Parametric Polymorphism:
It is also known as "Early Binding Parametric Polymorphism.” It opens a way to use the same
code for different data types. It is implemented by using Templates. For example: To develop an
understanding of this sort of Polymorphism, let us execute a program to find the greater of two
Integers or two Strings.
#include <iostream>
#include <string>
using namespace std;
template <class temp>
Page 101 of 139
temp greater(temp a, temp b)
{
if (a > b)
return a;
else
return b;
}
int main()
{
cout <<:: greater(96, 69) << endl;
string str1("Coding"), str2("Ninja");
cout <<:: greater(str1, str2) << endl;
}

An algorithm for unification


In logic and computer science, unification is an algorithmic process of solving equations between
symbolic expressions. For example, using x,y,z as variables, the singleton equation set

substitution { x ↦ 2, y ↦ cons(2,nil) } as its only solution.


{ cons(x,cons(x,nil)) = cons(2,y) } is a syntactic first-order unification problem that has the

A unification algorithm was first discovered by Jacques Herbrand, while a first formal
investigation can be attributed to John Alan Robinson, who used first-order syntactical
unification as a basic building block of his resolution procedure for first-order logic, a great step
forward in automated reasoning technology, as it eliminated one source of combinatorial
explosion: searching for instantiation of terms. Today, automated reasoning is still the main
application area of unification. Syntactical first-order unification is used in logic
programming and programming language type system implementation, especially in Hindley–
Milner based type inference algorithms. Semantic unification is used in SMT solvers, term
rewriting algorithms and cryptographic protocol analysis. Higher-order unification is used in
proof assistants, for example Isabelle and Twelf, and restricted forms of higher-order unification
(higher-order pattern unification) are used in some programming language implementations,
such as lambdaProlog, as higher-order patterns are expressive, yet their associated unification
procedure retains theoretical properties closer to first-order unification.

Weeks 11 & 12: Access to nonlocal names, parameter passing, Symbol tables, Language
facilities for dynamic storage allocation, Dynamic storage allocation
techniques, Storage allocation in Fortran. Intermediate Code Generation:
Intermediate languages, Declarations, Assignment statements, Boolean
expressions, Case statements, Back Patching, Procedure calls. Code
generation: Issues in the design of a code generator, The target machine,
Run-time storage management, Basic blocks and flow graphs, Next-use
information.

Access to nonlocal names


Page 102 of 139
What does non local mean?
Meaning of non-local in English

not found within, coming from, or relating to a small area, especially of a country: Non-local
attendees to the arts festival spend significantly more than local attendees, according to the
report. Take for example one can say that “A large number of items made of nonlocal materials
were found at the archaeological site”.

The non-local means algorithm replaces the value of a pixel by an average of a selection of other
pixels values: small patches centered on the other pixels are compared to the patch centered on
the pixel of interest, and the average is performed only for pixels that have patches close to the
current patch.

How do I access non-local names in compiler design?


In summary, access links are used in nested procedures to allow for access to non-local variables.
The access link is a special pointer that points to the stack frame of the calling function, and it is
used to follow the chain of stack frames to locate the variable's memory location.

Access to Non-local Names:


 In some cases, when a procedure refer to variables that are not local to it, then such
variables are called nonlocal variables
 There are two types of scope rules, for the non-local names. They are
Static scope
Dynamic scope

Static Scope or Lexical Scope


 Lexical scope is also called static scope. In this type of scope, the scope is verified by
examining the text of the program.
 Examples: PASCAL, C and ADA are the languages that use the static scope rule.
 These languages are also called block structured languages
Block
 A block defines a new scope with a sequence of statements that contains the local data
declarations. It is enclosed within the delimiters.
Example:
{
Declaration statements
Page 103 of 139
……….
}
 The beginning and end of the block are specified by the delimiter. The blocks can be in
nesting fashion that means block B2 completely can be inside the block B1
 In a block structured language, scope declaration is given by static rule or most closely
nested loop
 At a program point, declarations are visible
The declarations that are made inside the procedure
The names of all enclosing procedures
The declarations of names made immediately within such procedures
 The displayed image on the screen shows the storage for the names corresponding to
particular block
 Thus, block structure storage allocation can be done by stack

Lexical Scope for Nested Procedure


 If a procedure is declared inside another procedure then that procedure is known as
nested procedure
 A procedure pi, can call any procedure, i.e., its direct ancestor or older siblings of its
direct ancestor
Procedure main
Procedure P1
Procedure P2
Procedure P3
Procedure P4

Page 104 of 139


Nesting Depth:
 Lexical scope can be implemented by using nesting depth of a procedure. The procedure
of calculating nesting depth is as follows:
The main programs nesting depth is ‘1’
When a new procedure begins, add ‘1’ to nesting depth each time
When you exit from a nested procedure, subtract ‘1’ from depth each time

What is a Nonlocal name in Python?


What is nonlocal keyword in Python. The nonlocal keyword won't work on local or global
variables and therefore must be used to reference variables in another scope except the global
and local one. The nonlocal keyword is used in nested functions to reference a variable in the
parent function.

What is the difference between local and nonlocal in Python?


Answer

In python, nonlocal variables refer to all those variables that are declared within nested
functions. The local scope of a nonlocal variable is not defined. This essentially means that the
variable exists neither in the local scope nor in the global scope.

Reference
https://estudies4you.blogspot.com/2017/09/access-to-nonlocal-names.html

Storage allocation in Fortran


What is storage allocation?
Storage allocation is the process of associating an area of storage with a variable so that the data
item(s) represented by the variable can be recorded internally.
Does Fortran have memory allocation?
After definition of pointers one can allocate memory for it using the allocate command. The
memory pointed to by a pointer is given free again by the deallocate command.

Page 105 of 139


What is allocate in Fortran?
The allocatable attribute provides a safe way for memory handling. In comparison to variables
with pointer attribute the memory is managed automatically and will be deallocated
automatically once the variable goes out-of-scope.

For FORTRAN and other languages which allow static storage allocation, the amount of storage
required to hold each variable is fixed at translation time. Such languages have no nested
procedures or recursion and thus only one instance of each name (the same identifier may be
used in different context, however).
What is array in Fortran?
An array is a named collection of elements of the same type. It is a nonempty sequence of data
and occupies a group of contiguous storage locations. An array has a name, a set of elements,
and a type. An array name is a symbolic name for the whole sequence of data.

Best practice of allocating memory in Fortran?


I come across the SciVision 1 website when searching for other things. It contains a nice list of
posts on Fortran 3.
In particular, I note the post entitled “Fortran allocate large variable memory 23”. It motivates
me to ask the following two questions (note that there is a Question 2).
Question 1. What is the best practice of allocating memory in Fortran?
Personally, I wrap up a generic procedure named safealloc to do the job. Below is the
implementation of safealloc to allocate the memory for a rank-1 REAL(SP) array with a size
given by a variable n of kind INTEGER(IK). We can imagine, for example, the
module consts_mod defines SP=kind(0.0) and IK=kind(0). In addition, validate is a subroutine
that stops the program when an assertion fails, akin to the assert function in C or Python.
subroutine alloc_rvector_sp(x, n)
!--------------------------------------------------------------------------------------------------!
! Allocate space for an allocatable REAL(SP) vector X, whose size is N after allocation.
!--------------------------------------------------------------------------------------------------!
use, non_intrinsic :: consts_mod, only : SP, IK ! Kinds of real and integer variables
use, non_intrinsic :: debug_mod, only : validate ! An `assert`-like subroutine
implicit none

! Inputs
integer(IK), intent(in) :: n
Page 106 of 139
! Outputs
real(SP), allocatable, intent(out) :: x(:)
! Local variables
integer :: alloc_status
character(len=*), parameter :: srname = 'ALLOC_RVECTOR_SP'
! Preconditions
call validate(n >= 0, 'N >= 0', srname)
! According to the Fortran 2003 standard, when a procedure is invoked, any allocated
ALLOCATABLE
! object that is an actual argument associated with an INTENT(OUT) ALLOCATABLE dummy
argument is
! deallocated. So it is unnecessary to write the following line since F2003 as X is
INTENT(OUT):
!!if (allocated(x)) deallocate (x)
! Allocate memory for X
allocate (x(n), stat=alloc_status)
call validate(alloc_status == 0, 'Memory allocation succeeds (ALLOC_STATUS == 0)', srname)
call validate(allocated(x), 'X is allocated', srname)
! Initialize X to a strange value independent of the compiler; it can be costly for a large N.
x = -huge(x)
! Postconditions
call validate(size(x) == n, 'SIZE(X) == N', srname)
end subroutine alloc_rvector_sp
[Update (2022-01-25): I shuffled the lines a bit, moving validate(allocated(x), 'X is allocated',
srname) to the above of x = -huge(x).]

Practice questions:
1. What do you think about this implementation? Any comments, suggestions, and criticism will
be appreciated.
A related and more particular question is the following.
2. What is the best practice of allocating large memory in Fortran?
The question can be further detailed as follows.
Page 107 of 139
3. What does “large” mean under a modern and practical setting?
To be precise, let us consider a PC/computing node with >= 4GB of RAM. In addition, the
hardware (RAM, CPU, hard storage, etc), the compiler, and the system are reasonably
mainstream and modern, e.g., not more than 10 years old.
4. What special caution should be taken when the memory to allocate is large by the answer
to 2.1?
Boolean expressions
What is a Boolean expression?
Answer

In computer science, a Boolean expression is an expression used in programming languages that


produces a Boolean value when evaluated. A Boolean value is either true or false. That is,
Boolean expressions are the expressions that evaluate a condition and result in a Boolean value
such as true or false.

Example 1: (a>b && a>c) is a Boolean expression. It evaluates the condition by comparing if 'a'
is greater than 'b' and also if 'a' is greater than 'c'.

Example 2: (2>1 && 10>9) It evaluates the condition by comparing if '2' is greater than '1' and
also if '10' is greater than '9'.

A logical statement that results in a Boolean value, either be True or False, is a Boolean
expression. Sometimes, synonyms are used to express the statement such as ‘Yes’ for ‘True’ and
‘No’ for ‘False’.

Also, 1 and 0 are used for digital circuits for True and False, respectively.

Boolean expressions are the statements that use logical operators, i.e., AND, OR, XOR and
NOT. Thus, if we write X AND Y = True, then it is a Boolean expression.

Boolean operators
Most programming languages have the Boolean operators OR, AND and NOT; in C and
some languages inspired by it, these are represented by "||" (double pipe character), "&&"
(double ampersand) and "!" (exclamation point) respectively, while the corresponding bitwise
operations are represented by "|", "&" and "~" (tilde). In the mathematical literature the symbols
used are often "+" (plus), "·" (dot) and overbar, or "∨" (vel), "∧" (et) and "¬" (not) or "′"
(prime).

Page 108 of 139


Some languages, e.g., Perl and Ruby, have two sets of Boolean operators, with identical
functions but different precedence. Typically, these languages use and, or and not for the lower
precedence operators.
Some programming languages derived from PL/I have a bit string type and use BIT(1) rather
than a separate Boolean type. In those languages the same operators serve for boolean operations
and bitwise operations. The languages represent OR, AND, NOT and EXCLUSIVE OR by "|",
"&", "¬" (infix) and "¬" (prefix).
Short-circuit operators
Some programming languages, e.g., Ada, have short-circuit Boolean operators. These operators
use a lazy evaluation, that is, if the value of the expression can be determined from the left hand
Boolean expression then they do not evaluate the right hand Boolean expression. Therefore,
there may be side effects that only occur for one value of the left hand operand.

Boolean Algebra

Boolean algebra is the category of algebra in which the variable’s values are the truth
values, true and false, ordinarily denoted 1 and 0 respectively. It is used to analyze and simplify
digital circuits or digital gates. It is also called Binary Algebra or logical Algebra. It has been
fundamental in the development of digital electronics and is provided for in all modern
programming languages. It is also used in set theory and statistics.

The important operations performed in Boolean algebra are – conjunction (∧), disjunction (∨)
and negation (¬). Hence, this algebra is far way different from elementary algebra where the
values of variables are numerical and arithmetic operations like addition, subtraction,
multiplication, and division has been performed on them.

Boolean Algebra Operations


The basic operations of Boolean algebra are as follows:

 Conjunction or AND operation


 Disjunction or OR operation
 Negation or Not operation

Page 109 of 139


Below is the table defining the symbols for all three basic operations.

Operator Symbol Precedence

NOT ‘ (or) ¬ Highest

AND . (or) ∧ Middle

OR + (or) ∨ Lowest

Suppose A and B are two Boolean variables, then we can define the three operations as;

 A conjunction B or A AND B, satisfies A ∧ B = True, if A = B = True or else A ∧ B =


False.
 A disjunction B or A OR B, satisfies A ∨ B = False, if A = B = False, else A ∨ B = True.
 Negation A or ¬A satisfies ¬A = False, if A = True and ¬A = True if A = False

Boolean Algebra Terminologies

Now, let us discuss the important terminologies covered in Boolean algebra.

Boolean Algebra: Boolean algebra is the branch of algebra that deals with logical operations and
binary variables.

Boolean Variables: A Boolean variable is defined as a variable or a symbol defined as a


variable or a symbol, generally an alphabet that represents the logical quantities such as 0 or 1.

Boolean Function: A Boolean function consists of binary variables, logical operators, constants
such as 0 and 1, equal to the operator, and the parenthesis symbols.

Literal: A literal may be a variable or a complement of a variable.

Complement: The complement is defined as the inverse of a variable, which is represented by a


bar over the variable.

Truth Table: The truth table is a table that gives all the possible values of logical variables and
the combination of the variables. It is possible to convert the Boolean equation into a truth table.
The number of rows in the truth table should be equal to 2n, where “n” is the number of variables
Page 110 of 139
in the equation. For example, if a Boolean equation consists of 3 variables, then the number of
rows in the truth table is 8. (i.e.,) 23 = 8.

Boolean Algebra Truth Table

Now, if we express the above operations in a truth table, we get;

A B A∧B A∨B

True True True True

True False False True

False True False True

False False False False

A ¬A

True False

False True

Boolean Algebra Rules

Following are the important rules used in Boolean algebra.

(a) Variable used can have only two values. Binary 1 for HIGH and Binary 0 for LOW.
(b) The complement of a variable is represented by an overbar.

Thus, complement of variable B is represented as B. That is if B = 0, then B = 1 and 1

Page 111 of 139


(c) OR-ing of the variables is represented by a plus (+) sign between them. For example, the
OR-ing of A, B, and C is represented as A + B + C.
(d) Logical AND-ing of the two or more variables is represented by writing a dot between
them, such as A.B.C. Sometimes, the dot may be omitted like ABC.

Related Links

Truth Table Tautology

Conjunction Mathematical Logic

Laws of Boolean Algebra

There are six types of Boolean algebra laws. They are: Commutative law, Associative law,
Distributive law, AND law, OR law and Inversion law.

Those six laws are explained in detail here.

(1) Commutative Law

Any binary operation which satisfies the following expression is referred to as a commutative
operation. Commutative law states that changing the sequence of the variables does not have any
effect on the output of a logic circuit.

 A. B = B. A
 A+B=B+A

Associative Law

It states that the order in which the logic operations are performed is irrelevant as their effect is
the same.

 ( A. B ). C = A . ( B . C )
 ( A + B ) + C = A + ( B + C)

(2) Distributive Law

Distributive law states the following conditions:

 A. ( B + C) = (A. B) + (A. C)
Page 112 of 139
 A + (B. C) = (A + B) . ( A + C)

(3) AND Law

These laws use the AND operation. Therefore they are called AND laws.

 A .0 = 0
 A.1=A
 A. A = A
 A. A ¯=0

(4) OR Law

These laws use the OR operation. Therefore, they are called OR laws.

 A +0=A
 A+1=1
 A+A=A
 A+ A =1

(5) Inversion Law

In Boolean algebra, the inversion law states that double inversion of variable results in the
original variable itself.

Á = A

Boolean Algebra Theorems

The two important theorems which are extremely used in Boolean algebra are De Morgan’s First
law and De Morgan’s second law. These two theorems are used to change the Boolean
expression. This theorem basically helps to reduce the given Boolean expression in the
simplified form. These two De Morgan’s laws are used to change the expression from one form
to another form. Now, let us discuss these two theorems in detail.

De Morgan’s First Law:

De Morgan’s First Law states that (A.B)’ = A’+B’.

The first law states that the complement of the product of the variables is equal to the sum of
their individual complements of a variable.

The truth table that shows the verification of De Morgan’s First law is given as follows:
Page 113 of 139
A B A’ B’ (A.B)’ A’+B’

0 0 1 1 1 1

0 1 1 0 1 1

1 0 0 1 1 1

1 1 0 0 0 0

The last two columns show that (A.B)’ = A’+B’.

Hence, De Morgan’s First Law is proved.

De Morgan’s Second Law:

De Morgan’s Second law states that (A+B)’ = A’. B’.

The second law states that the complement of the sum of variables is equal to the product of their
individual complements of a variable.

The following truth table shows the proof for De Morgan’s second law.

A B A’ B’ (A+B)’ A’. B’

0 0 1 1 1 1

0 1 1 0 0 0

1 0 0 1 0 0

1 1 0 0 0 0

The last two columns show that (A+B)’ = A’. B’.

Hence, De Morgan’s second law is proved.


Page 114 of 139
The other theorems in Boolean algebra are complementary theorem, duality theorem,
transposition theorem, redundancy theorem and so on. All these theorems are used to simplify
the given Boolean expression. The reduced Boolean expression should be equivalent to the given
Boolean expression.

Solved Examples

Question: Simplify the following expression:

C+ B C

Solution:

Given:

C+ B C

According to Demorgan’s law, we can write the above expressions as

C+( B+C )

From Commutative law:

(C+C )+ B

From Complement law

1+ B = 1

Therefore,

1+ BC=1

Question 2: Draw a truth table for A(B+D).

Solution: Given expression A(B+D).

A B D B+D A(B+D)

0 0 0 0 0

0 0 1 1 0

0 1 0 1 0

Page 115 of 139


0 1 1 1 0

1 0 0 0 0

1 0 1 1 1

1 1 0 1 1

1 1 1 1 1

Stay tuned with BYJU’S – The Learning App and also explore more videos.

Frequently Asked Questions on Boolean Algebra are as follow;


(Q1) What is meant by Boolean algebra?

In Mathematics, Boolean algebra is called logical algebra consisting of binary variables that hold
the values 0 or 1, and logical operations.
(Q2) What are some applications of Boolean algebra?

In electrical and electronic circuits, Boolean algebra is used to simplify and analyze the logical or
digital circuits.
(Q3) What are the three main Boolean operators?

The three important Boolean operators are:


AND (Conjunction)
OR (Disjunction)
NOT (Negation)

(Q4) Is the value 0 represents true or false?

In Boolean logic, zero (0) represents false and one (1) represents true. In many applications, zero
is interpreted as false and a non-zero value is interpreted as true.

(Q5) Mention the six important laws of Boolean algebra.

Page 116 of 139


The six important laws of Boolean algebra are:
(i) Commutative law
(ii) Associative law
(iii) Distributive law
(iv) Inversion law
(v) AND law
(vi) OR law

Reference
https://byjus.com/maths/boolean-algebra/

Weeks 13 & 12:

Symbolic debugging of optimized code


A symbolic debugger can be invoked either when a run-time error occurs in the program being
debugged or when program execution reaches a breakpoint. For simplicity, we assume that
breakpoints are inserted only between source statements. Exactly what constitutes a statement
depends on the programming language.
Symbolic debugging does not allow you to view the program source, but does allow you to
reference paragraphs and variables by their COBOL identifiers. The advantage to using symbolic
debugging rather than source debugging is that the compiled object module is much smaller.

Question:
Why symbolic debugging of optimized code?

Symbolic debuggers
Definition: Symbolic debuggers are program development tools that allow a user to interact with
an executing process at the source level.
Page 117 of 139
Or
Definition: Symbolic debuggers are system development tools that can accelerate the validation
speed of behavioral specifications by allowing a user to interact with an executing code at the
source level.

Explanation on symbolic debuggers


In response to a user query, the debugger must be able to retrieve and display the value of a
source variable in a manner consistent with what the user expects with respect to the source
statement where execution has halted. However, when a program has been compiled with
optimizations, values of variables may either be inaccessible in the run-time state or inconsistent
with what the user expects. Such problems that pertain to the retrieval of source values are called
data value problems.

Debugging Optimized Code

Although it is possible to do a reasonable amount of debugging at nonzero optimization levels,


the higher the level the more likely that source-level constructs will have been eliminated by
optimization. Take for example, if a loop is strength-reduced, the loop control variable may be
completely eliminated and thus cannot be displayed in the debugger. This can only happen at -
O2 or -O3. Explicit temporary variables that you code might be eliminated at level -O1 or higher.

The use of the -g switch, which is needed for source-level debugging, affects the size of the
program executable on disk, and indeed the debugging information can be quite large. However,
it has no effect on the generated code (and thus does not degrade performance)

Since the compiler generates debugging tables for a compilation unit before it performs
optimizations, the optimizing transformations may invalidate some of the debugging data.
Therefore, one need to anticipate certain anomalous situations that may arise while debugging
optimized code. These are the most common cases:

1. ‘The ‘hopping Program Counter’:’ Repeated step or next commands show the PC bouncing
back and forth in the code. This may result from any of the following optimizations:

(i) ‘Common subexpression elimination:’ using a single instance of code for a quantity that the
source computes several times. As a result, one may not be able to stop on what looks like a
statement.

(ii) ‘Invariant code motion:’ moving an expression that does not change within a loop, to the
beginning of the loop.

(iii) ‘Instruction scheduling:’ moving instructions so as to overlap loads and stores (typically)
with other code, or in general to move computations of values closer to their uses. Often this
causes you to pass an assignment statement without the assignment happening and then later
bounce back to the statement when the value is actually needed. Placing a breakpoint on a line of
code and then stepping over it may, therefore, not always cause all the expected side-effects.
Page 118 of 139
2. ‘The ‘big leap’:’ More commonly known as ‘cross-jumping’, in which two identical pieces of
code are merged and the program counter suddenly jumps to a statement that is not supposed to
be executed, simply because it (and the code following) translates to the same thing as the code
that ‘was’ supposed to be executed. This effect is typically seen in sequences that end in a jump,
such as a goto, a return, or a break in a C switch statement.

3. ‘The ‘roving variable’:’ The symptom is an unexpected value in a variable. There are various
reasons for this effect:
(a) In a subprogram prologue, a parameter may not yet have been moved to its
‘home’.
(b) A variable may be dead, and its register re-used. This is probably the most
common cause.

(c ) As mentioned above, the assignment of a value to a variable may have been


moved.

(d) A variable may be eliminated entirely by value propagation or other means. In this
case, GCC may incorrectly generate debugging information for the variable

In general, when an unexpected value appears for a local variable or parameter you
should first ascertain if that value was actually computed by your program, as opposed to
being incorrectly reported by the debugger. Record fields or array elements in an object
designated by an access value are generally less of a problem, once you have ascertained
that the access value is sensible. Typically, this means checking variables in the
preceding code and in the calling subprogram to verify that the value observed is
explainable from other values (one must apply the procedure recursively to those other
values); or re-running the code and stopping a little earlier (perhaps before the call) and
stepping to better see how the variable obtained the value in question; or continuing to
step ‘from’ the point of the strange value to see if code motion had simply moved the
variable’s assignments later.

In light of such anomalies, a recommended technique is to use -O0 early in the software
development cycle, when extensive debugging capabilities are most needed, and then move to -
O1 and later -O2 as the debugger becomes less critical. Whether to use the -g switch in the
release version is a release management issue. Note that if you use -g you can then use
the strip program on the resulting executable, which removes both debugging information and
global symbols.

Quiz
What is the main purpose of debugger?
Debugging tools (called debuggers) are used to identify coding errors at various development
stages. They are used to reproduce the conditions in which error has occurred, then examine the
program state at that time and locate the cause.

Page 119 of 139


What is debugging of optimized code in compiler design?
Debugging optimized programs presents special usability problems. Optimization can change the
sequence of operations, add or remove code, change variable data locations, and perform other
transformations that make it difficult to associate the generated code with the original source
statements.

What are three ways you can optimize a code?


If you're looking to optimize your code for performance, there are a few things you can do.
First, make sure your code is well written and clean. Second, use a profiler to identify areas of
your code that could be improved. And finally, don't forget to optimize your algorithms and data
structures.

What is symbolic debugger with example?


A symbolic debugger knows the addresses of the symbols and is able to display them in the
disassembly. Take for example, here the debugger shows in the disassembly the code label at the
beginning of the procedure, and a second label a couple of lines below. Data references in the
disassembly are shown by data label.

What are the two types of debugging?


There are two types of debugging techniques: reactive debugging and preemptive debugging.
Most debugging is reactive — a defect is reported in the application, or an error occurs, and the
developer tries to find the root cause of the error to fix it.

What is it called when you optimize code?


In computer science, program optimization, code optimization, or software optimization is the
process of modifying a software system to make some aspect of it work more efficiently or use
fewer resources.

What are symbol files for debugging?


Program database (. pdb) files, also called symbol files, map identifiers and statements in your
project's source code to corresponding identifiers and instructions in compiled apps. These
mapping files link the debugger to your source code, which enables debugging.

Code scheduling

Code scheduling is an important part of compiler design. It is necessary to know more about
the code-scheduling process because it can help someone to ensure that compiler is producing
optimal machines for programs.

There are three steps in code scheduling:

Page 120 of 139


1. Identify all the instructions that need to be scheduled before each jump, and their
positions relative to each other (e.g., there are two instructions after an LBR
instruction).
2. Determine when these instructions should actually execute based on program
counter values and other information such as branch prediction tables, register
usage information, etc., which then allows us to make decisions about how much
time we want each instruction type to spend executing before jumping forward
again.
3. Generate machine code from our algorithm so that all instructions will be executed
exactly when they were intended without any stalls or collisions due mainly because
there aren’t enough registers available within our microprocessor architecture

Code- Scheduling Constraints

Data dependence analysis is used to determine the order of execution of instructions. It is used
to determine the order of execution of instructions because it gives an indication of how much
effort will be required for a particular instruction to be executed at any given time. A data-
dependent instruction can only be executed after all its dependent instructions have been
completed.

A true data dependency exists if both operands are required for executing an instruction,
which means that neither operand can be skipped during its execution without affecting
program behavior (i.e., no bug). An anti-data dependency exists when one operand is not
required for executing an instruction but another one may change its value and affect program
behavior (i.e., bugs). An output dependence exists when more than one assembler statement
depends upon receiving a constant value from another assembler statement; this means that if
we change this constant value then our outputs will change accordingly even though they
weren’t affected by our previous changes in values!

Finding Dependences Among Memory Accesses:

Array Data-Dependence Analysis:


To find the array data dependence, we need to carry out an analysis of arrays. We can use DSA
(data-structure analysis) or MRA (memory reference analysis). The general idea behind this
technique is to find all the different ways that array elements may be accessed in your program
and their dependencies. The most popular way of doing this is by using a graph representation
called a “tree” that shows how each variable affects other variables.

Page 121 of 139


Pointer-Alias Analysis:
Pointer aliasing occurs when two pointers point to the same location in memory; they refer to
different objects but hold references that point directly at each other’s addresses. This type of
problem arises when you create multiple objects with similar names or structures but with
different structure layouts.
For example, two lists are created with one list having more elements than another list does
(the shorter list has been trimmed down) leading us into trouble when we attempt to access
those elements later on down our code path!
The problem is that the compiler has no way of knowing which list we want to access; it could
be either one of them. As a result, the compiler will generate code that is not thread-safe and
unsafe.
The solution to this problem is to use a technique called pointer-alias analysis. This technique
will go through your code and figure out which pointers are pointing at the same memory
locations and make sure that no two pointers point at the same location. It does this by
performing a simple operation called “conflict resolution” where it finds all possible conflicts
between instances of objects that may be shared across threads.

Tradeoff Between Register Usage and Parallelism:

The compiler is responsible for making tradeoffs between register usage and parallelism.
Register usage refers to the number of registers that a particular instruction uses at runtime,
while parallelism refers to how many instructions can be executed in parallel by a single
processor (or multiple processors).
The recommended strategy for producing the best performance is often called “ register-level
profiling,” which means measuring how much time your program spends using each register
during execution and then modifying your code so it runs more efficiently on those scarce
resources. A good example would be if you have an instruction like ADD_EQ which needs
two operands: A and B; this means you need both these registers available at once when
calculating addition or subtraction instead of keeping only one free for other purposes such as
storing data into memory or passing arguments from function callbacks down through loops
where needed. This can be improved by putting the two operands into separate registers, which
allows you to use one temporarily while performing other instructions on it and then later
retrieve the result of your calculation. If a processor has only one register available at any
given time, then this is known as “register pressure” or “register contention“.

The registers in a processor are not infinite, and if you try to use more than the available
number of them at any given time then performance will suffer. This means that if you have a
loop that is iterating over an array, for example, then it may be beneficial to move some of the
data into memory before going through each iteration rather than keeping all of it in registers
(which would require a lot of copying back and forth).

Phase Ordering Between Register Allocation and Code Scheduling

Register allocation and code scheduling are two phases of the compiler. Register allocation is
the first phase that allocates registers for a particular instruction, whereas code scheduling
refers to the second phase where instructions are placed into machine language. The two
Page 122 of 139
phases are independent of each other; however, they can be performed in any order or
interleaved with each other.

The register allocation process consists of three steps:


1. Add registers (addressing mode)
2. Allocate new register (allocator)
3. Assign value to those registers based on their usage by different instructions within
the program text (assignment).

A specific number of registers will be allocated per each instruction in order for it to execute
properly at run time; this need not necessarily match between all possible execution paths
through an application program during one pass through its source code as part of link-time
optimization analysis efforts such as static single assignment form analysis technique.
In contrast with register allocation where only one value may be assigned per branch
conditionally executed branch instruction only needs two semantically equivalent operands but
still requires three clocks cycles total processing time due to additional explicit clock cycles
required when used together with explicit clock signal generation hardware componentry
rather than being used solely in conjunction with precompiled libraries which themselves
typically contain many branches since most programs do not have significant portions
consisting primarily out only a few big chunks like main() routine might require hundreds-
thousands lines worth runtime overhead per iteration loop counter running over N iterations
before finally reaching endpoint conditionally executed section.”

Control Dependence

Control-dependence constraints are the most common data-flow constraints to consider in code
scheduling. They can be classified as true, anti, and output.
True control dependence arises when a control statement depends on the result of another
control statement or other statements. For example, if you have an IF statement with two
expressions that use the same variable, then you must ensure that they are executed in order, or
else your program may not work correctly.

Anti-control dependence occurs when two independent computations need not be performed
together (e.g., A and B) but both of them can execute simultaneously without affecting each
other’s results (e.g., A-B). This type of constraint is often called “data sharing” because it
allows multiple computations to share access to shared resources such as variables and
registers without interfering with each other through implicit mutual exclusion mechanisms
such as memory barriers or synchronization mechanisms like critical sections within threads;
however there may still be some degree of mutual interference due simply because one thread
might try accessing before another thread has finished using those resources!

Speculative Execution Support:

Speculative execution is when a processor makes assumptions about the future state of its
environment (i.e., instructions that are executed on other processors), then executes those
Page 123 of 139
instructions based on these assumptions. The result of this process is usually more efficient
than all assumptions made at once; however, there are costs associated with making these
kinds of decisions at runtime rather than waiting until they occur during program execution
(like in normal code). Speculative executions are often seen as an alternative approach because
they allow programmers to make runtime decisions before they actually need them instead of
waiting until after everything has been executed already!
The term speculative execution is used to describe a processor’s ability to make assumptions
about future instructions. This can be done by using thread-level parallelism, which allows
multiple instructions from different threads of execution to run at the same time.

A Basic Machine Model:

The basic machine model consists of a register file, an instruction fetch unit, and an execution
unit. The register file is used to store the contents of memory locations while they are being
modified; it holds only one value at a time (a value can be either in or out). The instruction
fetch unit fetches instructions from memory that are needed by the program being executed (or
if there is no more space left in the register file, then it provides additional storage). After
being fetched, these instructions are decoded into machine code before being executed by the
execution unit which determines how they need to be executed within their context as well as
some other details such as whether or not there are any exceptions that have been detected
during compilation so far.

What are the different levels of Code Scheduling in computer architecture?

Code scheduling is used to cover dependency detection and resolution and parallel optimization.
Code scheduling is generally adept in conjunction with traditional compilation. A code scheduler
gets as input a set, or a sequence, of executable instruction, and a set of precedence constraints
enforced on them, frequently in the form of a DAG. As output, it undertakes to deliver, in each
scheduling phase, an instruction that is dependency-free and defines the best option for the
schedule to manage the precise available execution time.

Traditional non-optimizing compilers can be treated as including two major parts. The front-
end part of the compiler implements scanning, parsing, and semantic analysis of the source string
and makes an intermediate representation. This intermediate form is generally described by an
attributed abstract tree and a symbol table. The back-end part, in turn, creates the object code.
Traditional optimizing compilers speed up sequential execution and reduce the needed
memory space generally by removing redundant operations. Sequential optimization needs a
program analysis, which includes control flow, data flow, and dependency analysis in the front-
end part.

There are two different approaches to merging traditional compilation and code scheduling. In
the first, code scheduling is integrated into the compilation procedure. In this method, the code
Page 124 of 139
scheduler facilitates the results of the program analysis make by the front-end part of the
compiler.

The code scheduler generally follows the traditional sequential optimizer in the back-end part,
before register allocation and subsequent code generation. This type of code scheduling is known
as pre-pass scheduling.
The other approach is to help a traditional (sequentially) optimizing compiler and carry out code
scheduling afterward called post-pass scheduling.

Code scheduling can be implemented at three different levels such as basic block, loop, and
global level, as displayed in the Code scheduling category Figure.

Code scheduling techniques

The associated scheduling category or techniques are known as basic block (or local), loop, and
global techniques. These techniques increase performance in the order listed.

(i) Basic Block Scheduling In this case, scheduling and code optimization is accomplished
independently for each basic block, one after another.
(ii) Loop-Level Scheduling The next level of scheduling is loop-level scheduling. Therefore,
instructions belonging to ensuing iterations of a loop can generally be overlapped, resulting in
considerable speed-up.

It is generally accepted that loops are an important source of parallelism, particularly in


mathematical programs. Hence, it is possible that for highly parallel ILP processors, including
VLIW architectures, compilers must implement scheduling at least at the loop level. Therefore, a
huge number of techniques have been developed for scheduling at this level.

Page 125 of 139


(iii) Global Scheduling − The most efficient method to schedule is to do it at the largest possible
level, using global scheduling techniques. Therefore, parallelism is desired and derived beyond
basic blocks and simple loops, in such constructs as compound program methods containing
loops and conditional control constructs.

Global Code Scheduling in Compiler Design

Global Code Scheduling in compiler design is the process that is performed to rearrange the
order of execution of code which improves performance.
In the fifth phase of compiler design, code optimization is performed. There are various code
optimization techniques. But the order of execution of code in a computer program also
matters in code optimization. Global Code Scheduling in compiler design is the process that is
performed to rearrange the order of execution of code which improves performance. It
comprises the analysis of different code segments and finding out the dependency among them.

The goals of Global Code Scheduling are:


 Optimize the execution order
 Improving the performance
 Reducing the idle time
 Maximize the utilization of resources

There are various techniques to perform Global Code Scheduling:


(a) Primitive Code Motion
(b) Upward Code Motion
(c) Downward Code Motion
(a) Primitive Code Motion
Primitive Code Motion is one of the techniques used to improve performance in Global Code
Scheduling. As the name suggests, Code motion is performed in this. Code segments are
moved outside of the basic blocks or loops which helps in reducing memory accesses, thus
improving the performance.

Goals of Primitive code motion:


 Eliminates repeated calculations
 Reduces redundant computations
 Improving performance by reducing number of operations

Primitive Code Motion can be done in 3 ways as follows

Code Hoisting: In this technique, the code segment is moved from inside a loop to outside the
loop. It is done when the output of the code segment does not change with loop’s iteration. It
reduces loop overhead and redundant computation.

 C++
//before code hoisting

Page 126 of 139


int main() {

int x,y,b,a;

x=1,y=2,a=0;

while(a<10)

b=x+y;

cout<<a;

a++;

//after code hoisting

int main() {

int x,y,b,a;

x=1,y=2,a=0;

b=x+y;

while(a<10)

cout<<a;

a++;

Code Sinking: In this technique, the code segment is moved from


outside to inside the loop. It is performed when the code’s output
changes with each iteration of the loop. It reduces the number of
computation.
 C++
//before code sinking

Page 127 of 139


int main() {

int a,b;

a=0,b=1;

for(int i=0;i<5;i++)

cout<<a++;

for(int i=0;i<5;i++)

cout<<b++;

//after code sinking

int main() {

int a,b;

a=0,b=1;

for(int i=0;i<5;i++)

cout<<a++;

cout<<b++;

Memory Access Optimization: In this technique, the memory’s read or write operation is
moved out from the loops or blocks. This method eliminates redundant memory accesses and
enhances cache utilization.

(b) Upward Code Motion


Upward Code Motion is another technique used to improve performance in Global Code
Scheduling. In this, the code segment is moved outside of the block or loop to a position above
the block.

Page 128 of 139


Goals of Upward Code Motion:
 Reduces computational overheads
 Eliminates repeated calculations
 Improves performance

Steps involved in Upward Code Motion:


 Identification of loop invariant code
 Moving the invariant code upward
 Updating variable dependencies accordingly

 C++
//before upward code motion

void sum(int a, int b) {

int ans= 0;

for (int i=0;i<4;i++) {

ans+=a+b;

cout<<ans;

//after upward code motion

void sum(int a, int b) {

int z=a+b; //a+b is assigned to a variable z to avoid

//repeated computaion of a+b inside the for loop

int ans= 0;

for (int i=0;i<4;i++) {

ans+=z;

cout<<ans;

Page 129 of 139


(c) Downward Code Motion
This is another technique used to improve performance in Global Code Scheduling. Downward
Code Motion is almost same as Upward Code Motion except the fact that in this method code
segments is moved from outside the loop to inside the loop. It should be applied in a way so that
it do not add new dependencies in the code.
Goals of Downward Code Motion:

(i) Reduces computational overheads


(ii) Eliminates repeated calculations
(iii) Improves performance

Steps involved in Downward Code Motion:


1. Identification of loop invariant code
2. Moving the invariant code downward
3. Updating variable dependencies accordingly

 C++
//before downward code motion

void add(int a, int b) {

int ans = a+b;

for (int i=0;i<100;i++) {

ans+= i;

cout<<ans;

//after downward code motion

void add(int a, int b) {

int ans = a+b;

cout<<ans;

for (int i=0;i<100;i++) {

ans+= i;

Page 130 of 139


}

Updating Data Dependencies


Code motion techniques helps to improve the performance of the code but at the same time can
introduce errors in the code. To prevent this, it is important to update data dependencies. This
helps to check if the moved code performs in a non-erroneous manner.

Steps involved:

Step 1: Analyze the code segment that is being moved and note all the variables that depends on
it.

Step 2: If a code segment is moved from outside of the block to inside, some of the functions
may become unavailable due to the change in scope of the code. We need to introduce new
declarations inside the block.
Step 3: In this step, references of the variables are updated that are being moved. Replace
references to variables that were defined outside the block with references to the new
declarations inside the block.

Step 4: If the code segment includes assignments to variables, make sure they are updated
accordingly. Example, replace references to variables that were defined outside the block with
references to the new declarations inside the block.

Step 5: At last, verification is done to ensure that the moved code gives the correct result and
doesn’t produce any error.

Global Scheduling Algorithms


Global Scheduling Algorithms are used to improve performance by reducing the execution time
or maximizing resource utilization.

1. Trace Scheduling:
In trace scheduling algorithm, we rearrange the instructions along traces (frequently executed
paths). This helps in improving the performance of the code.

Goals of Trace Scheduling Algorithm:


 Minimize branch misprediction
 Maximize instruction-level parallelism

Page 131 of 139


Steps in Trace Scheduling

2. List Scheduling:
In list scheduling algorithm, the overall execution time of a program can be reduced by
rescheduling instructions. They can be rescheduled based on their availability or resource
constraints.

Advantage of List Scheduling Algorithm:

 It is a flexible algorithm
 Can handle multiple constraints like resource constraints and instructions latencies
 Used in modern compilers

Page 132 of 139


Steps in List scheduling

3. Modulo Scheduling:

In Modulo Scheduling Algorithm, iteration counts must be known in advance. It works by


dividing the iterations of loop into groups and schedules instructions from different iterations
parallelly. It aims at exploiting parallelism within loops.

Page 133 of 139


Steps in Modulo scheduling

4. Software Pipelining:

In this algorithms, loop iterations are overlapped to improve performance and reduce time
complexity. It executes multiple loops simultaneously. The main aim of this algorithm is to
reduce loop-level parallelism and increase instruction-level parallelism.

Page 134 of 139


Steps in software pipelining

Advanced Code Motion Techniques:

In Code optimization phase of compiler, the main aim is to increase the overall performance of
the code. Code Motion Techniques are used to improve the performance of the program.

Code motion techniques

Page 135 of 139


Loop Invariant Code Motion
 C++
int main() {

int sum=0;

for(int i=0;i<5;i++)

sum+=5;

cout<<i;

//moving loop invariant code out of loop

int main() {

int sum=0;

sum+=5;

for(int i=0;i<5;i++)

cout<<i;

Partial Redundancy Elimination


 C++
//before

void prod(int a, int b) {

int ans= 0;

for (int i=0;i<4;i++) {

ans+=a*b;

Page 136 of 139


cout<<ans;

//after

void prod(int a, int b) {

int z=a*b;

int ans= 0;

for (int i=0;i<4;i++) {

ans+=z;

cout<<ans;

In this example, we remove the partial redundancy by computing a*b


only one time instead of computing it repeatedly inside a loop.
Dead Code Elimination: It is the code that is unreachable or has no
effect on the program.

C++
//before dead code elimination

int sum(int a, int b) {

int sum = 0;

if (a>0)

sum=a+b;

return sum;

else

...

Page 137 of 139


}

//after dead code elimination

int sum(int a, int b) {

int sum = 0;

if (a>0)

sum=a+b;

return sum;

In this example, the else part is removed as it is unreachable part of the code.

Loop Carried Code Motion: It is same as loop invariant code motion. In this, the loop invariant
part of the loop is moved out of the loop to reduce the number of computations.
Interaction with Dynamic Schedulers

Dynamic schedulers are used in processors. They dynamically reorder the instructions and helps
in maximizing the utilization of resources.

Goals of Dynamic Scheduler


 Improves efficiency of the processor
 Exploits instruction-level parallelism

When interacting with Dynamic Schedulers, the analysis of instruction dependency must be done
correctly. Dynamic schedulers determine the order of execution of instructions correctly only
when the data is accurate. Thus, the analysis of instruction dependency is important when
interacting with dynamic schedulers. It involves analyzing data dependencies, control
dependencies and resource dependencies. They work by assigning priorities to instructions so
that they can be executed in a certain order. Dynamic schedulers make decision based on the
availability of resources. Dynamic Schedulers also helps in handling resource conflicts when
there is a conflict among various resources. Code optimization also helps dynamic schedulers to
improve instruction scheduling and resource utilization.

Page 138 of 139


Conclusion

As it is clear from the above, the compiler is a very important part of the programming
language. It is helpful in making programs run quickly and smoothly. But one must also be
aware of the constraints imposed by the hardware on these machines. These constraints may
not affect all compilers equally; some might be more powerful than others while others simply
cannot handle them well at all. So, if you want to write programs that are compatible with
those available today or in future generations then there are certain things that need to be taken
care of before writing even a single line of code.

Practice questions
What is code scheduling in compiler design?

Reference
https://www.geeksforgeeks.org/code-scheduling-constraints/

END OF THE CLASS LECTURE

Page 139 of 139

You might also like