Computer Organization and Architecture Book
Computer Organization and Architecture Book
Organization
and Architecture
About the Authors
Tarun Kumar Ghosh is currently Assistant Professor, Department of CSE, Haldia Institute of
Technology, West Bengal. Previously, he had served as a Lecturer in the Department of CSE, Asansol
Engineering College, West Bengal. He received his ME degree in CST from B E College (deemed
University), Howrah (currently known as Bengal Engineering and Science University) and his B.Tech.
in CSE from the University of Calcutta.
Prof. Ghosh is a member of ACM (Association for Computing Machinery), New York, USA. His
areas of interest include Computer Architecture, Interconnection Networks, VLSI, and Operating
Systems. He has published several research papers in various international and national conferences
and conducted three training programs. A coordinator at IGNOU, Haldia Study Centre, Prof. Ghosh is
also the recipient of a UGC Scholarship at the post-graduate level.
Anindya Jyoti Pal is currently Assistant Professor and Head of Information Technology, Heritage
Institute of Technology, West Bengal. He had served as a Lecturer at JIS College of Engineering. He
has done a B.Sc. (Physics Hon.), B.Tech. (Computer Science and Engineering) and M.Tech (Computer
Science and Engineering). He has authored numerous books and has published and presented several
papers in different national and international journals and conferences.
A member of the Computer Society of India, Prof. Pal has varied research interest in areas such as
soft computing, evolutionary algorithms for NP, and hard and complete problems. He has also con-
ducted a faculty development programme in the Heritage Institute of Technology.
Computer
Organization
and Architecture
Information contained in this work has been obtained by Tata McGraw-Hill, from sources believed to be reliable.
However, neither Tata McGraw-Hill nor its authors guarantee the accuracy or completeness of any information
published herein, and neither Tata McGraw-Hill nor its authors shall be responsible for any errors, omissions, or
damages arising out of use of this information. This work is published with the understanding that Tata McGraw-
Hill and its authors are supplying information but are not attempting to render engineering or other professional
services. If such services are required, the assistance of an appropriate professional should be sought.
Typeset at The Composers, 260, C.A. Apt., Paschim Vihar, New Delhi 110 063 and printed at
Krishna Offset, 10/122, Vishnu Gali, Vishwas Nagar, Delhi 110 032
RALZCRBFRQCZL
Contents
Preface ix
Roadmap to the Syllabus (CS-303) xv
Roadmap to the Syllabus (CS-404) xvii
1. Fundamentals of Computers 1.1–1.15
1.1 Introduction 1.1
1.2 Digital Computers 1.1
1.3 Layers in a Computer System 1.3
1.4 Types of Computers 1.5
1.5 History of Computers 1.5
Review Questions 1.10
Solved Problems 1.12
2. Data Representation and Computer Arithmetic 2.1-2.31
2.1 Data Types 2.1
2.2 Number Systems 2.2
2.3 Complements of Numbers 2.3
2.4 Binary Data Representation 2.4
2.5 Guard Bits and Truncation 2.11
2.6 Multiplication of Unsigned and Signed Integers 2.12
2.7 Division of Unsigned Integers 2.17
Review Questions 2.22
Solved Problems 2.25
3. Microoperation and Design of Arithmetic Logic Unit 3.1-3.25
3.1 Register Transfer Microoperation 3.1
3.2 Bus Transfer 3.3
3.3 Memory Transfer 3.6
3.4 Arithmetic Microoperation 3.6
3.5 Design of Some Arithmetic Units 3.7
3.6 Logic Unit 3.15
3.7 Shifter Unit 3.16
3.8 Arithmetic Logic Unit (ALU) 3.18
vi Contents
The emergence of computers has revolutionised the design of electronic systems. Nowadays, computers
are used in almost all steps of life: education, business, entertainment, games, sports, security, confer-
ences, and many more avenues. Designing a computer is a highly sophisticated and complex task. Over
the years, different methods have been applied in designing computers.
A brief explanation of the title Computer Organization and Architecture is necessary here.
Computer Organization is a lower level, concrete description of the system that involves how the
constituent parts of the system are interconnected and how they inter-operate in order to realise the
architectural specifications. The organisational issues include the hardware details transparent to the
programmer, like the memory technology, interfacing between the CPU and peripherals, and system-
interconnect components like computer buses and switches. Computer Architecture deals with the
conceptual design and basic overview of a computer system. It refers to the parameters of a computer
system that are visible to the user. It includes the instruction-addressing modes, instruction formats,
instruction sets, I/O mechanism, and related concepts.
This book is the outgrowth of a series of lectures based on the course Computer Organization and
Architecture and Advanced Computer Architecture, delivered over the last several years at different
colleges under the West Bengal University of Technology, West Bengal. Our aim is to simplify the
subject and make it easy even to an average student.
This book is intended to serve as a first-level course on Computer Organization and Architecture in
Computer Science, Information Technology, Electronics and Communication Engineering and the
Instrumentation Engineering curricula of West Bengal University of Technology (WBUT), West Ben-
gal. Also, this book will be useful for students of MCA, BCA and diploma courses in Computer
Science. This book is mostly self-contained, assuming that the reader has a basic knowledge of com-
puter programming, number systems and digital logic. For the benefit of those who have no prior
knowledge of digital logic, the book includes an overview of the essential topics of digital electronics in
an Appendix.
The book is divided into eight chapters:
l Chapter 1 presents the basics of digital computers, different layers in a computer system, over-
view of the operating systems, types of computers, brief history of computers and details of Von
Neumann computers.
x Preface
l Chapter 2 discusses different data-representation methods in computer registers and their relative
advantages and disadvantages, and different arithmetic operations.
l Chapter 3 introduces different types of registers, a register transfer language (RTL) and shows
how RTL is used to express microoperations in symbolic form. This chapter also explains how a
common bus can be constructed. Some important arithmetic units are also designed. Lastly, a
hypothetical arithmetic logic unit is developed to show the hardware design of the most common
microoperations.
l Chapter 4 covers the parameters of a memory, the concept of memory hierarchy, and basic
concepts associated with main memory, secondary memory and cache memory. The concept and
operation of associative memory is explained in detail. The overview of virtual memory is also
elucidated.
l Chapter 5 contains basics of instruction sets, and the instruction formats. Different CPU organi-
zations are discussed with the help of examples. The basics of an instruction cycle and variety of
addressing modes are explained. The last section introduces CISC and RISC concepts and dis-
cusses their characteristics and relative advantages and disadvantages.
l Chapter 6 presents control unit design using both hardwired and microprogramming approaches.
Both approaches are illustrated through examples. The concept of nanoprogramming is introduced
at the end of the chapter.
l Chapter 7 explains the techniques that computers use to communicate with peripheral devices.
Three modes of data transfer between computer and peripherals are discussed in detail: pro-
grammed I/O, interrupt I/O and direct memory access (DMA). Bus arbitration and input–output
processor (IOP) are introduced.
l Chapter 8 is devoted to parallel processing techniques. This chapter introduces computer perfor-
mance measurement parameters. Flynn’s classification of computers is discussed. Pipeline, vector
processing, superscalar, super-pipeline and VLIW are presented in sufficient detail. This chapter
also includes descriptions of array processors and multiprocessors, along with issues like inter-
connection networks and cache coherence problem.
The book is designed for the 3rd semester paper (CS-303) of the CSE branch, 4th semester paper
(CS-404) of IT/ECE (old)/AEIE (old) branches, 4th semester paper (CS-404(EI)) of AEIE (new)
branch and 5th semester paper (EC-503) of ECE (new) branch of WBUT. Chapter 8 is not necessary
for paper CS-303 of the CSE branch, because 4th CSE semester students will study this chapter in
detail in the form of a paper (CS-403) on Advanced Computer Architecture. Apart from the CSE
branch, all the above-mentioned branches must study all chapters of the book to cover their respective
papers on the subject.
In order to make the book more user-friendly, we have incorporated the following important features:
l The book is presented in lucid and concise manner, covering all the topics in the syllabus of the
following branches: CSE/IT/ECE/AEIE.
l The book contains an exhaustive list of about 150 solved problems, 200 review questions, both in
the form of multiple-choice and short-answer-type questions.
l 10 solved question papers (2004—2008) are given for reference.
l One appendix on Digital Devices and Logic Design is included.
We are grateful to all our colleagues, especially Mr Susmit Maity, Mr S K Sahnawaj, Mr Apratim
Mitra, Mr Soumen Saha, Mr Sourav Mandal, Mr Syed Mamud Hossain, Mr Palash Roy and Mr
Tushar Kanti Das for their encouragement and helpful comments while writing this book. We are
thankful to the following reviewers who have gone through the book and provided valuable suggestions.
Preface xi
T K GHOSH
A J PAL
ROADMAP TO THE SYLLABUS
Computer Organization
Code: CS-303
Contact: 3L
Credit: 3
Concepts and Terminology: Digital computer components. Hardware and software and their dual
nature, role of operating systems (OS). Von-Neumann concept.
The ALU: ALU organization, integer representation, serial and parallel adders, 1’s and 2’s comple-
ment arithmetic, multiplication of signed binary numbers, floating point number arithmetic, overflow
detection, status flags, registers.
Timing diagrams, T-states, Timing diagram, controlling arithmetic and logic instructions. Instruction
sequencing with examples. Introduction to micro-programming, variations in micro-programming con-
figuration.
xiii Roadmap to the Syllabus
General Organization: Interconnecting system components. Instruction word formats, addressing modes,
interfacing buses, timing diagrams, examples from popular machines.
Computer Organization
and Architecture
Code: CS-404
Contact: 3L
Credit: 3
Concepts and Terminology: Digital computer concepts; Von-Neumann concept; Hardware and Soft-
ware and their nature; structure and functions of a computer system, role of operating system.
Memory Unit: Memory classification, characteristics; organization of RAM, address decoding ROM/
PROM/EEPROM; magnetic memories, recording formats and methods, disk and tape units; concept of
memory map, memory hierarchy, associative memory organization; cache introduction, techniques to
reduce cache misses, concept of virtual memory and paging.
CPU Design: The ALU – ALU organization, integer representation, 1’s and 2’s complement
arithmetic; serial and parallel Adders; implementation of high speed adders—carry look ahead and
carry save adders; multiplication of signed binary numbers—Booth’s algorithm; division algorithms—
restoring and non-restoring; floating point number arithmetic; overflow detection, status flags.
Instruction Set Architecture: Choice of instruction set; instruction word formats; addressing modes.
Control Design: Timing diagrams; T-states, controlling arithmetic and logic instruction, control struc-
tures; hardwired and micro programmed, CISC and RISC characteristics.
Pipelining—general concept, speed up, instruction and arithmetic pipeline; examples of some pipeline
in modern processors, pipeline hazards; Flynn’s classification—SISD, SIMD, MISD, MIMD architec-
tures, vector and array processors and their comparison, concept of multiprocessor: centralized and
distributed architectures.
Input/Output Organization: Introduction to bus architecture, effect of bus widths, programmed and
interrupt I/O, DMA.
1
Fundamentals of Computers
1.1 INTRODUCTION
Nowadays computers have been used in almost all steps of life: education, business, entertainment,
games, sports, security, conference, etc. The design of a computer is highly sophisticated and com-
plex. Over the years, different methods have been applied in designing it. Parameters such as perfor-
mance, cost, storage capacity, types of use determine the choice of different concepts and methods
used in designing a computer. The study of these concepts and techniques will be the goal of the
book.
A Computer is an automatic machine made up of electronic and electro-mechanical devices, which
processes the data.
Characteristics of Computers
(a) Speed—It has fast speed operation of several million operations per second.
(b) Accuracy—It is capable of performing calculations to the extent of 64 decimal accuracy.
(c) Storage—It is capable of storing large volumes of information.
(d) Decision making—It is capable of decision making according to the supplied information.
Most of the modern day computers are digital computers though some are also analog computers. An
analog computer senses input signals whose values keep changing continuously. It allows physical
processes such as pressure, acceleration, power, force etc. to be represented by electrical current or
voltage signals. Digital computers (or simply computers) perform the calculations on numerical or
digital values. Today’s most of the computers fall into this class. Digital computers use the binary
number system, which has two digits: 0 and 1. A binary digit is called a bit. Information is repre-
sented in digital computers in groups of bits. By using different coding techniques, groups of bits can
be made to represent discrete symbols, such as decimal digits or letters of the alphabet.
1.2 Computer Organization and Architecture
The Basic Input-Output System (BIOS ) is program consisting of I/O drivers, which are different
programs to perform various I/O operations on behave of various peripheral devices attached to the
computer. When a program needs an I/O operation, it calls this program. BIOS programs are basically
programs embedded on a chip, called firmware.
When first power is on, this program runs. The main task of the BIOS is to identify and initiate
component hardware like hard drives, keyboard, and mouse. This prepares the computer to work by
loading the operating system into main memory from hard disk. This process is known as booting
(short form of bootstrapping). In summary, BIOS can be said to be a coded program embedded on a
chip (firmware) that recognizes and controls various peripheral devices that build up a computer.
Digital computers can be categorized into four different types, based on the computers’ performance,
size and cost. They are: mainframe computers, minicomputers, microcomputers and supercomputers.
Mainframe computer It is a large computer system consisting of thousands of ICs, which is physi-
cally distributed in more than one place. This computer is designed for intensive computational tasks
and used by large organizations like banks, railways and hospitals. Mainframe computer is often
shared by multiple users connected to the computer through several terminals. This computer is very
expensive. Examples include IBM system/360, Burroughs B 5000 and UNIVAC 1100/2200 series.
Minicomputer This class of computers is smaller and slower version of mainframe computer. Thus,
its cost is very less compared to the mainframe computer. This machine is designed to serve multiple
users simultaneously and used by smaller organizations and research centres. Computers like DEC’s
PDP, HP 3000 series and CDC 1700 are minicomputers.
Microcomputer Invention of microprocessor (CPU on a chip) gives rise to the microcomputer that
is small, low-cost and single user machine. It is also called personal computer (PC). This inexpensive
computer is designed to use on a small desk or even to carry. This class of computers is very popular,
due to its high performance per cost ratio and size. The more powerful microcomputer designed to
perform scientific applications is called workstation. IBM PC series based on Intel’s 80x86 family,
Apple’s Macintosh and Motorola’s 680x0 family are examples of microcomputers.
Supercomputer This class of computers is the most powerful and expensive computer available
today. This computer is design to perform fast using multiprocessing and parallel processing tech-
niques. This machine is specially used for complex scientific applications, like weather forecasting,
satellite launching, climate research and nuclear research. Popular supercomputers are Cray-1, Power-
PC and Fujitsu’s VP 200. An important point to be noted that today’s supercomputer tends to become
tomorrow’s normal computer.
The modern day computers are the results of combined efforts of several scientists over last century.
The history of computers is divided into two eras: Mechanical Era and Electronic Era.
1.6 Computer Organization and Architecture
The abacus is nearly 2000 years old. It is very useful for teaching simple mathematics to children.
The abacus has a wooden frame with vertical rods on which wooden beads slide. Arithmetic problems
are solved when these beads are moved around. Beads are considered counted, when moved towards
the beam that separates the two decks: upper deck and lower deck. Each bead in the upper deck has a
value of five and each bead in the lower deck has a value of one.
Mechanical Computer/Calculator Mechanical computer was invented by French philosopher Pas-
cal in 1643. This could add and subtract 8 columns of numbers. Decimal numbers were engraved on
counter wheels much like those in a car’s odometer.
Babbage’s Difference Engine It was the first computer to perform multi-step operations automati-
cally, i.e. without human intervention in every step, and was designed by mathematician Charles
Babbage in the 19th century. A difference engine is a special-purpose mechanical calculator-cum-
computer designed to tabulate polynomial functions. Since logarithmic and trigonometric functions
can be approximated by polynomials, such a machine is more general than it appears at first.
Babbage’s Analytical Engine Analytical Engine was the improved version of the Difference En-
gine. This machine is considered to be the first general-purpose programmable computer ever de-
signed. It was a decimal computer with word length of 50 decimal digits and a storage capacity of
1000 digits. An interesting feature on this machine was conditional branch instruction handling. Two
major components of this machine are an ALU called the mill and a main memory called the store. A
program for the Analytical Engine (Fig. 1.6) was composed of two sequences of punched cards:
Operation cards and Variable cards.
Punched card is a magnetic plat, on which the holes are punched and they are sensed by a machine.
Operation cards are used to select the operation to be performed by the mill and variable cards are
used to specify the locations in the store from which inputs were to be taken or results sent.
Fundamentals of Computers 1.7
IAS Computer/ Von-Neumann Computer In 1946, Von Neumann and his colleagues began the
design of a new stored-program computer, now referred to as the IAS computer, at the Institute for
Advanced Studies, Princeton. Nearly, all modern computers still use this stored-program concept.
This concept has three main principles:
1.8 Computer Organization and Architecture
Instruction IAS instructions are 20 bits long, so that two instructions can be stored in each 40-bit
memory location. An instruction consists of two parts, as shown in Fig. 1.8: an 8-bit op-code (opera-
tion code), which defines the operation to be performed (add, subtract, etc.) and a 12-bit address part,
which can identify any of 212 memory locations that may be used to store an operand of the instruc-
tion.
Reduced Word Length IAS instruction allows only one memory address. This results in a substan-
tial reduction in word length. Two aspects of the IAS organization make this possible.
1. Fixed registers in the CPU are used to store operands and results. The IAS instructions
automatically make use of these registers as required. In other words, CPU register addresses
are implicitly specified by the op-code.
Fundamentals of Computers 1.9
2. The instructions of a program are stored in the main memory in approximately the sequence in
which they are to be executed. Hence the address of the next instruction pair is usually the
address of the current instruction pair plus one. The need for a next instruction address in the
instruction format is eliminated. Special branch instructions are included to permit the instruc-
tion execution sequence to be varied.
Structure of an IAS Computer The CPU of the IAS computer, as shown in Fig. 1.9, consists of a
data-processing unit and a program control unit. It contains various processing and control circuits
along with a set of high-speed registers (AC, MQ, DR, IBR, PC, IR, and MAR) intended for
temporary storage of instructions or data or memory addresses. The arithmetic-logic circuits of the
data-processing unit (Arithmetic Logic Unit (ALU)) perform the main actions specified by instruc-
tions. The control circuits in the program-control unit (simply control unit) are responsible for fetch-
ing instructions, decoding op-codes, fetching data (operand) from the memory and providing proper
control signals for all CPU actions. An electronic clock circuit is used to generate the basic timing
signals needed to synchronize the operation of the different parts of the system.
The main memory M is used for storing programs and data. A word transfer can take place
between the 40-bit data register (DR) of the CPU and any location M(X) with address X in M. The
address X to be used is stored in a 12-bit address register (MAR). The DR may be used to store an
operand during the execution of an instruction. Two additional registers for the temporary storage of
operands and results are included: the accumulator (AC) and the multiplier quotient register (MQ).
Two instructions are fetched simultaneously from M and transferred to the program control unit. The
instruction that is not to be executed immediately is placed in an instruction buffer register (IBR). The
op-code of the other instruction is placed in the instruction register (IR) where it is decoded. The
address field of the current instruction is transferred to the memory address register (MAR). Another
address register called the instruction address register or the program counter (PC) is used to store the
address of the next instruction to be executed.
Von-Neumann Bottleneck One of the major factors contributing for a computer’s performance is
the time required to move instructions and data between the CPU and main memory. The CPU has to
wait longer to obtain a data-word from the memory than from its registers, because the registers are
very fast and are logically placed inside the processor (CPU). This CPU-memory speed disparity is
referred to as Von-Neumann bottleneck. This performance problem is reduced by using a special type
memory called cache memory between the CPU and main memory. The speed of cache memory is
almost same as the CPU, for which there is almost no waiting time of the CPU for the required data-
word to come. Another way to reduce the problem is by using special type computers known as
Reduced Instruction Set Computers (RISC). This class of computers generally uses a large number of
registers, through which the most of the instructions are executed. This computer usually limits access
to main memory to a few load and store instructions. This architecture is designed to reduce the
impact of the bottleneck by reducing the total number of the memory accesses made by the CPU and
by increasing the number of register accesses.
REVIEW QUESTIONS
Group-A
1. Choose the most appropriate option for the following questions:
(i) CPU consists of
(a) main memory and ALU
(b) main memory, ALU and control unit
(c) cache memory, ALU and control unit
(d) ALU, control unit and registers.
(ii) Control unit is
(a) a logic unit to provide control signals for different units
(b) used to control input-output from a processor
(c) used to control the CPU
(d) used to fetch the instruction from memory.
Fundamentals of Computers 1.11
(iii) A compiler is
(a) an application program
(b) a system program used to convert a high-level program to low-level program
(c) a part of operating system
(d) a system program used to convert an assembly language program to low-level program.
(iv) An example of popular batch processing operating system is
(a) Windows 98
(b) MS-DOS
(c) UNIX
(d) Windows 2000.
(v) Which of the operating systems supports multiple CPUs through shared main memory?
(a) multiprogramming OS
(b) real-time OS
(c) distributed OS
(d) multiprocessing OS.
(vi) BIOS is
(a) a collection of I/O driver programs
(b) part of OS to perform I/O operations
(c) firmware consisting of I/O driver programs
(d) a program to control one of the I/O peripherals.
(vii) Babbage’s difference engine is a computer
(a) for subtraction
(b) for both addition and subtraction
(c) for performing multi-step operations automatically
(c) for arithmetic and logical operations.
(viii) Stored program computers
(a) store the program and data in the same memory
(b) store the program and data in two separate memories
(c) store programs in memory
(d) store the program and data in the same address of memory.
(ix) Stored program computers
(a) require a permanent memory
(b) need small memory
(c) allow self modifying programs
(d) do not treat themselves as data.
(x) IAS computer introduced the concept of
(a) ALU, main memory, PC, AC and multiplier quotient (MQ) registers for executing instruc-
tions as well as stored program
(b) ALU, AC and MQ for executing an instruction as well as stored program
(c) decimal, fixed-point and floating point ALUs with stored program
(d) ALU, main memory, PC and subroutine calls.
(xi) The Von-Neumann bottleneck is a problem, which occurs due to
(a) small size main memory
(b) high-speed CPU
(c) speed disparity between CPU and main memory
(d) malfunctioning of any unit in CPU.
1.12 Computer Organization and Architecture
Group-B
2. What do you mean by a computer system? Describe the functions of different hardware components.
3. Classify the software in general. Discuss each of them in brief.
4. Define: operating system and compiler. Briefly discuss about different operating systems with ex-
amples.
5. Discuss about different types computers: mainframe, mini, micro and supercomputer.
6. What is the stored program computer? Discuss about its instruction and data formats.
7. Describe the structure of IAS computer.
8. Write short note on: Von Neumann (IAS) computer and its bottleneck.
SOLVED PROBLEMS
1. What are the differences between low-level language and high-level language?
Answer
(a) Low-level languages are closer to the computers, that is low-level languages are generally
written using binary codes; whereas the high-level languages are closer to the human, that is
these are written using English-like instructions.
(b) Low-level language programs are machine dependent, that is, one program written for a
particular machine using low-level language cannot run on another machine. But, high-level
language programs are machine independent.
(c) As far as debugging is concerned, high-level programs can be done easily than low-level
programs.
(d) It is more convenient to develop application programs in high-level languages compared to the
low-level languages.
2. What are the differences between machine language and assembly language?
Answer
(a) Machine language instructions are composed of bits (0s and 1s). This is the only language the
computer understands. Each computer program can be written in different languages, but
ultimately it is converted into machine language because this is the only language the computer
understands. Assembly language instructions are composed of text-type mnemonic codes.
(b) Machine language instructions are difficult to understand and debug, since each instruction is
only combination of 0s and 1s. However, since assembly language instructions are closer to
the human language (i.e. English), it is easy to debug.
(c) In terms of execution, machine language is faster than assembly language. Because for assem-
bly language program, one converter called assembler is needed to convert it into equivalent
machine language program; whereas no converter is needed for machine language program.
3. Differentiate between compilers and interpreters.
Answer
(a) Compiler is a system program that converts the source program written in a high-level lan-
guage into corresponding target code in low-level language. This conversion is done by com-
piler at a time for all instructions. However, the interpreter is a system program that translates
each high-level program instruction into the corresponding machine code. Here, in interpreter
instead of the whole program, one instruction at a time is translated and executed immediately.
Fundamentals of Computers 1.13
Popular compilers are C, C++, FORTRAN, and PASCAL. The commonly used interpreters
are BASIC and PERL.
(b) The compilers are executed more efficiently and are faster compared to interpreters. Though,
the interpreters can be designed easily.
(c) The compilers use large memory space compared to interpreters.
4. Discuss briefly about Princeton architecture and Harvard architecture.
Answer
Princeton computers are computers with a single memory for program and data storage. The Von
Neumann architecture is also known as Princeton architecture.
Harvard computers are computers with separate program and data memories. Data memory and
program memory can be different widths, type etc. Program and data can be fetched in one cycle, by
using separate control signals- ‘program memory read’ and ‘data memory read’. Example includes
Harvard Mark 1 computer.
5. What is an Operating System (OS)? Briefly describe the major functions of an OS.
Answer
An operating system is a collection of programs and utilities, which acts as the interface between user
and computer. The operating system is a system program that tells computer to do tasks under a
variety of conditions. The main objective of an operating system is to create the user friendly environ-
ment.
The following are the main functions of operating systems:
1. Managing the user’s programs.
2. Managing the memories of computer.
3. Managing the I/O operations.
4. Controlling the security of computer.
6. Show the addressing for program and data, assuming von Neumann architecture for storing
the following program:
(a) Assume that a program has a length of 2048 bytes and the program starts from an address
0.
(b) The input data size is 512 bytes and stores from 3000.
(c) The results of 30 bytes generated after program execution are stored at address 4000.
Answer
The figure below shows the addressing of program and data.
12. Suppose a 600 MHz machine does the number of context switching 60 times per second. How
many cycles are there in each time-slice?
Answer
600 MHz = 600 ¥ 106 cycles per second.
Therefore, the number of cycles per time-slice = (600 ¥ 106)/60 = 107.
13. To achieve a speed-up of 4 on a program that originally took 80 ns to execute, what must be
the execution time of the program be reduced to?
Answer
Execution time before
Speed-up =
Execution timeafter
Now, we have speed-up = 4 and execution timebefore = 80 ns.
Therefore, from the speed-up formulae, we get execution timeafter = 20 ns.
So, to achieve the speed-up of 4, execution time must be reduced to 20 ns.
CHAPTER
2
Data Representation and
Computer Arithmetic
The computers store the binary information needed to perform some operation in memory or proces-
sor registers. The binary information can be instruction or data. Instruction is a bit or group of bits
used to instruct the computer to execute the program. Data (sometimes called operands) are numbers
and other binary-coded information that are operated on by an instruction to achieve required results.
In this chapter, we show how different data types are represented in binary coded form in processor
registers and the arithmetic operations which are performed on stored data.
Different user (application) programs use different types of data based on the problem. A program
can operate either on numeric data or non-numeric data. The different types of non-numeric data are
as follows:
l Characters
l Addresses
l Logical data
All non-binary data are represented in computer’s memory or registers in the binary coded form.
Character Data A character may be a digit, an alphabet or a special symbol etc. A character is
represented by a group of bits. It includes upper-case and lower-case alphabets (26), decimal numbers
(10) and special characters, such as +, @, *, etc. A set of multiple characters usually form a meaning-
ful data. The standard code to represent characters is American Standard Code for Information
Interchange (ASCII). This standard uses an 8-bit pattern in which 7 bits specify the character. The 8th
bit is generally used as a parity bit for error detection or sometimes it is permanently 1 or 0. Another
popular code is Extended Binary Coded Decimal Interchange Code (EBCDIC) used for large comput-
ers. The computer systems such as IBM System/360 use this code. It is an 8-bit code without parity.
Addresses The data or operand is often used as an address for some instructions. An address may
be a processor register or a memorny location from where the required operand value is retrieved for
2.2 Computer Organization and Architecture
In our day-to-day life, we are using decimal numbers, which are 0, 1, 2, 3, 4, 5, 6, 7, 8 and 9. In other
words, humans are most accustomed with decimal system. But, a computer can only understand the
information composed of 0s and 1s. That means the binary number system is followed by the
computers in most natural way. However, sometimes it is necessary to use other number systems, like
hexadecimal or octal or decimal systems.
A number in the number system of base or radix (r) is represented by a set of symbols from r
distinct symbols. The decimal number system uses 10 digits from 0 to 9, thus its base is 10. The
binary system uses two distinct digits 0 and 1, thus its base is 2. For octal system (base r = 8), a
number is represented by 8 distinct digits 0 to 7. The 16 symbols used in hexadecimal number system
(base 16) are 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E and F. Here, the symbols A, B, C, D, E, and F
correspond to the decimal numbers 10, 11, 12, 13, 14, and 15 respectively.
The value of a number is calculated by summing up all multiplied value of each digit with an
integer power of r. For example, the decimal number 8752.4
= 8 ¥ 103 + 7 ¥ 102 + 5 ¥ 101 + 2 ¥ 100 + 4 ¥ 10–1
Similarly, the binary number 101101
= 1 ¥ 25 + 0 ¥ 24 + 1 ¥ 23 + 1 ¥ 22 + 0 ¥ 21 + 1 ¥ 20
Decimal Codes As we know, humans understand decimal system easily and computers process
every thing in binary, therefore there must be a conversion system from decimal-to-binary representa-
tion in computer’s input process. Similarly, binary-to-decimal conversion system must be a part of the
computer’s output process. These conversions should be performed very rapidly. To facilitate rapid
conversions, several number codes exist that encode each decimal separately by a group of bits. These
types of codes are called decimal codes. Two widely used decimal codes are: BCD (binary coded
decimal) and Excess-3 code.
BCD (Binary Coded Decimal) The BCD is the simplest binary code to represent a decimal
number. In BCD code, four bits are used to represent a decimal number. For example, decimal 5 is
represented by 0101. When a decimal number consists of more than one decimal digit, each digit is
independently represented by its 4-bit binary equivalent. For example, 39 is represented by 0011
1001.
BCD code is weighted (positional) code and weights of four bits which represent an individual
digit are 8, 4, 2 and 1. For this reason, the BCD code is sometimes called 8-4-2-1 code. In weighted
codes, the bits are multiplied by weights mentioned and addition of the weighted bits gives the
decimal digit. For example, the number 1001 in BCD code (8-4-2-1 code) gives the decimal equiva-
lent = 8 ¥ 1 + 4 ¥ 0 + 2 ¥ 0 + 1 ¥ 1= 9.
The code has the advantage of rapid conversion, though it has the disadvantage in forming comple-
ments. For example, the 1’s complement of 0011 (decimal 3) is 1100 (decimal 12), which is invalid
BCD code. To solve this problem, another decimal code called excess-3 code is used.
Data Representation and Computer Arithmetic 2.3
Excess-3 Code In this code, 0011 (decimal 3) is added to each BCD (decimal digit) of a number.
For example, 0110 (decimal 6) = 1001 (decimal 9) in excess-3 code. The excess-3 code for 435 is
0111 0110 1000. This code has been used in some older computers. The disadvantage of this code is
that it is not a weighted code that means the sum of weights of bits is not equal to the corresponding
decimal digit.
Before going to discuss the numerical data representation, we have to know about the complements of
a number, since complements are used in digital computers for simplifying the subtraction operation
and for logical manipulation. Complements of numbers in base/ radix r system are of two types: the
r’s complement and (r-1)’s complement. Thus for binary system, there are 2’s and 1’s complement.
For decimal system, complements are 10’s and 9’s complements.
(r-1)’s Complement For a number N having m digits in base r, the (r – 1)’s complement of N is
defined as (rm – 1) – N. In case of decimal numbers, r = 10 and r – 1 = 9, so the 9’s complement of N
is (10m – 1) – N. Now, we know that 10m = 1000…0 (m 0s) in decimal. Therefore, 10m – 1 is
equivalent to 99…9 (m 9s) in decimal. For example, m = 5, we have 105 = 100000 and 105 – 1 =
99999. It infers that the 9’s complement of a decimal number is obtained by subtracting each digit
from 9. For example, the 9’s complement of 35367 is 99999 – 35367 = 64632.
In case of binary number system, r = 2 and r–1 = 1, so the 1’s complement of N is (2m –1) – N.
Again, we have, 2m = 1000…0 (m 0s) in binary. Thus, 2m –1 equals to111…1 (m 1s) in binary. For
example, m = 5, we have 25 = (100000)2 and 25 – 1 = (11111)2. Thus the 1’s complement of a binary
number is obtained by subtracting each bit from 1. However, the subtraction of a bit from 1 means the
bit to change from 1 to 0 and 0 to 1. Hence, the 1’s complement of a binary number is obtained by
replacing 1s into 0s and 0s into 1s. For example, the 1’s complement of a binary number 101101 is
010010.
Similarly, the (r–1)’s complement of numbers in other systems like octal and hexadecimal can be
formulated.
r’s Complement The r’s complement of a number N having m digits in base r is defined as rm – N;
if N π 0 and 0; if N = 0. Note that, rm – N = [(rm – 1) – N] + 1. Thus, the r’s complement of a number
is obtained by adding 1 to the (–1)’s complement of the number. For example, the 10’s complement
of a decimal number 43678 is 56321 + 1 = 56322 and the 2’s complement of a binary number 101101
is 010010 + 1 = 010011. An m-bit 2’s-complement number system can represent every integer in the
range –2m–1 to +2m–1 – 1. Also, note that, the complement of a complement of a number is
the original number. The r’s complement of N is rm – N. The complement of the complement is rm –
(rm – N) = N, which is the original number. Now, we will see the application of using r’s complement
in subtraction of unsigned numbers.
complex than the method that uses complements, when subtraction is to be performed in digital
computers. So, computers usually use the method of complements to implement subtraction.
The subtraction of two m-digit unsigned numbers A – B (B π 0) in base r can be performed using
the rule: The minuend A is added with the r’s complement of the subtrahend B, which gives A +
(rm – B) = (A – B) + rm.
Now, following two cases may arise.
Case-1: When A ≥ B;
The addition gives an end carry rm from leading bit position (most significant position), which is to
be discarded and the rest is the result A – B.
Case-2: When A < B;
The addition does not give any end carry and the addition is equal to rm – (B – A), which is the r’s
complement of (B – A). Now, to get the result in familiar form, the r’s complement of the result is
taken and a negative sign is placed in front of the result.
Let’s consider the example of subtraction, 45328 – 20342 = 24986. The 10’s complement of 20342
is 79658.
A = 45328
10’s complement of B = 79658
Sum of these two = 124986
Discard end carry 105 = –100000
Thus, answer is = 24986.
Now consider, an example where A < B. The subtraction 20342 – 45328, which will give –24986
as answer, is to be performed.
We have,
A = 20342
10’s complement of B = 54672
Sum = 75014
There is no end carry.
The result is – 24986, after taking 10’s complement of 75014.
The same way, the subtraction of two unsigned binary numbers can be done.
A number can be either unsigned or signed. Unsigned numbers are positive numbers, including zero.
So, the unsigned numbers can be represented by its magnitude, there is no need to represent the sign
of the numbers. Positive and negative numbers, including zero are treated as signed numbers. In order
to differentiate between positive and negative numbers, we need a notation for sign. In real life, a
negative number is indicated by a minus sign in leftmost position followed by its magnitude and a
positive number by a plus sign in leftmost position followed by its magnitude. But, everything in
computers must be represented with 1s and 0s, including the sign of numbers. As a result, it is the
common practice to represent the sign of numbers with a bit placed in the leftmost (most significant)
position of the number. The convention is that a bit 0 is used as sign bit for positive numbers and 1 for
negative numbers.
Data Representation and Computer Arithmetic 2.5
Moreover, a number may have a radix (binary) point. A number may be a fraction or integer or
mixed integer-fraction number, depending on the position of the radix point. Then the natural ques-
tion comes in mind: where is this radix (binary) point stored in the registers? There are two ways of
specifying the position of the binary point in a register: (1) by giving it a fixed position. So, this
method used to represent numbers is referred to as fixed-point representation method. (2) by using a
floating-point representation. The fixed-point numbers are known as integers whereas floating-point
numbers are known as real numbers.
In signed-2’s complement representation, the range for numbers using n bits is: –(2n–1) to
+(2n–1 –1). This range comes from the fact that there is only one representation for 0, allowing an odd
number of non-zero values to be represented. Thus, the negative numbers are represented using only
signed-2’s complement method.
Arithmetic Operations on Fixed-Point Numbers
Arithmetic Addition The addition of two numbers in the sign-magnitude system is performed by the
rules of ordinary arithmetic. This method of addition is quite complex when it is implemented in
computers. However, the addition using signed-2’s complement method is very simple and can be
stated as follows:
The two numbers, including their sign bits, are added and the carry out from the sign (leftmost)
position is discarded, if any. Numerical examples are illustrated below, where numbers are stored in
8-bit registers.
+9 00001001 –9 11110111
+15 00001111 +15 00001111
+24 00011000 +6 00000110
+9 00001001
–15 11110001
–6 11111010
Negative results are automatically in 2’s complement form. For example in last case, the signed
binary number 11111010 is a negative number because the leftmost bit is 1. So, its magnitude’s (7-
bits: 1111010) 2’s complement is 0000110. That is the binary equivalent of +6. Therefore, the result
is –6.
Arithmetic Subtraction Like the arithmetic addition, the subtraction of two numbers in the signed-
magnitude system is performed by the rules of ordinary arithmetic and it is quite complex in imple-
menting in computers.
But, subtraction of two signed binary numbers when negative numbers are in 2’s complement form
is very simple and can be stated as follows:
The 2’s complement of the subtrahend, including the sign bit is added with the minuend, including
the sign bit. The carry out from the sign bit position is discarded, if any. The subtraction can be
summarized with the relation:
A – B = A + 1’s complement of B + 1.
This method of subtraction is obtained from the fact that the subtraction can be converted into
addition, if sign of subtrahend is reversed. In other words,
(±A) – (+B) = (±A) + (–B)
(±A) – (–B) = (±A) + (+B)
Consider the subtraction (–9) – (–15) = +6. For 8-bit registers, this can be written as 11110111 +
00001111 = 100000110. The correct answer is 00000110 (i.e., +6), after removing the end carry.
Note that the addition and subtraction of binary numbers in the signed-2’s complement system are
performed by the same basic addition and subtraction rules as unsigned numbers. Moreover, the
addition and subtraction of binary numbers are performed by only the addition operation. Thus,
computers need only one common additive hardware circuit to handle two arithmetic operations.
Data Representation and Computer Arithmetic 2.7
Out of three (mantissa M, radix r and exponent E), only the mantissa M and the exponent E are
physically present in the register, including their sign bits. The radix r is not present, but its presence
and the position of it in the mantissa are always assumed. Most computers use fractional system of
representation for mantissa, but some computers use the integer system for mantissa.
If the integer system of representation for mantissa is used, the decimal number +786.231 is
represented in floating-point with a mantissa and an exponent as follows:
Mantissa Exponent
+ 786231 –03
The floating-point binary number is also represented in the same fashion. For example, the binary
number + 10011.1101 can be represented with 12-bit mantissa and 6-bit exponent as below.
Mantissa Exponent
0 10011110100 0 00101
The mantissa has a leading 0 to indicate positive sign and the exponent has representation of + 5.
When the most significant digit (left most) of the mantissa is nonzero, the floating-point number is
said to be normalized. For example, 0.00386 ¥ 104 is not a normalized number. The corresponding
normalized number is 0.386 ¥ 102. Similarly, the 8-bit binary number 00011100 is not normalized
because of the three leading 0s. The number can be normalized by shifting it three positions to the left
and leaving out the leading 0s to obtain 11100000. But, the left shifting three positions means
multiplication of the original number with 23 = 8. So, to retain the value of number same, the
exponent must be subtracted by 3. The maximum possible precision for the floating-point number is
achieved by normalization. Another advantage of using normalized floating point numbers is in-
creased coverage of numbers. If a computer uses all numbers as normalized, then one bit position can
be saved by omitting the most significant position, which is always 1. This 1 is called hidden 1. It
should be noted that a zero cannot be normalized because it does not have a nonzero digit.
Floating-point representation is more useful than fixed-point representation, in dealing with very
large or small numbers. Moreover, the floating-point representation is more accurate in arithmetic
operation.
IEEE Standard for Floating-Point Numbers Initially, different computer manufacturers were
using different formats for the floating-point representation. The most common ways of representing
floating-point numbers in computers are the formats standardized as the IEEE (Institute of Electrical
and Electronics Engineers) 754 standard, commonly called “IEEE floating-point”. These formats can
be manipulated efficiently by nearly all modern floating-point computer hardware. It has two similar
formats as follows:
1.Single Precision Format It is 32-bit format, in which 8-bit is for exponent, 23-bit for mantissa, 1-
bit for sign of the number, as shown in Fig. 2.1. Here, the implied base 2 and original signed exponent
E are not stored in register. The value actually stored in the exponent field is an unsigned integer E¢
called biased exponent, which is calculated by the relation E¢ = E + 127. This is referred to as the
excess-127 format. Thus, E¢ is in the range 0 £ E ¢ £ 255. The end values of this range, 0 and 255, are
used to represent special values. Therefore, the range of E¢ is 1 £ E¢ £ 254, for normal values. This
means that the actual exponent (E) is in the range –126 £ E £ 127.
The lower-order 23 bits represent the mantissa. Since binary normalization is used, the most
significant bit of the mantissa is always set to 1. This bit is not explicitly represented; it is assumed to
Data Representation and Computer Arithmetic 2.9
be to the immediate left of the binary point. This bit is called hidden 1. Thus, the 23 bits stored in the
mantissa M field actually represent the fractional part of the mantissa, that is, the bits to the right of
the binary point. The 1-bit sign field S contains 0 for positive and 1 for negative number.
2. Double Precision Format This is 64-bit format in which 11-bit is for biased exponent E¢,
52-bit for mantissa M and 1-bit for sign of the number, as shown in Fig.. 2.2. The representation is
same as single precision format, except the size and thus other related parameters.
Example 2.1 Represent the binary positive number 1101011 in the IEEE single-precision format.
The binary positive number 1101011 = + 1.101011 ¥ 26
The 23-bit mantissa M = 0.101011 000000 000000 00000
The biased exponent E¢ = E + 127 = 6 + 127 = 133 = 1000 0101
The sign bit S = 0, since the number is positive.
Therefore, the IEEE single-precision (32-bit) representation is:
Example 2.2 Represent the decimal number – 0.75 in the IEEE single-precision format.
The decimal number – 0.75 = – 0.11 in binary = – 1.1 ¥ 2–1
The 23-bit mantissa M = 0.100000 000000 000000 00000
2.10 Computer Organization and Architecture
When the mantissa is shifted right, some bits at the right most position (least significant position) are
lost. In order obtain maximum accuracy of the final result; one or more extra bits known as guard
bits, are included in the intermediate steps. These bits temporarily contain the recently shifted out bits
from the right most side of the mantissa. When the number has to be finally stored in a register or in
a memory as the result, the guard bits are not stored. However, based on the guard bits, the value of
the mantissa can be made more precise by the rounding (transaction) technique.
The truncation of a number involves ignoring of the guard bits. Suppose n =3 bits are used in final
representation of a number, n=3 extra guard bits are kept during operation. By the end of the
operation, the resulting 2n = 6 bits need to be truncated to n =3 bits by one of the following three
methods. In all cases truncation error exist, which is E = actual value – truncated value.
Chopping In this method, simply all n = 3 guard bits are dropped. All fractions in the range 0. b–1
b–2 b–3 000 to 0. b–1 b–2 b–3 111 are truncated to 0. b–1 b–2 b–3. The truncation error of chopping is
0 £ E £ 0.000111 < 0.001 = 2–n . Since E is always greater than 0, we say this truncation error is
biased.
Von Neumann Rounding If at least one of the guard bits is 1, the least significant bit of the
retained bits is set to 1, no matter whether it is originally 0 or 1; otherwise nothing is changed in
retained bits and simply guard bits are dropped. Two worst cases may arise (for b–3 = 1 and 0) when
at least one of the guard bits is 1.
Case-1: The number 0. b–1 b–2 1111 is truncated to 0. b–1 b–2 1. The truncation error is E = 0. b–1
b–2 1111 – 0. b–1 b–2 1 = 0.000111 < 0.001 = 2–n .
Case-2: The number 0. b–1 b–2 0001 is truncated to 0. b–1 b–2 1. Then the truncation error is E = 0.
b–1 b–2 0001 – 0. b–1 b–2 1 = – 0.000111 > – 0.001 = – 2–n.
Both two cases can be summarized as | E | < 2–n. Thus the Von Neumann rounding error is
unbiased, because the range of error is symmetrical about 0.
Rounding Here, the truncation of the number is done according to the following rules.
Rule-1: If the highest guard bit is 1 and the rest guard bits are not all 0, a 1 is added to the
lsb position of the bits retained. Thus, 0. b–1 b–2 b–3 1xx is rounded to 0. b–1 b–2 b–3 + 0.001. The error
in this case is E = 0. b–1 b –2 b–3 1xx – (0. b–1 b –2 b –3 + 0.001) = 0.0001xx – 0.001 =
–( 0.001 – 0.0001xx) > – 0.0001 = – 2–(n+1).
Rule-2: If the highest guard bit b–(n+1) is 0, drop all guard bits. Thus, the number 0. b–1 b–2 b–3 0xx is
rounded to 0. b–1 b–2 b–3. The truncation error E is = 0. b–1 b–2 b–3 0xx – 0. b–1 b–2 b–3 = 0.0000xx >
– 0.0001 = – 2–(n+1).
Rule-3: If the highest guard bit is 1 and the rest guard bits are all 0s, the rounding depends on the lsb,
b–n = b–3.
If lsb = 0, the number 0. b–1 b–2 0100 is truncated to 0. b–1 b–2 0. The error in this case is E = 0.
b–1 b–2 0100 – 0. b–1 b–2 0 = 0.0001 = 2–(n+1).
If lsb = 1, the number 0. b–1 b–2 1100 is truncated to 0. b–1 b–2 1 + 0.001. The truncation error E is
= 0. b–1 b–2 1100 – (0. b–1 b–2 1 + 0.001) = – 0.0001 = – 2–(n+1).
2.12 Computer Organization and Architecture
The value represented by guard bits is 0.5 * 2–n, it is randomly rounded either up or down with
equal probability (50%). The rounding error of these cases can summarized as | E | £ 2–(n+1). Thus, the
rounding error is unbiased.
Comparing the three rounding methods, we see that the rounding technique has the smallest
unbiased rounding error, but it requires most complex and costly hardware.
Two unsigned integers can be multiplied the same way as two M fi 0101
decimal numbers by manual method. Consider the multiplication of Q fi 1100
two unsigned integers, where the multiplier Q =12 = (1100)2 and 0000 ¸
0000 Ô
the multiplicand M = 5 = (0101)2 as illustrated in Fig. 2.3. 0101 ˝ Partial Products
In the paper and pencil (manual) method, shifted versions of 0101 Ô˛
multiplicands are added. This method can be implemented by using 0111100 Final Product
AND gates and full adders (for full adder, see Appendix). The
hardware realization of this method for 4-bit multiplier Q and 4-bit Figure 2.3 Paper and pencil
method (Manual
multiplicand M is shown in Fig. 2.4. The basic combinational cell
method)
used in the array as the building block handles one bit of the partial
product, as shown in Fig. 2.4(a). If the multiplier bit Qi is 1, then this cell adds an incoming partial
product bit to the corresponding multiplicand bit Mi. Each row of the array adds the multiplicand
appropriately left shifted to the incoming partial product PPi to generate outgoing partial product
PPi+1. If the multiplier bit Qi is 0, PPi is passed down vertically as PPi+1 with a physical right shift.
Initial partial product PP0 is all 0s and PP4 is the desired 8-bit product M ¥ Q = Pr7 Pr6 Pr5 Pr4 Pr3 Pr2
Pr1 Pr0.
Data Representation and Computer Arithmetic 2.13
Example 2.3 To illustrate this method, consider, the multiplier Q = 14 = 11102 and the multipli-
cand M = 6 = 01102.
M F A Q Size
Initial 0110 0 0000 1110 4
Configuration
Step-1
As Q[0]=0
R.S.(FAQ) 0110 0 0000 0111 3
Size – –
Step-2
As Q[0]= 1
A=A+M 0110 0 0110 0111 —
RS(FAQ) 0110 0 0011 0011 2
Size – –
2.14 Computer Organization and Architecture
Since the size register is currently 0, the algorithm is terminated and the final product is = AQ = 0101
01002 = 8410.
Data Representation and Computer Arithmetic 2.15
This method of multiplication is good for unsigned number multiplication. In case of signed
number multiplication, the signs of the operands can be treated separately and the multiplication of
magnitudes of the numbers can be processed using the above method. The sign of the product is
determined as Mn ≈ Qn, where Mn, Qn are the signs of the multiplicand (M) and the multiplier (Q)
respectively.
Based on this recoding rule, the Booth’s algorithm for multiplication can be developed easily. The
algorithm inspects two lower-order multiplier bits at time to take the next step of action. The algo-
rithm is described by the flowchart in Fig. 2.6. A flip-flop (a fictitious bit position) is used to the right
of lsb of the multiplier and it is initialized to 0. Subsequently, it receives the lsb of the multiplier when
the multiplier is shifted right.
Once all bits of the multiplier are inspected, the accumulator and multiplier register together
contain the product. Ignore the right end flip-flop used for holding an initial 0, as it is a fictitious bit
and subsequent lsbs from multiplier.
Example 2.5 To see how this procedure works, the following example is considered. M = – 6 =
1010 and Q = 7 = 0111.
M A Q Size
Initial
Configuration 1010 0000 0111 0 4
Step-1
As Q[0] = 1 and
Q[–1]=0
A=A – M 1010 0110 0111 0 —
And ARS(AQ) 1010 0011 0011 1 3
Step-2
As Q[0]=1 and
Q[–1]=1
ARS(AQ) 1010 0001 1001 1 2
Step-3
As Q[0]=1 and
Q[–1]=1
ARS(AQ) 1010 0000 1100 1 1
Step-4
As Q[0]=0 and
Q[–1]=1
A=A + M 1010 1010 1100 1 —
ARS(AQ) 1010 1101 0110 0 0
Since, the size register becomes 0, the algorithm is terminated and the product is = AQ = 1101 0110,
which shows that the product is a negative number. To get the number in familiar form, take the 2’s
complement of the magnitude. The result is – 42.
Advantages of the Booth’s multiplication method:
(i) Pre-processing steps are unnecessary, so the Booth’s algorithm treats signed numbers in a
uniform way with unsigned nimbers.
(ii) Less number of additions and subtractions are required, compared to the sequential multiplica-
tion method.
The division is more complex operation than multiplication. Given a dividend (D) and a divisor (V),
the quotient (Q) and the remainder (R) are related according to the following expression:
D = QV + R, where 0 £ R < V.
Subtractive division algorithms are derived from the paper and pencil (manual) method. This
process is illustrated in Fig. 2.7; using D = 42 (in decimal) = 101010 and Q = 11 (in decimal) = 1011.
In this method, every iteration generates one quotient bit. First, align the divisor below the dividend
from msb and try to subtract it from the dividend. If the result of the subtraction is positive, then put a
2.18 Computer Organization and Architecture
1 for the quotient, and shift the divisor one position to 11 Quotient = decimal 3,
the right. The process is repeated. However, if the divi- 1011 101010
1011
sor cannot be subtracted from the dividend for a positive
10100
result, put a 0 for the quotient and shift the divisor to the 1011
right. Then, try to subtract the same from the dividend. 1001 Remainder = decimal 9.
This process continues until all bits of the dividend are
covered. Figure 2.7 Paper and pencil (manual)
division method
2.7.1 Restoring Division Method
Now, the manual division method can be modified in the following way to obtain restoring division
method.
Instead of shifting the divisor, shift the dividend to the left. The restoring division method uses
three n-bit registers A, M, Q for dividing two n-bit numbers. The register M is used to hold the
divisor. Initially, A contains 0 and Q holds the n-bit dividend. In each iteration, the contents of
register-pair AQ are shifted to the left first. The content of M is then subtracted from A. If the result
of subtraction is positive, a 1 is placed into the vacant position created in lsb position of Q by the left
shift operation; otherwise a 0 is put into this position and before beginning the next iteration, restore
the content of A by adding the current content of A register with M. For this step, the algorithm is
referred to as a restoring division algorithm. When, the algorithm terminates, the A register contains
the remainder result and the Q register contains the quotient result.
The restoring division algorithm to divide two n-bit numbers is described using the flowchart
shown in Fig. 2.8.
Example 2.6 To illustrate restoring division algorithm, let us consider an example where dividend
Q = 7 = 0111 and divisor M = 3 = 0011.
M A Q Size
Initial Configuration 00011 00000 0111 4
Step-1
LS(AQ) 00011 00000 111– —
A=A – M 00011 11101 111– —
As Sign of A= –ve
Set Q[0]=0
& Restore A 00011 00000 1110 3
Step-2
LS(AQ) 00011 00001 110– —
A=A – M 00011 11110 110– —
As Sign of A= –ve
Set Q[0]=0
Restore A 00011 00001 1100 2
Step-3
LS(AQ) 00011 00011 100– —
A=A – M 00011 00000 100– —
As Sign of A= +ve
Set Q[0]=1 00011 00000 1001 1
Data Representation and Computer Arithmetic 2.19
From the above result, we see that the quotient = Q = 0010 = 2 and remainder = A = 00001 = 1.
2.20 Computer Organization and Architecture
Example 2.7 To illustrate this method, let us take an example where dividend Q = 0111 and
divisor M = 0011.
M A Q Size
Initial Configuration 00011 00000 0111 4
Step-1
As Sign of A= +ve
LS(AQ) 00011 00000 111– —
A=A – M 00011 11101 111– —
As sign of A= –ve
Set Q[0]=0 00011 11101 1110 3
Step-2
As sign of A= –ve
LS(AQ) 00011 11011 110– —
A=A + M 00011 11110 110– —
As sign of A= –ve
Set Q[0]=0 00011 11110 1100 2
Step-3
As sign of A= –ve
LS(AQ) 00011 11101 100– —
A=A + M 00011 00000 100– —
As sign of A= +ve
Set Q[0]=1 00011 00000 1001 1
Step-4
As sign of A= +ve
LS(AQ) 00011 00001 001– —
A=A – M 00011 11110 001– —
As sign of A= –ve
Set Q[0]=0 00011 00001 0010 0
Restore A
From the above last step, we conclude that quotient = 0010 = 2 and remainder = 00001 = 1.
These two algorithms can be extended to handle signed numbers as well. The sign of the result
must be treated separately as in case of multiplication and the positive magnitudes of the dividend and
divisor are performed using either of the last two techniques for quotient and remainder.
REVIEW QUESTIONS
Group A
1. Choose the most appropriate option for the following questions:
(i) The maximum unsigned binary number in 8-bit is
(a) 255 (b) 256 (c) 128 (d) 127.
(ii) The minimum and maximum 8-bit numbers in signed magnitude representation are
(a) 0 and 255 (b) – 127 and 127 (c) – 128 and 127 (d) none.
Data Representation and Computer Arithmetic 2.23
(iii) The minimum and maximum 8-bit numbers in signed 2’s complement representation are
(a) –127 and 127 (b) –128 and 127 (c) 0 and 255 (d) none.
‘ (iv) The complement operations are useful for
(a) logical operations
(b) addition and subtraction operations
(c) arithmetic operations
(d) subtraction and logical operations.
(v) The number – 0.125 is represented in IEEE single-precision format as:
(a) 1 0111 1100 1000…… 0
(xii) The number of AND gates and number of full adders required in 4 ¥ 4 array multiplier are
(a) 4 and 4 (b) 16 and 4
(c) 4 and 16 (d) 16 and 16, respectively.
(xiii) The sequential multiplication method is generally used to multiply two unsigned numbers, but
can be used for multiplication of signed numbers, where the sign of the product is processed
separately using one
(a) OR gate (b) NOT gate (c) XOR gate (d) AND gate.
(xiv) The maximum number of additions and subtractions are required for which of the following
multiplier numbers in Booth’s algorithm?
(a) 0100 1111 (b) 0111 1000 (c) 0000 1111 (d) 0101 0101.
Group B
2. Why do digital computers use binary numbers for their operations?
3. Describe two complement methods. Prove that complement of a complement is the original number.
4. Discuss the fixed-point representation method with example.
5. What are representations of integer number – 19 in 8-bit format using
(a) signed magnitude method?
(b) signed 1’s complement method?
(c) signed 2’s complement method?
6. Compare different integer representation methods.
7. What are the minimum and maximum integers representable in n-bit value using
(a) signed magnitude method?
(b) signed 1’s complement method?
(c) signed 2’s complement method?
Give the argument for each.
8. Discuss the overflow problem in fixed-point representation and its detection method using example.
9. Describe floating-point representation using examples.
10. When is a floating-point called normalized floating-point number? Give the reason(s) for converting
a non-normalized floating-point number into normalized one.
11. What is biased exponent and why?
12. Give the IEEE single-precision representation for floating-point numbers. Represent the number –
7.75 in this representation.
13. Discuss the overflow and underflow problems in floating-point representation.
14. What is a guard bit? Describe different truncation methods with their errors.
15. Discuss the 4 ¥ 4 array multiplier method with block diagram.
16. Discuss the sequential multiplication method and use this to multiply decimal numbers 27 and 56.
Can you apply this method to multiply two signed numbers?
17. Give the recoded Booth’s multiplier representations for the following:
(a) 1001 0011
(b) 1110 1010
18. Describe the Booth’s multiplication method and use this to multiply decimal numbers –23 and 9.
What are the advantages of this method?
19. Discuss the restoring division algorithm and use this to divide decimal number 23 by 6. Can you
apply this method to divide two signed numbers?
20. Deduce the non-restoring division algorithm from restoring division algorithm.
21. Describe the non-restoring division algorithm and use this to divide decimal number 29 by 7.
Data Representation and Computer Arithmetic 2.25
SOLVED PROBLEMS
1. Describe Booth’s modified algorithm and show that just N/2 partial products are required to
multiply two N-bit binary numbers. Describe the method using the two numbers A=10101010
and B=11001110.
Answer
A faster version of Booth’s multiplication algorithm for signed numbers, known as the modified
Booth’s algorithm, examines three adjacent bits Q[i+1] Q[i] Q[i-1] of the multiplier Q at a time,
instead of two. Apart from three basic actions performed by original Booth’s algorithm, which can be
expressed as: add 0, 1 ¥ M (multiplicand) and 1 ¥ M to Ac (the accumulated partial product), this
modified algorithm performs two more actions: add 2 ¥ M and 2 ¥ M to Ac. These have the effect of
increasing the radix from 2 to 4 and allow an N ¥ N multiplication requiring only N/2 partial
products.
Observe that bit pair (1, 1 ) is equivalent to pair (0,1). That is instead of adding 1 times of the
multiplicand M at shift position i to 1 ¥ M at position i + 1, the same result is obtained by adding
1 ¥ M at position i. Other cases are: (1,0) is equivalent to (0,2), ( 1 ,1) is equivalent to (0, 1 ), and so
on. The following table shows the multiplicand selection decisions for all possibilities.
(c) –22 – 7 = (–22) – 7. This is a subtraction. So, 2’s complement of 7 is to be added with (–22).
The 2’s complement of 7 = 1111 1001 and representation of (–22) = 1110 1010. After
addition of these two we get, 11110 0011. This is 9-bit result. So, by rule, discard 9th bit and
get the result as 1110 0011, which shows it is negative number. The result in equivalent
decimal is –29.
(d) 18 – (–5) = 18 + 5. The representation of 18 is 0001 0010 and representation of 5 is 0000
0101. We get after addition, 0001 0111. The result is equivalent to decimal 23.
7. Can you add 8-bit signed numbers 0110 0010 and 0100 0101? If not, why? Suggest a solution.
Answer
Carries: 01
0110 0010
0100 0101
(add) 1010 0111
This result suggests that the number is negative, which is wrong. However, if the carry out from
the sign bit position is treated as the sign of the result, the 9-bit answer thus obtained will be correct
answer. So, there is an overflow.
To detect an overflow condition the carry into the sign bit position and the carry out from the sign
bit position are examined. If these two carries are both 0s or both are 1s, there is no overflow. If these
two carries are not equal (i.e., if one is 0 and other is 1), an overflow condition exists. Considering the
carry from the sign bit position with (i.e.0, here) along with the 8-bit result will give correct answer.
8. Write down the Boolean expression for overflow condition when adding or subtracting two
binary numbers expressed in two’s complement.
Answer
If one number is positive and the other is negative, after an addition overflow cannot occur, since
addition of a positive number to a negative number produces a number that is always smaller than the
larger of the two original numbers. However, an overflow may occur if the two numbers added are of
same sign i.e., both are positive or both are negative. Let’s consider following examples.
Carries:01 Carries:10
+69 0 1000101 –69 1 0111011
+78 0 1001110 –78 1 0110010
+147 1 0010011 –147 0 1101101
Observe that the 8-bit result that should have been positive (first example) has a negative sign bit
and the 8-bit result that should have been negative (second example) has a positive sign bit. However,
if the carry out from the sign bit position is treated as the sign of the result, the 9-bit answer thus
obtained will be correct answer. Since the 9-bit answer cannot be accommodated with 8-bit register,
we say that an overflow results.
To detect an overflow condition the carry into the sign bit position (i.e. Cn–1) and the carry out
from the sign bit position (i.e. Cn) are examined. If these two carries are both 0 or both are 1, there is
no overflow. If these two carries are different, an overflow condition exists. The overflow occurs if
the Boolean expression Cn ≈ Cn–1 is true.
2.28 Computer Organization and Architecture
9. Give the merits and demerits of the floating point and fixed point representations for storing
real numbers
Answer
Merits of fixed-point representation:
(a) This method of representation is suitable for representing integers in registers.
(b) Very easy to represent, because it uses only one field: magnitude field.
Demerits of fixed-point representation:
(a) Range of representable numbers is restricted.
(b) It is very difficult to represent complex fractional numbers.
(c) Since there is no standard representation method for it, it is some time confusing to represent a
number in this method.
Merits of floating-point representation:
(a) By this method, any type and any size of numbers can be represented easily.
(b) There are several standardized representation methods for this.
Demerits of floating-point representation:
(a) Relatively complex representation, because it uses basically two fields: mantissa and exponent
fields.
(b) Length of register for storing floating-point numbers is large.
10. Represent following decimal numbers in IEEE 754 floating point format:
(a) –1.75
(b) 21
Answer
(a) The decimal number – 1.75 = – 1.11 in binary = –1.11 ¥ 20
The 23-bit mantissa M = 0.110000 000000 000000 00000
The biased exponent E¢ = E + 127 = 0 + 127 = 127 = 0111 1111
Since the number is negative, the sign bit S = 1
Therefore, the IEEE single-precision (32-bit) representation is:
11. What value is represented by the IEEE single precision floating point number:
0101 0101 0110 0000 0000 0000 0000 0000?
Answer
The sign of the number = 0, biased exponent value = 10101010 = 170. So the exponent value = 170 –
127 = 43. The mantissa field = 110 0000 0000 0000 0000 0000.
Data Representation and Computer Arithmetic 2.29
Therefore, the value of the number = + (1.11)2 ¥ 243 = 1.75 ¥ 243 = 1.539 ¥ 1013 (approx.).
12. How NaN (Not a Number) and Infinity are represented in IEEE 754 standard?
Answer
Not a Number (NaN) is represented when biased exponent E¢ = 255 and mantissa M π 0. NaN is a
result of performing an invalid operation such as 0/0 and -1 .
Infinity is represented when E¢ = 255 and M = 0. The infinity is the result of dividing a normal
number by 0.
13. A floating point number system uses 16 bits for representing a number. The most significant
bit is the sign bit. The least significant nine bits represent the mantissa and remaining 6 bits
represent the exponent. Assume that the numbers are stored in the normalized format with one
hidden bit
(a) Give the representation of –1.6 ¥ 103 in this number system.
(b) What is the value represented by 0 000100 110000000?
Answer
The format of the 16-bit floating-point representation is as follows:
1 001010 100100000
Answer
First we must shift the smaller number to the right to align the exponents, so 2.56 becomes 0.0256
¥ 102. The sum of mantissas is
2.3400
+ 0.0256
2.3656
Thus, the sum is 2.3656 ¥ 102. Since, we have two digits to round, we want values 0 to 49 to round
down and 51 to 99 to round up, 50 being the tiebreaker. Rounding the sum up with three significant
digits gives 2.37 ¥ 102.
15. Compute the product of the following pair of unsigned integers. Generate the full 8-bit result.
(a) 1001 ¥ 0110
(b) 1111 ¥ 1111
Answer
(a) 1001 ¥ 0110.
This can be written as
(1001 ¥ 100) + (1001 ¥ 10) = 100100 + 10010 = 0011 0110.
(b) 1111 ¥ 1111.
This can be written as
(1111 ¥ 1000) + (1111 ¥ 100) + (1111 ¥ 10) + (1111 ¥ 1) = 1111000 + 111100 + 11110 +
1111 = 1110 0001.
16. Multiply –13 and +14 by the method of partial products.
Answer
–13 = 10011 (multiplicand)
+14 = 01110 (multiplier)
00000
10011
10011
10011
100001010
In case of negative multiplicand, 2’s complement multiplier is added additionally with n-bit shift
due to sign extension of partial product on the right shifting.
The 2’s complement of multiplier left shifted by n i.e. 5 bits is 1001000000.
Add this shifted number with added result of partial products, to get the correct result of signed
numbers multiplication.
Thus, 100001010 + 1001000000 = 1101001010 = –182.
17. Give the recoded Booth’s multiplier representations for the following:
(a) 1100 1010
(b) 1110 1101
Answer
(a) Original multiplier: 11001010=011001010
Recoded pattern: 1 0 1 0 11 11 0
Data Representation and Computer Arithmetic 2.31
3
Microoperation and Design
of Arithmetic Logic Unit
Arithmetic Logic Unit (ALU) is the main part the Central Processing Unit (CPU) in processing the
instructions. The other parts of the CPU are the Control Unit (CU) and register unit. ALU simply
executes the instructions in the order as dictated by the CU. The ALU performs the instruction
execution on the operand data stored in registers. A CPU with many registers reduces the number of
references to the main memory, and thus simplifying the programming task and shortening the
execution time. An operation performed on the data stored in registers is called micro-operation. The
result of the micro-operation may replace the previous binary information of a register or may be
transferred to another register. Examples of micro-operations are add, subtract, shift, load and clear, etc.
The internal hardware structure of a computer is characterized by the following attributes:
1. The types of micro-operations performed on the binary information stored in registers.
2. The control signals that initiate the sequence of micro-operations.
3. The set of registers it contains and their functions.
The frequently used micro-operations in digital computers are classified into four categories:
1. Register Transfer Micro-operations: Transfer of binary information from one register to an-
other.
2. Arithmetic Micro-operations: Arithmetic operations performed on the data stored in registers.
3. Logical Micro-operations: Bit manipulation operations on non-numeric data stored in registers.
4. Shift Micro-operations: Shift operations on data stored in registers.
In this chapter, we will design one hypothetical Arithmetic Logic Unit (ALU). Before design of an
ALU, we need to discuss various micro-operations, specially register transfer micro-operation.
Computers contain some registers within CPU for faster execution. The number of registers differs
from processor to processor. A register is nothing but a collection of flip flops (see Appendix, for
details) each capable of storing one bit of information. Registers are available in the following forms:
3.2 Computer Organization and Architecture
Register Function
PC (Program Counter) Holds the address of the next instruction to be executed.
IR (Instruction register) Holds the instruction code (operation code) currently being executed.
SP (Stack Pointer) Holds the address of the top element of the memory stack.
BR (Base Register) Holds the starting address of the memory stack.
MAR (Memory Address Register) Holds the address of the data item to be retrieved from the main memory.
MBR or DR (Memory Buffer Holds the data item retrieved from the main memory.
Register or Data Register)
SR or PSW (Status Register Holds the condition code flags and other information that describe the
or Program Status Word) status of the currently executing program.
Register Transfer Language (RTL) Register Transfer Language (RTL) is the symbolic notation
used to describe the micro-operation transfer between registers. Information transfer from one register
to another is characterized in symbolic form by means of a replacement operator. The symbolic code
R1 ¨ R2 indicates a transfer of the content of register R2 into R1. The transfer micro-operation
means the content of source register R2 is copied into the destination register R1, but the content of
R2 remains same. This transfer micro-operation overwrites the content of R1 by the content of R2.
As far as internal hardware connectivity is concerned, a register transfer implies that circuits are
available from the outputs of the source register to the inputs of the destination register and that the
Microoperation and Design of Arithmetic Logic Unit 3.3
destination register has a parallel load capability. Normally, the register transfer occurs under a
predetermined control condition. This can be illustrated by means of an if-then symbolic code:
If (C = 1) then (R1 ¨ R2)
where C is a control signal generated in the con-
trol circuit. A control function is sometimes speci-
fied to separate the control variables from the
register transfer operation. A control function is
nothing, but a Boolean variable that is equal to 1
or 0. Thus the above symbolic code is equivalent
to:
C: R1 ¨ R2
A colon is used to terminate the control condi-
tion. This code means that the transfer micro-
operation be performed only if C = 1. Fig. 3.2(a)
shows the block diagram that illustrates the trans-
fer from R2 to R1. Register R1 has a load con-
trol input C that is controlled by the control cir-
cuit. A common clock is used to synchronize all
Figure 3.2 Transfer from R2 to R1 when C = 1 (in
the activities in the total circuit for transfer op-
symbolic code C: R ¨ R2)
eration.
As shown in the timing diagram (Fig. 3.2(b)), in the rising edge of a clock pulse at time t, the
control section activates C. The next positive transition of the clock at time t + 1 finds the load input
enabled and the data inputs of R1 are then loaded with the data outputs of register R2 in parallel. At
time t + 1, C should be disabled; otherwise if C remains active, the transfer will occur with every
clock pulse transition.
Many registers are provided in the CPU of a computer for fast execution. Therefore several paths
must be provided to transfer information from one register to another. If a separate communication
line is used between each register pair in the system, the number of lines will be excessive and thus
cost of communication will be huge. Thus it is economical to have a common bus system for
transferring information between registers in a multiple-register configuration. A bus system consists
of a group of common communication lines, where each line is used to transfer one bit of a register at
a time. Thus, a shared communication path consisting of one or more connection lines is known as a
bus and the transfer of data through this bus is known as bus transfer. Sometimes, it is said that n-bit
bus or n-line bus, the meaning of which is that the bus consists of n parallel lines to transfer n-bit of
data all at a time. The n is called width of the bus. The width of the bus has an impact on a computer’s
performance. The wider the bus, the greater the number of bits transferred at a time.
We will present two ways to construct common bus system. One way is using multiplexers (simply
known as MUXs, see appendix for MUX) and another way is using tri-state buffers.
3.4 Computer Organization and Architecture
Construction of a Common Bus Using MUXs The Fig. 3.3 shows an n-line common bus
system using multiplexers for register transfer, where four registers are used each of n-bit. This
common bus is used to transfer a register’s content to other register or memory at a single time. A
multiplexer selects one source register whose all n-bit information is then placed on the bus. Two
multiplexers are shown in the figure one for the low-order significant bit and another for the high-
order significant bit. The bus consists of n 4 ¥ 1 multiplexers each having four data inputs, 0 through
3 and two common selection lines for all multiplexers.
Each MUX has four input lines each is connected to all four registers’ bits marked. The two
selection lines S0 and S1 are connected to the selection inputs of all n MUXs. The selection lines
choose all n bits of one register and transfer them into the n-line common bus. For example, when S1
S0 = 00, the 0th data inputs of all n MUXs are selected and placed to the outputs that form the bus.
The function table for bus shown in Fig. 3.3 is given below.
S1 S0 Register selected
0 0 A
0 1 B
1 0 C
1 1 D
General case Suppose an n-line bus system is to be constructed for k registers of n bits each. The
number of MUXs needed to construct the bus is equal to n, the number of bits in each register. The
size of each multiplexer must be k ¥ 1, since it multiplexes k data lines.
Construction of a Common Bus Using Tri-state Buffers Another way to construct a common
bus system is using tri-sate buffers. A tri-state gate is a digital circuit that exhibits three states out of
Microoperation and Design of Arithmetic Logic Unit 3.5
which two states are normal signals equivalent to logic 1 and logic 0 similar to a conventional gate.
The third state is a high-impedance state. The high-impedance state behaves like an open circuit,
which means that no output is produced though there is an input signal and does not have logic
significance. The gate is controlled by one separate control input C. If C is high the gate behaves like
a normal logic gate having output 1 or 0. When C is low the gate does not product any output
irrespective of the input values. The graphic symbol of a tri-state buffer gate is shown in Fig. 3.4.
General case Suppose an n-line common bus for k registers of n bits each using tri-state buffers
needs to be constructed. We need n circuits with k buffers in each as shown in Fig. 3.5. Therefore,
total number of buffers needed is k * n. Only one decoder is required to select among the k registers.
Size of the decoder should be log2k-to- k.
3.6 Computer Organization and Architecture
The basic arithmetic microoperations are addition, subtraction, increment, decrement and shift. The
arithmetic addition microoperation is defined by the statement
R1 ¨ R2 + R3
This microoperation states that the content of R2 register is added to the content of R3 register and
the result is stored into R1 register. In order to implement this operation with hardware we need three
registers and a digital circuit to perform addition. The basic arithmetic microoperations are listed in
the Table 3.1.
Table 3.1 Basic Arithmetic Microoperations
Symbolic notation Description
R1 ¨ R2 + R3 Added contents of R2 and R3 transferred to R1
R1 ¨ R2 – R3 Contents of R2 minus R3 transferred to R1
R1 ¨ R1 +1 Incrementing the content of R1 by 1
R1 ¨ R1 –1 Decrementing the content of R1 by 1
R1 ¨ R1 Complementing the content of R1 (1’s complement)
R1 ¨ R1 +1 2’s complement the content of R1 (negate)
R1 ¨ R2 + R3 +1 Content of R2 added with 2’s complement of R3 (subtraction) and transferred to R1.
Microoperation and Design of Arithmetic Logic Unit 3.7
In this table, the subtraction microoperation is performed using 2’s complement method. The multipli-
cation and division are not included in this table; though, the operations are two valid micro-opera-
tions. The multiplication operation can be implemented by a sequence of add and shift micro-opera-
tions. The division can be implemented with a sequence of subtract and shift microoperations.
Advantage The serial adder circuit is small and hence, it is very inexpensive irrespective of the
number of bits to be added.
Disadvantage The serial is very slow since it takes n clock cycles for addition of two n-bit numbers.
Parallel Adder A parallel adder is an adder, which adds all bits of two numbers in one clock cycle.
It has separate adder circuit for each bit. Therefore, to add two n-bit numbers, parallel adder needs n
separate adder circuits. There are basically two types of parallel adders, depending on the way of
carry generation.
(a) Carry-Propagate Adder (CPA) or Ripple Carry Adder (RCA)
(b) Carry Look-ahead Adder (CLA).
3.8 Computer Organization and Architecture
Carry-Propagate Adder (CPA) For addition of two n-bit numbers, n full adders (FAs) are re-
quired. Each full adder’s carry output will be the input of the next higher bit full adder. Each full
adder performs addition for same position bits of two numbers. An n-bit CPA circuit is shown in the
Fig. 3.8.
The addition time is decided by the delay introduced by the carry. In worst case, the carry from the
first full adder stage has to propagate through all the full adder stages. Therefore, the maximum
propagation delay for n-bit CPA is D ¥ n, where D is the time delay for each full adder stage and n is
the number of bits in each operand.
Advantage This adder, being a combinational circuit, is faster than serial adder. In one clock period
all bits of two numbers are added.
Disadvantages
1. The addition delay becomes large, if the size of numbers to be added is increased.
2. The hardware cost is more than that of serial adder. Because, number of full adders needed is
equal to the number of bits in operands.
Building Long Adder Since carry is propagated serially through each full adder, smaller size
CPAs can be cascaded to obtain a large CPA. As an example, construction of 16-bit CPA using four
4-bit CPAs is shown in the Fig. 3.9
Carry Look-ahead Adder (CLA) A Carry Look-ahead Adder (CLA) is a high-speed adder,
which adds two numbers without waiting for the carries from the previous stages. In the CLA, carry-
inputs of all stages are generated simultaneously, without using carries from the previous stages.
Microoperation and Design of Arithmetic Logic Unit 3.9
In the full adder, the carry output Ci+1 is related to its carry input Ci as follows:
Ci+1=AiBi + (Ai+Bi)Ci
This result can be rewritten as:
Ci+1 = Gi + PiCi (1)
where Gi = AiBi and Pi = Ai + Bi
The function Gi is called the carry-generate function, since a
carry Ci+1 is generated when both Ai and Bi are 1. The function
Pi is called as carry-propagate function, since if Ai or Bi is a 1,
then the input carry Ci is propagated to the next stage. The basic
adder (BA) for generating the sum Si, carry propagate Pi and
carry generate Gi bits, is shown in Fig. 3.10. The sum bit Si = Ai
≈ Bi ≈ Ci. For the implementation of one basic adder, two XOR
gates, one AND gate and one OR gate are required.
Now, we want to design a 4-bit CLA, for which four carries
C1, C2, C3 and C4 are to be generated. Using equation number
(1); C1, C2, C3 and C4 can be expressed as follows: Figure 3.10 Basic adder
C1= G0 + P0C0
C2= G1 + P1C1
C3= G2 + P2C2
C4= G3 + P3C3
These equations are recursive and the recursion can be removed as below.
C1 = G0 + P0C0 (2)
C2 = G1 + P1C1
= G1 + P1(G0 + P0C0)
= G1 + P1G0 + P1P0C0 (3)
C3 = G2 + P2C2
= G2 + P2(G1 + P1G0 + P1P0C0)
= G2 + P2G1 + P2P1G0 + P2P1P0C0 (4)
C4 = G3 + P3C3
= G3 + P3(G2 + P2G1 + P2P1G0 + P2P1P0C0)
= G3 + P3G2 + P3P2G1 + P3P2P1G0 + P3P2P1P0C0 (5)
The equations (2), (3), (4) and (5) suggest that C1, C2, C3 and C4 can be generated directly from
C0. In other words, these four carries depend only on the initial carry C0. For this reason, these
equations are called carry look-ahead equations. A 4-bit carry look-ahead adder (CLA) is shown in
Fig. 3.11.
The maximum delay of the CLA is 6 ¥ D (for Gi and Pi generation, delay = D, for Ci generation,
delay = 2D and lastly another 3D for sum bit Si) where D is the average gate delay. The same holds
good for any number of bits because the adder delay does not depend on size of number (n). It
depends on the number of levels of gates used to generate the sum and the carry bits.
Carry-Save Adder (CSA) The various types of adders we have discussed so far can add two
numbers only. In parallel processing and in multiplication and division, multi-operand addition is
often encountered. More powerful adders are required which can add many numbers instead of two
3.10 Computer Organization and Architecture
together. Such type high-speed multi-operand adder is called a carry-save adder (CSA). To see the
effectiveness, consider the following example:
34
62
58
76
10 ¨ Sum vector
22 ¨ Carry vector
230 ¨ Final result.
In this example, four decimal numbers are added. First, the unit place digits are added, and
producing a sum of 0 and a carry digit of 2. Similarly the ten place digits are added, producing a sum
of 1 and a carry digit of 2. These summations can be performed in parallel to produce a sum vector of
10 and a carry vector of 22, because there is no carry propagation from the unit place digit to the tenth
place digit. When all digits of the operands are added, the sum and the shifted carry vector are added
in the conventional manner, i.e. using either CPA or CLA, which produces the final answer.
The CSA takes three numbers as inputs, say, X, Y and Z, and produces two outputs, the sum vector
S and carry vector C. The sum vector S and carry vector C are obtained by the following relations:
S=X≈Y≈Z
C = XY + YZ + XZ; Here all logical operations are performed bit-wise.
The final arithmetic sum of three inputs, i.e. Sum = X + Y + Z, is obtained by adding the two
outputs, i.e. Sum = S + C, using a CPA or CLA.
Let us take one example to illustrate the CSA technique.
X= 010110
Y= 110011
Z= 001101
S= 101000
C=010111
Sum = S + C = 1 0 1 0 1 1 0
Microoperation and Design of Arithmetic Logic Unit 3.11
The carry-save addition process can be implemented in fully parallel mode or in series-parallel
mode. Let us consider the 4-operand summation:
Sum = X + Y + Z + W
where X, Y, Z and W are 4-bit operands. The block diagram representation of this summation process
is shown in Fig. 3.12(a).
In this system, the first carry-save adder (CSA) adds X, Y and Z and produces a sum vector (S1)
and a carry vector (C1). The sum vector, the shifted carry vector and the fourth operand W are applied
as the inputs to the second CSA. The results produced by the second CSA are then added by a CPA to
generate the final summation, Sum.
3.12 Computer Organization and Architecture
The carry is propagated only in the last step. So, the total time required to add 4 operands is:
Time (4) = 2 * [CSA add time] + [CPA add time]
In general, time to add n operands by this method is:
Time (n) = (n–2) * [CSA add time] + [CPA add time]
This result can again be improved by using a CLA in the last stage, instead of CPA.
The circuit in the Fig. 3.13 can easily be extended to design an n-bit incrementer by using more
number of half adders. Instead of half adders, full adders can be used in the incrementer circuit where
one of the inputs of each full adder is connected to logic ‘0’ and first stage full adder’s carry input is
fixed to logic ‘1’.
Here, we are adding a bit 1 as one of the inputs to the full adder. This means that binary number
(1111) is added with the operand number A. The binary number (1111) means –1 in decimal, since
the negative number is represented in computers using signed 2’s complement method. That means,
we are adding –1 with the operand value stored in register A.
The selection input S determines the operation. When S = 0, this circuit performs the addition
operation and when S = 1, this circuit performs subtraction operation. The inputs to each XOR gate
are S-input and B- input. When S = 0, we have 0 ≈ B = B (It can be verified from the truth table of
XOR gate). This means that the direct B-value is given as input into a full adder (FA) and the carry-
input into first full adder is 0. Thus, the circuit performs addition. When S = 1, we have 1 ≈ B = B
(It can be verified from the truth table of XOR gate) and carry-input is 1. This means that the circuit
performs the addition of A with 2’s complement of B. For unsigned numbers, A – B if A ≥ B or the
2’s complement of (B – A) if A < B. For signed numbers, the result is A – B provided that there is no
overflow.
3.14 Computer Organization and Architecture
The output of the circuit is calculated from the following arithmetic sum:
K = A + Y + Cin
where A is a 4-bit number, Y is the 4-bit output of multiplexers and Cin is the carry input bit to
the parallel adder. By this circuit it is possible to get 8 arithmetic micro-operations, as listed in the
Table 3.2.
Case 1: When S1 S0 = 00.
In this case, the values of B are selected to the Y inputs of the adder. If Cin = 0, output K = A + B.
If Cin = 1, output K = A + B + 1. In both cases the microoperation addition is performed without carry
or without adding the carry input.
Microoperation and Design of Arithmetic Logic Unit 3.15
Logic unit is needed to perform the logical microoperations such as OR, AND, XOR (exclusive-OR),
complement, etc on individual pairs of bits stored in registers. For example, the OR microoperation
between the contents of two registers R2 and R3 can be stated as
C: R1 ¨ R2 ⁄ R3.
This symbolic instruction specifies the OR microoperation to be performed on the contents of regis-
ters R2 and R3 bitwise, provided that the control variable C = 1.
Special symbols are used for logic microoperations XOR, AND and complement (some times
called NOT). The symbol ≈ is used to denote XOR microoperation. The symbol Ÿ is used to
designate AND microoperation and the symbol – (or ‘) used as a bar on the top of a register name
indicates the (1’s) complement or NOT microoperation.
Now, we will design a logic unit that can perform the four basic logic microoperations: OR, AND,
XOR and complement. Because from these four microoperations, all other logic microoperations can
be derived. A one-stage logic unit for these four basic microoperations is shown in the Fig. 3.17. The
3.16 Computer Organization and Architecture
Shifter unit is used to perform shift microoperations. Shift microoperations are used to transfer stored
data serially. The shifting of bits of a register can be in either direction, left or right. Shift micro-
operations can be classified into three categories:
(a) Logical (b) Circular (c) Arithmetic.
In logical shift, all bits including sign bit take part in the shift operation. A bit 0 is entered in the
vacant extreme bit position (left most or right most). As a result, the left-most bit is lost, if it is the left
shift operation. Similarly, the right-most bit is lost, if
it is the right shift operation. We use the symbols lsl
and lsr for left shift and right shift microoperations
respectively.
In circular shift (also known as rotation operation),
one bit shifted out from one extreme bit position en-
ters the other extreme side’s vacant bit position as
shown in Fig. 3.18. No bit is lost or added.
In arithmetic shift, sign bit remains unaffected and
other bits (magnitude bits) take part in shift micro-
operation, as shown in Fig. 3.19. As a result of the Figure 3.18 Circular shift microoperation
arithmetic left shift operation, the left-most bit of the
magnitude part is lost and extreme right vacant bit is filled in with 0. Similarly, the right-most bit is
Microoperation and Design of Arithmetic Logic Unit 3.17
The arithmetic, logic and shifter units introduced earlier can be combined into one ALU with com-
mon selection lines. The shift microoperations are often performed in a separate unit, but sometimes
the shifter unit is made part of the overall ALU. Since the ALU is composed of three units, namely,
arithmetic, logic and shifter units. For 4-bit ALU, four multiplexers for arithmetic unit are needed
each of size 4 ¥ 1, four multiplexers for logic unit are needed each of size 4 ¥ 1 and four multiplexers
for shifter unit are needed each of size 2 ¥ 1. A complete block diagram schematic of a 4-bit ALU is
shown in Fig. 3.21.
The set of four multiplexers each of 4:1 at output end chooses among arithmetic output in K, logic
output in L and shift output in G. A particular arithmetic or logic microoperation is selected with
selection inputs S1 and S0. The final output of the ALU is determined by the set of multiplexers with
selection lines S3 and S2. The function table for the ALU is shown in the Table 3.5. The table lists
14 microoperations, 8 for arithmetic, 4 for logic and 2 for shifter unit. For shifter unit, the selection
line S1 is used to select either left or right shift microoperation.
Microoperation and Design of Arithmetic Logic Unit 3.19
REVIEW QUESTIONS
Group A
1. Choose the most appropriate option for the following questions:
(i) Microoperation in computers is an operation
(a) in ALU (b) on stored data in register
(c) in control unit (d) performed by the operating system.
(ii) A register is
(a) a part of main memory (b) a part of CPU
(c) collection of flip-flops (d) Both (b) and (c).
(iii) The address of the next instruction to be executed is held by
(a) AC (accumulator) register (b) IR (instruction register)
(c) PC (program counter) register (d) SP (stack pointer).
(iv) A bus in the computer system is
(a) collection of some individual lines each is used to send a bit randomly
(b) collection of parallel lines each is used to send one bit synchronously
(c) collection of lines through which control signals are sent
(d) collection of lines through which data are sent.
(v) To construct an n-line common bus using MUX for k registers of n bits each, the number of
MUXs and size of each MUX are
(a) k and n ¥ 1 (b) n and 2k
(c) n and k ¥ 1 (d) k and 2n, respectively.
(vi) To construct an n-line common bus using tri-state buffers for k registers of n bits each, the
number of buffers and size of common decoder are
(a) n and log2k-to- k (b) n*k and log2n-to- n
(c) k and log2n-to- n (d) k*n and log2k-to- k, respectively.
3.20 Computer Organization and Architecture
(vii) To add two n-bit numbers in parallel adder, the number of clock periods required is
(a) n (b) 2*n (c) 1 (d) n/2.
(viii) The number of clock periods needed in n-bit serial adder to add two numbers is
(a) n (b) 2*n (c) 1 (d) n/2.
(ix) The maximum propagation delay for n-bit CPA is
(a) D * n (b) 6 * D (c) n
(d) D, where D is the time delay for each full adder stage.
(x) All carries in the CLA depend on
(a) final stage carry only (b) initial input carry only
(c) previous stage carry (d) Both (a) and (b).
(xi) The maximum propagation delay for n-bit CLA is
(a) D * n (b) 6 * D (c) n
(d) D, where D is the average gate delay.
(xii) The CSA is
(a) 2-to-1 converter
(b) 3-to-1 converter
(c) 3-to-2 converter
(d) n-to-2 converter; where n is any positive integer.
(xiii) The minimum number of CSA-levels and minimum number of CSAs required in CSA tree to
add seven n-bit operands are
(a) 3 and 5 (b) 4 and 4 (c) 4 and 5 (d) 5 and 4, respectively.
(xiv) In arithmetic shift operation, 0s are padded in vacant positions for
(a) left/right shift of sign magnitude numbers
(b) right shift of 2’s complement negative numbers
(c) left shift of 2’s complement negative numbers
(d) right/left shift of 1’s complement numbers.
Group B
2. Show the circuit diagram for implementing the following register transfer operation:
if ( a b = 1) then R1 ¨ R2 else R1 ¨ R3; where a and b are control variables.
3. What is bus and what is bus transfer? Why do most computers have a common bus system?
4. Construct a common bus system using MUXs for three registers, each of 4-bits.
5. A digital computer has a common bus system for k-registers of n-bits each. The bus is constructed
with MUXs.
(i) What sizes of MUXs are needed?
(ii) How many MUXs are there in the bus?
6. What is tri-state buffer? Construct a common bus system using tri-state buffers for two registers of
4-bits each.
7. A digital computer has a common bus system for k-registers of n-bits each. The bus is constructed
with tri-state buffers.
(iii) What size of common decoder is needed?
(iv) How many tri-state buffers are there in the bus?
8. Explain the significance of timing signals in a computer system.
9. What is memory transfer? What are the different registers associated for memory transfer? Discuss.
10. What is binary adder? What are different types of binary adders?
11. What is serial adder? Discuss it using diagram. What are its merits and demerits?
Microoperation and Design of Arithmetic Logic Unit 3.21
12. What is parallel adder? What are different types of parallel adders? What are its merits and demer-
its?
13. Compare and contrast serial adder and parallel adder.
14. What is carry propagate adder (CPA)? Design a 4-bit CPA. What are its merits and demerits?
Estimate the maximum propagation delay for n-bit CPA.
15. What is carry look-ahead adder (CLA)? Design a 4-bit CLA. What are its merits and demerits?
Estimate the maximum propagation delay for n-bit CLA.
16. Why CLA is called fast parallel adder? Explain.
17. How do you design a 32-bit CPA using 8-bit CPAs? Give the block diagram.
18. What is carry save adder (CSA)? Use one example to illustrate its operation.
19. How many CSA levels are needed to reduce 16 summands to 2 using CSA-tree pattern? Draw the
diagram. What is the addition time for n summands?
20. Design a 4-bit incrementer circuit using full adders.
21. Design a 4-bit combinational decrementer circuit.
22. Design an n-bit adder/subtractor composite unit.
23. Design a 3-bit arithmetic unit, which will perform addition, subtraction, increment, decrement and
transfer operations.
24. Design a logic circuit that performs four logic operations of XOR, XNOR, NOR and NAND.
25. Suppose register A holds the 8-bit number 11011001. Determine the B operand and the logic micro-
operation to be performed in order to change the value in A to:
(i) 01101101
(ii) 11111101
26. Suppose register A holds the 8-bit number 11011001. Determine the sequence of binary values in A
after an arithmetic shift-right, followed by a circular shift-right and followed by a logical shift-left.
27. Design a 4-bit combinational shifter circuit.
28. Design a 2-bit ALU that performs addition, subtraction, logical AND, logical OR and logical shift
operations.
SOLVED PROBLEMS
1. Why every computer system is associated with a set of general purpose registers?
Answer
Some general purpose registers are used inside processor to enhance the effective execution speed.
The registers are fastest storage devices whose speed is almost same as processors in computer
systems and are used to hold the instructions and data temporarily. The processors do not have to wait
for required instructions and data, if they are available in registers.
2. A digital computer has a common bus system for k-registers of n bits each. The bus is
constructed with multiplexers.
(a) What size of multiplexers is needed?
(b) How many multiplexers are there in the bus?
3.22 Computer Organization and Architecture
Answer
For an n-line bus system for k registers of n bits each:
(a) The size of each multiplexer must be k ¥ 1, since it multiplexes k data lines each from a
register.
(b) Each multiplexer transfers one bit of the selected register. The number of multiplexers needed
to construct the bus is equal to n, the number of bits in each register.
3. A digital computer has a common bus system for k-registers of n bits each. The bus is
constructed with tri-state buffers.
(a) How many decoders are needed and what size of decoder is needed?
(b) How many tri-state buffers are there in the bus?
Answer
For an n-line bus system for k registers of n bits each:
(a) Only one decoder is required to select among the k registers. Size of the decoder should be
log2k-to- k.
(b) The total number of buffers needed is k * n.
4. Show the circuit diagram for implementing the following register transfer operation:
if ( a b = 1) then R1 ¨ R2 else R1 ¨ R3; where a and b are control variables.
Answer
The control function C is a b. The register transfer operation can be written as:
C: R1 ¨ R2
C’: R1 ¨ R3
The circuit diagram for the register transfer operations is shown below.
The R2 register is selected by the MUX if control condition C = 1; otherwise register R3 is
selected as source register.
Answer
In serial adder, the propagation delay of the flip-flop, tf also contributes to total addition delay. If ts is
the delay for single stage adder, the minimum period of the clock must be (ts + tf). Hence, the
minimum time needed by the serial adder is 2 ¥ (ts + tf).
In ripple carry adder, addition time = 2 ¥ ts.
Thus, the ripple carry adder is faster.
6. Suppose a 16-bit ripple carry adder (RCA) is constructed using 4-bit RCA as building block.
What is the maximum addition delay of the adder?
Answer
To generate S15, C15 must be available.The generation of C15 depends on the availability of C14,
which in turns must wait for C13 to become available. The maximum delay for such adder can be
computed as:
15 ¥ 2D (for carry to propagate through 15 full adders) + 3D (for S15 generation from C15) = 33D.
7. Suppose a 16-bit carry look-ahead adder (CLA) is constructed using 4-bit CLA as building
block. What is the maximum addition delay of the adder?
Answer
The maximum delay for such adder can be computed as:
D(for Gi, Pi generation) + 2D (for C4 generation from C0) + 2D (for C8 generation from C4) +
2D (for C12 generation from C8) + 2D (for C15 generation from C12) + 3D (for S15 generation from
C15) = 12D.
8. Why CLA is called fast parallel adder?
Answer
In the CLA, carry-inputs of all stages are generated simultaneously, without using carries from the
previous stages. These input carries depend only on the initial carry C0. For this reason, CLA is fast
parallel adder.
9. If the average gate delay is 4 ns, what is the delay for an 8-bit carry look-ahead adder?
Answer
For carry look-ahead adder, the addition delay is 6 ¥ gate delay = 6 ¥ 4 ns = 24 ns.
10. Two 4-bit unsigned numbers are to be multiplied using the principle of carry save adders.
Assume the numbers to be A3 A2 A1 A0 and B3 B2 B1 B0. Show the arrangement and intercon-
nection of the adders and the input signals so as to generate an 8-bit product as P7 P6 P5 P4 P3
P2 P1 P0.
Answer
The multiplication of two unsigned is done by repeated add-shift operations. Add-shift multiplication
of two 4-bit numbers is illustrated in figure below.
A3 A2 A1 A0 = A
B3 B2 B1 B0 = B
A3B0 A2B0 A1B0 A0B0 = W1
A3B1 A2B1 A1B1 A0B1 = W2
A3B2 A2B2 A1B2 A0B2 = W3
A3B3 A2B3 A1B3 A0B3 = W4
P7 P6 P5 P4 P3 P2 P1 P0 = A ¥ B = Product
Figure Add-shift multiplication of two 4-bit numbers (A ¥ B = Product)
3.24 Computer Organization and Architecture
The additions of partial products W1, W2, W3 and W4, which are generated using bit-wise AND
logic operations, can be done using CSA-tree as shown in figure below to realize the multiplier for 4-
bit numbers.
The first carry-save adder (CSA-1) adds W1, W2 and W3 and produces a sum vector (S1) and a
carry vector (C1). The sum vector, the shifted carry vector and the fourth partial product W4 are
applied as the inputs to the second CSA. The results produced by the second CSA are then added by
a CPA to generate the final summation Sum.
12. Suppose register A holds the 8-bit number 11011001. Determine the B operand and the logic
micro-operation to be performed in order to change the value in A to:
(a) 01101101
(b) 11111101
Answer
(a) If B = 1011 0100 and the XOR logic micro-operation is performed with A, then the content of
A will be 0110 1101. That is, A ¨ A ≈ B.
(b) If OR logic micro-operation is performed between A and B, then the content of A will be 1111
1101. That is, A ¨ A ⁄ B.
13. Suppose register A holds the 8-bit number 11011001. Determine the sequence of binary values
in A after an arithmetic shift-right, followed by a circular shift-right and followed by a logical
shift-left.
Answer
The value in A = 1101 1001
After arithmetic shift right, the value in A = 1110 1100
After circular shift right, the value in A = 0111 0110
After logical shift left, the value in A = 1110 1100
14. What is bit-slice processor? Give some examples.
Answer
A bit-slice processor is constructed from processor modules of smaller bit width. Each of these
processes one bit field of an operand. Each processor module chip typically includes an ALU and a
few registers. Using smaller size processors, one can design a processor of any word length. For
example, a 4-bit bit slice processor includes a register file and an ALU to perform operations on 4-bit
data, so that four such chips can be combined to build a 16-bit processor unit.
Some of bit-slice processors are Intel’s 3000-family and AMD’s AM2900 family.
CHAPTER
4
Memory Organization
4.1 INTRODUCTION
A computer’s memory system is just as important as the CPU in determining its performance.
Because programs and data they operate on are stored in the memory of computer. The execution
speed of programs is highly dependant on the speed with which instructions and data can be trans-
ferred between the processor and memory. It is also very important to have a large memory to
enhance the execution of programs that are large and deal with huge amounts of data.
Ideally, we would like to have the memory which would be fast, large and inexpensive. Unfortu-
nately, it is impossible to meet all three requirements simultaneously. If we increase the speed and
capacity, then cost will increase. We can achieve these goals at optimum level by using several types
of memories.
The maximum capacity of a memory is determined by the addressing scheme. For example, a 16-
bit computer that generates 16-bit addresses is capable of addressing up to 216= 64K memory loca-
tions.
Speed A useful parameter of the memory is its speed of operation, which is the time that elapses
between the initiation of an operation and the completion of that operation. This is measured in terms
of two parameters: access time, tA and cycle time, tC. Some times, speed is measured in terms of
access time and some times in terms of cycle time. For example, to perform a read operation, first the
address of the location is sent to memory followed by the ‘read’ control signal. The memory decodes
the address to select the location and reads out the contents of the location. The access time is the
time taken by the memory to complete a read operation from the moment of receiving the ‘read’
control signal. Generally, access times for read and write are equal. The memory cycle time is the
minimum time delay required between the initiations of two successive memory operations. For
example, the time delay between two successive memory read operations is the memory cycle time.
During the first read operation, the information read from memory is available after the access time.
This data can be immediately used by CPU. However, the memory is still busy with some internal
operation for some more time called recovery time, tR. During this time, another memory access, read
or write cannot be initiated. Only after the recovery time, next operation can be started. The cycle
time is the total time including access time and recovery time: tC = tA + tR. This recovery time varies
with memory technology.
Bandwidth or Data Transfer Rate The maximum amount of information that can be transferred
to or from the memory per unit time is called bandwidth and is expressed as number of bytes or words
per second. It depends on the speed of access and the width of data bus.
The total memory capacity of a computer can be considered as being a hierarchy of components. The
memory hierarchy system consists of all storage devices used in a computer system and are broadly
divided into following four groups, shown in Fig. 4.2.
l Secondary (auxiliary) memory
l Main (primary) memory
l Cache memory
l Internal memory
Secondary Memory The slow-speed and low-cost devices that provide backup storage are called
secondary memory. The most commonly used secondary memories are magnetic disks, such as hard
disk, floppy disk and magnetic tapes. This type of memory is used for storing all programs and data,
as this is used in bulk size. When a program not residing in main memory is needed to execute, it is
transferred from secondary memory to main memory. Programs not currently needed in main memory
(in other words, the programs are not currently executed by the processor) are transferred into
secondary memory to provide space for currently used programs and data.
Main Memory This is the memory that communicates directly with CPU. Only programs and data
currently needed by the CPU for execution reside in the main memory. Main memory occupies
Memory Organization 4.3
central position in hierarchy by being able to communicate directly with CPU and with secondary
memory devices through an I/O processor, as depicted in Fig. 4.3.
Cache Memory This is a special high-speed main memory, sometimes used to increase the speed
of processing by making the current programs and data available to the CPU at a rapid rate. Gener-
ally, the CPU is faster than main memory, thus resulting that processing speed is limited mainly by
the speed of main memory. So, a technique used to compensate the speed mismatch between CPU
and main memory is to use an extremely fast, small cache between CPU and main memory, whose
access time is close to CPU cycle time. The cache is used for storing portions of programs currently
being executed in the CPU and temporary data frequently needed in the present computations. Thus,
the cache memory acts a buffer between the CPU and main memory. By making programs and data
available at a rapid rate, it is possible to increase the performance of computer.
Cache memory is one high-speed main memory (SRAM). The cache memory can be placed in
more than one level. Most of the recent microprocessors - starting from Intel 80486- have on-chip
(memory chip is placed inside the CPU chip) cache memory also known as internal cache. High
performance microprocessors such as- Pentium pro and later have two levels of cache memory on-
chip. These are known as level 1 (L1) and level 2 (L2) caches. An on-chip cache is slightly faster than
an off-chip cache of same technology.
4.4 Computer Organization and Architecture
Internal Memory This memory refers to the high-speed registers used inside the CPU. These
registers hold temporary results when a computation is in progress. There is no speed disparity
between these registers and the CPU because they are fabricated with the same technology. However,
since registers are very expensive, only a few registers are used as internal memory.
A basic characteristic of a memory is the order or sequence in which information can be accessed.
The methods of accessing include the following:
l Sequential or Serial access
l Random access
l Direct or Semi-random access
l Associative
Sequential Access In this method, the memory is accessed in a specific linear sequential manner.
For example, if fourth record (collection of data) stored in a sequential access memory needs to be
accessed, the first three records must be skipped. Thus, the access time in this type of memory
depends on the location of the data. Magnetic disks, magnetic tapes and optical memories like CD-
ROM use this method.
Random Access In this mode of access, any location of the memory can be accessed randomly. In
other words, the access to any location is not related with its physical location and is independent of
other locations. For random access, a separate mechanism is there for each location. Semiconductor
memories (RAM, ROM) are this type.
Direct Access This method is basically the combination of previous two methods. Memory de-
vices such as magnetic hard disks contain many rotating storage tracks. If each track has its own read/
write head, the tracks can be accessed randomly, but access within each track is sequential. In this
case the access is semi-random or direct. The access time depends on both the memory organization
and the characteristic of storage technology.
Associative Access This is a special type of random access method that enables one to make a
comparison of desired bit locations within a word for a specific match and to do this for all words
simultaneously. Thus, based on a portion of a word’s content, word is retrieved rather than its address.
Cache memory uses this type of access mode.
The central storage unit in a computer system is the main memory which is directly accessible by the
CPU. It is a relatively large and fairly fast external memory used to store programs and data during
the computer operation. Most of the main memory in a general-purpose computer is made up of RAM
(Random Access Memory) integrated circuit chips, which are volatile (i.e. if power goes off, the
stored information is lost) in nature. But a small part of the main memory is also constructed with
ROM (Read Only Memory) chips, which are non-volatile. Originally, RAM was used to refer to a
Memory Organization 4.5
random-access memory, but now it is used to mean a read-write memory (RWM) to distinguish it
from a read-only memory, although ROM’s access mechanism is also random.
RAM is used to store the most of the programs and data that are modifiable. Integrated RAM chips
are available in two forms: one is static RAM (SRAM) and another is dynamic RAM (DRAM). The
SRAM memories consist of circuits capable of retaining the stored information as long as power is
applied. That means this type of memory requires constant power. SRAM memories are used to build
cache memory. On the other hand, DRAM stores the binary information in the form of electric
charges that applied to capacitors. The stored information on the capacitors tend to loss over a period
of time and thus the capacitors must be periodically recharged to retain their state. The main memory
is generally made up of DRAM chips.
Comparison of SRAM and DRAM
1. The SRAM has lower access time, which means it is faster compared to the DRAM.
2. The SRAM requires constant power supply, which means this type of memory consumes more
power; whereas, the DRAM offers reduced power consumption, due to the fact that the
information is stored in the capacitor.
3. Due to the relatively small internal circuitry in the one-bit memory cell of DRAMs, the large
storage capacity in a single DRAM memory chip is available compared to the same physical
size SRAM memory chip. In other words, DRAM has high packaging density compared to the
SRAM.
4. SRAM is costlier than DRAM.
Another part of the main memory consists with ROMs (read only memories), whose contents are
not generally altered by the users/programmers. In other words, the ROM is generally used for storing
the programs and data that are permanently resident in the computer. In this connection, it is worth
noting that the ROM portion of the main memory is needed for storing an initial start-up program
called a Bootstrap Loader. The Bootstrap Loader is a system program whose task is to load a portion
the operating system from secondary memory (hard-disk) to main memory (RAM).
The chip has 512 locations each location capable of storing 8 bits. This requires a 9-bit address bus
and 8-bit bidirectional data bus. Here, note that the data bus is bidirectional, since it allows the
4.6 Computer Organization and Architecture
transfer of data either from memory to CPU during read operation or from CPU to memory during
write operation. The R/ W control line specifies either read or write operation. When this line is high,
the control line sends read signal and when it is low, it sends a write signal. Some chip selection lines
are required to enable the desired chip from multiple chips in a large memory system before read or
write operation on it.
A ROM is used to read the information from it. So, it does not have any R/ W line, because when
a chip is selected, it will be used to read the binary information from it. Also, the data bus is
unidirectional. A ROM chip is organized externally in a similar manner as RAM. A block diagram
for a ROM chip of size 512 ¥ 8 is shown in Fig. 4.5.
For the same physical size chip, it is possible to have more bits of ROM than of RAM, because the
internal binary cells in ROM occupy less space than in RAM, which can be easily understood by their
internal circuits discussed in Section 4.5.2. In other words, ROM has high packaging density com-
pared to the RAM.
Construction of Large Memory Using Small Chips The large memory can be constructed by
expanding some small size chips in either horizontally or vertically. In horizontal expansion, the word
is increased; whereas in vertical expansion, number of locations is increased. For example, two RAM
chips each of size 512 ¥ 4 can be horizontally expanded to obtain a large memory of size 512 ¥ 8 and
the same number of 512 ¥ 4 RAM chips can be connected vertically to construct a large memory of
size 1K ¥ 4. Sometimes large memory is constructed using either horizontal or vertical technique or
sometimes using both techniques.
Large memory to be constructed can be of heterogeneous (i.e. mixture of both RAM and ROM) or
homogeneous (i.e. either all chips are RAM or ROM, but not both). We will discuss it using some
examples.
RAM and ROM chips are available in a variety sizes. If the memory needed for a computer is
larger than the size of a single chip, then it is necessary to combine a number of smaller chips to form
the required memory size.
To illustrate this construction, we will take two examples. First is of heterogeneous and second is
homogeneous connection.
Heterogeneous Case
Example 4.1 Suppose, we have two sets of memories with RAM of size 512 ¥ 8 and ROM of size
512 ¥ 8 to design a memory of capacity 1024 ¥ 8.
This is already mentioned that for the same physical size chip, it is possible to have
more bits of ROM than of RAM, because the internal binary cells in ROM occupy
Memory Organization 4.7
less space than in RAM. Therefore, we will take a ROM of size 512 bytes and the
four RAM chips each of 128 bytes. The Fig. 4.6 shows the interconnection diagram
of these memories with the CPU having16 address lines.
The address lines 1 to 7 are connected to each memory and address lines 8, 9 are
used in dual purposes. In case of a RAM selection out of four RAMs, these two lines
are used through a 2-to–4 decoder and the lines are also connected to the ROM as
address lines along with lines 1 to 7 giving a total of 9 address lines in the ROM,
since the ROM has 512 locations. The CPU address line number 10 is used for
separation between RAM and ROM.
Memory Address Map The interconnection between memory and CPU is estab-
lished from the size of memory needed and the type of RAM and ROM chips
available. The addressing of memory can be designed by means of a table, known as
memory address map, which specifies the memory address space assigned to each
chip. The address map table for the memory connection to the CPU shown in Fig.
4.6 is constructed in Table 4.1. The CPU generates 16-bit address for memory read
or write operation. In case of any RAM operation, only 7-bit address is required.
Since there are 4 RAM chips, a 2-to-4 decoder is required for selecting any one
RAM at a time. For this 8 and 9 lines are required. Also, 10th line is used for
separation of RAM with ROM. The ROM chip has 512 bytes and so it needs 9
address lines. For the ROM, along with lower-order 7 address lines, 8 and 9 lines are
used as address lines. The other 11 to 16 lines of CPU are unused and for simplicity
we assume that they carry 0s as address signals.
Table 4.1 Memory address map table for the Fig. 4.6
Homogeneous Case Suppose the required large RAM memory size is K ¥ L and the small size
RAM chip capacity is m ¥ n, then the number of small size chips required can be calculated as:
The number of chips each of size m ¥ n = s = È(K * L)/(m * n)˘.
Example 4.2 Suppose, we have to construct a large RAM-type memory of size 1K ¥ 8 using same
size smaller RAM chips each of size 256 ¥ 2.
The block diagram for 256 ¥ 2 RAM chip is shown in figure 4.7.
The larger 1K ¥ 8 RAM memory requires 10 address lines and 8 data lines. The
construction of this memory using smaller RAMs each of 256 ¥ 2 needs 1K/256=
1024/256 = 4 rows and 8/2 = 4 columns of smaller chips, as in shown in figure 4.8.
Total number of smaller chips required is s = 4*4 = 16. In the Fig. 4.8, all chips are
4.8 Computer Organization and Architecture
of equal size, i.e. 256 ¥ 2. The address lines A8 and A9 are connected to a 2-to-4
decoder, which activates (selects) one of four rows of chips at a time. For example, if
A8 A9 = 00, then decoder selects first row of four chips for read or write operation.
Since same 8-bit address is sent to all chips at a time. When one row of four chips is
selected, 8 bits of data from a common location in all four chips are accessed
simultaneously.
Example 4.3 Suppose a large memory of 1K ¥ 4 is to be constructed using 512 ¥ 2 RAM chips.
For small size RAM chip of 512 ¥ 2, number of address lines required is 9 and
number of data lines is 2. For large memory of 1K ¥ 4, number of address lines
required is 10 and number of data lines is 4.
Therefore, in the interconnection diagram:
The number of rows = 1K/512 = 1024/512 = 2
The number of columns = 4/2 = 2
Hence, the number of small size RAMs required = 2*2 = 4.
The interconnection diagram is shown in Fig. 4.9. Here in the diagram, only two
rows are there. So, the first row is selected (activated) by A9 line of the address bus
directly and the second row is selected by its complement bit information. In other
words, if A9 line contains logic 1, then first row of chips will be selected and
otherwise the second row will be selected.
Write Operation Similarly, for the write operation, the address provided to the decoder activates the
word line to close both the switches. Then the bit value that to be written into the cell is provided
through the sense/write circuit and the signals in bit lines are then stored into the cell.
CMOS (Complementary Metal Oxide Semicon-
ductor) Realization of SRAM One SRAM cell
using CMOS is shown in Fig. 4.12. Four transistors
(T3, T4, T5 and T6) are cross connected in such a way
that they produce a stable state. In state 1, the voltage
at point A is maintained high and voltage at point at B
is low by keeping transistors T3 and T6 on (i.e. closed),
while T4 and T5 off (i.e. open). Similarly, in state 0,
the voltage at A is low and at point B is high by
keeping transistors T3 and T6 off, while T4 and T5 on.
Both theses states are stable as long as the power is
applied on it. Thus, for state 1, if T1 and T2 are turned
on (closed), bit lines b and b will have high and low
Figure 4.12 A CMOS SRAM cell
signals, respectively. The state of the cell is read or
written as above.
The main advantage of using CMOS SRAMs is the low power consumption. Since, when the cell
is being accessed the current flows in the cell only. Otherwise, T1, T2 and one transistor in each
inverter are turned off, ensuring that there is no active path between Vcc and ground.
DRAM Memory Cell Though SRAM is very fast, but it is
expensive because of its each cell requires several transistors.
Relatively less expensive RAM is DRAM, due to the use of one
transistor and one capacitor in each cell, as shown in the
Fig. 4.13, where C is the capacitor and T is the transistor. Infor-
mation is stored in a DRAM cell in the form of a charge on a
capacitor and this charge needs to be periodically recharged.
For storing information in this cell, transistor T is turned on
and an appropriate voltage is applied to the bit line. This causes
a known amount of charge to be stored in the capacitor. After
the transistor is turned off, due to the property of the capacitor, it Figure 4.13 A DRAM cell
starts to discharge. Hence, the information stored in the cell can
be read correctly only if it is read before the charge on the capacitor drops below some threshold
value.
Types of RAM
Asynchronous DRAM (ADRAM) The DRAM described above is the asynchronous type DRAM.
The timing of the memory device is controlled asynchronously. A specialized memory controller
circuit generates the necessary control signals to control the timing. The CPU must take into account
the delay in the response of the memory.
Synchronous DRAM (SDRAM) These RAM chips’ access speed is directly synchronized with the
CPU’s clock. For this, the memory chips remain ready for operation when the CPU expects them to
be ready. These memories operate at the CPU-memory bus without imposing wait states. SDRAM is
Memory Organization 4.13
commercially available as modules incorporating multiple SDMAM chips and forming the required
capacity for the modules.
Double-Data-Rate SDRAM (DDR SDRAM) This faster version of SDRAM performs its opera-
tions on the both edges of the clock signal; whereas a standard SDRAM performs its operations on
the rising edge of the clock signal. Since they transfer data on both edges of the clock, the data
transfer rate is doubled. To access the data at high rate, the memory cells are organized into two
groups. Each group is accessed separately.
Rambus DRAM (RDRAM) The RDRAM provides a very high data transfer rate over a narrow
CPU-memory bus. It uses various speedup mechanisms, like synchronous memory interface, caching
inside the DRAM chips and very fast signal timing. The Rambus data bus width is 8 or 9 bits.
Cache DRAM (CDRAM) This memory is a special type DRAM memory with an on-chip cache
memory (SRAM) that acts as a high-speed buffer for the main DRAM.
ROM Memory Cell ROM is another
part of main memory, which is used to store
some permanent system programs and sys-
tem data. A ROM cell structure is shown
in Fig. 4.14. A logic value 1 is stored in
the cell if the transistor is not connected to
the ground at point P; otherwise, a binary 0
is stored. The bit line is connected through
a resistor to the power supply.
In order to read the state of the cell, the
Figure 4.14 A ROM memory cell
word line is activated to close the transis-
tor, which acts as a switch. The voltage on the bit line drops to near zero if point P is connected. The
point P is not connected to retain the state of cell as 1. When it is manufactured, data are written into
ROM cells.
Variety of ROM chips is available, which are discussed briefly next.
Types of ROM
PROM Memory Cell Some ROM designs allow the data to be loaded into the cell by user, and then
this ROM is called PROM (Programmable ROM). Inserting a fuse at point P in Fig. 4.14 achieves
programmability. Before it is programmed, the memory contains all 0s. The user can insert 1s at the
required locations by burning out the fuses of cells at these locations using high-voltage currents. The
PROM’s cells are once programmable, i.e. the user can store the desired bits in the cells only once
and these bits cannot be altered.
EPROM An erasable PROM (EPROM) uses a transistor in each cell that acts as a programmable
switch. The contents of an EPROM can be erased (set to all 1s) by burning out the device to
ultraviolet light for a few (20 to 30) minutes. Since ROMs and PROMs are simpler and thus cheaper
than EPROMs. The EPROMs are used during system development and debugging.
EEPROM (Electrically Erasable PROM): In many applications, permanent data have to be gener-
ated in a program application and need to be stored. For example, in a mobile phone the telephone
numbers are to be kept permanently till the user wants to erase those data. Similarly, the user may
wish to erase previously entered information. EEPROMs have an advantage in that the information in
4.14 Computer Organization and Architecture
them can be selectively erased by writing 1s and each bit in the information can be stored again by
writing the desired bit. An EEPROM needs two write operations at an address, one for erase and one
for writing. RAM writes the information directly without first erasing that information at that address.
But in the EEPROM, the stored information is non-volatile.
Flash Memory A currently popular type of EEPROM, in which erasing is performed in large blocks
rather than bit by bit, is known as flash EPROM or flash memory. Erasing in large blocks reduces the
overhead circuitry, thus leading to greater density and lower cost. The current trend is “memory stick”
made of flash memory that is used to Universal Serial Bus (USB) of the personal computer for data
exchange between computers.
The largest capacity and less expensive memory in the system is secondary memory. The following
hierarchy diagram in Fig. 4.15 illustrates some the various devices that are available for secondary
(auxiliary) storage of data. The most common secondary memory devices used are magnetic tapes and
magnetic disks.
referred to as inter-record gaps. The length of a magnetic tape is typically 2400 feet and it is stored on
a reel. The major difficulty with this device is the particle contamination caused by improper manual
handling.
Problem 4.1 A disk pack has 19 surfaces. Storage area on each surface has an inner diameter of
22 cm and outer diameter of 33 cm. Maximum storage density on any track is
2000 bits/cm and minimum spacing between tracks is 0.25 mm.
Memory Organization 4.17
Problem 4.2 A hard disk with one platter rotates at 15,000 rpm and has 1024 tracks, each with
2048 sectors. Disk read-write head starts at track 0. (Tracks are numbered from 0 to
1023) The disk receives a request to access a random sector on a random track. If the
seek time of the disk head is 1 ms for every 100 tracks it crosses.
(a) What is the average seek time?
(b) What is the average rotational latency?
(c) What is the transfer time for a sector?
Solution (a) Since, the disk receives a request to access a track at random. Thus, the head
may have to move either direction. On an average, the head will have to move
1024/2 = 511.5 tracks.
Since the seek time of the head is 1ms per 100 tracks, the average seek time
= 511.5/100 = 5.115 ms.
(b) Since, the platter rotates at 15,000 rpm, each rotation takes 1/15000 min. =
(60 * 103)/15000 = 4 ms.
The average rotational latency is half the rotation time, which is = 2 ms.
(c) Each rotation takes 4 ms.
Number of sectors per track = 2048.
Therefore, each sector has read-write head over it = 4/2048 ms = 0.002 ms
(apprx.)
Therefore, the transfer time is = 0.002 ms (apprx.)
CD CD represents Compact Disc (note: when referring to optical media, disc ends in “c” not “k” as
with magnetic disks). CD-ROM means Compact Disk — Read Only Memory, and it is the same
media as that used in a home or car CD player. Because the same formats are used for music and data,
audio CDs can be played on a computer sound system that has the proper hardware and software.
Although data CDs can be read by a computer, they cannot be
played in a stereo system, and if they could, what would they sound
like? Just a few years ago, software applications were installed from
several floppy disks, but now they are usually installed from a single
CD-ROM. CDs were the optical discs referred to earlier with a
storage capacity of 650-680 MB, which is the equivalent of about
470 standard 3.5” floppy disks. Yet some applications are so large
that they may require more than one disk. For example, it requires Figure 4.19 Top view of a CD
four discs to install all the components of the Premium version of
Office2000. A photographic top view of a CD is shown in Fig. 4.19.
Unlike magnetic disks that place data in concentric circles
(tracks), optical discs mimic the layout of a phonograph
record and place data in a single spiral (track). However,
the spiral track on a CD is only on one side and spirals from
inside out. Digital data—binary 0s and 1s—are represented
by pits, which scatter a low power laser beam, and lands
(flat spots), which reflect it. This is illustrated in the
Fig. 4.20 that depicts the side view of a CD.
By holding one of the CDs up to the light, the pits can be
seen because they are less than a millionth of a meter in Fig. 4.20 Side view of a very small
diameter. portion of CD
CD-R CD-R means Compact Disc—Recordable. With the proper hardware (a burner), blank CD-R
discs, and appropriate software, we can create our own data or audio CDs. Unlike CD-ROM disks
where the pits are pressed into the surface in a process similar to that used to make phonograph
records, CD-R discs contain a dye layer composed of photosensitive organic compounds. By using a
higher energy setting, the dye is heated by the writing laser and becomes opaque through a chemical
reaction. So the sections of the disk that have not been burned act as lands, while the opaque sections
act as the non-reflective pits. In addition to being able to create new CDs, CD-R drives can also read
CD-ROMs and play audio CDs. Furthermore, newly created audio CDs can also be played in a home
or car sound system. Unfortunately, CD-R discs can only be recorded once. That is, the same tracks
cannot be erased and then rerecorded over. Some of the newer burners are multi-session; this allows
users to keep adding data to a CD-ROM over time. This is important if we want to use the CD-R
drive to create backup CD-ROMs instead of the traditional tape backups. Are CDs reusable? In other
words, can we erase and rerecord them? The answer is yes and it is nothing but the CD-RW.
CD-RW The final type of CDs is CD-RW, which means Compact Disc—ReWriteable. With the
proper hardware (also called a burner), blank CD-RW discs, and appropriate software, we can create
our own data or audio CDs. CD-RW technology uses a different kind of data-bearing layer from that
used in an ordinary CD-R. This technology uses a phase change process to alter its state from a
reflective state to a light absorbing state, and it can be reversed to make the area erasable and
Memory Organization 4.19
reusable. When attempting to copy an audio CD to a blank CD-RW disc, the warning message as
shown in Fig 4.21 appears.
CD-RW discs will not play in a home or car sound system, and, furthermore, older CD-ROM players
cannot read them.
DVD The newest type of optical storage and the one with the greatest storage capacity is DVD,
which means Digital Versatile Disc (originally digital video disc). Whereas CD players use an infra-
red laser to read the information stored on the disc, DVD players use red lasers, which allows the bits
to be packed closer together. DVDs also have a more efficient encoding, which helps to increase the
capacity of the discs. Unlike CDs, DVDs can be dual layered and double sided. DVDs have a storage
capacity of 4.7-17 GB, which is the equivalent of up to 15 CDs. Furthermore, DVD players can read
CDs, and it is not surprising that they are the standard, or at least the optional, disc reader in many
newer computers systems.
With such an enormous storage capacity, what are some practical uses for DVD-ROMs? As
indicated by their early name, the most popular one is for the playback of movies. Many videotape
rental stores are now including an increasing number of movies titles on DVD, which can be played
in a computer DVD player or in one attached to a home entertainment system. Because of the high
storage capacity, not only can a full-length movie be recorded on a single disc, but so can multiple
language tracks and subtitles written in different languages. By using an audio DVD, a box set of
CDs could be released on a single disc instead of multiple CDs. Recall that the Premier version of
Office2000 requires four CDs. A single DVD could be used in place of these. Phonebooks of the
entire population of our country are available on approximately 4 -6 regional CDs containing all the
listed phone numbers. Can computer users create their own DVDs? As we may expect, the answer is
yes. Comparable to the record-once CD-R burners are the DVD-R burners, and a combination DVD-
R/CD-RW burner can be found in modern-day computers. However, there are two competing and
incompatible standards for rewriteable DVDs — DVD+RW and DVD-RAM.
Several data-processing applications require the search of data in a block or record stored in memory.
The normal procedure of searching a block is to store all data where they can be addressed in
4.20 Computer Organization and Architecture
sequence. The normal search method is selecting addresses in sequence, reading the content of
memory at each address and comparing the information read with the data being searched until a
match occurs.
The searching time for desired data stored in memory can be reduced largely if stored data can be
searched only by the data value itself rather than by an address. The memory accessed by the data
content is known as associative memory or content addressable memory (CAM). When a data is
stored in this memory, no address is stored. At any first empty location, the data is stored. When a
data word is to be read from the memory, only the data word or part of data, called key, is provided.
The memory is sequentially searched thoroughly for match with the specified key and set them for
reading next.
The advantage of using this memory is that it
is suitable for parallel searches due to its orga-
nization. The searching in this memory is fast.
Since each cell must have storage capability as
well as logic circuits for matching, the associa-
tive memory is more expensive than a RAM
memory. For this reason, this memory is used in
applications, where search time is very critical
and must be very short. An associative memory
organization is shown in Fig. 4.22.
Hardware Structure The associative memory
consists of a memory array and logic for m words
with n bits per word. In this organization, sev-
eral registers are used, functions of which are
described next.
1. Input register (I) The input register I is used Figure 4.22 Block diagram of associative memory
to hold the data to be written into the associa-
tive memory or the data to be searched for. At a time it holds one word of data of the memory, i.e. its
length is n-bit.
2. Mask register (M) The mask register M provides a mask for choosing a particular field or key in
the input register’s word. The maximum length of this register is n-bit, because the M register can
hold a portion of the word or all bits of the word to be searched. Suppose, a student database file
containing several fields such as name, class roll number, address, etc. is stored in the memory. From
this file, say only ‘name’ field is required for searching. Then the M register will hold this ‘name’
field only and the searching in the memory will be only with respect to this field without bothering
about other fields. Thus, only those bits in the input register that have 1s in their corresponding
position of the mask register are compared. The entire argument is compared with each memory word
if the mask register contains all 1s. Each word in memory is matched with the input data in the I-
register.
To illustrate the matching technique, let the input register I and mask register M have the following
information.
Memory Organization 4.21
I = 1001 1011
M = 1111 0000
Word1 = 0011 1100 no match
Word2 = 1001 0011 match
Only four leftmost bits of I are compared with stored memory words because M has 1s in these
positions. There is a match for word2, but not with word1.
3. Select register (S) The select register S has m bits, one for each memory word. If matches found
after comparing input data in I register with key field in M register, then the corresponding bits in
select register (S) are set.
4. Output register (Y) This register contains the match data word retrieved from the associative
memory.
The relation between the memory array and four external registers in an associative memory
system is shown in Fig. 4.23.
The internal organization of a cell Cij is shown in Fig. 4.24. It consists of a flip-flop storage
element Aij and the circuits for matching, selecting, reading and writing the cell. By a write operation,
the input bit Ij is stored into the cell. By a read operation, the stored bit is read form the cell. The
match and select logic compares the content of the storage cell with the corresponding unmasked bit
of the input register and provides an output for the decision logic that sets the bit in Si.
4.22 Computer Organization and Architecture
Cache memory is small and fast memory used to increase the instruction-processing rate. Its operation
is based on the property called “locality of reference” inherent in programs. Analysis of a large
number of typical programs shows that the CPU references to main memory during some time period
tend to be confined within a few localized areas in memory. In other words, few instructions in the
localized areas in memory are executed repeatedly for some time duration and other instructions are
accessed infrequently. This phenomenon is known as the property of locality of reference. This
property may be understood considering that when a program loop is executed, the CPU repeatedly
refers to the set of instructions in memory that constitute the loop. Thus loop tends to localize the
references to memory for fetching the instructions.
There are two dimensions of the locality of reference property: temporal and spatial.
Temporal Locality Recently referenced instructions are likely to be referenced again in the near
future. This is often caused by special program constructs such as iterative loops or subroutines. Once
a loop is entered or a subroutine is called, a small code segment will be referenced repeatedly many
times. Thus temporal locality tends to cluster the access in the recently used areas.
Spatial Locality This refers to the tendency for a program to access instructions whose addresses
are near one another. For example, operations on tables or arrays involve accesses of a certain
clustered area in the address space. Program segments, such as routines and macros, tend to be stored
in the same neighbourhood of the memory space.
If active segments of a program are placed in a fast small cache memory, the average memory
access time can be reduced, thus reducing the total execution time of the program. This memory is
logically placed between the CPU and main memory as shown in Fig. 4.3. Because we know that the
cache memory’s speed is almost same as that of CPU. The main idea of cache organization is that by
keeping the most frequently accessed instructions and data in the fast cache memory, the average
memory access time will be almost same as access time of cache.
Memory Organization 4.23
The operation of the cache is conceptually very easy and is as follows: First the cache memory is
accessed, when the CPU needs to access memory for a word. If the word is found in the cache, the
CPU reads it from the fast cache memory. If the word addressed by the CPU is not found in the cache,
the main memory is accessed next to find the word. Due to the property of locality of reference, a
block of words containing the one just accessed is then brought into the cache memory from main
memory. The block size may vary from machine to machine. Another term often used to refer to a
cache block is cache line.
number of hits
=
number of hits + number of misses
Thus, the hit ratio is nothing but a probability of getting hits out of some number of memory
references made by CPU. So its range is 0 £ h £ 1.
Now we will observe the average access time of a memory system consisting of two levels of
memories: main memory and cache memory.
Let, tc, h and tm denote the cache access time, hit ratio in cache and the main memory access time,
respectively. Then the average access time can be formulated as:
tav = h* tc + (1– h) *(tc+ tm) = tc + (1 – h) * tm
This equation is derived from the fact that when there is a cache hit, the main memory is not be
accessed and in the case of cache miss, both main memory and cache memory are accessed.
Here, in all three methods, we will be using hexadecimal (HEX) numbers for both address and data
for simplicity. Then address of 16 bits is shown in four-digit HEX number and similarly 8 bits data is
shown in 2-digit HEX number.
Associative Mapping The associative cache
memory uses the fastest and most flexible mapping
method, in which both address and data of the
memory word are stored. This organization is some-
times referred to as a fully associative cache and is
illustrated in Fig. 4.26. The cache memory can store
256 words out of 64K words from main memory.
This method allows any location in cache to store
any word from main memory. The CPU first sends a
16-bit address for a desired word to the input regis-
ter and the associative cache memory is then searched
for a matching address sequentially. If the address of
the desired word is found, then the 8-bit data word is
read and sent to the CPU. If no match occurs, then
main memory is searched for the address of the word. Figure 4.26 Associative mapping cache
Then address-data pair is brought into the cache (all numbers in HEX system)
memory from main memory and placed onto an old
address-data pair, if the cache is full. In this situation, a replacement algorithm is needed for selecting
an old address-data pair to make the place for the newly coming address-data pair. For this, different
type replacement algorithms such as First-in First-out (FIFO) or Least Recently Used (LRU) are used,
which have been discussed later in this section.
Merits of Associative Mapping This memory is easy to implement and it is also very fast.
Demerits of Associative Mapping This memory is expensive compared to random-access memories
because of additional storage of addresses with data in the cache memory. Here, in our example, we
are storing 16 + 8 =24 bits for a single word of 8-bit.
Direct Mapping Instead of storing total address information with data in cache, only part of
address bits is stored along with data in the direct cache mapping. Let us assume that cache memory
can hold 2m words and main memory can hold 2n words. This means that the CPU will generate n-bit
memory address. This n-bit address is divided into two fields: lower-order m bits for the index field
and the remaining higher-order (n-m) bits for the tag field. The direct mapping cache organization
uses the n-bit address to access the main memory and the m-bit index to access the cache. So, for our
example, the index and tag fields are shown as follows:
Memory Organization 4.25
8 bits 8 bits
Tag Index
The internal organization of the words in the cache memory is shown in Fig. 4.27.
The cache memory stores maximum 256 data words and their associated tags. Due to a miss in the
cache, a new word is brought into the cache; the tag bits are stored alongside the data bits. The index
field is used as the address to access the cache, when the CPU generates a word’s address. The tag
field of the CPU address is matched sequentially with all tag values stored in cache. If the CPU’s tag-
address matches with any cache tag, i.e. there is a hit and the desired data word is read by the CPU
from cache. If there is no match in cache, then there is a miss and the required word is read from main
memory, which is then stored in the cache together with its new tag, replacing the previous tag-data
pair value. Thus, the new tag-data pair is placed in same indexed location in cache as CPU’s current
index of the address for which miss has occurred. But, here it can be noted that the hit ratio can drop
considerably if two or more words whose addresses have the same index but different tags are
accessed repeatedly.
The cache is divided into cache blocks, also called cache lines. A block contains a set of contigu-
ous address words of same size. Each block is typically 32 bytes. We know that data is transferred
from main memory to cache memory as block. The direct mapping example just described above uses
a block size of one word. The direct mapping cache organization using block size 8 words is shown in
Fig. 4.28.
The index field is now divided into two parts: the block field and the word field. Since each block
size is 8 words, the 256-word cache can hold 32 blocks. The block number is specified with a 5-bit
field, since there are 32 blocks and a word within a block is specified by 3 bits, since each block
contains 8 words. The tag field for all stored words within a block is same, since a block contains
consecutive 8 words of data. When a miss occurs in cache, an entire block must be brought into cache
memory from main memory.
4.26 Computer Organization and Architecture
Merits of Set-Associative Cache This cache memory has highest hit ratio compared to other two
cache memories.
Demerits of Set-Associative Cache This is the most expensive memory. The cost increases as set
size increases.
Replacement Methods In case a miss occurs in cache memory, then a new data from main
memory needs to be placed over old data in the selected location of cache memory. In case of direct
mapping cache, we have no choice and thus no replacement algorithm is required. The new data has
to be stored only in a specified cache location as per the mapping rule for the direct mapping cache.
For associative mapping and set-associative mapping, we need a replacement algorithm since we have
multiple choices for locations. We outline below some most used replacement algorithms.
First-in first-out (FIFO) algorithm This algorithm chooses the word that has been in the cache for
a long time. In other words, the word which entered the cache first, gets pushed out first.
Least recently used (LRU) algorithm This algorithm chooses the item for replacement that has
been used by the CPU minimum number of times in the recent past.
Cache Writing Methods Generally only two operations; read and write are performed on a
memory. The read operation does not change the content of memory, since a copy of data has been
retrieved from the memory in the course of read operation. However, the write operation changes the
content of memory. So, the write operation should be performed carefully. There are two methods in
writing into cache memory: Write-through and Write-back policies.
Write-through policy This is the simplest and most commonly used procedure to update the cache.
In this technique, when the cache memory is updated, at the same time the main memory is also
updated. Thus the main memory always contains the same data as the cache. But it is a slow process,
since each time main memory needs to be accessed.
Write-back policy In this method, during a write operation only the cache location is updated. When
the update occurs, the location is marked by a flag called modified or dirty bit. When the word is
replaced from cache, it is written into main memory if its flag bit is set. The philosophy of this method
is based on the fact that during a write operation, the word residing in cache may be accessed several
times (temporal locality of reference). This method reduces the number of references to main memory.
However, this method may encounter the problem of inconsistency due to two different copies of the
same data, one in cache and other in main memory.
4.28 Computer Organization and Architecture
Problem 4.3 A hierarchical cache-main memory subsystem has the following specifications:
(i) Cache access time of 50 nsec
(ii) main memory access time of 500 nsec
(iii) 80% of memory request are for read
(iv) hit ratio of 0.9 for read access and the write-through scheme is used.
Calculate the following:
(a) average access time of the memory system considering only memory read cycle.
(b) average access time of the system both for read and write requests.
Solution Given,
Cache access time tc = 50 nsec.
Main memory access time tm = 500 nsec.
Probability of read pr = 0.8
Hit ratio for read access hr = 0.9
Writing scheme: write-through.
(a) Considering only memory read cycle,
The average access time tav-r = hr * tc + (1–hr )* (tc + tm)
= 0.9*50 + (1–0.9)*550
= 100 nsec.
Memory Organization 4.29
Problem 4.4 For a cache memory of size 32 KB, how many cache lines (blocks) does the cache
hold for block sizes of 32 or 64 bytes?
Solution The number of blocks in cache = size of cache/block size.
Thus, for block size of 32 bytes, the number of blocks = (32 * 1024)/32 = 1024.
Similarly, for block size of 64 bytes, the number of blocks = (32 * 1024)/64 = 512.
Problem 4.5 A computer has a main memory of 64K ¥ 16 and a cache memory of 1K words. The
cache uses direct mapping with a block size of four words.
(a) How many bits are there in the tag, index, block and word fields of the address
format?
(b) How many bits are there in each word of cache?
(c) How many blocks can the cache accommodate?
Solution (a) The main memory size = 64K ¥ 16
Therefore, the CPU must generate the address of 16-bit (since 64K = 216)
The cache memory size = 1K
Therefore, the size of index field of cache = 10-bit (1K = 210)
The tag-field uses 16 – 10 = 6 bits
The size of each cache block = 4 words
Thus, the number of blocks in cache = 1024/4 = 256
Therefore the number of bits required to select each block = 8 (since 256 = 28)
The number of bits required to select a word in a block = 2, because there are
4 words in each block.
Thus, the address format is as follows:
6 8 2
Tag Block Word
Ï
Ô
Ô
ÔÔ
Ì
Ô
Ô
Ô
ÔÓ
Index
(b) The main memory size = 64K ¥ 16
Therefore, the number of bits in each word in cache = 16
(c) From part (a); the number of blocks in cache = 256.
Problem 4.6 A cache has 64 KB capacity, 128-byte lines and is 4-way set-associative. The CPU
generates 32-bit address for accessing data in the memory.
(a) How many lines and sets does the cache have?
(b) How many entries are required in the tag field?
(c) How many bits of tags are required in each entry in the tag array?
4.30 Computer Organization and Architecture
Parts of programs and data are brought into main memory from secondary memory, as the CPU needs
them. Virtual memory is a technique used in some large computer systems, which gives the program-
mer an illusion of having a large main memory, although which may not be the case. The size of
virtual memory is equivalent to the size of secondary memory. Each address referenced by the CPU
called the virtual (logical) address is mapped to a physical address in main memory. This mapping is
done during run-time and is performed by a hardware device called memory-management unit (MMU)
with the help of a memory map table, which is maintained by the operating system.
The virtual memory makes the task of programming much easier, because the programmer no
longer needs to bother about the amount of physical memory available. For example, a program size
is 18 MB and the available user part of the main memory is 15 MB (Other part of the main memory is
occupied by the operating system). First, 15 MB of the program is loaded into main memory and then
remaining 3 MB is still in the secondary memory. When the remaining 3 MB code is needed for
execution, swap out the 3 MB code from main memory to secondary memory and swap in new 3 MB
code from secondary memory to main memory.
The advantage of virtual memory is efficient utilization of main memory, because the larger size
program is divided into blocks and partially each block is loaded in the main memory whenever it is
required. Thus multiple programs can be executed simultaneously. The technique of virtual memory
has other advantages of efficient CPU utilization and improved throughput.
Logical (Virtual) Address Space and Physical Address Space When a program needs to be
executed, the CPU would generate addresses called logical addresses. The corresponding addresses in
the physical memory, as occupied by the executing program, are called physical addresses. The set of
all logical addresses generated by the CPU or program is called logical-address space and the set of
all physical addresses corresponding to these logical addresses is called physical-address space. The
memory-management unit (MMU) maps each logical address to a physical address during program
execution. The Fig. 4.30 illustrates this mapping method, which uses a special register called base
register or relocation register. The content of the relocation register is added to every logical address
generated by the user program at the beginning of execution. For example, if the relocation register
holds an address value 2000, then a reference to the location 0 by the user is dynamically relocated to
2000 address. A reference to the address 150 is mapped to the address 2150.
Memory Organization 4.31
Disadvantages
1. The paging may encounter a problem called page break. For example, the virtual address
space for a program is 18 KB and the page size is 4 KB. Thus the number of frames required
by this program is 5. However, the used space in last (fifth) frame of the physical memory is
only 2 KB and remaining 2 KB of the frame is wasted. This is referred to as page break.
2. When the number of pages in a virtual memory is large, it is quite difficult to maintain the
page tables.
Page Replacement When a program starts execution, one or more pages are brought to the main
memory and the page table is responsible to indicate their positions. When the CPU needs a particular
page for execution and that page is not in main (physical) memory (still in the secondary memory),
this situation is called page fault. When the page fault occurs, the execution of the present program is
suspended until the required page is brought into main memory from secondary memory. The re-
quired page replaces an existing page in the main memory, when it is brought into main memory.
Thus, when a page fault occurs, a page replacement is needed to select one of the existing pages to
make the room for the required page. There are several replacement algorithms such as FIFO (First-
in First-out), LRU (Least Recently Used ) and optimal page replacement algorithm available.
The FIFO algorithm is simplest and its criterion is “select a page for replacement that has been in
the main memory for longest period of time”.
The LRU algorithm states that “select a page for replacement, if the page has not been used often
in the past”. The LRU algorithm is difficult to implement, because it requires a counter for each page
to keep the information about the usage of page.
The optimal algorithm generally gives the lowest page faults of all algorithms and its criterion is
“replace a page that will not be used for the longest period of time”. This algorithm is also difficult to
implement, because it requires future knowledge about page references.
Memory Organization 4.33
0 0 0 0 3 3 3 2 2 2 1 1 1
1 1 1 1 0 0 0 3 3 3 2 2
2 2 2 2 1 1 1 0 0 0 4
Page fault y y y y y y y y y y n y y n
If a required page is already in main memory, page replacement is not required, which is indicated
in the table by ‘n’. In that situation, no page fault occurs.
From this table, we observe that 12 page faults occur, out of 14 references. Thus, the page fault
rate is = no. of page faults/no. of page references in the string = 12/14 = 85%.
Example of LRU:
Reference 0 1 2 3 0 1 2 3 0 1 3 2 4 1
string
Frame
0 0 0 0 3 3 3 2 2 2 1 1 4 4
1 1 1 1 0 0 0 3 3 3 3 3 1
2 2 2 2 1 1 1 0 0 2 2 2
Page fault y y y y y y y y y y n y y y
From this table, we observe that 13 page faults occur, out of 14 references. Thus, the page fault
rate is = 13/14 = 93%.
Example of Optimal replacement algorithm:
Reference 0 1 2 3 0 1 2 3 0 1 3 2 4 1
string
Frame
0 0 0 0 0 0 1 1
1 1 1 1 2 2 2
2 2 3 3 3 4
Page fault y y y y n n y n n y n n y n
From this table, we observe that 7 page faults occur, out of 14 references. Thus, the page fault rate
is = 7/14 = 50%.
Segmentation Segmentation is a memory management scheme that supports the user view of
memory. A logical-address space of a program is a collection of segments. A segment is defined as a
logical grouping of instructions, such as subroutine, array or data area. Each segment has a name and
a length. The address of the segment specifies both segment name and offset within the segment. For
simplicity of implementation, segments are referred to by a segment number rather than by a segment
name. Thus, a logical address consists of two tuples: (segment number (s), offset (d)).
The mapping of logical address to corresponding physical address is done using segment table.
Each entry of the segment table has a segment base and a segment limit. The segment base indicates
the starting physical address where the segment resides in main memory and the segment limit
Memory Organization 4.35
specifies the length of the segment. The hardware implementation of segmentation is shown in
Fig. 4.33.
A logical address consists of two fields: (segment number (s), offset (d)). The segment number s is
used as an index in the segment table and the offset d is word number within the segment s. The offset
d must be between 0 and the segment limit. If offset is beyond that range, the operating system
generates an error signal (trap), which means that the offset is not valid. If it is valid, it is added with
the segment base to produce the address in the physical memory for the desired word.
To illustrate the segmentation technique, we consider the example in Fig. 4.34. The logical address
space is divided into 3 segments. Each segment has an entry in the segment table. The base specifies
the starting address of the segment and the limit specifies the size of the segment. For example, first
(i.e. 0th) segment is loaded in the main memory from 1000 to 1500. Thus, the base is 1000 and limit is
1500 – 1000 = 500. A reference to word (byte) 20 of segment 0 is mapped onto address 1000 + 20 =
1020. Thus, the logical address (0 (segment no.), 20 (offset)) is mapped to the corresponding physical
address 1020. Similarly, a reference to word 50 of segment 2 is mapped onto address 2500 + 50 =
2550. Thus, the logical address (2, 50) has the corresponding physical address 2550.
The advantages of segmentation are:
1. It supports efficient utilization of physical memory. Unlike paging, there is no space wastage
within a segment.
2. It supports user view of memory more efficiently. It can handle dynamically growing seg-
ments.
3. Protection and sharing can be done easily.
The processors such as IBM 360/67 and VAX 11/780 use paging. Intel’s microprocessor 80286
supports only segmentation; whereas microprocessor 80386 uses both segmentation and paging schemes.
Problem 4.7 Consider a logical address space of 8 pages of 1024 words each, mapped onto a
physical memory of 32 frames.
(a) How many bits are there in the logical address?
(b) How many bits are there in the physical address?
4.36 Computer Organization and Architecture
Solution (a) Logical address space consists of 8 pages, each of 1024 words.
To select each page, 3 bits are required. (Because 8 = 23)
To select each word in a page, 10 bits are required. (Because1024 = 210)
Therefore, the logical address consists of 3 + 10 = 13 bits.
(b) The physical memory consists of 32 frames.
So, to select each frame, 5 bits are required.
Since in paging, size of page and size of frame are equal.
Therefore, the physical address consists of 5 + 10 = 15 bits.
What are the physical addresses for the following logical addresses?
(a) 0, 430
(b) 1, 234
(c) 2, 13
Solution (a) The physical address corresponding to 0, 430 is 215 (base) + 430 (offset) = 645.
(b) In logical address 1, 234; offset value 234 is greater than the length of segment 1
(i.e. 160). So, there is an error.
(c) The physical address corresponding to 2, 13 is 1200 + 13 = 1213.
Memory Organization 4.37
REVIEW QUESTIONS
Group A
1. Choose the most appropriate option for the following questions:
(i) How many memory locations can be addressed by a 32-bit computer?
(a) 64 KB (b) 32 KB (c) 4 GB (d) 4 MB
(ii) The access time of memory is the time duration
(a) from receiving read/ write signal to the completion of read/write operation
(b) from receiving an address value to the completion of read/write operation
(c) from receiving a chip enable signal to the completion of read/write operation
(d) for all memory operations starting from enable signal.
(iii) The bandwidth of memory accesses is in ascending order for
(a) cache, DRAM, SDRAM and RDRAM
(b) SRAM, DRAM, SDRAM and cache
(c) DRAM, SDRAM, RDRAM and cache
(d) RDRAM, DRAM, SDRAM and cache.
(iv) The memory hierarchy system in respect of increasing speed consists of
(a) secondary, main, cache and internal
(b) internal, main, cache and secondary
(c) internal, secondary, main and cache
(d) cache, main, secondary and internal.
(v) The memory hierarchy system in respect of increasing cost consists of
(a) secondary, main, cache and internal
(b) internal, main, cache and secondary
(c) internal, secondary, main and cache
(d) cache, main, secondary and internal.
(vi) The semi-random access mechanism is followed in
(a) RAM (b) ROM (c) magnetic (d) register.
(vii) The associative access mechanism is followed in
(a) main (b) cache (c) magnetic (d) both (a) and (b).
(viii) The initial bootstrap program is generally stored in
(a) RAM (b) ROM (c) magnetic (d) cache.
(ix) If the size of a RAM is 1 GB (assuming it is byte-addressable), it means that it has number of
address lines and number of data lines as
(a) 20 and 8 (b) 20 and 16 (c) 30 and 8 (d) 10 and 32.
(x) To construct a RAM memory of capacity 512 words each of size 12 bits using RAM chips each
of size 128 ¥ 4, the number of smaller size RAM chips required is
(a) 4 (b) 8 (c) 12 (d) 16.
(xi) The number of transistors required in SRAM, DRAM and ROM are
(a) 4, 2, 2 (b) 1, 1, 4 (c) 1, 6, 1 (d) 6, 1, 1; respectively.
(xii) Flash memory is
(a) SRAM (b) DDR SDRAM (c) PROM (d) EEPROM.
(xiii) The cylinder in a disk pack is
(a) collection of all tracks in a surface
(b) logical view of same radius tracks on different surfaces of disks
4.38 Computer Organization and Architecture
7. What will be the maximum capacity of a memory, which uses an address bus of size 8-bit?
8. Show the bus connections with a CPU to connect 4 RAM chips of size 256 ¥ 8 bit each and a ROM
chip of 512 ¥ 8 bit size. Assume the CPU has 8 bit data bus and 16 bit address bus. Clearly specify
generation of chip select signals. Give the memory address map for the system.
9. Suppose we are given RAM chips each of size 256 ¥ 4. Design a 2K ¥ 8 RAM system using this
chip as the building block. Draw a neat logic diagram of your implementation.
10. Briefly describe the storage structure of a CMOS SRAM storage cell and explain the read and write
operations on the cell. Give the suitable diagram.
11. Draw the cell structure of DRAM and explain the read and write operations on it.
12. Briefly describe different types of RAMs.
13. Draw and describe the storage structure of a ROM storage cell and explain the read and write
operations on the cell.
14. Briefly describe different types of ROMs.
15. Explain the need of auxiliary (secondary) memory devices. How are they different from main
memory?
16. What are the different types of secondary memories?
17. Give at least two differences between magnetic tape and magnetic disk.
18. Discuss in brief the internal structure of magnetic tape using diagram. Explain how read and write
operations are performed on it.
19. Draw and describe the internal structure of a disk-pack.
20. Define:
(a) Sector.
(b) Track
(c) Cylinder
(d) Rotational latency
(e) Seek time
(f) Access time in magnetic disk.
21. Why the formatting of disk is necessary before working on that disk?
22. What is the transfer rate of an 8-track magnetic tape whose speed is 120 inches per second and
whose density is 1600 bits per inch?
23. A disk pack has 20 recording surfaces and has a total of 4000 cylinders. There is an average of
300 sectors per track. Each sector contains 512 bytes of data.
(a) What is the maximum number of bytes that can be stored in this pack?
(b) What is the data transfer rate in bytes per second at a rotational speed of 3600 rpm?
(c) Using a 32-bit word, suggest a suitable scheme for specifying the disk address, assuming that
there are 256 bytes per sector.
24. A disk pack has the following parameters:
(a) average time to position the magnetic head over a track is 20 ns.
(b) rotational speed of 2400 rpm.
(c) number of bits per track is 20000
(d) number of bits per sector is 1500
Calculate the average time to read one sector.
25. What is CAM? What is its main characteristic? What is its main advantage? Draw and describe
hardware structure of associative memory or CAM. Also draw its one bit cell structure.
26. The cache memory is designed based on one property of programs stored memory. What is that
property? Explain. What are different varieties of this property?
4.40 Computer Organization and Architecture
27. What is cache memory? How does cache memory increase the performance of a computer? What is
hit ratio?
28. Describe the operation of cache memory. What is meant by “the cache memory has hit ratio of 0.8”?
29. Derive an expression for effective (average) access time for an n-level memory system having hit
ratios h1, h2, … hn and access times t1, t2‚ … tn ,where t1< t2 < …tn.
30. A three level memory system having cache access time of 10 nsec and disk access time of 40 nsec
has a cache hit ratio of 0.96 and main memory hit ratio of 0.9. What should be the main memory
access time to achieve effective access time of 10 nsec?
31. What is meant by cache mapping? What are different types of mapping? Describe different mapping
techniques with suitable example.
32. Compare different cache mapping techniques.
33. What are different replacement methods in cache memories? Explain.
34. Name different cache memory writing methods. Explain them with advantages and disadvantages.
35. Name the categories of cache misses and explain the reasons behind them.
36. Write three methods for reducing cache miss rates.
37. What is virtual memory? What are the reasons for using it?
38. What do you mean by logical address space and physical address space?
39. Explain the paging and segmentation techniques used for implementing virtual memory.
40. A digital computer has a memory unit of 64K ¥ 16 and a cache memory of 1K words. The cache
uses direct mapping with a block size of 4 words. How many bits are there in the tag, index, block
and word fields of the address format?
41. What is the limitation of direct-mapped cache? Explain with an example how it can be improved in
set-associative cache.
42. Compare paging and segmentation methods.
43. What is page fault? What are different page replacement algorithms? Describe briefly.
44. Consider a cache (M1) and (M2) hierarchy with the following characteristics;
M1: 16 k words, 50 ns access time;
M2: 1 M words, 400 ns access time;
Assume 8 word cache blocks and a set size of 256 words with set – associative mapping.
(a) Show the mapping between M2 and M1.
(b) Calculate the effective memory-access time with a cache-hit ratio of h = 0.95.
45. Consider the design of 3-level memory hierarchy with following specification for memory hierarchy.
Memory levels Access time Capacity Cost/KB
Cache 25ns 512 KB $1.25
Main memory unknown 32 MB $0.2
Disk unit 4 ms unknown $0.0002
The design goal is to achieve effective memory access time t = 10.04 microsecond with cache hit
ratio h1 = 0.98 and h2 = 0.9 in the main memory unit. Also the total cost of the memory hierarchy is
upper-bounded by $1500. Find out the unknown terms in the above table.
46 A system has 48 bit virtual address, 36 bit physical address and 128 MB of main memory address.
The system has 4096 bytes of pages. How many virtual and physical pages can the address support?
How many page frames of main memory are there?
47. What do you mean by page reference string? Suppose a process accesses the following addresses at
a particular time interval: 0100,0432,0101,0612,0102,0103,0104,0101,0611,0102,0103,
0104,0101,0601, 0101,0102,0609,0102,0105. Assume a page size=100 bytes.
Memory Organization 4.41
What will be the reference string? Using this reference string, calculate the page fault rate for the
following algorithms:
(a) FIFO replacement
(b) LRU replacement.
(c) Optimal replacement.
Assume that number of frames=3.
48. If the size of cache block is increased, discuss possible merits and demerits.
49. Which locality of reference is exploited if the size of cache block increases?
50. In order to exploit temporal locality of reference, what should be the policy of writing program?
51. How does the size of cache block affect the hit ratio?
52. If the size of MAR and MDR are 32-bit and 16-bit respectively, what is the size of main memory?
SOLVED PROBLEMS
1. Describe the storage structure of a bipolar storage cell & explain the reading & writing
operations on the cell. Give a suitable diagram.
Answer
Two junction transistors T0 and T1 are connected in such a way that they form a flip-flop. Assuming
point D at ground voltage (i.e. word line is disabled), when T0 is on, the flip-flop stores logic ‘0’
while current flows to ground through T0 making point A at ‘0’ level. This in turns holds T1 off. As a
result, the point B is held at voltage equal to the base-emitter voltage of T0 (i.e. B at logic ‘1’)
whereby T0 is held on. Similarly, when T1 is on, the point B is at ‘0’ level while the point A is held at
the base-emitter voltage of T1 (i.e. A at logic ‘1’) whereby T1 is held on. Depending on 0 (or 1)
stored, T0 (or T1) is on and both the diodes D0 and D1 are reverse biased. As a result both the bit lines
b and b¢ are isolated from the flip-flop cell. [For detail of junction transistors, see Appendix]
Read operation: On selecting the word line, the voltage at D goes down at ground level, which selects
the cell. Depending on the T0 (or T1) conducting, the point A (or B) gets the voltage close to ground
level. As a result D0 (or D1) gets forward biased. Depending on D0 (or D1) being forward biased, the
appropriate logic value is read through bit line b (or b¢).
4.42 Computer Organization and Architecture
Write operation: On selecting the word line, the voltage at D goes down at ground level to select the
cell. Depending on the logic ‘1’ (or ‘0’) to be written, the bit line b (or b¢) is held high whereby D0 (or
D1) gets forward biased and consequently T1 (or T0) switch is on resulting T0 (or T1) off.
2. How do the following influence the performance of a virtual memory system?
(i) Size of a page (ii) Replacement policy
Answer
(i) Page size: If page size is large, the page fault rate will be less. But, in that case, transfer time
of the page will increase.
If page size is small, the memory is better utilized, but number of pages and hence the size
of page table will be large.
(ii) Replacement policy: When a page fault occurs, a page replacement is needed to select one of
the existing pages to make the room for the required page. There are several replacement
policies such as FIFO (First-in First-out), LRU (Least Recently Used) and optimal page
replacement algorithm available. The performance of virtual memory is degraded if too many
page faults occur, because that lead to bring the new required pages to the physical memory.
That’s why the algorithm which gives lowest page faults is considered as best algorithm.
3. A computer has direct mapped cache with 16 one-word blocks. The cache is initially empty.
What is the observed hit ratio when the CPU generates the following word address sequence:
1, 4, 8, 5, 20, 17, 19, 56, 9, 11, 4, 43, 5, 6, 9, 17?
Answer
The direct mapping is expressed as
I = J mod K
Where
I = cache block number
J = main memory block number
K = number of blocks in cache.
The processor generates the addresses for words in main memory. In our problem, K = 16 with one
word per block and main memory block sequence is: 1, 4, 8, 5, 20, 17, 19, 56, 9, 11, 4, 43, 5, 6, 9, 17.
Thus, the corresponding cache block sequence and its word is as (block no., word address): (1, 1),
(4,4), (8,8), (5,5), (4,20), (1,17), (3,19), (8,56), (9,9), (11, 11), (4,4), (11,43), (5,5), (6,6), (9,9), (1,17).
Initially, cache is empty.
(Cache block no., word address) (1,1) (4,4) (8,8) (5,5) (4,20) (1,17) (3,19) (8,56) (9,9)
Hit(H)/Miss (M) M M M M M and M and M M and M
replace replace replace
(Cache block no., word address) (11,11) (4,4) (11,43) (5,5) (6,6) (9,9) (1,17)
Hit(H)/Miss(M) M M M and replace H M H H
Therefore, hit ratio = 3/16.
4. What is the bandwidth of a memory system that transfers 128-bit of data per reference, has a
speed 20 ns per operation?
Answer
Given the speed of 20 ns, one memory reference can initiate in every 20 ns and each memory
reference fetches 128-bit (i.e. 16 bytes) of data. Therefore, the bandwidth of the memory system is
16 bytes / 20 ns = (16 ¥ 109) / 20 bytes per second = 8 ¥ 108 bytes per second.
Memory Organization 4.43
5. What will be the maximum capacity of a memory, which uses an address bus of size 12-bit?
Answer
The maximum capacity of memory will be 212 words i.e. 4096 words.
6. Why is the memory system of a computer organized as a hierarchy?
Answer
Ideally, we would like to have the memory which would be fast, large and inexpensive. Unfortu-
nately, it is impossible to meet all three requirements simultaneously. If we increase the speed and
capacity, then cost will increase. We can achieve these goals at optimum level by using several types
of memories, which collectively give a memory hierarchy.
The lower levels of memory hierarchy, which are implemented using slow and cheap memory
technologies, contain most of programs and data. The higher levels of memory hierarchy, which are
implemented using fast and expensive memory technologies, contain smaller amount of programs and
data. The processor, being very high speed device, references data in the fast higher levels of memory
hierarchy. If referred data is not available there, it is moved from lower levels of the hierarchy so that
the higher levels handle most references. If most references are handled by the higher levels, the
memory system gives an average access time almost same as the fastest level of the memory hierar-
chy, with a cost per bit same as that of the lowest level of the hierarchy.
7. What are destructive read out memory and non-destructive read out memory? Give examples.
Answer
In some memories, reading the memory word destroys the stored word, this fact is known as destruc-
tive readout and memory is known as destructive readout memory. In these memories, each read
operation must be followed by a write operation that restores the memory’s original state. Example
includes dynamic RAM.
In some memories, the reading the memory word does not destroy the stored word, this fact is
known as non-destructive readout and memory is known as non-destructive readout memory. Ex-
amples include static RAM and magnetic memory.
8. Why do the DRAMs generally have large capacities than SRAMs constructed by the same
fabrication technology?
Answer
Each DRAM cell contains two devices – one capacitor and one transistor, while each SRAM cell
consists of six transistors. This means a DRAM cell is much smaller than a SRAM cell, allowing the
DRAM to store more data in the same size chip space.
9. Why is refreshing required in Dynamic RAM?
Answer
Information is stored in a dynamic RAM memory cell in the form of a charge on a capacitor. Due to
the property of the capacitor, it starts to discharge. Hence, the information stored in the cell can be
read correctly only if it is read before the charge on the capacitor drops below some threshold value.
Thus, this charge in capacitor needs to be periodically recharged or refreshed.
10. Suppose a DRAM memory has 4K rows in its array of bit cells, its refreshing period is 64 ms
and 4 clock cycles are needed to access each row. What is the time needed to refresh the
memory if clock rate is 133 MHz? What fraction of the memory’s time is spent performing
refreshes?
4.44 Computer Organization and Architecture
Answer
In DRAM memory, no. of rows of cells in memory is 4K = 4096 and 4 clock cycles are needed to
access each row.
Therefore, no. of cycles needed to refresh all rows = 4096 ¥ 4 = 16384 cycles.
Since clock rate is 133 MHz,
The time needed to refresh all rows = 16384 / (133 ¥ 106) seconds
= 123 ¥ 10–6 seconds
= 0.123 ms. [1 ms = 10–3 sec.]
Thus, the refreshing process occupies 0.123 ms in each 64 ms time interval.
Therefore, refresh overhead is 0.123/64 = 0.002.
Hence, only 0.2 % of the memory’s time is spent performing refreshes.
11. How many 256 ¥ 4 RAM chips are needed to provide a memory capacity of 2048 bytes? Show
also the corresponding interconnection diagram.
Answer
The given RAM memory size is 256 ¥ 4. This memory chip requires 8 (because 256 = 28) address
lines and 4 data lines.
Size of memory to be constructed is 2048 bytes, which is equivalent to 2048 ¥ 8. Thus, it requires 11
(because 2048 = 211) address lines and 8 data lines.
In the interconnection diagram:
The number of rows required = 2048/256 = 8.
The number of columns required = 8/4 = 2.
Thus, total number of RAMs each of size 256 ¥ 4 required = 8 * 2 = 16.
The interconnection diagram is shown in the following figure.
Memory Organization 4.45
12. Explain how a RAM of capacity 2 K bytes can be mapped into the address space (1000)H to
(17FF)H of a CPU having a 16 bit address lines. Show how the address lines are decoded to
generate the chip select condition for the RAM.
Answer
Since the capacity of RAM memory is 2048 bytes = 2 KB, the memory uses 11 (2 KB = 211) address
lines, say namely A10 – A0, to select one word. Thus, memory’s internal address decoder uses 11 lines
A10 – A0 to select one word.
To select this memory module, remaining 5 (i.e. 16 – 11) address lines A15 – A11 are used. Thus,
an external decoding scheme is employed on these higher-order five address bits of processor’s
address.
The address space of the memory is 1000H and 17FFH.
Therefore, the starting address (1000)H in memory is as:
A 15 A 14 A 13 A 12 A 11 A 10 A 9 A8 A7 A6 A5 A4 A3 A2 A1 A0
0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
Based on the higher-order five bits (00010), external decoding scheme performs a logical AND
operation on address values: A15 , A14 , A13 , A12 and A11 . The output of AND gate acts as chip select
(CS) line. The address decoding scheme is shown in the following figure.
13. A high speed tape system accommodates 1200 ft. reel of standard 9-track tape. The tape is
moved past the recording head at a rate of 140 inches per second. What must be the linear
tape recording density in the order to achieve a data transfer rate of 105 bits per second?
Answer
Given,
Tape length = 1200 ft.
Tape speed = 140 inches/sec.
Data transfer rate = 105 bits/sec.
4.46 Computer Organization and Architecture
20. Why do virtual page and physical frame have same size, in paging?
Answer
Virtual address that is generated by processor is divided into two fields; page number that identifies
the page containing an address and offset that identifies the location of the address within the page.
Similarly, physical address is divided into two fields; frame number and offset. If the page size and
frame size are same, address translation can be done easily. The offset from the virtual address can be
concatenated with the frame number that corresponds to the virtual page containing the address to
produce the physical address that corresponds to a virtual address. If these two were different, a
complicated means would be required for address translation.
21. In a system with 64-bit virtual addresses and 43-bit physical addresses, how many bits are
required for the virtual page number and physical frame number if the pages are 8 KB in size?
How big is each page table entry? How many page table entries are required for this system?
Answer
Since the page size is 8 KB, 13 bits are required for the offset field of both the virtual and physical
address.
Therefore, bits required for the virtual page number = 64 – 13 = 51 and
Bits required for the physical frame number = 43 – 13 = 30.
Each page table entry contains frame number and a valid/invalid bit.
So, a total of (30 + 1) i.e. 31 bits is used to store each page table entry.
Memory Organization 4.49
Answer
Since page size is 512 bytes
Lower order 9 bits are used for offset within the page.
Size of the virtual address space is 64 KB
Therefore, each virtual address consists of 16-bit.
Thus, higher order 16–9 i.e. 5 bits specify the virtual page number.
(a) For virtual address 3494; lower (494) number is the offset and higher (3) is the page number.
Now, looking in the table, we find frame number (1) corresponding to the page number (3). By
concatenating the offset within the page with the physical frame number, we get physical
address 1494 corresponding to the virtual address 3494.
(b) For virtual address 12350, the physical address is 5350.
(c) For virtual address 30123, the physical address is 7123.
23. What is a translation look-aside buffer (TLB)?
Answer
In paging scheme, the main memory is accessed two times to retrieve data, one for accessing page
table and another for accessing data itself. Since the access time of main memory is large, one new
technique is adopted to speed up the data retrieval. A fast associative memory called translation look-
aside buffer (TLB) is used to hold most recently used page table entries. Whenever CPU needs to
access a particular page, the TLB is accessed first. If desired page table entry is present in the TLB, it
is called TLB hit and then the frame number is retrieved from the table to get the physical address in
main memory. If the desired page table entry is not present in the TLB, it is called TLB miss and then
CPU searches the original page table in main memory for the desired page table entry. The organiza-
tion of address translation scheme that includes a TLB is shown in the following figure.
4.50 Computer Organization and Architecture
24. Suppose a processor’s TLB has hit ratio 80% and it takes 20 ns to search the TLB and 100 ns
to access main memory. What will be the effective access time?
Answer
When the referred page number is found in the TLB, then a mapped memory access takes (20 + 100)
i.e. 120 ns. If it is not found in the TLB, then we must first access memory for the page table
and frame number, and then access the desired word in memory, for a total of (20 + 100 + 100) i.e.
220 ns.
Therefore, the effective access time = (TLBhit ¥ TIMEhit) + (TLBmiss ¥ TIMEmiss)
= 0.8 ¥ 120 + 0.2 ¥ 220 ns
= 140 ns
25. Why does the virtual memory prevent programs from accessing each other’s data?
Answer
Each program has its own virtual address space. The virtual memory system translates different
programs’ virtual addresses to different physical addresses so that no two programs’ virtual addresses
map onto the same physical addresses, thus preventing the programs from accessing each other’s data.
26. What is memory interleaving? What are the varieties of it? Discuss.
Answer
The main memory is partitioned into several independent memory modules (chips) and addresses
distributed across these modules. This scheme, called interleaving, allows concurrent accesses to
more than one module. The interleaving of addresses among M modules is called M-way interleaving.
There are two basic methods, higher-order and lower-order interleaving, of distributing the ad-
dresses among the memory modules. Assume that there are a total of N = 2n words in main memory.
Then the physical address for a word in memory consists of n bits, an–1 an–2 … a1a0. One method,
high-order interleaving, distributes the addresses in M = 2m modules so that each module i, for 0 £ i
£ M–1, contains consecutive addresses i2n–m to (i+1)2n–m –1, inclusive. The high-order m bits are
Memory Organization 4.51
used to select the module while the remaining n-m bits select the address within the module, as shown
in diagram next.
The second method, called low-order interleaving, distributes the addresses so that consecutive
addresses are located within consecutive modules. The low-order m bits of the address select the
module, while the remaining n-m bits select the address within the module. Hence, an address A is
located in module A mod M.
CHAPTER
5
Computer Instruction Set
5.1 INTRODUCTION
Computer architecture is defined as the study of the components and their interconnections that form
a computer system. An instruction set design for a machine is the primary architectural consideration.
The complete collection of instructions that are understood by a machine is called the instruction set
for the machine. Much of a computer system’s architecture is hidden from a high level language
programmer. In the abstract sense, the programmer should not really care about what is the underlying
architecture. The instruction set is the boundary where the computer designer and the computer
programmer can view the same machine. Any weakness of the instruction set design will drastically
affect the machine language programmer and the compiler. Hence, a well-designed, pre-planned
instruction set enables the compilers to create a compact machine language program. This chapter
discusses the various factors that influence the instruction set design.
Before finalizing the instruction set for a computer, computer architects have to consider the
following aspects.
Flexibility to the programmer A programmer wishes to have as many instructions as possible so
that the appropriate operations are carried out by the respective instructions. While designing the
instruction set, it should be noted that too many instructions in the instruction set results in a complex
control unit design. Instruction decoding requires huge circuitry and time.
Number of addressing modes If all possible addressing modes are present in the architecture, it will
give a lot of options for programming a particular program. However, it will again require complex
control unit.
Number of general purpose registers (GPRs) If the CPU has a large number of GPRs, the execu-
tion will be faster. But, use of large number of registers increases cost of the CPU.
System performance The system performance can be enhanced, if less number of instructions is
used in a program. For short programs, instructions should be powerful. Thus, a single instruction
must be able to perform several microoperations (i.e, large length instructions are used). So, reduced
5.2 Computer Organization and Architecture
size program is desired. But, this increases the complexity in the control unit design and instruction
execution time.
Applications An instruction set for a particular computer is designed in aiming the certain applica-
tion area. For example, a scientific computer must have strong floating-point arithmetic without which
the precision would be heavily degraded. Whereas, an entertainment computer must have multimedia
operations.
In broad sense, the superiority of a computer is decided on the basis of its instruction set. Since, the
total number of instructions and their powerfulness has contributed to the efficiency of the computer,
these two factors are given highest priority. An efficient program is the one which is short, hence fast
execution and occupies less memory space. The size of a program depends largely on the formats of
instructions used.
A computer usually has a variety of instruction formats. It is the task of the control unit within
CPU to interpret each instruction code and provide the necessary control functions needed to process
the instruction. The most common format followed by instructions is depicted in the Fig. 5.1.
specified by the binary number 00111. The internal organization of processor registers determines the
number of address fields in the instruction.
The design of an instruction set for a computer depends on the way in which the CPU is organized.
Generally, there are three different CPU organizations with certain specific instructions:
1. Single accumulator organization.
2. General register organization.
3. Stack organization.
The meaning of the operation is R1 ¨ R2 * R3. This instruction also can be written in the
following way, where the destination register is the same as one of the source registers.
MULT R1, R2
This means the operation R1 ¨ R1 * R2, which uses two address fields. The use of large number
of registers results in short programs with limited instructions. IBM 360 and PDP-11 are some of the
typical examples.
The advantages of this organization:
1. Since large number of registers is used in this organization, the efficiency of the CPU in-
creases.
2. Less memory space is used to store the program since the instructions are used in more
compact way.
The disadvantages of this organization:
1. Care should be taken to avoid unnecessary usage of registers. Thus compilers need to be more
intelligent in this aspect.
2. This organization involves more cost, since large number of registers is used.
The Fig. 5.3 shows the effects of these two operations on the stack in Fig. 5.2.
2. Execution of instructions is fast, because operand data are stored in consecutive memory
locations.
3. Since instructions do not have address field, the length of instructions is short.
Disadvantage
1. Program size lengthens.
One of the three types of organizations that have just been described has been implemented in most
contemporary computers. Though, some computers have been built with features from more than one
camp. For example, the Intel 8080 microprocessor has seven general-purpose registers, one of which
is an accumulator register. Thus, some of the characteristics of a general register organization and
some of the characteristics of an accumulator organization are followed by the processor 8080.
Length of an instruction basically depends on the number of address fields used in it.
The advantages and disadvantages of using number of addresses in an instruction are summarized
below:
l The fewer the addresses, the shorter the instruction. Long instructions with multiple addresses
usually require more complex decoding and processing circuits.
l Limiting the number of addresses also limits the range of functions each instruction can
perform.
l Fewer addresses mean more primitive instructions, and longer programs are needed.
l Storage requirements of shorter instructions and longer programs tend to balance; larger pro-
grams require longer execution time.
The length of an instruction can be affected by and affects:
l Memory size
l Memory organization
l Bus structure
l CPU complexity
l CPU speed
To show how the number of addresses affects a computer program, we will evaluate the arithmetic
statement
X = (A + B) – (C + D)
using zero, one, two or three address instructions. For this, LOAD symbolic op-code is used for
transferring data to register from memory. STORE symbolic op-code is used for transferring data to
memory from register. The symbolic op-codes ADD and SUB are used for the arithmetic operations
addition and subtraction respectively. Assume that the respective operands are in memory addresses
A, B, C and D and the result must be stored in the memory at address X.
Three-address Instructions The general register organized computers use three-address instruc-
tions. Each address field may specify either a processor register or a memory operand. The program to
evaluate X = (A + B) – (C + D) in assembly language is shown below, together with comments that
give explanation of each instruction.
Computer Instruction Set 5.7
One-address Instructions The single accumulator based computers use one-address instructions.
Here, all instructions use an implied accumulator (AC) register. The program to evaluate X = (A + B)
– (C + D) using one-address instructions is as follows:
LOAD C ; AC ¨ M[C]
ADD D ; AC ¨ AC + M[D]
STORE T ; T ¨ AC
LOAD A ; AC ¨ M[A]
ADD B ; AC ¨ AC + M[B]
SUB T ; AC ¨ AC – M[T]
STORE X ; X ¨ AC
T is the temporary memory location required for storing the intermediate result.
Zero-address Instructions Zero-address instructions are used by stack-organized computers, which
do not use any address field for the operation-type instructions. The name “zero-address” is given to
this type of computer because of the absence of an address field in the computational instructions.
However, two basic instructions in stack PUSH and POP require an address field to specify the
destination or source of operand. The assembly language program using zero-address instructions is
written next. In the comment field, the symbol TOS is used, which means the top of stack.
PUSH A ; TOS ¨ A
PUSH B ; TOS ¨ B
ADD ; TOS ¨ (A + B)
PUSH C ; TOS ¨ C
PUSH D ; TOS ¨ D
ADD ; TOS ¨ (C + D)
SUB ; TOS ¨ (A + B) – (C + D)
POP X ; X ¨ TOS
5.8 Computer Organization and Architecture
The ALU of the CPU executes the instructions as dictated by the op-code field of instructions. The
instructions are executed on some data stored in registers or memory. The different ways in which the
location of an operand is specified in an instruction are referred to as addressing modes. A computer
uses variety of addressing modes.
The advantages of having different addressing modes:
Computers use different addressing modes for the following purposes:
1. It gives programming versatility or flexibility to the programmers with respect to the number
of instructions and execution time by providing various addressing modes.
2. To reduce the length of the instructions or the size of programs. Because these two parameters
are associated with the capacity of memory.
Instruction Cycle Before discussing different addressing modes, it is important to get the basic
idea about instruction cycle of computers. The processing required for a single instruction is called
Computer Instruction Set 5.9
instruction cycle. The control unit’s task is to go through an instruction cycle (see Fig. 5.5) that can be
divided into five major phases:
1. Fetch the instruction from memory.
2. Decode the instruction.
3. Fetch the operand(s) from memory or register.
4. Execute the whole instruction.
5. Store the output result to the memory or register.
The step 1 is basically performed using a special register in the CPU called program counter (PC)
that holds the address of the next instruction to be executed. If the current instruction is simple
arithmetic/logic or load/store type the PC is automatically incremented. Otherwise, PC is loaded with
the address dictated by the currently executing instruction. The decoding done in step 2 determines
the operation to be performed and the addressing mode of the instruction for calculation of addresses
of operands. After getting the information about the addresses of operands, the CPU fetches the
operands in step 3 from memory or registers and stores them in registers. In step 4, the ALU of
processor executes the instruction on the stored operands in registers. After the execution of instruc-
tion, in phase 5 the result is stored back in memory or register and returns to step 1 to fetch the next
instruction in sequence. All these sub-operations are controlled and synchronized by the control unit.
A computer generally has variety of addressing modes. Sometimes, two or more addressing modes
are combined in one mode. The popular addressing modes are discussed next.
Class I Here, no address field is used in instruction.
1. Implied (or Inherent) mode In this mode the operands are indicated implicitly by the instruction.
The accumulator register is generally used to hold the operand and after the instruction execution the
result is stored in the same register. For example,
(a) RAL; Rotates the content of the accumulator left through carry.
(b) CMA; Takes complement of the content of the accumulator.
This mode is very popular with 8-bit micro-processors such as the Intel’s 8085.
2. Immediate mode In this mode the operand is mentioned explicitly in the instruction. In other
words, an immediate-mode instruction contains an operand value rather than an address of it in the
address field. To initialize register to a constant value, this mode of instructions is useful. For
example:
(a) MVI A, 06; Loads equivalent binary value of 06 to the accumulator.
(b) ADI 05; Adds the equivalent binary value of 05 to the content of AC.
3. Stack addressing mode Stack-organized computers use stack addressed instructions. In this ad-
dressing mode, all the operands for an instruction are taken from the top of the stack. The instruction
does not have any operand field. For example, the instruction
SUB
uses only one op-code (SUB) field, no address field. Both the operands are in the topmost two
positions in the stack, in consecutive locations. When the SUB instruction is executed, two operands
are popped out automatically from the stack one-by-one. After subtraction, the result is pushed onto
the stack. Since no address field is used, the instruction is short.
Class II Here, address field is register address.
4. Register (direct) mode In this mode the processor registers hold the operands. In other words, the
address field is now register field, which contains the operands required for the instruction. A particu-
lar register is selected from a register field in the instruction. Out of 2k registers in the CPU, one
register is selected using k-bit field. This mode is useful to a long program in storing the intermediate
results in the registers rather than in memory. This will result in fast execution since register accessing
is much faster than memory accessing.
Computer Instruction Set 5.11
For example:
ADD R1, R2; Adds contents of registers R1 and R2 and stores the result in R1.
5. Register indirect mode In this mode the in-
struction specifies an address of CPU register
that holds the address of the operand in memory.
In other words, address field is a register which
contains the memory address of operand (see
Fig. 5.6). This mode is very useful for rapid ac-
cess of the main memory location such as an
array. The advantage of using this mode is that
using small number of bits in the address field
of the instruction a memory location is accessed
rather than directly using large bits.
6. Auto-increment or auto-decrement mode Figure 5.6 Register indirect mode
This is similar to the register indirect mode ex-
cept that after or before register’s content is used to access memory it is incremented or decremented.
It is necessary to increment or decrement the register after every access to an array of data in memory,
if the address stored in the register refers to the array. This can be easily achieved by this mode.
Class III Here, address field is a memory address.
Before discussing different addressing modes in this category, we need to know about the effective
address of the operand.
Sometimes the instruction directly gives the address of the operand in its format. Sometimes the
instruction does not give the operand or its address explicitly. Instead of that, it specifies the informa-
tion from which the memory address of the operand can be determined. This address is referred to as
effective address.
7. Direct (or Absolute) address mode In this mode the in-
struction contains the memory address of the operand explic-
itly. Thus, the address part of the instruction is the effective
address (Fig. 5.7). Since the operand address is directly avail-
able in the instruction, there is no need for the effective ad-
dress calculation step. Hence the instruction cycle time is re-
duced. Examples of direct addressing are:
(a) STA 2500H ; Stores the content of the accumulator in
the memory location 2500H.
(b) LDA 2500H ; Loads the accumulator with the content
of the memory location 2500H.
All branch-type instructions use direct addressing modes, Figure 5.7 Direct addressing mode
because the address field of these specifies the actual branch
address.
8. Indirect address mode In this mode the instruction gives a memory address in its address field
which holds the address of the operand. Thus, the address field of the instruction gives the address
5.12 Computer Organization and Architecture
The address field value 2500 may be an operand value or memory address or register address of
operand, which depends on the mode of instruction.
Table 5.1 shows the effective address of operand and content of AC register after the instruction
execution for different addressing modes, if applicable.
In case of relative addressing, the effective address of the operand is calculated as:
Effective address = Content of PC + Address part of instruction
= 2002 + 2500
= 4502
The content of PC which holds the address of the next instruction to be executed, is 2002 since the
current instruction being executed is stored in the locations 2000 and 2001. The operand address for
other addressing modes can easily be determined.
An instruction set of a processor is a collection that defines all the instructions. A complete instruc-
tion set is often referred to as the instruction set architecture (ISA) of the processor. An instruction
from the set alone can be used in a program that runs on the processor. An instruction set of a
processor must have the following characteristics:
1. Completeness We should be able to construct a machine language program to evaluate any
function that is computable using a reasonable amount of memory space.
2. Efficiency Frequently required functions can be performed rapidly using relatively few instruc-
tions.
3. Regularity The instruction set should contain expected op-codes and addressing modes. For
example, if there is left shift op-code, there should be right shift op-code.
4. Compatibility To reduce hardware and software design costs, the instructions may be required to
be compatible with those of existing machines.
Table 5.2 lists some sample instructions for each type of instructions. Since different computer
manufacturers follow different types of symbolic names to instructions in assembly language notation,
even for the same instruction; a simple mnemonics is adopted in this table for better comprehension.
(Contd)
NOP No operation is specified, but program execution continues.
EXECUTE Fetch operand from specified location and execute as instruction; note
that PC is not modified.
Input-output IN (READ) Copy of data from specified IO port to specified or implied destination.
OUT (WRITE) Copy of data from specified or implied source to IO port.
START IO Transfer instructions to I/O processor (IOP) to initiate an IO operation.
HALT IO Transfer instructions to IOP to terminate an IO operation.
TEST IO Transfer status information from IO system to specified destination.
REVIEW QUESTIONS
Group A
1. Choose the most appropriate option for the following questions:
(i) An instruction set
5.20 Computer Organization and Architecture
(xii) A two-byte long assembly language instruction BR 09 (branch instruction) stored at location
1000 (all numbers are in HEX). The program counter (PC) holds the address
(a) 1002 (b) 1000 (c) 1009 (d) 100B.
(xiii) An indexed addressed instruction having an address field with all 0 bits is equivalent to:
(a) register direct mode (b) register indirect mode
(c) memory indirect mode (d) memory indirect mode.
(xiv) The CPI value for a RISC processor is
(a) 1 (b) 2 (c) 3 (d) none.
Group B
2. What is meant by “instruction set of a machine”? What are the different parameters that determine
the design of an instruction set of a machine? Explain.
3. What are the different fields of an instruction generally used in a computer?
4. Classify the CPU organizations. Give one example for each.
5. What are 0, 1, 2 and 3-address machines? Give examples.
6. Write the assembly language procedures using 0, 1, 2 and 3 – address instructions to implement the
instruction: X = (A+B ¥ C)/(D – E), using suitable assumptions.
7. What is the importance of stack in some CPU organization?
8. What is the difference between big-endian and little-endian?
9. Why do we need various addressing modes?
10. What is instruction cycle? Describe with the help of flowchart.
11. Briefly describe following addressing modes with example:
(a) implied
(b) immediate
(c) stack
(d) register
(e) register indirect
(f ) auto increment
(g) direct
(h) indirect
(i) relative
(j) base
(k) indexed.
12. What is the main difference between base and index register addressing modes?
13. What are the characteristics of a good instruction set?
14. Compare and contrast between CISC and RISC. Give two examples of each.
15. Why every computer is associated with a set of general purpose registers?
16. What is the maximum number of 0-address, 1-address, 2-address instructions if the instruction size
is of 32-bit and 10-bit address field?
17. There are 54 processor registers, 5 addressing modes and 8K ¥ 32 main memory. State the instruc-
tion format and size of each field if each instruction supports one register operand and one address
operand.
18. Why would it not be a good idea to implement a stack using a processor’s register file?
19. What value remains on the stack after the following sequence of instructions?
PUSH #6 (symbol # indicates direct value of the number)
PUSH #8
PUSH # 4
5.22 Computer Organization and Architecture
ADD
PUSH # 12
SUB
MULT
20. What are the merits and demerits of fixed-length and variable-length instruction formats?
21. Is there any possible justification for an instruction with two op-codes?
22. A relative mode branch instruction is stored in memory location 530 (decimal). The branch is made
to the location 30 (decimal). What is the effective address?
SOLVED PROBLEMS
1. If each register is specified by 3 bits and instruction ADD R1, R2, R3 is two-byte long; then
what is the length of op-code field?
Answer
The op-code field can have 16 – 3 – 3 – 3 = 7 bits.
Op-code 7-bit R1 3-bit R2 3-bit R3 3-bit
2. What is the maximum number of 0-address, 1-address and 2-address instructions if the in-
struction size is of 32-bit and 10-bit address field?
Answer
In 0-address instructions, no address field is used. So, all 32-bit of the instruction size can be used as
the size of op-code field. Therefore, maximum number of 0-address instructions is 232.
In 1-address instructions, one address field is used whose size is given as 10-bit. So, remaining
32 – 10 i.e. 22-bit can be used as op-code field. Therefore, maximum number of 1-address instruc-
tions is 222.
In 2-address instructions, two address fields are used; collectively they occupy 20-bit. So, remain-
ing 32 – 20 i.e. 12-bit can be used as op-code field. Therefore, maximum number of 2-address
instructions is 212.
3. There are 58 processor registers, 7 addressing modes and 16K ¥ 32 main memory. State the
instruction format and size of each field if each instruction supports one register operand and
one address operand.
Answer
The processor has 58 registers, so 6-bit is used to specify each register uniquely (because 32 < 58
< 64).
No. of addressing modes is 7, so 3-bit is used in mode field in the instruction format (because 7 < 8).
The main memory size is 16K ¥ 32; so to access each word of 32-bit in memory 14-bit address is
generated.
Therefore, the op-code field requires 32 – 3 – 6 – 14 = 9-bit.
The instruction format will be as:
Op-code (9-bit) Mode (3-bit) Register addr. (6-bit) Memory addr. (14-bit)
Computer Instruction Set 5.23
4. The 32-bit value 40A75429 is stored to the location 1000. What is the value of the byte in
address 1002 if the system is big-endian? If little-endian?
Answer
In big-endian assignment, the most significant byte is stored in lower address and least significant
byte is stored in higher address. Therefore, the byte 54 will be stored in location 1002.
In little-endian assignment, the least significant byte is stored in lower address and the most signifi-
cant byte is stored in higher address. Therefore, the byte A7 will be stored in location 1002.
5. What are advantages of general-register based CPU organizations over stack based CPU
organizations?
Answer
There are three main advantages of general-register based CPU organizations over stack based CPU
organizations.
(a) In general-register based CPU organizations, reading a register does not affect its content,
whereas, in stack based CPU organizations, reading value from the top of the stack removes
the value from the stack.
(b) In general register-based CPU organizations, any register from register file can be chosen to
keep values while writing a program; whereas, in stack based CPU organizations, accessing
values is limited by the LIFO (last-in first-out) nature of the stack.
(c) Since, fewer memory references are made by programs written in general register-based CPU
organizations, the effective execution is faster than that in stack based CPU organizations,
where generally stack is implemented by memory locations and locations are accessed in LIFO
nature.
6. Assuming that all registers initially contain 0, what is the value of R1 after the following
instruction sequence is executed?
MOV R1, #6
MOV R2, #5
ADD R3, R1, R1
SUB R1, R3, R2
MULT R3, R1, R1
Answer
The instruction MOV R1, #6 places value 6 into register R1 and instruction MOV R2, #5 places value
5 into register R2. The instruction ADD R3, R1, R1 performs addition of values 6 in R1 and 6 in R1
and result 12 is stored in register R3. The instruction SUB R1, R3, R2 subtracts value 5 in R2 from 12
in R3 and result 7 is stored in register R1. Last instruction MULT R3, R1, R1 multiplies values 7 with
7 (content of R1) and result 49 is put into register R3. Therefore, the value of R1, after the execution
of all instructions in the sequence, is 7.
7. What value remains on the stack after the following sequence of instructions?
PUSH #3 (symbol # indicates direct value of the number)
PUSH #5
PUSH # 4
ADD
PUSH # 7
SUB
MULT
5.24 Computer Organization and Architecture
Answer
After first three push operations, the stack contains 4, 5, 3 starting from the top of the stack. Then the
ADD operation automatically pops two top most values from the stack, which are here 4 and 5, and
then added and sum value 9 is push on the top of the stack. Thus, now stack has 9 and 3. After PUSH
#7, the stack contains 7, 9, 3 starting from the top. Next instruction SUB subtracts the top value of the
stack from the next value down in the stack. So after SUB instruction execution, the value 9 – 7 i.e. 2
is push on the top of the stack. Finally, MULT instruction pops 2 and 3 from the stack and pushes
2 ¥ 3 i.e. 6 on the stack.
8. Why stack is not generally implemented using a processor’s register file?
Answer
The stack gives the illusion of having large storage space to the programmer. Due to the limited
number of registers in CPU, a system that used only register file to implement its stack would only be
able to push small amount of data onto the stack. Thus, the programmers would have to take care of
the finite size of the register implemented stack, while writing the programs, which would make
programming harder.
9. Explain why a given (Infix) arithmetic expression needs to be converted to Reverse Polish
Notation (RPN) for effective use of a stack organization.
Answer
Stack-organized computers are better suited to postfix (RPN) notation than traditional infix notation.
In infix notation, operator is placed between operands. In postfix notation, operator is placed after
operands. For example, infix notation A * B becomes AB * in postfix notation.
Once an expression is recoded in postfix notation, converting it into a stack-based program is very
easy. Starting from the left, each operand is replaced with a PUSH operation to place the operand on
the stack and operator is replaced with the appropriate instruction to perform the operation.
10. Convert the given arithmetic expression to RPN Notation:
(A + B ^ D) / (E – F) + G
Answer
Conventionally, three levels of precedence for the usual five binary operators as:
Highest: Exponentiation (^)
Next highest: Multiplication (*) and division (/ )
Lowest: Addition (+) and subtraction (–)
Based on the priorities of operators, the given expression can be written as
((A + (B ^ D))/(E – F)) + G
Now, the equivalent postfix (RPN) expression can be evaluated as follows:
((A + (BD ^))/(EF –)) + G
fi ((ABD ^ +)/(EF –)) + G
fi (ABD ^ + EF – /) + G
fi ABD ^ + EF – / G +
Computer Instruction Set 5.25
13. What do you mean by instruction cycle, machine cycle and T states?
Answer
Instruction cycle: The processing required for a single instruction is called instruction cycle. The
control unit’s task is to go through an instruction cycle that can be divided into five major phases:
1. Fetch the instruction from memory.
2. Decode the instruction.
3. Fetch the operand(s) from memory or register.
4. Execute the whole instruction.
5. Store the output result to the memory or register.
Machine cycle: A machine cycle consists of necessary steps carried out to perform the memory access
operation. Each of the basic operations such as fetch or read or write operation constitutes a machine
cycle. An instruction cycle consists of several machine cycles.
T-states: One clock cycle of the system clock is referred to as T-state. A machine cycle consists of
several T-states.
14. A two-byte relative mode branch instruction is stored in memory location 1000. The branch is
made to the location 87. What is the effective address?
Answer
Since the instruction is two-byte, the content of PC is 1000 + 2 = 1002 after the instruction fetch. The
effective branch address is calculated by adding the content of PC with address part of instruction
(offset) which is 87. Thus the effective address is 1002 + 87 = 1089. Thus, after the instruction
execution, the program branches to 1089 address in memory.
15. What is load-store architecture? What are advantages and disadvantages of this architecture
over other general register based architectures?
Answer
In load-store architecture, only two instructions – load and store can access the memory system. In
other general register based architectures, not only load and store, other instructions can access the
operands in memory system.
Advantages:
(a) To implement a program, small number of instructions can be used.
(b) By minimizing the set of instructions that can access the memory system makes design of
control unit simpler.
(c) By limiting the accesses to memory system increases the overall performance of the machine.
Disadvantages:
(a) As only two instructions can access memory; the length of the programs increases and thus
storing the programs requires large memory.
(b) Large and complex instructions are difficult to programming.
CHAPTER
6
Design of Control Unit
6.1 INTRODUCTION
The preliminary concepts forming the basis for control unit design are the register transfer micro-
operations and their analytical descriptions. In Section 3.1, we have discussed the register transfer in
details. There we saw that register transfer occurs under some predetermined control condition(s),
which is (are) generated by the control unit. Here we take another example for further illustration.
Comparison between Two Methods The microprogramming approach is more expensive than
hardwired approach. In microprogramming approach, a control ROM memory is needed.
The main advantage of microprogramming is it provides a well-structured control organization.
Control signals are systematically transformed into formatted words (microinstructions). With micro-
programming, many additions and changes are made by simply changing the microprogram in the
control memory, as the control signals are embedded in a kind of two-level software called firmware.
A small change in the hardwired approach may lead to redesigning the entire system.
Now-a-days microprogramming is accepted as a standard tool to design the control unit of a
computer. For example, processors such as IBM 370, PDP-11 and Intel 80 ¥ 86 family have a
microprogrammed control unit. However, some olden day computers like Zilog’s 16-bit microproces-
sor Z8000 still use a hardwired control unit.
Example 6.2 Multiplier Control Unit Let us take one example to illustrate the design proce-
dure.
Step 1 (Statement of task) Implement a Booth’s multiplier to multiply two signed
4-bit numbers.
6.4 Computer Organization and Architecture
We know that Booth’s procedure inspects a pair of multiplier bits (refer Section
2.6.2) and performs one of the following actions:
Multiplier bits inspected Action
Q[i] Q[i-1] (in ith position)
0 0 None
0 1 Add M
1 0 Subtract M
1 1 None
Step 2 To design Booth’s multiplication method, the processing section is pro-
posed in the Fig. 6.3. As mentioned in the Section 2.6.2, the 4-bit register M will
hold the multiplicand. The mul-
tiplier Q register is 5-bit wide.
Initially, the high-order 4-bit of
this register will hold the 4-bit
multiplier. The least-significant
bit of this register is initialized
with the fictitious 0. The 4-bit
adder/subtractor unit is used to
perform the operations A + M
or A – M. The result produced
by this hardware unit is always
stored to the 4-bit accumulator Figure 6.3 Processing section for 4 ¥ 4 Booth’s
A. Here, the accumulated partial multiplication
product stored in (AQ) register
pair is shifted right. The L register is used to keep track of the iteration count. In the
example case, this register is initialized with decimal 4 and thus L is 3-bit in length
and decremented by 1 after the completion of each iteration. Thus, the algorithm
terminates when L reaches decimal 0. When L equals to decimal 0 (termination of
algorithm), the high- and low-order 4 bits of the final product are found in the
registers A and Q, respectively. The 4-bit data buses - Inbus and Outbus, are used to
transfer data into and out of the processing section respectively.
Step 3 For 4 ¥ 4 Booth’s multiplication algorithm, a register transfer description is
devised next. Q [0: –1] is used to indicate the low-order 2 bits of the Q register
(Initially Q[0] indicates the lsb of Q register and Q[–1] indicates a fictitious 0).
Similarly, Q[3:0] indicates the high-order 4 bits of the Q register. The last step, Go
to HALT, introduces an infinite loop after the algorithm is completed.
Registers: M[4], A[4], Q[5], L[3];
Buses: Inbus[4], Outbus[4];
START A ¨ 0, M ¨ Inbus, L¨ 4
Q[3:0] ¨ Inbus, Q[–1] ¨ 0;
LOOP If Q[0:–1] = 01 then go to ADD
If Q[0:–1] = 10 then go to SUB
Design of Control Unit 6.5
Go to RSHIFT;
ADD A ¨ A + M;
Go to RSHIFT;
SUB A ¨ A – M;
RSHIFT ASR (AQ), L ¨ L–1;
If L π 0 then go to LOOP
Output = A;
Output = Q[3:0];
HALT Go to HALT;
Step 4 The processing section contains three main elements:
l 4-bit adder/subtarctor.
l General-purpose registers.
l Tri-state buffers.
The operational characteristics of these three elements are provided in Fig. 6.4. By
introducing the proper values to control inputs C, L, R and D, four operations (clear,
parallel load, right shift and decrement) can be performed. A clock circuit synchro-
nizes all these operations. The 4-bit adder/subtractor can be implemented using a
4-bit parallel adder chip and four XOR gates (which is implemented in Section
3.5.4). To build a general-purpose register, standard flip-flops and gates can be used.
The tri-sate buffers are used to control the data transfer to the outbus.
Step 5 There are 10 control signals required: C0, C1, C2, C3, C4, C5, C6, C7, C8, C9
and their tasks are provided next. The micro-operations A ¨ 0, M ¨ Inbus, L ¨ 4
will be executed when C0, C1, C2 are held high. Similarly, other micro-operations are
performed by activating proper signals. Though, the signals’ tasks are self-explana-
tory. A detail logic diagram of the processing section along with various control-
signal points is shown in Fig. 6.5. In the diagram, a total of 8 tri-state buffers are
needed, out of which a set of 4 buffers are controlled by C8 and another set of 4
buffers are controlled by C9. Though two buffers are shown in Fig. 6.5, one is
controlled by C8 and other is controlled by C9.
C0: A ¨ 0 C5: A ¨ F
C1: M ¨ Inbus C6: ASR (AQ)
C2: L ¨ 4 C7: L ¨ L – 1
Design of Control Unit 6.7
There are 10 states in the controller state diagram. Ten non-overlapping timing
signals (T0 to T9) must be generated for the controller to perform the Booth’s
algorithm. But, only one will be high for a clock pulse.
Step 8 Since minimum 10 clock cycles (periods) are needed for 10 states in the
controller, a mod-16 counter and a 4-to-16 decoder are used to generate the clock
periods and to select one of the control signals at the appropriate state respectively.
The characteristic of the mod-16 counter is discussed in the Fig. 6.8.
Step 9 The controller and its logic diagram are shown in the Fig. 6.9. The main
component of this design is the sequence controller (SC) hardware, which sequences
the controller as indicated in the state diagram in Fig. 6.7. The truth table of SC is
derived from the controller’s state diagram and is shown in Table 6.1.
Design of Control Unit 6.9
For example, consider the logic involved in deriving the first entry of the SC truth
table. Observe that the mod-16 counter is loaded with the specified external data if
the counter control inputs C and L are 0 and 1 respectively. From the controller’s
state diagram, it can be observed that if the present control state is T2 (counter output
= 0010) and if the bit pair inspected is 00 (i.e., Q[0] Q[–1] = 00) then the next state
will be T5. When these input conditions occur, the counter must be loaded with
external data value 0101 (When counter output = 0101, then T5 =1). Therefore, the
SC generates load (L) = 1 and d3 d2 d1 d0 = 0101.
Using the same reasoning, the last entry of the SC truth table is obtained. From the
controller’s state diagram, it can be observed that if the present state is T9, the next
control state will be same T9 (it stays in the infinite loop). The SC must generate the
outputs load (L) =1 and d3 d2 d1 d0 = 1001 to obtain the desired state sequence.
Similarly, other entries of the SC truth table are derived.
The counter will automatically count up in response to the clock pulse (because the
enable input E is fixed with 1), when the counter load control input L = 0. In other
words, the sequential execution flow will be there when load input (L) = 0. Such
normal sequencing activities are desirable in the following situations:
l Present state is: T0, T1, T4, T5, T7, or T8.
l Present state is: T2 and Q[0] Q[–1] = 01.
l Present state is: T6 and Z = 1.
These results suggest that the SC should not affect the counter load control input L.
Hence these possibilities are excluded from the SC truth table. The SC must exercise
control only when there is a need for the counter to deviate from its normal counting
sequence.
6.10 Computer Organization and Architecture
CPU Hardwired Control Unit The general CPU hardwired control unit is depicted in Fig. 6.10.
The inputs to the control unit are content of instruction register (IR), the clock signal, the status (flag)
and control signals. Consider, the instruction register (IR). The control unit will perform different
functions for different instructions. There must be a unique logic input for each op-code to simplify
the control unit logic. In order to perform this function, a decoder is used which takes an encoded
input and produces a single output. To synchronize all micro-operations, a clock unit is used which
issues a repetitive sequence of pulses. The clock cycle time must be long enough to allow the
propagation of signals along data paths and through processing units. The control unit generates
different control signals at different time units within a single instruction cycle. For this reason, a
counter is used as input to the control unit to generate different timing states T1, T2, T3 and so on for
different control signals. At the end of the instruction cycle, the control unit must reinitialize the
counter at T1 using special Reset signal.
Like conventional program, retrieval and interpretation of the control words are done. The instruc-
tions of a CPU are stored in the main memory. They are fetched and executed in a sequence. The
CPU can perform different functions simply by changing the instructions stored in the main memory.
Similarly, the control unit can execute a different control operation by changing the contents of the
CM. Hence, the microprogrammed approach offers greater flexibility than its hardwired counterpart,
since this approach is based on the programming concept giving an easy way for altering the contents
of the CM.
Usually, all microinstructions have three important fields:
l Control field
l Next-address field
l Condition for branching.
A control memory in addition to the conventional main memory is used in the microprogramming
approach. Thus, it is desired to give more emphasis on minimizing the length of the microinstruction.
The length of the microinstruction decides the size of the control memory, as well as the cost involved
with this approach. The following factors are directly involved with the length of a microinstruction:
l How many micro-operations can be activated simultaneously (the degree of parallelism).
l The control field organization.
l The method by which the address of the next microinstruction is specified.
Several micro-operations can be executed simultaneously. A single microinstruction with a com-
mon op-code can be specified for all micro-operations executed in parallel. This allows short micro-
programs to be written. The length of microinstruction increases, whenever there is a need for
parallelism. Similarly, short microinstructions have limited capability in expressing parallelism. Since
massive parallelism is not possible using short microinstructions, the overall length of a micropro-
gram written using these instructions will increase.
There are various ways to organize control information. A simple way to organize the control field
would be to have 1 bit for each control line that controls the processing unit, allowing full parallelism,
6.12 Computer Organization and Architecture
and there is no need for decoding the control field. However, this method cannot use the control
memory efficiently when it is impossible to invoke all control operations simultaneously.
Consider the example shown in Fig. 6.11. Assume there are four registers A, B, C and D whose
contents are transferred to the destination register R when the appropriate control line is activated:
C0: R ¨ A
C1: R ¨ B
C2: R ¨ C
C3: R ¨ D
Since there is only one destination register R, it is not possible to allow more than one transfer at
any given time. If one bit is allocated for each control in the control field, the result will appear as
shown next:
In the previous format, there are only five valid binary pat-
terns. However, five distinct binary patterns can be represented
by using only 3 bits according to the basic switching theory.
Such an arrangement is illustrated in Fig. 6.12.
The control information is encoded into a 3-bit field, and a
decoder is needed to get the actual control information, in Fig.
6.12. The relationship between the encoded and the actual con-
trol information is specified as follows:
E2 E1 E0 Action
0 0 0 No operation (NOP)
0 0 1 R¨A Figure 6.12 Encoded control
0 1 0 R¨B arrangement
0 1 1 R¨C
l 0 0 R¨D
Design of Control Unit 6.13
This way of organizing the control field is known as encoded format method, which specifies to a
short control field and short microinstructions. One extra hardware element, namely decoder is needed
for such a reduction. Therefore, a compromise must be made.
Fifteen control lines can be specified in a fully unencoded form as shown next:
The same information can be specified in the fully encoded form as shown next:
In the first and second cases, the sizes of the control field are 16 and 4 bits, respectively. However,
the second approach needs a 4-to-16 decoder to generate the actual control signals. Thus, the control
information is given by a partial encoding, as a measure of compromise and is shown next:
The control signals are partitioned into disjoint groups of control fields, so two signals of different
groups can be enabled in parallel. For the above example, the control signals are partitioned into two
groups as:
Group 1: C1 C2 C3 C4 C5 C6 C7.
Group 2: C8 C9 C10 C11C12 C13 C14 C15.
With the above grouping, C8 can be activated simultaneously with C1 or C2 but not C1 and C2.
Using one 3-to-8 and one 4-to-16 decoders, the actual control signals are generated. In the last case,
the control field requires 7 bits E0 to E6. This technique is mixed one which lies between the unen-
coded and fully encoded approaches.
Horizontal and Vertical Microprogramming Based on the length of microinstructions, the mi-
croprogramming method is either horizontal or vertical.
Having an individual bit for each control signal in the microinstruction format is known as a
horizontal microinstruction, as shown in Fig. 6.13. The unencoded method, discussed above, is
6.14 Computer Organization and Architecture
usually followed in horizontal organization. Each bit in microinstruction activates one control signal.
Several control signals can be simultaneously generated by a single microinstruction. The length of
the microinstruction is large. Horizontal microinstructions have the following general attributes:
l Long formats.
l Ability to express a high degree of parallelism.
l Very little encoding of the control information.
In the IBM 360/model 50, the microinstructions used in control unit follow horizontal format.
In a vertical microinstruction (Fig. 6.14), a single
field can produce an encoded sequence. The encoded
technique is followed in this organization. The verti-
cal microprogram technique takes more time for gen-
erating the control signals due to the decoding time
and also more microinstructions are needed. But the Figure 6.14 Vertical microinstruction
overall cost is less since the microinstructions are
small in size. The horizontal microprogram releases faster control signals but the control memory size
is huge due to increased word length. Thus, the vertical microinstructions are characterized by:
l Short formats.
l Limited ability to express parallel microoperations.
l Considerable encoding of the control information.
In the IBM 370/model 145, the microinstructions used in control unit follow vertical format.
Structure of Microprogrammed Control Unit We now describe the design of a typical
microprogrammed control unit. The architecture of a typical modern microprogrammed control unit is
shown in Fig. 6.15.
The various components used in Fig. 6.15 are summarized next.
Control Memory Buffer Register (CMBR) The function of CMBR is same as the MBR (memory
buffer register) of the main memory. It is basically a latch and acts as a buffer for the microinstruc-
tions retrieved from the CM. Typically, each microinstruction has three fields as:
The condition select field selects the external condition to be tested. The output of the MUX will be 1,
if the selected condition is true. The MPC will be loaded with the address specified in the branch
address field of the microinstruction, because the output of the MUX is connected to the load input of
the microprogram counter (MPC). However, the MPC will point to the next microinstruction to be
executed, if the selected external condition is false. Thus, this arrangement allows conditional branch-
ing. The control function field of the microinstruction may hold the control information in an encoded
form which thus may require decoders.
Design of Control Unit 6.15
Microprogram Counter (MPC) The task of MPC is same as the PC (program counter) used in the
CPU. The address of the next microinstruction to be executed is held by the MPC. Initially, it is
loaded from an external source to point to the starting address of the microprogram to be executed.
From then on, the MPC is incremented after each microinstruction fetch and the instruction fetched is
transferred to the CMBR. However, the MPC will be loaded with the contents of the branch address
field of the microinstruction that is held in the CMBR, when a branch instruction is encountered.
External Condition Select MUX Based on the contents of the condition select field of the microin-
struction, this MUX selects one of the external conditions. Therefore, the condition to be selected
must be specified in an encoded form. Any encoding leads to a short microinstruction, which implies
a small control memory; hence the cost is reduced. Suppose two external conditions X1, X2 are to be
tested; then the condition-select and actions taken are summarized next:
Condition select Action taken
00 No branching
01 Branch if X1 =1
10 Branch if X2 =1
11 Always branching (unconditional branching)
The multiplexer has four inputs V0, V1, V2, V3 where Vi is routed to the multiplexer’s output when
the condition select field has decimal equivalent i. Hence we require V0 = 0, V1 = X1, V2 = X2, V3 = 1
to control the loading of microinstruction branch addresses into MPC.
6.16 Computer Organization and Architecture
Example 6.3 The design of a typical microprogrammed control unit is discussed now. Consider
the implementation of a microprogrammed control unit for the 4 ¥ 4 Booth’s multi-
plication. First step will be writing of the microprogram in a symbolic form, then
next task will be generating of control signals and architecture of the control unit,
and lastly we will give the microprogram in binary for 4 ¥ 4 Booth’s multiplication.
The symbolic microprogram for 4 ¥ 4 Booth’s multiplication is as follows:
Control Memory Control word
Address
0 Start A ¨ 0, M ¨ Inbus, L ¨ 4
1 Q[3:0] ¨ Inbus, Q[–1] ¨ 0;
2 LOOP If Q[0: –1] = 01 then go to ADD
3 If Q[0: –1] = 10 then go to SUB
4 Go to RSHIFT;
5 ADD A ¨ A + M;
6 Go to RSHIFT;
7 SUB A ¨ A – M;
8 RSHIFT ASR (AQ), L ¨ L–1;
9 If Z = 0 then go to LOOP
10 Output = A;
11 Output = Q[3:0];
12 HALT Go to HALT;
In this task, three conditions, Q[0] Q[–1] = 01, Q[0]Q[–1] = 10 and Z = 0, are
tested. Here, Z corresponds to the L register. When L π 0, Z is reset to 0, otherwise Z
is set to 1. These three conditions are applied as inputs to the condition select MUX.
Additionally, to take care of no-branch and unconditional-branch situations, logic 0
and logic 1 are applied as data inputs to this MUX, respectively. Therefore, The
MUX is able to handle five data inputs and thus must be at least an 8:1. The size of
the condition select field must be 3 bits in length.
With this design, the condition select field may be interpreted as below:
Condition select Action taken
000 No branching
001 Branch if Q[0] =0 and Q[–1] =1
010 Branch if Q[0] =1 and Q[–1] = 0
011 Branch if Z =0
100 Unconditional branching
With these details, the size of the control word is calculated as follows:
Size of a control size of the size of the number of
word = condition select + branch address + control
field field functions
=3 + 4 + 10
= 17 bits.
Design of Control Unit 6.17
Hence, the sizes of the CMDB and CM are 17 bits and 13 ¥ 17, respectively. The
complete hardware organization of the control unit and control signals is shown in
Fig. 6.16.
C0 : A ¨ 0
C1 : M ¨ Inbus
C2 : L ¨ 4
C3 : Q[3:0] ¨ Inbus, Q[–1] ¨ 0
C4 : F = l + r
C¢4 : F = l – r
C5 : A ¨ F
C6 : ASR (AQ)
C7 : L ¨ L – 1
C8 : Outbus = A
C9 : Outbus = Q[3:0]
Table 6.2
CPU Microprogrammed Control Unit Here, we want to design microprogrammed control unit
for a basic accumulator-based CPU as shown in Fig. 6.17. This CPU consists of a data processing unit
(DPU) designed to execute the set of 10 basic single-address instructions listed in Table 6.3. The
instructions are assumed to be fixed length and to execute on data words of the same fixed length, say
32 bits. The function of control unit (CU) is to manage the control signals linking the CU to the DPU,
as well as the control signals between the CPU and the external memory M.
In order to design the CU, first we have to identify the relevant control actions (micro-operations)
needed to process the given instruction set using the hardware from Fig. 6.17. The instruction execu-
tion behavior of the CPU is shown in Fig. 6.18 using a flowchart. All instructions require a common
instruction-fetch phase, followed by an execution-phase that varies with each instruction type. The
content of the program counter (PC) is copied to the memory address register (AR) in the fetch phase.
A memory read operation is then executed, which transfers the instruction word I to memory data
register (DR); that is expressed by DR ¨ M(AR). Op-code of I is transferred to the instruction
Design of Control Unit 6.19
register (IR), where it is decoded; at the same time PC is incremented to point to the next consecutive
instruction in M.
The op-code type of current instruction determines the subsequent operations to be performed. For
example, the store instruction ST X is executed in three steps: the address field X of ST X is
transferred to AR, the content of the accumulator (AC) is transferred to DR and finally the memory
write operation M(AR) ¨ DR is performed. The branch-on-zero instruction BZ adr is executed by
first checking AC. If AC π 0, no action is taken; if AC = 0, the address field adr, which is in
DR(ADR), is transferred to PC, thus performing the branch operation. From Fig. 6.18, it can be seen
that instruction fetching takes three cycles, while instruction execution takes from one to three cycles.
The control signals and control points needed by the CPU are determined implicitly by the micro-
operations appearing in the flow chart. A suitable set of control signals for the CPU and their
functions are listed in the Table 6.4. Figure 6.19 shows the approximate points of the corresponding
control points in both the CU and DPU. The control signals generated by CU are used as control
inputs to different units of DPU and memory. Three basic groups can be created for these control
lines, as:
— Operation select: C2, C9, C10, C11, C12.
— Memory control: C1, C8.
— Data transfer: C0, C3, C4, C5, C6, C7.
6.20 Computer Organization and Architecture
Here memory control refers to the external memory M. Many of the control signals route informa-
tion between the CPU’s internal data and control registers.
FETCH: AR ¨ PC;
DR ¨ M(AR);
PC ¨ PC +1, IR ¨ DR(OP);
go to IR;
LD: AR ¨ DR(ADR)
DR ¨ M(AR);
AC ¨ DR, go to FETCH
ST: AR ¨ DR(ADR);
DR ¨ AC;
M(AR) ¨ DR, go to FETCH;
MOV1: DR ¨ AC, go to FETCH;
MOV2: AC ¨ DR, go to FETCH;
AND: AC ¨ AC Ÿ DR, go to FETCH;
NOT: AC ¨ ÿ AC, go to FETCH;
ADD: AC ¨ AC + DR, go to FETCH;
SUB: AC ¨ AC – DR, go to FETCH;
BRA: PC ¨ DR(ADR), go to FETCH;
BZ: if AC = 0 then PC ¨ DR(ADR), go to FETCH:
Apparently, one may feel that two-level structure will increase the overall cost, but it actually
improves the economy of the system by reducing the total space required.
The horizontal and vertical microprogramming concepts together give the concept of
nanoprogramming. In reality, this method gives useful trade-offs between these two techniques. The
nanoprogramming technique offers significant savings in space when a group of micro-operations
occur many times in a microprogram.
Let us consider a nanoprogrammed computer in which control memory has dimension A ¥ B. So
the size of this memory is A*B bits to store the microprogram. Assume this microprogram has K (K <
A) unique microinstructions. In nanocontrol memory (NCM) of size K ¥ B, these K microinstructions
are held. These K microinstructions occur once in the NCM. Each microinstruction in the original
microprogram is replaced with an address that specifies the location of the NCM in which the original
B-bit wide microinstruction is stored. Since the NCM has K addresses, only Èlog2K˘ bits are needed
to specify one NCM address. Therefore, the size of each microinstruction in a two-level control
memory is only Èlog2K˘ bits. This is shown in Fig. 6.22.
The operation of a control unit using a two-level memory shown in Fig. 6.21 can be explained
next. The first microinstruction from the microprogram of MCM is read. This microinstruction is
actually the first address of the NCM. The content of this location in the NCM is the control word,
which is transferred to the nanoinstruction register (NIR). The bits in this register are then used to
control the gates for one cycle. After completion of the cycle, the second microinstruction is read
from the MCM and the process continues. Therefore, two memory fetches (one for the MCM and the
other for NCM) are required. The conventional control memory consists of a single memory; thus,
Design of Control Unit 6.25
one memory fetch is necessary. This reduction in control unit using single-level of memory is com-
pensated by the cost of the memory when the same microinstructions occur many times in the
microprogram. The main disadvantage of nanoprogramming is that a control unit using a NCM is
slower than one using a conventional control memory, since the nanoprogramming concept consists
of two-level memory.
The nanoprogramming concept was first used in the QM-1 computer designed around 1970 by
Nanodata Corporation. It is also employed in the Motorola 680X0 microprocessors series.
Example 6.4 Example of nanoprogramming Let us take one practical example of the
nanoprogramming concept: the nanomemory structure of the Motorola MC68000
16-bit microprocessor in Fig. 6.23. It has 640 microinstructions, out of which 280 are
unique, as shown in Fig. 6.23. The contents of the MCM are pointers to the NCM.
Each microinstruction of the MCM is of size = Èlog2280˘ = 9 bits in length.
It can be seen that the MC68000 offers control memory savings. In the MC68000,
the MCM is 640 ¥ 9 bits and the NCM is 280 ¥ 70 bits, since there are 280 unique
microinstructions. If the MC68000 is implemented by using a single CM, this memory
will have 640 ¥ 70 bits. Therefore, the use of nanoprogramming saves a total of
= 640 * 70 – (640 * 9 + 280 * 70) bits
= 19, 440 bits.
REVIEW QUESTIONS
Group A
1. Choose the most appropriate option for the following questions:
(i) Microprogrammed control unit is
(a) sequential logic controller
(b) low-cost control unit
(c) easily modifiable, because it follows programming technique
(d) none.
(ii) A hardwired control unit needs
(a) sequence controller and decoder, but does not use control memory
(b) state machine and control memory
(c) multiplexer and control memory
(d) encoder and control memory.
6.26 Computer Organization and Architecture
SOLVED PROBLEMS
Carry (C): it indicates whether there is any end-carry from the most significant bit position.
Zero (Z): it indicates whether the result is zero or non-zero.
Sign (S): it indicates whether the result is positive or negative.
Overflow (V): it indicates whether the operation produces any overflow or not.
There may be other flags such as parity and auxiliary carry.
2. What are the advantages and disadvantages of microprogram control unit over hardwired
control unit?
Answer
Advantages of microprogram control unit:
(a) It provides a well-structured control organization. Control signals are systematically trans-
formed into formatted words (microinstructions). Logic gates, flip flops, decoders and other
digital circuits are used to implement hardwired control organization.
(b) With microprogramming, many additions and changes can be made by simply changing the
microprogram in the control memory. A small change in the hardwired approach may lead to
redesigning the entire system.
Disadvantage of microprogram control unit:
The microprogramming approach is more expensive than hardwired approach. Since a control ROM
memory is needed in microprogramming approach.
3. Suppose there are 15 micro-instructions in control memory that generate the control signals
for an instruction and it takes 7 micro-instructions to read an instruction from memory into
instruction register IR and then to decode the instruction. Assuming a read of control memory
address occurs in 1 ns, what will be the time taken by the processor for the instruction?
Answer
For reading an instruction from memory and decoding, number of micro-instructions required = 7 and
that for execution = 15. Therefore, time required to process an instruction = 7 + 15 = 22 ns.
4. An encoded microinstruction format is to be used in control unit. Show how a 9-bit micro-
operation field can be divided into subfields to specify 46 different control signals.
Answer
The 9-bit micro-operation can be divided into two subfields to specify 46 control signals by a partial
encoding, as shown in figure below.
Design of Control Unit 6.29
The control signals are partitioned into disjoint groups of control fields, so two signals of different
groups can be enabled in parallel. The control signals are partitioned into two groups as:
Group 1: C1 C2 ….C31.
Group 2: C32 C33 ….C46.
With the above grouping, C32 can be activated simultaneously with C1 or C2 but not C1 and C2.
Using one 5-to-32 and one 4-to-16 decoders, the actual control signals are generated.
5. A processor has 28 distinct instructions with 13 instructions having 12 micro-instructions and
15 having 18 micro-instructions.
(a) How many addresses are used in control memory?
(b) If three instructions jump to another set of micro-instructions, each having four micro-
instructions, then how many addresses are now used in control memory? Assume that
each micro-instruction also stores a branch address.
Answer
(a) Number of addresses in control memory for 28 instructions = 13 ¥ 12 + 15 ¥ 18 = 426.
(b) Number of addresses in control memory = 13 ¥ 12 + 15 ¥ 18 + 3 ¥ 4 = 438.
6. Why do most modern processors follow microprogramming control organizations?
Answer
Microprogramming allows relatively complex instructions to be implemented using small number of
hardware circuitry, because control memory stores the necessary control signals directly instead of
large number of logic gates as used in hardwired control organization. Moreover, many additions and
changes can be made by simply changing the microprogram in the control memory.
7. A conventional microprogrammed control unit includes 2048 words by 117 bits. Each of
512 microinstructions is unique. Calculate the savings achieved by having a nanoprogramming
technique. Calculate the sizes of microcontrol memory and nanocontrol memory.
Answer
It has 2048 microinstructions, out of which 512 are unique, as shown in figure below. The contents of
the microcontrol memory (MCM) are pointers to the nanocontrol memory (NCM). Each microinstruc-
tion of the MCM is of size = Èlog2512˘ = 9 bits in length.
Now, control unit has: the MCM is 2048 ¥ 9 bits and the NCM is 512 ¥ 117 bits, since there are
512 unique microinstructions. In conventional microprogramming by using a single CM, this memory
size is 2048 ¥ 117 bits. Therefore, the use of nanoprogramming saves a total of
= 2048 * 117 – (2048 * 9 + 512 * 117) bits
= 1,61,280 bits.
6.30 Computer Organization and Architecture
7
Input-Output Organization
7.1 INTRODUCTION
Data transfer between the computer and external device takes place through I/O mechanism. One
communicates with a computer system via the I/O devices interfaced to it. The user can enter
programs and data using the keyboard on a terminal, executes the programs to obtain results and
finally the results may be displayed on monitor of the computer. Therefore, the I/O devices connected
to a computer system provide an efficient means of communication between the computer and the
outside world. These I/O devices are commonly known as peripherals and popular I/O devices used
are keyboard, monitor, printer, disk and mouse.
In this chapter, we will consider in detail interconnection system and various ways in which I/O
operations are performed.
Every computer supports a variety of peripheral devices. To use a peripheral device, two modules are
required:
(a) I/O interface or I/O controller.
(b) I/O driver.
lPeripherals are mainly electromechanical and electromagnetic devices and their manner of
operations is different from the operation of the CPU and main memory, which are electronic
devices. Hence, a conversion of signal values may be required.
l Data codes and formats in peripheral devices are different from the code format in the CPU
and memory.
l The data transfer rate of the peripheral devices is usually slower than the transfer rate of the
CPU and therefore, a synchronization mechanism may be required.
l The various peripheral devices attached to a computer have different modes of operations and
each must be controlled so as not to disturb the operation of other peripherals connected to the
CPU.
In order to resolve these differences, computer systems must include special hardware device
called I/O interface with each peripheral to supervise and synchronize all I/O transfers.
The linkage of the I/O interface to the central computer is via the bus. A typical communication
linkage between the central computer (i.e. CPU and main memory) and several peripheral devices is
shown in Fig. 7.1. The I/O bus consists of data lines, address lines and control lines. The magnetic
disk, printer, monitor and keyboard are used in practically any general-purpose computer. Each
peripheral device has associated with it an I/O interface unit. A device’s interface circuit constitutes
of an address decoder, the data and status registers, and the control circuitry. Each I/O interface unit
decodes the address and control signal received from the I/O bus, interprets them for the peripheral
and provides signals for the peripheral. It synchronizes the data flow and supervises the transfer
between peripheral and processor. For example, the printer controller controls the paper motion, the
printing timing and the selection of printing characters. An I/O controller may be incorporated
separately or may be physically integrated with the peripheral.
The I/O bus from the central computer is attached to all peripheral controllers. To communicate
with a particular I/O device, the processor places a device address on the address lines. The address
decoder of I/O interface monitors the address lines. When a particular I/O interface detects its own
address, it activates the path between the bus lines and the peripheral device that it controls. All
peripherals whose addresses do not correspond to the address in the bus are disabled by their inter-
faces.
When the address is made available in the address lines, at that time the processor provides an
operation code in the control lines. The interface selected responds to the operation code and pro-
ceeds to execute it. The operation code is referred to as an I/O command. The meaning of the
command depends on the peripheral type that the processor is addressing. There are four types of I/O
commands that an I/O interface may receive when it is addressed by a processor:
1. Control It is used to enable an I/O device and to give directive what to do. For example, a
magnetic tape unit may be instructed to rewind or to move forward one record. These commands are
tailored to the particular type of peripheral device.
2. Test It is used to test various status conditions associated with an I/O interface and its peripheral.
For example, the processor may want to know that the peripheral of interest is powered on and ready
for use. It also may want to know if the most recent I/O operation is completed and if any error
occurs.
3. Read This causes the I/O interface to obtain a data-item from the peripheral and place it in an
internal buffer (data register). The processor can then obtain the data item by requesting that the I/O
interface place it on the data bus.
4. Write This causes the I/O interface to take a data-item from the data bus and subsequently
transfer that data item to the peripheral.
Some well known I/O interface devices are: SCSI (Small Computer System Interface), USB (Uni-
versal Serial Bus), IDE (Integrated Drive Electronics), RS-232C, FireWire, Centronics Interface.
There are two types of I/O interfaces available: Serial interface and Parallel interface. In serial
interface, there is only one data line and thus data bits are transmitted serially one after other.
Examples include: USB, RS-232C, and FireWire. In parallel interface, there are multiple data lines in
parallel and thus multiple number of bits can be transmitted from the system simultaneously. Ex-
amples include: SCSI, Centronics Interface, and IDE.
operation. After completing the I/O operation, the I/O driver returns control to the called program and
passes return signals about the completion of the operation. Figure 7.2 illustrates communication
between the I/O controller and the application program. The collection of I/O driver programs is
called BIOS (Basic Input Output Control System).
The I/O drivers for the basic peripheral devices supported by the general PC are part of the BIOS
which is physically stored in ROM part of main memory. The I/O drivers for other peripherals are
provided on the floppy diskette or CD. This program is installed in the hard disk and brought into the
RAM by bootstrap system program during booting.
Like the I/O bus, the memory bus contains data, address and read/write control lines. In addition to
communicating with I/O, the processor must communicate with the memory unit. There are three
ways that processor uses computer buses to communicate with memory and I/O:
1. Use two separate buses (address, data and control), one for memory and the other for I/O.
2. Use one common bus (address and data) for both memory and I/O but have separate control
lines for each.
3. Use one common bus (address, data and control) for memory and I/O.
In the first case, the computer has separate sets of data, address, and control buses, one for
accessing memory and the other for I/O. This procedure is used in computers that provide a separate
I/O processor (IOP), also called I/O channel in addition to the CPU. The memory communicates with
both the CPU and the IOP through a memory bus. The objective of the IOP is to provide a separate
path for the transfer of information between external I/O devices and internal memory.
In the second case, computers use one common bus to transfer information between memory or I/O
and the CPU. The distinction between a memory transfer and I/O transfer is made through separate
read and write lines. The CPU specifies whether the address on the address lines is for a memory
word or for an interface register by enabling one of two possible read or write lines. I/O-read and I/O-
write control lines are enabled during an I/O transfer. The memory read and memory write control
lines are enabled during a memory transfer. This configuration isolates all I/O interface addresses
from the addresses assigned to memory and is referred to as the isolated I/O or I/O mapped I/O
method for assigning addresses in a common bus. This method isolates memory and I/O addresses so
Input-Output Organization 7.5
In order to execute a task in a computer, transfer of information among different devices is necessary.
When two units have to communicate with each other for data transfer, usually one of them is the
master and the other one is slave. Sometimes one word of data is transferred between two units in one
clock period or some times in more than one clock period. There are two types of data transfer
depending on the mechanism of timing the data: Synchronous and Asynchronous.
Information transferred from the central computer (i.e. CPU and main memory) into a peripheral
device originates in the memory unit. Information received from a peripheral device is usually stored
in memory for later processing. The CPU only executes the I/O instructions and may accept the data
temporarily, but the ultimate source or destination is the memory unit. Data transfer between the
central computer and I/O devices may be handled in a variety of modes. Three possible modes are:
1. Programmed I/O.
2. Interrupt-initiated I/O.
3. Direct memory access (DMA).
The CPU executes a program to communicate with an I/O device via I/O interface registers for
programmed I/O. This is a software method.
An external I/O device requests the processor to transfer data by activating a signal on the computer’s
interrupt line during interrupt-initiated I/O. In response, the computer executes a program called the
interrupt-service routine (ISR) to carry out the function desired by the external I/O device. This is
also a software method.
Data transfer between the computer’s main memory and an external I/O device occurs without
CPU involvement in direct memory access (DMA). It is a hardware method.
Input-Output Organization 7.9
Example 7.1 Example of Programmed I/O Transferring data from I/O device to memory.
I/O device does not have direct access to memory in the programmed I/O method.
A transfer from an I/O device to memory requires the execution of several instruc-
tions by the CPU, including an input instruction to transfer the data from the device
to the CPU and a store instruction to transfer the data from the CPU to memory.
Figure 7.10 shows an example of data transfer from an I/O device through its
interface into memory via the CPU. The handshaking procedure is followed here.
The device transfers bytes of data one at a time, as they are available. The device
places a byte of data, when available, in the I/O bus and enables its data valid line.
The interface accepts the byte into its data register and enables its data accepted line.
A flag bit is then set in its status register by the interface. The device now disables
the data valid line, but it will not transfer another byte until the data accepted line is
disabled and flag bit is reset by the interface.
Figure 7.10 Data transfer from I/O device to memory through CPU
An I/O routine or a program is written for the computer to check the flag bit in the
status register to determine if a byte is placed in the data register by the I/O device.
By reading the status register into a CPU register and checking the value of the flag
bit, this can be done. When the flag is set to 1, the CPU reads the data from data
register and then transfers to the memory by store instruction. The flag bit is then
7.10 Computer Organization and Architecture
reset to 0 by either the CPU or the interface, depending on the design of interface
circuits. When the flag bit is reset, the interface disables the data accepted line and
the device can then transfer the next data byte. Thus the following four steps to be
executed by the CPU to transfer each byte:
1. Read the status register of interface unit.
2. Check the flag bit of the status register and go to step (3) if it is set; otherwise
loop back to step (1).
3. Read the data register of interface unit for data.
4. Send the data to the memory by executing store instruction.
The programmed I/O method is particularly useful in small low-speed computers
or in systems that are dedicated to monitor a device continuously. Generally the CPU
is 5-7 times faster than an I/O device. Thus, the difference in data transfer rate
between the CPU and the I/O device makes this type of transfer inefficient.
The CPU responds to the interrupt signal by storing the return address from the program counter
(PC) register into a memory stack or into a processor register and then control branches to an ISR
program that processes the required I/O transfer. The way that the CPU chooses the branch address of
the ISR varies from one unit to another. In general, there are two methods for accomplishing this. One
is called vectored interrupt and the other is non-vectored. In a vectored interrupt, the source that
interrupts supplies the branch information (starting address of ISR) to the CPU. This information is
called the interrupt vector, which is not any fixed memory location. In a non-vectored interrupt, the
branch address (starting address of ISR) is assigned to a fixed location in memory.
In interrupt-initiated I/O, the device controller should have some additional intelligence for check-
ing device status and raising an interrupt whenever data transfer is required. This results in extra
hardware circuitry in the device controller.
Interrupt Hardware An interrupt handling hardware implements the interrupt. To implement inter-
rupts, the CPU uses a signal known as an interrupt request (INTR) signal to the interrupt handler or
controller hardware, which is connected to each I/O device that can issue an interrupt to it. Here,
interrupt controller makes liaison with the CPU on behave of I/O devices. Typically, interrupt control-
ler is also assigned an interrupt acknowledge (INTA) line that the CPU uses to signal the controller
that it has received and begun to process the interrupt request by employing an ISR. Figure 7.12
shows the hardware lines for implementing interrupts.
The interrupt controller uses a register called interrupt-request mask register (IMR) to detect any
interrupt from the I/O devices. Consider there is n number of I/O devices in the system. Therefore
IMR is n-bit register each bit indicates the status of one I/O device. Let, IMR’s content is denoted as
E0 E1 E2 … En-1. When E0 = 1 then device 0 interrupt is recognized; When E1 = 1 then device 1
interrupt is recognized and so on. The processor uses a flag bit known as interrupt enable (IE) in its
status register (SR) to process the interrupt. When this flag bit is ‘1’, the CPU responds to the
presence of interrupt; otherwise not.
7.12 Computer Organization and Architecture
Enabling and Disabling Interrupts Sometimes a program does not want any interruption; it
informs the CPU not to encourage any interrupt. The CPU ignores interrupt and hence any interrupts
which arrived in the mean time remain pending, until the program issues another directive to the CPU
asking it to allow interrupt. Therefore, there has to be some feature to enable or disable an interrupt.
Disabling an interrupt is called masking. There can be two types of interrupts in an I/O organization.
l Non-maskable interrupts: The user program cannot disable it by any instruction. Some com-
mon examples are: hardware error and power fail interrupt.
l Maskable interrupts: The user program can disable all or a few device interrupts by an instruc-
tion.
A flag bit known as interrupt enable (IE) is used in processor status register (SR) to process the
interrupt. When this flag bit is ‘1’, the CPU responds to the interrupt. When this flag bit is ‘0’, the
CPU ignores the interrupt. The program must issue enable interrupt (EI) instruction to set the IE flag.
The CPU sets the IE flag, while executing this instruction. The program must issue disable interrupt
(DI) instruction to reset the IE flag. The CPU resets the IE flag, while executing this instruction.
Thus, in respect of interrupt servicing the CPU’s behavior is controlled by the program that is being
executed currently. There are two following special situations when the CPU resets the IE flag on its
own:
1. During non-maskable interrupts handling.
2. During interrupt servicing; the CPU resets the IE flag immediately after saving the return
address into memory or into a special register and before branching to ISR. Thus, when CPU
starts execution of ISR, interrupts are disabled. Therefore, it is up to the ISR to allow interrupts
(by an EI instruction) if it wises.
Interrupt Nesting During the execution of one ISR, if it allows another interrupt, then this is
known as interrupt nesting. Suppose the CPU is initially executing program ‘A’ when first interrupt
occurs. The CPU after storing return address of instruction in program ‘A’, starts executing ISR1.
Now say in the mean time, second interrupt occurs. The CPU again after storing return address of
instruction in ISR1, starts executing ISR2. When it is executing ISR2, third interrupt occurs. The
CPU again performs storing of return address for ISR2 and then starts executing ISR3. After complet-
ing ISR3, the CPU resumes the execution of ISR2 for remaining portion. Similarly, after completing
ISR2, the CPU resumes the execution of ISR1 for remaining portion. After completing ISR1, the
CPU returns to the program ‘A’ and continues from the location it branched earlier.
Priority Interrupt In a typical application a number of I/O devices are attached to the computer,
with each device being able to originate an interrupt request. The first task of the interrupt controller
is to identify the source of the interrupt. There is also the possibility that several sources may request
interrupt service simultaneously. In this case the controller must also decide which to service first. A
priority interrupt is a system that establishes a priority over the various sources to determine which
condition is to be serviced first when two or more requests arrive simultaneously. Devices with high-
speed transfers such as magnetic disks are usually given high priority and slow devices such as
keyboards receive low priority. When two devices interrupt the CPU at the same time, the CPU
services the device, with the higher priority first.
The interrupt requests from various sources are connected as input to the interrupt controller. As
soon as the interrupt controller senses (using IMR) the presence of any one or more interrupt requests,
it immediately issues an interrupt signal through INTR line to the CPU. The interrupt controller
Input-Output Organization 7.13
assigns a fixed priority for the various interrupt requestor devices. For example, the IRQ0 is assigned
the highest priority among the eight different interrupt requestors. Assigning decreasing order of
priority from IRQ0 to IRQ7, the IRQ7 is the lowest priority. It (IRQ7) is serviced only when no other
interrupt request is present.
DMA Controller To communicate with the CPU and I/O device, the DMA controller needs the
usual circuits of an interface. In addition to that, it needs an address register, a word count register, a
status register and a set of address lines. Three registers are selected by the controller’s register select
(RS) line. The address register and address lines are used for direct communication with the memory.
The address register is used to store the starting address of the data block to be transferred. The word
count register contains the number of words that must be transferred. This register is decremented by
7.14 Computer Organization and Architecture
one after each word transfer and internally tested for zero after each transfer. Between the device and
memory under control of the DMA, the data transfer can be done directly. The status register contains
information such as completion of DMA transfer. All registers in the DMA controller appear to the
CPU as I/O interface registers. Thus, the CPU can read from or write into the DMA registers under
program control via the data bus.
While executing the program for I/O transfer, the CPU first initializes the DMA controller. After
that, the DMA controller starts and continues to transfer data between memory and peripheral unit an
entire block is transferred. The DMA controller is initialized by the CPU by sending the following
information through the data bus:
1. The starting address of the memory blocks where data are available for or where data are to be
stored for write.
2. The number of words in the memory block (word count) to be read or written.
3. Read or write control to specify the mode of transfer.
4. A control to start the DMA transfer.
DMA Transfer In DMA transfer, I/O devices can directly access the main memory without inter-
vention by the processor. Figure 7.14 shows a typical DMA system. The sequences of events involved
in a DMA transfer between an I/O device and the main memory are discussed next.
sends a DMA ACK (acknowledgement) signal to the I/O device. The DMA controller takes the
control of the memory buses from the CPU. Before releasing the control of the buses to the controller,
the CPU initializes the address register for starting memory address of the block of data, word-count
register for number of words to be transferred and the operation type (read or write). The I/O device
can then communicate with memory through the data bus for direct data transfer. For each word
transferred, the DMA controller increments its address-register and decrements its word count regis-
ter. After each word transfer, the controller checks the DMA request line. If this line is high, next
word of the block transfer is initiated and the process continues until word count register reaches zero
(i.e., the entire block is transferred). If the word count register reaches zero, the DMA controller stops
any further transfer and removes its HOLD signal. It also informs the CPU of the termination by
means of an interrupt through INT line. The CPU then gains the control of the memory buses and
resumes the operations on the program which initiated the I/O operations.
Advantages of DMA It is a hardware method, whereas programmed I/O and interrupt I/O are
software methods of data transfer. DMA mode has following advantages:
1. High speed data transfer is possible, since CPU is not involved during actual transfer, which
occurs between I/O device and the main memory.
2. Parallel processing can be achieved between CPU processing and DMA controller’s I/O op-
eration.
DMA Transfer Modes DMA transfers can be of two types: cycle stealing and block (burst)
transfer.
Memory accesses by the CPU and the DMA controllers are interlocking. Requests by DMA
devices for using memory buses are always given higher priority than processor requests. Among
different DMA devices, top priority is given to high-speed peripherals such as a disk, a high-speed
network interface or a graphics display device. Since the CPU originates most memory access cycles,
the DMA controller can be said to “steal” memory cycles from the CPU. Hence, this interlocking
technique usually called cycle stealing.
When DMA controller is the master of the memory buses, a block of memory words is transferred
in continuous without interruption. This mode of DMA transfer is known as block (burst) transfer.
This mode of transfer is needed for fast devices such as magnetic disks, where data transmission
cannot be stopped or slowed down until an entire block is transferred.
A conflict may arise if the number of DMA controllers or other controllers or processors try to access
the common bus at the same time, but access can be given to only one of those. Only one processor or
controller can be bus master. The bus master is the controller that has access to a bus at an instance.
To resolve these conflicts, bus arbitration procedure is implemented to coordinate the activities of all
devices requesting memory transfers. Bus arbitration refers to a process by which the current bus
master accesses and then leaves the control of the bus and passes it to another bus requesting
processor unit. The selection of the bus master must take into account the needs of various devices by
7.16 Computer Organization and Architecture
establishing a priority system for gaining access to the bus. The bus arbiter decides who would
become current bus master. There are two approaches to bus arbitration:
1. Centralized bus arbitration: A single bus arbiter performs the required arbitration.
2. Distributed bus arbitration: All devices participate in the selection of the next bus master.
All devices are effectively assigned static priorities according to their locations along a bus grant
control line (BGT). The device closest to the central bus arbiter is assigned the highest priority.
Requests for bus access are made on a common request line, BRQ. Similarly, the common acknowl-
edge signal line (SACK) is used to indicate the use of bus. When no device is using the bus, the
SACK is inactive. The central bus arbiter propagates a bus grant signal (BGT) if the BRQ line is high
and acknowledge signal (SACK) indicates that the bus is idle. The first device, which has issued a bus
request, receives the BGT signal and stops the latter’s propagation. This sets the bus-busy flag in the
bus arbiter by activating SACK and the device assumes bus control. On completion, it resets the bus-
busy flag in the arbiter and a new BGT signal is generated if other requests are outstanding (i.e., BRQ
is still active). The first device simply passes the BGT signal to the next device in the line.
The main advantage of the daisy chaining method is its simplicity. Another advantage is scalability.
The user can add more devices anywhere along the chain, up to a certain maximum value.
Polling or Rotating Priority Method In this method, the devices are assigned unique priorities
and compete to access the bus, but the priorities are dynamically changed to give every device an
opportunity to access the bus. This dynamic priority algorithm generalizes the daisy chain implemen-
tation of static priorities discussed above. Recall that in the daisy chain scheme all devices are given
static and unique priorities according to their positions on a bus-grant line (BGT) emanating from a
central bus arbiter. However, in the polling scheme, no central bus arbiter exists, and the bus-grant
line (BGT) is connected from the last device back to the first in a closed loop (Fig. 7.16). Whichever
device is granted access to the bus serves as bus arbiter for the following arbitration (an arbitrary
Input-Output Organization 7.17
device is selected to have initial access to the bus). Each device’s priority for a given arbitration is
determined by that device’s distance along the bus-grant line from the device currently serving as bus
arbiter; the latter device has the lowest priority. Hence, the priorities change dynamically with each
bus cycle.
The main advantage of this method is that it does not favor any particular device or processor. The
method is also quite simple.
Fixed Priority or Independent Request Method In
bus independent request method, the bus control passes
from one device to another only through the centralized
bus arbiter. Figure 7.17 shows the independent request
method. Each device has a dedicated BRQ output line and
BGT input line. If there are m devices, the bus arbiter has
m BRQ inputs and m BGT outputs. The arbiter follows a
priority order with different priority level to each device.
At a given time, the arbiter issues bus grant (BGT) to the
highest priority device among the devices who have issued
bus requests. This scheme needs more hardware but gener- Figure 7.17 Fixed priority bus arbitra-
ates fast response. tion method
The DMA mode of data transfer reduces CPU’s overhead in handling I/O operations. It also allows
parallelism in CPU and I/O operations. Such parallelism is necessary to avoid wastage of valuable
CPU time while handling I/O devices which are much slower compared to CPU. The concept of
DMA operation can be extended to relieve the CPU further from getting involved with the execution
of I/O operations. This gives rise to the development of special purpose processor called IO processor
(IO channel).
The IOP is just like a CPU that handles the details of I/O operations. It is more equipped with
facilities than those are available in a typical DMA controller. The IOP can fetch and execute its own
instructions that are specifically designed to characterize I/O transfers. In addition to the I/O-related
tasks, it can perform other processing tasks like arithmetic, logic, branching and code translation. The
block diagram of an IOP is shown in Fig. 7.18. The main memory unit takes the pivotal role. It
communicates with processor by means of DMA.
7.18 Computer Organization and Architecture
REVIEW QUESTIONS
Group A
1. Choose the most appropriate option for the following questions:
(i) An I/O interface is
(a) a hardware unit used for transferring data with central computer (CPU and memory) on
behave of a peripheral device
(b) a software used for transferring data with central computer (CPU and memory) on behave
of a peripheral device
(c) a firmware used for transferring data with central computer (CPU and memory) on behave
of a peripheral device
(d) none.
(ii) An I/O command
(a) is generated by CPU to communicate with a particular peripheral
(b) is generated by a peripheral which wants to communicate with CPU or memory
(c) initiates an I/O transfer
(d) is provided through data bus.
(iii) An I/O driver is
(a) hardware unit used for transferring data with central computer (CPU and memory) on
behave of a peripheral device
(b) software module that issues different commands to the I/O interface
(c) common program for all peripherals attached with the computer
(d) modifiable by the general users.
(iv) The objective of the IOP (I/O processor) is to use
(a) common buses (address, data and control) for the transfer of information between memory
and I/O devices
Input-Output Organization 7.19
(b) common address and data buses, but separate control lines
(c) a separate bus
(d) none.
(v) The main advantage of memory mapped I/O technique is
(a) no extra instructions are required
(b) no extra control signals are required
(c) the operations are done directly at the I/O register address
(d) DMA operations are fast.
(vi) Synchronous mode of data transfer
(a) is suitable for high speed devices
(b) occurs between two communicating devices, where one device is master and another is
slave
(c) is suitable for slow speed devices
(d) both (b) and (c).
(vii) Asynchronous data transfer
(a) can be initiated by source or destination device
(b) is initiated by source device
(c) is initiated by destination device
(d) is controlled by clock and can be initiated by source or destination device.
(viii) Asynchronous mode of data transfer
(a) is suitable for high speed devices
(b) occurs between two communicating devices, where one device is master and another is
slave
(c) is suitable for slow speed devices
(d) both (b) and (c).
(ix) When bulk data transfer is needed
(a) programmed I/O technique is used
(b) interrupt-initiated I/O technique is used
(c) DMA mode is used
(d) IOP is used.
(x) During data transfer in programmed I/O method
(a) the CPU can be busy on its own task of instruction execution
(b) the CPU is totally idle
(c) the CPU monitors the interface status
(d) none.
(xi) When a device interrupts, the CPU finds the service routine address for processing from the
(a) interrupt vector start address
(b) interrupt vector location defined as per the device address
(c) program already under execution
(d) device control register.
(xii) DMA operations need
(a) switching logic between the I/O and system bus
(b) I/O bus
(c) special control signals to CPU such as hold and hold acknowledge
(d) no CPU control signals.
7.20 Computer Organization and Architecture
21. When a DMA controller takes control of a bus, and while it retains control of the bus, what does the
CPU do?
22. What is bus arbitration? What are different methods of bus arbitration? Explain them.
23. What is cycle stealing and how does it enhance the performance of the system?
24. In virtually all computers having DMA modules, DMA access to main memory is given higher
priority than CPU access to main memory. Why?
25. Write short note on: input-output processor (IOP).
SOLVED PROBLEMS
interrupt. The highest priority device is polled first and if is found that its interrupt signal is on, the
CPU branches to the device’s own ISR for I/O transfer between the device and CPU. Otherwise it
moves to poll the next highest priority device.
4. Differentiate between polled I/O and interrupt driven I/O.
Answer
(a) In the polled I/O or programmed I/O method, the CPU stays in the program until the I/O
device indicates that it is ready for data transfer, so CPU is kept busy needlessly. But, in
interrupt driven I/O method, CPU can perform its own task of instruction executions and is
informed by raising an interrupt signal when data transfer is needed.
(b) Polled I/O is low cost and simple technique; whereas, interrupt I/O technique is relatively high
cost and complex technique. Because in second method, a device controller is used to continu-
ously monitor the device status and raise an interrupt to the CPU as soon as the device is ready
for data transfer.
(c) The polled I/O method is particularly useful in small low-speed computers or in systems that
are dedicated to monitor a device continuously. However, interrupt I/O method is very useful
in modern high speed computers.
5. Discuss the advantage of interrupt-initiated I/O over programmed I/O.
Answer
In the programmed I/O method, the program constantly monitors the device status. Thus, the CPU
stays in the program until the I/O device indicates that it is ready for data transfer. This is time-
consuming process since it keeps the CPU busy needlessly. It can be avoided by letting the device
controller continuously monitor the device status and raise an interrupt to the CPU as soon as the
device is ready for data transfer. Upon detecting the external interrupt signal, the CPU momentarily
stops the task it is processing, branches to an interrupt-service-routine (ISR) or I/O routine or
interrupt handler to process the I/O transfer, and then after completion of I/O transfer, returns to the
task it was originally performing. Thus, in the interrupt-initiated mode, the ISR software (i.e. CPU)
performs data transfer but is not involved in checking whether the device is ready for data transfer or
not. Therefore, the execution time of CPU can be optimized by employing it to execute normal
programs, while no data transfer is required.
6. What are the different types of interrupt? Give examples.
Answer
There are basically three types of interrupts: external, internal or trap and software interrupts.
External interrupt: These are initiated through the processors’ interrupt pins by external devices.
Examples include interrupts by input-output devices and console switches. External interrupts can be
divided into two types: maskable and non-maskable.
Maskable interrupts: The user program can enable or disable all or a few device interrupts by
executing instructions EI or DI.
Non-maskable interrupts: The user program cannot disable it by any instruction. Some common
examples are: hardware error and power fail interrupt. This type of interrupt has higher priority than
maskable interrupts.
Input-Output Organization 7.23
Internal interrupt: This type of interrupts is activated internally by exceptional conditions. The inter-
rupts caused due to overflow, division by zero and execution of an illegal op-code are common
examples of this category.
Software interrupts: A software interrupt is initiated by executing an instruction like INT n in a
program, where n refers to the starting address of a procedure in program. This type of interrupts is
used to call operating system. The software interrupt instructions allow to switch from user mode to
supervisor mode.
7. What are the differences between vectored and non-vectored interrupt?
Answer
In a vectored interrupt, the source that interrupts supplies the branch information (starting address of
ISR) to the CPU. This information is called the interrupt vector, which is not any fixed memory
location. The processor identifies individual devices even if they share a single interrupt-request line.
So the set-up time is very less.
In a non-vectored interrupt, the branch address (starting address of ISR) is assigned to a fixed
location in memory. Since the identities of requesting devices are not known initially. The set-up time
is quite large.
8. “Interrupt request is serviced at the end of current instruction cycle while DMA request is
serviced almost as soon as it is received, even before completion of current instruction execu-
tion.” Explain.
Answer
In the interrupt initiated I/O, interrupt request is serviced at the end of current instruction cycle,
because the processor takes part in the I/O transfer for which processor was interrupted. Thus proces-
sor will be busy in data transfer after this instruction.
But in DMA transfer, the processor is not involved during data transfer. It actually initiates the data
transfer. The whole data transfer is supervised by DMA controller and at that time processor is free to
do its own task of interaction execution.
9. Give the main reason why DMA based I/O is better in some circumstances than interrupt
driven I/O?
Answer
To transfer large blocks of data at high speed, DMA method is used. A special DMA controller is
provided to allow transfer a block of data directly between a high speed external device like magnetic
disk and the main memory, without continuous intervention by the CPU. The data transmission
cannot be stopped or slowed down until an entire block is transferred. This mode of DMA transfer is
known as burst transfer.
10. What are the different types of DMA controllers and how do they differ in their functioning?
Answer
DMA controllers are of two types:
– Independent DMA controller
– DMA controller having multiple DMA-channels
7.24 Computer Organization and Architecture
8
Parallel Processing
8.1 INTRODUCTION
Parallel processing has emerged as a key enabling technology in modern computers, driven by the
ever-increasing demand for high performance, lower costs and sustained productivity in real-life
applications. The term parallel processing means improving the performance of a computer system
by carrying out several tasks simultaneously. Advanced computer architectures are centered on the
concept of parallel processing. In general, parallel computers can be characterized into three structural
classes: pipelined computers, array processors and multiprocessor systems.
In this chapter, we give emphasize more on these three architectures along with some related
architectures like vector processors.
The ideal performance of a computer system demands a perfect match between hardware capability
and program behavior. Hardware capability can be enhanced with better hardware technology, inno-
vative architectural features and efficient resources management. However, program behavior is
difficult to predict due to its heavy dependence on application and run-time conditions. There are also
many other factors affecting program behavior like- algorithm design, data structures, language effi-
ciency, programmer skill and compiler technology. It is impossible to achieve a perfect match be-
tween hardware and software by merely improving only a few factors without considering other
factors.
Performance is defined as units of things done per second. A computer designer’s goal is always to
design a reasonable high-performance computer with given constraints. If we are primarily concerned
with execution time of a program, then the performance of a computer is defined by:
Performance (X) = 1/(execution time(X))
In other words, “X is n times faster than Y” means
8.2 Computer Organization and Architecture
The clock cycle time (CCT) is the amount of time for one clock period to elapse (e.g. 5 ns). The
clock rate is the inverse of the clock cycle time (f = 1/CCT in megahertz). For example, if a computer
has a clock cycle time of 5 ns, the clock rate is:
1 Hz = 200 MHz.
5 ¥ 10-9
The size of a program is determined by its instruction count (IC), in terms of number of machine
instructions to be executed in the program. Different machine instructions may require different
numbers of clock cycles (periods) to execute. Therefore, the cycles per instruction (CPI) becomes an
important parameter for measuring the time needed to execute each instruction.
For a given instruction set, we can calculate an average CPI over all instruction types, provided we
know their frequencies of appearance in the program. An accurate estimate of the average CPI
requires a large amount of program code to be traced over a long period of time. Unless specifically
focusing on a single instruction type, we simply use the term CPI to mean the average value with
respect to a given instruction set and a given program mix.
CPU Performance The two main measures of performance are:
l Execution time (CPU time): time needed to execute a program.
l Throughput: number of programs completed per unit time.
Parallel Processing 8.3
CPU time The time to execute a given program can be computed as:
CPU time = Instruction count (IC) ¥ clock cycle per instruction (CPI) ¥ clock cycle time (CCT).
Thus, CPU time = Instruction count ¥ CPI/clock rate.
The basic goal of architecture is to reduce CPU time for a program. These factors (IC, CPI, and
CCT) are affected by compiler technology, the instruction set architecture, the machine organization,
and the underlying technology. When trying to improve performance, look at what occurs frequently
and thus make the common case fast. The above CPU time can be used as a basis in estimating the
execution rate of a processor.
Other Parameters The raw power of a processor can be indicated in terms of the number of
operations it can perform in one second. Popular marketing metrics for computer performance are:
MIPS (Million of Instructions per Second) and MFLOPS (Million of Floating-point Operations
per Second) Generally, either of these two is used a performance index in computer system, since
two are very close measurements. It should be emphasized that the MIPS and MFLOPS rates vary
with respect to a number of factors, including clock rate, the instruction count (IC) and the CPI of a
given machine.
MIPS rating is defined as:
MIPS = instructions per second (IPS) ¥ 10–6.
= instruction count (IC)/(CPU time ¥ 106).
= clock rate/(CPI ¥ 106).
For example, a program that executes 3 million instructions in 2 seconds has a MIPS rating of 1.5.
Advantage of MIPS rating: Easy to understand and measure.
Disadvantage of MIPS rating: May not reflect actual performance, since simple instructions do better.
Since a floating point operation is considered as one of the most complex task a processor per-
forms, MFLOPS rating is sometimes used a performance index. It is defined as:
MFLOPS = No. of floating point operations/(CPU time ¥ 106).
For example, a program that executes 4 million instructions in 5 seconds has a MFLOPS rating of
0.8.
Advantage of MFLOPS: Easy to understand and measure.
Disadvantage of MFLOPS: Same as MIPS. Moreover, it only measures floating point operations.
When a manufacturer gives the MIPS and MFLOPS rating numbers for a given processor, these
numbers indicate the processor’s maximum capability. This maximum is not always achievable in
practical applications.
Numerical Examples
Problem 8.1 In a simple machine with load-store architecture having clock rate 50 MHz, let the
instruction frequency be as follows for a program:
8.4 Computer Organization and Architecture
Problem 8.2 Computers M1 and M2 are two implementations of the same instruction set. M1 has
a clock rate of 50 MHz and M2 has a clock rate of 75 MHz. M1 has a CPI of 2.8 and
M2 has a CPI of 3.2 for a given program.
How many times faster is M2 than M1 for this program?
Ex Time _ M1 IC _ M1 ¥ CPI _ M1/Clock Rate _ M1
Solution =
Ex Time _ M2 IC _ M2 ¥ CPI _ M2/Clock Rate _ M2
2.8/50
= = 1.31.
3.2/75
Therefore, M2 is 1.31 times faster than M1 for the same program.
Parallel processing is an efficient form of information processing which emphasizes the exploitation
of concurrent events in the computing process. Concurrency implies parallelism, simultaneity and
pipelining.
l Parallel events may occur in multiple resources during the same time interval.
l Simultaneous events may occur at the same time instant.
l Pipelined events may occur in overlapped time spans.
Parallel computers are those systems that emphasize parallel processing. Parallel computers are
divided into three architectural configurations:
l Pipeline computers
l Array processors
l Multiprocessor systems
Pipeline computers perform overlapped computations to exploit temporal parallelism. The pipelining
concept is discussed in Section 8.5. Array processors use multiple synchronized ALUs to achieve
spatial parallelism. Array processor system is discussed in Section 8.7. Multiprocessor systems achieve
asynchronous parallelism through a set of interactive processors with shared resources (memories,
database, etc.) and this system is described in Section 8.8. These three parallel approaches are not
mutually exclusive.
Apart from these systems, a new type computer called data flow computer is coming up. Data flow
computers are based on the concept of data-driven computation, which is drastically different from
Parallel Processing 8.5
the operation of a conventional von Neumann machine. The fundamental difference is that instruction
execution in a conventional computer is under program-flow control, whereas that in a data flow
computer is driven by the data (operand) availability.
Over the last two decades, parallel processing has drawn the attention of many researchers and several
high-speed architectures have been proposed. To present these results in a concise way, different
architectures must be classified in well defined schemes.
Three computer architectural classification schemes are available:
1. Flynn’s classification (1966) based on the multiplicity of instruction streams and data streams in
a computer system.
2. Feng’s classification (1972) based on the number bits processed in unit time (serial versus
parallel processing).
3. Handler’s classification (1977) based on the degree of parallelism found in CPU, ALU and bit
levels.
Flynn’s classification of computers is presented in next section. But, discussion of other two
classification schemes is beyond the scope of this book.
SISD computer Most serial computers available today fall in this organization as shown in
Fig. 8.1. Instructions are executed sequentially but may be overlapped in their execution stages (In
other words the technique of pipelining can be used in the CPU). Modern day SISD uniprocessor
systems are mostly pipelined. An SISD computer may have more than one functional unit in it, but all
are under the supervision of one control unit. This type of machines can process only scalar type
instructions.
SIMD computer Array processors fall into this class. As illustrated in Fig. 8.2, there are multiple
processing elements supervised by the common control unit. All PEs (processing elements, which are
essentially ALUs) receive the same instruction broadcast from the control unit but operate on differ-
ent data sets from distinct data streams. The shared memory subsystem containing multiple modules is
very essential. This machine is generally used to process vector type data.
MISD computer Very few or no parallel computers fit in this organization, which is conceptually
illustrated in Fig. 8.3. There are n processor elements, each receiving distinct instructions to execute
on the same data stream and its derivatives. The results (outputs) of one processor element become
the inputs (operands) of the next processor element in the series. This architecture is also known as
systolic arrays. This structure has received much less attention, though some fault-tolerance machines
can be used in this class. Thus in general no practical machine of this class exists.
MIMD Computer This category covers multiprocessor systems and multiple computer systems
(Fig. 8.4). Details of these machines are discussed in Section 8.8. An MIMD computer is called
tightly coupled (or Uniform Memory Access (UMA)) if the degree of interactions among the proces-
sors is high. Otherwise, we consider them loosely coupled (or Non-Uniform Memory Access (NUMA)).
Most commercial MIMD computers are loosely coupled.
Parallel Processing 8.7
Listed below are several system models under each of the three existing computer organizations
under Flynn’s classification.
8.5 PIPELINING
The term “pipelining” is borrowed from the traditional production floor where a product is assembled
as a pipeline in multiple stages at different stations. To get the basic idea of pipelining, let us consider
a real-life example: “Say 100 students appear for an examination on a subject. There are 5 questions,
all are to be attempted. After the examination, all 100 scripts are to be evaluated by examiner (s) for
grading. Further, we are assuming that each question evaluation takes 5 minutes, for simplicity.”
For evaluation of all 100 scripts, we may employ two approaches:
Approach-1: Employing one examiner for all scripts.
In this case, the total evaluation time = 100 ¥ 5 ¥ 5= 2500 minutes.
Approach-2: Employing five examiners, assuming each examiner is specialized for a single question
evaluation.
Assumption is that ith examiner (Ei) is specialized for ith question (Qi) evaluation for all scripts.
All examiners are sitting adjacently and all 100 scripts are stacked in front of 1st examiner who is
specialized for 1st question evaluation for all scripts. After 1st question evaluation, he/she passes the
script to the adjacent 2nd examiner for 2nd question evaluation of that script. During that time, 1st
examiner is free to take the 2nd script from the stack for its 1st question evaluation. Therefore, after
first 10 minutes, 1st examiner takes script number 3 for its 1st question evaluation, 2nd examiner takes
script number 2 for its 2nd question and 3rd examiner takes script number 1 for its 3rd question
evaluation, during the next 5 minutes. This process continues. Thus, after first 5 ¥ 5 = 25 minutes, 1st
script gets evaluated completely and after another 5 minutes (after first 30 minutes) 2nd script gets
evaluated totally and so on. Thus, after first 25 minutes, a question gets evaluated in each 5 minutes.
This is shown in Fig. 8.5. All 100 scripts are evaluated in this manner. Total evaluation time = 25 +
99 ¥ 5 = 520 minutes. This suggests that employing specialized examiners takes much less time
compared to the one examiner employment (i.e. approach 1).
stage is transferred to the next stage in the pipeline. The final result is obtained after the instruction
has passed through all the stages. All stages are synchronized by a common clock. Stages are pure
combinational circuits performing arithmetic or logic operations over the data stream flowing through
the pipe. The stages are separated by high-speed interface latches (i.e. collection of registers). Figure
8.6 shows the pipeline concept with k stages.
Figure 8.6 actually shows a linear pipeline that can process a sequence of subtasks with a linear
precedence relation. A linear precedence relation of a set of subtasks {T1, T2, … , Tk} means that a
task Tj cannot start until all earlier subtasks Ti (for all i<j) finish. Similarly, in a linear pipeline with
set of stages {S1, S2, … , Sk} a stage Sj cannot start until all earlier stages Si (for all i<j) finish for a
single task.
Overlapped operations of pipeline processors are represented by a two-dimensional chart called
space-time diagram as shown in Fig. 8.7. Assume that the pipeline uses 4 stages.
Clock-period Ideally, we expect the execution delay in all stages should be same. But, due to the
difference in the hardware circuitry in different stages of a pipeline, execution delay cannot be same.
Shorter latency stages may finish early and long latency stages are still continuing with their specified
tasks. Since one single clock circuitry is used for the synchronization in the pipeline, we have to
define the clock period uniformly so that no overwriting or no partial execution of results produced in
any stage.
The logic circuitry in each stage Si has a time delay denoted by ti. Let tl be the time delay of each
interface latch. The clock-period of a linear pipeline is defined by
t = max {ti}ki = 1 + tl = tm + tl.
The reciprocal of the clock-period is called the frequency f = 1/t of a pipeline processor.
8.10 Computer Organization and Architecture
Problem 8.3 Suppose the time delays of the four stages of a pipeline are t1 = 60 ns, t2 = 70 ns, t3
= 90 ns and t4 = 80 ns respectively and the interface latch has a delay tl = 10 ns,
then
Parallel Processing 8.11
(i) What would be the maximum clock frequency of the above pipeline?
(ii) What is the maximum speed-up of this pipeline over its equivalent non-pipeline
counterpart?
Solution The clock-period should at least t = 90 +10 = 100 ns.
So, the clock frequency, f = 1/t =1/100 = 10 MHz.
In case of non-pipeline, the time delay = t1 + t2 + t3 + t4 = 60 + 70 + 90 + 80 = 300 ns.
So, the speed-up = 300/100 = 3. This means that the pipeline processor is 3 times
faster than its equivalent non-pipeline processor.
Arithmetic pipeline An arithmetic pipeline divides an arithmetic operation, such as a multiply, into
multiple arithmetic steps each of which is executed one-by-one in different arithmetic stages in the
ALU. Examples include 4-stage pipeline used in Star-100, 8-stage pipeline used in TI-ASC.
Instruction pipeline The execution of a stream of instructions can be pipelined by overlapping the
execution of the current instruction with the fetch, decode and operand fetch of subsequent instruc-
tions. All high-performance computers are now equipped with this pipeline.
Processor pipeline Pipeline processing of the same data stream
by a cascade of processors, each of which processes a specific
task. No practical example found.
Figure 8.9 Reservation tables for two functions of the pipeline in Fig. 8.8
Reservation Table This is a 2-dimensional chart to show how successive pipeline stages are
utilized for a specific function evaluation in successive cycles. The two reservation tables shown
in Fig. 8.9, correspond to two functions (A and B) of the sample pipeline in Fig. 8.8. Reservation
table represents the flow of data through the pipeline for one complete evaluation of a given function.
A marked entry (A/B) in the (i,j)th square of the table indicates that stage Si will be used j time units
after the initiation of the function evaluation.
Evaluation Time The total number of clock units in the reservation table is the evaluation time for
the given function. A reservation table represents the flow of data through the pipeline for one
complete evaluation of a given function.
S4 X
S3 X
S2 X
S1 X
t1 t2 t3 t4 t5
(b) Reservation table for floating point adder pipeline shown in Fig. 8.10(a)
Figure 8.10
8.14 Computer Organization and Architecture
Multiplier Pipeline for two 6-bit integers The multiplication of two integers is done by repeated
add-shift operations, using an ALU, which has built-in add and shift operations. As we have seen in
Section 3.5.1, multiple-number addition can be realized with a multilevel tree CSA-adder. CSA
(Carry Save Adder) is a 3-to-2 converter and CPA (Carry Propagate Adder) is 2-to-1 converter.
Instead of CPA, 2-to-1 converter high speed adder CLA (Carry Look-ahead Adder) can be used. Both
adders (CSA and CPA/CLA) are used here at different stages. Procedure of two 6-bit numbers
multiplication is shown in Fig. 8.11.
The additions of partial products W1, W2, ….., W6, which are generated using bit-wise AND logic
operations, can be done using CSA-tree as shown in Fig. 8.12(a) to realize the multiplier pipeline for
6-bit numbers.
The reservation table for this multiplier pipeline is shown in Fig. 8.12(b).
Multiplier Pipeline for higher-bit Integers Multiple-pass usage by having feedback connections
can be used in the CSA-tree of the pipeline. Figure 8.13 (a) shows the concept as an example to
multiply two 32-bit numbers, where two input ports of the CSA-tree are now connected with the
feedback carry vector and sum vector. This pipeline is used to add four shifted-partial products per
iteration. Only 8 iterations would be needed in the CSA-tree with feedback to multiply two 32-bit
numbers.
As seen in the reservation table of it in Fig. 8.13 (b), the total evaluation time of this function
equals to 26 clock periods, out of which 24 cycles are needed in the recursive CSA-tree. This
recursive approach saves significant hardware (CSAs) in comparison to the single-pass approach. As
a contrast, 30 CSAs in 8 pipeline stages would be required in one-pass 32-input CSA-tree pipeline.
The increase in hardware is 26 additional CSAs, each of 32-bit wide. However, recursive approach
requires more time periods than single-pass. Total 16 ( = 26–10) more clock periods are needed in
recursive case, where 10= 1+8+1 corresponds to one cycle for input stage, eight cycles for one-pass
CSA-tree and one cycle for the CPA/CLA stage.
S5 X
S4 X
S3 X
S2 X
S1 X
t1 t2 t3 t4 t5
(b) Reservation table for 6-bit multiplier pipeline shown in Fig. 8.12 (a)
Figure 8.12
8.16 Computer Organization and Architecture
Figure 8.13 A pipelined multiplier for 32-bit numbers using a recursive CSA tree
Parallel Processing 8.17
may require one or more clock periods to execute, depending on the instruction type, processor and
memory architectures used. A typical instruction pipeline is shown in Fig. 8.14. The instruction fetch
stage (IF) fetches instructions from memory, presumably one per cycle. The instruction-decode stage
(ID) resolves the instruction function like, add or subtract, etc to be performed and identifies the
operands needed. The operand fetch stage (OF) fetches the operand values needed for execution into
processor registers. The execute stage (EX) executes the instructions on the stored operand values.
The last write-back stage (WB) is used to write results into registers or memory.
Solution of structural hazards Certain resources are duplicated in order to avoid structural hazards.
Functional units (ALU, FP unit) can be pipelined themselves in order to support several instructions
at a time. A classical way to avoid hazards at memory access is by providing separate data and
instruction caches.
Data Hazards Inter-instruction dependencies may arise to prevent the sequential (in-order) data
flow in the pipeline, when successive instructions overlap their fetch, decode and execution through a
pipeline processor. This situation due to inter-instruction dependencies is called data hazard.
Example: We have two instructions, I1 and I2. In a pipeline the execution of I2 can start before I1
has terminated. If in a certain stage of the pipeline, I2 needs the result produced by I1, but this result
has not yet been generated, we have a data hazard.
According to various data update patterns in instruction pipeline, there are three classes of data
hazards exist:
l Write After Read (WAR) hazards
l Read After Write (RAW) hazards
l Write After Write (WAW) hazards
Note that read-after-read is not a hazard, because nothing is changed on a read operation. To
discuss above three hazards, we define the followings:
Resource object It refers to working registers, memory locations and special flags.
Data object The contents of resource objects are called data objects. Each instruction can be consid-
ered as a mapping from a set of data objects to a set of data objects.
Domain of instruction The domain D(I) of an instruction I is the set of resource objects whose data
objects may affect the execution of instruction I.
Range of instruction The range R(I) of an instruction I is the set of resource objects whose data
objects may be modified by the execution of instruction I. Obviously, the domain of an instruction
holds operands to be read (retrieved) in the instruction execution and its range set will hold the
produced results.
Now, consider two instructions I and J in a program, where I occurs before J. Between instructions
I and J, there may be none or other instructions. Meanings of three hazards are as follows (see
Fig. 8.18):
l RAW (read after write)—J tries to read some data object before I writes it, so J gets the old
value of data object.
l WAW (write after write)—J tries to write (modify) some data object before it is written by I.
l WAR (write after read)—J tries to write some data object before it is read by I, so I incorrectly
gets the new value.
Formally, the necessary conditions (but not sufficient conditions) for these hazards are:
— R(I) « D(J) π f for RAW
— R(I) « R(J) π f for WAW
— D(I) « R(J) π f for WAR
Detection of data hazards Hazard detection method can be implemented in the instruction-fetch
(IF) stage of a pipeline processor by comparing the domain and the range of the incoming instruction
with those of the instructions being processed in the pipeline. If any of the earlier necessary condi-
tions is detected, a warning signal must be generated to prevent the hazard from taking place.
Parallel Processing 8.21
Solution of data hazards The system must resolve the interlock situation when a hazard is detected.
Consider the sequence of instructions {… I, I+1, …, J, J+1,…}in which a hazard has been detected
between the current instruction J and a previous instruction I. This hazardous situation can be re-
solved in one of the two following ways:
l One simple solution is to stall the pipeline and to ignore the execution of instructions J,
J+1,…, down the pipeline until the instruction I has passed the point of resource conflict.
l A more advanced approach is to ignore only instruction J and continue the flow of instructions
J+1, J+2,…, down the pipeline. However, the potential hazards due to the suspension of J must
be continuously tested as instructions J+1, J+2,… execute prior to J. Thus, multilevel of hazard
detection may be encountered, which requires much more complex control policies to resolve
such multilevel of hazards.
Bus Structures Since there are different com-
binational circuits in different stages; the pro-
cessing speeds of pipeline stages are usually un-
equal. Consider the pipeline in Fig. 8.19, with
three stages having delays T1, T2 and T3, re-
spectively. If T1 = T3 = T and T2 = 2T, then Figure 8.19 An example pipeline with S2 taking
stage S2 reduces the throughput rate. Because double time of S1 or S3
we get one output in each 2T time after 4T time.
The stage S2 causes the unnecessary congestion.
One simple method is to subdivide the slow stage S2 serially. Figure 8.20 shows the subdivision of
stage S2. In this case, after 4T time, we get one output in each T time. Thus, the throughput is
increased.
8.22 Computer Organization and Architecture
Another method is to use duplicates of the slow-stage S2 in parallel way to smooth congestion, as
shown in Fig. 8.21.
In this case also, throughput is increased. The control and synchronization of tasks in parallel stages
are much more complex than those for serial stages.
Superpipelining Superpipelining is based on dividing the stages of a pipeline into sub-stages and
thus increasing the number of instructions which are supported by the pipeline at a given moment. By
dividing each stage into two, the clock cycle period t will be reduced to the half, t/2; hence, at the
maximum capacity, the pipeline produces a result every t/2 unit. For a given architecture and the
corresponding instruction set there is an optimal number of pipeline stages; increasing the number of
stages over this limit reduces the overall performance.
Superscalar Architecture A superscalar architecture is one in which several instructions can be
initiated simultaneously and executed independently. Pipelining allows several instructions to be
executed at the same time, but they have to be in different pipeline stages at a given moment.
Superscalar architectures include all features of pipelining. Superscalar architectures allow several
instructions to be issued and completed per clock cycle. A superscalar architecture consists of a
number of pipelines that are working in parallel. Depending on the number and kind of parallel units
available, a certain number of instructions can be executed in parallel. Figure 8.22 gives the
Parallel Processing 8.23
comparison among simple pipelined execution, superpipelined execution and superscalar execution
using four instructions.
— Load/Store unit.
— 3 Integer units.
— Floating-point unit.
l Supports in-order issue.
Power PC 620
l Provides in addition to the 604 out-of-order issue.
Pentium
l Three independent execution units:
— 2 Integer units.
— Floating point unit.
l Supports in-order issue; two instructions issued per clock cycle.
Pentium II to IV
l Provide in addition to the Pentium out-of-order execution.
l Five instructions can be issued in one cycle.
VLIW Processor The hardware cost and complexity of the superscalar technique are major con-
siderations in processor design. VLIW architectures rely on compile-time detection of parallelism.
The compiler analyzes the program and detects operations to be executed in parallel; such operations
are packed into one large instruction. Detection of parallelism and packaging of operations into
instructions is done by the compiler off-line. After one instruction has been fetched all the corre-
sponding operations are issued in parallel. No hardware is needed for run-time detection of parallel-
ism. This concept was used with some commercial success in the Multiflow Trace machine. Variants
of this concept are employed in the Intel Pentium processors.
A vector processor consists of a scalar processor and a vector unit which could be thought of as an
independent functional unit capable of efficient vector operations. Scalar instructions are executed on
the scalar processor, whereas vector instructions are executed on the vector unit which is generally
pipelined. When designing the instruction set for a vector unit, one has to choose an ISA (Instruction
Set Architecture). Most of today’s vector units have an instruction set that generalizes the Load/Store
ISA (RISC architecture) of scalar processors. Vector processors are processors which have special
hardware for performing operations on vectors.
Most of the available supercomputers are vector supercomputers. A supercomputer is character-
ized by very high execution speed, large main memory and secondary memory and the use of parallel
structured softwares in large extend. Supercomputers are specifically designed to perform huge vec-
tors or matrix computations in the scientific areas of petroleum exploration, VLSI circuit design,
meteorology, nuclear research and artificial intelligence, etc.
To examine the efficiency of vector processing over scalar processing, we compare the following
two programs, one written for vector processing and the other written for scalar processing.
A(I) = B(I)+C(I);
End
is implemented by the following sequence of scalar operations:
INITIALIZE I=1;
LABEL: READ B(I);
READ C(I);
ADD B(I)+C(I);
STORE A(I) = B(I)+C(I);
INCREMENT I = I+1;
IF I <= N GO TO LABEL;
STOP;
In a vector processor, the above ‘for’ loop operation can be vectorized into one vector instruction
as:
A(1 : N)=B(1 : N)+C(1 :N);
where A(1:N) refers to the N-element vector consisting of scalar components A(1), A(2),… A(N). To
execute this vector instruction, vector processor uses an adder pipeline.
The execution of the scalar loop repeats the loop-control overhead in each iteration. In vector
processing using pipelines this loop-control overhead is eliminated by using hardware or firmware
controls. A vector-length register is used to control the vector operations. The overhead of pipeline
processing is mainly the setup time, which is needed to route the operands among functional units.
In register-to-register architecture, operands and results are retrieved indirectly from the main
memory through the use of a large number of vector registers or scalar registers. The processors like
Cray-1 and the Fujitsu VP-200 use vector instructions in register-to-register formats.
A block diagram of a modern multiple-pipeline vector computer is shown in Fig. 8.24.
A(0) = x
for i = 1 to n do
A(i) = A(i – 1) * B(i) + C(i + 1)
may be vectorized as:
A(0) = x
A[1 : N] = A[0 : N – 1] * B[1 : N] + C[2 : N + 1]
3. “If ” statements can be coded as “where” statements; for example
for i = 1 to N do
If L[i] π 0 then A[i] = A[i] + 1
8.28 Computer Organization and Architecture
8.6.5 Vector Length and Stride (Issues in Vector Fitting and Storage)
In a number of applications the length of the vectors used in the user program could be much larger
than the maximum vector length (MVL) available on a vector register machine. In addition, the length
of a vector may be unknown at compile time; for example a loop of the form “for i = 1 to N” where N
is determined at run-time. This problem of fitting user vector lengths to machine vector lengths is
called vector fitting.
One approach is to use the concept of “Strip mining”. Strip mining is the generation of code such
that each vector operation is done for a size less than or equal to the MVL. Thus the compiler would
generate a loop where at each iteration operations are performed on vectors of length at most MVL
and are then stored back into memory. For example, let MVL be 32 components and consider
operations on a vector A[1 : 100] of length 100. The compiler must generate a loop (of four iterations)
where the first three iterations operate on 32 components, namely A[1 : 32];A[33 : 64];A[65 : 96] and
the last iteration operates on a vector of length 4, namely A[97 : 100].
Vector stride This is important when we deal with higher dimensional vectors and the program
wants to access rows, columns, or any other form. Consider a two dimensional vector. We could have
a program accessing rows or columns of a vector. To accomplish this, the stride value is provided.
The stride for a vector expresses the address increment used to move from one element to the next in
the vector access. For example, suppose a 2-D vector of size 10 ¥ 10 is stored in a row-major order.
Then if the stride is 1 then we can fetch a row of 10 elements. If the stride is set to 10 then we can
fetch a column of 10 elements.
An array processor is a synchronous array of parallel processors and consists of multiple processing
elements (PEs) (PEs are essentially ALUs) under the supervision of one control unit (CU). An array
processor can handle single instruction and multiple data streams. Hence, array processors are also
known as SIMD computers. SIMD machines are especially designed to perform vector computations
over large matrices or arrays of data. Since both vector processors and array processors perform on
vector type data, people often fail to differentiate between array processor and vector processor. A
vector processor does not operate on all the elements of a vector at the same time; it only operates on
Parallel Processing 8.29
the multiple elements concurrently. A vector processor generally uses pipelining, whereas, an array
processor uses multiple PEs those may or may not be pipelined.
SIMD computers are available in two basic architectural organizations: array processors, using
random-access memory; and associative processors, using content-addressable (or associative) memory.
Generally, first one is used extensively in this category of computers and thus our focus will be on the
first case.
Such a network is capable of connecting an arbitrary input terminal to an arbitrary output terminal.
Multistage networks can be one-sided or two-sided. If input-output ports are on the same side, then
the networks are called one-sided networks, sometimes called full switches. The two-sided multistage
networks usually have an input side and an output side. Again a two-sided multi-stage network can be
blocking or non-blocking. If the simultaneous connections of some multiple input-output pairs may
result in conflicts in the switches or links, then the multi-stage network is called blocking network.
Examples include Omega network. If the network can perform all possible connections between
sources (inputs) and destinations (outputs) by rearranging its connections, the multistage network is
called non-blocking network. Such networks always allow any possible connection path between
source-destination pair. Examples include Benes network.
Some Examples of SIMD Interconnection Networks
Mesh-Connected Illiac Network The Illiac-IV array processor with N = 64 PEs uses a single-stage
recirculating network called mesh-connected Illiac network. Each PEi sends data to any one of PEi+1 ,
PEi–1 , PEi+r and PEi–r where r = ÷N (for the case of the Illiac-IV, r = ÷64 = 8) in one circulation step
through the network. Formally, the Illiac network is implemented by the following four routing
functions:
R +1 (i) = (i + 1) mod N; R –1 (i) = (i – 1) mod N;
R +r (i) = (i + r) mod N; R –r (i) = (i – r) mod N;
where 0 <= i <= N – 1.
Ususally, N is a perfect square, such as N = 64 and r = 8 in the Illiac-IV network. Due to the
complexity of the original Illiac network, a reduced Illiac network is presented in Fig. 8.28 for N = 16
and r = 4. The original Illiac network has a similar structure except number of PEs. Each PEi is
directly connected to its four nearest neighbors in the mesh network. Two topologies shown in
Fig. 8.28 are identical.
Shuffle-Exchange and Omega Networks The two routing functions shuffle (S) and exchange
(E) are used in the class of shuffle-exchange networks. Let a PE address be A = an–1 an–2…a1a0. Then
the shuffle function (S) is defined as:
S(an–1 an–2 …a1a0 ) = an–2 an–3 …a1a0an–1.
where 0 <= A <= N – 1 and n = log2 N.
The S function performs the cyclic shifting of the bits in
A to the left for one bit position. This action corresponds to
perfectly shuffling a deck of N cards, as illustrated in
Fig. 8.29 for N=8.
The exchange-routing function E is defined by:
E(an–1 an–2…a1a0 ) = an–1 an–2 … a1 a0
This function complements the least significant bit which
means that the exchange of data between two PEs with adja-
cent addresses.
These shuffle and exchange functions can be used to imple-
ment either a recirculating network (single-stage) or a multi-
stage network. Figure 8.30 shows a single-stage recirculat-
ing shuffle-exchange network for N = 8. The exchange con- Figure 8.29 Perfect shuffle connection
nection is indicated by solid line and the shuffle connection for N=8
is indicated by dashed line. The shuffle-exchange networks
have been used to implement a number of parallel algorithms for applications like the fast Fourier
transform (FFT), polynomial evaluation, sorting, and matrix transposition, etc.
When the shuffle-exchange functions are implemented in multiple stages, the network is called
Omega network. An N ¥ N Omega network consists of n (n = log2 N) identical stages. The connec-
tion pattern between two adjacent stages is a perfect-shuffle interconnection. Each stage has N/2
switch boxes under independent box control. The stages are marked from 0 to n–1 from the input side
to the output side. Each box has four functions (straight, exchange, upper broadcast, lower broadcast).
The Omega network for N = 8 is illustrated in Fig. 8.31.
It is possible to connect every PE to every other PE using at least one configuration of the network.
This feature is called as the full-access property. The data-routing is controlled by the destination
address in binary. Suppose the ith high-order bit of the destination address is a 0, the 2 ¥ 2 switch at
stage i connects the input to the upper output. Otherwise, the input is connected to the lower output.
Butterfly Network and Hypercube Network The kth butterfly function bk is defined as follows
for k= 1, 2 ,…, n–1:
bk (an–1 …ak+1 ak ak–1…a1a0) = an–1 …ak+1 a0 ak–1…a1ak.
Parallel Processing 8.35
Thus bk interchanges bits 0 and k of the source address to obtain the destination address. For
example, when k = 1 and N = 4, we obtain
b1 (00) = 00
b1 (01) = 10
b1 (10) = 01
b1 (11) = 11
This action corresponds to the interconnection
pattern shown in Fig. 8.32.
Butterfly connection can also be used for multi-
stage network called indirect hypercube network.
The N ¥ N hypercube consists of n (= log2N) stages,
each contains N/2 switch boxes. The wiring patterns
following the stages are defined by b1, b2,…, bn–1
and inverse shuffle function S-1 in last stage. The Figure 8.32 Butterfly connection and single-
stages are marked from 0 to n–1 from the input side stage butterfly network for N = 4
to the output side.
The inverse shuffle function S–1 is defined by
S–1(an–1 an–2 …a1a0) = a0 an–1 an–2 …a1 .
This function is same as the shuffle function (S) defined earlier with direction of the address bit
rotation reversed. Figure 8.33 shows an 8 ¥ 8 indirect hypercube network.
Like omega network, this network supports full-access property. This network differs in two
aspects with omega network:
1 The hypercube network uses two-function (straight and exchange) switch boxes, whereas the
omega network uses four-function switch boxes.
8.36 Computer Organization and Architecture
2. The data-flow directions in the two networks are opposite to each other. In other words, the
roles of sources (inputs) and destinations (outputs) are exchanged in two networks.
Multiprocessor systems belong to MIMD class of computer as classified by M.J. Flynn. A multipro-
cessor is a single computer consisting of multiple processors, which may communicate and cooperate
at different levels in solving a given problem. The communication may occur by sending messages
from one processor to the other or by sharing a common memory.
As mentioned earlier, MIMD category of Flynn’s classification covers multiprocessor systems and
multiple computer systems. There are some similarities between multiprocessors and multi-computer
systems since both the support of concurrent operations in the system. However, there exists an
important difference between multiple computer systems and multiprocessors, based on the extent of
resource sharing and cooperation in the solution of a problem. A collection several autonomous
computers (CPUs) connected through a network forms a multiple computer system. These autono-
mous CPUs may or may not communicate with each other. Examples of a multiple computer system
include the IBM Attached Support Processor System and Intel’s Paragon XP/S. A multiprocessor
system is controlled by one operating system, which provides interaction between processors and their
programs at levels. Examples include well-known C.mmp and Cm* multiprocessors designed at
Carnegie Mellon University.
Three different sets of architectural models for a multiprocessor are available: Uniform-Memory-
Access (UMA) model, Nonuniform-Memory-Access (NUMA) model and Cache-Only Memory Ar-
chitecture (COMA) model.
UMA Model In UMA multiprocessor model (Fig. 8.34), all the processors access the physical main
memory uniformly. All processors have equal access time to all memory words. This is why the
model is called uniform memory access. Each processor may use a private cache. Peripherals are also
Parallel Processing 8.37
shared in some fashion. This multiprocessor is also termed as tightly coupled systems (TCS) or
shared-memory systems due to the high degree of resource sharing. The system interconnect network
can be a common bus, a crossbar switch, or a multistage network, which are discussed later.
Figure 8.34 UMA model of multiprocessor Figure 8.35 NUMA model of multiprocessors
processor system using the common communication path is shown in Fig. 8.37. This organization is
the least complex and the easiest to reconfigure. Such an interconnection network is often a totally
passive unit having no active components such as switches. The bus interface circuits of the sending
and receiving units control data transfer operations. Since the bus is a shared resource and can
transfer one word at a time for a pair of communicating devices, there are possibilities of bus conflict
while two or more senders or receivers attempt to use the bus. This problem is called bus contention
(or bus arbitration). Thus a mechanism must be provided to resolve contention. There exist several
bus-contention algorithms and some of them are discussed in Section 7.6.1. An example of a time
shared bus is the PDP-11 Unibus, which has 56 signal lines to provide the control, address and data
buses necessary to transfer 16-bit words.
crossbar switch network. We assume that there are 8 destination devices (memory modules) and
8 requestors (processors). The cross-point switch consists of an arbitration module, which has a
priority encoder circuit and multiplexer modules. Each processor generates a memory module request
signal (REQ) to the arbitration module, whose priority encoder selects the processor with the highest
priority. The arbitration module returns an acknowledge signal (ACK) to the selected processor. The
processor initiates its memory operation after it receives the ACK. Multiplexer modules are used to
select one of 8 processors for the memory module access, based on 3-bit select combination generated
by the arbitration module.
Such a network was used to implement the processor-memory switch for the C.mmp, which has
16 processors and 16 memory modules.
Multiport memory networks In multiport
memory system, all cross-point priority arbitra-
tion logic and switching functions that are dis-
tributed throughout the crossbar switch matrix
are moved at the interfaces to the memory mod-
ules. Thus the memory modules become more
expensive due to the added access ports and
associated logic, as shown in Fig. 8. 40. In dia-
gram, n processors are tied to n input ports of a
memory module and similarly m I/O peripherals
are tied to m input ports of a memory module.
Only one of these n ¥ m processor and I/O re-
quests will be serviced at a time. This organiza-
tion becomes very expensive when n, m and k
become very large. Another drawback is that
large number of interconnection buses and con-
Figure 8.40 Multiport memory organization for
nectors ( ports) are required when the configura-
multiprocessor systems
tion become very large. This system organiza-
8.40 Computer Organization and Architecture
processors, need not be non-cacheable. This approach is simple and less expensive as no hardware
circuitry is required and the solution is achieved at the compiler time.
Hardware Protocol There exist two different hardware protocols:
1. Snoopy protocol.
2. Directory protocol.
Snoopy Protocol In this protocol, a snoopy cache controller is attached with every processor that
constantly monitors the operations on the bus by other processors. Every processor keeps track of the
other processor’s memory writes. Two common approaches are adopted based on snooping: write
update protocol and write invalidate protocol.
In write update protocol, when a processor updates a shared item in its cache, it sends message to
all other processors and supplies the new updated value of the item so that other processors can
update their local caches immediately. If the shared items can be identified, then the extent of
broadcasting can be reduced.
In write invalidate protocol, multiple copies of an item are allowed. However, whenever a proces-
sor modifies a location x in its cache, it must check the other caches to invalidate possible copies.
This operation is referred to as a cross-interrogate (XI). The idea of this method is that the caches are
tied on a high-speed bus. When a local processor writes into a shared data item in its cache, the
processor sends a signal to all the remote caches to indicate that the “data at memory address x has
been modified.” At the same time, it writes through memory. Note that, to ensure correctness of
execution, a processor which requests an XI must wait for an acknowledge signal from all other
processors before it can complete the write operation. The XI invalidates the remote cache location
corresponding to x if it exists in that cache. When the remote processor references this invalid cache
location, it results in a cache miss, which is serviced to retrieve the block containing the updated item.
For each write operation, (n – 1) XIs result, where n is the number of processors. When n increases,
the traffic on the high-speed bus becomes a bottleneck.
Directory Protocol In this protocol, a centralized controller is maintained that is part of the main
memory controller and a directory is kept in main memory. The directory contains global state
information about the contents of the various local caches. When a processor modifies the informa-
tion in its cache, the central memory controller checks the directory and finds which processors are
affected. Only the affected processors are informed about the change by the central controller.
MESI Protocol To provide cache consistency, the data cache often supports a protocol known as
Modified Exclusive Shared Invalid (MESI). For MESI, the data cache includes two status bits per tag,
so that each line can be in one of four states (Fig. 8.42):
REVIEW QUESTIONS
Group A
1. Choose the most appropriate option for the following questions:
(i) According to the Flynn’s classification, the digital computers have been divided into
(a) two categories (b) three categories (c) four categories (d) none.
(ii) In which category of Flynn’s classification of computers do the pipelined computers fall?
(a) SISD (b) SIMD (c) MISD (d) MIMD.
(iii) In which category of Flynn’s classification of computers do the multiprocessor computers fall?
(a) SISD (b) SIMD (c) MISD (d) MIMD.
Parallel Processing 8.43
11. What is reservation table? What is the basic difference of it with space-time diagram? What is
evaluation time?
12. Design a floating-point adder pipeline. Show its reservation table.
13. Design a 6-bit multiplier pipeline using CSA. Show its reservation table.
14. Design a 32-bit multiplier non-linear pipeline using CSA. Show its reservation table.
15. What are instruction pipeline and arithmetic pipeline?
16. What do you mean by the pipeline hazards? What are the different types of pipeline hazards?
Explain each of them.
17. Describe how the branch type instructions can cause the damage in the performance of an instruction
pipeline and what will be the solution for this?
18. Show how the bus re-structuring can improve the throughput of an instruction pipeline.
19. Compare super-pipelining, super-scalar, VLIW processors. Give example for each.
20. What do you mean by vector processing? What are the basic types of vector instructions? Give
example of each.
21. With an example show how the vector processors perform better than scalar processors.
22. Describe the vector instruction format.
23. Why most vector processors have pipeline structures?
24. What are the different architectural configurations of vector processors? Discuss them with ex-
amples.
25. Write short note on: vectorizing compilers.
26. What are strip mining and vector stride, in respect of vector processors?
27. Both vector processors and array processors are specialized to operate on vectors. What are the
main differences between them?
28. What do you mean by array processors? What are the different architectural organizations of array
processors?
29. Describe the SIMD array processor organization that is supported by ILLIAC-IV computer.
30. Discuss how masking and data-routing mechanisms work in an array processor.
31. Why do we need masking mechanism in SIMD array processors?
32. In an SIMD array processor of 8 PEs, the sum S( k) of the first k components in a vector A is desired
for each k from 0 to 7. Let A = (A0, A1,…, A7).
We need to compute the following 8 summations:
S( k) = Âik= 0 Ai ; for k = 0, 1,…, 7.
Discuss how data-routing and masking are performed in the processor.
33. What is the necessity for using an interconnection network in a parallel processing computer? What
are different types of SIMD interconnection networks according to the network topologies? Explain
them.
34. Describe the following networks:
(a) mesh connected ILLIAC network
(b) omega network
(c) hypercube network
35. How many number of 2 ¥ 2 switches and how many number of stages are required to design an n ¥
n omega network?
36. Explain the terms with relevance to interconnection networks:
(a) full access property
(b) non-blocking property.
8.46 Computer Organization and Architecture
37. What do you mean by multiprocessor system? What are the similarities and dissimilarities between
the multiprocessor system and multiple computer system?
38. What are different architectural models for multiprocessors? Explain each of them with example.
39. Distinguish between loosely coupled and tightly coupled multiprocessor architectures. Which archi-
tecture is better and why?
40. Briefly discuss the tome shared or common bus interconnection network. What are the advantages
and disadvantages of this network?
41. Describe the following networks:
(a) cross-bar network
(b) multi-port network.
42. Explain how an unauthorized access to a memory module can be prevented in multi-port networks.
43. What do you mean by cache coherence problem? What are different protocols to solve the problem?
Briefly discuss each of them.
44. Write short notes on:
(a) snoopy protocol
(b) directory protocol
(c) MESI protocol.
45. Consider a four stage normalized floating point adder with 10 ns delay per stage, which equals the
pipeline clock period.
(a) Name the appropriate functions to be performed by four stages.
(b) Find the minimum number of periods required to add 100 floating point numbers A1 + A2 +…+
A100 using this pipeline adder, assuming that the output Z of 4th stage can be routed back to any
of the two inputs X or Y of the pipeline with delays equal to any multiples of the period.
SOLVED PROBLEMS
1. Why are several parameters used to measure the performance of computers, instead of one?
Answer
Several parameters exist because computers are used for a variety of applications, the performance of
which can depend on different aspects of the computer system. For example, scientific computations
depend mainly on the performance of processors and memory system, while database and transaction
processing applications depend largely on the performance of computers’ I/O system. Thus, depend-
ing on the applications to be performed, parameters vary.
2. When executing a particular program, machine A gives 90 MIPS and machine B gives
70 MIPS. However, machine A executes the program in 50 seconds and machine B executes in
40 seconds. Give the reason.
Answer
MIPS rating gives measure of the rate at which a processor executes instructions, but different
machines need different number of instructions to perform a given computation. If machine A had to
execute more instructions than machine B to complete the program, it would be possible for machine
A to take longer to run the program than machine B despite the fact that machine A executes more
instructions per second.
Parallel Processing 8.47
3. A 50 MHz processor was used to execute a program with the following instruction mix and
clock cycle counts:
Calculate the effective CPI, MIPS rate and execution time for this program.
Answer
CPIav = (50000 ¥ 1 + 35000 ¥ 2 + 20000 ¥ 2 + 6000 ¥ 3)/(50000 + 35000 + 20000
+ 6000)
= 1.6 cycles / instruction
MIPS = clock rate/(CPI ¥ 106)
= (50 ¥ 106)/(1.6 ¥ 106)
= 31.25
Execution time = IC (instruction count) ¥ CPIav / clock rate
= ((50000 + 35000 + 20000 + 6000) ¥ 1.6)/(50 ¥ 106) sec
= (111000 ¥ 1.6)/(50 ¥ 106) sec
= 3.552 ms
4. What is meant by pipeline architecture? How does it improve the speed of execution of a
processor?
Answer
Pipelining is a technique of decomposing a sequential task into subtasks, with each subtask being
executed in a special dedicated stage (segment) that operates concurrently with all other stages. Each
stage performs partial processing dictated by the way the task is partitioned. Result obtained from a
stage is transferred to the next stage in the pipeline. The final result is obtained after the instruction
has passed through all the stages. All stages are synchronized by a common clock.
Several instructions are executed in different stages of the pipeline simultaneously. Thus, execu-
tions of several instructions are overlapped in the pipeline, giving the increased rate at which instruc-
tions execute.
5. Consider the execution of a program of 20000 instructions by a linear pipeline processor with
a clock rate 40 MHz. Assume that the instruction pipeline has five stages and that one
instruction is issued per clock cycle. The penalties due to branch instructions and out-of-order
executions are ignored. Calculate the speed-up of the pipeline over its equivalent non-pipeline
processor, the efficiency and throughput.
Answer
Given,
No. of instructions (tasks) n = 20000
No. of stages k = 5
Clock rate f = 40 MHz
Clock period t = 1/f = 1/(40 ¥ 106) sec
8.48 Computer Organization and Architecture
n¥k
Speed up Sp =
k + (n - 1)
20000 ¥ 5
=
5 + (20000 - 1)
= 4.99
n
Efficiency h =
k + (n - 1)
20000
=
5 + (20000 - 1)
= 0.99
h
Throughput w =
t
= 0.99 ¥ (40 ¥ 106) instructions per second
= 39.6 MIPS
9. Deduce that the maximum speed-up in a k-stage pipeline processor can be k. Is this maximum
speed-up always achievable? Explain.
Answer
Speed-up is defined as
Time to execute n tasks in k-stage non-pipeline processor
Sp =
Time to execute n tasks in k-stage pipeline processor
Time to execute n tasks in k-stage pipeline processor is t [k + (n–1)] units, where k clock periods
(cycles) are needed to complete the execution of the first task and remaining (n–1) tasks require (n–1)
cycles. Time to execute n tasks in k-stage non-pipeline processor is n.k.t, where each task requires k
cycles because no new task can enter the pipeline until the previous task finishes. The clock period of
the pipeline processor is t.
n.k.t nk
Thus Sp = =
t [k + (n - 1)] k + (n - 1)
The maximum speed-up can be deduced for n >> k, that means n Æ •.
k
Max Sp = Lt
nƕ (k /n) + 1 - (1/n)
=k
But this maximum speed-up k is never fully achievable because of data dependencies between
instructions, interrupts, program branches, etc.
7. Given a non-pipelined processor with 15 ns clock period. How many stages of pipelined
version of the processor are required to achieve a clock period of 4 ns? Assume that the
interface latch has delay of 0.5 ns.
Parallel Processing 8.49
Answer
We can write the clock period of a pipeline processor as:
Clock period of non-pipeline
Clock period of pipeline = + Interface latch delay
No. of pipeline stages
This can be written as:
Clock period of non-pipeline
No. of pipeline stages =
Clock period of pipeline – Interface latch delay
Therefore, for our problem,
15
No. of pipeline stages = = 4.3
4 - 0.5
Since the number of stages in pipeline cannot be fraction, by rounding up gives the number of
stages as 5.
8. Consider the following pipeline processor with four stages. All successor stages after each
stage must be used in successive clock periods.
Write the reservation table for this pipeline with seven evaluation time.
Answer
The reservation table for this pipeline with seven evaluation time (i.e. seven columns) is as:
9. Describe how the branch type instructions can cause the damage in the performance of an
instruction pipeline and what will be the solution for this?
Answer
A stream of instructions is executed continuously in an overlapped fashion in the instruction pipeline
if any branch-type instruction is not encountered in stream of instructions. The performance in this
case would be one instruction per each pipeline cycle after first k cycles in a k-stage pipeline. If a
branch instruction enters in the pipeline, the performance would be hampered. After execution of an
instruction halfway down the pipeline, a conditional branch instruction is resolved and the program
counter (PC) needs to be loaded with a new address to which program should be directed, making all
pre-fetched instructions in the pipeline useless. The next instruction cannot be initiated until the
8.50 Computer Organization and Architecture
completion of the current branch instruction. This causes extra time penalty (delay) in order to
execute next instruction in the pipeline.
In order to cope with the adverse effects of branch instructions, an important technique called
prefetching is used. Prefetching technique states that: Instruction words ahead of the one currently
being decoded in the instruction-decoding (ID) stage are fetched from the memory system before the
ID stage requests them.
10. Consider a four stage normalized floating point adder with 10 ns delay per stage, which
equals the pipeline clock period.
(a) Name the appropriate functions to be performed by four stages.
(b) Find the minimum number of periods required to add 100 floating point numbers A1 + A2
+……..+ A100 using this pipeline adder, assuming that the output Z of 4th stage can be
routed back to any of the two inputs X or Y of the pipeline with delays equal to any
multiples of the period.
Answer
(a) The functions of four stage pipeline are summarized as:
1. Exponent subtraction: Compare e1 and e2, and pick up the fraction of a number with a
smaller exponent. Pick up k = |e1–e2| and also the larger exponent.
2. Alignment of mantissas: Shift right the mantissa of the number with smaller exponent by k
positions.
3. Addition of mantissas: Add the fractions.
4. Normalization of the result: Compute M= the number of leading 0s in the mantissa of the
result. Subtract the number M from the exponent and shift left the mantissa of the result by
M positions.
(b) The addition proceeds in two distinct phases: (1) add pairs of numbers to get 4 partial sums
(since we have four stage pipeline and in general, for a k stage pipeline we have k partial
sums), and (2) add the partial sums to get the final sum.
In first phase, in the first four cycles we can send the four pairs of additions: (A1 + A5), (A2
+ A6), (A3 + A7), (A4 + A8). (In general, for a k stage pipeline we send in k pairs). After 4
cycles, the pipeline is full (computing the four partial sums) and there are n-8 (here, n = 100)
remaining numbers to be sent into the pipeline which represent n -8 more additions. Note that
each number will be sent in to be added with the partial sum. For example, number A9 is sent
in to be added with the partial sum (A1 + A5). In order to accomplish this, the pipeline must be
set to take one of its inputs from its output stage and the other input from the stream of input
numbers.
The time to complete n – 8 additions is n – 8 pipeline cycles, and therefore in n – 8 + 4 = n
– 4 cycles all partial sums have been sent into the pipeline and all four stages of the pipeline
are occupied with these partial sums.
In the second phase, the partial sums are added to form the final result. Let the four partial
sums be denoted sum1; sum2; sum3; sum4, with sum1 being the first partial sum sent in (and
thus is at the last stage of the pipeline at the end of phase 1). The time to add these four partial
sums is a function of the length of the pipeline and the time it takes for these partial sums to
exit the pipeline. Thus,
l After 1 cycle, sum1 exits the pipeline but cannot be added to anything; therefore it is held
in an interface latch.
Parallel Processing 8.51
l After another 1 cycle, sum2 exits the pipeline and can be added to sum1. Therefore, after a
total of 2 cycles (from end of phase 1) sum1 + sum2 can be sent into the pipeline.
l After another 2 cycles, both sum3 and sum4 can be sent into the pipeline to be added. Note
that the sum of these two will be available 4 cycles later.
l Finally, after a total of 8 cycles from the end of phase 1 the final sum (which is the sum of
(sum1 + sum2) and (sum3 + sum4)) can be sent into the pipeline.
l The final sum will thus be available a total of 8 + 4 = 12 cycles after the end of phase 1.
The total minimum time for additions of n numbers therefore T = (n – 4) +12 = n + 8 cycles =
108 cycles. (Putting n = 100).
11. Identify all of the RAW, WAR and WAW hazards in the instruction sequence:
DIV R1, R2, R3
SUB R4, R1, R5
ASH R2, R6, R7
MULT R8, R4, R2
BEQ R9, #0, R10
OR R3, R11, R1
Also identify all of the control hazards in the sequence.
Answer
RAW hazard exists between instructions:
DIV and SUB
ASH and MULT
SUB and MULT
DIV and OR
WAR hazard exists between instructions:
DIV and ASH
DIV and OR
There is no WAW hazard.
There is only one control hazard between the BEQ and OR instructions.
12. What is instruction-level parallelism? How do processors exploit it to improve performance?
Answer
Many of the instructions in a sequential program are independent, meaning that it is not necessary to
execute them in the order that they appear in the program to produce the correct result. This fact is
referred to as instruction-level parallelism. Processors exploit this by executing these instructions in
parallel rather than sequentially, reducing the amount of time it takes to execute programs.
13. With an example show how the vector processors perform better than scalar processors.
Answer
To examine the efficiency of vector processing over scalar processing, we compare the following two
set of codes for the same program code, one written for vector processing and the other written for
scalar processing.
8.52 Computer Organization and Architecture
16. What is the necessity for using an interconnection network in a parallel processing computer?
Answer
The parallel computers have several processing units having small or no local memories attached with
them. Also, these processing units take part in the execution of a single large task (program) stored in
a large shared main memory. That is, these units execute different modules of the large task (program)
concurrently. Thus, these units must co-operate or share their instructions through shared main memory.
For the sharing of the instructions among different processing units, an interconnection network is
required.
17. How many number of 2 ¥ 2 switches and how many number of stages are required to design
an n ¥ n omega network?
Answer
An n ¥ n omega network consists of log2 n identical stages. Each stage has n/2 switch boxes each of
size 2 ¥ 2. Therefore, total number of switch boxes required is (n/2) ¥ log2 n.
18. What do you mean by multiprocessor system? What are the similarities and dissimilarities
between the multiprocessor system and multiple computer system?
Answer
A multiprocessor system is a single computer system consisting of multiple processors, which may
communicate and cooperate at different levels in solving a given problem. The communication may
occur by sending messages from one processor to the other or by sharing a common memory.
Similarities:
(a) Both systems consist of several processors inter-connected through networks.
(b) Both execute programs concurrently.
(c) Throughput, in both cases, is very high.
(d) Both are very expensive systems.
Dissimilarities:
(a) In multiprocessor systems, the processors usually communicate with each other. However, in
multiple computer systems, the autonomous processors may or may not communicate with
each other.
(b) A multiprocessor system is controlled by one operating system, which provides interaction
between processors; but in multiple computer system, different processors are controlled by
different operating systems.
(c) The degree of interactions among processors in multiprocessor systems is very high; but that in
multiple computer systems is very less.
19. Compare between centralized and distributed architecture. Which is the best architecture
among them and why?
Answer
In centralized architecture, all the processors access the physical main memory uniformly. All proces-
sors have equal access time to all memory words. The degree of interactions among tasks is high.
Thus probability of bus conflicts is high, because of frequent sharing of codes between two proces-
sors. The architecture is shown in the following figure.
8.54 Computer Organization and Architecture
In distributed system, a local memory is attached with each processor. All local memories distrib-
uted throughout the system form a global shared memory accessible by all processors. A memory
word access time varies with the location of the memory word in the shared memory. The degree of
interactions among tasks is less. Thus probability of bus conflicts is also less. The distributed system
is depicted in figure.
It is faster to access a local memory with a local processor. The access of remote memory attached
to other processor takes longer due to the added delay through the interconnection network. There-
fore, the distributed system is faster and in this regard, it is better.
20. Explain the terms with relevance to interconnection networks: (a) full access property (b) non-
blocking property.
Answer
(a) Full access property: If it is possible to connect every PE (processing element) to every other
PE using at least one configuration of the network. This feature is called as the full-access
property.
(b) Non-blocking property: If the network can perform all possible connections between sources
(input) PEs and destination (output) PEs by rearranging its connections, the multistage net-
work is said to have non-blocking property.
APPENDIX
Digital Devices and
Logic Design
Here, we want to discuss some important digital devices like logic gates, flip-flops, registers, full
adders, and multiplexers etc, which are used through-out the book.
The digital systems consist of some primitive logic circuits known as logic gates. These circuits
operate on binary variables that assume one of two distinct logic states, usually called 0 and 1. Logic
gates are electronic circuits because they are made up of a number of electronic devices and compo-
nents. Each gate has a distinct graphic symbol and its operation can be described by means of an
algebraic expression. The input-output relationship of the binary variables for each gate can be
represented in tabular form by a truth table. There are three basic types of gates—AND, OR and NOT
(or Inverter). The other gates like XOR, NAND and NOR gates can be derived from these basic gates.
A.1.2 OR Gate
The OR gate, like AND gate, has two or more inputs and only one output. The output assumes the
logic state 1, if at least one of the inputs is at logic state 1 and output assumes logic state 0 only when
all inputs are at logic 0. The graphic symbols and truth tables of two input and three-input OR gates
are shown in Fig. A.2. The symbol for OR operation is ‘+’ (plus) or ‘⁄’ and thus A AND B can be
written as A.B or A ⁄ B.
A.2 Computer Organization and Architecture
Truth Table
Inputs Outputs
A B C X
Truth Table 0 0 0 0
Inputs Outputs 0 0 1 0
0 1 0 0
A B X
0 1 1 0
0 0 0 1 0 0 0
0 1 0 1 0 1 0
1 0 0 1 1 0 0
1 1 1 1 1 1 1
Truth Table
Inputs Outputs
A B C X
Truth Table 0 0 0 0
Inputs Outputs 0 0 1 1
0 1 0 1
A B X
0 1 1 1
0 0 0 1 0 0 1
0 1 1 1 0 1 1
1 0 1 1 1 0 1
1 1 1 1 1 1 1
A.1.5 NAND and NOR Gates Figure A.4 The XOR gate
The two most widely used gates in real circuits are the NAND and NOR gates. These two gates are
called universal gates because other logic operations can be realized using either NAND gates or
NOR gates. Both NAND and NOR gates can perform all three basic logic operations (AND, OR,
NOT).
NAND Gate The NAND gate is combination of AND and NOT gates, which is shown in
Fig. A.5.(a). The output assumes logic state 0, only when each of the inputs assumes a logic state 1.
The output assumes logic state 1, for all other combination of inputs. The graphic symbols and truth
tables of two input and three-input NAND gates are shown in Figs A.5.(b) and A.5.(c) respectively.
The expression for the output of the NAND gate can be written as ABC º
Truth Table
Inputs Outputs
A B X
0 0 1
0 1 1
1 0 1
1 1 0
Truth Table
Inputs Outputs
A B C X
0 0 0 1
0 0 1 1
0 1 0 1
0 1 1 1
1 0 0 1
1 0 1 1
1 1 0 1
1 1 1 0
NOR Gate The NOR gate is the combination of OR gate and NOT gate, which is shown in
Fig A.6.(a). The output assumes logic state 1, only when each of the inputs assumes a logic state 0.
The output assumes logic state 0, for all other combination of inputs. The graphic symbols and truth
tables of two input and three-input NOR gates are shown in Figs A.6.(b) and A.6.(c) respectively. The
expression for the output of the NOR gate can be written as A + B + C + º .
Truth Table
Inputs Outputs
A B X
0 0 1
0 1 0
1 0 0
1 1 0
Truth Table
Inputs Outputs
A B C X
0 0 0 1
0 0 1 0
0 1 0 0
0 1 1 0
1 0 0 0
1 0 1 0
1 1 0 0
1 1 1 0
Realization of Other Logic Functions Using NAND/NOR Gates The NAND and NOR gates
are said to be universal gates, because using only NAND or NOR gates other logic operation can be
realized. It is easier to fabricate NAND or NOR gates using IC (integrated circuit) technology than
AND or OR gates. In addition to that, NAND and NOR gates consume less power. As a consequence,
they are used as fundamental building blocks in fabricating the digital devices. Figure A.7 shows how
other gates can be realized using NAND gates.
Similarly, different gates can be realized using NOR gates.
There are two types of logic circuits: (i) combinational and (ii) sequential.
The combinational circuit is one whose output state at any instant is dependent only on the states
of the inputs at that time. Thus, combinational circuit has no memory. Combinational circuits are
Digital Devices and Logic Design A.5
realized by logic gates. Examples of combinational circuits are adder, subtractor, multiplexer, de-
coder, demultiplexer, etc.
The sequential circuit is one whose output state at any instant is dependent not only on the present
states of inputs at that instant, but also on the prior input states. Thus, sequential circuit has memory.
The sequential circuits are realized by logic gates in addition with flip-flops. Examples of sequential
circuits are registers, counters, etc.
Inputs Outputs
A B S C
0 0 0 0
0 1 1 0
1 0 1 0
1 1 0 1
A.6 Computer Organization and Architecture
From this table, we conclude that the sum (S) bit is obtained by XOR operation of A and B, and
the carry (C) bit is obtained by AND operation of A and B. Therefore,
S = A. B + A .B = A ≈ B
C = A.B
Hence, half adder (HA) can be realized by using one XOR gate and one AND gate, as shown in
Fig. A.8 (a). Figure A.8 (b) shows the block diagram of half adder.
Inputs Outputs
A B Cin S Cout
0 0 0 0 0
0 0 1 1 0
0 1 0 1 0
0 1 1 0 1
1 0 0 1 0
1 0 1 0 1
1 1 0 0 1
1 1 1 1 1
From the truth table, the expressions for sum S and carry-out Cout can be written as:
S = A . B .Cin + A .B. Cin + A. B . Cin + A.B.Cin = A ≈ B ≈ Cin (1)
Cout = A .B.Cin + A. B .Cin + A.B. Cin + A.B.Cin = A.B + (A + B).Cin
= A.B + (A ≈ B).Cin (2)
Now, expressions (1) and (2) can be realized using XOR, AND and OR gates, to get sum S and
carry-out Cout, which is shown in Fig. A.9. It is worth noting that a full adder is a combination of two
half adders, as shown in Fig. A.9. The block diagram of a full adder (FA) is shown in Fig. A.10.
BCD Adder The BCD number is introduced in Chapter 2. When two BCD digits are added
together with a possible carry from previous stage, the sum can be maximum 19. Since, maximum
Digital Devices and Logic Design A.7
Figure A.9 Logic diagram of full adder Figure A.10 Block diagram of
full adder
BCD digit is 9, thus 9 + 9 + 1 (carry from previous stage) = 19. When two BCD digits are added
using 4-bit binary adder, the sum will be in binary form and the range of sum will be 0 to 19 in
equivalent decimal. The binary sum and BCD sum in this range is shown in Table A.1. By observing
the table, it is clear that the two sums are identical upto decimal 9 and thus no correction is required.
When sum is greater than 9, we find an invalid BCD sum representation. The binary 0110 (i.e. 6) is
added to the binary sum to get the valid BCD sum representation. This will generate an output carry
as required.
The BCD adder must be able to perform the following:
(a) Add two BCD digits using one 4-bit parallel adder.
(b) If the binary sum of this adder is greater than or equal to binary 1010 (i.e. decimal 10), add
binary 0110 (decimal 6) to the sum and generates the required output carry.
From the table, the logic circuit for necessary correction can be derived. It is clear that when binary
sum from 4-bit parallel adder is greater than 01001 (i.e. decimal 9) we need a correction. This can be
said in other words as, BCD carry X will be in high state when either of the following conditions
occurs:
1. When C = 1 (sums greater than or equal to 16)
2. When S3 = 1 and either S2 or S1 (sums in the range 10 to 15). Since decimal numbers 8 and 9
also have S3 = 1. To distinguish decimal numbers 10 to 15 from 8 and 9, we use S2 S1 bit
positions.
(Contd )
0 0 1 1 0 0 0 1 1 0 6
0 0 1 1 1 0 0 1 1 1 7
0 1 0 0 0 0 1 0 0 0 8
0 1 0 0 1 0 1 0 0 1 9
0 1 0 1 0 1 0 0 0 0 10
0 1 0 1 1 1 0 0 0 1 11
0 1 1 0 0 1 0 0 1 0 12
0 1 1 0 1 1 0 0 1 1 13
0 1 1 1 0 1 0 1 0 0 14
0 1 1 1 1 1 0 1 0 1 15
1 0 0 0 0 1 0 1 1 0 16
1 0 0 0 1 1 0 1 1 1 17
1 0 0 1 0 1 1 0 0 0 18
1 0 0 1 1 1 1 0 0 1 19
Thus, the condition for correction and the output carry can be expressed as:
X = C + S3.S2 + S3.S1
The BCD adder can be implemented as (see Fig. A.11): First two BCD numbers are added using
simple 4-bit parallel adder, which produces the sum as C S3 S2 S1 S0. Then a correction is needed
when this sum is equal to or greater than 10, which is implemented by using the above expression for
X. Lastly, another 4-bit parallel adder is used to add binary 0110, produced by the correction logic,
with the sum S3 S2 S1 S0. This will produce the required BCD sum output S3 S2 S1 S0. The carry
output X is used as the carry to the next BCD adder.
combinations of n-bit inputs. The logic diagram of a 2- to-4 line decoder is shown in figure A.12. It
has two-input lines A and B and four-output lines D0, D1, D2 and D3. The truth table of the decoder is
shown in Fig. A.12(b).
Some decoders have one or more enable inputs to control the operation of the circuit. For example,
the 2-to-4 line decoder in Fig. A.12 has one enable line E. When E is high, the decoder is activated
and when E is low, it is disabled.
The logic circuit and function table for a 1-to-4 line DEMUX are shown in Fig. A.14. The two
select lines S0 and S1 enable only one AND gate at a time and the data (I) on the input line passes
through the selected gate to the associated output line.
flip-flop using the clock signal is called the clocked flip-flop. It has two outputs, one labeled Q for the
normal value and another labeled Q for the complement value of the stored bit.
Even though a logic gate has no storage capability, several logic gates can be connected together to
form a storage cell. Depending on the interconnection pattern of logic gates and number of inputs
used, the flip-flops can be of different types. The common types of flip-flops are described next.
SR Flip-Flop The SR flip-flop has three input lines, one line labeled S for set, one line labeled R
for reset and another line labeled C for clock signal. It has two inputs, one labeled Q for the normal
value of the output and another labeled Q for the complement value of it. A realization of a SR flip-
flop using universal gate, NAND and its truth table is shown in Fig. A.15.
Inputs Output
S R Q(t+1) Remarks
0 0 Q(t) No change
0 1 0 Reset
1 0 1 Set
1 1 ? Indeterminate
When no clock pulse is applied, the output of the flip-flop cannot change irrespective of the input
values at S and R. When clock pulse is applied, the flip-flop will be affected according to the input
values at S and R. Thus, the flip-flop is enabled when clock is applied to the circuit. The truth table is
given for enabled flip-flop. Q(t) and Q(t+1) indicate the binary state of the Q output at given time t
and next time t+1 respectively. The SR flip-flop has invalid (indeterminate) state when both S and R
are logic ‘1’ simultaneously.
D Flip-Flop The D (data) flip-flop has only one input line and is obtained from SR flip-flop by
putting one inverter between the S and R input lines. Thus a D flip-flop can be realized using SR flip-
flop as shown in Fig. A.16. If the input D has logic state 1, the output of the flip-flop assumes the
state 1. If the input D has logic state 0, the output of the flip-flop assumes the state 0. This is shown in
the truth table.
Input Output
D Q(t + 1) Remarks
0 0 Reset
1 1 Set
JK Flip-Flop The JK flip-flop is versatile and most widely used flip-flop. The operation of JK flip-
flop is same as the SR flip-flop, except that it has no indeterminate state for logic input ‘1’ for both
the inputs. In this situation, the state of the output is changed and the output state is the complement
of the previous state. A realization of a JK flip-flop using universal gate, NAND and its truth table is
shown in Fig. A.17.
Inputs Output
J K Q(t+1) Remarks
0 0 Q(t) No change
0 1 0 Reset
1 0 1 Set
1 1 Q (t) Toggle
T Flip-Flop A T flip-flop has one input line, labeled T for toggle. T flip-flop acts as toggle switch.
Toggle means switching over to the opposite state. It can be realized using a JK flip-flop with input T
= J = K, as shown in Fig. A.18.
Input Output
T Q(t+1) Remarks
0 Q(t) No change
1 Q (t) Toggle
Shift Registers A register is a collection of flip-flops each capable of storing one bit of informa-
tion. An 8-bit register has 8 flip-flops and is capable storing 8-bit data. In a shift register, the flip-flops
are connected in series. The output of each flip-flop is connected to the input of the adjacent flip-flop
in the register. The content of a shift register can be shifted within the register without changing the
order of the bits. Data can be shifted one position left or right at a time when one clock is applied to
the register. Based on the ways of loading into and reading out the data, shift registers can be
classified into four categories:
l Serial-in, serial-out
l Serial-in, parallel-out
l Parallel-in, serial-out
l Parallel-in, parallel-out
Counters A digital counter (simply counter) is a set of flip-flops whose states change in response
to clock pulses applied at the input to the counter. The flip-flops are interconnected in such a way that
their combined state at any time is the binary equivalent to the total number of pulses applied up to
that time. Thus, counters are used to count clock pulses or time interval.
Counters can be of two types: asynchronous and synchronous.
In asynchronous counters (also called ripple counters), the flip-flops (FFs) within the counter do not
change the states at exactly the same time. This is because FFs are not clocked simultaneously.
Synchronous counters are counters in which all the FFs are clocked simultaneously. Synchronous
counters are faster than asynchronous counters, because the propagation delay is less.
A counter can be up-counter or down-counter. An up-counter is a counter which counts in the
upward direction, i.e. 0, 1, 2, …, N-1. A down-counter is a counter which counts in the downward
direction, i.e. N-1, N-2, …, 1, 0. Each counts of the counter is called the state of the counter. The total
number of states through which the counter can progress is called the modulus of the counter. A 3-bit
counter is often referred to as a modulus-8 (or mod-8) counter since it has eight states. Similarly, a 4-
bit counter is a mod-16 counter.
Asynchronous Counters Asynchronous counter can be realized by using by the use of JK flip-flops,
as shown in Fig. A.20. The flip-flops are connected in series. The Q-output of each FF is connected to
the clock input pf the next FF. Thus, the output of each FF drives the next FF. In this respect, counter
is called ripple counter or serial counter. All the J and K inputs are tied to Vcc. This means that each
FF will change state (toggle) with a negative transition at its clock input.
Clock States
pulses Q2 Q1 Q0
0 0 0 0
1 0 0 1
2 0 1 0
3 0 1 1
4 1 0 0
5 1 0 1
6 1 1 0
7 1 1 1
8 0 0 0
The counter shown in Fig. A.20 is a ripple up-counter. Similarly, a down-counter can be realized,
as shown in Fig. A.21. The clock is still used at the clock input of the first FF, but the complement of
Q0 is used to drive second FF in the series. Similarly, the complement of Q1 is used to drive next FF.
Clock States
pulses Q2 Q1 Q0
0 1 1 1
1 1 1 0
2 1 0 1
3 1 0 0
4 0 1 1
5 0 1 0
6 0 0 1
7 0 0 0
8 1 1 1
Synchronous Counters The ripple counter is the simplest to build, but each FF has a propagation
delay. This problem can be eliminated in synchronous counters.
In synchronous counters, all FFs are triggered simultaneously by the same clock pulse so that all
FFs change their state at the same time. For this, counters are called as parallel or carry look-ahead
counters. Figure A.22 shows the schematic diagram of a 3-bit synchronous up counter.
The MOS (Metal Oxide Semiconductor) logic uses metal oxide semiconductor field effect transistors
(MOSFETs). Compared to the bipolar logic families, the MOS families are simpler, inexpensive to
fabricate and require much less power. There are three logic families constructed using MOS transis-
tors:
l PMOS: Using p-channel transistors
l NMOS: Using n-channel transistors
l CMOS: Using both p-channel and n-channel transistors
Digital Devices and Logic Design A.15
PMOS is the slowest, oldest type and nearly obsolete today. NMOS is widely used in memories
and microprocessors. CMOS is suitable for individual logic circuit design and it consumes very less
power.
NMOS Transistors An n-channel transistor is said to be of NMOS-type and it behaves as a closed
(on) switch when its gate input is equivalent to positive voltage Vcc (i.e. Vin = Vcc), as shown in
Fig. A.23(a) and it acts as an open (off) switch when its gate input is 0 (i.e. Vin = 0), as shown in
Fig. A.23(b).
A NOT gate can be realized using NMOS transistor, as shown in Fig. A.24. The transistor T acts
as a switch.
When gate input voltage, Vin = 0, the transistor T acts as an open switch and output voltage, Vout =
Vcc. When Vin = Vcc, the transistor T acts as a closed switch and output voltage Vout is near to 0.
Thus, the circuit performs the function of a NOT gate.
PMOS Transistors A p-channel transistor is said to be of PMOS-type and it behaves as an open
switch when the gate input, Vin = Vcc, as indicated by Fig. A.25(a). It acts as a closed switch when Vin
= 0 V, as indicated by the Fig. A.25(b).
Note that the PMOS transistor has a bubble on the gate input in its graphical symbol, which
indicates that its functionality is complementary to that of an NMOS transistor. Also note that in
PMOS transistors, the positions of source and drain terminals are revered in comparison to the NMOS
transistors.
CMOS Transistors A CMOS (complementary metal oxide semiconductor) transistor uses both n-
channel and p-channel transistors to take several advantages over PMOS and NMOS transistors. The
CMOS transistor is faster and consumes less power than other two types. The basic idea of CMOS
circuit is illustrated by the inverter circuit in Fig. A.26.
When Vin = Vcc, transistor T1 is turned off and transistor T2 on. Thus the output voltage, Vout, drops
to 0. When Vin is supplied with 0 V, transistor T1 is turned on and transistor T2 off. Thus the output
voltage, Vout, gets up to Vcc. Hence, the logic values at points indicated by input Vin and output Vout
are complements to each other and the circuit acts as a NOT gate.
In same way, other logic gates can be realized using CMOS transistors.
2004
Computer Organization
Paper Code: CS-303 Semester: 3rd
Time Allotted: 3 hours Full Marks: 70
The questions are of equal value. The figures in the margin indicate full marks
Candidates are required to give their answers in their own words as far as practicable.
Answer any five Questions
1. (a) Explain the reading and writing operations of a basic static MOS cell. 6
(b) Why a DRAM cell needs refreshing? 2
(c) Why hierarchical memory organization is needed? 2
(d) Explain the memory hierarchy pyramid, showing both primary and secondary memory in
the diagram and also explain the relationship of cost, speed and capacity. 4
Answer
(a) See answer of question no. 7 (b) 2007 (CS-303).
(b) See answer of question no. 8 (b) (i) 2006 (CS-303).
(c) See answer of question no. 10 (a) 2007 (CS-404).
(d) The total memory capacity of a computer can be considered as being a hierarchy of components.
The memory hierarchy system consists of all storage devices used in a computer system and are
broadly divided into following four groups, shown in figure on next page.
l Secondary (auxiliary) memory
l Main (primary) memory
l Cache memory
l Internal memory Capacity
Secondary Memory: The slow-speed and low-cost devices that provide backup storage are
called secondary memory. The most commonly used secondary memories are magnetic disks,
such as hard disk, floppy disk and magnetic tapes. This type of memory is used for storing all
programs and data, as this is used in bulk size. When a program not residing in main memory is
needed to execute, it is transferred from secondary memory to main memory. Programs not
currently needed in main memory (in other words, the programs are not currently executed by
SQP.2 Computer Organization and Architecture
the processor) are transferred into secondary memory to provide space for currently used pro-
grams and data.
Main Memory: This is the memory that communicates directly with CPU. Only programs and
data currently needed by the CPU for execution reside in the main memory. Main memory
occupies central position in hierarchy by being able to communicate directly with CPU and with
secondary memory devices through an I/O processor.
Cache Memory: This is a special high-speed main memory, sometimes used to increase the
speed of processing by making the current programs and data available to the CPU at a rapid
rate. Generally, the CPU is faster than main memory, thus resulting that processing speed is
limited mainly by the speed of main memory. So, a technique used to compensate the speed
mismatch between CPU and main memory is to use an extremely fast, small cache between
CPU and main memory, whose access time is close to CPU cycle time. The cache is used for
storing portions of programs currently being executed in the CPU and temporary data frequently
needed in the present computations. Thus, the cache memory acts a buffer between the CPU and
main memory. By making programs and data available at a rapid rate, it is possible to increase
the performance of computer.
Internal memory: This memory refers to the high-speed registers used inside the CPU. These
registers hold temporary results when a computation is in progress. There is no speed disparity
between these registers and the CPU because they are fabricated with the same technology.
However, since registers are very expensive, only a few registers are used as internal memory.
2. (a) Briefly explain the two “write” policies; write through and write- back for cache design.
What are the advantages and disadvantages of both the methods? 5
(b) Explain the difference between full associative and direct mapped cache mapping approaches.
3
(c) Given the following, determine size of the sub-fields (in bits) in the address for Direct
Mapping, associative and set associative mapping cache schemes: We have 256 MB main
memory and 1MB cache memory. The address space of this processor is 256 MB. The
block size is 128 bytes. There are 8 blocks in a cache set. 6
Answer
(a) See answer of question no. 7 (d) of 2006 (CS-303).
(b) See answer of question no. 3 (a) of 2007 (CS-303).
(c) Given,
Solved Question Paper 2004 SQP.3
3. (a) Give the main reason why DMA based I/O is better in some circumstances than interrupt
driven I/O. 3
(b) Explain the basic Direct Memory Access (DMA) operation for transfer of data bytes be-
tween memory and peripherals. 5
(c) What is programmed I/O technique? Why is it not very useful? 3
(d) What are the different types of interrupt? Give examples. 3
Answer
(a) See answer of question no. 9 (b) of 2007 (CS-303).
(b) See answer of question no. 9 (a) of 2007 (CS-303).
(c) See answer of question no. 9 (c) of 2007 (CS-303).
SQP.4 Computer Organization and Architecture
Step-4
As Q[0]=1 and
Q[–1]=1
ARS(AQ) 1001 0001 01011 0
Since the size register becomes 0, the algorithm is terminated and the product is = AQ = 0001
0101, which shows that the product is a positive number. The result is 21 in decimal.
(c) See answer of question no. 11 (b) of 2007 (CS-404).
5. (a) Give the merits and demerits of the floating point and fixed point representations for storing
real numbers. 4
(b) What are guard bits? 2
(c) A floating point number system uses 16 bits for representing a number. The most significant
bit is the sign bit. The least significant nine bits represent the mantissa and remaining 6 bits
represent the exponent. Assume that the numbers are stored in the normalized format with
one hidden bit. 2+2
(i) Give the representation of –1.6 ¥ 103 in this number system.
(ii) What is the value represented by 0 000100 110000000?
(d) Give the timing diagrams of basic memory read and writing operations. 3
Answer
(a) Merits of fixed-point representation:
(i) This method of representation is suitable for representing integers in registers.
(ii) Very easy to represent, because it uses only one field: magnitude field.
Demerits of fixed-point representation:
(i) Range of representable numbers is restricted.
(ii) It is very difficult to represent complex fractional numbers.
(iii) Since there is no standard representation method for it, it is some time confusing to repre-
sent a number in this method.
Merits of floating-point representation:
(i) By this method, any type and any size of numbers can be represented easily.
(ii) There are several standardized representation methods for this.
Demerits of floating-point representation:
(i) Relatively complex representation, because it uses basically two fields: mantissa and expo-
nent fields.
(ii) Length of register for storing floating-point numbers is large.
(b) When the mantissa is shifted right, some bits at the right most position (least significant posi-
tion) are lost. In order to obtain maximum accuracy of the final result; one or more extra bits
known as guard bits, are included in the intermediate steps. These bits temporarily contain the
recently shifted out bits from the right most side of the mantissa. When the number has to be
finally stored in a register or in a memory as the result, the guard bits are not stored. However,
based on the guard bits, the value of the mantissa can be made more precise by one of three
rounding techniques: chopping, von Neumann rounding and rounding.
SQP.6 Computer Organization and Architecture
1 001010 100100000
Similar to the memory read, to write a memory word, the address is specified on the address bus
to select the word among many available words at first clock period T1. At second clock period
T2, memory write signal is activated and after seeing the write signal activated, memory stores
data in it from the data bus. The total operation needs three clock periods.
Solved Question Paper 2004 SQP.7
After CALL SUB1 () has been fetched, the program counter (PC) contains the address of the
“Next instruction” immediately following CALL. This address of PC is saved on the stack,
which is the return address to main program. PC then contains the first executable instruction of
subroutine SUB1 () and processor continues to execute its codes. The control is returned to the
main program from the subroutine by executing RETURN, which pulls the return address (i.e.
address of Next instruction) off the stack and puts it in the PC.
Since, the last item stored on the stack is the first item to be removed from it, the stack is well
suited to nested subroutines. That is, a subroutine is able to call another subroutine and this
process is repeated many times. Eventually, the last subroutine called completes its computa-
tions and returns to the subroutine that called it. The return address needed for this first return is
the last one generated in the nested call sequence. That is, return addresses are generated and
used in last-in-first-out order. This suggests that the return addresses associated with subroutine
calls should be pushed onto a stack.
(d) To obtain the addresses of operands, the address mode is needed. A processor can support
various addressing modes in order to give flexibility to the users. The addressing mode gives the
way addresses of operands is determined.
The various addressing modes are discussed next.
1. Implied (or Inherent) Mode: In this mode the operands are indicated implicitly by the
instruction. The accumulator register is generally used to hold the operand and after the
instruction execution the result is stored in the same register. For example,
(a) RAL; Rotates the content of the accumulator left through carry.
(b) CMA; Takes complement of the content of the accumulator.
2. Immediate Mode: In this mode the operand is mentioned explicitly in the instruction. In
other words, an immediate-mode instruction contains an operand value rather than an ad-
dress of it in the address field. To initialize registers to a constant value, this mode of
instructions are useful. For example:
(a) MVI A, 06; Loads equivalent binary value of 06 to the accumulator.
(b) ADI 05; Adds the equivalent binary value of 05 to the content of AC.
3. Stack addressing Mode: Stack-organized computers use stack addressed instructions. In this
addressing mode, all the operands for an instruction are taken from the top of the stack. The
instruction does not have any operand field. For example, the instruction
SUB
uses only one op-code (SUB) field, no address field. Both the operands are in the topmost
two positions in the stack, in consecutive locations. When the SUB instruction is executed,
two operands are popped out automatically from the stack one-by-one. After subtraction,
the result is pushed onto the stack. Since no address field is used, the instruction is short.
4. Register (Direct) Mode: In this mode the processor registers hold the operands. In other
words, the address field is now register field, which contains the operands required for the
instruction. For example:
ADD R1, R2; Adds contents of registers R1 and R2 and stores the result in R1.
5. Register Indirect Mode: In this mode the instruction specifies an address of CPU register
that holds the address of the operand in memory. In other words, address field is a register
which contains the memory address of operand.
Solved Question Paper 2004 SQP.9
Answer
(a) Parts of programs and data are brought into main memory from secondary memory, as the CPU
needs them. Virtual memory is a technique used in some large computer systems, which gives
the programmer an illusion of having a large main memory, although which may not be the
case. The size of virtual memory is equivalent to the size of secondary memory. Each address
referenced by the CPU called the virtual (logical) address is mapped to a physical address in
main memory. This mapping is done during run-time and is performed by a hardware device
called memory-management unit (MMU) with the help of a memory map table, which is main-
tained by the operating system.
The virtual memory makes the task of programming much easier, because the programmer no
longer needs to bother about the amount of physical memory available. For example, a program
size is 18 MB and the available user part of the main memory is 15 MB (Other part of the main
memory is occupied by the operating system). First, 15 MB of the program is loaded into main
memory and then remaining 3 MB is still in the secondary memory. When the remaining 3 MB
code is needed for execution, swap out the 3 MB code from main memory to secondary memory
and swap in new 3 MB code from secondary memory to main memory.
The advantage of virtual memory is efficient utilization of main memory, because the larger
size program is divided into blocks and partially each block is loaded in the main memory
whenever it is required. Thus, multiple programs can be executed simultaneously. The technique
of virtual memory has other advantages of efficient CPU utilization and improved throughput.
(b) When a program starts execution, one or more pages are brought to the main memory and the
page table is responsible to indicate their positions. When the CPU needs a particular page for
execution and that page is not in main (physical) memory (still in the secondary memory), this
situation is called page fault. When the page fault occurs, the execution of the present program
is suspended until the required page is brought into main memory from secondary memory.
(c) TLB: In paging scheme, the main memory is accessed two times to retrieve data, one for
accessing page table and another for accessing data itself. Since the access time of main memory
is large, one new technique is adopted to speed up the data retrieval. A fast associative memory
called translation look-aside buffer (TLB) is used to hold most recently used page table entries.
Whenever CPU needs to access a particular page, the TLB is accessed first. If desired page table
entry is present in the TLB, it is called TLB hit and then the frame number is retrieved from the
table to get the physical address in main memory. If the desired page table entry is not present in
the TLB, it is called TLB miss and then CPU searches the original page table in main memory
for the desired page table entry. The organization of address translation scheme that includes a
TLB is shown in the figure on next page.
(d) See answer of question no. 8 (C) of 2007 (CS-303).
8. (a) What is a control memory and control word? 2
(b) Explain the organization of control memory. How is control unit implemented using control
memory? 3
(c) Write a symbolic microprogram of a basic FETCH routine. 5
(d) Briefly explain a vertical microinstruction format and a horizontal microinstruction format.
4
Solved Question Paper 2004 SQP.11
Answer
(a) In the microprogrammed control unit, all control functions that can be simultaneously activated
are grouped to form control words stored in a separate ROM memory called the control memory.
Each control word contains signals to activate one or more micro-operations.
(b) Like conventional program, retrieval and interpretation of the control words are done. The
instructions of a CPU are stored in the main memory. They are fetched and executed in a
sequence. The CPU can perform different functions simply by changing the instructions stored
in the main memory. Similarly, when control words are retrieved in a sequence, a set of micro-
operations are activated that will complete the desired task. The control unit can execute a
different control operation by changing the control words (also called as micro-instructions) of
the CM. Usually, all microinstructions have three important fields:
l Control field
l Next-address field
l Condition for branching.
To reduce the hardware cost of CM, it is desired to give more emphasis on minimizing the
length of the microinstruction. The length of the microinstruction decides the size of the control
memory, as well as the cost involved with this approach. The following factors are directly
involved with the length of a microinstruction:
l How many micro-operations can be activated simultaneously (the degree of parallelism).
l The control field organization.
l The method by which the address of the next microinstruction is specified.
(c) The FETCH microprogram controls the instruction-fetch cycle. Its symbolic microprogram is as
follows:
AR ¨ PC; Memory address register is loaded with content of PC
(program counter)
SQP.12 Computer Organization and Architecture
In a vertical microinstruction (shown in figure below), a single field can produce an encoded
sequence. The encoded technique is followed in this organization. The vertical microprogram
technique takes more time for generating the control signals due to the decoding time and also
more microinstructions are needed. But the overall cost is less since the microinstructions are
small in size. The horizontal microprogram releases faster control signals but the control memory
size is huge due to increased word length. Thus, the vertical microinstructions are characterized
by:
l Short formats.
l Limited ability to express parallel microoperations.
l Considerable encoding of the control information.
In the IBM 370/model 145, the microinstructions used in control unit follow vertical format.
9. (a) What are the various modes of data transfer between computer and peripherals? Explain. 5
(b) Differentiate isolated I/O and memory mapped I/O. 4
(c) Show how computer bus is organized using tri-state buffer. 5
Solved Question Paper 2004 SQP.13
Answer
(a) The modes of data transfer between computer and peripherals are:
1. Programmed I/O.
2. Interrupt-initiated I/O.
3. Direct memory access (DMA).
Programmed I/O:
This is the software method where CPU is needed all the times during data transfer between any
two devices. Programmed I/O operations are the result of I/O instructions written in the com-
puter program or I/O routine. Each data item transfer is initiated by an instruction in the
program or I/O routine. Generally, the transfer is to and from a CPU register and peripheral.
Transferring data under program control requires constant monitoring of the peripheral by the
CPU. Once a data transfer is initiated, the CPU is required to monitor the interface to see when
a transfer can again be made.
Interrupt-initiated I/O
In the programmed I/O method, the program constantly monitors the device status. Thus, the
CPU stays in the program until the I/O device indicates that it is ready for data transfer. This is
time-consuming process since it keeps the CPU busy needlessly. It can be avoided by letting the
device controller continuously monitor the device status and raise an interrupt to the CPU as
soon as the device is ready for data transfer. Upon detecting the external interrupt signal, the
CPU momentarily stops the task it is processing, branches to an interrupt-service-routine (ISR)
or I/O routine or interrupt handler to process the I/O transfer, and then returns to the task it was
originally performing. Thus, in the interrupt-initiated mode, the ISR software (i.e. CPU) per-
forms data transfer but is not involved in checking whether the device is ready for data transfer
or not. Therefore, the execution time of CPU can be optimized by employing it to execute
normal program, while no data transfer is required.
Direct Memory Access (DMA)
To transfer large blocks of data at high speed, this third method is used. A special controlling
unit may be provided to allow transfer a block of data directly between a high speed external
device like magnetic disk and the main memory, without continuous intervention by the CPU.
This method is called direct memory access (DMA).
DMA transfers are performed by a control circuit that is part of the I/O device interface. We
refer to this circuit as a DMA controller. The DMA controller performs the functions that would
normally be carried out by the CPU when accessing the main memory. During DMA transfer,
the CPU is idle or can be utilized to execute another program and CPU has no control of the
memory buses. A DMA controller takes over the buses to manage the transfer directly between
the I/O device and the main memory.
The CPU can be placed in an idle state using two special control signals, HOLD and HLDA
(hold acknowledge). The following figure shows two control signals in the CPU that character-
ize the DMA transfer. The HOLD input is used by the DMA controller to request the CPU to
release control of buses. When this input is active, the CPU suspends the execution of the
current instruction and places the address bus, the data bus and the read/write line into a high-
SQP.14 Computer Organization and Architecture
impedance state. The high-impedance state behaves like an open circuit, which means that the
output line is disconnected from the input line and does not have any logic significance. The
CPU activates the HLDA output to inform the external DMA controller that the buses are in the
high-impedance state. The control of the buses has been taken by the DMA controller that
generated the bus request to conduct memory transfers without processor intervention. After the
transfer of data, the DMA controller disables the HOLD line. The CPU then disables the HLDA
line and regains the control of the buses and returns to its normal operation.
(b) 1. In the isolated (I/O mapped) I/O, computers use one common address bus and data bus to
transfer information between memory or I/O and the CPU; but use separate read-write
control lines one for memory and another for I/O. Whereas, in memory mapped I/O, com-
puters use only one set of read and write lines along with same set of address and data buses
for both memory and I/O devices.
2. The isolated I/O technique isolates all I/O interface addresses from the addresses assigned
to memory. Whereas, the memory mapped I/O does not distinguish between memory and
I/O addresses.
3. Processors use different instructions for accessing memory and I/O devices in isolated I/O.
In memory mapped I/O, processors use same set of instructions for accessing memory and
I/O.
4. Thus, the hardware cost is more in isolated I/O relative to the memory mapped I/O, because
two separate read-write lines are required in first technique.
(c) See answer of question no. 10 (e) of 2007 (CS-303).
10. Write short notes on any four of the following: 2.5 ¥ 4
(a) Carry Save Adder.
(b) Magnetic Recording.
(c) Asynchronous data transfer.
(d) ROM architecture.
(e) Daisy chaining.
(f) Cache Replacement Policies.
Answer
(a) In parallel processing and in multiplication and division, multi-operand addition is often en-
countered. More powerful adders are required which can add many numbers instead of two
together. Such type high-speed multi-operand adder is called a carry-save adder (CSA). To see
the effectiveness, consider the following example:
Solved Question Paper 2004 SQP.15
34
62
58
76
10 ¨ Sum vector
22 ¨ Carry vector
230 ¨ Final result.
In this example, four decimal numbers are added. First, the unit place digits are added, and
producing a sum of 0 and a carry digit of 2. Similarly the ten place digits are added, producing
a sum of 1 and a carry digit of 2. These summations can be performed in parallel to produce a
sum vector of 10 and a carry vector of 22, because there is no carry propagation from the unit
place digit to the tenth place digit. When all digits of the operands are added, the sum and the
shifted carry vector are added in the conventional manner, i.e. using either CPA or CLA, which
produces the final answer.
The CSA takes three numbers as inputs, say, X, Y and Z, and produces two outputs, the sum
vector S and carry vector C. The sum vector S and carry vector C are obtained by the following
relations:
S=X≈Y≈Z
C = XY + YZ + XZ; Here all logical operations are performed bit-wise.
The final arithmetic sum of three inputs, i.e. Sum = X + Y + Z, is obtained by adding the two
outputs, i.e. Sum = S + C, using a CPA or CLA.
Let us take one example to illustrate the CSA technique.
X= 0 1 0 11 0
Y= 1 1 0 01 1
Z= 0 0 1 10 1
S= 1 0 1 00 0
C=01 0 1 11
Sum = S + C = 1 0 1 0 11 0
(b) See answer of question no. 10 (a) of 2007 (CS-303).
(c) There is no common clock between the master and slave in asynchronous transfer. Each has its
own private clock for internal operations. This approach is widely used in most computers.
Asynchronous data transfer between two independent units requires that control signals be
transmitted between the communicating units to indicate the time at which data is being trans-
mitted. One simple way is to use a strobe signal supplied by one of the units to indicate the other
unit when the transfer has to occur.
Strobe Control Technique
A single control line is used by the strobe control method of asynchronous data transfer to time
each transfer. The strobe may be activated by either the source or the destination unit.
The disadvantage of the strobe method is that the source unit that initiates the transfer cannot
know whether the destination unit has actually received the data item that was placed in the bus.
SQP.16 Computer Organization and Architecture
Similarly, a destination unit that initiates the transfer cannot know whether the source unit has
actually placed the data on the bus.
Handshaking Technique
To overcome this problem of strobe technique, another method commonly used is to accompany
each data item being transferred with a control signal that indicates the presence of data in the
bus. The unit receiving the data item responds with another control signal to acknowledge
receipt of the data. This type of agreement between two independent units is referred to as
handshaking mode of transfer.
The handshaking scheme provides a high degree of flexibility and reliability because the
successful completion of a data transfer relies on active participation by both units. However, a
slow speed destination unit can hold up the bus whenever it gets a chance to communicate.
Another drawback is if one of the two communicating devices is faulty, the initiated data
transfer cannot be completed.
Examples of asynchronous transfer:
1. The centronics interface follows handshaking scheme.
2. Most microprocessors such as Motorola 88010 and Intel 80286 follow this bus transfer
mechanism.
(d) ROM is another part of main memory, which is used to store some permanent system programs
and system data. A ROM cell structure is shown in figure next. A logic value 1 is stored in the
cell if the transistor is not connected to the ground at point P; otherwise, a binary 0 is stored.
The bit line is connected through a resistor to the power supply.
In order to read the state of the cell, the word line is activated to close the transistor, which
acts as a switch. The voltage on the bit line drops to near zero if point P is connected. The point
P is not connected to retain the state of cell as 1. When it is manufactured, data are written into
ROM cells.
(e) Daisy chaining is a centralized bus arbitration technique. A conflict may arise if the number of
DMA controllers or other controllers or processors try to access the common bus at the same
time, but access can be given to only one of those. Only one processor or controller can be bus
master. The bus master is the controller that has access to a bus at an instance. To resolve these
conflicts, bus arbitration procedure is implemented to coordinate the activities of all devices
requesting memory transfers.
During any bus cycle, the bus master may be any device—the processor or any DMA control-
ler unit, connected to the bus. The following figure illustrates the daisy chaining method.
Solved Question Paper 2004 SQP.17
All devices are effectively assigned static priorities according to their locations along a bus
grant control line (BGT). The device closest to the central bus arbiter is assigned the highest
priority. Requests for bus access are made on a common request line, BRQ. Similarly, the
common acknowledge signal line (SACK) is used to indicate the use of bus. When no device is
using the bus, the SACK is inactive. The central bus arbiter propagates a bus grant signal (BGT)
if the BRQ line is high and acknowledge signal (SACK) indicates that the bus is idle. The first
device, which has issued a bus request, receives the BGT signal and stops the latter’s propaga-
tion. This sets the bus-busy flag in the bus arbiter by activating SACK and the device assumes
bus control. On completion, it resets the bus-busy flag in the arbiter and a new BGT signal is
generated if other requests are outstanding (i.e., BRQ is still active). The first device simply
passes the BGT signal to the next device in the line.
The main advantage of the daisy chaining method is its simplicity. Another advantage is
scalability. The user can add more devices anywhere along the chain, up to a certain maximum
value.
(f) In case a miss occurs in cache memory, then a new data from main memory needs to be placed
over old data in the selected location of cache memory. In case of direct mapping cache, we
have no choice and thus no replacement algorithm is required. The new data has to be stored
only in a specified cache location as per the mapping rule for the direct mapping cache. For
associative mapping and set-associative mapping, we need a replacement algorithm since we
have multiple choices for locations. We outline below some most used replacement algorithms.
First-In First-Out (FIFO) Algorithm This algorithm chooses the word that has been in the
cache for a long time. In other words, the word which entered the cache first, gets pushed out
first.
Least Recently Used (LRU) Algorithm This algorithm chooses the item for replacement that
has been used by the CPU minimum number of times in the recent past.
2004
Computer Organization
and Architecture
Paper Code: CS-404 Semester: 4th
Time Alloted: 3 hours Full Marks: 70
The questions are of equal value. The figures in the margin indicate full marks
Candidates are required to give their answers in their own words as far as practicable.
Answer any seven questions
1. (a) What is virtual memory? Why is it called virtual? 1+1
(b) What do you mean by logical address space and physical address space? 2
(c) Explain with an example how logical address is converted into physical address and also
explain how page replacements take place. 4
(d) Write the advantages of virtual memory system. 2
Answer
(a) See answer of question no. 2 of 2007 (CS-404).
(b) See answer of question no. 2 (b) of 2005 (CS-404).
(c) The memory-management unit (MMU) maps each logical address to a physical address during
program execution. The figure below illustrates this mapping method, which uses a special
register called base register or relocation register.
SQP.2 Computer Organization and Architecture
The content of the relocation register is added to every logical address generated by the user
program at the beginning of execution. For example, if the relocation register holds an address
value 2000, then a reference to the location 0 by the user is dynamically relocated to 2000
address. A reference to the address 150 is mapped to the address 2150.
Page Replacement:
When a program starts execution, one or more pages are brought to the main memory and the
page table is responsible to indicate their positions. When the CPU needs a particular page for
execution and that page is not in main (physical) memory (still in the secondary memory), this
situation is called page fault. When the page fault occurs, the execution of the present program
is suspended until the required page is brought into main memory from secondary memory. The
required page replaces an existing page in the main memory, when it is brought into main
memory. Thus, when a page fault occurs, a page replacement is needed to select one of the
existing pages to make the room for the required page. There are several replacement algorithms
such as FIFO (First-in First-out), LRU (Least Recently Used) and optimal page replacement
algorithm available.
The FIFO algorithm is simplest and its criterion is “select a page for replacement that has
been in the main memory for longest period of time”.
The LRU algorithm states that “select a page for replacement, if the page has not been used
often in the past”. The LRU algorithm is difficult to implement, because it requires a counter for
each page to keep the information about the usage of page.
The optimal algorithm generally gives the lowest page faults of all algorithms and its crite-
rion is “replace a page that will not be used for the longest period of time”. This algorithm is
also difficult to implement, because it requires future knowledge about page references.
(d) See answer of question no. 2 of 2007 (CS-404).
2. (a) Discuss about the different hazards in pipelining. 5
(b) Assuming no result forwarding and the five stage sample pipeline; draw a pipelined execu-
tion diagram for the following code fragment: 5
ADD r1, r2, r3
SUB r4, r5, r6
MUL r8, r9, r10
DIV r12, r13, r14
Answer
(a) See answer of question no. 9 (b) of 2006 (CS-404).
(b) There are no instruction hazards in this code fragment, so instruction proceed through the
pipeline at one stage per cycle. The five-stage instruction pipeline has the stages: instruction
fetch (IF), instruction decode (ID), operand fetch (OF), execution (EX) and write-back (WB).
The following diagram illustrates the pipeline execution flow.
Solved Question Paper 2004 SQP.3
is also assigned an interrupt acknowledge (INTA) line that the CPU uses to signal the controller
that it has received and begun to process the interrupt request by employing an ISR. Figure
below shows the hardware lines for implementing interrupts.
The interrupt controller uses a register called interrupt-request mask register (IMR) to detect
any interrupt from the I/O devices. Consider there is n number of I/O devices in the system.
Therefore IMR is n-bit register each bit indicates the status of one I/O device. Let, IMR’s
content is denoted as E0 E1 E2 ….. En–1. When E0 = 1 then device 0 interrupt is recognized;
When E1 = 1 then device 1 interrupt is recognized and so on. The processor uses a flag bit
known as interrupt enable (IE) in its status register (PS) to process the interrupt. When this flag
bit is ‘1’, the CPU responds to the presence of interrupt; otherwise not.
(c) See answer of question no. 10 (e) of 2004 (CS-303).
4. Explain Booth’s Algorithm. Apply Booth’s algorithm to multiply the two numbers (+14)10, and
(–12)10. Assume the multiplier and multiplicand to be of 5 bits each. 10
Answer
For Booth’s algorithm: see answer of question no. 8(a) of 2007 (CS-303).
For multiplication example of Booth’s algorithm: see answer of question no. 11(a) of 2007
(CS-404).
5. Two 4 bit unsigned numbers are to be multiplied using the principle of carry save adders.
Assume the numbers to be A3 A2 A1 A0 and B3 B2 B1 B0. Show the arrangement and intercon-
nection of the adders and the input signals so as to generate an 8-bit product as P7 P6P5 P4 P3 P2
P1 P0. 10
Answer
The multiplication of two unsigned is done by repeated add-shift operations. Add-shift multiplication
of two 4-bit numbers is illustrated in the following figure.
Solved Question Paper 2004 SQP.5
The main memory communicates directly with CPU. Only programs and data currently needed
by the CPU for execution reside in the main memory. The auxiliary memory cannot directly
communicate with CPU. Generally main memory is volatile, i.e. when power is off the content
is lost. But, auxiliary memory is non-volatile. Due to low cost of secondary memory, it is used in
bulk size. However, main memory is not used in large size, because of its relatively high cost.
(b) (i) Magnetic tape is divided vertically into frames and horizontally into nine parallel tracks
where each frame is capable of storing 8-bit data and 1-bit parity information. Whereas,
each surface of a magnetic disk is divided into a number of concentric tracks each consists
of sectors. Generally, the minimum quantity of information that can be transferred is sector.
(ii) The access mechanism in tape drive is sequential, but that in disk drive is sequential or
direct.
(c) Limitation of direct mapped cache: The hit ratio is not good. It needs frequent replacement for
data-tag value. Because there is a fixed cache location for any given block. For example, if a
program happens to reference words repeatedly from two different blocks that map into the
same cache block, then the blocks need to be swapped in continuously making hit ratio to drop.
The direct mapped cache can hold maximum m words, if cache memory capacity is m-word.
In set-associative cache memory, two or more words can be stored under the same index
address. Each data word is stored together with its tag. The number of tag-data words under an
index is said to form a set. If k number of words with their associated tags (i.e. set size = k) are
stored under an index of cache, then the cache memory is called k-way set-associative. Thus a 2-
way set-associative memory stores two data words together with two tag values under a single
index address. That means that the set-associative cache memory can hold 2 * m words of the
main memory and consequently hit ratio improves.
7. (a) Explain the difference between direct and indirect address instructions. How many refer-
ences to memory are needed for each type of these instructions to bring an operand into
processor register? 4
(b) Show the bus connections with a CPU to connect four RAM chips of size 256 ¥ 8 bit each
and a ROM chip of 512 ¥ 8 bit size. Assume the CPU has 8 bit data bus and 16 bit address
bus. Clearly specify generation of chip select signals. 6
Answer
(a) In direct mode, the instruction contains the memory address of the operand explicitly. Thus, the
address part of the instruction is the effective address. Since the operand address is directly
available in the instruction, there is no need for the effective address calculation step. Hence the
instruction cycle time is reduced. Example of direct addressing is:
STA 2500H; Stores the content of the accumulator in the memory location 2500H.
In indirect mode, the instruction gives a memory address in its address field which holds the
address of the operand. Thus, the address field of the instruction gives the address where the
effective address is stored in memory. The following example illustrates the indirect addressing
mode:
MOV R1, (X); Content of the location whose address is given in X is loaded into register R1.
Since, in this addressing, there is the scope of changing the address during run-time of program
without changing the instruction content. This type of addressing modes gives flexibility in
programming. It is very useful for implementing pointers in C language. However, instruction
cycle time increases as there are two memory accesses.
Solved Question Paper 2004 SQP.7
In direct mode, only one memory reference is needed, but in indirect mode, two memory
references are needed.
(b) See answer of question no. 7 (c) of 2006 (CS-303).
8. (a) Use 8-bit two’s complement integers, perform the following computations: 6
(i) –34 + (–12)
(ii) 17 – 35
(iii) –22 – 7
(iv) 18 – (–5)
(b) Compute the product of the following pair of unsigned integers. 4
Generate the full 8-bit result.
(i) Ob1001 ¥ Ob0110
(ii) Ob1111 ¥ Ob1111
Answer
(a) (i) In 2’s complement representation, –34 = 1101 1110
–12 = 1111 0100
Adding these two numbers, we get, 11101 0010, which is 9-bit result. By addition rule
discard the 9th bit and get the result: 1101 0010, which shows that the number is negative.
To get the result in its familiar form, take 2’s complement of the result. The result is -46.
(ii) 17 – 35: This is subtraction and we know that 2’s complement of (35) is to be added with
17. The representation of 17 is 0001 0001 and 2’s complement of 35 is 1101 1101. After
addition, we get, 1110 1110. This negative number and its value is –18.
(iii) –22 – 7 = (–22) – 7. This is a subtraction. So, 2’s complement of 7 is to be added with
(–22). The 2’s complement of 7 = 1111 1001 and representation of (–22) = 1110 1010.
After addition of these two we get, 11110 0011. This is 9-bit result. So, by rule, discard 9th
bit and get the result as 1110 0011, which shows it is negative number. The result in
equivalent decimal is –29.
(iv) 18 – (–5) = 18 + 5. The representation of 18 is 0001 0010 and representation of 5 is 0000
0101. We get after addition, 0001 0111. The result is equivalent to decimal 23.
(b) (i) Ob1001 ¥ Ob0110 i.e. 1001 ¥ 0110. This can be written as
(1001 ¥ 100) + (1001 ¥ 10) = 100100 + 10010 = 0011 0110.
(ii) Ob1111 ¥ Ob1111 i.e. 1111 ¥ 1111. This can be written as
(1111 ¥ 1000) + (1111 ¥ 100) + (1111 ¥ 10) + (1111 ¥ 1) = 1111000 + 111100 + 11110 +
1111 = 1110 0001.
9. (a) Where does DMA mode of data transfer finds its use? 2
(b) What are the different types of DMA controllers and how do they differ in their functioning?
3
(c) Draw the schematic diagram for daisy chain polling arrangement in case of non vectored
interrupt for three devices. 5
Answer
(a) To transfer large blocks of data at high speed, the DMA method is used. A special controlling
unit is provided to allow transfer a block of data directly between a high speed external device
like magnetic disk and the main memory, without continuous intervention by the CPU.
SQP.8 Computer Organization and Architecture
The devices are place in a chain-fashion with highest priority device in the first place (device 1),
followed by lower priority devices. The priorities are assigned by the interrupt controller. When
one or more devices send interrupt signal through the interrupt controller to the CPU, the CPU
then set interrupt acknowledge (INTA) to the controller and branches to the fixed starting
location of ISR. The controller, in turns, sends it to the highest priority device. If this device has
generated the interrupt INTR, it will accept the INTA; otherwise it will pass the INTA signal to
the next device until the INTA is accepted by one requestor device. When the INTA is accepted
by a device, device is ready to data transfer.
10. Write short notes on any four from the following: 2.5 ¥ 4 =10
(i) Direct Memory Access (DMA).
(ii) I/O Processor.
(iii) Von Neumann Architecture.
(iv) Cache memory.
(v) Non restoring division method.
(vi) Carry look-ahead adder.
Answer
(i) DMA: See answer of question no. 9 (c) of 2006 (CS-404).
(ii) IO Processor: The IOP is just like a CPU that handles the details of I/O operations. It is more
equipped with facilities than those are available in a typical DMA controller. The IOP can fetch
and execute its own instructions that are specifically designed to characterize I/O transfers. In
Solved Question Paper 2004 SQP.9
addition to the I/O-related tasks, it can perform other processing tasks like arithmetic, logic,
branching and code translation. The block diagram of an IOP is shown in figure below. The
main memory unit takes the pivotal role. It communicates with processor by means of DMA.
(iii) Von Neumann Architecture: See answer of question no. 8 (c) of 2007 (CS-303).
(iv) Cache memory: This is a special high-speed main memory, sometimes used to increase the
speed of processing by making the current programs and data available to the CPU at a rapid
rate. Generally, the CPU is faster than main memory, thus resulting that processing speed is
limited mainly by the speed of main memory. So, a technique used to compensate the speed
mismatch between CPU and main memory is to use an extremely fast, small cache between
CPU and main memory, whose access time is close to CPU cycle time. The cache is used for
storing portions of programs currently being executed in the CPU and temporary data frequently
needed in the present computations. Thus, the cache memory acts a buffer between the CPU and
main memory. By making programs and data available at a rapid rate, it is possible to increase
the performance of computer.
Cache memory is one high-speed main memory (SRAM). The cache memory can be placed
in more than one level. Most of the recent microprocessors - starting from Intel 80486- have on-
chip (memory chip is placed inside the CPU chip) cache memory also known as internal cache.
High performance microprocessors such as- Pentium pro and later have two levels of cache
memory on-chip. These are known as level 1 (L1) and level 2 (L2) caches. An on-chip cache is
slightly faster than an off-chip cache of same technology.
(v) Non restoring division method: It is described in the following flowchart.
SQP.10 Computer Organization and Architecture
Solved Question Paper 2004 SQP.11
This algorithm removes the restoration steps of restoring algorithm, though it may require a
restoration step at the end of algorithm for remainder A, if A is negative.
(vi) Carry look-ahead adder: A Carry Look-ahead Adder (CLA) is a high-speed adder, which adds
two numbers without waiting for the carries from the previous stages. In the CLA, carry-inputs
of all stages are generated simultaneously, without using carries from the previous stages.
In the full adder, the carry output Ci+1 is related to its carry input Ci as follows:
Ci+1 = AiBi + (Ai+Bi)Ci
This result can be rewritten as:
Ci+1 = Gi + PiCi (1)
where Gi = AiBi and Pi = Ai + Bi
The function Gi is called the carry-generate function, since a carry Ci+1 is generated when
both Ai and Bi are 1. The function Pi is called as carry-propagate function, since if Ai or Bi is a
1, then the input carry Ci is propagated to the next stage. The basic adder (BA) for generating
the sum Si, carry propagate Pi and carry generate Gi bits, is shown in figure below. The sum bit
Si is = Ai ≈ Bi ≈ Ci. For the implementation of one basic adder, two XOR gates, one AND gate
and one OR gate are required.
Now, we want to design a 4-bit CLA, for which four carries C1, C2, C3 and C4 are to be
generated. Using Eq. (1); C1, C2, C3 and C4 can be expressed as follows:
C1 = G0 + P0C0
C2 = G1 + P1C1
C3 = G2 + P2C2
C4 = G3 + P3C3
These equations are recursive and the recursion can be removed as below.
C1 = G0 + P0C0 (2)
C2 = G1 + P1C1
= G1 + P1(G0 + P0C0)
= G1 + P1G0 + P1P0C0 (3)
C3 = G2 + P2C2
= G2 + P2(G1 + P1G0 + P1P0C0)
= G2 + P2G1 + P2P1G0 + P2P1P0C0 (4)
SQP.12 Computer Organization and Architecture
C4 = G3 + P3C3
= G3 + P3(G2 + P2G1 + P2P1G0 + P2P1P0C0)
= G3 + P3G2 + P3P2G1 + P3P2P1G0 + P3P2P1P0C0 (5)
The Eqs (2), (3), (4) and (5) suggest that C1, C2, C3 and C4 can be generated directly from C0.
In other words, these four carries depend only on the initial carry C0. For this reason, these
equations are called carry look-ahead equations. A 4-bit carry look-ahead adder (CLA) is
shown in figure below.
The maximum delay of the CLA is 6 ¥ D (for Gi and Pi generation, delay = D, for Ci
generation, delay = 2D and lastly another 3D for sum bit Si) where D is the average gate delay.
The same holds good for any number of bits because the adder delay does not depend on size of
number (n). It depends on the number of levels of gates used to generate the sum and the carry
bits.
2005
Computer Organization
Paper Code: CS-303 Semester: 3rd
Time Alloted: 3 hours Full Marks: 70
The questions are of equal value. The figures in the margin indicate full marks.
Candidates are required to give their answers in their own words as far as practicable.
Answer any seven questions
1. (a) What are the widths of data bus and address bus for 4096 ¥ 8 memory?
(b) What do you mean by program status word?
(c) Define: content addressable memory.
Answer
(a) The size of memory is 4096 ¥ 8.
So, the number of words is 4096, where each word consists of 8 bits.
Hence, the number of address bits required to access each word is 12, because 4096 = 212.
Therefore, the address bus width is 12-bit and data bus width is 8-bit.
(b) The processor uses one special register called program status word (or status register) to hold
the condition code flags and other information that describe the status of the currently executing
program. It holds 1-bit flags to indicate certain conditions that produced during arithmetic and
logic operations. The bits are set or reset depending on the outcome of most recent arithmetic
and logic operation. The register generally contains following four flags:
Carry (C): it indicates whether there is any end-carry from the most significant bit position.
Zero (Z): it indicates whether the result is zero or non-zero.
Sign (S): it indicates whether the result is positive or negative.
Overflow (V): it indicates whether the operation produces any overflow or not.
There may be other flags such as parity and auxiliary carry.
(c) The searching time for desired data stored in memory can be reduced largely if stored data can
be searched only by the data value itself rather than by an address. The memory accessed by the
data content is known as content addressable memory (CAM) or associative memory. When a
data is stored in this memory, no address is stored. At any first empty location, the data is stored.
SQP.2 Computer Organization and Architecture
When a data word is to be read from the memory, only the data word or part of data, called key,
is provided. The memory is sequentially searched thoroughly for match with the specified key
and set them for reading next.
2. Explain Booth’s algorithm. Apply Booth’s algorithm to multiply the two numbers (+14)10 and
(–11)10. Assume the multiplier and multiplicand to be 5 bits each.
Answer
For Booth’s algorithm: See answer of question no. 8 (a) of 2007 (CS-303).
Multiplication of numbers (+14)10 and (–11)10:
Multiplicand, M = +14 = 01110 and multiplier, Q = –11 = 10101.
M A Q Size
Initial
Configuration 01110 00000 10101 0 5
Step-1
As Q[0]=1 and
Q[–1]=0
A=A–M 01110 10010 10101 0 —
ARS(AQ) 01110 11001 01010 1 4
Step-2
As Q[0]=0 and
Q[–1]=1
A=A+M 01110 00111 01010 1 —
ARS(AQ) 01110 00011 10101 0 3
Step-3
As Q[0]=1 and
Q[–1]=0
A=A–M 01110 10101 10101 0 —
ARS(AQ) 01110 11010 11010 1 2
Step-4
As Q[0]=0 and
Q[–1]=1
A=A+M 01110 01000 11010 1 —
ARS(AQ) 01110 00100 01101 0 1
Step -5
As Q[0] = 1 and
Q[–1] = 0
A=A–M 01110 10110 01101 0 —
ARS (AQ) 01110 11011 00110 1 0
Since, the size register becomes 0, the algorithm is terminated and the product is = AQ = 1101100110,
which shows that the product is a negative number. To get the result in familiar form, take the 2’s
complement of the magnitude of the number and the result is –154 in decimal.
Solved Question Paper 2005 SQP.3
Secondary Memory: The slow-speed and low-cost devices that provide backup storage are
called secondary memory. The most commonly used secondary memories are magnetic disks,
such as hard disk, floppy disk and magnetic tapes. This type of memory is used for storing all
programs and data, as this is used in bulk size. When a program not residing in main memory is
needed to execute, it is transferred from secondary memory to main memory. Programs not
currently needed in main memory (in other words, the programs are not currently executed by
the processor) are transferred into secondary memory to provide space for currently used pro-
grams and data.
Main Memory: This is the memory that communicates directly with CPU. Only programs and
data currently needed by the CPU for execution reside in the main memory. Main memory
occupies central position in hierarchy by being able to communicate directly with CPU and with
secondary memory devices through an I/O processor.
Cache Memory: This is a special high-speed main memory, sometimes used to increase the
speed of processing by making the current programs and data available to the CPU at a rapid
rate. Generally, the CPU is faster than main memory, thus resulting that processing speed is
limited mainly by the speed of main memory. So, a technique used to compensate the speed
mismatch between CPU and main memory is to use an extremely fast, small cache between
CPU and main memory, whose access time is close to CPU cycle time. The cache is used for
storing portions of programs currently being executed in the CPU and temporary data frequently
needed in the present computations. Thus, the cache memory acts a buffer between the CPU and
main memory. By making programs and data available at a rapid rate, it is possible to increase
the performance of computer.
SQP.4 Computer Organization and Architecture
Internal memory: This memory refers to the high-speed registers used inside the CPU. These
registers hold temporary results when a computation is in progress. There is no speed disparity
between these registers and the CPU because they are fabricated with the same technology.
However, since registers are very expensive, only a few registers are used as internal memory.
(b) Yes. The ROM is also random access memory. In random mode of access, any location of the
memory can be accessed randomly. This type of memories is very fast.
(c) The memory accessed by the data content is known as associative memory or content address-
able memory (CAM). When a data is stored in this memory, no address is stored. At any first
empty location, the data is stored. When a data word is to be read from the memory, only the
data word or part of data, called key, is provided. The memory is sequentially searched thor-
oughly for match with the specified key and set them for reading next.
The associative memory consists of a memory array and logic for m words with n bits per
word. In this organization, several registers are used, functions of which are described next.
I. Input register (I): The input register I is used to hold the data to be written into the
associative memory or the data to be searched for. At a time it holds one word of data of the
memory, i.e. its length is n-bit.
II. Mask register (M): The mask register M provides a mask for choosing a particular field or
key in the input register’s word. The maximum length of this register is n-bit, because the M
register can hold a portion of the word or all bits of the word to be searched. Suppose, a
student’s database file containing several fields such as name, class roll number, address,
etc is stored in the memory. From this file, say only ‘name’ field is required for searching.
Then the M register will hold this ‘name’ field only and the searching in the memory will be
only with respect to this field without bothering about other fields. Thus, only those bits in
Solved Question Paper 2005 SQP.5
the input register that have 1s in their corresponding position of the mask register are
compared. The entire argument is compared with each memory word if the mask register
contains all 1s. Each word in memory is matched with the input data in the I-register.
To illustrate the matching technique, let the input register I and mask register M have the
following information.
I = 1001 1011
M = 1111 0000
Word1 = 0011 1100 no match
Word2 = 1001 0011 match
Only four leftmost bits of I are compared with stored memory words because M has 1s in
these positions. There is a match for word2, but not with word1.
III. Select register (S): The select register S has m bits, one for each memory word. If matches
found after comparing input data in I register with key field in M register, then the corre-
sponding bits in select register (S) are set.
IV. Output register (Y): This register contains the match data word retrieved from the associa-
tive memory.
4. (a) Differentiate between hardwired control unit and micro-programmed control unit.
(b) Draw the block diagram and explain the functionality of micro-programmed control unit.
Answer
(a) (i) Microprogrammed control unit provides a well-structured control organization. Control
signals are systematically transformed into formatted words (microinstructions). Logic gates,
flip flops, decoders and other digital circuits are used to implement hardwired control
organization.
(ii) With microprogramming, many additions and changes are made by simply changing the
microprogram in the control memory. A small change in the handwired approach may lead
to redesigning the entire system.
(iii) The microprogrammed control unit is more expensive than handwired control. Since a
control ROM memory is needed in microprogramming approach.
(b) See answer of question no. 9 (d) of 2006 (CS-404).
5. (a) Define “latency time” of a memory.
(b) Explain the working principle of a static MOS cell.
(c) Among Dynamic MOS Cell and Static MOS Cell which one is used for the construction of
cache memory and which one for main memory? Why?
Answer
(a) The latency time is measured in terms of two parameters: access time, tA and cycle time, tC.
Some times, latency is measured in terms of access time and some times in terms of cycle time.
To perform a read operation, first the address of the location is sent to memory followed by the
‘read’ control signal. The memory decodes the address, selects the location and reads out the
contents of the location. The access time is the time taken by the memory to complete a read
operation from the moment of receiving the ‘read’ control signal. Generally, access time for
read and write is equal. Suppose two successive memory read operations have to be performed.
During the first read operation, the information read from memory is available after the access
time. This data can be immediately used by CPU. However, the memory is still busy with some
SQP.6 Computer Organization and Architecture
internal operation for some more time called recovery time, tR. During this time, another memory
access, read or write cannot be initiated. Only after the recovery time, next operation can be
started. The cycle time is the total time including access time and recovery time: tC=tA + tR. This
recovery time varies with memory technology.
(b) See answer of question no. 6(d) of 2004 (CS-303).
(c) Dynamic MOS Cell is used to construct main memory and static MOS Cell is used to construct
cache memory. Because, static memory is faster than dynamic memory.
6. Compare restoring and non-restoring division algorithms. Give the non-restoring division algo-
rithm for division of two binary numbers.
Answer
For first part: See answer of question number 5(b) of 2007 (CS-303).
For second part: See answer of question number 10(v) of 2004 (CS-404).
7. (a) Write down the functions of an operating system.
(b) Describe the function of major components of a digital computer with diagram.
Answer
(a) See answer of question number 7(b) of 2006 (CS303).
(b) See answer of question number 7(a) of 2007 (CS-303).
8. (a) Define: instruction set and instruction set architecture.
(b) Explain different types of addressing modes used in instruction set design.
Answer
(a) Instruction set: The complete collection of instructions that are understood by a machine (i.e.
CPU) is called the instruction set for the machine.
Instruction set architecture: A complete instruction set is also referred to as the instruction set
architecture (ISA) of the machine.
(b) See answer of question no. 6(d) of 2004 (CS-303).
9. (a) Explain the advantages and disadvantages of parallel adder over serial adder.
(b) What are the advantages of microprogramming control over hardwired control?
(c) Why does adding different types of addressing modes to an ISA tend to improve perfor-
mance of a computer?
Answer
(a) See answer of question no. 9(b) of 2007 (CS-404).
(b) (i) Microprogrammed control unit provides a well-structured control organization, by using
programming concept. Control signals are systematically transformed into formatted words
(microinstructions).
(ii) So, many additions and changes are made by simply changing the microprogram in the
control memory.
(c) By providing various addressing modes to an ISA, the programming versatility or flexibility can
be achieved with respect to the number of instructions and execution time. So, a computer’s
performance can be improved by choosing appropriate instructions from the ISA for implement-
ing a program.
Solved Question Paper 2005 SQP.7
10. (a) What are the different types of ROMs? Explain their principles.
(b) Give examples of non-destructive read out memory and destructive read out memory.
Answer
(a) Types of ROMs are as follows:
PROM: Some ROM designs allow the data to be loaded into the cell by user, and then this
ROM is called PROM (Programmable ROM). Before it is programmed, the memory contains all
0s. The user can insert 1s at the required locations by burning out the fuses of cells at these
locations using high-voltage currents. The PROM’s cells are once programmable, i.e. the user
can store the desired bits in the cells only once and these bits cannot be altered.
EPROM: An erasable PROM (EPROM) uses a transistor in each cell that acts as a program-
mable switch. The contents of an EPROM can be erased (set to all 1s) by burning out the device
to ultraviolet light for a few (20 to 30) minutes. Since ROMs and PROMs are simpler and thus
cheaper than EPROMs. The EPROMs are used during system development and debugging.
EEPROM (Electrically Erasable PROM): In many applications, permanent data have to be
generated in a program application and need to be stored. For example, in a mobile phone the
telephone numbers are to be kept permanently till the user wants to erase those data. Similarly,
the user may wish to erase previously entered information. EEPROMs have an advantage in that
the information in them can be selectively erased by writing 1s and each bit in the information
can be stored again by writing the desired bit. An EEPROM needs two write operations at an
address, one for erase and one for writing. RAM writes the information directly without first
erasing that information at that address. But in the EEPROM, the stored information is non-
volatile.
(b) In some memories, the reading the memory word destroys the stored word, this fact is known as
destructive readout and memory is known as destructive readout memory. In these memories,
each read operation must be followed by a write operation that restores the memory’s original
state. Example includes dynamic RAM.
In some memories, the reading the memory word does not destroy the stored word, this fact is
known as non-destructive readout and memory is known as non-destructive readout memory.
Examples include static RAM and magnetic memory.
11. Write short notes on the following:
(a) Cache memory
(b) Von Neumann architecture
(c) Parallel adder
Answer
(a) Cache memory: See answer of question no. 10 (iv) of 2004 (CS-404).
(b) Von Neumann architecture: See answer of question no. 8 (c) of 2007 (CS-303).
(c) Parallel adder: A parallel adder is an adder, which adds all bits of two numbers in one clock
cycle. It has separate adder circuit for each bit. Therefore, to add two n-bit numbers, parallel
adder needs n separate adder circuits. There are basically two types of parallel adders, depend-
ing on the way of carry generation.
(i) Carry-Propagate Adder (CPA) or Ripple Carry Adder (RCA)
(ii) Carry Look-ahead Adder (CLA).
SQP.8 Computer Organization and Architecture
Carry-Propagate Adder (CPA): For addition of two n-bit numbers, n full adders (FAs) are
required. Each full adder’s carry output will be the input of the next higher bit full adder. Each
full adder performs addition for same position bits of two numbers. An n-bit CPA circuit is
shown in the figure below.
Carry Look-ahead Adder (CLA): A Carry Look-ahead Adder (CLA) is a high-speed adder,
which adds two numbers without waiting for the carries from the previous stages. In the CLA,
carry-inputs of all stages are generated simultaneously, without using carries from the previous
stages.
In the full adder, the carry output Ci+1 is related to its carry input Ci as follows:
Ci+1=AiBi + (Ai+Bi)Ci
This result can be rewritten as:
Ci+1 = Gi + PiCi (1)
where Gi = AiBi and Pi = Ai + Bi
The function Gi is called the carry-generate function, since a carry Ci+1 is generated when
both Ai and Bi are 1. The function Pi is called as carry-propagate function, since if Ai or Bi is a
1, then the input carry Ci is propagated to the next stage. The basic adder (BA) for generating
the sum Si, carry propagate Pi and carry generate Gi bits, is shown in the figure below. The sum
bit Si is = Ai ≈ Bi ≈ Ci. For the implementation of one basic adder, two XOR gates, one AND
gate and one OR gate are required.
Now, to design a 4-bit CLA, four carries C1, C2, C3 and C4 are to be generated. Using Eq. (1);
C1, C2, C3 and C4 can be expressed as follows:
Solved Question Paper 2005 SQP.9
C1= G0 + P0C0
C2= G1 + P1C1
C3= G2 + P2C2
C4= G3 + P3C3
These equations are recursive and the recursion can be removed as below.
C1= G0 + P0C0 (2)
C2= G1 + P1C1
= G1 + P1(G0 + P0C0)
= G1 + P1G0 + P1P0C0 (3)
C3= G2 + P2C2
= G2 + P2(G1 + P1G0 + P1P0C0)
= G2 + P2G1 + P2P1G0 + P2P1P0C0 (4)
C4= G3 + P3C3
= G3 + P3(G2 + P2G1 + P2P1G0 + P2P1P0C0)
= G3 + P3G2 + P3P2G1 + P3P2P1G0 + P3P2P1P0C0 (5)
The Eqs (2), (3), (4) and (5) suggest that C1, C2, C3 and C4 can be generated directly from C0.
In other words, these four carries depend only on the initial carry C0. For this reason, these
equations are called carry look-ahead equations. A 4-bit carry look-ahead adder (CLA) is
shown in the figure below.
2005
Computer Organization
and Architecture
Paper Code: CS-404 Semester: 4th
Time Alloted: 3 hours Full Marks: 70
The questions are of equal value. The figures in the margin indicate full marks.
Candidates are required to give their answers in their own words as far as practicable.
Answer any seven questions
1. Explain how a RAM of capacity 2 Kbytes can be mapped into the address space (1000)H to
(17FF)H of a CPU having a 16 bit address lines. Show how the address lines are decoded to
generate the chip select condition for the RAM. 10
Answer
Since, the capacity of RAM memory is 2048 bytes i.e. 2 KB, the memory uses 11 (2 KB = 211)
address lines, say namely A10 – A0, to select one word. Thus, memory’s internal address decoder uses
11 lines A10 – A0 to select one word.
To select this memory module, remaining 5 (i.e. 16–11) address lines A15 – A11 are used. Thus, an
external decoding scheme is employed on these higher-order five address bits of processor’s address.
The address space of the memory is 1000H and 17FFH.
Therefore, the starting address (1000H) in memory is as:
A15 A14 A13 A12 A11 A10 A9 A8 A7 A6 A5 A4 A3 A2 A1 A0
0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
Based on the higher-order five bits (00010), external decoding scheme performs a logical AND
operation on address values: A15 , A14 , A13 , A12 and A11. The output of AND gate acts as chip select
(CS) line. The address decoding scheme is shown in the following figure.
SQP.2 Computer Organization and Architecture
2. (a) What are the reasons for using the virtual memory? 3
(b) What do you mean by logical address space and physical address space? 3
(c) Explain the paging technique applied for memory mapping. 4
Answer
(a) Reasons for using the virtual memory:
(i) To give the programmers an illusion of having a large main memory, although which may
not be the case.
(ii) Efficient utilization of main memory, because the larger size program is divided into blocks
and partially each block is loaded in the main memory whenever it is required.
(iii) Efficient CPU utilization and thus improved throughput.
(b) When a program needs to be executed, the CPU would generate addresses called logical ad-
dresses. The corresponding addresses in the physical memory, as occupied by the executing
program, are called physical addresses. The set of all logical addresses generated by the CPU or
program is called logical-address space and the set of all physical addresses corresponding to
these logical addresses is called physical-address space. The memory-management unit (MMU)
maps each logical address to a physical address during program execution.
(c) Paging is a non-contiguous memory allocation method. In other words, the program is divided
into small blocks in paging and these blocks are loaded into else where in main memory. In
paging, the virtual address space is divided into equal size blocks called pages and the physical
(main) memory is divided into equal size blocks called frames. The size of a page and size of a
frame are equal. The size of a page or a frame is dependent on the operating system and is
generally 4 KB.
In paging, operating system maintains a data structure called page table, which is used for
mapping from logical address to physical address. The page table generally contains two fields,
one is page number and other is frame number. The table specifies the information that which
page would be mapped to which frame. Each operating system has its own way of maintaining
the page tables; most allocate a page table for each program.
Solved Question Paper 2005 SQP.3
Each address generated by the CPU (i.e. virtual address) is divided into two parts: page
number (p) and offset or displacement (d). The page number p is used as index in the page table
and the offset d is the word number within the page p. The structure of paging method is shown
in figure below.
3. Explain Booth’s Algorithm. Apply Booth’s algorithm to multiply the two numbers (+15)10 and
(–11)10 . Assume the multiplier and multiplicand to be of 5 bits each. 10
Answer
For Booth’s algorithm: see answer of question no. 8 (a) of 2007 (CS-303).
Multiplication of numbers (+ 15)10 and (–11)10:
Multiplicand, M = +15 = 01111 and multiplier, Q = –11 = 10101.
M A Q Size
Initial
Configuration 01111 00000 10101 0 5
Step-1
As Q[0]=1and
Q[–1]=0
A=A–M 01111 10001 10101 0 —
ARS(AQ) 01111 11000 11010 1 4
Step-2
As Q[0]=0 and
Q[–1]=1
A=A+M 01111 00111 11010 1 —
ARS(AQ) 01111 00011 11101 0 3
SQP.4 Computer Organization and Architecture
Step-3
As Q[0]=1 and
Q[–1]=0
A=A–M 01111 10100 11101 0 —
ARS(AQ) 01111 11010 01110 1 2
Step-4
As Q[0]=0 and
Q[–1]=1
A=A+M 01111 01001 01110 1 —
ARS(AQ) 01111 00100 10111 0 1
Step-5
As Q[0] = 1 and
Q[–1] = 0
A=A–M 01111 10101 10111 0 —
ARS (AQ) 01111 11010 11011 1 0
Since the size register becomes 0, the algorithm is terminated and the product is = AQ = 1101011011,
which shows that the product is a negative number. To get the result in familiar form, take the 2’s
complement of the magnitude of the number and the result is –165 in decimal.
4. (a) What are the different types of instruction formats? Explain them briefly. 4
(b) Write down the different addressing modes with examples. 6
Answer
(a) There are four instruction formats, depending on the number of address fields used in instruc-
tions and they are:
l Three address
l Two address
l One address
l Zero address
Three-address format:
The general instruction format of three address instructions is:
The op-code field specifies the operation to be performed and mode field gives the way of
calculating the operand source addresses. The address field consists of the parts, source1 and
source 2 give operand addresses and destination gives the address of the result after the instruc-
tion execution.
Two address format:
Here, the address field consists of two parts, first part of it gives address of the one of the
source operand and the same address is used to store the result of the operation after the
instruction execution and second part of the address field gives the address of another source
operand.
Solved Question Paper 2005 SQP.5
One-address format:
In one address instructions, address field has only one address which is address of one of the
operands. In this case, accumulator register (AC) is used always implicitly to hold other operand
and the result of the operation is stored in accumulator as well.
Zero address format:
In zero address formats, instructions consist of only op-code, no address of operand or result is
specified. Stack-organized computers use this type of instruction formats. When an instruction
is executed, two top most operands are retrieved automatically from the stack and result is
stored on the top of the stack.
(b) See answer of question no. 6 (d) of 2004 (CS-303).
5. (a) What is terminal quotient correction in non-restoring division? 4
(b) Write down the Boolean expression for overflow condition when adding or subtracting two
binary numbers expressed in two’s complement. 6
Answer
(a) Question is ambiguous.
(b) If one number is positive and the other is negative, after an addition overflow cannot occur,
since addition of a positive number to a negative number produces a number that is always
smaller than the larger of the of the two original numbers. However, an overflow may occur if
the two numbers added are of same sign i.e., both are positive or both are negative. Let’s
consider following examples.
Carries:01 Carries: 10
+69 0 1000101 –69 1 0111011
+78 0 1001110 –78 1 0110010
+147 1 0010011 –147 0 1101101
Observe that the 8-bit result that should have been positive (first example) has a negative sign
bit and the 8-bit result that should have been negative (second example) has a positive sign bit.
However, if the carry out from the sign bit position is treated as the sign of the result, the 9-bit
answer thus obtained will be correct answer. Since, the 9-bit answer cannot be accommodated
with 8-bit register, we say that an overflow results.
To detect an overflow condition the carry into the sign bit position (i.e. Cn–1) and the carry out
from the sign bit position (i.e. Cn) are examined. If these two carries are both 0 or both are 1,
there is no overflow. If these two carries are different, an overflow condition exists. The
overflow occurs if the Boolean expression Cn ≈ Cn–1 is true.
6. (a) Explain how an I/O operation based on an interrupt is more efficient than an I/O operation
based on a programmed mode of operation. 7
(b) Distinguish between vectored and non-vectored interrupt. 3
Answer
(a) In the programmed I/O method, the program constantly monitors the device status. Thus, the
CPU stays in the program until the I/O device indicates that it is ready for data transfer. This is
time-consuming process since it keeps the CPU busy needlessly. It can be avoided by letting the
device controller continuously monitor the device status and raise an interrupt to the CPU as
soon as the device is ready for data transfer. Upon detecting the external interrupt signal, the
SQP.6 Computer Organization and Architecture
The CPU responds to the interrupt signal by storing the return address from the program counter
(PC) register into a memory stack or into a processor register and then control branches to an
ISR program that processes the required I/O transfer.
(b) see answer of question number 8(a) of 2006 (CS-303).
7. (a) Why do we require memory hierarchy? 3
(b) What is locality of reference? What is virtual memory? 3
(c) How many 256 ¥ 4 RAM chips are needed to provide a memory capacity of 2048 bytes?
Show the corresponding interconnection diagram. 4
Answer
(a) See answer of question no. 10 (a) of 2007 (CS-404).
(b) For locality of reference: See answer of question no. 3 (a) of 2006 (CS-404).
For virtual memory: See answer of question no. 2 of 2007 (CS-404).
(c) See answer of question no. 10 (c) of 2007 (CS-404).
8. (a) Design a 4-bit Arithmetic unit using multiplexers and full-adders. 6
(b) A digital computer has a common bus system for K-registers of n bits each. The bus is
constructed with multiplexers.
(i) What size of multiplexers are needed?
(ii) How many multiplexers are there in the bus? 4
Solved Question Paper 2005 SQP.7
Answer
(a) The circuit has a 4-bit parallel adder and four multiplexers for 4-bit arithmetic unit. There are
two 4-bit inputs A and B, and the 5-bit output is S. The size of each multiplexer is 4:1. The two
common selection lines for all four multiplexers are S0 and S1. Cin is the carry input of the
parallel adder and the carry out is Cout. The four inputs to each multiplexer are B- value, B -
value, logic-1 and logic-0.
The output of the circuit is calculated from the following arithmetic sum:
S = A + Y + Cin
where A is a 4-bit number, Y is the 4-bit output of multiplexers and Cin is the carry input bit to
the parallel adder. By this circuit it is possible to get 8 arithmetic micro-operations, as listed in
the following table.
Group-A
(Multiple Choice Questions)
1. Choose the correct answers for the following: 10 ¥ 1 = 10
(i) Subtractor can be implemented using
(a) adder (b) complementer
(c) both (a) and (b) (d) none of these.
Answer
(c) both (a) and (b)
(ii) Overflow occurs when
(a) data is out of range (b) data is within range
(c) none of these.
Answer
(a) data is out of range
(iii) Instruction cycle is
(a) fetch-decode-execution (b) fetch-execution-decode
(c) decode-fetch-execution (d) none of these.
Answer
(a) fetch-decode-execution
(iv) Cache memory
(a) increases performance (b) machine cycle increases
(c) reduces performance (d) none of these.
Answer
(a) increases performance
SQP.2 Computer Organization and Architecture
(v) In microprogramming
(a) horizontal microinstruction is faster (b) vertical microinstruction is faster
(c) hardware control unit is faster (d) none of these.
Answer
(a) horizontal microinstruction is faster
(vi) Associative memory is a
(a) very cheap memory
(b) pointer addressable memory
(c) content addressable memory
(d) slow memory.
Answer
(c) content addressable memory
(vii) The speed of a microprocessor is usually measured by the
(a) microprocessor’s throughput
(b) speed with which it performs I/P and O/P operations
(c) time required to execute a basic instruction
(d) time required to process a small operation.
Answer
(c) time required to execute a basic instruction
(viii) For BIOS (Basic Input / Output System) and IOCS (Input / Output Control System), which
one of the following is true?
(a) BIOS and IOCS are same
(b) BIOS controls all devices and IOCS controls only certain devices
(c) BIOS is not a part of Operating System and IOCS is a part of Operating System
(d) BIOS is stored in ROM and IOCS is stored in RAM.
Answer
(c) BIOS is not a part of Operating System and IOCS is a part of Operating System
(ix) How many RAM chips of size (256 K ¥ 1 bit) are required to build 1M byte memory?
(a) 24 (b) 10 (c) 32. (d) 8
Answer
(c) 32.
(x) Which logic gate has the highest speed?
(a) DTL (b) ECL (c) RTL (d) TTL.
Answer
(b) ECL.
Group-B
(Short-Answer Questions)
Write short notes on any three 3 ¥ 5 = 15
2. Microinstruction
3. Stack memory
4. Pipeline processor
Solved Question Paper 2006 SQP.3
5. Virtual memory
6. IEEE format for floating point representation.
Answer
2. Microinstruction: In microprogrammed control unit, the set of microinstructions form a
micrprogram stored in a separate ROM memory called the control memory (CM). Each microin-
struction contains signals to activate one or more micro-operations. When these instrcutions are
retrieved in a sequence, a set of micro-operations are activated that will complete the desired
task.
Usually, all microinstructions have three important fields:
l Control field
l Next-address field
l Condition for branching.
A control memory in addition to the conventional main memory is used in the microprogram-
ming approach. Thus, it is desired to give more emphasis on minimizing the length of the
microinstruction. The length of the microinstruction decides the size of the control memory, as
well as the cost involved with this approach. Based on the length of microinstructions, the
microprogramming method is either horizontal or vertical.
Having an individual bit for each control signal in the microinstruction format is known as a
horizontal microinstruction.
Horizontal microinstructions have the following general attributes:
l Long formats.
l Ability to express a high degree of parallelism.
l Very little encoding of the control information.
In a vertical microinstruction, a single field can produce an encoded sequence. The encoded
technique is followed in this organization. The vertical microprogram technique takes more time
for generating the control signals due to the decoding time and also more microinstructions are
needed.
Thus, the vertical microinstructions are characterized by:
l Short formats.
l Limited ability to express parallel microoperations.
l Considerable encoding of the control information.
3. Stack memory: See answer of question 10 (d) of 2007 (CS-303).
4. Pipeline processor: Most of modern-day computers use the technique of pipelining.
Pipelining is a technique of decomposing a sequential task into subtasks, with each subtask
being executed in a special dedicated stage (segment) that operates concurrently with all other
stages. Each stage performs partial processing dictated by the way the task is partitioned. Result
obtained from a stage is transferred to the next stage in the pipeline. The final result is obtained
after the instruction has passed through all the stages. All stages are synchronized by a common
clock. Stages are pure combinational circuits performing arithmetic or logic operations over the
data stream flowing through the pipe. The stages are separated by high-speed interface latches
(i.e. collection of registers). The following figure shows the pipeline concept with k stages.
SQP.4 Computer Organization and Architecture
n ◊k ◊t
= where, t = clock period of the pipeline processor.
t [k + (n - 1)]
It can be noted that the maximum speed-up is k, for n >> k. But this maximum speed-up is
never fully achievable because of data dependencies between instructions, interrupts, program
branches, etc.
Efficiency: To define it, we need to define another term “time-space span”. It is the product
(area) of a time interval and a stage space in the space-time diagram. A given time-space span
can be in either a busy state or an idle state, but not both.
The efficiency of a linear pipeline is measured by the percentage of busy time-space spans
over the total time-space spans, which equal the sum of all busy and idle time-space spans. Let
n, k, t be the number of tasks (instructions), the number of stages and the clock period of a
linear pipeline, respectively. Then the efficiency is defined by
n ◊k ◊t n
h= =
k ◊ [k ◊ t + (n - 1) ◊ t] k + (n - 1)
Note that h Æ 1 (i.e., 100%) as n Æ •. This means that the larger the number of tasks flowing
through the pipeline, the better is its efficiency. For the same reason as speed-up, this ideal
efficiency is not achievable.
Throughput: The number of tasks that can be completed by a pipeline per unit time is called its
throughput. Mathematically, it is defined as
n h
w= =
k ◊ t + (n - 1) ◊ t t
Note that in ideal case, w = 1/t = f, frequency, when h Æ 1. This means that the maximum
throughput of a linear pipeline is equal to its frequency, which corresponds to one output result
per clock period.
Solved Question Paper 2006 SQP.5
5. Virtual memory: Parts of programs and data are brought into main memory from secondary
memory, as the CPU needs them. Virtual memory is a technique used in some large computer
systems, which gives the programmer an illusion of having a large main memory, although
which may not be the case. The size of virtual memory is equivalent to the size of secondary
memory. Each address referenced by the CPU called the virtual (logical) address is mapped to a
physical address in main memory. This mapping is done during run-time and is performed by a
hardware device called memory-management unit (MMU) with the help of a memory map table,
which is maintained by the operating system.
The virtual memory makes the task of programming much easier, because the programmer no
longer needs to bother about the amount of physical memory available. For example, a program
size is 18 MB and the available user part of the main memory is 15 MB (Other part of the main
memory is occupied by the operating system). First, 15 MB of the program is loaded into main
memory and then remaining 3 MB is still in the secondary memory. When the remaining 3 MB
code is needed for execution, swap out the 3 MB code from main memory to secondary memory
and swap in new 3 MB code from secondary memory to main memory.
The advantage of virtual memory is efficient utilization of main memory, because the larger
size program is divided into blocks and partially each block is loaded in the main memory
whenever it is required. Thus multiple programs can be executed simultaneously. The technique
of virtual memory has other advantages of efficient CPU utilization and improved throughput.
A virtual memory system may be configured in one of the following ways:
1. Paging technique
2. Segmentation technique.
6. IEEE format for floating point representation: See answer of question 4 (a) of 2007 (CS-303).
Group- C
(Long-Answer Questions)
Answer any three questions. 3 ¥ 15 = 45
7. (a) What are the bottlenecks of the von Neumann concept? 2
(b) Discuss the role of the operating system. 3
(c) Show the bus connection with a CPU to connect four RAM chips of size 256 ¥ 8 bits each
and a ROM chip of 512 ¥ 8 bit size. Assume the CPU has 8-bit data bus and 16-bit address
bus. Clearly specify generation of chip select signals. 5
(d) Briefly explain the two write policies write through and write back for cache design. What
are the advantages and disadvantages of both the methods? 5
Answer
(a) See answer of question 8 (c) of 2007 (CS-303).
(b) An operating system (OS) is a set of programs and utilities, which acts as the interface between
user programs and computer hardware. The following are the main roles of operating system:
1. Managing the user’s programs.
2. Managing the memories of computer.
3. Managing the I/O operations.
4. Controlling the security of computer.
SQP.6 Computer Organization and Architecture
(c) The figure below shows the interconnection diagram of these memories with the CPU having
16 address lines.
The address lines 1 to 8 are connected to each memory and address lines 9 is used in dual
purposes. In case of a RAM selection out of four RAMs, the line no. 9 and line no. 10 are used
through a 2-to-4 decoder. The line no. 9 is also connected to the ROM as address line along
with lines 1 to 8 giving a total of 9 address lines in the ROM, since the ROM has 512 locations.
The CPU address line number 11 is used for separation between RAM and ROM. The other 12
to 16 lines of CPU are unused and for simplicity we assume that they carry 0s as address
signals. For ROM, 10th line is unused and thus it can be assumed that this line carries signal 0.
Solved Question Paper 2006 SQP.7
The address map table for the memory connection to the CPU shown above is constructed in
the following table.
(d) Write-Through Policy: This is the simplest and most commonly used procedure to update the
cache. In this technique, when the cache memory is updated, at the same time the main memory
is also updated.
Advantage: The main memory always contains the same data as the cache.
Disadvantage: It is a slow process, since each time main memory needs to be accessed.
Write-Back Policy: In this method, during a write operation only the cache location is updated.
When the update occurs, the location is marked by a flag called modified or dirty bit. When the
word is replaced from cache, it is written into main memory if its flag bit is set. The philosophy
of this method is based on the fact that during a write operation, the word residing in cache may
be accessed several times (temporal locality of reference).
Advantage: The method is faster than write-through, because this method reduces the number of
references to main memory.
Disadvantage: This method may encounter the problem of inconsistency due to two different
copies of the same data, one in cache and other in main memory.
8. (a) What is interrupt? What are the differences between vectored and non-vectored interrupt?
1+4
(b) (i) Why is refreshing required in Dynamic MOS? 2
(ii) Define volatile and non-volatile memory. 3
(c) How do ALU and CU work? Explain. 3+2
Answer
(a) Interrupt is a special signal to the CPU generated by an external device that causes the CPU to
suspend the execution of one program and start the execution of another.
In a vectored interrupt, the source that interrupts supplies the branch information (starting
address of ISR) to the CPU. This information is called the interrupt vector, which is not any
fixed memory location. The processor identifies individual devices even if they share a single
interrupt-request line. So the set-up time is very less.
In a non-vectored interrupt, the branch address (starting address of ISR) is assigned to a fixed
location in memory. Since the identities of requesting devices are not known initially. The set-
up time is quite large.
(b) (i) Information is stored in a dynamic memory cell in the form of a charge on a capacitor. Due
to the property of the capacitor, it starts to discharge. Hence, the information stored in the
SQP.8 Computer Organization and Architecture
cell can be read correctly only if it is read before the charge on the capacitor drops below
some threshold value. Thus this charge in capacitor needs to be periodically recharged or
refreshed.
(ii) If power goes off, the stored information is lost in volatile memory; whereas, in non-volatile
memory, the stored information is never lost irrespective of the power supply.
Magnetic memories and ROM are examples of non-volatile memories. The example of
volatile memory is RAM.
(c) Both ALU and CU are major parts of CPU. The ALU does the arithmetic, transfer and logic
operations as dictated by the CU. An ALU consists of various arithmetic and logic units which
actually performs the operations.
The function of the CU is to control all operations by routing the selected data items to the
selected processing hardware units of ALU at the right time. A control unit’s responsibility is to
activate the associated processing hardware units by generating a set of signals that are synchro-
nized with a master clock. A control unit performs the following responsibilities:
l Instruction interpretation
l Instruction sequencing
During the interpretation phase, the control unit reads instructions from the memory (using
the PC register as a pointer). It then resolves the instruction type and addressing mode, gets the
necessary operands and routes them to the appropriate functional units of the execution unit.
Required signals are then issued to the different units of ALU to perform the desired operation
and the results are routed to the specific destination. Thus, this phase is done in “instruction
decoding” step of the instruction cycle.
During the sequencing phase, the control unit finds the address of the next instruction to be
executed and loads it into the PC. Thus, this phase is done in “instruction fetch” step of the
instruction cycle.
9. (a) What is Bus? How many buses are present in computer? 1+2
(b) What is “Dumb” memory? 1
(c) What is dirty bit? 1
(d) Draw a block diagram to illustrate the basic organization of computer system and explain
the function of various units. 8
(e) What is input device? How does it differ from output device? 2
Answer
(a) Bus: A bus consists of a group of common communication lines, where each line is used to
transfer one bit of a register at a time. Sometimes, it is said that n-bit bus or n-line bus, the
meaning of which is that the bus consists of n parallel lines to transfer n-bit of data all at a time.
The n is called width of the bus.
Three types of buses are present in computers: address bus, data bus and control bus.
Address bus carries the address information of data or instruction, data bus carries data or
instruction and control bus is used to send signals to various devices.
(b) The portion of main memory used by the host is managed by the host operating system in
multiprocessor system is called dumb memory.
(c) In write-back cache update policy, only cache location is updated and it is marked as updated
with an associated tag bit, known as dirty bit. The main memory location of the word is updated
Solved Question Paper 2006 SQP.9
later, when the block containing this marked word is to be removed from the cache to make
room for the new block.
(d) See answer of question no. 7(a) of 2007 (CS-303).
(e) The external devices through which information are transferred into the computer are known as
input devices. Popular input devices are keyboard, mouse and scanner.
The output devices are also external devices, but are used to receive the information stored in
memory. Some of output devices are monitor and printer. Thus information flow in the input
and output devices are opposite to each other. In case of input devices, information flow is
towards the central computer from devices, whereas in case of output devices information flow
from central computer to devices.
10. (a) Draw the logic diagram and discuss the advantages of a carry look ahead adder over
conventional parallel adder. 5
(b) Discuss with suitable logic diagram the operation of an SRAM cell. 5
(c) What are the different status flags in a processor? Discuss overflow detection. 5
Answer
(a) A 4-bit Carry Look-ahead Adder (CLA) is shown below.
A Carry Look-ahead Adder (CLA) is a high-speed parallel adder, which adds two numbers
without waiting for the carries from the previous stages, unlike conventional parallel adder such
as CPA. The carry-inputs of all stages are generated simultaneously, without using carries from
the previous stages. All carries depend only on initial carry C0.
Thus, this adder is very fast compared to the conventional parallel adders.
(b) One SRAM cell using CMOS is shown below.
Four transistors (T3, T4, T5 and T6) are cross connected in such a way that they produce a
stable state. In state 1, the voltage at point A is maintained high and voltage at point at B is low
by keeping transistors T3 and T6 on (i.e. closed), while T4 and T5 off (i.e. open). Similarly, in
state 0, the voltage at A is low and at point B is high by keeping transistors T3 and T6 off, while
T4 and T5 on. Both theses states are stable as long as the power is applied on it. Thus, for
state 1, if T1 and T2 are turned on (closed), bit lines b and b¢ will have high and low signals,
respectively.
Read Operation For the read operation, the word line is activated by the address input to the
address decoder. The activated word line closes both the transistors (switches) T1 and T2. Then
the bit values at points A and B can transmit to their respective bit lines. The sense/write circuit
at the end of the bit lines sends the output to the processor.
Write Operation Similarly, for the write operation, the address provided to the decoder acti-
vates the word line to close both the switches. Then the bit value that to be written into the cell
is provided through the sense/write circuit and the signals in bit lines are then stored into the
cell.
(c) The processor uses one special register called status register to hold the latest program status. It
holds 1-bit flags to indicate certain conditions that produced during arithmetic and logic opera-
tions. The bits are set or reset depending on the outcome of most recent arithmetic and logic
operation. The register generally contains following four flags:
Carry (C): it indicates whether there is any end-carry from the most significant bit position.
Zero (Z): it indicates whether the result is zero or non-zero.
Solved Question Paper 2006 SQP.11
(b) A processor is generally equipped with multiple interrupt lines those are connected between
processor and I/O modules. Several I/O devices share these interrupt lines. There are two ways
to service multiple interrupts: polled and daisy chaining technique.
In polling, interrupts are handled by software. When the processor detects an interrupt, it
branches to a common interrupt service routine (ISR) whose function is to poll each I/O device
to determine which device caused the interrupt. The order in which they are polled determines
the priority of each interrupt. The highest priority device is polled first and it is found that its
interrupt signal is on, the CPU branches to the device’s own ISR for I/O transfer between the
device and CPU. Otherwise it moves to poll the next highest priority device.
(c) The processing required for a single instruction is called instruction cycle. The control unit’s
task is to go through an instruction cycle that can be divided into five major phases:
1. Fetch the instruction from memory.
2. Decode the instruction.
3. Fetch the operand(s) from memory or register.
4. Execute the whole instruction.
5. Store the output result to the memory or register.
To read a memory word, the address is specified on the address bus to select the word among
many available words at first clock period T1. At second clock period T2, memory read signal
is activated and after seeing the read signal activated, memory places data in the data bus. The
total operation needs three clock periods.
2006
Computer Organization
and Architecture
Paper Code: CS-404 Semester: 4th
Time Alloted: 3 hours Full Marks: 70
(ii) With a neat sketch show the various fields in a typical instruction format. Also, write
the meaning of each field.
Answer
The most common format followed by instructions is depicted in the figure below.
When a new program is submitted, it tries to get started by taking frames from running
programs, causing page faults. Then the CPU is busy in replacing the pages for which it
faults and brings back it right away. Thus, the CPU spends more time in page-replacement
than executing.
(v) How in von Neumann scheme is it decided whether a word fetched from memory is
data or instruction?
Answer
The control unit are responsible for fetching words, decoding, fetching data (operand) from
the memory and providing proper control signals for all CPU operations. In the decoding
step, it is resolved whether it is data or instruction.
(vi) In parallel processing what is sequential fraction? What is its effect on speed-up?
Answer
On parallel computer, the execution of program is composed as: X(n) + Y(n) = 1; where X
is sequential fraction and Y is the parallel fraction and n is the program size.
On sequential computer, the relative is X(n) + p*Y(n); where p is the number of proces-
sors in parallel computer.
Therefore, speed-up (S) is (X(n) + p*Y(n))/1; 1 being the time for parallel computer.
So, S = X(n) + p * Y(n)
= X(n) + p * (1– X(n))
= p – X(n)*(p–1); which is known as Gustafson’s law.
If the sequential fraction decreases with program size, the speed-up, S Æ p as n Æ •, which
is desired.
(vii) “Interrupt request is serviced at the end of current instruction cycle while DMA request
is serviced almost as soon as it is received, even before completion current instruction
execution.” - Explain.
Answer
In the interrupt initiated I/O, interrupt request is serviced at the end of current instruction
cycle, because the processor takes part in the I/O transfer for which processor was inter-
rupted. Thus processor will be busy in data transfer after this instruction.
But in DMA transfer, the processor does not get involved during data transfer. It actually
initiates the data transfer. The whole data transfer is supervised by DMA controller and at
that time processor is free to do its own task of interaction execution.
(viii) Write down the main differences between tightly coupled and loosely coupled parallel
processing systems.
Answer
1. In tightly coupled system, a large main memory is shared among all processors, whereas
in loosely coupled system, each processor has local main memory and thus there is no
such shared main memory.
2. In tightly coupled system, the degree of interaction among tasks is high, but it less in
loosely coupled system.
3. Bus conflicts are high in tightly coupled system, because of frequent sharing of codes
between two processors, whereas this is not a problem in loosely coupled system.
Solved Question Paper 2006 SQP.5
2. (a) What do you mean by instruction cycle, machine cycle and T states ? 3
(b) What do you mean by implicit and immediate addressing mode? Illustrate with suitable
example. 1+2
(c) What is two address instruction? Evaluate the following arithmetic statement using two
address instructions. X = (A + B) * (C + D) 1+2
Answer
(a) See answer of question no. 4 of 2007 (CS-404).
(b) Implicit Mode: In this mode the operands are indicated implicitly by the instruction. The accu-
mulator register is generally used to hold the operand and after the instruction execution the
result is stored in the same register. For example,
(a) RAL; Rotates the content of the accumulator left through carry.
(b) CMA; Takes complement of the content of the accumulator.
Immediate Mode: In this mode the operand is mentioned explicitly in the instruction. In other
words, an immediate-mode instruction contains an operand value rather than an address of it in
the address field. To initialize registers to a constant value, this mode of instructions are useful.
For example:
(a) MVI A, 06; Loads equivalent binary value of 06 to the accumulator.
(b) ADI 05; Adds the equivalent binary value of 05 to the content of AC.
(c) Two-address instructions:
The most popular instructions in commercial computers are two-address instructions. The gen-
eral register organized computers use two-address instructions. Each address field may specify
either a processor register or a memory operand.
The assembly program using two-address instructions to evaluate X = (A + B) * (C + D) is as
follows:
LOAD R1, A ; R1 ¨ M[A]
ADD R1, B ; R1 ¨ R1 + M[B]
LOAD R2, C ; R2 ¨ M[C]
ADD R2, D ; R2 ¨ R2 + M[D]
MULT R1, R2 ; R1 ¨ R1 * R2
STORE X, R1 ; X ¨ R1
Assume that LOAD symbolic op-code is used for transferring data to register from memory.
STORE symbolic op-code is used for transferring data to memory from register. The symbolic
op-codes ADD and MULT are used for the arithmetic operations addition and multiplication
respectively. Assume that the respective operands are in memory addresses A, B, C and D and
the result must be stored in the memory at address X.
3. (a) What is Locality of Reference ? Explain the concept of cache memory with it. 1+2
(b) What is cache mapping ? Explain direct mapping for 256 ¥ 8 RAM and 64 ¥ 8 cache. 4
(c) State L1 and L2 cache policy with suitable figure. 2
Answer
(a) Analysis of a large number of typical programs shows that the CPU references to main memory
during some time period tend to be confined within a few localized areas in memory. In other
words, few instructions in the localized areas in memory are executed repeatedly for some time
SQP.6 Computer Organization and Architecture
duration and other instructions are accessed infrequently. This phenomenon is known as the
property of locality of reference.
Consider that when a program loop is executed, the CPU repeatedly refers to the set of
instructions in memory that constitute the loop. Thus, loop tends to localize the references to
memory for fetching the instructions. If this program loop is placed in fast cache memory, the
average memory access time can be reduced, thus reducing the total execution time of the
program. Because the cache memory’s speed is almost same as that of CPU.
(b) The transfer of data as a block from main memory to cache memory is referred to as cache
mapping. Three types of cache mapping have been used.
1. Associative mapping.
2. Direct mapping.
3. Set-associative mapping.
Direct Mapping
Instead of storing total address information with data in cache as in associative cache, only part
of address bits is stored along with data in the direct cache mapping. Since size of cache
memory is 64 ¥ 8 and that of main memory 256 ¥ 8. Thus,the CPU will generate 8-bit memory
address. This 8-bit address is divided into two fields: lower-order 6 bits for the index field and
the remaining higher-order 2 (= 8–6) bits for the tag field. The direct mapping cache organiza-
tion uses the 8-bit address to access the main memory and the 6-bit index to access the cache.
So, for our example, the index and tag fields are shown below:
The internal organization of the words in the cache memory is shown in figure blow. We know
that data is transferred from main memory to cache memory as block. We assume that a block
size of one word, since no block size is given.
Solved Question Paper 2006 SQP.7
The cache memory stores maximum 64 data words each of 8 bits and their associated tags of
2 bits. Due to a miss in the cache, a new block containing the desired word is brought into the
cache; the tag bits are stored alongside the data bits. The index field is used as the address to
access the cache, when the CPU generates a word’s address. The tag field of the CPU address is
matched sequentially with all tag values stored in cache. If the CPU’s tag-address matches with
any cache tag, i.e. there is a hit and the desired data word is read by the CPU from cache. If
there is no match in cache, then there is a miss and the required word is read from main
memory, which is then stored in the cache together with its new tag, replacing the previous tag-
data pair value. Thus, the new tag-data pair is placed in same indexed location in cache as
CPU’s current index of the address for which miss has occurred. But, here it can be noted that
the hit ratio can drop considerably if two or more words whose addresses have the same index
but different tags are accessed repeatedly.
(c) Cache memory has been used two or even three levels in modern-day computers. For two level
cache organization, level 1, simply L1 is used as on-chip cache and level 2, simply L2 is used as
off-chip cache. This is shown in the following figure. Obviously, L1 cache access is faster than
L2, but capacity of L2 is more than that of L1. Since L1 is embedded in the processor, processor
first accesses L1 then L2 in case of miss on L1.
4. (a) Show how non-restoring division method is deduced from restoring division method. 3
(b) Apply restoring division procedure to divide decimal number 20 by decimal 3. 3
(c) Design 2’s complement decremented circuit for a 4-bit number. 3
Answer
(a) In the restoring method, some extra additions are required to restore the number, when A is
negative. Proper restructuring of the restoring division algorithm can eliminate that restoration
step. This is known as the non-restoring division algorithm.
The three main steps in restoring division method are:
1. Shift AQ register pair to the left one position.
2. A = A – M.
3. If the sign of A is positive after the step 2, set Q[0] = 1; otherwise, set Q[0] = 0 and restore A.
Now, assume that the step 3 is performed first and then step 1 followed by step 2. Under this
condition, the following two cases may arise.
Case 1:When A is positive:
Note that shifting A register to the left one position is equivalent to the computation of 2A and
then subtraction. This gives the net effect on A as 2A – M.
SQP.8 Computer Organization and Architecture
6. (a) What are the advantages and disadvantages of microprogram control unit over hardwired
control unit? 3
(b) What are the general attributes of vertical and horizontal micro-instructions? 3
(c) What is priority interrupt ? Draw a diagram to show the priority interrupt scheme 1+ 2
Answer
(a) Advantages of microprogram control unit:
(i) It provides a well-structured control organization. Control signals are systematically trans-
formed into formatted words (microinstructions). Logic gates, flip flops, decoders and other
digital circuits are used to implement hardwired control organization.
(ii) With microprogramming, many additions and changes are made by simply changing the
microprogram in the control memory. A small change in the hardwired approach may lead
to redesigning the entire system.
Disadvantage of microprogram control unit:
The microprogramming approach is more expensive than hardwired approach. Since, a control
ROM memory is needed in microprogramming approach.
(b) The attributes of vertical micro-instructions:
l Short formats.
l Limited ability to express parallel microoperations.
l Considerable encoding of the control information.
The attributes of horizontal micro-instructions:
l Long formats.
l Ability to express a high degree of parallelism.
l Very little encoding of the control information.
(c) Priority interrupt: A priority interrupt is a system that establishes a priority over the various
sources to determine which condition is to be serviced first when two or more requests arrive
simultaneously. This is done by interrupt controller. Devices with high-speed transfers such as
magnetic disks are usually given high priority and slow devices such as keyboards receive low
priority. When two devices generate interrupt signals to the interrupt controller for the CPU at
the same time, the CPU services the device, with the higher priority first.
For priority interrupt scheme: see answer of question no. 8(b) of 2007 (CS-404).
7 (a) Define speed-up of a parallel processing system. 2
(b) Show that when K jobs are processed over an N stage pipeline, the speed-up obtained is
NK
Sp = 4
N + K -1
(c) With the help of a neat diagram, show the structure of a typical arithmetic pipeline
performing subtraction. 3
Answer
(a) Speed-up S(n) of a parallel processing system is defined as the ratio of total execution time T(1)
on serial processing system to the corresponding execution time T(n) on a processing system
whose degree of parallelism in n.
T(1)
Thus, S(n) =
T(n)
Solved Question Paper 2006 SQP.11
8. Show that if a single powerful processor P is replaced by n small processors PZ having compu-
tational speed 1/nth the computational speed P then no advantage in speed-up is obtained. 9
Answer
Let the execution time of processor P be x.
Then each small processor PZ has execution time is x/n.
There are n small processors, each having execution time x/n.
So, the total execution time of all n processors is n * (x/n) = x.
The speed-up S of the processor P over all small n processors is given as:
Total execution time of all n processors
S=
Execution time of processor P
x
=
x
= 1.
This result suggests that no speed-up is gained in this process.
9. Write short notes on any three of the following: 3¥3=9
(a) Flynn’s classification
(b) Pipeline hazards
(c) DMA processing
(d) Wilke’s microprogramming control scheme
(e) Demand paging and page replacement methods.
Answer
(a) Flynn’s classification: Based on the number of simultaneous instruction and data streams used
by a CPU during program execution, digital computers can be classified into four categories as:
l Single instruction stream-single data stream (SISD) machine.
l Single instruction stream-multiple data stream (SIMD) machine.
l Multiple instruction stream-single data stream (MISD) machine.
l Multiple instruction stream-multiple data stream (MIMD) machine.
SISD Computer
Most serial computers available today fall in this organization as shown in figure next. Instruc-
tions are executed sequentially but may be overlapped in their execution stages (In other words
the technique of pipelining can be used in the CPU). Modern day SISD uniprocessor systems
are mostly pipelined. Examples of SISD computers are IBM 360/91, CDC Star-100 and TI-
ASC.
Solved Question Paper 2006 SQP.13
SIMD Computer
Array processors fall into this class. As illustrated in figure next, there are multiple processing
elements supervised by the common control unit. All PEs (processing elements, which are
essentially ALUs) receive the same instruction broadcast from the control unit but operate on
different data sets from distinct data streams. The shared memory subsystem containing multiple
modules is very essential. This machine generally used to process vector type data. Examples of
SIMD computers includes Illiac-IV and BSP.
MISD Computer
Very few or no parallel computers fit in this organization, which is conceptually illustrated in
figure next. There are n processor elements, each receiving distinct instructions to execute on
the same data stream and its derivatives. The results (outputs) of one processor element become
the inputs (operands) of the next processor element in the series. This architecture is also known
as systolic arrays.
Captions: CU: control unit PE: processing element IS: instruction stream DS: data stream.
MISD computer (Systolic array)
SQP.14 Computer Organization and Architecture
MIMD Computer
This category covers multiprocessor systems and multiple computer systems. The structure of
MIMD computer is shown in figure next. An MIMD computer is called tightly coupled (or
Uniform Memory Access (UMA)) if the degree of interactions among the processors is high.
Otherwise, we consider them loosely coupled (or Non-Uniform Memory Access (NUMA)). Most
commercial MIMD computers are loosely coupled. Examples of MIMD multiprocessors are
C.m*, C.mmp, Cray-3 and S-1.
(b) Pipeline hazards: Pipeline hazards are situations that prevent the next instruction in the instruc-
tion stream from executing during its designated clock cycle. The instruction is said to be
stalled. When an instruction is stalled, all instructions later in the pipeline than the stalled
instruction are also stalled. Instructions earlier than the stalled one can continue. No new
instructions are fetched during the stall.
Types of pipeline hazards are:
1. Control hazards
2. Structural hazards
3. Data hazards
Control hazards
They arise from the pipelining of branches and other instructions that change the content of
program counter (PC) register.
Solution of control hazards:
In order to cope with the adverse effects of branch instructions, an important technique called
prefetching is used. Prefetching technique states that: Instruction words ahead of the one cur-
rently being decoded in the instruction-decoding (ID) stage are fetched from the memory system
before the ID stage requests them.
Structural Hazards
Structural hazards occur when a certain resource (memory, functional unit) is requested by more
than one instruction at the same time.
Example: Instruction ADD R4, X fetches operand X from memory in the OF stage at 3rd
clock period. The memory doesn’t accept another access during that period. For this, (i+1)th
Solved Question Paper 2006 SQP.15
instruction cannot be initiated at 3rd clock period to fetch the instruction from memory. Thus,
one clock cycle is stalled in the pipeline for all subsequent instructions. This is illustrated next.
DMA transfers are performed by a control circuit that is part of the I/O device interface. We
refer to this circuit as a DMA controller. The DMA controller performs the functions that would
normally be carried out by the CPU when accessing the main memory. During DMA transfer,
the CPU is idle or can be utilized to execute another program and CPU has no control of the
memory buses. A DMA controller takes over the buses to manage the transfer directly between
the I/O device and the main memory.
The CPU can be placed in an idle state using two special control signals, HOLD and HLDA
(hold acknowledge). Figure below shows two control signals in the CPU that characterize the
DMA transfer. The HOLD input is used by the DMA controller to request the CPU to release
control of buses. When this input is active, the CPU suspends the execution of the current
instruction and places the address bus, the data bus and the read/write line into a high-imped-
ance state. The high-impedance state behaves like an open circuit, which means that the output
line is disconnected from the input line and does not have any logic significance. The CPU
activates the HLDA output to inform the external DMA controller that the buses are in the high-
impedance state. The control of the buses has been taken by the DMA controller that generated
the bus request to conduct memory transfers without processor intervention. After the transfer of
data, the DMA controller disables the HOLD line. The CPU then disables the HLDA line and
regains the control of the buses and returns to its normal operation.
To communicate with the CPU and I/O device the DMA controller needs the usual circuits of
an interface. In addition to that, it needs an address register, a word count register, a status
register and a set of address lines. Three registers are selected by the controller’s register select
(RS) line. The address register and address lines are used for direct communication with the
memory. The address register is used to store the starting address of the data block to be
transferred. The word count register contains the number of words that must be transferred. This
register is decremented by one after each word transfer and internally tested for zero after each
transfer. Between the device and memory under control of the DMA, the data transfer can be
done directly. The status register contains information such as completion of DMA transfer. All
registers in the DMA controller appear to the CPU as I/O interface registers. Thus, the CPU can
read from or write into the DMA registers under program control via the data bus.
While executing the program for I/O transfer, the CPU first initializes the DMA controller.
After that, the DMA controller starts and continues to transfer data between memory and
peripheral unit an entire block is transferred. The DMA controller is initialized by the CPU by
sending the following information through the data bus:
1. The starting address of the memory blocks where data are available for or where data are to
be stored for write.
2. The number of words in the memory block (word count) to be read or written.
Solved Question Paper 2006 SQP.17
The condition select field selects the external condition to be tested. The output of the MUX
will be 1, if the selected condition is true. The MPC will be loaded with the address specified in
the branch address field of the microinstruction, because the output of the MUX is connected to
the load input of the microprogram counter (MPC). However, the MPC will point to the next
microinstruction to be executed, if the selected external condition is false. Thus, this arrange-
ment allows conditional branching. The control function field of the microinstruction may hold
the control information in an encoded form which thus may require decoders.
Microprogram counter (MPC): The task of MPC is same as the PC (program counter) used in
the CPU. The address of the next microinstruction to be executed is held by the MPC. Initially,
SQP.18 Computer Organization and Architecture
it is loaded from an external source to point to the starting address of the microprogram to be
executed. From then on, the MPC is incremented after each microinstruction fetch and the
instruction fetched is transferred to the CMBR. However, the MPC will be loaded with the
contents of the branch address field of the microinstruction that is held in the CMBR, when a
branch instruction is encountered.
External condition select MUX: Based on the contents of the condition select field of the
microinstruction, this MUX selects one of the external conditions. Therefore, the condition to be
selected must be specified in an encoded form. Any encoding leads to a short microinstruction,
which implies a small control memory; hence the cost is reduced. Suppose two external condi-
tions X1, X2 are to be tested; then the condition-select and actions taken are summarized next:
Condition Action
select taken
00 No branching
01 Branch if X1 =1
10 Branch if X2 =1
11 Always branching (unconditional branching)
The multiplexer has four inputs V0, V1, V2, V3 where Vi is routed to the multiplexer’s output
when the condition select field has decimal equivalent i. Hence we require V0 = 0, V1 = X1, V2
= X2, V3 = 1 to control the loading of microinstruction branch addresses into MPC.
(e) Demand paging and page replacement methods:
Demand paging: It is a technique to implement virtual memory. It is the combination of paging
and swapping. As its name suggests, a page is loaded into main memory on demand. Thus, the
basic idea of this technique is that a page is not loaded into the main memory from secondary
memory until it is needed. For example, suppose the logical address space available is 64 KB,
main memory available is 32 KB, the page and frame size is 4 KB. Therefore, the number of
pages exists in logical address space is 16 and frames exists in main memory is 8. Thus, only
8 pages of logical memory can be stored in main memory. Remaining 8 pages are brought into
main memory whenever those pages are required. Thus, some existing pages in main memory
are replaced to make room for the required pages. Thus, existing pages are swapped out of main
memory to secondary memory and new required pages are swapped into the frames of swapped
out pages in main memory.
To implement this scheme, the page table is modified to include another third field, known as
valid/invalid bit, along with page number and frame number fields. This bit field is set (valid)
when the associated page is in the main memory and it is reset (invalid) when the page is not in
main memory, but in secondary memory.
Page replacement: When a program starts execution, one or more pages are brought to the main
memory and the page table is responsible to indicate their positions. When the CPU needs a
particular page for execution and that page is not in main (physical) memory (still in the
secondary memory), this situation is called page fault. Thus, access to a page marked invalid
gives page fault signal. When the page fault occurs, the execution of the present program is
suspended until the required page is brought into main memory from secondary memory. The
required page replaces an existing page in the main memory, when it is brought into main
memory. Thus, when a page fault occurs, a page replacement is needed to select one of the
Solved Question Paper 2006 SQP.19
existing pages to make the room for the required page. There are several replacement algorithms
such as FIFO (First-in First-out), LRU (Least Recently Used) and optimal page replacement
algorithm available.
The FIFO algorithm is simplest and its criterion is “select a page for replacement that has
been in the main memory for longest period of time”.
The LRU algorithm states that “select a page for replacement, if the page has not been used
often in the past”. The LRU algorithm is difficult to implement, because it requires a counter for
each page to keep the information about the usage of page.
The optimal algorithm generally gives the lowest page faults of all algorithms and its crite-
rion is “replace a page that will not be used for the longest period of time”. This algorithm is
also difficult to implement, because it requires future knowledge about page references.
An algorithm is evaluated by running it on a particular string of memory references and
computing the number of page faults. The string of memory references is called a reference
string.
2007
Computer Organization
Paper Code: CS-303 Semester: 3rd
Time Allotted: 3 hours Full Marks: 70
Group-A
(Multiple-Choice Questions)
1. Choose the correct alternatives for the following: 10 ¥1 = 10
(i) When signed numbers are used in binary arithmetic, then which one of the following
notations would have unique representation for zero?
(a) magnitude (b) 1’s complement (c) 2’s complement (d) none.
Answer
(c) 2’s complement
(ii) How many address bits are required for a 1024 ¥ 8 memory?
(a) 1024 (b) 5 (c) 10 (d) none.
Answer
(c) 10
(iii) The principle of locality justifies the use of
(a) Interrupt (b) Polling (c) DMA (d) Cache memory.
Answer
(d) cache memory
(iv) A major advantage of direct mapping of a cache is its simplicity. The main disadvantage of
this organization that
(a) it does not allow simultaneous access to the intended data and its tag
(b) it is more expensive than other types of cache organizations
(c) the cache hit ratio is degraded if two or more blocks used alternatively map onto the
same block frame in the cache
(d) its access time is greater than that of other cache organizations.
SQP.2 Computer Organization and Architecture
Answer
(c) the cache hit ratio is degraded if two or more blocks used alternatively map onto the
same block frame in the cache
(v) A digital computer has a memory unit with 24 bits per word. The instruction set consists of
150 different operations. All instructions have an operation code part (op-code) and address
part (allowing for only one address). Each instruction is stored in one word of memory. Bits
are needed for op-code
(a) 6 (b) 7 (c) 8 (d) 9
Answer
(c) 8
(vi) Maximum n bit 2’s complement number is
(a) 2n (b) 2n – 1 (c) 2n–1 – 1 (d) cannot be said.
Answer
(c) 2n–1 – 1
(vii) State true or false:
Adding 0110 11012 to 1010 00102 in 8-bit 2’s complement binary will cause an overflow:
(a) true (b) false.
Answer
(a) true
(viii) Micro instructions are kept in
(a) main memory (b) control memory (c) cache memory (d) none.
Answer
(b) control memory
(ix) Instruction cycle is
(a) fetch-decode-execution (b) decode-fetch-execute
(c) fetch-execution-decode (d) none.
Answer
(a) fetch-decode-execution
(x) The basic principle of the von Neumann computer is
(a) storing program and data in separate memory
(b) using pipe line concept
(c) storing both program and data in the same memory
(d) using a large number of registers.
Answer
(c) storing both program and data in the same memory
Group-B
(Short-Answer Questions)
Answer any three of the following. 3 ¥ 5 = 15
2. (a) Explain the difference between three-address, two-address, one-address instructions & zero-
address instruction with suitable examples.
(b) Give an example and explain Base-Index Addressing. 3+2
Solved Question Paper 2007 SQP.3
Answer:
(a) The size of programs consisting of all three-address instructions is small, whereas that of
programs using zero-address instructions is large. Three- and two-address instructions are gen-
erally used in general-register organized processors, one-address instructions used in single
accumulator based processors and zero-address instructions are used in stack based CPU organi-
zations.
Suppose we have to evaluate the arithmetic statement
X = (A + B) * C
using zero, one, two or three address instructions. For this, LOAD symbolic op-code is used for
transferring data to register from memory. STORE symbolic op-code is used for transferring
data to memory from register. The symbolic op-codes ADD and MULT are used for the arith-
metic operations addition and multiplication respectively. Assume that the respective operands
are in memory addresses A, B and C and the result must be stored in the memory at address X.
Using three-address instructions, the program code in assembly language is as:
ADD R1, A, B ; R1¨ M[A] + M[B]
MULT X, C, R1 ; X ¨ M[C] + R1
Using two-address instructions, the program code in assembly language is as:
LOAD R1, A ; R1¨ M[A]
ADD R1, B ; R1 ¨ R1 + M[B]
LOAD R2, C ; R2¨ M[C]
MULT R1, R2 ; R1 ¨ R1 * R2
STORE X, R1 ; X ¨ R1
Using one-address instructions, the program code in assembly language is as:
LOAD A ; AC ¨ M[A]
ADD B ; AC ¨ AC + M[B]
STORE T ; T ¨ AC
LOAD C ; AC ¨ M[C]
MULT T ; AC ¨ AC * M[T]
STORE X ; X ¨ AC
Using zero-address instructions, the program code in assembly language is as:
PUSH A ; TOS ¨ A [TOS means top of the stack]
PUSH B ; TOS ¨ B
ADD ; TOS ¨ (A + B)
PUSH C ; TOS ¨ C
MULT ; TOS ¨ ((A + B) * C)
POP X ; X ¨ TOS
(b) In this mode the content of the base register (BR) is added to the address part of the instruction
to obtain the effective address. This mode is similar to the indexed addressing mode, but
exception is in the way they are used. A base register holds the starting address of a memory
array of operands and the address part of the instruction gives a displacement or offset relative
to this starting address. This mode is used for relocation of the programs in the memory.
SQP.4 Computer Organization and Architecture
For example, an operand array starts at memory address 1000 and thus the base register BR
contains the value 1000. Now consider load instruction
LDA 0002
The effective address of the operand is calculated as:
Effective address = 0002 + content of BR
= 1002.
3. (a) Explain the difference between full associative and direct mapped cache mapping approaches.
(b) What are “write through” and “write back” policies in cache? 3+2
Answer
(a) The fully associative cache memory uses the fastest and most flexible mapping method, in
which both address and data of the memory word are stored. This memory is expensive because
of additional storage of addresses with data in the cache memory.
In the direct cache mapping, instead of storing total address information with data in cache,
only part of address bits is stored along with data. Suppose the cache memory can hold 2m
words and main memory can hold 2n words. The n-bit address generated by the CPU is divided
into two fields: lower-order m bits for the index field and the remaining higher-order (n-m) bits
for the tag field. The direct mapping cache organization uses the m-bit index to access the cache
and higher-order (n-m) bits of tag are stored along side the data in cache. This is the simplest
type of cache mapping, since only tag field is required to match. That’s why it is one of the
fastest caches. Also, it is less expensive cache relative to the associative cache.
(b) There are two policies in writing into cache memory: (i) write-through (ii) write-back.
Write-Through Policy: This is the simplest and most commonly used procedure to update the
cache. In this technique, when the cache memory is updated, at the same time the main memory
is also updated. Thus, the main memory always contains the same data as the cache.
Write-Back Policy: In this method, during a write operation only the cache location is up-
dated. When the update occurs, the location is marked by a flag called modified or dirty bit.
When the word is replaced from cache, it is written into main memory if its flag bit is set.
4. (a) Briefly explain the IEEE 754 standard format for floating point representation.
(b) How NaN ( Not a Number) and Infinity are represented in this standard. 3+2
Answer
(a) The IEEE 754 has two similar formats as follows:
1. Single precision format: It is 32-bit format, in which 8-bit is for exponent, 23-bit for
mantissa, 1-bit for sign of the number, as shown in the following figure.
Solved Question Paper 2007 SQP.5
Here, the implied base 2 and original signed exponent E is not stored in register. The value
actually stored in the exponent field is an unsigned integer E¢ called biased exponent, which
is calculated by the relation E¢ = E + 127. This is referred to as the excess–127 format.
Thus, E¢ is in the range 0 £ E¢ £ 255. The end values of this range, 0 and 255, are used to
represent special values. Therefore, the range of E¢ is 1 £ E¢ £ 254, for normal values. This
means that the actual exponent (E) is in the range –126 £ E £ 127.
2. Double precision format: This is 64-bit format in which 11-bit is for biased exponent E¢, 52-
bit for mantissa M and 1-bit for sign of the number, as shown in figure next. The represen-
tation is same as single precision format, except the size and thus other related parameters.
(b) Not a Number (NaN) is represented when E¢ = 255 and M π 0. NaN is a result of performing an
invalid operation such as 0/0 and –1 .
Infinity is represented when E¢ = 255 and M = 0. The infinity is the result of dividing a
normal number by 0.
5. (a) Draw and explain the basic structure of a Hard-Disk and explain seek-time and latency time
associated with it.
(b) Compare restoring and non-restoring division algorithms. 3+2
Answer
(a) Disks that are permanently attached to the unit assembly and cannot be removed by the general
user are called hard disks. A hard disk consists of some number of circular magnetic disks. The
magnetic disk is made of either aluminum or plastic coated with a magnetic material so that
information can be stored on it. The recording surface is divided into a number of concentric
circles called tracks. The tracks are commonly divided into sections called sectors. To distin-
guish between two consecutive sectors, there is a small inter-sector gap. In most systems, the
minimum quantity of information transfer is a sector. The information is accessed onto the
tracks using movable read-write heads that move from the innermost to the outmost tracks and
vice-versa. There is one read-write head per surface. During normal operation, disks are rotated
continuously at a constant angular velocity. Same radius tracks on different surfaces of disks
forms a logical cylinder.
SQP.6 Computer Organization and Architecture
To access data, the read-write head must be placed on the proper track based on the given
cylinder address. The time required to position the read-write head over the desired track is
known as the seek time, ts. After positioning the read-write head on the desired track, the disk
controller has to wait until the desired sector is under the read-write head. This waiting time is
known as rotational latency, tl. The access time of the disk is the sum of ts and tl.
(b) Both algorithms are directly used for division of two unsigned numbers. The restoring algorithm
requires more number of additions compared to non-restoring algorithm. Because in first algo-
rithm, when the accumulator register becomes negative after the subtraction of divisor from
accumulator content, the content of accumulator is restored to previous value by adding divisor
to it. This restoration step is not required in non-restoring algorithm. However, the non-restoring
algorithm may require a restoration step at the end of algorithm, if content of accumulator is
negative.
6. (a) Explain the difference between Instruction Pipeline and Arithmetic Pipeline.
(b) What are the different hazards in Pipeline? 3+2
Answer
(a) 1. Instruction pipeline is used to process all instructions, whereas arithmetic pipeline is used to
process arithmetic type instructions such as addition, subtraction, multiplication, etc.
2. In instruction pipeline, the execution of a stream of instructions can be pipelined by over-
lapping the execution of the current instruction with the fetch, decode and operand fetch of
subsequent instructions. An arithmetic pipeline divides an arithmetic operation, such as a
multiply, into multiple arithmetic steps each of which is executed one-by-one in different
arithmetic stages in the ALU.
3. All high-performance computers are now equipped with instruction pipeline. The number of
arithmetic pipelines varies from processors to processors.
Solved Question Paper 2007 SQP.7
(b) Pipeline hazards are situations that prevent the next instruction in the instruction stream from
executing during its designated clock cycle. There are three types of pipeline hazards:
1. Control hazards
2. Structural hazards
3. Data hazards
Control hazards:
They arise from the pipelining of branches and other instructions that change the content of
program counter (PC) register.
Structural Hazards:
Structural hazards occur when a certain resource (memory, functional unit) is requested by more
than one instruction at the same time.
Data Hazards:
Inter-instruction dependencies may arise to prevent the sequential (in-order) data flow in the
pipeline, when successive instructions overlap their fetch, decode and execution through a
pipeline processor. This situation due to inter-instruction dependencies is called data hazard.
Group-C
(Long-Answer Questions)
Answer any three of the following questions. 3 ¥ 15 = 45
7. (a) Describe the function of Major Components of a digital computer with neat sketch.
(b) Explain the reading and writing operations of a basic static MOS cell.
(c) How many 128 ¥ 16 RAM chips are needed to construct a memory capacity of 4096 words
(16 bit is one word)? How many lines of the address bus must be used to access a memory
of 4096 words? For chip select, how many lines must be decoded? 5+6+4
Answer
(a)
(ii) Control Unit: This is the unit that supervises the flow of information between various units.
The control unit retrieves the instructions using registers one by one from the program,
which is stored in the memory. The instructions are interpreted (or decoded) by the control
unit itself and then the decoded instructions are sent to the ALU for processing.
(iii) Memory: The memory unit stores programs as well as data. Generally three types of memo-
ries are used: secondary, main and cache memories.
(iv) Input Unit: This unit transfers the information as provided by the users into memory.
Examples include keyboard, mouse, scanner, etc.
(v) Output Unit: The output units receive the result of the computation and displayed to the
monitor or the user gets the printed results by means of a printer.
(b) One SRAM cell using CMOS is shown in figure next. Four transistors (T3, T4, T5 and T6) are
cross connected in such a way that they produce a stable state. In state 1, the voltage at point A
is maintained high and voltage at point at B is low by keeping transistors T3 and T6 on (i.e.
closed), while T4 and T5 off (i.e. open). Thus, for state 1, if T1 and T2 are turned on (closed), bit
lines b and b¢ will have high and low signals, respectively.
Read Operation: For the read operation, the word line is activated by the address input to the
address decoder. The activated word line closes both the transistors (switches) T1 and T2. Then
the bit values at points A and B can transmit to their respective bit lines. The sense/write circuit
at the end of the bit lines sends the output to the processor.
Write Operation: Similarly, for the write operation, the address provided to the decoder activates
the word line to close both the switches. Then the bit value that to be written into the cell is
provided through the sense/write circuit and the signals in bit lines are then stored into the cell.
(c) The number of 128 ¥ 16 RAM chips needed to construct a memory capacity of 4096 ¥ 16 is
= (4096/128) * (16/16) = 32.
Since the large memory size is 4096 words. The number of lines in the address bus used to
access the memory is 12.
Since the 32 rows (= 4096/128) of 128 ¥ 16 RAMs are used to construct the large memory of
size 4096 ¥ 16. The number of lines decoded to select a RAM is 5. Therefore the address
decoder size is 5-to-32.
Solved Question Paper 2007 SQP.9
8. (a) Give the Booth’s algorithm for multiplication of signed 2’s complement numbers in flow-
chart and explain.
(b) Multiply –5 by –3 using Booth’s algorithm.
(c) What is von Neumann architecture? What is von Neumann bottleneck?
(d) What is virtual memory? 5+4+4+2
Answer
(a)
SQP.10 Computer Organization and Architecture
The algorithm inspects two lower-order multiplier bits at time to take the next step of action.
The algorithm is described by the flowchart shown next. A flip-flop (a fictitious bit position) is
used to the right of lsb of the multiplier and it is initialized to 0. Subsequently, it receives the lsb
of the multiplier when the multiplier is shifted right.
Once all bits of the multiplier are inspected, the accumulator and multiplier register together
contains the product. Ignore the right end flip-flop used for holding an initial 0, as it is a
fictitious bit and subsequent lsbs from multiplier.
(b) M = –5 = 1011 and Q = –3 = 1101.
M A Q Size
Initial
Configuration 1011 0000 1101 0 4
Step-1
As Q[0]=1and
Q[–1]=0
A=A – M 1011 0101 1101 0 -
And ARS(AQ) 1011 0010 1110 1 3
Step-2
As Q[0]=0 and
Q[–1]=1
A=A+M 1011 1101 1110 1 -
ARS(AQ) 1011 1110 1111 0 2
Step-3
As Q[0]=1 and
Q[–1]=0
A=A–M 1011 0011 1111 0 -
ARS(AQ) 1011 0001 1111 1 1
Step-4
As Q[0]=1 and
Q[–1]=1
ARS(AQ) 1011 0000 1111 1 0
Since, the size register becomes 0, the algorithm is terminated and the product is = AQ = 0000
1111, which shows that the product is a positive number. The result is 15 in decimal.
(c) Von Neumann architecture:
The architecture uses a concept, known as stored-program concept, has three main principles:
1. Program and data can be stored in the same memory.
2. The computer executes the program in sequence as directed by the instructions in the
program.
3. A program can modify itself when the computer executes the program.
Each instruction contains only one memory address and has the format:
OPCODE ADDRESS
The 8-bit op-code specifies the operation to be performed by the CPU and 12-bit address
specifies the operand’s memory address. Thus, length of each instruction is 20-bit.
Solved Question Paper 2007 SQP.11
Von-Neumann bottleneck:
One of the major factors contributing for a computer’s performance is the time required to move
instructions and data between the CPU and main memory. The CPU has to wait longer to obtain
a data-word from the memory than from its registers, because the registers are very fast and are
logically placed inside the processor (CPU). This CPU-memory speed disparity is referred to as
Von-Neumann bottleneck.
(d) Virtual memory is a technique used in some large computer systems, which gives the program-
mer an illusion of having a large main memory, although which may not be the case. The size of
virtual memory is equivalent to the size of secondary memory. Each address referenced by the
CPU called the virtual (logical) address is mapped to a physical address in main memory. This
mapping is done during run-time and is performed by a hardware device called memory-man-
agement unit (MMU) with the help of a memory map table, which is maintained by the operat-
ing system.
9. (a) Explain the basic Direct Memory Access (DMA) operation for transfer of data bytes be-
tween memory and peripherals.
(b) Give the main reason why DMA based I/O is better in some circumstances than interrupt
driven I/O?
(c) What is programmed I/O technique? Why is it not very useful?
(d) According to the following information, determine size of the subfields (in bits) in the
address for Direct Mapping and Set Associative Mapping cache schemes:
l We have 256 MB main memory and 1 MB cache memory
l The address space of the processor is 256 MB
l The block size is 128 bytes
l There are 8 blocks in a cache set. 5+3+3+4
Answer
(a) DMA transfers are performed by a control circuit that is part of the I/O device interface. We
refer to this circuit as a DMA controller. The DMA controller performs the functions that would
normally be carried out by the CPU when accessing the main memory. During DMA transfer,
the CPU is idle or can be utilized to execute another program and CPU has no control of the
memory buses. A DMA controller takes over the buses to manage the transfer directly between
the I/O device and the main memory.
The CPU can be placed in an idle state using two special control signals, HOLD and HLDA
(hold acknowledge). The following figure shows two control signals in the CPU that character-
ize the DMA transfer. The HOLD input is used by the DMA controller to request the CPU to
release control of buses. When this input is active, the CPU suspends the execution of the
current instruction and places the address bus, the data bus and the read/write line into a high-
impedance state. The high-impedance state behaves like an open circuit, which means that the
output line is disconnected from the input line and does not have any logic significance. The
CPU activates the HLDA output to inform the external DMA controller that the buses are in the
high-impedance state. The control of the buses has been taken by the DMA controller that
generated the bus request to conduct memory transfers without processor intervention. After the
transfer of data, the DMA controller disables the HOLD line. The CPU then disables the HLDA
line and regains the control of the buses and returns to its normal operation.
SQP.12 Computer Organization and Architecture
(b) To transfer large blocks of data at high speed, DMA method is used. A special DMA controller
is provided to allow transfer a block of data directly between a high speed external device like
magnetic disk and the main memory, without continuous intervention by the CPU. The data
transmission cannot be stopped or slowed down until an entire block is transferred. This mode
of DMA transfer is known as burst transfer.
(c) This is the software method where CPU is needed all the times during data transfer between any
two devices. Programmed I/O operations are the result of I/O instructions written in the com-
puter program or I/O routine. Each data item transfer is initiated by an instruction in the
program or I/O routine. Generally, the transfer is to and from a CPU register and peripheral.
Transferring data under program control requires constant monitoring of the peripheral by the
CPU.
The programmed I/O method is particularly useful in small low-speed computers or in sys-
tems that are dedicated to monitor a device continuously. Generally the CPU is 5-7 times faster
than an I/O device. Thus, the difference in data transfer rate between the CPU and the I/O
device makes this type of transfer inefficient.
(d) The address space of the processor is 256 MB. So, the processor generates an address of 28-bit.
The cache memory size = 1MB
Therefore, the size of index field of cache = 20-bit (1MB = 220)
The tag-field uses 28 – 20 = 8 bits.
The number of blocks in cache = Size of cache/size of a block = 220/27 = 8192.
Therefore the number of bits required to select each block = 13 (since 8192 = 213)
The size of each block is 128 bytes.
So, the number of bits required to select a word (byte) in a block = 7.
Thus, the address format for direct mapped cache is as follows:
The read and write signals pass through coils around a ring of soft magnetic material. A very
narrow gap separates the ring from a storage cell on a track so that their respective magnetic
field can induct. This induction permits information transfer between the head and the storage
medium.
(b) The subtraction A – B is equivalent to A+ 2’s complement of B (i.e. 1’s complement of B +1).
The addition and subtraction can be combined to a single circuit by using exclusive-OR (XOR)
gate with each full adder. The 4-bit adder-subtractor circuit is shown in figure next. The
selection input S determines the operation. When S = 0, this circuit performs the addition
operation and when S = 1, this circuit performs subtraction operation.
SQP.14 Computer Organization and Architecture
(c) The ALU of the CPU executes the instructions as dictated by the op-code field of instructions.
The instructions are executed on some data stored in registers or memory. The different ways in
which the location of an operand is specified in an instruction are referred to as addressing
modes. A computer uses variety of addressing modes; some of them are described below:
1. Implied (or Inherent) Mode: In this mode the operands are indicated implicitly by the
instruction. The accumulator register is generally used to hold the operand and after the
instruction execution the result is stored in the same register. For example,
RAL; Rotates the content of the accumulator left through carry.
2. Immediate Mode: In this mode the operand is mentioned explicitly in the instruction. In
other words, an immediate-mode instruction contains an operand value rather than an ad-
dress of it in the address field. To initialize registers to a constant value, this mode of
instructions are useful. For example:
MVI A, 06; Loads equivalent binary value of 06 to the accumulator
3. Register (Direct) Mode: In this mode the processor registers hold the operands. In other
words, the address field is now register field, which contains the operands required for the
instruction.
For example:
ADD R1, R2; Adds contents of registers R1 and R2 and stores the result in R1.
4. Register Indirect Mode: In this mode the instruction specifies an address of CPU register
that holds the address of the operand in memory.
5. Direct (or Absolute) Address Mode: In this mode the instruction contains the memory
address of the operand explicitly. Example of direct addressing is:
STA 2500H ; Stores the content of the accumulator in the memory location 2500H.
6. Indirect Address Mode: In this mode the instruction gives a memory address in its address
field which holds the address of the operand.
For example:
MOV R1, (X) ; Content of the location whose address is given in X is loaded
into register R1.
7. Relative Address Mode or PC-relative Address Mode: In this mode the effective address is
obtained by adding the content of program counter (PC) register with address part of the
instruction.
(d) Stack based computer operates instructions, based on a data structure called stack. A stack is a
list of data words with a Last-In, First-Out (LIFO) access method that is included in the CPU of
most computers. A portion of memory unit used to store operands in successive locations can be
considered as a stack in computers. The register that holds the address for the top most operand
in the stack is called a stack pointer (SP). The two operations performed on the operands stored
in a stack are the PUSH and POP. From one end only, operands are pushed or popped. The
PUSH operation results in inserting one operand at the top of stack and it decreases the stack
pointer register. The POP operation results in deleting one operand from the top of stack and it
increases the stack pointer register.
For example, figure below shows a stack of four data words in the memory. PUSH and POP
instructions which require an address field. The PUSH instruction has the format:
PUSH <memory address>
Solved Question Paper 2007 SQP.15
The PUSH instruction inserts the data word at specified address to the top of the stack. The POP
instruction has the format:
POP <memory address>
The POP instruction deletes the data word at the top of the stack to the specified address. The
stack pointer is updated automatically in either case. The PUSH operation can be implemented
as
SP ¨ SP – 1 ; decrement the SP by 1
SP ¨ <memory address> ; store the content of specified memory
address into SP, i.e. at top of stack
The POP operation can be implemented as
<memory address> ¨ SP ; transfer the content of SP (i.e. top most data)
into specified memory location
SP ¨ SP + 1 ; increment the SP by 1
The figure next shows the effects of these two operations on the stack in figure above.
SQP.16 Computer Organization and Architecture
(e) A tri-state gate is a digital circuit that exhibits three states out of which two states are normal
signals equivalent to logic 1 and logic 0 similar to a conventional gate. The third state is a high-
impedance state. The gate is controlled by one separate control input C. If C is high the gate
behaves like a normal logic gate having output 1 or 0. When C is low the gate does not produce
any output irrespective of the input values. The graphic symbol of a tri-state buffer gate is
shown below.
The common bus is used to transfer a register’s content to other register or memory at a single
time. A common bus system with tri-state buffers is described in the figure next, where one line
of the common bus is shown.
Assume that there are four registers A, B, C and D. The outputs of four buffers are connected
together to form a single line of the bus. The control inputs to the buffers, which are generated
by a common decoder, determine which of the four normal inputs will communicate with the
common line of the bus. Note that only one buffer may be in the active state at any given time.
Because the selection lines S0, S1 of the decoder activate one of its output lines at a time and the
output lines of the decoder act as the control lines to the buffers. For example, if select combina-
tion S1S0 is equal to 00, then 0th output of the decoder will be activated, which then activates the
top-most tri-state buffer and thus the bus line content will be currently A0.
Solved Question Paper 2007 SQP.17
11. (a) Classify memory system in a digital computer according to their use.
(b) A random access memory module of capacity 2048 bytes is to be used in a computer and
mapped between 2000H and 27FFH. Explain with the help of a block diagram the address
decoding schema assuming 16 bit address bus.
(c) How do the following influence the performance of a virtual memory system?
(i) Size of page
(ii) Replacement policy. 3+7+5
Answer
(a) The memory system consists of all storage devices used in a computer system and are broadly
divided into following four groups:
l Secondary (auxiliary) memory
l Main (primary) memory
l Cache memory
l Internal memory
Secondary Memory: The slow-speed and low-cost devices that provide backup storage are
called secondary memory. The most commonly used secondary memories are magnetic disks,
such as hard disk, floppy disk and magnetic tapes. This type of memory is used for storing all
programs and data, as this is used in bulk size. When a program not residing in main memory is
needed to execute, it is transferred from secondary memory to main memory. Programs not
currently needed in main memory (in other words, the programs are not currently executed by
the processor) are transferred into secondary memory to provide space for currently used pro-
grams and data.
Main Memory: This is the memory that communicates directly with CPU. Only programs and
data currently needed by the CPU for execution reside in the main memory. Main memory
occupies central position in hierarchy by being able to communicate directly with CPU and with
secondary memory devices through an I/O processor.
Cache Memory: This is a special high-speed main memory, sometimes used to increase the
speed of processing by making the current programs and data available to the CPU at a rapid
rate. Generally, the CPU is faster than main memory, thus resulting that processing speed is
limited mainly by the speed of main memory. So, a technique used to compensate the speed
mismatch between CPU and main memory is to use an extremely fast, small cache between
CPU and main memory, whose access time is close to CPU cycle time.
Internal memory: This memory refers to the high-speed registers used inside the CPU. These
registers hold temporary results when a computation is in progress. There is no speed disparity
between these registers and the CPU because they are fabricated with the same technology.
However, since registers are very expensive, only a few registers are used as internal memory.
(b) Since the capacity of RAM memory is 2048 bytes i.e. 2 KB, the memory uses 11 (2 KB = 211)
address lines, say namely A10 – A0, to select one word. Thus, memory’s internal address
decoder uses 11 lines A10 – A0 to select one word.
To select this memory module, remaining 5 (i.e. 16 – 11) address lines A15 – A11 are used.
Thus, an external decoding scheme is employed on these higher-order five address bits of
processor’s address.
SQP.18 Computer Organization and Architecture
(c) (i) Page size: This is very important issue in designing virtual memory. For a given virtual-
memory space, if the page size decreases, the number of pages increases and thus the size of
page table increases. For example, a virtual memory of 2 MB, there will be 2048 pages of
1KB, but only 1024 pages of 2KB. Since each executing program must have its own copy
of the page table, a large page size is advantageous.
On contrary, memory is better utilized with smaller pages. In other words, no or little
space will be wastage inside a page, if it smaller. With smaller page size, total I/O time
should be reduced, since locality of reference will be improved.
(ii) Replacement policy: When a page fault occurs, a page replacement is needed to select one
of the existing pages in main memory to make the room for the required page. Designing
appropriate replacement algorithms is an important task. We want one with the lowest page
fault rate. There are several replacement algorithms such as FIFO (First-in First-out), LRU
(Least Recently Used) and optimal page replacement algorithm available. Among these,
optimal algorithm generally has the lowest page fault rate.
2007
Computer Architecture
and Organization
Paper Code: CS-404 Semester: 4th
Time Alloted: 3 hours Full Marks: 70
Group-A
(Multiple-Choice Questions)
1. Choose the correct alternatives for the following: 10 ¥1 = 10
(i) With 2’s complement representation, the range of values that can be represented on the data
bus of an 8 bit microprocessor is given by
(a) –128 to + 127 (b) –128 to + 128 (c) –127 to + 128 (d) –256 to + 128.
Answer
(a) –128 to + 127
(ii) Booth’s algorithm for computer arithmetic is used for
(a) multiplication of number in sign magnitude form
(b) multiplication of number in 2’s complement form
(c) division of number in sign magnitude form
(d) division of number in 2’s complement form.
Answer
(b) multiplication of number in 2’s complement form
(iii) Micro instructions are kept in
(a) Main memory (b) Control store (c) Cache (d) none of these.
Answer
(b) Control store
(iv) What is the 2’s complement representation of –24 in a 16-bit microcomputer?
(a) 0000 0000 0001 1000 (b) 1111 1111 1110 1000
(c) 1111 1111 1110 0111 (d) 0001 0001 1111 0011.
Answer
(b) 1111 1111 1110 1000
SQP.2 Computer Organization and Architecture
Virtual memory is not a physical memory, is actually a technique. That is why it is called virtual
memory.
The advantage of virtual memory is efficient utilization of main memory, because the larger size
program is divided into blocks and partially each block is loaded in the main memory whenever it is
required. Thus multiple programs can be executed simultaneously. The technique of virtual memory
has other advantages of efficient CPU utilization and improved throughput.
3. What is meant by parallel processing? What is the basic objective of parallel processing?
Answer
Parallel processing is an efficient form of information processing which emphasizes the exploitation
of concurrent events in the computing process. Parallel processing is the basis of parallel computers.
The basic objective of the parallel processing is to improve the performance of a computer system by
carrying out several tasks simultaneously.
4. What do you mean by instruction cycle, machine cycle and T states?
Answer
Instruction cycle: The processing required for a single instruction is called instruction cycle. The
control unit’s task is to go through an instruction cycle that can be divided into five major phases:
1. Fetch the instruction from memory.
2. Decode the instruction.
3. Fetch the operand(s) from memory or register.
4. Execute the whole instruction.
5. Store the output result to the memory or register.
Machine cycle: A machine cycle consists of necessary steps carried out to perform the memory access
operation. Each of the basic operations such as fetch or read or write operation constitutes a machine
cycle. An instruction cycle consists of several machine cycles.
T-states: One clock cycle of the system clock is referred to as T-state.
5. Distinguish between vectored interrupt and non-vectored interrupt.
Answer:
see answer of question number 8(a) of 2006 (CS-303).
6. Compare RISC with CISC.
Answer
CISC RISC
1. A large number of instruction types used - typically 1. Relatively few number of instruction types-
from 100 to 250 instructions. typically less than100 instructions.
2. A large number of addressing modes used - typically 2. Relatively few addressing modes - typically
from 5 to 15 different modes. less than or equal to 5.
3. Variable-length instruction formats. 3. Fixed-length, easily decoded instruction formats.
4. Small number of general-purpose registers (GPRs) - 4. Large number of general-purpose registers (GPRs)-
typically 8-24 GPRs. typically 32-192 GPRs.
5. Clock per instruction (CPI) lies between 2 and 15. 5. Clock per instruction (CPI) lies between 1 and 2.
6. Mostly micro-programmed control units. 6. Mostly hardwired control units.
7. Most instructions manipulate operands in memory. 7. All operations are executed within registers of the CPU.
SQP.4 Computer Organization and Architecture
Group-B
(Long-Answer Questions)
Answer any three questions. 3 ¥ 15 = 45
7. (a) What is pipelining ?
(b) What are speedup, throughput and efficiency of a pipelined architecture?
(c) Describe pipeline hazards.
(d) Compare between centralized and distributed architecture. Which is the best architecture
among them and why? 2+3+5+3+2
Answer
(a) Pipelining is a technique of decomposing a sequential task into subtasks, with each subtask
being executed in a special dedicated stage (segment) that operates concurrently with all other
stages. Each stage performs partial processing dictated by the way the task is partitioned. Result
obtained from a stage is transferred to the next stage in the pipeline. The final result is obtained
after the instruction has passed through all the stages. All stages are synchronized by a common
clock. Stages are pure combinational circuits performing arithmetic or logic operations over the
data stream flowing through the pipe. The stages are separated by high-speed interface latches
(i.e. collection of registers). Figure below shows the pipeline concept with k stages.
n◊k◊t
= where, t = clock period of the pipeline processor.
t [k + (n – 1)]
Time to execute n tasks in k-stage pipeline processor is t[k + (n–1)] units, where k clock
periods (cycles) are needed to complete the execution of the first task and remaining (n–1) tasks
require (n–1) cycles. Time to execute n tasks in k-stage non-pipeline processor is n.k.t, where
each task requires k cycles because no new task can enter the pipeline until the previous task
finishes.
It can be noted that the maximum speed-up is k, for n >> k. But this maximum speed-up is
never fully achievable because of data dependencies between instructions, interrupts, program
branches, etc.
Solved Question Paper 2007 SQP.5
Efficiency: To define it, we need to define another term “time-space span”. It is the product
(area) of a time interval and a stage space in the space-time diagram. A given time-space span
can be in either a busy state or an idle state, but not both.
The efficiency of a linear pipeline is measured by the percentage of busy time-space spans
over the total time-space spans, which equal the sum of all busy and idle time-space spans. Let
n, k, t be the number of tasks (instructions), the number of stages and the clock period of a
linear pipeline, respectively. Then the efficiency is defined by
n ◊k ◊t n
h= =
k ◊ [k ◊ t + (n – 1) ◊ t] k + (n – 1)
Note that h Æ 1 (i.e., 100%) as n Æ •. This means that the larger the number of tasks flowing
through the pipeline, the better is its efficiency. For the same reason as speed-up, this ideal
efficiency is not achievable.
Throughput: The number of tasks that can be completed by a pipeline per unit time is called its
throughput. Mathematically, it is defined as
n h
w= =
k ◊ t + (n – 1) ◊ t t
Note that in ideal case, w = 1/t = f, frequency, when h Æ 1. This means that the maximum
throughput of a linear pipeline is equal to its frequency, which corresponds to one output result
per clock period.
(c) Pipeline hazards are situations that prevent the next instruction in the instruction stream from
executing during its designated clock cycle. There are three types of pipeline hazards:
1. Control hazards
2. Structural hazards
3. Data hazards
Control hazards: They arise from the pipelining of branches and other instructions that change
the content of program counter (PC) register.
Structural Hazards: Structural hazards occur when a certain resource (memory, functional unit)
is requested by more than one instruction at the same time.
Data Hazards: Inter-instruction dependencies may arise to prevent the sequential (in-order) data
flow in the pipeline, when successive instructions overlap their fetch, decode and execution
through a pipeline processor. This situation due to inter-instruction dependencies is called data
hazard.
(d) In centralized architecture, all the processors access the physical main memory uniformly. All
processors have equal access time to all memory words. The architecture is shown in the
following figure.
SQP.6 Computer Organization and Architecture
In distributed system, a local memory is attached with each processor. All local memories
distributed throughout the system form a global shared memory accessible by all processors. A
memory word access time varies with the location of the memory word in the shared memory.
The distributed system is depicted in figure.
It is faster to access a local memory with a local processor. The access of remote memory
attached to other processor takes longer due to the added delay through the interconnection
network. Therefore, the distributed system is faster and in this regard, it is better.
8. (a) What is meant by DMA? Why is it useful? Briefly explain with suitable diagram, the DMA
operation in association with CPU.
(b) Draw the schematic diagram for daisy chain polling arrangement in case of vectored inter-
rupt for three devices. 2+2+6+5
Answer
(a) A special controlling unit called DMA controller is provided to allow transfer a block of data
directly between a high speed external device like magnetic disk and the main memory, without
continuous intervention by the CPU. This method is called direct memory access (DMA).
DMA is useful, because it has following advantages:
1. High speed data transfer is possible, since CPU is not involved during actual transfer, which
occurs between I/O device and the main memory.
2. Parallel processing can be achieved between CPU processing and DMA controller’s I/O
operation.
In DMA transfer, I/O devices can directly access the main memory without intervention by
the processor. The following figure shows a typical DMA system.
Solved Question Paper 2007 SQP.7
The sequences of events involved in a DMA transfer between an I/O device and the main
memory are as follows:
A DMA request signal from an I/O device starts the DMA sequence. DMA controller acti-
vates the HOLD line. It then waits for the HLDA signal from the CPU. On receipt of HLDA,
the controller sends a DMA ACK (acknowledgement) signal to the I/O device. The DMA
controller takes the control of the memory buses from the CPU. Before releasing the control of
the buses to the controller, the CPU initializes the address register for starting memory address
of the block of data, word-count register for number of words to be transferred and the operation
type (read or write). The I/O device can then communicate with memory through the data bus
for direct data transfer. For each word transferred, the DMA controller increments its address-
register and decrements its word count register. After each word transfer, the controller checks
the DMA request line. If this line is high, next word of the block transfer is initiated and the
process continues until word count register reaches zero (i.e., the entire block is transferred). If
the word count register reaches zero, the DMA controller stops any further transfer and removes
its HOLD signal. It also informs the CPU of the termination by means of an interrupt through
INT line. The CPU then gains the control of the memory buses and resumes the operations on
the program which initiated the I/O operations.
(b) To implement interrupts, the CPU uses a signal, known as an interrupt request (INTR) signal to
the interrupt controller hardware, which is connected to each I/O device that can issue an
interrupt to it. Here, interrupt controller makes liaison with the CPU on behave of I/O devices.
Typically, interrupt controller is also assigned an interrupt acknowledge (INTA) line that the
CPU uses to signal the controller that it has received and begun to process the interrupt request
SQP.8 Computer Organization and Architecture
by employing an ISR (interrupt service routine). Devices are connected in daisy chain fashion,
as shown in figure below, to set up priority interrupt system.
The devices are placed in a chain-fashion with highest priority device in the first place (device
1), followed by lower priority devices. The priorities are assigned by the interrupt controller.
When one or more devices send interrupt signal through the interrupt controller to the CPU, the
CPU then sets interrupt acknowledge (INTA) to the controller, which in turns sends it to the
highest priority device. If this device has generated the interrupt INTR, it will accept the INTA;
otherwise it will pass the INTA signal to the next device until the INTA is accepted by one
requestor device. When the INTA is accepted by a device, device puts its own interrupt vector
address (VAD) to the data bus using interrupt controller.
9. (a) Discuss the principle of carry look ahead adder and design a 4-bit CLA adder and estimate
the speed enhancement with respect to ripple carry adder.
(b) Briefly state the relative advantages and disadvantages of parallel adder over serial adder.
(c) X = (A + B) ¥ C
Write down the zero address, one address and three address instructions for the expression.
(4 + 3) + 2 + 6
Answer
(a) Principle of CLA: A Carry Look-ahead Adder (CLA) is a high-speed adder, which adds two
numbers without waiting for the carries from the previous stages. In the CLA, carry-inputs of all
stages are generated simultaneously, without using carries from the previous stages.
Design: In the full adder, the carry output Ci+1 is related to its carry input Ci as follows:
Ci+1 =AiBi + (Ai+Bi)Ci
This result can be rewritten as:
Ci+1 = Gi + PiCi (1)
where G i = AiBi and Pi = Ai + Bi
The function Gi is called the carry-generate function, since a carry Ci+1 is generated when both
Ai and Bi are 1. The function Pi is called as carry-propagate function, since if Ai or Bi is a 1,
then the input carry Ci is propagated to the next stage. The basic adder (BA) for generating the
sum Si, carry propagate Pi and carry generate Gi bits, is shown in the following figure. The sum
Solved Question Paper 2007 SQP.9
bit Si is = Ai ≈ Bi ≈ Ci. For the implementation of one basic adder, two XOR gates, one AND
gate and one OR gate are required.
Now, to design a 4-bit CLA, four carries C1, C2, C3 and C4 are to be generated. Using Eq. (1);
C1, C2, C3 and C4 can be expressed as follows:
C1 = G0 + P0C0
C2 = G1 + P1C1
C3 = G2 + P2C2
C4 = G3 + P3C3
These equations are recursive and the recursion can be removed as below.
C1 = G0 + P0C0 (2)
C2 = G1 + P1C1
= G1 + P1(G0 + P0C0)
= G1 + P1G0 + P1P0C0 (3)
C3 = G2 + P2C2
= G2 + P2(G1 + P1G0 + P1P0C0)
= G2 + P2G1 + P2P1G0 + P2P1P0C0 (4)
C4 = G3 + P3C3
= G3 + P3(G2 + P2G1 + P2P1G0 + P2P1P0C0)
= G3 + P3G2 + P3P2G1 + P3P2P1G0 + P3P2P1P0C0 (5)
The Eqs (2), (3), (4) and (5) suggest that C1, C2, C3 and C4 can be generated directly from C0. In
other words, these four carries depend only on the initial carry C0. For this reason, these
equations are called carry look-ahead equations. A 4-bit carry look-ahead adder (CLA) is
shown in the figure next.
The maximum delay of the CLA is 6 ¥ D (for Gi and Pi generation, delay =D, for Ci
generation, delay = 2D and lastly another 3D for sum bit Si) where D is the average gate delay.
The same holds good for any number of bits because the adder delay does not depend on size of
number (n). It depends on the number of levels of gates used to generate the sum and the carry
bits. Whereas, the maximum propagation delay for CPA depends on size of inputs and for n-bit
CPA it is D ¥ n, where D is the time delay for each full adder stage and n is the number of bits
in each operand.
SQP.10 Computer Organization and Architecture
(b) Advantage of parallel adders over serial adders: The parallel adder, being a combinational
circuit, is faster than serial adder. In one clock period all bits of two numbers are added, whereas
a serial adder requires n clock periods to add two n-bit numbers.
Disadvantages of parallel adders over serial adders:
1. The addition delay becomes large, if the size of numbers to be added is increased. But this
remains same for serial adders.
2. The hardware cost is more than that of serial adder. Because, number of full adders needed
is equal to the number of bits in operands.
(c) To evaluate this arithmetic expression, we use some op-codes as: LOAD symbolic op-code is
used for transferring data to register from memory. STORE symbolic op-code is used for
transferring data to memory from register. The symbolic op-codes ADD and MULT are used for
the arithmetic operations addition and multiplication respectively. Assume that the respective
operands are in memory addresses A, B and C and the result must be stored in the memory at
address X.
Using three-address instructions, the program code in assembly language is as:
ADD R1, A, B ; R1 ¨ M[A] + M[B]
MULT X, C, R1 ; X ¨ M[C] + R1
Using two-address instructions, the program code in assembly language is as:
LOAD R1, A ; R1 ¨ M[A]
ADD R1, B ; R1 ¨ R1 + M[B]
LOAD R2, C ; R2 ¨ M[C]
MULT R1, R2 ; R1 ¨ R1 * R2
STORE X, R1 ; X ¨ R1
Using one-address instructions, the program code in assembly language is as:
LOAD A ; AC ¨ M[A]
ADD B ; AC ¨ AC + M[B]
STORE T ; T ¨ AC
LOAD C ; AC ¨ M[C]
MULT T ; AC ¨ AC * M[T]
STORE X ; X ¨ AC
Solved Question Paper 2007 SQP.11
3. Due to the relatively small internal circuitry in the one-bit memory cell of DRAMs, the
large storage capacity in a single DRAM memory chip is available compared to the same
physical size SRAM memory chip. In other words, DRAM has high packaging density
compared to the SRAM.
4. SRAM is costlier than DRAM.
(c) The given RAM memory size is 256 ¥ 4. This memory chip requires 8 (because 256 = 28)
address lines and 4 data lines.
Size of memory to be constructed is 2048 bytes, which is equivalent to 2048 ¥ 8. Thus, it
requires 11 (because 2048 = 211) address lines and 8 data lines.
In the interconnection diagram:
The number of rows required = 2048/256 = 8.
The number of columns required = 8/4 = 2.
Thus, total number of RAMs each of size 256 ¥ 4 required = 8 * 2 = 16.
(d) Given
No. of sectors per track = 20
No. of bytes in each sector = 4000
No. of surfaces in disk pack = 8.
Outer diameter of disk = 12 cm.
Inner diameter = 4 cm.
Inter-track gap = 0.1 mm.
So, total width of track = (12 – 4)/2 = 4 cm.
No. of tracks per surface = (4 * 10)/0.1 = 400.
Thus, the total storage capacity of the disk drive = no. of surfaces * no. of tracks per surface *
no. sectors per track * capacity of each sector = 8 * 400 * 20 * 4000 = 256000000 bytes =
244.14 MB (apprx.).
Solved Question Paper 2007 SQP.13
then subtracted from A. If the result of subtraction is positive, a 1 is placed into the vacant
position created in lsb position of Q by the left shift operation; otherwise a 0 is put into this
position and before beginning the next iteration, restore the content of A by adding the current
content of A register with M. For this step, the algorithm is referred to as a restoring division
algorithm. When, the algorithm terminates, the A register contains the remainder result and the
Q register contains the quotient result.
The restoring division algorithm to divide two n-bit numbers is described using the flowchart
shown in figure below.
Solved Question Paper 2007 SQP.15
The algorithm discussed is used for division of two unsigned integers. This algorithm can be
extended to handle signed numbers as well. The sign of the result must be treated separately and
the positive magnitudes of the dividend and divisor are performed using this technique for
quotient and remainder. The sign of the quotient is determined as Mn ≈ Qn, where Mn, Qn are
the signs of the divisor (M) and the dividend (Q) respectively.
2008
Computer Organization
and Architecture
This question paper is for EIE, 4th Semister and for new syllabus
Paper Code: CS-404(EI) Semester: 4th
Time Alloted: 3 hours Full Marks: 70
Group-A
(Multiple-Choice Questions)
1. Choose the correct answer from the given alternatives in each of the following:
(i) Maximum number of directly addressable locations in the memory of a processor having
10-bits wide control bus, 20-bits address bus, and 8-bit data bus is
(a) 1K (b) 2K (c) 1M (d) none of these
Answer
(c) 1M
(ii) Booth’s algorithm for computer arithmetic is used for
(a) multiplication of numbers in sign magnitude form
(b) multiplication of numbers in two’s complement form
(c) division of numbers in sign magnitude form
(d) division of numbers in two’s complement form
Answer
(b) multiplication of numbers in two’s complement form
(iii) The last statement of any Symbolic Mircroprogram must contain
(a) NEXT (b) OVER (c) FETCH (d) INDRCT
Answer
(c) FETCH
(iv) Virtual memory system allows the employment of
(a) more than address space (b) the full address space
(c) more than hard disk capacity (d) none of these
SQP.2 Computer Organization and Architecture
Answer
(a) more than address space
(v) In fourth generation computers, the main technology used is
(a) Transistor (b) SSI (c) MSI (d) LSI & VLSI
Answer
(d) LSI & VLSI
(vi) The numbers in the range –23 to +31 are represented by the minimum number of bits
(a) 6 (b) 8 (c) 7 (d) 5
Answer
(a) 6
(vii) Bidirectional buses use
(a) tri-state buffers
(b) two tri-state buffers in cascade
(c) two back to back connected tri-state buffers in parallel
(d) two back to back connected buffers
Answer
(c) two back to back connected tri-state buffers in parallel
(viii) CPU gets the address of next instruction to be processed from
(a) Instruction register (b) Memory address register
(c) Index register (d) Program counter
Answer
(d) Program counter
(ix) The first computer used to store a program is
(a) EDSAC (b) ENIAC (c) EDVAC (d) ACE
Answer
(b) ENIAC
(x) A machine using base register addressing method has n base registers and displacement
contains k bits; programmer can access any
(a) n regions of k addresses each (b) 2n regions of 2k addresses each
k
(c) n regions of 2 addresses each (d) none of these
Answer
(c) n regions of 2k addresses each
Group-B
(Short-Answer Questions)
Answer any three of the following. 3 ¥ 5 = 15
2. (a) Write the key features of von Neumann architecture of a computer and mention the bottle-
necks.
(b) How does Harvard architecture differ from von Neumann architecture? 2+2+1
Answer
(a) See answer of question number 8 (C) of 2007 (CS-303).
(b) A single memory is used for both program and data storage in Von Neumann computers.
Solved Question Paper 2008 SQP.3
But, Harvard computers are computers with separate programs and data memories. Data memory and
program memory can be of different widths, type, etc. Program and data can be fetched in one
cycle, by using separate control signals—‘program memory read’ and ‘data memory read’.
3. (a) Write + 710 in IEEE 64-bit format.
(b) Convert IEEE 32-bit format 4040000016 in decimal value.
(c) Convert IEEE 64-bit format ABCD00000000000016 in decimal value. 2 + 1.5 + 1.5
Answer
(a) The decimal number +7 = + 111 in binary = + 1.11 ¥ 22
The 52-bit mantissa M =0.11000 00000 00000 00000 00000 00000 00000 00000 00000 0000000
The biased exponent E¢ = E + 1023 = 2 + 1023 = 1025 = 100000 00001, using 11-bit.
Since the number is positive, the sign bit S = 0
Therefore, the IEEE double-precision (64-bit) representation is:
0 100000 00001 11000 00000 00000 00000 00000 00000 00000 00000 00000 0000000
PUSH D ; TOS ¨ D
ADD ; TOS ¨ (C + D)
MULT ; TOS ¨ (A + B) * (C + D)
POP X ; X ¨ TOS
For one address machine:
The assembly language program using one address instructions is written below.
LOAD A ; AC ¨ M[A]
ADD B ; AC ¨ AC + M[B]
STORE T ; T ¨ AC
LOAD C ; AC ¨ M[C]
ADD D ; AC ¨ AC + M[D]
MULT T ; AC ¨ AC * M[T]
STORE X ; X ¨ AC
5. What are vectored interrupts? How are they used in implementing hardware interrupts? 5
Answer
In a vectored interrupt I/O method, the source device that interrupts, supplies the branch information
(i.e. the starting address of interrupt service routine (ISR)) to the CPU. This information is called the
interrupt vector, which is not any fixed memory location.
To implement interrupts, the CPU uses a signal, known as an interrupt request (INTR) signal to the
interrupt handler or controller hardware, which is connected to each I/O device that can issue an
interrupt to it. Here, interrupt controller makes liaison with the CPU on behalf of I/O devices.
Typically, interrupt controller is also assigned an interrupt acknowledge (INTA) line that the CPU
uses to signal the controller that it has received and begun to process the interrupt request by
employing an ISR. The following figure below shows the hardware lines for implementing interrupts.
2
Solved Question Paper 2008 SQP.5
Memory hierarchy
Main Memory: This is the memory that communicates directly with CPU. Only programs and
data currently needed by the CPU for execution reside in the main memory. Main memory
occupies central position in hierarchy by being able to communicate directly with CPU and with
secondary memory devices through an I/O processor.
Cache Memory: This is a special high-speed main memory, sometimes used to increase the
speed of processing by making the current programs and data available to the CPU at a rapid
rate. Generally, the CPU is faster than a main memory, thus the processing speed is limited
mainly by the speed of main memory. So, a technique used to compensate the speed mismatch
between CPU and main memory is to use an extremely fast, small cache between CPU and main
memory, whose access time is close to CPU cycle time. The cache is used for storing portions of
programs currently being executed in the CPU and temporary data frequently needed in the
present computations. Thus, the cache memory acts a buffer between the CPU and the main
memory. By making programs and data available at a rapid rate, it is possible to increase the
performance of computer.
Solved Question Paper 2008 SQP.7
Internal memory: This memory refers to the high-speed registers used inside the CPU. These
registers hold temporary results when a computation is in progress. There is no speed disparity
between these registers and the CPU because they are fabricated with the same technology.
However, since registers are very expensive, only a few registers are used as internal memory.
(b) Given,
The capacity of main memory = 256 MB
The capacity of cache memory = 1MB
Block size = 128 bytes.
A set contains 8 blocks.
Since the address space of the processor is 256 MB, the processor generates address of 28-bit
to access a byte (word). Because 256 MB = 228.
The number of blocks main memory contains = 256 MB / 128 bytes = 221.
Therefore, no. of bits required to specify one block in main memory = 21.
Since the block size is 128 bytes.
The no. of bits required to access each word (byte) = 7.
For associative cache, the address format is:
Tag-address Word
21 7
8 13 7
˛
Ô
Ô
Ô
Ô
Ô
Ô
˝
Ô
Ô
Ô
Ô
Ô
Ô
¸
Index
In case of set-associative cache:
A set contains 8 blocks.
Therefore, the number of sets in cache = 213/8 = 210.
Thus, the number of bits required to specify each set = 10.
The tag field of address = 28 – (10 + 7) = 11-bit.
For set-associative cache, the address format is:
Tag Set Word
11 10 7
Answer
(a) When a program needs to be executed, the CPU would generate addresses, called logical or
virtual addresses. The corresponding addresses in the physical memory, as occupied by the
executing program, are called physical addresses. The set of all logical addresses generated by
the CPU or program is called logical-address space and the set of all physical addresses
corresponding to these logical addresses is called physical-address space. The memory-manage-
ment unit (MMU) maps each logical address to a physical address during program execution.
The figure below illustrates this mapping method, which uses a special register called base
register or relocation register.
The content of the relocation register is added to every logical address generated by the user
program at the beginning of execution. For example, if the relocation register holds an address
value 2000, then a reference to the location 0 by the user is dynamically relocated to 2000 address.
A reference to the address 150 is mapped to the address 2150.
(b) Static memories (SRAMs) are memories that consist of circuits capable of retaining their state
as long as power is applied. Thus, this type of memories are called volatile memories. The
figure below shows a cell diagram of SRAM memory. A latch is formed by two inverters
connected as shown in the figure. Two transistors T1 and T2 are used for connecting the latch
with two bit lines. The purpose of these transistors is to act as switches that can be opened or
closed under the control of the word line, which is controlled by the address decoder. When the
word line is at 0-level, the transistors are turned off and the latch retains its information. For
example, the cell is at state 1 if the logic value at point A is 1 and at point B is 0. This state is
retained as long as the word line is not activated.
Solved Question Paper 2008 SQP.9
Read Operation: For the read operation, the word line is activated by the address input to the
address decoder. The activated word line closes both the transistors (switches) T1 and T2. Then
the bit values at points A and B can transmit to their respective bit lines. The sense/write circuit
at the end of the bit lines sends the output to the processor.
Write Operation: Similarly, for the write operation, the address provided to the decoder acti-
vates the word line to close both the switches. Then, the bit value to be written into the cell is
provided through the sense/write circuit and the signals in bit lines are then stored into the cell.
(c) See answer of question no. 8 (b) (i) 2006 (CS-303).
10. (a) What are the various modes of data transfer between computer and peripherals? Explain.
(b) Differentiate between isolated I/O and memory mapped I/O.
(c) Show how computer bus is organized using tri-state buffer. 5+5+5
Answer
(a) See answer of question number 9 (a) of 2004 (CS-303).
(b) See answer of question number 9 (b) of 2004 (CS-303).
(c) See answer of question number 10 (e) of 2007 (CS-303).
11. (a) What is meant by DMA? Why is it useful? Briefly explain, with suitable diagram, the DMA
operation in association with CPU.
(b) Draw the schematic diagram for daisy chain polling arrangement in case of vectored inter-
rupt for three devices. 2+2+4+7
Answer
(a) See answer of question number 8 (a) of 2007 (CS-404).
(b) See answer of question number 8 (b) of 2007 (CS-404).
2008
Computer Organization
and Architecture
This Paper is for IT/ECE (Old)/AEIE (Old)
Paper Code: CS-404 Semester: 4th
Time Allotted: 3 hours Full Marks: 70
Group-A
(Multiple-Choice Questions)
1. Choose the correct alternatives for the following : 10 ¥ 1 = 10
(i) Thrashing
(a) reduces page I/O
(b) decreases the degree of multiprogramming
(c) implies excessive page I/O
(d) none of these.
Answer
(c) implies excessive page I/O
(ii) When signed numbers are used in binary arithmetic, then which one of the following
notations would have unique representation for zero?
(a) Sign Magnitude (b) Sign 1’s complement
(c) Sign 2’s complement (d) None of these.
Answer
(c) Sign 2’s complement
(iii) If the memory chip size is 256 ¥ 1 bits, then the number of chips required to make up
1 kbytes of memory is
(a) 32 (b) 24 (c) 12 (d) 8
Answer
(a) 32
SQP.2 Computer Organization and Architecture
(iv) How many address bits are required for a 512 ¥ 4 memory?
(a) 512 (b) 4 (c) 9 (d) A0 – A6
Answer
(c) 9
(v) What is the 2’s complement representation of –20 in a 16-bit micro-computer?
(a) 0000 0000 0001 1000 (b) 1111 1111 1110 0111
(c) 1111 1111 1110 1000. (d) none of these.
Answer
(d) none of these
(vi) The technique of placing software in a ROM semiconductor chip is called
(a) PROM (b) EPROM (c) FIRMWARE (d) Micro-processor.
Answer
(c) FIRMWARE
(vii) Which one is the advantage of Virtual memory?
(a) Faster access to memory on an average
(b) Process can be given protected address spaces
(c) Program larger than the physical memory size can be run
(d) None of these.
Answer
(c) Program larger than the physical memory size can be run
(viii) A page fault
(a) occurs when a program access a page memory
(b) is an error in a specific page
(c) is an access to a page not currently in memory
(d) none of these.
Answer
(c) is an access to a page not currently in memory
(ix) Convert( FAFAFA )16 into Octal form
(a) 76767676 (b) 76575372 (c) 76737672 (d) 76727672.
Answer
(b) 76575372
(x) The logic circuitry in ALU is
(a) entirely combinational (b) entirely sequential
(c) combinational cum sequential (d) none of these.
Answer
(a) entirely combinational
Group-B
(Short-Answer questions)
Answer any three of the following. 3 ¥ 5 = 15
2. Explain the memory hierarchy pyramid, showing both primary and secondary memory in the
diagram and also explain the relationship of cost, speed and capacity.
Solved Question Paper 2008 SQP.3
Answer
See answer of question number 8(a) of 2008(CS-404(EI)).
3. Discuss about the different hazards in pipelining.
Answer
See answer of question number 9(b) of 2006(CS-404).
4. Explain how a RAM of capacity 2 kbytes can be mapped into the address space (1000)H to
(17FF)H of a CPU having a 16-bit address lines. Show how the address lines are decoded to
generate the chip select condition for the RAM.
Answer
See answer of question number 1 of 2005(CS-404).
5. Evaluate the following arithmetic statement using three addresses, two addresses and one ad-
dress instructions:
X = (A+B) * (C+D)
Answer
See answer of question number 4 of 2008(CS-404(EI)).
6. Use 8-bit two’s complement integers, perform the following computations : 2+2+1
(i) –34 + (–12) (ii) 17 – 35 (iii) 18 – ( –5).
Answer
See answer of question number 8(a) of 2004(CS-404).
Group–C
(Long-Answer Questions)
Answer any three questions. 3 ¥ 15 = 45
7. (a) Multiply +14 and –13 using Booth’s sequential method of multiplication. Draw the corre-
sponding circuit block diagram.
(b) Multiply +12 and –11 using modified Booth’s sequential method of multiplication. Draw
the corresponding circuit block diagram. (4 + 4 ) + (4 + 3)
Answer
(a) Multiplication of numbers + 14 and –13:
Multiplicand, M = + 14 = 01110 and multiplier, Q = –13 = 10011.
M A Q Size
Initial
Configuration 01110 00000 10011 0 5
Step-1
As Q[0]=1and
Q[–1]=0
A+A–M 01110 10010 10011 0 —
ARS(AQ) 01110 11001 01001 1 4
SQP.4 Computer Organization and Architecture
Step-2
As Q[0]=1 and
Q[–1]=1
ARS(AQ) 01110 11100 10100 1 3
Step-3
As Q[0] = 0 and
Q[–1] = 1
A=A+M 01110 01010 10100 1 –
ARS(AQ) 01110 00101 01010 0 2
Step-4
As Q[0] = 0 and
Q[–1] = 0
ARS(AQ) 01110 00010 10101 0 1
Step -5
As Q[0] = 1and
Q[–1] = 0
A=A–M 01110 10100 10101 0 –
ARS (AQ) 01110 11010 01010 1 0
Since, the size register becomes 0, the algorithm is terminated and the product is = AQ =
11010 01010, which shows that the product is a negative number. To get the result in familiar
form, take the 2’s complement of the magnitude of the number and the result is –182 in decimal.
The circuit block diagram of the Booth’s sequential multiplication algorithm is shown below.
(b) A faster version of Booth’s multiplication algorithm for signed numbers, known as the modified
Booth’s algorithm, examines three adjacent bits Q[i + 1] Q[i] Q[i – 1] of the multiplier Q at a
time, instead of two. Apart from three basic actions performed by original Booth’s algorithm,
which can be expressed as: add 0, 1 ¥ M (multiplicand) and 1 ¥ M to A (the accumulated
partial products), this modified algorithm performs two more actions: add 2 ¥ M and 2 ¥ M to
A. These have the effect of increasing the radix from 2 to 4 and allow an N ¥ N multiplication
requiring only N/2 partial products.
The following table lists the multiplicand selection decisions for all possibilities.
Solved Question Paper 2008 SQP.5
As Sign of A= –ve
Set Q[0]=0
Restore A 00010 00000 1010 1
Step-4
LS(AQ) 00010 00001 010– —
A=A – M 00010 11111 010– —
As Sign of A= –ve
Set Q[0]=0
Restore A 00010 00001 0100 0
From the above result, we see that the quotient = Q = 0100 = 4 and remainder = A = 00001 = 1.
Division of 9 by 2 using non-restoring division method:
Dividend Q = 9 = 1001 and divisor M = 2 = 0010.
M A Q Size
Initial Configuration 00010 00000 1001 4
Step-1
As Sign of A= +ve
LS(AQ) 00010 00001 001– —
A=A – M 00010 11111 001– —
As sign of A= –ve
Set Q[0]=0 00010 11111 0010 3
Step-2
As sign of A= –ve
LS(AQ) 00010 11110 010– —
A=A + M 00010 00000 010– —
As sign of A= +ve
Set Q[0]=1 00010 00000 0101 2
Step-3
As sign of A = +ve
LS(AQ) 00010 00000 101– —
A=A – M 00010 11110 101– —
As sign of A= –ve
Set Q[0]=0 00010 11110 1010 1
Step-4
As sign of A= –ve
LS(AQ) 00010 11101 010– —
A=A + M 00010 11111 010– —
As sign of A= –ve
Set Q[0]=0
Restore A 00010 00001 0100 0
(i.e. A= A+M)
Solved Question Paper 2008 SQP.7
From the above last step, we conclude that quotient = 0100 = 4 and remainder = 00001 = 1.
The block diagram of the restoring division method is shown below.
With these details, the size of the control word is calculated as follows:
Size of a control size of the size of the number of
word = condition select + branch address + control
field field functions
= 2 + 4 + 11
= 17 bits.
Hence, the sizes of the CMDB and CM are 17 bits and 12 ¥ 17, respectively. The complete
hardware organization of the control unit and control signals is shown in figure next.
C0 : A ¨ 0
C1 : M ¨ Inbus
C2 : L ¨ 4
C3 : Q ¨ Inbus
C4 : LS (AQ)
C5 : F = l – r
C’5 : F = l + r
C6 : A ¨ F
C7 : Q[0] ¨ 1
C’7 : Q[0] ¨ 0
C8 : L ¨ L – 1
C9 : Outbus = A
C10 : Outbus = Q
the contents of the branch address field are irrelevant. However, without any loss of generality,
the contents of this field can be reset to 0000. For this instruction, three micro-operations C0, C1
and C2 are activated. Therefore, only the corresponding bit positions in the control function
fields are set to 1. This results in the following binary microinstruction:
Condition Branch Control
select address function
00 0000 11100000000
Continuing in this way, the complete binary microprogram for n ¥ n restoring divider can be
produced, as in the following table.
programs currently being executed in the CPU and temporary data frequently needed in the
present computations. Thus, the cache memory acts a buffer between the CPU and main memory.
By making programs and data available at a rapid rate, it is possible to increase the performance
of computer.
There are two methods in writing into cache memory:
Write-Through Policy This is the simplest and most commonly used procedure to update the
cache. In this technique, when the cache memory is updated, at the same time the main memory
is also updated. Thus, the main memory always contains the same data as the cache. But it is a
slow process, since each time main memory needs to be accessed.
Write-Back Policy In this method, during a write operation only the cache location is updated.
When the update occurs, the location is marked by a flag called modified or dirty bit. When the
word is replaced from cache, it is written into main memory if its flag bit is set. The philosophy
of this method is based on the fact that during a write operation, the word residing in cache may
be accessed several times (temporal locality of reference). This method reduces the number of
references to main memory. However, this method may encounter the problem of inconsistency
due to two different copies of the same data, one in cache and other in main memory.
The set-associative mapping is a combination of the direct- and associative-mapping tech-
niques. Blocks of the cache are grouped into sets and the mapping allows a block of the main
memory to reside in any block of a specific set. Hence, the conflict problem of the direct method
is eased by having a few choices for block placement. At the same time, the hardware cost is
reduced by decreasing the size of the associative search. That is why the set-associative map-
ping technique is more advantageous than direct- or associative- mapping technique.
Given,
The size of cache = 512 KB
The size of main memory = 2 MB
The block size = 64 bytes.
The main memory capacity = 2 MB = 221 bytes.
So, the processor generates 21-bit address to access each word in cache memory.
The cache memory size = 512 KB = 219 bytes and each block size = 64 bytes = 26 bytes.
Therefore, the number of blocks in cache memory = 512 KB/64 bytes = 213.
So, the block field size = 13-bit.
Since each block size = 64 bytes, the sector (or word) field size = 6-bit.
Thus, the tag field size = 21 – (13 + 6) = 2-bit.
Therefore the address format in direct-mapped cache is as follows:
2 13 6
Tag Block Sector (or Word)
The number of blocks main memory contains = 2 MB/64 bytes = 215.
Therefore, no. of bits required to specify one block in main memory = 15.
Since, the block size is 64 bytes.
The no. of bits required to access each word (byte) = 6.
For associative cache, the address format is:
Solved Question Paper 2008 SQP.11
Tag-address Word
15 6
In case of 8-way set-associative cache:
Each set contains 8 blocks.
Therefore, the number of sets in cache = 213/8 = 210.
Thus, the number of bits required to specify each set = 10.
The tag field of address = (21 – (10 + 6) = 5-bit.
For 8-way set-associative cache, the address format is:
Tag Set Word
5 10 6
(b) Ideally, we would like to have the memory which would be fast, large and inexpensive. Unfortu-
nately, it is impossible to meet all three requirements simultaneously. If we increase the speed
and capacity, then cost will increase. We can achieve these goals at optimum level by using
several types of memories, which collectively give a memory hierarchy.
A memory hierarchy system is broadly divided into following four groups, shown in figure
below.
l Secondary (auxiliary) memory
l Main (primary) memory
l Cache memory
l Internal memory
10. (a) If two n bit numbers are added then the result will be maximum how bit long?
(b) If two n bit numbers are multiplied then the result will be maximum how bit long?
(c) Design a 4-bit Arithmetic unit using multiplexers and full adders.
(d) Two 4-bit unsigned numbers are to be multiplied using the principle of carry save adders.
Assume the numbers to be A3 A2 A1 A0 and B3 B2 B1 B0. Show the arrangement and
interconnection of the adders and the input signals so as to generate an eight bit product as
P7 P6 P5 P4 P3 P2 P1 P0. 1+1+6+7
Answer
(a) If two n bit numbers are added then the result will be maximum (n+1)-bit long.
SQP.12 Computer Organization and Architecture
(b) If two n bit numbers are multiplied then the result will be maximum (n + n)-bit i.e. 2n-bit long.
(c) See answer of question number 8(a) of 2005(CS-404).
(d) See answer of question number 5 of 2004(CS-404).
11. (a) Explain the basic Direct Memory Access (DMA) operation for transfer of data bytes be-
tween memory and peripheral.
(b) What is programmed I/O technique? Why is it not very useful?
(c) What are the different types of interrupt? Give example.
(d) Give the main reason why DMA based I/O is better in some circumstances than interrupt
driven I/O. 6+3+3+3
Answer
(a) See answer of question no. 9 (a) of 2007 (CS-303).
(b) See answer of question no. 9 (c) of 2007 (CS-303).
(c) See answer of question no. 3 (d) of 2004 (CS-303).
(d) See answer of question no. 9 (b) of 2007 (CS-303).
CHAPTER
Solved Problems
CHAPTER 1
1. What are the differences between low-level language and high-level language?
Answer
(a) Low-level languages are closer to the computers, that is low-level languages are generally
written using binary codes; whereas the high-level languages are closer to the human, that is
these are written using English-like instructions.
(b) Low-level language programs are machine dependent, that is, one program written for a particu-
lar machine using low-level language cannot run on another machine. But, high-level language
programs are machine independent.
(c) As far as debugging is concerned, high-level programs can be done easily than low-level
programs.
(d) It is more convenient to develop applications in high-level languages compared to the low-level
languages.
2. What are the differences between machine language and assembly language?
Answer
(a) Machine language instructions are composed of bits (0 and 1). This is the only language the
computer understands. Each computer program can be written in different languages, but ulti-
mately it is converted into machine language because this is the only language the computer
understands. Assembly language instructions are composed of text-type mnemonic codes.
(b) Machine language instructions are difficult to understand and debug, since each instruction is
only combination of 0s and 1s. However, since assembly language instructions are closer to the
human language (i.e. English), it is easy to debug.
(c) In terms of execution, machine language is faster than assembly language. Because for assembly
language program, one converter called assembler is needed to convert it into equivalent ma-
chine language program; whereas no converter is needed for machine language program.
S.2 Computer Organization and Architecture
Answer
The figure below shows the addressing of program and data.
10. Assume that an LSI IC at semiconductor memory stores 1024 bits. Assume a main memory unit
of 1024 ICs. How many bytes do these ICs store?
Answer
Since an LSI IC at semiconductor memory stores 1024 bits, the main memory unit of 1024 ICs stores
1024 ¥ 1024 bits = (1024 ¥ 1024)/8 bytes = 128 KB.
11. How does a multiprogramming system give the illusion that multiple programs are running on
the machine simultaneously? What factor can cause this illusion to void?
Answer
In multiprogramming system, the processor switches among multiple programs executing on them
very frequently—50 or more times per second. If the number of programs executing in the system is
relatively small, each program will get a chance to execute often enough that the system looks like
processor is executing all of the programs at the same time. However, if the number of programs
executing on the system gets too large, the processor will be busy in context switching and thus
execution of programs will get reduced, making the illusion void.
12. Suppose a 600 MHz machine does the number of context switching 60 times per second. How
many cycles are there in each time-slice?
Answer
600 MHz = 600 ¥ 106 cycles per second.
Therefore, the number of cycles per time-slice = (600 ¥ 106)/60 = 107.
13. To achieve a speed-up of 4 on a program that originally took 80 ns to execute, what must be the
execution time of the program be reduced to?
Answer
Execution time before
Speed-up =
Execution timeafter
Now, we have speed-up = 4 and execution timebefore = 80 ns.
Therefore, from the speed-up formulae, we get execution timeafter = 20 ns.
So, to achieve the speed-up of 4, execution time must be reduced to 20 ns.
CHAPTER 2
(c) –22 – 7 = (–22) – 7. This is a subtraction. So, 2’s complement of 7 is to be added with (–22).
The 2’s complement of 7 = 1111 1001 and representation of (–22) = 1110 1010. After
addition of these two we get, 11110 0011. This is 9-bit result. So, by rule, discard 9th bit and
get the result as 1110 0011, which shows it is negative number. The result in equivalent
decimal is –29.
(d) 18 – (–5) = 18 + 5. The representation of 18 is 0001 0010 and representation of 5 is 0000
0101. We get after addition, 0001 0111. The result is equivalent to decimal 23.
6. Can you add 8-bit signed numbers 0110 0010 and 0100 0101? If not, why? Suggest a solution.
Answer
Carries: 01
0110 0010
0100 0101
(add) 1010 0111
This result suggests that the number is negative, which is wrong. However, if the carry out from
the sign bit position is treated as the sign of the result, the 9-bit answer thus obtained will be correct
answer. So, there is an overflow.
To detect an overflow condition the carry into the sign bit position and the carry out from the sign
bit position are examined. If these two carries are both 0s or both are 1s, there is no overflow. If these
two carries are not equal (i.e., if one is 0 and other is 1), an overflow condition exists. Considering the
carry from the sign bit position with (i.e.0, here) along with the 8-bit result will give correct answer.
7. Write down the Boolean expression for overflow condition when adding or subtracting two
binary numbers expressed in two’s complement.
Answer
If one number is positive and the other is negative, after an addition overflow cannot occur, since
addition of a positive number to a negative number produces a number that is always smaller than the
larger of the two original numbers. However, an overflow may occur if the two numbers added are of
same sign i.e., both are positive or both are negative. Let’s consider following examples.
Carries:01 Carries:10
+69 0 1000101 –69 1 0111011
+78 0 1001110 –78 1 0110010
+147 1 0010011 –147 0 1101101
Observe that the 8-bit result that should have been positive (first example) has a negative sign bit
and the 8-bit result that should have been negative (second example) has a positive sign bit. However,
if the carry out from the sign bit position is treated as the sign of the result, the 9-bit answer thus
obtained will be correct answer. Since the 9-bit answer cannot be accommodated with 8-bit register,
we say that an overflow results.
To detect an overflow condition the carry into the sign bit position (i.e. Cn-1) and the carry out from
the sign bit position (i.e. Cn) are examined. If these two carries are both 0 or both are 1, there is no
overflow. If these two carries are different, an overflow condition exists. The overflow occurs if the
Boolean expression Cn ≈ Cn-1 is true.
Solved Problems S.7
8. Give the merits and demerits of the floating point and fixed point representations for storing
real numbers
Answer
Merits of fixed-point representation:
(i) This method of representation is suitable for representing integers in registers.
(ii) Very easy to represent, because it uses only one field: magnitude field.
Demerits of fixed-point representation:
(i) Range of representable numbers is restricted.
(ii) It is very difficult to represent complex fractional numbers.
(iii) Since there is no standard representation method for it, it is some time confusing to represent a
number in this method.
Merits of floating-point representation:
(i) By this method, any type and any size of numbers can be represented easily.
(ii) There are several standardized representation methods for this.
Demerits of floating-point representation:
(i) Relatively complex representation, because it uses basically two fields: mantissa and exponent
fields.
(ii) Length of register for storing floating-point numbers is large.
9. Represent following decimal numbers in IEEE 754 floating point format:
(a) –1.75
(b) 21
Answer
(a) The decimal number – 1.75 = – 1.11 in binary = –1.11 ¥ 20
The 23-bit mantissa M = 0.110000 000000 000000 00000
The biased exponent E¢ = E + 127 = 0 + 127 = 127 = 0111 1111
Since the number is negative, the sign bit S = 1
Therefore, the IEEE single-precision (32-bit) representation is:
10. What value is represented by the IEEE single precision floating point number:
0101 0101 0110 0000 0000 0000 0000 0000?
Answer
The sign of the number = 0, biased exponent value = 10101010 = 170. So the exponent value = 170 –
127 = 43. The mantissa field = 110 0000 0000 0000 0000 0000.
S.8 Computer Organization and Architecture
Therefore, the value of the number = + (1.11)2 ¥ 243 = 1.75 ¥ 243 = 1.539 ¥ 1013 (approx.).
11. How NaN (Not a Number) and Infinity are represented in IEEE 754 standard?
Answer
Not a Number (NaN) is represented when biased exponent E¢ = 255 and mantissa M π 0. NaN is a
result of performing an invalid operation such as 0/0 and -1 .
Infinity is represented when E¢ = 255 and M = 0. The infinity is the result of dividing a normal
number by 0.
12. A floating point number system uses 16 bits for representing a number. The most significant bit
is the sign bit. The least significant nine bits represent the mantissa and remaining 6 bits
represent the exponent. Assume that the numbers are stored in the normalized format with one
hidden bit
(i) Give the representation of –1.6 ¥ 103 in this number system.
(ii) What is the value represented by 0 000100 110000000?
Answer
The format of the 16-bit floating-point representation is as follows:
1 001010 100100000
Answer
First we must shift the smaller number to the right to align the exponents, so 2.56 becomes 0.0256
¥ 102. The sum is
2.3400
+ 0.0256
2.3656
Thus, the sum is 2.3656 ¥ 102. Since, we have two digits to round, we want values 0 to 49 to round
down and 51 to 99 to round up, 50 being the tiebreaker. Rounding the sum up with three significant
digits gives 2.37 ¥ 102.
14. Compute the product of the following pair of unsigned integers. Generate the full 8-bit result.
(i) 1001 ¥ 0110
(ii) 1111 ¥ 1111
Answer
(i) 1001 ¥ 0110.
This can be written as
(1001 ¥ 100) + (1001 ¥ 10) = 100100 + 10010 = 0011 0110.
(ii) 1111 ¥ 1111.
This can be written as
(1111 ¥ 1000) + (1111 ¥ 100) + (1111 ¥ 10) + (1111 ¥ 1) = 1111000 + 111100 + 11110 + 1111
= 1110 0001.
15. Multiply -13 and +14 by the method of partial products.
Answer
–13 = 10011 (multiplicand)
+14 = 01110 (multiplier)
00000
10011
10011
10011
100001010
In case of negative multiplicand, 2’s complement multiplier is added additionally with n-bit shift
due to sign extension of partial product on the right shifting.
The 2’s complement of multiplier left shifted by n i.e. 5 bits is 1001000000.
Add this shifted number with added result of partial products, to get the correct result of signed
numbers multiplication.
Thus, 100001010 + 1001000000 = 1101001010 = –182.
16. Give the recoded Booth’s multiplier representations for the following:
(a) 1100 1010
(b) 1110 1101
Answer
(a) Original multiplier: 11001010=011001010
Recoded pattern: 1 0 1 0 11 11 0
S.10 Computer Organization and Architecture
CHAPTER 3
1. Why every computer system is associated with a set of general purpose registers?
Answer
Some general purpose registers are used inside processors to enhance to effective execution speed.
The registers are fastest storage devices whose speed is almost same as processors in computer
systems to hold the instructions and data temporarily. The processors do not have to wait for required
instructions and data, if they are available in registers.
2. A digital computer has a common bus system for k-registers of n bits each. The bus is con-
structed with multiplexers.
(i) What size of multiplexers is needed?
(ii) How many multiplexers are there in the bus?
Answer
For an n-line bus system for k registers of n bits each:
(i) The size of each multiplexer must be k ¥ 1, since it multiplexes k data lines each from a register.
(ii) Each multiplexer transfers one bit of the selected register. The number of multiplexers needed to
construct the bus is equal to n, the number of bits in each register.
3. A digital computer has a common bus system for k-registers of n bits each. The bus is con-
structed with tri-state buffers.
(i) How many decoders are needed and what size of decoder is needed?
(ii) How many tri-state buffers are there in the bus?
Answer
For an n-line bus system for k registers of n bits each:
(i) Only one decoder is required to select among the k registers. Size of the decoder should be
log2k-to- k.
(ii) The total number of buffers needed is k * n.
4. Show the circuit diagram for implementing the following register transfer operation:
if ( a b = 1) then R1 ¨ R2 else R1 ¨ R3; where a and b are control variables.
Answer
The control function C is a b. The register transfer operation can be written as:
C: R1 ¨ R2
C’: R1 ¨ R3
The circuit diagram for the register transfer operations is as shown below.
The R2 register is selected by the MUX if control condition C = 1; otherwise register R3 is
selected as source register.
S.12 Computer Organization and Architecture
9. If the average gate delay is 4 ns, what is the delay for an 8-bit carry look-ahead adder?
Answer
For carry look-ahead adder, the addition delay is 4 ¥ gate delay = 4 ¥ 4 ns = 16 ns.
10. Two 4 bit unsigned numbers are to be multiplied using the principle of carry save adders.
Assume the numbers to be A3 A2 A1 A0 and B3 B2 B1 B0. Show the arrangement and interconnec-
tion of the adders and the input signals so as to generate an eight-bit product as P7 P6P5 P4 P3
P2 P1 P0.
Answer
The multiplication of two unsigned is done by repeated add-shift operations. Add-shift multiplication
of two 4-bit numbers is illustrated in figure below.
A3 A2 A1 A0 = A
B3 B2 B1 B0 = B
A3B0 A2B0 A1B0 A0B0 = W1
A3B1 A2B1 A1B1 A0B1 = W2
A3B2 A2B2 A1B2 A0B2 = W3
A3B3 A2B3 A1B3 A0B3 = W4
P7 P6 P5 P4 P3 P2 P1 P0 = A ¥ B = Product
Add-shift multiplication of two 4-bit numbers (A ¥ B = Product)
The additions of partial products W1, W2, W3 and W4, which are generated using bit-wise AND
logic operations, can be done using CSA-tree as shown in figure below to realize the multiplier for 4-
bit numbers.
The first carry-save adder (CSA-1) adds W1, W2 and W3 and produces a sum vector (S1) and a
carry vector (C1). The sum vector, the shifted carry vector and the fourth partial product W4 are
applied as the inputs to the second CSA. The results produced by the second CSA are then added by
a CPA to generate the final summation Sum.
S.14 Computer Organization and Architecture
12. Suppose register A holds the 8-bit number 11011001. Determine the B operand and the logic
micro-operation to be performed in order to change the value in A to:
(i) 01101101
(ii) 11111101
Answer
(i) If B = 1011 0100 and the XOR logic micro-operation is performed with A, then the content of
A will be 0110 1101. That is, A ¨ A ≈ B.
(ii) If OR logic micro-operation is performed between A and B, then the content of A will be 1111
1101. That is, A ¨ A ⁄ B.
13. Suppose register A holds the 8-bit number 11011001. Determine the sequence of binary values
in A after an arithmetic shift-right, followed by a circular shift-right and followed by a logical
shift-left.
Answer
The value in A = 1101 1001
After arithmetic shift right, the value in A = 1110 1100
After circular shift right, the value in A = 0111 0110
After logical shift left, the value in A = 1110 1100
14. What is bit-slice processor? Give some examples.
Answer
A bit-slice processor is constructed from processor modules of smaller bit width. Each of these
processes one bit field of an operand. Each processor module chip typically includes an ALU and a
Solved Problems S.15
few registers. Using smaller size processors, one can design a processor of any word length. For
example, a 4-bit bit slice processor includes a register file and an ALU to perform operations on 4-bit
data, so that four such chips can be combined to build a 16-bit processor unit.
Some of bit-slice processors are Intel’s 3000-family and AMD’s AM2900 family.
CHAPTER 4
5. Why do the DRAMs generally have large capacities than SRAMs constructed by the same
fabrication technology?
Answer
Each DRAM cell contains two devices – one capacitor and one transistor, while each SRAM cell
consists of six transistors. This means a DRAM cell is much smaller than a SRAM cell, allowing the
DRAM to store more data in the same size chip space.
6. Why is refreshing required in Dynamic RAM?
Answer
Information is stored in a dynamic RAM memory cell in the form of a charge on a capacitor. Due to
the property of the capacitor, it starts to discharge. Hence, the information stored in the cell can be
read correctly only if it is read before the charge on the capacitor drops below some threshold value.
Thus, this charge in capacitor needs to be periodically recharged or refreshed.
7. Suppose a DRAM memory has 4K rows in its array of bit cells, its refreshing period is 64 ms
and 4 clock cycles are needed to access each row. What is the time needed to refresh the
memory if clock rate is 133 MHz? What fraction of the memory’s time is spent performing
refreshes?
Answer
In DRAM memory, no. of rows of cells in memory is 4K = 4096 and 4 clock cycles are needed to
access each row.
Therefore, no. of cycles needed to refresh all rows = 4096 ¥ 4 = 16384 cycles.
Since clock rate is 133 MHz,
The time needed to refresh all rows = 16384 / (133 ¥ 106) seconds
= 123 ¥ 10–6 seconds
= 0.123 ms. [1 ms = 10–3 sec.]
Thus, the refreshing process occupies 0.123 ms in each 64 ms time interval.
Therefore, refresh overhead is 0.123/64 = 0.002.
Hence, only 0.2 % of the memory’s time is spent performing refreshes.
8. How many 256 ¥ 4 RAM chips are needed to provide a memory capacity of 2048 bytes? Show
also the corresponding interconnection diagram.
Answer
The given RAM memory size is 256 ¥ 4. This memory chip requires 8 (because 256 = 28) address
lines and 4 data lines.
Size of memory to be constructed is 2048 bytes, which is equivalent to 2048 ¥ 8. Thus, it requires 11
(because 2048 = 211) address lines and 8 data lines.
In the interconnection diagram:
The number of rows required = 2048/256 = 8.
The number of columns required = 8/4 = 2.
Thus, total number of RAMs each of size 256 ¥ 4 required = 8 * 2 = 16.
The interconnection diagram is shown in the following figure.
Solved Problems S.17
9. Explain how a RAM of capacity 2 K bytes can be mapped into the address space (1000)H to
(17FF)H of a CPU having a 16 bit address lines. Show how the address lines are decoded to
generate the chip select condition for the RAM.
Answer
Since the capacity of RAM memory is 2048 bytes = 2 KB, the memory uses 11 (2 KB = 211) address
lines, say namely A10 – A0, to select one word. Thus, memory’s internal address decoder uses 11 lines
A10 – A0 to select one word.
To select this memory module, remaining 5 (i.e. 16 – 11) address lines A15 – A11 are used. Thus,
an external decoding scheme is employed on these higher-order five address bits of processor’s
address.
The address space of the memory is 1000H and 17FFH.
Therefore, the starting address (1000)H in memory is as:
A 15 A 14 A 13 A 12 A 11 A 10 A 9 A8 A7 A6 A5 A4 A3 A2 A1 A0
0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
Based on the higher-order five bits (00010), external decoding scheme performs a logical AND
operation on address values: A15 , A14 , A13 , A12 and A11 . The output of AND gate acts as chip select
(CS) line. The address decoding scheme is shown in the following figure.
S.18 Computer Organization and Architecture
10. A high speed tape system accommodates 1200 ft. reel of standard 9-track tape. The tape is
moved past the recording head at a rate of 140 inches per second. What must be the linear tape
recording density in the order to achieve a data transfer rate of 105 bits per second?
Answer
Given,
Tape length = 1200 ft.
Tape speed = 140 inches/sec.
Data transfer rate = 105 bits/sec.
We have, data transfer rate = tape density ¥ tape speed
Therefore, the tape density = 105/140 bits/inch
= 714 bits/inch
11. Suppose a 30 GB hard-disk is to be manufactured. If the technology used to manufacture the
disks allows 1024-byte sectors, 2048-sector tracks and 4096-track platters. How many platters
are required?
Answer
The total capacity of each platter = size of each sector ¥ no. of sectors per track ¥ no. of tracks per
platter
= 1024 ¥ 2048 ¥ 4096 bytes
= 8 ¥ 230 bytes
= 8 GB
Therefore, no. of platters required = Ècapacity_of_disk/capacity_of_each_platter˘
= È30/8˘
= 4.
12. A hierarchical cache-main memory subsystem has the following specifications: (i) Cache access
time of 160 ns (ii) main memory access time of 960 n (iii) hit ratio of cache memory is 0.9.
Calculate the following:
(a) Average access time of the memory system.
(b) Efficiency of the memory system.
Solved Problems S.19
Answer
Given,
Cache access time, tc = 160 ns
Main memory access time, tm = 960 ns
Hit ratio, h = 0.9
(a) The average access time of the system, tav = h ¥ tc + (1 – h) ¥ (tc + tm)
= 0.9 ¥ 160 + 0.1 ¥ (160 + 960)
= 256 ns
(b) The efficiency of the memory system = tc / tav
= 160/256
= 0.625
13. A three level memory system having cache access time of 15 ns and disk access time of 80 ns
has a cache hit ratio of 0.96 and main memory hit ratio of 0.9. What should be the main
memory access time to achieve effective access time of 25 ns?
Answer
Given,
Cache access time, tc = 15 ns
Disk (secondary) memory access time, ts = 80 ns
Hit ratio for cache, hc = 0.96
Hit ratio for main memory, hm = 0.9
The average access time, tav = 25 ns
Let, the main memory access time is tm unit.
Now, we know, the average access time of the memory system,
tav = hc ¥ tc + hm ¥ (1–hc) ¥ (tc + tm) + (1–hc) ¥ (1–hm) ¥ (tc + tm + ts)
That is, 25 = 0.96 ¥ 15 + 0.9 ¥ 0.04 ¥ (15 + tm) + 0.04 ¥ 0.1 ¥ (15 + tm + 80)
By simplifying, we get, tm = 27
Hence, the main memory access time must be 27 ns to achieve the effective access time of 25 ns.
14. Explain how cache memory increases the performance of a computer system.
Answer
Due the locality of reference property of programs, some blocks like program loop, subroutine and
data array in the programs are referenced frequently. Since the cache memory’s speed is almost same
as that of CPU. When these program blocks are placed in fast cache memory, the average memory
access time is reduced, thus reducing the total execution time of the program.
15. How does the size of cache block (i.e. line) affect the hit ratio?
Answer
Generally, increasing the block size of a cache increases the hit ratio because of the property of
locality of reference- that is; the addresses close to an address that has just been referenced are likely
to be referenced soon. Increasing the block size increases the amount of data near the address that
caused the miss that is brought into the cache on a cache miss. Since most of this data is likely to be
referenced soon, bringing it into the cache eliminates the cache misses that would have occurred when
data was referenced.
S.20 Computer Organization and Architecture
However, sometimes large block can reduce the performance of the memory system. Since the
larger blocks reduce the number of blocks in the cache, conflict misses can arise giving the reduced
performance of the system.
16. Given the following, determine size of the sub-fields (in bits) in the address for direct mapping,
associative and set associative mapping cache schemes: We have 256 MB main memory and
1MB cache memory. The address space of this processor is 256 MB. The block size is 128 bytes.
There are 8 blocks in a cache set.
Answer
Given,
The capacity of main memory = 256 MB
The capacity of cache memory = 1MB
Block size = 128 bytes.
A set contains 8 blocks.
Since, the address space of the processor is 256 MB.
The processor generates address of 28-bit to access a byte (word).
The number of blocks main memory contains = 256 MB / 128 bytes = 221.
Therefore, no. of bits required to specify one block in main memory = 21.
Since the block size is 128 bytes.
The no. of bits required to access each word (byte) = 7.
For associative cache, the address format is:
Tag-address Word
21 7
17. Why do virtual page and physical frame have same size, in paging?
Answer
Virtual address that is generated by processor is divided into two fields; page number that identifies
the page containing an address and offset that identifies the location of the address within the page.
Similarly, physical address is divided into two fields; frame number and offset. If the page size and
frame size are same, address translation can be done easily. The offset from the virtual address can be
concatenated with the frame number that corresponds to the virtual page containing the address to
produce the physical address that corresponds to a virtual address. If these two were different, a
complicated means would be required for address translation.
18. In a system with 64-bit virtual addresses and 43-bit physical addresses, how many bits are
required for the virtual page number and physical frame number if the pages are 8 KB in size?
How big is each page table entry? How many page table entries are required for this system?
Answer
Since the page size is 8 KB, 13 bits are required for the offset field of both the virtual and physical
address.
Therefore, bits required for the virtual page number = 64 – 13 = 51 and
Bits required for the physical frame number = 43 – 13 = 30.
Each page table entry contains frame number and a valid/invalid bit.
So, a total of (30 + 1) i.e. 31 bits is used to store each page table entry.
Since each virtual page number contains 51 bits,
The virtual address space can hold 251 pages, which requires 251 page table entries.
19. A virtual memory system has the following specifications:
l Size of the virtual address space is 64 KB
l Size of physical address space is 4 KB
l Page size is 512-byte
From the following page table, what are the physical addresses corresponding to the virtual
addresses:
(a) 3494 (b) 12350 (c) 30123
Answer
Since page size is 512 bytes
Lower order 9 bits are used for offset within the page.
Size of the virtual address space is 64 KB
S.22 Computer Organization and Architecture
21. Suppose a processor’s TLB has hit ratio 80% and it takes 20 ns to search the TLB and 100 ns to
access main memory. What will be the effective access time?
Answer
When the referred page number is found in the TLB, then a mapped memory access takes (20 + 100)
i.e. 120 ns. If it is not found in the TLB, then we must first access memory for the page table
and frame number, and then access the desired word in memory, for a total of (20 + 100 + 100) i.e.
220 ns.
Solved Problems S.23
CHAPTER 5
1. If each register is specified by 3 bits and instruction ADD R1, R2, R3 is two-byte long; then
what is the length of op-code field?
Answer
The op-code field can have 16 – 3 – 3 – 3 = 7 bits.
Op-code 7-bit R1 3-bit R2 3-bit R3 3-bit
2. What is the maximum number of 0-address, 1-address and 2-address instructions if the instruc-
tion size is of 32-bit and 10-bit address field?
Answer
In 0-address instructions, no address field is used. So, all 32-bit of the instruction size can be used as
the size of op-code field. Therefore, maximum number of 0-address instructions is 232.
In 1-address instructions, one address field is used whose size is given as 10-bit. So, remaining
32 – 10 i.e. 22-bit can be used as op-code field. Therefore, maximum number of 1-address instruc-
tions is 222.
In 2-address instructions, two address fields are used; collectively they occupy 20-bit. So, remain-
ing 32 – 20 i.e. 12-bit can be used as op-code field. Therefore, maximum number of 2-address
instructions is 212.
3. There are 58 processor registers, 7 addressing modes and 16K ¥ 32 main memory. State the
instruction format and size of each field if each instruction supports one register operand and
one address operand.
Answer
The processor has 58 registers, so 6-bit is used to specify each register uniquely (because 32 < 58
< 64).
No. of addressing modes is 7, so 3-bit is used in mode field in the instruction format (because 7 < 8).
The main memory size is 16K ¥ 32; so to access each word of 32-bit in memory 14-bit address is
generated.
Therefore, the op-code field requires 32 – 3 – 6 – 14 = 9-bit.
The instruction format will be as:
Op-code (9-bit) Mode (3-bit) Register addr. (6-bit) Memory addr. (14-bit)
4. The 32-bit value 40A75429 is stored to the location 1000. What is the value of the byte in
address 1002 if the system is big-endian? If little-endian?
Answer
In big-endian assignment, the most significant byte is stored in lower address and least significant
byte is stored in higher address. Therefore, the byte 54 will be stored in location 1002.
In little-endian assignment, the least significant byte is stored in lower address and the most signifi-
cant byte is stored in higher address. Therefore, the byte A7 will be stored in location 1002.
Solved Problems S.25
5. What are advantages of general-register based CPU organizations over stack based CPU
organizations?
Answer
There are three main advantages of general-register based CPU organizations over stack based CPU
organizations.
(a) In general-register based CPU organizations, reading a register does not affect its content,
whereas, in stack based CPU organizations, reading value from the top of the stack removes the
value from the stack.
(b) In general register-based CPU organizations, any register from register file can be chosen to
keep values while writing a program; whereas, in stack based CPU organizations, accessing
values is limited by the LIFO (last-in first-out) nature of the stack.
(c) Since, fewer memory references are made by programs written in general register-based CPU
organizations, the effective execution is faster than that in stack based CPU organizations,
where generally stack is implemented by memory locations and locations are accessed in LIFO
nature.
6. Assuming that all registers initially contain 0, what is the value of R1 after the following
instruction sequence is executed?
MOV R1, #6
MOV R2, #5
ADD R3, R1, R1
SUB R1, R3, R2
MULT R3, R1, R1
Answer
The instruction MOV R1, #6 places value 6 into register R1 and instruction MOV R2, #5 places value
5 into register R2. The instruction ADD R3, R1, R1 performs addition of values 6 in R1 and 6 in R1
and result 12 is stored in register R3. The instruction SUB R1, R3, R2 subtracts value 5 in R2 from 12
in R3 and result 7 is stored in register R1. Last instruction MULT R3, R1, R1 multiplies values 7 with
7 (content of R1) and result 49 is put into register R3. Therefore, the value of R1, after the execution
of all instructions in the sequence, is 7.
7. What value remains on the stack after the following sequence of instructions?
PUSH #3 (symbol # indicates direct value of the number)
PUSH #5
PUSH # 4
ADD
PUSH # 7
SUB
MULT
Answer
After first three push operations, the stack contains 4, 5, 3 starting from the top of the stack. Then the
ADD operation automatically pops two top most values from the stack, which are here 4 and 5, and
then added and sum value 9 is push on the top of the stack. Thus, now stack has 9 and 3. After PUSH
#7, the stack contains 7, 9, 3 starting from the top. Next instruction SUB subtracts the top value of the
S.26 Computer Organization and Architecture
stack from the next value down in the stack. So after SUB instruction execution, the value 9 – 7 i.e. 2
is push on the top of the stack. Finally, MULT instruction pops 2 and 3 from the stack and pushes
2 ¥ 3 i.e. 6 on the stack.
8. Why stack is not generally implemented using a processor’s register file?
Answer
The stack gives the illusion of having large storage space to the programmer. Due to the limited
number of registers in CPU, a system that used only register file to implement its stack would only be
able to push small amount of data onto the stack. Thus, the programmers would have to take care of
the finite size of the register implemented stack, while writing the programs, which would make
programming harder.
9. Explain why a given (Infix) arithmetic expression needs to be converted to Reverse Polish
Notation (RPN) for effective use of a stack organization.
Answer
Stack-organized computers are better suited to postfix (RPN) notation than traditional infix notation.
In infix notation, operator is placed between operands. In postfix notation, operator is placed after
operands. For example, infix notation A * B becomes AB * in postfix notation.
Once an expression is recoded in postfix notation, converting it into a stack-based program is very
easy. Starting from the left, each operand is replaced with a PUSH operation to place the operand on
the stack and operator is replaced with the appropriate instruction to perform the operation.
10. Convert the given arithmetic expression to RPN Notation:
(A + B ^ D) / (E – F) + G
Answer
Conventionally, three levels of precedence for the usual five binary operators as:
Highest: Exponentiation (^)
Next highest: Multiplication (*) and division (/ )
Lowest: Addition (+) and subtraction (–)
Based on the priorities of operators, the given expression can be written as
((A + (B ^ D))/(E – F)) + G
Now, the equivalent postfix (RPN) expression can be evaluated as follows:
((A + (BD ^))/(EF –)) + G
fi ((ABD ^ +)/(EF –)) + G
fi (ABD ^ + EF – /) + G
fi ABD ^ + EF – / G +
11. How stack is useful in subroutine call?
Answer
Subroutine is a self-contained sequence of instructions that can be called or invoked from any point in
a program. When a subroutine is called, a branch is made to the first executable instruction of the
subroutine. After the subroutine has been executed, a return is made to the instruction following the
point at which it was called. Consider the following code segment:
Solved Problems S.27
MAIN ()
{
------------
------------
CALL SUB1 ()
Next instruction
------------
}
SUB1 ()
{
------------
------------
RETURN
}
After CALL SUB1 () has been fetched, the program counter (PC) contains the address of the “Next
instruction” immediately following CALL. This address of PC is saved on the stack, which is the
return address to main program. PC then contains the first executable instruction of subroutine SUB1
() and processor continues to execute its codes. The control is returned to the main program from the
subroutine by executing RETURN, which pulls the return address (i.e. address of Next instruction)
off the stack and puts it in the PC.
Since the last item stored on the stack is the first item to be removed from it, the stack is well
suited to nested subroutines. That is, a subroutine is able to call another subroutine and this process
can be repeated many times. Eventually, the last subroutine called completes its computations and
returns to the subroutine that called it. The return address needed for this first return is the last one
generated in the nested call sequence. That is, return addresses are generated and used in last-in-first-
out order. This suggests that the return addresses associated with subroutine calls should be pushed
onto a stack.
12. Suppose it takes 7 ns to read an instruction from memory, 3 ns to decode the instruction, 5 ns to
read the operands from register file, 2 ns to perform the computation of the instruction and 4 ns
to write the result into the register. What is the maximum clock rate of the processor?
Answer
The total time to execute an instruction = (7 + 3 + 5 + 2 + 4) ns = 21 ns. That is an instruction cycle
takes 21 ns.
The time to execute an instruction by processor must be greater than the clock cycle time of the
processor. Therefore, the maximum clock rate = 1/cycle time
= 1 / (21 ¥ 10–9) Hertz
= 1000/21 MHz (1 MHz = 106 Hz)
= 47.62 MHz
13. What do you mean by instruction cycle, machine cycle and T states?
Answer
Instruction cycle: The processing required for a single instruction is called instruction cycle. The
control unit’s task is to go through an instruction cycle that can be divided into five major phases:
1. Fetch the instruction from memory.
S.28 Computer Organization and Architecture
CHAPTER 6
Carry (C): it indicates whether there is any end-carry from the most significant bit position.
Zero (Z): it indicates whether the result is zero or non-zero.
Sign (S): it indicates whether the result is positive or negative.
Overflow (V): it indicates whether the operation produces any overflow or not.
There may be other flags such as parity and auxiliary carry.
2. What are the advantages and disadvantages of microprogram control unit over hardwired
control unit?
Answer
Advantages of microprogram control unit:
(a) It provides a well-structured control organization. Control signals are systematically transformed
into formatted words (microinstructions). Logic gates, flip flops, decoders and other digital
circuits are used to implement hardwired control organization.
(b) With microprogramming, many additions and changes can be made by simply changing the
microprogram in the control memory. A small change in the hardwired approach may lead to
redesigning the entire system.
Disadvantage of microprogram control unit:
The microprogramming approach is more expensive than hardwired approach. Since a control ROM
memory is needed in microprogramming approach.
3. Suppose there are 15 micro-instructions in control memory that generate the control signals for
an instruction and it takes 7 micro-instructions to read an instruction from memory into instruc-
tion register IR and then to decode the instruction. Assuming a read of control memory address
occurs in 1 ns, what will be the time taken by the processor for the instruction?
Answer
For reading an instruction from memory and decoding, number of micro-instructions required = 7 and
that for execution = 15. Therefore, time required to process an instruction = 7 + 15 = 22 ns.
4. An encoded microinstruction format is to be used in control unit. Show how a 9-bit micro-
operation field can be divided into subfields to specify 46 different control signals.
Answer
The 9-bit micro-operation can be divided into two subfields to specify 46 control signals by a partial
encoding, as shown in figure below.
S.30 Computer Organization and Architecture
The control signals are partitioned into disjoint groups of control fields, so two signals of different
groups can be enabled in parallel. The control signals are partitioned into two groups as:
Group 1: C1 C2 ….C31.
Group 2: C32 C33 ….C46.
With the above grouping, C32 can be activated simultaneously with C1 or C2 but not C1 and C2.
Using one 5-to-32 and one 4-to-16 decoders, the actual control signals are generated.
5. A processor has 28 distinct instructions with 13 instructions having 12 micro-instructions and
15 having 18 micro-instructions.
(a) How many addresses are used in control memory?
(b) If three instructions jump to another set of micro-instructions, each having four micro-
instructions, then how many addresses are now used in control memory? Assume that each
micro-instruction also stores a branch address.
Answer
(a) Number of addresses in control memory for 28 instructions = 13 ¥ 12 + 15 ¥ 18 = 426.
(b) Number of addresses in control memory = 13 ¥ 12 + 15 ¥ 18 + 3 ¥ 4 = 438.
6. Why do most modern processors follow microprogramming control organizations?
Answer
Microprogramming allows relatively complex instructions to be implemented using small number of
hardware circuitry, because control memory stores the necessary control signals directly instead of
large number of logic gates as used in hardwired control organization. Moreover, many additions and
changes can be made by simply changing the microprogram in the control memory.
7. A conventional microprogrammed control unit includes 2048 words by 117 bits. Each of
512 microinstructions is unique. Calculate the savings achieved by having a nanoprogramming
technique. Calculate the sizes of microcontrol memory and nanocontrol memory.
Answer
It has 2048 microinstructions, out of which 512 are unique, as shown in figure below. The contents of
the microcontrol memory (MCM) are pointers to the nanocontrol memory (NCM). Each microinstruc-
tion of the MCM is of size = Èlog2512˘ = 9 bits in length.
Now, control unit has: the MCM is 2048 ¥ 9 bits and the NCM is 512 ¥ 117 bits, since there are
512 unique microinstructions. In conventional microprogramming by using a single CM, this memory
size is 2048 ¥ 117 bits. Therefore, the use of nanoprogramming saves a total of
= 2048 * 117 – (2048 * 9 + 512 * 117) bits
= 1,61,280 bits.
Solved Problems S.31
CHAPTER 7
In polling, interrupts are handled by software. When the processor detects an interrupt, it branches
to a common interrupt service routine (ISR) whose function is to poll each I/O device to determine
which device caused the interrupt. The order in which they are polled determines the priority of each
interrupt. The highest priority device is polled first and if is found that its interrupt signal is on, the
CPU branches to the device’s own ISR for I/O transfer between the device and CPU. Otherwise it
moves to poll the next highest priority device.
4. Differentiate between polled I/O and interrupt driven I/O.
Answer
(a) In the polled I/O or programmed I/O method, the CPU stays in the program until the I/O device
indicates that it is ready for data transfer, so CPU is kept busy needlessly. But, in interrupt
driven I/O method, CPU can perform its own task of instruction executions and is informed by
raising an interrupt signal when data transfer is needed.
(b) Polled I/O is low cost and simple technique; whereas, interrupt I/O technique is relatively high
cost and complex technique. Because in second method, a device controller is used to continu-
ously monitor the device status and raise an interrupt to the CPU as soon as the device is ready
for data transfer.
(c) The polled I/O method is particularly useful in small low-speed computers or in systems that are
dedicated to monitor a device continuously. However, interrupt I/O method is very useful in
modern high speed computers.
5. Discuss the advantage of interrupt-initiated I/O over programmed I/O.
Answer
In the programmed I/O method, the program constantly monitors the device status. Thus, the CPU
stays in the program until the I/O device indicates that it is ready for data transfer. This is time-
consuming process since it keeps the CPU busy needlessly. It can be avoided by letting the device
controller continuously monitor the device status and raise an interrupt to the CPU as soon as the
device is ready for data transfer. Upon detecting the external interrupt signal, the CPU momentarily
stops the task it is processing, branches to an interrupt-service-routine (ISR) or I/O routine or
interrupt handler to process the I/O transfer, and then after completion of I/O transfer returns to the
task it was originally performing. Thus, in the interrupt-initiated mode, the ISR software (i.e. CPU)
performs data transfer but is not involved in checking whether the device is ready for data transfer or
not. Therefore, the execution time of CPU can be optimized by employing it to execute normal
programs, while no data transfer is required.
6. What are the different types of interrupt? Give examples.
Answer
There are basically three types of interrupts: external, internal or trap and software interrupts.
External interrupt: These are initiated through the processors’ interrupt pins by external devices.
Examples include interrupts by input-output devices and console switches. External interrupts can be
divided into two types: maskable and non-maskable.
Maskable interrupts: The user program can enable or disable all or a few device interrupts by
executing instructions EI or DI.
Solved Problems S.33
Non-maskable interrupts: The user program cannot disable it by any instruction. Some common
examples are: hardware error and power fail interrupt. This type of interrupt has higher priority than
maskable interrupts.
Internal interrupt: This type of interrupts is activated internally by exceptional conditions. The inter-
rupts caused due to overflow, division by zero and execution of an illegal op-code are common
examples of this category.
Software interrupts: A software interrupt is initiated by executing an instruction like INT n in a
program, where n refers to the starting address of a procedure in program. This type of interrupts is
used to call operating system. The software interrupt instructions allow to switch from user mode to
supervisor mode.
7. What are the differences between vectored and non-vectored interrupt?
Answer
In a vectored interrupt, the source that interrupts supplies the branch information (starting address of
ISR) to the CPU. This information is called the interrupt vector, which is not any fixed memory
location. The processor identifies individual devices even if they share a single interrupt-request line.
So the set-up time is very less.
In a non-vectored interrupt, the branch address (starting address of ISR) is assigned to a fixed
location in memory. Since the identities of requesting devices are not known initially. The set-up time
is quite large.
8. “Interrupt request is serviced at the end of current instruction cycle while DMA request is
serviced almost as soon as it is received, even before completion of current instruction execu-
tion.” Explain.
Answer
In the interrupt initiated I/O, interrupt request is serviced at the end of current instruction cycle,
because the processor takes part in the I/O transfer for which processor was interrupted. Thus proces-
sor will be busy in data transfer after this instruction.
But in DMA transfer, the processor is not involved during data transfer. It actually initiates the data
transfer. The whole data transfer is supervised by DMA controller and at that time processor is free to
do its own task of interaction execution.
9. Give the main reason why DMA based I/O is better in some circumstances than interrupt driven
I/O?
Answer
To transfer large blocks of data at high speed, DMA method is used. A special DMA controller is
provided to allow transfer a block of data directly between a high speed external device like magnetic
disk and the main memory, without continuous intervention by the CPU. The data transmission
cannot be stopped or slowed down until an entire block is transferred. This mode of DMA transfer is
known as burst transfer.
10. What are the different types of DMA controllers and how do they differ in their functioning?
Answer
DMA controllers are of two types:
– Independent DMA controller
– DMA controller having multiple DMA-channels
S.34 Computer Organization and Architecture
CHAPTER 8
1. Why are several parameters used to measure the performance of computers, instead of one?
Answer
Several parameters exist because computers are used for a variety of applications, the performance of
which can depend on different aspects of the computer system. For example, scientific computations
depend mainly on the performance of processors and memory system, while database and transaction
processing applications depend largely on the performance of computers’ I/O system. Thus, depend-
ing on the applications to be performed, parameters vary.
2. When executing a particular program, machine A gives 90 MIPS and machine B gives
70 MIPS. However, machine A executes the program in 50 seconds and machine B executes in
40 seconds. Give the reason.
Answer
MIPS rating gives measure of the rate at which a processor executes instructions, but different
machines need different number of instructions to perform a given computation. If machine A had to
execute more instructions than machine B to complete the program, it would be possible for machine
A to take longer to run the program than machine B despite the fact that machine A executes more
instructions per second.
3. A 50 MHz processor was used to execute a program with the following instruction mix and
clock cycle counts:
Solved Problems S.37
Calculate the effective CPI, MIPS rate and execution time for this program.
Answer
CPIav = (50000 ¥ 1 + 35000 ¥ 2 + 20000 ¥ 2 + 6000 ¥ 3)/(50000 + 35000 + 20000
+ 6000)
= 1.6 cycles / instruction
MIPS = clock rate/(CPI ¥ 106)
= (50 ¥ 106)/(1.6 ¥ 106)
= 31.25
Execution time = IC (instruction count) ¥ CPIav / clock rate
= ((50000 + 35000 + 20000 + 6000) ¥ 1.6)/(50 ¥ 106) sec
= (111000 ¥ 1.6)/(50 ¥ 106) sec
= 3.552 ms
4. What is meant by pipeline architecture? How does it improve the speed of execution of a
processor?
Answer
Pipelining is a technique of decomposing a sequential task into subtasks, with each subtask being
executed in a special dedicated stage (segment) that operates concurrently with all other stages. Each
stage performs partial processing dictated by the way the task is partitioned. Result obtained from a
stage is transferred to the next stage in the pipeline. The final result is obtained after the instruction
has passed through all the stages. All stages are synchronized by a common clock.
Several instructions are executed in different stages of the pipeline simultaneously. Thus, execu-
tions of several instructions are overlapped in the pipeline, giving the increased rate at which instruc-
tions execute.
5. Consider the execution of a program of 20000 instructions by a linear pipeline processor with a
clock rate 40 MHz. Assume that the instruction pipeline has five stages and that one instruction
is issued per clock cycle. The penalties due to branch instructions and out-of-order executions
are ignored. Calculate the speed-up of the pipeline over its equivalent non-pipeline processor,
the efficiency and throughput.
Answer
Given,
No. of instructions (tasks) n = 20000
No. of stages k = 5
Clock rate f = 40 MHz
Clock period t = 1/f = 1/(40 ¥ 106) sec
S.38 Computer Organization and Architecture
n¥k
Speed up Sp =
k + (n - 1)
20000 ¥ 5
=
5 + (20000 - 1)
= 4.99
n
Efficiency h =
k + (n - 1)
20000
=
5 + (20000 - 1)
= 0.99
h
Throughput w =
t
= 0.99 ¥ (40 ¥ 106) instructions per second
= 39.6 MIPS
9. Deduce that the maximum speed-up in a k-stage pipeline processor can be k. Is this maximum
speed-up always achievable? Explain.
Answer
Speed-up is defined as
Time to execute n tasks in k-stage non-pipeline processor
Sp =
Time to execute n tasks in k-stage pipeline processor
Time to execute n tasks in k-stage pipeline processor is t[k + (n–1)] units, where k clock periods
(cycles) are needed to complete the execution of the first task and remaining (n–1) tasks require (n–1)
cycles. Time to execute n tasks in k-stage non-pipeline processor is n.k.t, where each task requires k
cycles because no new task can enter the pipeline until the previous task finishes. The clock period of
the pipeline processor is t.
n.k.t nk
Thus Sp = =
t [k + (n - 1)] k + (n - 1)
The maximum speed-up can be deduced for n >> k, that means n Æ •.
k
Max Sp = Lt
nƕ (k /n) + 1 - (1/n)
=k
But this maximum speed-up k is never fully achievable because of data dependencies between
instructions, interrupts, program branches, etc.
7. Given a non-pipelined processor with 15 ns clock period. How many stages of pipelined version
of the processor are required to achieve a clock period of 4 ns? Assume that the interface latch
has delay of 0.5 ns.
Solved Problems S.39
Answer
We can write the clock period of a pipeline processor as:
Clock period of non-pipeline
Clock period of pipeline = + Interface latch delay
No. of pipeline stages
This can be written as:
Clock period of non-pipeline
No. of pipeline stages =
Clock period of pipeline – Interface latch delay
Therefore, for our problem,
15
No. of pipeline stages = = 4.3
4 - 0.5
Since the number of stages in pipeline cannot be fraction, by rounding up gives the number of
stages as 5.
8. Consider the following pipeline processor with four stages. All successor stages after each
stage must be used in successive clock periods.
Write the reservation table for this pipeline with seven evaluation time.
Answer
The reservation table for this pipeline with seven evaluation time (i.e. seven columns) is as:
9. Describe how the branch type instructions can cause the damage in the performance of an
instruction pipeline and what will be the solution for this?
Answer
A stream of instructions is executed continuously in an overlapped fashion in the instruction pipeline
if any branch-type instruction is not encountered in stream of instructions. The performance in this
case would be one instruction per each pipeline cycle after first k cycles in a k-stage pipeline. If a
branch instruction enters in the pipeline, the performance would be hampered. After execution of an
instruction halfway down the pipeline, a conditional branch instruction is resolved and the program
counter (PC) needs to be loaded with a new address to which program should be directed, making all
pre-fetched instructions in the pipeline useless. The next instruction cannot be initiated until the
S.40 Computer Organization and Architecture
completion of the current branch instruction. This causes extra time penalty (delay) in order to
execute next instruction in the pipeline.
In order to cope with the adverse effects of branch instructions, an important technique called
prefetching is used. Prefetching technique states that: Instruction words ahead of the one currently
being decoded in the instruction-decoding (ID) stage are fetched from the memory system before the
ID stage requests them.
10. Consider a four stage normalized floating point adder with 10 ns delay per stage, which equals
the pipeline clock period.
(a) Name the appropriate functions to be performed by four stages.
(b) Find the minimum number of periods required to add 100 floating point numbers A1 + A2
+……..+ A100 using this pipeline adder, assuming that the output Z of 4th stage can be routed
back to any of the two inputs X or Y of the pipeline with delays equal to any multiples of the
period.
Answer
(a) The functions of four stage pipeline are summarized as:
1. Exponent subtraction: Compare e1 and e2, and pick up the fraction of a number with a
smaller exponent. Pick up k=|e1–e2| and also the larger exponent.
2. Alignment of mantissas: Shift right the mantissa of the number with smaller exponent by k
positions.
3. Addition of mantissas: Add the fractions.
4. Normalization of the result: Compute M= the number of leading 0s in the mantissa of the
result. Subtract the number M from the exponent and shift the mantissa of the result by M
positions.
(b) The addition proceeds in two distinct phases: (1) add pairs of numbers to get 4 partial sums
(since we have four stage pipeline and in general, for a k stage pipeline we have k partial sums),
and (2) add the partial sums to get the final sum.
In first phase, in the first four cycles we can send the four pairs of additions: (A1 + A5), (A2 +
A6), (A3 + A7), (A4 + A8). (In general, for a k stage pipeline we send in k pairs). After 4 cycles,
the pipeline is full (computing the four partial sums) and there are n-8 (here, n = 100) remaining
numbers to be sent into the pipeline which represent n -8 more additions. Note that each number
will be sent in to be added with the partial sum. For example, number A9 is sent in to be added
with the partial sum (A1 + A5). In order to accomplish this, the pipeline must be set to take one
of its inputs from its output stage and the other input from the stream of input numbers.
The time to complete n – 8 additions is n – 8 pipeline cycles, and therefore in n – 8 + 4 = n –
4 cycles all partial sums have been sent into the pipeline and all four stages of the pipeline are
occupied with these partial sums.
In the second phase, the partial sums are added to form the final result. Let the four partial
sums be denoted sum1; sum2; sum3; sum4, with sum1 being the first partial sum sent in (and thus
is at the last stage of the pipeline at the end of phase 1). The time to add these four partial sums
is a function of the length of the pipeline and the time it takes for these partial sums to exit the
pipeline. Thus,
l After 1 cycle, sum1 exits the pipeline but cannot be added to anything; therefore it is held in
an interface latch.
Solved Problems S.41
lAfter another 1 cycle, sum2 exits the pipeline and can be added to sum1. Therefore, after a
total of 2 cycles (from end of phase 1) sum1 + sum2 can be sent into the pipeline.
l After another 2 cycles, both sum3 and sum4 can be sent into the pipeline to be added. Note
that the sum of these two will be available 4 cycles later.
l Finally, after a total of 8 cycles from the end of phase 1 the final sum (which is the sum of
(sum1 + sum2) and (sum3 + sum4)) can be sent into the pipeline.
l The final sum will thus be available a total of 8 + 4 = 12 cycles after the end of phase 1.
The total minimum time for additions of n numbers therefore T = (n – 4) +12 = n + 8 cycles =
108 cycles. (Putting n = 100).
11. Identify all of the RAW, WAR and WAW hazards in the instruction sequence:
DIV R1, R2, R3
SUB R4, R1, R5
ASH R2, R6, R7
MULT R8, R4, R2
BEQ R9, #0, R10
OR R3, R11, R1
Also identify all of the control hazards in the sequence.
Answer
RAW hazard exists between instructions:
DIV and SUB
ASH and MULT
SUB and MULT
DIV and OR
WAR hazard exists between instructions:
DIV and ASH
DIV and OR
There is no WAW hazard.
There is only one control hazard between the BEQ and OR instructions.
12. What is instruction-level parallelism? How do processors exploit it to improve performance?
Answer
Many of the instructions in a sequential program are independent, meaning that it is not necessary to
execute them in the order that they appear in the program to produce the correct result. This fact is
referred to as instruction-level parallelism. Processors exploit this by executing these instructions in
parallel rather than sequentially, reducing the amount of time it takes to execute programs.
13. With an example show how the vector processors perform better than scalar processors.
Answer
To examine the efficiency of vector processing over scalar processing, we compare the following two
set of codes for the same program code, one written for vector processing and the other written for
scalar processing.
S.42 Computer Organization and Architecture
16. What is the necessity for using an interconnection network in a parallel processing computer?
Answer
The parallel computers have several processing units having small or no local memories attached with
them. Also, these processing units take part in the execution of a single large task (program) stored in
a large shared main memory. That is, these units execute different modules of the large task (program)
concurrently. Thus, these units must co-operate or share their instructions through shared main memory.
For the sharing of the instructions among different processing units, an interconnection network is
required.
17. How many number of 2 ¥ 2 switches and how many number of stages are required to design an
n ¥ n omega network?
Answer
An n ¥ n omega network consists of log2 n identical stages. Each stage has n/2 switch boxes each of
size 2 ¥ 2. Therefore, total number of switch boxes required is (n/2) ¥ log2 n.
18. What do you mean by multiprocessor system? What are the similarities and dissimilarities
between the multiprocessor system and multiple computer system?
Answer
A multiprocessor system is a single computer system consisting of multiple processors, which may
communicate and cooperate at different levels in solving a given problem. The communication may
occur by sending messages from one processor to the other or by sharing a common memory.
Similarities:
(a) Both systems consist of several processors inter-connected through networks.
(b) Both execute programs concurrently.
(c) Throughput, in both cases, is very high.
(d) Both are very expensive systems.
Dissimilarities:
(a) In multiprocessor systems, the processors usually communicate with each other. However, in
multiple computer systems, the autonomous processors may or may not communicate with each
other.
(b) A multiprocessor system is controlled by one operating system, which provides interaction
between processors; but in multiple computer system, different processors are controlled by
different operating systems.
(c) The degree of interactions among processors in multiprocessor systems is very high; but that in
multiple computer systems is very less.
19. Compare between centralized and distributed architecture. Which is the best architecture among
them and why?
Answer
In centralized architecture, all the processors access the physical main memory uniformly. All proces-
sors have equal access time to all memory words. The degree of interactions among tasks is high.
Thus probability of bus conflicts is high, because of frequent sharing of codes between two proces-
sors. The architecture is shown in the following figure.
S.44 Computer Organization and Architecture
In distributed system, a local memory is attached with each processor. All local memories distrib-
uted throughout the system form a global shared memory accessible by all processors. A memory
word access time varies with the location of the memory word in the shared memory. The degree of
interactions among tasks is less. Thus probability of bus conflicts is also less. The distributed system
is depicted in figure.
It is faster to access a local memory with a local processor. The access of remote memory attached
to other processor takes longer due to the added delay through the interconnection network. There-
fore, the distributed system is faster and in this regard, it is better.
20. Explain the terms with relevance to interconnection networks: (a) full access property (b) non-
blocking property.
Answer
(a) Full access property: If it is possible to connect every PE (processing element) to every other PE
using at least one configuration of the network. This feature is called as the full-access property.
(b) Non-blocking property: If the network can perform all possible connections between sources
(inputs) PEs and destinations (outputs) PEs by rearranging its connections, the multistage net-
work is said to have non-blocking property.