Mit (CS) 404
Mit (CS) 404
Author
Prof. Jatindra Kumar Deka
Assistant Professor
Department of Computer Science and Engineering, IIT Guwahati
Acknowledgement
The University acknowledges with thanks to NPTEL and the author for providing the
study material on NPTEL portal under Creative Commons Attribution-
NonCommercial-ShareAlike - CC-BY-NC-SA.
[1]
[2]
Table of Contents
1.0 Learning Objectives ....................................................................................................... 17
1.7 Summary........................................................................................................................ 34
[3]
2.3 Signed Integer ................................................................................................................ 40
2.7 Summary........................................................................................................................ 46
[4]
3.2.4 Fourth Generation (1974-Present): Very Large-Scale Integration (VLSI) / Ultra
Large Scale Integration (ULSI) ....................................................................................... 50
3.6 Summary........................................................................................................................ 55
[5]
4.4 Binary Multiplier, Hardware Implementation ............................................................... 69
4.5 Summary........................................................................................................................ 72
5.4 Summary........................................................................................................................ 91
[6]
Answers to Check your progress I................................................................................... 93
[7]
Check your progress I .................................................................................................... 114
[8]
9.1.6.1 Relative Addressing....................................................................................... 135
[9]
9.8.9 Skip Instruction .................................................................................................... 147
[10]
10.10 Branching ................................................................................................................ 177
[11]
11.8 Model Questions ........................................................................................................ 213
[12]
12.6 Direct Memory Access .............................................................................................. 234
[13]
13.3.4 RAID levels ........................................................................................................ 258
[14]
Check your progress I .................................................................................................... 271
[15]
16.2 Interconnection Networks.......................................................................................... 312
[16]
INTRODUCTION TO COMPUTER SYSTEM
1.1 Introduction
The basic functional units of computer are made of electronics circuit and it works with
electrical signal. We provide input to the computer in form of electrical signal and get the
output in form of electrical signal. There are two basic types of electrical signals, namely,
analog and digital. The analog signals are continuous in nature and digital signals are
discrete in nature. The electronic device that works with continuous signals is known as
analog device and the electronic device that works with discrete signals is known as digital
device. In present days most of the computers are digital in nature and we will deal with
Digital Computer in this course.
Computer is a digital device, which works on two levels of signal. We say these two levels
of signal as High and Low. The High-level signal basically corresponds to some high-level
signal (say 5 Volt or 12 Volt) and Low-level signal basically corresponds to Low-level
signal (say 0 Volt). This is one convention, which is known as positive logic. There are
others convention also like negative logic.
Since Computer is a digital electronic device, we have to deal with two kinds of electrical
signals. But while designing a new computer system or understanding the working
principle of computer, it is always difficult to write or work with 0V or 5V.
[17]
To make it convenient for understanding, we use some logical value, say,
Computer is used to solve mainly numerical problems. Again, it is not convenient to work
with symbolic representation. For that purpose, we move to numeric representation. In this
convention, we use 0 to represent LOW and 1 to represent HIGH.
0 means LOW
1 means HIGH
To know about the working principle of computer, we use two numeric symbols only
namely 0 and 1. All the functionalities of computer can be captured with 0 and 1 and its
theoretical background corresponds to two valued Boolean algebra.
With the symbol 0 and 1, we have a mathematical system, which is knows as binary
number system. Basically, binary number system is used to represent the information and
manipulation of information in computer. This information is basically strings of 0s and 1s.
The smallest unit of information that is represented in computer is known as Bit (Binary
Digit), which is either 0 or 1. Four bits together is known as Nibble, and Eight bits together
is known as Byte.
[18]
Today, a personal computer has more computational power, more main memory, more
disk storage, smaller in size and it is available in affordable cost.
This rapid rate of improvement has come both from advances in the technology used to
build computers and from innovation in computer design. In this course we will mainly
deal with the innovation in computer design.
The task that the computer designer handles is a complex one: Determine what attributes
are important for a new machine, then design a machine to maximize performance while
staying within cost constraints.
This task has many aspects, including instruction set design, functional organization, logic
design, and implementation.
While looking for the task for computer design, both the terms computer organization and
computer architecture come into picture.
It is difficult to give precise definition for the terms Computer Organization and Computer
Architecture. But while describing computer system, we come across these terms, and in
literature, computer scientists try to make a distinction between these two terms.
Computer architecture refers to those parameters of a computer system that are visible to a
programmer or those parameters that have a direct impact on the logical execution of a
program. Examples of architectural attributes include the instruction set, the number of bits
used to represent different data types, I/O mechanisms, and techniques for addressing
memory.
Computer organization refers to the operational units and their interconnections that realize
the architectural specifications. Examples of organizational attributes include those
hardware details transparent to the programmer, such as control signals, interfaces between
the computer and peripherals, and the memory technology used.
[19]
In this course we will touch upon all those factors and finally come up with the concept
how these attributes contribute to build a complete computer system.
• The program control unit has a set of registers and control circuit to generate
control signals.
• The execution unit or data processing unit contains a set of registers for storing data
and an Arithmetic and Logic Unit (ALU) for execution of arithmetic and logical
operations.
[20]
In addition, CPU may have some additional registers for temporary storage of data.
B. Input Unit:
With the help of input unit data from outside can be supplied to the computer. Program or
data is read into main storage from input device or secondary storage under the control of
CPU input instruction.
Example of input devices: Keyboard, Mouse, Hard disk, Floppy disk, CD-ROM drive etc.
C. Output Unit:
With the help of output unit computer results can be provided to the user or it can be stored
in storage device permanently for future use. Output data from main storage go to output
device under the control of CPU output instructions.
Example of output devices: Printer, Monitor, Plotter, Hard Disk, Floppy Disk etc.
D. Memory Unit:
Memory unit is used to store the data and program. CPU can work with the information
stored in memory unit. This memory unit is termed as primary memory or main memory
module. These are basically semi conductor memories.
Hard Disk, Floppy Disk, Magnetic Tape ------ These are magnetic devices,
➢ The electronic device that works with continuous signals is known as _______ device.
➢ _______________is used to represent the information and manipulation of information in
computer.
➢ Computer ____________ refers to those parameters of a computer system that are visible
to a programmer or those parameters that have a direct impact on the logical execution of a
program.
➢ Computer _______________ refers to the operational units and their interconnections that
realize the architectural specifications.
➢ The execution unit or data processing unit contains a set of registers for storing data
and an _______________for execution of arithmetic and logical operations.
➢ Secondary memories are _________ memory and it is used for permanent storage of data
and program.
In this small computer, we do not consider about Input and Output unit. We will consider
only CPU and memory module. Assume that somehow, we have stored the program and
[22]
data into main memory. We will see how CPU can perform the job depending on the
program stored in main memory.
P.S. - Our assumption is that students understand common terms like program, CPU,
memory etc. without knowing the exact details.
Consider the Arithmetic and Logic Unit (ALU) of Central Processing Unit:
Consider an ALU which can perform four arithmetic operations and four logical operations
To distinguish between arithmetic and logical operation, we may use a signal line,
In the similar manner, we need another two signal lines to distinguish between four
arithmetic operations. The different operations and their binary code are as follows:
Arithmetic Logical
[23]
Consider the part of control unit, its task is to generate the appropriate signal at right
moment. There is an instruction decoder in CPU which decodes this information in such a
way that computer can perform the desired task
The simple model for the decoder may be considered that there is three input lines to the
decoder and correspondingly it generates eight output lines. Depending on input
combination only one of the output signals will be generated and it is used to indicate the
corresponding operation of ALU.
These storage units are known as register. But in computer, we need more storage space
for proper functioning of the Computer.
Some of them are inside CPU, which are known as register. Other bigger chunk of storage
space is known as primary memory or main memory. The CPU can work with the
information available in main memory only.
To access the data from memory, we need two special registers one is known as Memory
Data Register (MDR) and the second one is Memory Address Register (MAR).
Data and program is stored in main memory. While executing a program, CPU brings
instruction and data from main memory, performs the tasks as per the instruction fetch
from the memory. After completion of operation, CPU stores the result back into the
memory.
In next section, we discus about memory organization for our small machine.
[24]
1.5 Main Memory Organization
Main memory unit is the storage unit, there are several location for storing information in
the main memory module.
The capacity of a memory module is specified by the number of memory location and the
information stored in each location.
A memory module of capacity 16 X 4 indicates that, there are 16 location in the memory
module and in each location, we can store 4 bit of information.
We have to know how to indicate or point to a specific memory location. This is done by
address of the memory location.
READ Operation: This operation is to retrieve the data from memory and
bring it to CPU register
We need some mechanism to distinguish these two operations READ and WRITE.
With the help of one signal line, we can differentiate these two operations. If the content of
this signal line is
[25]
To transfer the data from CPU to memory module and vice-versa, we need some
connection. This is termed as DATA BUS.
The size of the data bus indicates how many bit we can transfer at a time. Size of data bus
is mainly specified by the data storage capacity of each location of memory module.
We have to resolve the issues how to specify a particular memory location where we want
to store our data or from where we want to retrieve the data.
This can be done by the memory address. Each location can be specified with the help of a
binary address.
If we use 4 signal lines, we have 16 different combinations in these four lines, provided we
use two signal values only (say 0 and 1).
To distinguish 16 location, we need four signal lines. These signal lines use to identify a
memory location is termed as ADDRESS BUS. Size of address bus depends on the
memory size. For a memory module of capacity of 2n location, we need n address lines,
that is, an address bus of size n.
We use an address decoder to decode the address that are present in address bus
As for example, consider a memory module of 16 location and each location can store 4 bit
of information.
The size of address bus is 4 bit and the size of the data bus is 4 bits
[26]
If R/W = 0, we perform a READ operation, and
if R/W = 1, we perform a WRITE operation
If the contents of address bus is 0101 and contents of data bus is 1100 and R/W = 1, then
1100 will be written in location 5.
If the contents of address bus is 1011 and R/W=0, then the contents of location 1011 will
be placed in data bus.
In next section, we will explain how to perform memory access operation in our small
hypothetical computer.
We need some more instruction to work with the computer. Apart from the instruction
needed to perform task inside CPU, we need some more instructions for data transfer from
main memory to CPU and vice versa.
In our hypothetical machine, we use three signal lines to identify a particular instruction. If
we want to include more instruction, we need additional signal lines.
[27]
With this additional signal line, we can go up to 16 instructions. When the signal of this
new line is 0, it will indicate the ALU operation. For signal value equal to 1, it will indicate
8 new instructions. So, we can design 8 new memory access instructions.
We have added 6 new instructions. Still two codes are unused, which can be used for other
purposes. We show it as NOP means No Operation. We have seen that for ALU operation,
instruction decoder generated the signal for appropriate ALU operation.
Apart from that we need many more signals for proper functioning of the computer.
Therefore, we need a module, which is known as control unit, and it is a part of CPU. The
control unit is responsible to generate the appropriate signal. As for example, for LDAI
instruction, control unit must generate a signal which enables the register A to store in data
into register A.
One major task is to design the control unit to generate the appropriate signal at
appropriate time for the proper functioning of the computer. Consider a simple problem to
add two numbers and store the result in memory, say we want to add 7 to 5. To solve this
problem in computer, we have to write a computer program. The program is machine
specific, and it is related to the instruction set of the machine. For our hypothetical
machine, the program is as follows:
Memory
Instruction Binary HEX
Location
LDAI 5 1000 0101 85 (0, 1)
LDBI 7 1010 0111 A7 (2, 3)
ADD 0000 0 (4)
STC 15 1100 1111 CF (5, 6)
HALT 1101 D (7)
[28]
Consider another example, say that the first number is stored in memory location 13 and
the second data is stored in memory location 14. Write a program to Add the contents of
memory location 13 and 14 and store the result in memory location 15.
Memory
Instruction Binary HEX
Location
LDAA 13 1000 0101 8 5 (0, 1)
LDBA 14 1010 0111 A 7 (2, 3)
ADD 0000 0 (4)
STC 15 1100 1111 C F (5, 6)
HALT 1101 D (7)
Consider another example, say that the first number is stored in memory location 13 and
the second data is stored in memory location 14. Write a program to Add the contents of
memory location 13 and 14 and store the result in memory location 15.
Memory
Instruction Binary HEX
Location
LDAA 13 1000 0101 8 5 (0, 1)
LDBA 14 1010 0111 A 7 (2, 3)
ADD 0000 0 (4)
STC 15 1100 1111 C F (5, 6)
HALT 1101 D (7)
One question still remains unanswered: How to store the program or data to main memory.
Once we put the program and data in main memory, then only CPU can execute the
program. For that we need some more instructions.
We need some instructions to perform the input tasks. These instructions are responsible to
provide the input data from input devices and store them in main memory. For example,
instructions are needed to take input from keyboard.
[29]
We need some other instructions to perform the output tasks. These instructions are
responsible to provide the result to output devices. For example, instructions are needed to
send the result to printer.
We have seen that number of instructions that can be provided in a computer depends on
the signal lines that are used to provide the instruction, which is basically the size of the
storage devices of the computer.
For uniformity, we use same size for all storage space, which are known as register. If we
work with a 16-bit machine, total instructions that can be implemented is 216.
The model that we have described here is known as Von Neumann Stored Program
Concept. First, we have to store all the instruction of a program in main memory, and CPU
can work with the contents that are stored in main memory. Instructions are executed one
after another.
➢ NOP means_______________.
[30]
1.6 Main Memory Organization: Stored Program
The present-day digital computers are based on stored-program concept introduced by Von
Neumann. In this stored-program concept, programs and data are stored in separate storage
unit called memories.
Central Processing Unit, the main component of computer can work with the information
stored in storage unit only.
In 1946, Von Neumann and his colleagues began the design of a stored-program computer
at the Institute for Advanced Studies in Princeton. This computer is referred as the IAS
computer.
[31]
The IAS computer is having three basic units:
This is the main unit of computer, which is responsible to perform all the operations. The
CPU of the IAS computer consists of a data processing unit and a program control unit.
The data processing unit contains a high-speed register intended for temporary storage of
instructions, memory addresses and data. The main action specified by instructions are
performed by the arithmetic-logic circuits of the data processing unit.
The control circuits in the program control unit are responsible for fetching instructions,
decoding opcodes, controlling the information movements correctly through the system,
and providing proper control signals for all CPU actions.
It is used for storing programs and data. The memory locations of memory unit is uniquely
specified by the memory address of the location. M(X) is used to indicate the location of
the memory unit M with address X.
The data transfer between memory unit and CPU takes place with the help of data register
DR. When CPU wants to read some information from memory unit, the information first
brings to DR, and after that it goes to appropriate position. Similarly, data to be stored to
memory must put into DR first, and then it is stored to appropriate location in the memory
unit.
The address of the memory location that is used during memory read and memory write
operations are stored in the memory register AR.
[32]
The information fetched from the memory is a operand of an instruction, then it is moved
from DR to data processing unit (either to AC or MQ). If it is an operand, then it is moved
to program control unit (either to IR or IBR).
Two additional registers for the temporary storage of operands and results are included in
data processing units: the accumulator AC and the multiplier-quotient register MQ.
Two instructions are fetched simultaneously from M and transferred to the program control
unit. The instruction that is not to be executed immediately is placed in the instruction
buffer register IBR. The opcode of the other instruction is placed in the instruction register
IR where it is decoded.
In the decoding phase, the control circuits generate the required control signals to perform
the specified operation in the instruction. The program counter (PC) is used to store the
address of the next instruction to be fetched from memory.
Input devices are used to put the information into computer. With the help of input devices
we can store information in memory so that CPU can use it. Program or data is read into
main memory from input device or secondary storage under the control of CPU input
instruction.
Output devices are used to output the information from computer. If some results are
evaluated by computer and it is stored in computer, then with the help of output devices,
we can present it to the user. Output data from the main memory go to output device under
the control of CPU output instruction.
[33]
1.7 Summary
1. The smallest unit of information that is represented in computer is known as Bit
(Binary Digit)
2. Four bits together is known as Nibble.
3. Eight bits together is known as Byte.
4. There are two basic types of electrical signals, namely, analog and digital.
5. The analog signals are continuous in nature and digital signals are discrete in
nature.
6. Memory unit is used to store the data and program.
7. The CPU can work with the information available in main memory only.
8. The capacity of a memory module is specified by the number of memory location
and the information stored in each location.
9. To transfer the data from CPU to memory module and vice-versa, we need some
connection. This is termed as DATA BUS.
10. The control unit is responsible to generate the appropriate signal.
11. The control circuits in the program control unit are responsible for fetching
instructions, decoding opcodes, controlling the information movements correctly
through the system, and providing proper control signals for all CPU actions.
12. Input devices are used to put the information into computer.
13. Output devices are used to output the information from computer.
[34]
Answers to Check your progress I
➢ Analog
➢ binary number system
➢ architecture
➢ organization
➢ Arithmetic and Logic Unit (ALU)
➢ non-volatile
➢ (4096 X 16) as the memory have 4096 unique address locations and each location can store
16-bit of data.
➢ No Operation
➢ Data Register
➢ Program Counter
[35]
NUMBER SYSTEM AND REPRESENTATION
• After the completion of this unit, the learner shall be able to:
• Convert a given decimal number to a binary number;
• Convert a given binary number to a decimal number;
• Define Octal and hexadecimal numbers;
• Explain the representation of unsigned integers;
• Explain the representation of signed integers;
• Represent a negative number in signed-magnitude form;
• Represent a negative number in 1's complement form;
• Represent a negative number in 2's compliment form;
• Explain the representation of Real Numbers;
• Explain IEEE standard floating-point format;
• Define the Representation of Character in ASCII, EBCDIC and UNICODE format;
In our day to day activities for arithmetic, we use the Decimal Number System. The
decimal number system is said to be of base, or radix 10, because it uses ten digits and the
coefficients are multiplied by power of 10.
A decimal number such as 5273 represents a quantity equal to 5 thousand plus 2 hundred,
plus 7 tens, plus 3 units. The thousands, hundreds, etc. are powers of 10 implied by the
position of the coefficients. To be more precise, 5273 should be written as:
[36]
However, the convention is to write only the coefficient and from their position deduce the
necessary power of 10.
For computer arithmetic we use binary number system. The binary number system uses
two symbols to represent the number and these two symbols are 0 and 1.
The binary number system is said to be of base 2 or radix 2, because it uses two digits and
the coefficients are multiplied by power of 2.
(in decimal)
In case of 8-bit numbers, the minimum number that can be stored in computer is 00000000
(0) and maximum number is 11111111 (255) (if we are working with natural numbers).
[37]
So, the domain of number is restricted by the storage capacity of the computer. Also, it is
related to number system; above range is for natural numbers.
In general, for n-bit number, the range for natural number is from
In the above example, the result is an 8-bit number, as it can be stored in the 8-bit
computer, so we get the correct results.
10000001 129
10101010 178
----------------- ------
100101011 307
In the above example, the result is a 9-bit number, but we can store only 8 bits, and the
most significant bit (MSB) cannot be stored.
The result of this addition will be stored as (00101011) which is 43 and it is not the desired
result. Since we cannot store the complete result of an operation, and it is known as the
overflow case.
[38]
The coefficient a(n-1) is multiplied by2(n-1) and it is known as most significant bit (MSB).
The coefficient a0 is multiplied by 20 and it is known as least significant bit (LSB).
For our convenient, while writing in paper, we may take help of other number systems like
octal and hexadecimal. It will reduce the burden of writing long strings of 0s and 1s.
Octal Number: The octal number system is said to be of base, or radix 8, because it uses 8
digits and the coefficients are multiplied by power of 8.
Eight digits used in octal system are: 0, 1, 2, 3, 4, 5, 6 and 7.
Hexadecimal number : The hexadecimal number system is said to be of base, or radix 16,
because it uses 16 symbols and the coefficients are multiplied by power of 16.
Sixteen digits used in hexadecimal system are: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E and
F. Consider the following addition example:
➢ The smallest unit of information a computer can understand and process is known as a ___.
➢ A string of eight 0s and 1s is called a _____________.
➢ A computer works on a ________ number system.
➢ Hexadecimal number system has _______ base.
➢ Convert (89)10 =(…………)2
[39]
2.3 Signed Integer
We know that for n-bit number, the range for natural number is from 0 to 2n-1.
For n-bit, we have all together 2n different combination, and we use these different
combinations to represent 2n numbers, which ranges from 0 to 2n -1.
If we want to include the negative number, naturally, the range will decrease. Half of the
combinations are used for positive number and other half is used for negative number.
For example, if we consider 8-bit number, then range for natural number is from 0
to 255; but for signed integer the range is from -127 to +127.
We know that for n-bit number, the range for natural number is from .
• Signed-Magnitude form.
• 1’s complement form.
• 2’s complement form.
In signed-magnitude form, one particular bit is used to indicate the sign of the number,
whether it is a positive number or a negative number. Other bits are used to represent the
magnitude of the number.
For an n-bit number, one bit is used to indicate the signed information and remaining (n-1)
bits are used to represent the magnitude. Therefore, the range is from (- 2n -1) to (+ 2n -1).
[40]
Generally, Most Significant Bit (MSB) is used to indicate the sign and it is termed as
signed bit. 0 in signed bit indicates positive number and 1 in signed bit indicates negative
number.
Given a number N in base r having n digits, the (r - 1)’s complement of N is defined as (rn -
1) -N For decimal numbers, r = 10 and r - 1 = 9, so the 9’s complement of N
is (10n -1) -N.
[41]
e.g., 10's complement of 5642 is 9's complement of 5642 + 1, i.e., 4357 + 1 =4358
e.g., 2's complement of 1010 is 1's complement of 1010 + 1, i.e., 0101 + 1 =0110.
Since we are considering an eight bit number, so the 9th bit (MSB) of the result cannot be
stored. Therefore, the final result is 00000000.
Since the addition of two number is 0, so one can be treated as the negative of the other
number. So, 1's complement can be used to represent negative number.
Consider the eight-bit number 01011100, 2's complements of this number is 10100100. If
we perform the following addition:
0 1 0 1 1 1 0 0
1 0 1 0 0 0 1 1
--------------------------------
1 0 0 0 0 0 0 0 0
Since we are considering an eight bit number, so the 9th bit (MSB) of the result can not be
stored. Therefore, the final result is 00000000.
Since the addition of two number is 0, so one can be treated as the negative of the other
number. So, 2's complement can be used to represent negative number.
[42]
2's 1's Signed
Decimal
Complement complement Magnitude
+7 0111 0111 0111
+6 0110 0110 0110
+5 0101 0101 0101
+4 0100 0100 0100
+3 0011 0011 0011
+2 0010 0010 0010
+1 0001 0001 0001
+0 0000 0000 0000
-0 ----- 1111 1000
-1 1111 1110 1001
-2 1110 1101 1010
-3 1101 1100 1011
-4 1100 1011 1100
-5 1011 1010 1101
-6 1010 1001 1110
-7 1001 1000 1111
-8 1000 ------ -------
Therefore any real number can be converted to binary number system There are two
schemes to represent real number :
• Fixed-point representation
• Floating-point representation
[43]
2.5.1 Fixed-point representation
This is known as fixed-point representation where the position of decimal point is fixed
and number of bits before and after decimal point are also predefined.
If we use 16 bits before decimal point and 7 bits after decimal point, in signed magnitude
form, the range is
One bit is required for sign information, so the total size of the number is 24 bits
mantissa * Rexponent
Numbers are often normalized, such that the decimal point is placed to the right of the
first non zero digit.
[44]
To store this number in floating point representation, we store 5236 in mantissa part and 4
in exponent part.
• Single precision
• Double precision
Double Precision:
Single Precision:
S E M
S E M
S: sign bit: 0 denoted + and 1
S: sign bit: 0 denoted + and 1 denotes -
denotes -
M: 23-bit mantissa
M: 52-bit mantissa
To represent character, we are using some coding scheme, which is nothing but a mapping
function. Some of standard coding schemes are: ASCII, EBCDIC, UNICODE.
ASCII: American Standard Code for Information Interchange. It uses a 7-bit code. All
together we have 128 combinations of 7 bits and we can represent 128 character. As for
example 65 = 1000001 represents character 'A'.
[45]
EBCDIC: Extended Binary Coded Decimal Interchange Code. It uses 8-bit code and we
can represent 256 character.
UNICODE: It is used to capture most of the languages of the world. It uses 16-bit.
Unicode provides a unique number for every character, no matter what the platform, no
matter what the program, no matter what the language. The Unicode Standard has been
adopted by such industry leaders as Apple, HP, IBM, JustSystem, Microsoft, Oracle, SAP,
Sun, Sybase, Unisys and many others.
2.7 Summary
1. Decimal Numbers: 10 Symbols {0,1,2,3,4,5,6,7,8,9}, Base or Radix is 10.
2. Binary Numbers: 2 Symbols {0,1}, Base or Radix is 2.
3. The binary number system is positional where each binary digit has a weight based
upon its position relative to the least significant bit (LSB).
4. Octal Numbers: Symbols {0,1,2,3,4,5,6,7}, Base or Radix is 8.
5. Hexadecimal Numbers: 16 Symbols {0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F}, Base is 16.
6. Many applications have to deal with non-numerical data.
a. Characters and strings
b. There must be a standard mechanism to represent alphanumeric and other
characters in memory.
7. Three standards in use:
a. Extended Binary Coded Decimal Interchange Code(EBCDIC)
[46]
i. Used in older IBM machines
ii. American Standard Code for Information Interchange(ASCII)
iii. Most widely used today
b. UNICODE
i. Used to represent all international characters.
ii. Used by Java
➢ Bit
➢ Byte
➢ Binary
➢ 16
[47]
➢ 1011001
[48]
BRIEF HISTORY OF COMPUTER EVOLUTION
Learning Objectives
After the completion of this unit, the learner shall be able to:
• Instruction set
• Data formats
• Principle of Operation (formal description of every operation)
• Features (organization of programmable storage, registers used, interrupts
mechanism, etc.)
In short, it is the combination of Instruction Set Architecture, Machine Organization and
the related hardware.
[49]
and 1500 relays were used to build ENIAC, and it was programmed by manually
setting switches
• UNIVAC [1950]: the first commercial computer
• John Von Neumann architecture: Goldstine and Von Neumann took the idea of
ENIAC and developed concept of storing a program in the memory. Known as the
Von Neumann's architecture and has been the basis for virtually every machine
designed since then.
Features:
[50]
3.3 Evolution of Instruction Sets
Instruction Set Architecture (ISA) Abstract interface between the Hardware and lowest-
level Software
[51]
o CC-UMA multiprocessor
o CC-NUMA multiprocessor
o Not-CC-NUMA multiprocessor
o Message-passing multiprocessor
o 2000s: Special purpose architecture, functionally reconfigurable, special
considerations for low power/mobile processing, chip multiprocessors,
memory systems
▪ Massive SIMD
▪ Parallel processing multiprocessor
Under a rapidly changing set of forces, computer technology keeps at dramatic change, for
example:
1. How many memory locations can be addressed with the help of 8-bit address bus?
2. Give example of input and output devices.
Computer organization defines the ways in which these components are interconnected and
controlled. It is the capabilities and performance characteristics of those principal
functional units. Architecture can have a number of organizational implementations, and
organization differs between different versions. Such, all Intel x86 families share the same
basic architecture, and IBM system/370 family share their basic architecture.
A bus is a parallel circuit that connects the major components of a computer, allowing the
transfer of electric impulses form one connected component to any other.
[54]
• Serial Port - adheres to RS-232c spec, uses DB9 or DB25 connector, capable of
115kb.sec speeds
• Parallel port - as know as printer port, enhanced types: ECP- extended capabilities
port, EPP - enhanced parallel port
• USB - universal serial bus, two types: 1.0 and 2.0, hot plug-and-play, at 12MB/s,
up to 127 devices chain. 2.0 data rate is at 480 bits/s.
• Firewire - high speed serial port, 400 MB/s, hot plug-and-play, 30 times faster
than USB 1.0
1. _______ is the first general purpose microprocessor, 8-bit data path, used in first
personal computer.
2. VESA stands for _________________________________.
3. IDE stands for ________________________________.
4. ___________ is a high speed serial port, 400 MB/s, hot plug-and-play, 30 times
faster than USB 1.0.
3.6 Summary
1. Computer Architecture is the field of study of selecting and interconnecting
hardware components to create computers that satisfy functional performance and
cost goals.
2. Computer Architecture refers to those attributes of the computer system that are
visible to a programmer and have a direct effect on the execution of a program.
3. Computer organization defines the ways in which these components are
interconnected and controlled.
4. A bus is a parallel circuit that connects the major components of a computer,
allowing the transfer of electric impulses form one connected component to any
other.
[55]
3.7 Model Questions
1. Give an example to distinguish computer architecture and computer organization.
2. Define computer architecture.
3. Explain the various units of a stored program computer?
4. What is data bus and address bus?
5. What are the technologies used in the first four generations of the computer?
6. Explain the features of PCI bus.
1. Number of memory location that can be addressed by a 8-bit address bus is 28 =256
2. Input Device : Keyboard, Mouse, Hard disk, Floppy
Output Device Monitor, Printer, Hard disk, Floppy etc.
1. 8080
2. Video Electronics Standard Association
3. Integrated Drive Electronics
4. Firewire
[56]
ARITHMETIC LOGIC UNIT
4.1 Introduction
ALU is responsible to perform the operation in the computer. The basic operations are
implemented in hardware level. ALU is having collection of two types of operations:
• Arithmetic operations
• Logical operations
Consider an ALU having 4 arithmetic operations and 4 logical operation. To identify any
one of these four logical operations or four arithmetic operations, two control lines are
needed. Also, to identify the any one of these two groups- arithmetic or logical, another
control line is needed. So, with the help of three control lines, any one of these eight
operations can be identified.
We need three control lines to identify any one of these operations. The input combination
of these control lines is shown below:
[57]
Control line C2 is used to identify the group: logical or arithmetic, i.e. C2 =0
: arithmetic operation C2 =1: logical operation. Control lines C0 and C1 are used to
identify any one of the four operations in a group. One possible combination is given here.
A 3 X8 decode is used to decode the instruction. The block diagram of the ALU is shown
in figure 2.1.
[58]
The ALU has got two input registers named as A and B and one output storage register,
named as C. It performs the operation as:
C = A op B
The input data are stored in A and B, and according to the operation specified in the
control lines, the ALU perform the operation and put the result in register C.
As for example, if the contents of controls lines are, 000, then the decoder enables the
addition operation and it activates the adder circuit and the addition operation is performed
on the data that are available in storage register A and B . After the completion of the
operation, the result is stored in register C. We should have some hardware
implementations for basic operations. These basic operations can be used to implement
some complicated operations which are not feasible to implement directly in hardware.
There are several logic gates exists in digital logic circuit. These logic gates can be used to
implement the logical operation. Some of the common logic gates are mentioned here.
AND gate: The output is high if both the inputs are high. The AND gate and its truth table
is shown in Figure 2.2.
OR gate: The output is high if any one of the input is high. The OR gate and its truth table
is shown in Figure 2.3.
[59]
Figure 2.3: OR gate and its truth table.
EX-OR gate: The output is high if either of the input is high. The EX-OR gate and its
truth table is given in Figure 2.4.
If we want to construct a circuit which will perform the AND operation on two 4-bit
number, the implementation of the 4-bit AND operation is shown in the Figure-2.5.
[60]
Figure2.5: 4-bit AND operator
C: Carry Bit
S: Sum Bit
[61]
The simplified sum of products expressions are:
This circuit can not handle the carry input, so it is termed as half adder. The circuit
diagram and block diagram of Half Adder is shown in Figure 2.6.
A full adder is a combinational circuit that forms the arithmetic sum of three bits. It
consists of three inputs and two outputs. Two of the input variables, denoted by x and y,
represent the two bits to be added. The third input Z, represents the carry from the previous
lower position. The two outputs are designated by the symbols S for sum and C for carry.
[62]
The simplified expression for S and C are:
The circuit diagram and block diagram of a Full Adder is shown in the Figure 2.7. n-such
single bit full adder blocks are used to make n-bit full adder. To demonstrate the binary
addition of four-bit numbers, let us consider a specific example.
Consider two binary numbers
A =1 0 0 1 B=0011
[63]
To get the four bit adder, we have to use 4 full adder block. The carry output the lower bit
is used as a carry input to the next higher bit. The circuit of 4-bit adder shown in the
Figure 2.8.
The subtraction operation can be implemented with the help of binary adder circuit,
because
[64]
We know that 2's complement representation of a number is treated as a negative number
of the given number.
We can get the 2's complements of a given number by complementing each bit and adding
1 to it.
The circuit for subtracting A-B consist of an added with inverter placed between each data
input B and the corresponding input of the full adder. The input carry C0 must be equal to 1
when performing subtraction.
The operation thus performed becomes A, plus the 1's complement of B, plus 1. This is
equal to A plus 2's complement of B.
With this principle, a single circuit can be used for both addition and subtraction. The 4-bit
adder subtractor circuit is shown in the figure. It has got one mode ( M ) selection input
line, which will determine the operation,
If M=1, then A – B = A + (- B)
= A + 1’s complement of B+1
[65]
The operation of OR gate:
if
if
4.2.4 Multiplication
The process consists of looking at successive bits of the multiplier, least significant bit
first. If the multiplier bit is a 1, the multiplicand is copied down, otherwise, zeros are
copied down. The numbers copied down in successive lines are shifted one position to the
left from the previous number. Finally, the numbers are added and their sum forms the
product.
[66]
When multiplication is implemented in a digital computer, the process is changed slightly.
Instead of providing registers to store and add simultaneously as many binary numbers as
there are bits in the multiplier, it is convenient to provide an adder for the summation of
only two binary numbers and successively accumulate the partial products in a register. It
will reduce the requirements of registers. Instead of sifting the multiplicand to the left, the
partial product is shifted to right. When the corresponding bit of the multiplier is 0, there
is no need to add all zeros to the partial product.
An algorithm to multiply two binary numbers. Consider that the ALU does not provide the
multiplication operation, but it is having the addition operation and shifting operation.
Then we can write a micro program for multiplication operation and provide the micro
program code in memory. When a multiplication operation is encountered, it will execute
this micro code to perform the multiplication.
1. The ALU gives the output of the operations and the output is stored in the ________
a. Memory Devices
b. Registers
c. Flags
d. Output Unit
2. ALU is____________?
a. Arithmetic Logic Unit
b. Array Logic Unit
c. Application Logic Unit
d. None of above
3. Total number of inputs in a half adder is __________
a. 2
b. 3
c. 4
[67]
d. 1
4. In which operation carry is obtained?
a. Subtraction
b. Addition
c. Multiplication
d. Both addition and subtraction
5. If A and B are the inputs of a half adder, the sum is given by __________
a. A AND B
b. A OR B
c. A XOR B
d. A EX-NOR B
Consider a situation such that we do not have the multiplication operation in a primitive
computer. Is it possible to perform the multiplication? Of course, yes, provided the
addition operation is available.
We can perform the multiplication with the help of repeated addition method; for example,
if we want to multiply 4 by 5 ( 4 X 5), then simply add 4 five times to get the result.
If it is possible by addition operation, then why we need a multiplication operation.
Consider a machine, which can handle 8 bit numbers, then we can represent number from
0 to 255. If we want to multiply 175 X 225, then there will be at least 175 addition
operation.
But if we use the multiplication algorithm that involves shifting and addition, it can be
done in 8 steps, because we are using an 8-bit machine.
[68]
Again, the micro program execution is slightly slower, because we have to access the code
from micro controller memory, and memory is a slower device than CPU. It is possible to
implement the multiplication algorithm in hardware.
The multiplicand is stored in register B and the multiplier is stored in register Q. The
partial product is formed in register A and stored in A and Q
The counter P is initially set to a number equal to the number of bits in the multiplier. The
counter is decremented by 1 after forming each partial product. When the content of the
counter reaches zero, the product is formed and the process stops.
Initially, the multiplicand is in register B and the multiplier in Q. The register A is reset to
0. The sum of A and B forms a partial product- which is transferred to the EA register.
[69]
Both partial product and multiplier are shifted to the right. The least significant bit of A is
shifted into the most significant position of Q; and 0 is shifted into E.
After the shift, one bit of the partial product is shifted into Q, pushing the multiplier bits
one position to the right.
The right most flip flop in register Q, designated by Q0 will hold the bit of the multiplier
which must be inspected next. If the content of this bit is 0, then it is not required to add
the multiplicand, only shifting is needed. If the content of this bit is 1, then both addition
and shifting are needed.
After each shifter, value of counter P is decremented and the process continues till the
counter value becomes 0. The final result is available in (EAQ ) registers combination.
To control the operation, it is required to design the appropriate control logic that is shown
in the block diagram. The flow chart of the multiplication operation is given in the Figure
2.11.
The working of multiplication algorithm is shown here with the help of an example.
[70]
Multiplicand B = 11001
[71]
_____ and ______ operations.
2. Which of the following operation is extremely useful in serial transfer of data:
a. Logical microoperation
b. Arithmetic microoperation
c. Shift Microoperation
d. None of the above
.
4.5 Summary
1. An ALU is a digital circuit used to perform arithmetic and logic operations.
2. Examples of arithmetic operations are addition, subtraction, multiplication, and
division. Examples of logic operations are comparisons of values such as NOT,
AND, and OR.
3. The simplest type of operation is a NOT gate. This uses only a single transistor. It
uses a single input and produces a single output, which is always the opposite of the
input.
4. The OR gate results in a 1 if either the first or the second input is a 1. The OR gate
only results in a 0 if both inputs are 0.
5. The AND gate results in a 1 only if both the first and second input are 1s.
6. The XOR gate, also pronounced X-OR gate, results in a 0 if both the inputs are 0 or
if both are 1. Otherwise, the result is a 1.
[72]
4. Demonstrate multiplication of two-binary numbers with the help of an example.
Design an arithmetic circuit to perform this multiplication.
5. What is an overflow in arithmetic operation of signed magnitude data? How is it
detected?
6. Design an adder to add two 4-bit numbers.
7. How to multiply a number by 2 using shift operations, give an example.
1. b
2. a
3. a
4. b
5. c
1. SHIFT, ADD
2. b
[73]
MEMORY
• Internal, and
• External
Internal memory is used by CPU to perform task and external memory is used to store bulk
information, which includes large software and data. Memory is used to store the
information in digital form. The memory hierarchy is given by:
• Register
• Cache Memory
• Main Memory
• Magnetic Disk
• Removable media (Magnetic tape)
[74]
5.1.1 Register
This is a part of Central Processor Unit, so they reside inside the CPU. The information
from main memory is brought to CPU and keep the information in register. Due to space
and cost constraints, we have got a limited number of registers in a CPU. These are
basically faster devices.
Cache memory is a storage device placed in between CPU and main memory. These are
semiconductor memories. These are basically fast memory device, faster than main
memory. We can not have a big volume of cache memory due to its higher cost and some
constraints of the CPU. Due to higher cost we can not replace the whole main memory by
faster memory. Generally, the most recently used information is kept in the cache memory.
It is brought from the main memory and placed in the cache memory. Now a days, we get
CPU with internal cache.
Like cache memory, main memory is also semiconductor memory. But the main memory
is relatively slower memory. We have to first bring the information (whether it is data or
program), to main memory. CPU can work with the information available in main memory
only.
This is bulk storage device. We have to deal with huge amount of data in many
applications. But we don't have so much semiconductor memory to keep this information
in our computer. On the other hand, semiconductor memories are volatile in nature. It loses
its content once we switch off the computer. For permanent storage, we use magnetic disk.
The storage capacity of magnetic disk is very high.
[75]
5.1.5 Removable Media
For different application, we use different data. It may not be possible to keep all the
information in magnetic disk. So, which ever data we are not using currently, can be kept
in removable media. Magnetic tape is one kind of removable medium. CD is also a
removable media, which is an optical device.
Register, cache memory and main memory are internal memory. Magnetic Disk,
removable media are external memory. Internal memories are semiconductor memory.
Semiconductor memories are categorized as volatile memory and non-volatile memory.
RAM: Random Access Memories are volatile in nature. As soon as the computer is
switched off, the contents of memory are also lost.
ROM: Read only memories are non volatile in nature. The storage is permanent, but it is
read only memory. We can not store new information in ROM.
• PROM: Programmable Read Only Memory; it can be programmed once as per user
requirements.
• EPROM: Erasable Programmable Read Only Memory; the contents of the memory
can be erased and store new data into the memory. In this case, we have to erase
whole information.
The main memory of a computer is semiconductor memory. The main memory unit of
computer is basically consisting of two kinds of memory:
[76]
The permanent information is kept in ROM and the user space is basically in RAM. The
smallest unit of information is known as bit (binary digit), and in one memory cell we can
store one bit of information. 8 bits together is termed as a byte. The maximum size of main
memory that can be used in any computer is determined by the addressing scheme.
A computer that generates 16-bit address is capable of addressing upto 216 which is equal
to 64K memory location. Similarly, for 32-bit addresses, the total capacity will be 232
which is equal to 4G memory location.
In some computer, the smallest addressable unit of information is a memory word and the
machine are called word-addressable.
In some computer, individual address is assigned for each byte of information, and it is
called byte-addressable computer. In this computer, one memory word contains one or
more memory bytes which can be addressed individually.
[77]
A byte addressable 32-bit computer, each memory word contains 4 bytes. A possible way
of address assignment is shown in figure 3.1. The address of a word is always integer
multiple of 4.
The main memory is usually designed to store and retrieve data in word length quantities.
The word length of a computer is generally defined by the number of bits actually stored or
retrieved in one main memory access.
Consider a machine with 32-bit address bus. If the word size is 32 bits, then the high order
30 bit will specify the address of a word. If we want to access any byte of the word, then it
can be specified by the lower two bits of the address bus.
The data transfer between main memory and the CPU takes place through two CPU
registers.
If the MDR is n-bit long, then the n bit of data is transferred in one memory cycle.
The transfer of data takes place through memory bus, which consist of address bus and
data bus. In the above example, size of data bus is n-bit and size of address bus is k bit.
It also includes control lines like Read, Write and Memory Function Complete (MFC) for
coordinating data transfer. In the case of byte addressable computer, another control line to
be added to indicate the byte transfer instead of the whole word.
For memory operation, the CPU initiates a memory operation by loading the appropriate
data i.e., address to MAR.
If it is a memory read operation, then it sets the read memory control line to 1. Then the
contents of the memory location are brought to MDR and the memory control circuitry
indicates this to the CPU by setting MFC to 1.
[78]
If the operation is a memory write operation, then the CPU places the data into MDR and
sets the write memory control line to 1. Once the contents of MDR are stored in specified
memory location, then the memory control circuitry indicates the end of operation by
setting MFC to 1.
A useful measure of the speed of memory unit is the time that elapses between the
initiation of an operation and the completion of the operation (for example, the time
between Read and MFC). This is referred to as Memory Access Time. Another measure is
memory cycle time. This is the minimum time delay between the initiation two
independent memory operations (for example, two successive memory read operation).
Memory cycle time is slightly larger than memory access time.
1. _____________ memory is a storage device placed in between CPU and main memory.
2. _________________ memories are volatile in nature.
3. _____________ can be programmed once as per user requirements.
4. The smallest unit of information is known as_____.
[79]
Figure 3.2: Binary Storage cell made up of SR-Latch
The binary cell stores one bit of information in its internal latch. Control input to binary
cell.
0 X None
1 0 Write
1 1 Read
The storage part is modelled here with SR-latch, but in reality, it is an electronics circuit
made up of transistors. The memory constructed with the help of transistors is known as
semiconductor memory. The semiconductor memories are termed as Random Access
Memory(RAM), because it is possible to access any memory location in random.
Depending on the technology used to construct a RAM, there are two types of RAM:
[80]
5.2.1.1 Dynamic Ram (DRAM)
A DRAM is made with cells that store data as charge on capacitors. The presence or
absence of charge in a capacitor is interpreted as binary 1 or 0. Because capacitors have a
natural tendency to discharge due to leakage current, dynamic RAM require periodic
charge refreshing to maintain data storage. The term dynamic refers to this tendency of the
stored charge to leak away, even with power continuously applied. A typical DRAM
structure for an individual cell that stores one-bit information is shown in the figure 3.3.
For the write operation, a voltage signal is applied to the bit line B, a high voltage
represents 1 and a low voltage represents 0. A signal is then applied to the address line,
which will turn on the transistor T, allowing a charge to be transferred to the capacitor.
For the read operation, when a signal is applied to the address line, the transistor T turns on
and the charge stored on the capacitor is fed out onto the bit line B and to a sense amplifier.
The sense amplifier compares the capacitor voltage to a reference value and determines if
the cell contains a logic 1 or a logic 0.
The read out from the cell discharges the capacitor, width must be restored to complete the
read operation.
[81]
Due to the discharge of the capacitor during read operation, the read operation of DRAM is
termed as destructive read out.
In an SRAM, binary values are stored using traditional flip-flop constructed with the help
of transistors. A static RAM will hold its data as long as power is supplied to it. A typical
SRAM constructed with transistors is shown in the figure 3.4.
[82]
Four transistors (T1, T2, T3, T4) are cross connected in an arrangement that produces a
stable logic state. In logic state 1, point A1 is high and point A2 is low; in this state T1 and
T4 are off, and T2 and T3 are on.
In logic state 0, point A1 is low and point A2 is high; in this state T1 and T4 are on, and T2
and T3 are off.
Both states are stable as long as the dc supply voltage is applied. The address line is used
to open or close a switch which is nothing but another transistor. The address line controls
two transistors (T5 and T6).
When a signal is applied to this line, the two transistors are switched on, allowing a read or
write operation.
For a write operation, the desired bit value is applied to line B, and its complement is
applied to line 𝐵̅ . This forces the four transistors (T1, T2, T3, T4) into the proper state.
For a read operation, the bit value is read from the line B. When a signal is applied to the
address line, the signal of point A1 is available in the bit line B.
[83]
Each row of cells constitutes a memory word, and all cell of a row are connected to a
common line which is referred as word line. An address decoder is used to drive the word
line. At a particular instant, one-word line is enabled depending on the address present in
the address bus. The cells in each column are connected by two lines. These are known as
bit lines. These bit lines are connected to data input line and data output line through a
Sense/Write circuit. During a Read operation, the Sense/Write circuit sense, or read the
information stored in the cells selected by a word line and transmit this information to the
output data line. During a write operation, the sense/write circuit receive information and
store it in the cells of the selected word.
[84]
Consider a slightly larger memory unit that has 1K (1024) memory cells.
If it is organized as a 128 x 8 memory chips, then it has got 128 memory words of size 8
bits. So, the size of data bus is 8 bits and the size of address bus is 7 bits (27 = 128). The
storage organization of 128 x 8 memory chip is shown in the figure 3.6.
[85]
Figure 3.7: 1024 x 1 Memory chip
If it is organized as a 1024 x 1 memory chips, then it has got 1024 memory words of size 1
bit only. Therefore, the size of data bus is 1 bit and the size of address bus is 10 bits (210 =
1024 bits).
[86]
Figure 3.8: Organizaion of 1k x 1 Memory chip
The commercially available memory chips contain a much larger number of cells. As for
example, a memory unit of 1MB (mega byte) size, organized as 1M x 8, contains memory
cells. It has got 220 memory location and each memory location contains 8 bits
information. The size of address bus is 20 and the size of data bus is 8.
[87]
Figure 3.9: 1 MB(Mega Byte) Memory Chip
The number of pins of a memory chip depends on the data bus and address bus of the
memory module. To reduce the number of pins required for the chip, we use another
scheme for address decoding. The cells are organized in the form of a square array. The
address bus is divided into two groups, one for column address and other one is for row
address. In this case, high- and low-order 10 bits of 20-bit address constitute of row and
column address of a given cell, respectively. In order to reduce the number of pin needed
for external connections, the row and column addresses are multiplexed on ten pins.
During a Read or a Write operation, the row address is applied first. In response to a signal
pulse on the Row Address Strobe (RAS) input of the chip, this part of the address is loaded
into the row address latch.
All cell of this particular row is selected. Shortly after the row address is latched, the
column address is applied to the address pins. It is loaded into the column address latch
with the help of Column Address Strobe (CAS) signal, similar to RAS. The information in
this latch is decoded and the appropriate Sense/Write circuit is selected.
[88]
For a Write operation, the information at the input lines are transferred to the selected
circuits.
The 1MB (Megabyte) memory chip with 20 address lines as shown in the figure 3.9. The
same memory chip (1MB) with 10 address lines (where row & column address are
multiplexed) is shown in Figure 3.10.
Now we discuss the design of memory subsystem using memory chips. Consider a
memory chips of capacity 16K x 8. The requirement is to design a memory subsystem of
capacity 64K x 16. Each memory chip has got eight lines for data bus, but the data bus size
of memory subsystem is 16 bits.
The total requirement is for 64K memory location, so four such units are required to get
the 64K memory location. For 64K memory location, the size of address bus is 16. On the
other hand, for 16K memory location, size of address bus is 14 bits.
Each chip has a control input line called Chip Select (CS). A chip can be enabled to accept
data input or to place the data on the output bus by setting its Chip Select input to 1. The
address bus for the 64K memory is 16 bits wide. The high order two bits of the address are
[89]
decoded to obtain the four chip-select control signals. The remaining 14 address bits are
connected to the address lines of all the chips. They are used to access a specific location
̅ inputs of all chips are tied together to
inside each chip of the selected row. The R/𝑊
̅̅̅̅̅̅̅̅̅̅ control.
provide a common READ/𝑊𝑅𝐼𝑇𝐸
[90]
Figure 3.12: 64k x 16 Memory chip
5.4 Summary
1. Internal memory is used by CPU to perform task and external memory is used to
store bulk information, which includes large software and data.
2. Memory is used to store the information in digital form.
3. Register is a part of Central Processor Unit, so they reside inside the CPU.
4. Cache memory is a storage device placed in between CPU and main memory.
[91]
5. Magnetic disk is a bulk storage device.
6. Semiconductor memories are volatile in nature.
7. The storage capacity of magnetic disk is very high.
8. Magnetic tape is one kind of removable medium.
9. Semiconductor memories are categorized as volatile memory and non-volatile
memory.
10. Programmable Read Only Memory can be programmed once as per user
requirements.
11. In Erasable Programmable Read Only Memory, the contents of the memory can be
erased and store new data into the memory.
12. In Electrically Erasable Programmable Read Only Memory, the contents of a
particular location can be changed without effecting the contents of other location.
13. The permanent information is kept in ROM and the user space is basically in RAM.
14. The smallest unit of information is known as bit (binary digit).
15. In one memory cell we can store one bit of information.
16. 8 bits together is termed as a byte.
17. The maximum size of main memory that can be used in any computer is
determined by the addressing scheme.
18. The address of a word is always integer multiple of 4.
19. The word length of a computer is generally defined by the number of bits actually
stored or retrieved in one main memory access.
20. The binary storage cell is the basic building block of a memory unit.
21. The binary storage cell that stores one bit of information can be modelled by an SR
latch with associated gates.
22. A memory cell is capable of storing 1-bit of information.
23. A number of memory cells are organized in the form of a matrix to form the
memory chip.
24. Each row of cells constitutes a memory word, and all cell of a row are connected to
a common line which is referred as word line.
[92]
5.5 Model Questions
1. Define memory hierarchy?
2. Define cache memory. Why it is used?
3. Why removable media is used?
4. What are the various types of ROMs? Explain.
5. What is Main Memory? How it can be classified?
6. What is word addressable machines?
7. Define word length in a computer.
8. What is Random Access memory? Define different types of RAM.
9. What is a binary storage cell? Explain.
10. Explain the working of SRAM. Explain the working of DRAM.
11. Explain the difference between SRAM Versus DRAM.
12. Define the Internal Organization of Memory Chips.
13. What is word line?
14. Explain 16 X 8 memory organization.
1. Cache
2. Semiconductor
3. Programmable Read Only Memory
4. bit
1. 4
2. 1-bit
3. binary storage cell
4. 8
[93]
CACHE MEMORY
Figure 3.13: Cache memory between CPU and the main memory
Now, if it can be arranged to have the active segments of a program in a fast memory, then
the total execution time can be significantly reduced. It is the fact that CPU is a faster
device and memory is a relatively slower device. Memory access is the main bottleneck for
[94]
the performance efficiency. If a faster memory device can be inserted between main
memory and CPU, the efficiency can be increased. The faster memory that is inserted
between CPU and Main Memory is termed as Cache memory. To make this arrangement
effective, the cache must be considerably faster than the main memory, and typically it is 5
to 10 time faster than the main memory. This approach is more economical than the use of
fast memory device to implement the entire main memory. This is also a feasible due to the
locality of reference that is present in most of the program, which reduces the frequent data
transfer between main memory and cache memory. The inclusion of cache memory
between CPU and main memory is shown in Figure 3.13.
The memory control circuitry is designed to take advantage of the property of locality of
reference. Some assumptions are made while designing the memory control circuitry:
• The CPU does not need to know explicitly about the existence of the cache.
• The CPU simply makes Read and Write request. The nature of these two
operations are same whether cache is present or not.
• The address generated by the CPU always refer to location of main memory.
• The memory access control circuitry determines whether or not the requested
word currently exists in the cache.
When a Read request is received from the CPU, the contents of a block of memory words
containing the location specified are transferred into the cache. When any of the locations
in this block is referenced by the program, its contents are read directly from the cache.
Consider the case where the addressed word is not in the cache and the operation is a read.
First the block of words is brought to the cache and then the requested word is forwarded
to the CPU. But it can be forwarded to the CPU as soon as it is available to the cache,
instead of the whole block to be loaded in the cache. This is called load through, and there
is some scope to save time while using load through policy.
[95]
The cache memory can store a number of such blocks at any given time.
The correspondence between the Main Memory Blocks and those in the cache is specified
by means of a mapping function.
When the cache is full and a memory word is referenced that is not in the cache, a decision
must be made as to which block should be removed from the cache to create space to bring
the new block to the cache that contains the referenced word. Replacement algorithms are
used to make the proper selection of block that must be replaced by the new one.
When a write request is received from the CPU, there are two ways that the system can
proceed. In the first case, the cache location and the main memory location are updated
simultaneously. This is called the store through method or write through method.
The alternative is to update the cache location only. During replacement time, the cache
block will be written back to the main memory. This method is called write back method.
If there is no new write operation in the cache block, it is not required to write back the
cache block in the main memory. This information can be kept with the help of an
associated bit. This bit it set while there is a write operation in the cache block. During
replacement, it checks this bit, if it is set, then write back the cache block in main memory
otherwise not. This bit is known as dirty bit. If the bit gets dirty (set to one), writing to
main memory is required.
The write through method is simpler, but it results in unnecessary write operations in the
main memory when a given cache word is updated a number of times during its cache
residency period.
During a write operation, if the address word is not in the cache, the information is written
directly into the main memory. A write operation normally refers to the location of data
areas and the property of locality of reference is not as pronounced in accessing data when
write operation is involved. Therefore, it is not advantageous to bring the data block to the
cache when there a write operation, and the addressed word is not present in cache.
[96]
6.3 Mapping Functions
The mapping functions are used to map a particular block of main memory to a particular
block of cache. This mapping function is used to transfer the block from main memory to
cache memory. Three different mapping functions are available:
All these three mapping methods are explained with the help of an example.
Consider a cache of 4096 (4K) words with a block size of 32 words. Therefore, the cache
is organized as 128 blocks. For 4K words, required address lines are 12 bits. To select one
of the blocks out of 128 blocks, we need 7 bits of address lines and to select one word out
of 32 words, we need 5 bits of address lines. So the total 12 bits of address is divided for
two groups, lower 5 bits are used to select a word within a block, and higher 7 bits of
address are used to select any block of cache memory.
Let us consider a main memory system consisting 64K words. The size of address bus is
16 bits. Since the block size of cache is 32 words, so the main memory is also organized as
block size of 32 words. Therefore, the total number of blocks in main memory is 2048 (2K
x 32 words = 64K words). To identify any one block of 2K blocks, we need 11 address
lines. Out of 16 address lines of main memory, lower 5 bits are used to select a word
within a block and higher 11 bits are used to select a block out of 2048 blocks.
[97]
Number of blocks in cache memory is 128 and number of blocks in main memory is 2048,
so at any instant of time only 128 blocks out of 2048 blocks can reside in cache memory.
Therefore, we need mapping function to put a particular block of main memory into
appropriate block of cache memory.
The simplest way of associating main memory blocks with cache block is the direct
mapping technique. In this technique, block k of main memory maps into block k modulo
m of the cache, where m is the total number of blocks in cache. In this example, the value
of m is 128. In direct mapping technique, one particular block of main memory can be
transferred to a particular block of cache which is derived by the modulo function.
Since more than one main memory block is mapped onto a given cache block position,
contention may arise for that position. This situation may occur even when the cache is not
full. Contention is resolved by allowing the new block to overwrite the currently resident
block. So, the replacement algorithm is trivial.
The main memory address is divided into three fields. The field size depends on the
memory capacity and the block size of cache. In this example, the lower 5 bits of address is
used to identify a word within a block. Next 7 bits are used to select a block out of 128
blocks (which is the capacity of the cache). The remaining 4 bits are used as a TAG to
identify the proper block of main memory that is mapped to cache.
When a new block is first brought into the cache, the high order 4 bits of the main memory
address are stored in four TAG bits associated with its location in the cache. When the
CPU generates a memory request, the 7-bit block address determines the corresponding
cache block. The TAG field of that block is compared to the TAG field of the address. If
they match, the desired word specified by the low-order 5 bits of the address is in that
block of the cache.
[98]
If there is no match,
the required word
must be accessed
from the main
memory, that is, the
contents of that
block of the cache
is replaced by the
new block that is
specified by the
new address
generated by the
CPU and
correspondingly the
TAG bit will also
be changed by the
high order 4 bits of
the address. The
whole arrangement
for direct mapping
technique is shown
in the figure 3.14.
Figure 3.14: Direct-mapping cache
In the associative mapping technique, a main memory block can potentially reside in any
cache block position. In this case, the main memory address is divided into two groups,
low-order bits identify the location of a word within a block and high-order bits identifies
the block. In the example here, 11 bits are required to identify a main memory block when
it is resident in the cache, high-order 11 bits are used as TAG bits and low-order 5 bits are
used to identify a word within a block. The TAG bits of an address received from the CPU
must be compared to the TAG bits of each block of the cache to see if the desired block is
present.
In the associative mapping, any block of main memory can go to any block of cache, so it
has got the complete flexibility and we have to use proper replacement policy to replace a
block from cache if the currently accessed block of main memory is not present in cache. It
might not be practical to use this complete flexibility of associative mapping technique due
to searching overhead, because the TAG field of main memory address has to be compared
[99]
with the TAG field of all the cache block. In this example, there are 128 blocks in cache
and the size of TAG is 11 bits. The whole arrangement of Associative Mapping Technique
is shown in the figure 3.15.
This mapping technique is intermediate to the previous two techniques. Blocks of the
cache are grouped into sets, and the mapping allows a block of main memory to reside in
any block of a specific set. Therefore, the flexibility of associative mapping is reduced
from full freedom to a set of specific blocks. This also reduces the searching overhead,
because the search is restricted to number of sets, instead of number of blocks. Also, the
contention problem of the direct mapping is eased by having a few choices for block
replacement. Consider the same cache memory and main memory organization of the
previous example. Organize the cache with 4 blocks in each set. The TAG field of
associative mapping technique is divided into two groups, one is termed as SET bit and the
[100]
second one is termed as TAG bit. Each set contains 4 blocks, total number of set is 32. The
main memory address is grouped into three parts: low-order 5 bits are used to identifies a
word within a block. Since there are total 32 sets present, next 5 bits are used to identify
the set. High-order 6 bits are used as TAG bits.
The 5-bit set field of the address determines which set of the cache might contain the
desired block. This is similar to direct mapping technique, in case of direct mapping, it
looks for block, but in case of block-set-associative mapping, it looks for set. The TAG
field of the address must then be compared with the TAGs of the four blocks of that set. If
a match occurs, then the block is present in the cache; otherwise the block containing the
addressed word must be brought to the cache. This block will potentially come to the
corresponding set only. Since, there are four blocks in the set, we have to choose
appropriately which block to be replaced if all the blocks are occupied. Since the search is
restricted to four blocks only, so the searching complexity is reduced. The whole
arrangement of block-set-associative mapping technique is shown in the figure 3.15.
It is clear that if we
increase the number
of blocks per set,
then the number of
bits in SET field is
reduced. Due to the
increase of blocks
per set, complexity
of search is also
increased. The
extreme condition
of 128 blocks per
set requires no set
bits and
corresponds to the
fully associative
mapping technique
with 11 TAG bits.
The other extreme
of one block per set
is the direct
mapping method.
Figure 3.15: Block-set Associated mapping Cache with 4 blocks/set
[101]
Check your progress I
1. The faster memory that is inserted between CPU and Main Memory is termed as ________
memory.
2. The correspondence between the Main Memory Blocks and those in the cache is specified
by means of a __________________.
3. In a ________ mapped cache, each memory address has only one possible location in the
cache where its data might be found; that is, each memory address maps directly to a cache
location.
4. In a _________ cache, an address can be cached in any line of the cache.
5. The __________________ cache is a compromise between a direct-mapped cache and a
fully associative cache.
Since program usually stay in localized areas for reasonable periods of time, it can be
assumed that there is a high probability that blocks which have been referenced recently
will also be referenced in the near future. Therefore, when a block is to be overwritten, it is
a good decision to overwrite the one that has gone for longest time without being
referenced. This is defined as the least recently used (LRU) block. Keeping track of LRU
block must be done as computation proceeds.
[102]
Consider a specific example of a four-block set. It is required to track the LRU block of
this four-block set. A 2-bit counter may be used for each block.
When a hit occurs, that is, when a read request is received for a word that is in the cache,
the counter of the block that is referenced is set to 0. All counters which values originally
lower than the referenced one are incremented by 1 and all other counters remain
unchanged.
When a miss occurs, that is, when a read request is received for a word and the word is not
present in the cache, we have to bring the block to cache.
There are two possibilities in case of a miss: If the set is not full, the counter associated
with the new block loaded from the main memory is set to 0, and the values of all other
counters are incremented by 1.
If the set is full and a miss occurs, the block with the counter value 3 is removed, and the
new block is put in its place. The counter value is set to zero. The other three block
counters are incremented by 1.
It is easy to verify that the counter values of occupied blocks are always distinct. Also, it is
trivial that highest counter value indicates least recently used block.
A reasonable rule may be to remove the oldest from a full set when a new block must be
brought in. While using this technique, no updating is required when a hit occurs. When a
miss occurs and the set is not full, the new block is put into an empty block and the counter
values of the occupied block will be increment by one. When a miss occurs and the set is
full, the block with highest counter value is replaced by new block and counter is set to 0,
counter value of all other blocks of that set is incremented by 1. The overhead of the policy
is less, since no updation is required during hit.
[103]
6.4.3 Random replacement policy
6.5 Summary
1. Cache is very fast and small memory that is placed in between the CPU and the
main memory.
2. Cache memory is used to reduce the average memory access time.
3. There are three basic categories of caches: direct mapped, set associative, and fully
associative.
4. In a direct-mapped cache, each memory address maps to only one possible location
in the cache where that address's data might appear.
5. During replacement time, the cache block will be written back to the main memory.
This method is called write back method.
6. In Direct mapping, a particular block of main memory can be brought to a
particular block of cache memory. So, it is not flexible.
7. In Associative mapping function, any block of Main memory can potentially reside
in any cache block position. This is much more flexible mapping method.
8. In Block-set-associative mapping method, blocks of cache are grouped into sets,
and the mapping allows a block of main memory to reside in any block of a specific
set. From the flexibility point of view, it is in between to the other two methods.
[104]
5. What is a mapping function? Explain various types of mapping function for cache
memory.
6. Explain the functioning of Direct Mapping Technique.
7. Explain the functioning of Associative mapping Technique.
8. Explain the functioning of Block-set-associative mapping Technique.
9. Explain Least Recently Used (LRU) Replacement policy?
1. Cache
2. mapping function
3. direct
4. fully associative
5. set-associative
[105]
MEMORY MANAGEMENT
Due to that, the main memory of a computer is divided into two parts. One part is reserved
for operating system. The other part is for user program. The program currently being
executed by the CPU is loaded into the user part of the memory. The two parts of the main
memory are shown in the figure 3.17. In a uni-programming system, the program currently
being executed is loaded into the user part of the memory.
[106]
In a multiprogramming system, the user part of memory is subdivided to accommodate
multiple process. The task of subdivision is carried out dynamically by operating system
and is known as memory management.
When memory holds multiple processes, then the process can move from one process to
another process when one process is waiting. But the processor is so much faster, then I/O
that it will be common for all the processes in memory to be waiting for I/O. Thus, even
with multiprogramming, a processor could be idle most of the time.
Due to the speed mismatch of the processor and I/O device, the status at any point in time
is referred to as a state.
There are five defined state of a process as shown in the figure 3.18.
When a process starts to execute, it is placed in the process queue and it is in the new state.
As resources become available, then the process is placed in the ready queue. At any given
time, a process may be in one of the following five states.
[107]
Figure 3.18: Five State process model
1. New : A program is admitted to execute, but not yet ready to execute. The
operating system will initialize the process by moving it to the ready state.
2. Ready : The process is ready to execute and is waiting access to the processor.
3. : The process is being executed by the processor. At any given time, only
Running one process is in running state.
4. Waiting: The process is suspended from execution, waiting for some system
resource, such as I/O.
5. Exit : The process has terminated and will be destroyed by the operating system.
The processor alternates between executing operating system instructions and executing
user processes. While the operating system is in control, it decides which process in the
queue should be executed next.
[108]
We know that the information of all the process that are in execution must be placed in
main memory. Since there is fix amount of memory, so memory management is an
important issue
The task of subdivision is carried out dynamically by the operating system and is known as
memory management. In uniprogramming system, only one program is in execution. After
completion of one program, another program may start.
In general, most of the programs involve I/O operation. It must take input from some input
device and place the result in some output device. Partition of main memory for uni-
program and multi program is shown in figure 3.19.
[109]
To utilize the idle time of CPU, we are shifting the paradigm from uniprogram
environment to multiprogram environment.
Since the size of main memory is fixed, it is possible to accommodate only few processes
in the main memory. If all are waiting for I/O operation, then again CPU remains idle.
To utilize the idle time of CPU, some of the process must be off loaded from the memory
and new process must be brought to this memory place. This is known swapping.
1. The process waiting for some I/O to complete, must stored back in disk.
2. New ready process is swapped in to main memory as space becomes available.
3. As process completes, it is moved out of main memory.
4. If none of the processes in memory are ready,
• Swapped out a block process to intermediate queue of blocked process.
• Swapped in a ready process from the ready queue.
But swapping is an I/O process, so it also takes time. Instead of remain in idle state of
CPU, sometimes it is advantageous to swapped in a ready process and start executing it.
The main question arises where to put a new process in the main memory. It must be done
in such a way that the memory is utilized properly.
7.2.2 Partitioning
Splitting of memory into sections to allocate processes including operating system. There
are two schemes for partitioning:
The memory is partitioned to fixed size partition. Although the partitions are of fixed size,
they need not be of equal size. There is a problem of wastage of memory in fixed size even
with unequal size. When a process is brought into memory, it is placed in the smallest
[110]
available partition that will hold it. Equal size and unequal size partition for fixed size
partitions of main memory is shown in Figure 3.20.
Even with the use of unequal size of partitions, there will be wastage of memory. In most
cases, a process will not require exactly as much memory as provided by the partition.
For example, a process that require 5-MB of memory would be placed in the 6-MB
partition which is the smallest available partition. In this partition, only 5-MB is used, the
remaining 1-MB can not be used by any other process, so it is a wastage. Like this, in
every partition we may have some unused memory. The unused portion of memory in each
partition is termed as hole.
But this is not the only hole that will be present in variable size partition. When all
processes are blocked then swap out a process and bring in another process. The new
[111]
swapped in process may be smaller than the swapped-out process. Most likely we will not
get two process of same size. So, it will create another whole. If the swap- out and swap-in
is occurring more time, then more and more hole will be created, which will lead to more
wastage of memory.
There are two simple ways to slightly remove the problem of memory wastage:
Coalesce : Join the adjacent holes into one large hole, so that some process can be
accommodated into the hole.
Compaction: From time to time go through memory and move all hole into one free
block of memory.
During the execution of process, a process may be swapped in or swapped out many times.
it is obvious that a process is not likely to be loaded into the same place in main memory
each time it is swapped in. Further more if compaction is used, a process may be shifted
while in main memory.
A process in memory consists of instruction plus data. The instruction will contain address
for memory locations of two types:
These addresses will change each time a process is swapped in. To solve this problem, a
distinction is made between logical address and physical address.
When the processor executes a process, it automatically converts from logical to physical
address by adding the current starting location of the process, called it's base address to
each logical address.
[112]
Every time the process is swapped in to main memory, the base address may be different
depending on the allocation of memory to the process.
Consider a main memory of 2-MB out of which 512-KB is used by the Operating System.
Consider three process of size 425-KB, 368-KB and 470-KB and these three processes are
loaded into the memory. This leaves a hole at the end of the memory. That is too small for
a fourth process. At some point none of the process in main memory is ready. The
operating system swaps out process-2 which leaves sufficient room for new process of size
320-KB. Since process-4 is smaller then process-2, another hole is created. Later a point is
reached at which none of the processes in the main memory is ready, but proces-2, so
process-1 is swapped out and process-2 is swapped in there. It will create another hole. In
this way it will create lot of small holes in the memory system which will lead to more
memory wastage.
The effect of dynamic partitioning that creates more whole during the execution of
processes is shown in the Figure 3.21.
[113]
Check your progress I
7.3 Summary
1. Memory management is the functionality of an operating system which handles or
manages primary memory and moves processes back and forth between main
memory and disk during execution.
2. The main memory of a computer is divided into two parts. One part is reserved for
operating system. The other part is for user program.
3. In a multiprogramming system, the user part of memory is subdivided to
accommodate multiple process.
4. The task of subdivision is carried out dynamically by operating system and is
known as memory management.
5. When memory holds multiple processes, then the process can move from one
process to another process when one process is waiting.
6. The processor alternates between executing operating system instructions and
executing user processes.
7. In an uniprogramming system, main memory is divided into two parts: one part for
the operating system and the other part for the program currently being executed.
8. The task of subdivision is carried out dynamically by the operating system and is
known as memory management.
[114]
7.4 Model Questions
1. What is memory management? Why it is required?
2. What are the five states of a process? Explain with the help of a diagram.
3. What are the various possible reasons for the suspension of a process?
4. What is swapping? Explain.
5. Define partitioning of memory space. How many types of partitioning is possible?
Explain.
6. What are the simple ways to remove the problem of memory wastage?
1. Data
2. Von-Neumann stored program
3. Swapping
4. Segmentation
[115]
VIRTUAL MEMORY
8.1 Paging
Both unequal fixed size and variable size partitions are inefficient in the use of memory. It
has been observed that both schemes lead to memory wastage. Therefore, we are not using
the memory efficiently.
Each process is also divided into small fixed chunks of same size. The chunks of a
program are known as pages. A page of a program could be assigned to available page
frame. In this scheme, the wastage space in memory for a process is a fraction of a page
frame which corresponds to the last page of the program.
At a given point of time some of the frames in memory are in use and some are free. The
list of free frames is maintained by the operating system.
Process A, stored in disk, consists of pages. At the time of execution of the process A, the
operating system finds six free frames and loads the six pages of the process A into six
frames.
[116]
These six frames need not be contiguous frames in main memory. The operating system
maintains a page table for each process. Within the program, each logical address consists
of page number and a relative address within the page. In case of simple partitioning, a
logical address is the location of a word relative to the beginning of the program; the
processor translates that into a physical address.
With paging, a logical address is a location of the word relative to the beginning of the
page of the program, because the whole program is divided into several pages of equal
length and the length of a page is same with the length of a page frame.
A logical address consists of page number and relative address within the page, the process
uses the page table to produce the physical address which consists of frame number and
relative address within the frame.
The Figure 3.22 shows the allocation of frames to a new process in the main memory. A
page table is maintained for each process. This page table helps us to find the physical
address in a frame which corresponds to a logical address within a process.
[117]
The conversion of logical address to physical address is shown in the figure for the Process
A.
This approach solves the problems. Main memory is divided into many small equal size
frames. Each process is divided into frame size pages. Smaller process requires fewer
pages, larger process requires more. When a process is brought in, its pages are loaded into
available frames and a page table is set up. The translation of logical addresses to physical
address is shown in the Figure 3.23.
[118]
whole process to the main memory, because the execution may be confined to a small
section of the program (eg. a subroutine).
It would clearly be wasteful to load in many pages for a process when only a few pages
will be used before the program is suspended. Instead of loading all the pages of a process,
each page of process is brought in only when it is needed, i.e on demand. This scheme is
known as demand paging.
Demand paging also allows us to accommodate more process in the main memory, since
we are not going to load the whole process in the main memory, pages will be brought into
the main memory as and when it is required.
With demand paging, it is not necessary to load an entire process into main memory. This
concept leads us to an important consequence – It is possible for a process to be larger than
the size of main memory. So, while developing a new process, it is not required to look for
the main memory available in the machine. Because, the process will be divided into pages
and pages will be brought to memory on demand.
Because a process executes only in main memory, so the main memory is referred to as
real memory or physical memory.
A programmer or user perceives a much larger memory that is allocated on the disk. This
memory is referred to as virtual memory. The program enjoys a huge virtual memory space
to develop his or her program or software.
The execution of a program is the job of operating system and the underlying hardware. To
improve the performance some special hardware is added to the system. This hardware unit
is known as Memory Management Unit (MMU).
In paging system, we make a page table for the process. Page table helps us to find the
physical address from virtual address.
The virtual address space is used to develop a process. The special hardware unit, called
Memory Management Unit (MMU) translates virtual address to physical address. When
[119]
the desired data is in the main memory, the CPU can work with these data. If the data are
not in the main memory, the MMU causes the operating system to bring into the memory
from the disk. A typical virtual memory organization is shown in the Figure 3.24.
[120]
8.3 Address Translation
The basic mechanism for reading a word from memory involves the translation of a virtual
or logical address, consisting of page number and offset, into a physical address, consisting
of frame number and offset, using a page table. There is one-page table for each process.
But each process can occupy huge amount of virtual memory. But the virtual memory of a
process cannot go beyond a certain limit which is restricted by the underlying hardware of
the MMU. One of such components may be the size of the virtual address register.
The sizes of pages are relatively small and so the size of page table increases as the size of
process increases. Therefore, size of page table could be unacceptably high. To overcome
this problem, most virtual memory scheme store page table in virtual memory rather than
in real memory.
This means that the page table is subject to paging just as other pages are. When a process
is running, at least a part of its page table must be in main memory, including the page
table entry of the currently executing page. A virtual address translation scheme by using
page table is shown in the Figure 3.25.
[121]
Each virtual address generated by the processor is interpreted as virtual page number (high
order list) followed by an offset (lower order bits) that specifies the location of a particular
word within a page. Information about the main memory location of each page kept in a
page table.
Some processors make use of a two level scheme to organize large page tables. In this
scheme, there is a page directory, in which each entry points to a page table. Thus, if the
length of the page directory is X, and if the maximum length of a page table is Y, then the
process can consist of up to X * Y pages. Typically, the maximum length of page table is
restricted to the size of one page frame.
There is one entry in the hash table and the inverted page table for each real memory page
rather than one per virtual page. Thus, a fixed portion of real memory is required for the
page table, regardless of the number of processes or virtual page supported. Because more
than one virtual address may map into the hash table entry, a chaining technique is used for
managing the overflow. The hashing techniques results in chains that are typically short –
either one or two entries. The inverted page table structure for address translation is shown
in the Figure 3.26.
[122]
8.3.2 Translation Lookaside Buffer (TLB)
Every virtual memory reference can cause two physical memory accesses. One to fetch
the appropriate page table entry One to fetch the desired data. Thus, a straight forward
virtual memory scheme would have the effect of doubling the memory access time.
To overcome this problem, most virtual memory schemes make use of a special cache for
page table entries, usually called Translation Lookaside Buffer (TLB).
This cache functions in the same way as a memory cache and contains those page table
entries that have been most recently used.
In addition to the information that constitutes a page table entry, the TLB must also include
the virtual address of the entry.
The Figure 3.27 shows a possible organization of a TLB where the associative mapping
technique is used.
[123]
Set-associative mapped TLBs are also found in commercial products. An essential
requirement is that the contents of the TLB be coherent with the contents of the page table
in the main memory.
When the operating system changes the contents of the page table it must simultaneously
invalidate the corresponding entries in the TLB. One of the control bits in the TLB is
provided for this purpose.
• Given a virtual address, the MMU looks in the TLB for the reference page.
• If the page table entry for this page is found in the TLB, the physical address is
obtained immediately.
• If there is a miss in the TLB, then the required entry is obtained from the page table
in the main memory and the TLB is updated.
• When a program generates an access request to a page that is not in the main
memory, a page fault is said to have occurred.
• The whole page must be brought from the disk into the memory before access can
proceed.
• When it detects a page fault, the MMU asks the operating system to intervene by
raising an exception (interrupt).
• Processing of active task is interrupted, and control is transferred to the operating
system.
• The operating system then copies the requested page from the disk into the main
memory and returns control to the interrupted task. Because a long delay occurs
due to a page transfer takes place, the operating system may suspend execution of
the task that caused the page fault and begin execution of another task whose page
are in the main memory.
[124]
Check your progress II
8.4 Summary
1. The idea of virtual memory is to create a virtual address space that doesn't
correspond to actual addresses in RAM.
2. We break virtual memory into chunks called pages; a typical page size is four
kilobytes. We also break RAM into page frames, each the same size as a page,
ready to hold any page of virtual memory.
3. The system also maintains a page table, stored in RAM, which is an array of
entries, one for each page, storing information about the page.
[125]
4. Whenever a program requests access to a memory address, the CPU will always
work with this as a virtual memory address, and it will need somehow to find where
the data is actually loaded. The CPU goes through the following process.
a. The CPU breaks the address into the first three bits giving the page number
page, and the last twelve bits giving the offset offs within the page.
b. The CPU looks into the page table at index page to find which page frame f
contains the page.
c. If the page entry says that the page is not in RAM, it initiates a page fault.
This is an exception telling the operating system that it needs to bring a
page into memory. After the operating system's exception handler finishes,
it returns back to the same instruction so the CPU ends up trying the
instruction over again.
d. Otherwise, the CPU loads from the memory address offs within page frame
f.
5. The page table is the primary data structure for holding information about each
page in memory.
6. One of the most important issues in virtual memory is the paging algorithm. For the
success of a virtual memory system, we need an algorithm that minimizes the
number of page faults on typical request sequences while simultaneously requiring
very little computation.
7. With the FIFO algorithm, any page fault results in throwing out the oldest page in
memory to make room for the new page.
8. The LRU algorithm says that the system should always eject the page that was least
recently used. A page that has been used recently, after all, is likely to be used
again in the near future, so we should not eject such a page.
[126]
2. What is virtual memory? Explain the need for virtual memory.
3. Explain about LRU page replacement algorithm.
4. What is physical address and logical address? Explain.
5. Explain with the help of a diagram how virtual address can be mapped into physical
address using mapping.
6. Differentiate virtual memory with cache memory.
7. What is page fault? How it is handled?
8. Explain briefly about paging and segmentation concept in memory organization.
9. What is address translation page fault routine, page fault and demand paging?
10. What is TLB?
11. Discuss how paging helps in implementing virtual memory.
12. Explain the virtual memory translation and TLB with necessary diagram.
1. A
2. Demand Paging
3. Virtual Memory
4. Physical Address
[127]
Answers to Check your progress II
1. A
2. C
3. Least Recently Used
4. A
[128]
INSTRUCTION SET AND ADDRESSING
• Immediate
• Direct
• Indirect
• Register
• Register Indirect
• Displacement
• Stack
All computer architectures provide more than one of these addressing modes. The question
arises as to how the control unit can determine which addressing mode is being used in a
[129]
particular instruction. Several approaches are used. Often, different opcodes will use
different addressing modes. Also, one or more bits in the instruction format can be used as
a mode field. The value of the mode field determines which addressing mode is to be used.
What is the interpretation of effective address? In a system without virtual memory, the
effective address will be either a main memory address or a register. In a virtual memory
system, the effective address is a virtual address or a register. The actual mapping to a
physical address is a function of the paging mechanism and is invisible to the programmer.
The simplest form of addressing is immediate addressing, in which the operand is actually
present in the instruction:
OPERAND = A
This mode can be used to define and use constants or set initial values of variables. The
advantage of immediate addressing is that no memory reference other than the instruction
fetch is required to obtain the operand. The disadvantage is that the size of the number is
restricted to the size of the address field, which, in most instruction sets, is small compared
with the world length.
[130]
Figure 4.1: Immediate Addressing Mode
The instruction format for Immediate Addressing Mode is shown in the Figure 4.1.
A very simple form of addressing is direct addressing, in which the address field contains
the effective address of the operand:
EA = A
The fetching of data from the memory location in case of direct addressing mode is shown
in the Figure 4.2. Here, 'A' indicates the memory address field for the operand
With direct addressing, the length of the address field is usually less than the word length,
thus limiting the address range. One solution is to have the address field refer to the
address of a word in memory, which in turn contains a full-length address of the operand.
This is know as indirect addressing:
[131]
EA = (A)
The exact memory location of the operand in case of indirect addressing mode is shown in
the Figure 4.2. Here 'A' indicates the memory address field of the required operands
Register addressing is similar to direct addressing. The only difference is that the address
field refers to a register rather than a main memory address:
EA = R
The advantages of register addressing are that only a small address field is needed in the
instruction and no memory reference is required. The disadvantage of register addressing is
that the address space is very limited.
The exact register location of the operand in case of Register Addressing Mode is shown in
the Figure 4.4. Here, 'R' indicates a register where the operand is present.
[132]
Figure 4.4: Register Addressing Mode.
Register indirect addressing is similar to indirect addressing, except that the address field
refers to a register instead of a memory location.
It requires only one memory reference and no special calculation.
EA = (R)
Register indirect addressing uses one less memory reference than indirect addressing.
Because, the first information is available in a register which is nothing but a memory
address. From that memory location, we use to get the data or information. In general,
register access is much faster than the memory access.
[133]
9.1.6 Displacement Addressing
A very powerful mode of addressing combines the capabilities of direct addressing and
register indirect addressing, which is broadly categorized as displacement addressing:
EA = A + (R)
Displacement addressing requires that the instruction have two address fields, at least one
of which is explicit. The value contained in one address field (value = A) is used directly.
The other address field, or an implicit reference based on opcode, refers to a register whose
contents are added to A to produce the effective address. The general format of
Displacement Addressing is shown in the Figure 4.6.
• Relative addressing
• Base-register addressing
• Indexing
[134]
9.1.6.1 Relative Addressing
For relative addressing, the implicitly referenced register is the program counter (PC). That
is, the current instruction address is added to the address field to produce the EA. Thus, the
effective address is a displacement relative to the address of the instruction.
The reference register contains a memory address, and the address field contains a
displacement from that address. The register reference may be explicit or implicit. In some
implementation, a single segment/base register is employed and is used implicitly. In
others, the programmer may choose a register to hold the base address of a segment, and
the instruction must reference it explicitly.
9.1.6.3 Indexing
The address field references a main memory address, and the reference register contains a
positive displacement from that address. In this case also the register reference is
sometimes explicit and sometimes implicit.
Generally, index register are used for iterative tasks, it is typical that there is a need to
increment or decrement the index register after each reference to it. Because this is such a
common operation, some system will automatically do this as part of the same instruction
cycle.
If certain registers are devoted exclusively to indexing, then auto-indexing can be invoked
implicitly and automatically. If general purpose register are used, the auto-index operation
may need to be signaled by a bit in the instruction.
[135]
Auto-indexing using increment can be depicted as follows:
EA = A + (R)
R = (R) + 1
EA = A + (R)
R = (R) - 1
In some machines, both indirect addressing and indexing are provided, and it is possible to
employ both in the same instruction. There are two possibilities: The indexing is performed
either before or after the indirection.
EA = (A) + (R)
First, the contents of the address field are used to access a memory location containing an
address. This address is then indexed by the register value.
EA = ( A + (R) )
An address is calculated, the calculated address contains not the operand, but the
address of the operand.
[136]
the stack is a pointer whose value is the address of the top of the stack. The stack pointer is
maintained in a register. Thus, references to stack locations in memory are in fact register
indirect addresses.
The stack mode of addressing is a form of implied addressing. The machine instructions
need not include a memory reference but implicitly operate on the top of the stack.
Each instruction must contain the information required by the CPU for execution. The
elements of an instruction are as follows:
[137]
a. Operation Code: Specifies the operation to be performed (e.g., add, move etc.).
The operation is specified by a binary code, know as the operation code or opcode.
b. Source operand reference: The operation may involve one or more source
operands; that is, operands that are inputs for the operation.
c. Result operand reference: The operation may produce a result.
d. Next instruction reference: This tells the CPU where to fetch the next instruction
after the execution of this instruction is complete.
The next instruction to be fetched is located in main memory. But in case of virtual
memory system, it may be either in main memory or secondary memory (disk). In most
cases, the next instruction to be fetched immediately follow the current instruction. In
those cases, there is no explicit reference to the next instruction. When an explicit
reference is needed, then the main memory or virtual memory address must be given.
[138]
instruction format is highly machine specific and it mainly depends on the machine
architecture. A simple example of an instruction format is shown in the Figure 4.8. It is
assumed that it is a 16-bit CPU. 4 bits are used to provide the operation code. So, we may
have to 16(24=16) different set of instructions. With each instruction, there are two
operands. To specify each operand, 6 bits are used. It is possible to provide 64 (26 = 64)
different operands for each operand reference.
Opcodes are represented by abbreviations, called mnemonics, that indicate the operations.
Common examples include:
ADD Add
SUB Subtract
MULT Multiply
DIV Division
LOAD Load data from
memory to CPU
STORE Store data to memory
from CPU.
MULT R, X ; R ← R * X
may mean multiply the value contained in the data location X by the contents of register R
and put the result in register R
In this example, X refers to the address of a location in memory and R refers to a particular
register.
[139]
Thus, it is possible to write a machine language program in symbolic form. Each symbolic
opcode has a fixed binary representation, and the programmer specifies the location of
each symbolic operand.
Memory instructions are used for moving data between memory and CPU registers.
I/O instructions are needed to transfer program and data into memory from storage device
or input device and the results of computation back to the user.
9.4.4 Control:
Test and branch instructions: Test instructions are used to test the value of a data word
or the status of a computation. Branch instructions are then used to branch to a different set
of instructions depending on the decision made.
[140]
9.5 Number of Addresses
What is the maximum number of addresses one might need in an instruction? Most of the
arithmetic and logic operations are either unary (one operand) or binary (two operands).
Thus, we need a maximum of two addresses to reference operands. The result of an
operation must be stored, suggesting a third address. Finally, after completion of an
instruction, the next instruction must be fetched, and its address is needed.
This reasoning suggests that an instruction may require to contain four address references:
two operands, one result, and the address of the next instruction. In practice, four address
instructions are rare. Most instructions have one, two or three operands addresses, with the
address of the next instruction being implicit (obtained from the program counter).
Operation repertoire : How many and which operations to provide, and how
complex operations should be.
Data Types : The various type of data upon which operations are
performed.
Instruction format : Instruction length (in bits), number of addresses, size of
various fields and so on.
Registers : Number of CPU registers that can be referenced by
instructions and their use.
Addressing : The mode or modes by which the address of an operand
is specified.
[141]
9.7 Types of Operands
Machine instructions operate on data. Data can be categorized as follows:
Addresses: It basically indicates the address of a memory location. Addresses are nothing
but the unsigned integer, but treated in a special way to indicate the address of a memory
location. Address arithmetic is somewhat different from normal arithmetic and it is related
to machine architecture.
Numbers: All machine languages include numeric data types. Numeric data are classified
into two broad categories: integer or fixed point and floating point.
Characters: A common form of data is text or character strings. Since computer works
with bits, so characters are represented by a sequence of bits. The most commonly used
coding scheme is ASCII (American Standard Code for Information Interchange) code.
Logical Data: Normally each word or other addressable unit (byte, halfword, and so on) is
treated as a single unit of data. It is sometime useful to consider an n-bit unit as consisting
of n 1-bit items of data, each item having the value 0 or 1. When data are viewed this way,
they are considered to be logical data. Generally, 1 is treated as true and 0 is treated as
false.
• Data Transfer
• Arithmetic
• Logical
• Conversion
• Input Output [ I/O ]
• System Control
[142]
• Transfer Control
The most fundamental type of machine instruction is the data transfer instruction. The data
transfer instruction must specify several things. First, the location of the source and
destination operands must be specified. Each location could be memory, a register, or the
top of the stack. Second, the length of data to be transferred must be indicated. Third, as
with all instructions with operands, the mode of addressing for each operand must be
specified. The CPU has to perform several tasks to accomplish a data transfer operation. If
both source and destination are registers, then the CPU simply causes data to be transferred
from one register to another; this is an operation internal to the CPU. If one or both
operands are in memory, then the CPU must perform some or all of the following actions:
[143]
9.8.2 Arithmetic
Most machines provide the basic arithmetic operations like add, subtract, multiply, divide
etc. These are invariably provided for signed integer (fixed-point) numbers. They are also
available for floating point number. The execution of an arithmetic operation may involve
data transfer operation to provide the operands to the ALU input and to deliver the result of
the ALU operation.
9.8.3 Logical
Most machines also provide a variety of operations for manipulating individual bits of a
word or other addressable units. Most commonly available logical operations are:
[144]
Test Test specified condition; set flag(s) based on outcome
Compare Make logical or arithmatic comparison Set flag(s) based
on outcome
Class of instructions to set controls for protection
Set Control Variables
purposes, interrupt handling, timer control etc.
Left (right) shift operand, introducing constant at end
Shift
Rotate Left (right) shift operation, with wraparound end
9.8.4 Conversion
Conversion instructions are those that change the format or operate on the format of data.
An example is converting from decimal to binary.
9.8.5 Input/Output
Input/Output instructions are used to transfer data between input/output devices and
memory/CPU register. Commonly available I/O operations are:
System control instructions are those which are used for system setting and it can be used
only in privileged state. Typically, these instructions are reserved for the use of operating
systems. For example, a system control instruction may read or alter the content of a
control register. Another instruction may be to read or modify a storage protection key.
[145]
9.8.7 Transfer of Control
In most of the cases, the next instruction to be performed is the one that immediately
follows the current instruction in memory. Therefore, program counter helps us to get the
next instruction. But sometimes it is required to change the sequence of instruction
execution and for that instruction set should provide instructions to accomplish these tasks.
For these instructions, the operation performed by the CPU is to upload the program
counter to contain the address of some instruction in memory. The most common transfer-
of-control operations found in instruction set are: branch, skip and procedure call.
A branch instruction, also called a jump instruction, has one of its operands as the address
of the next instruction to be executed. Basically, there are two types of branch instructions:
Conditional Branch instruction and unconditional branch instruction. In case of
unconditional branch instruction, the branch is made by updating the program counter to
address specified in operand. In case of conditional branch instruction, the branch is made
only if a certain condition is met. Otherwise, the next instruction in sequence is executed.
There are two common ways of generating the condition to be tested in a conditional
branch instruction First most machines provide a 1-bit or multiple-bit condition code that
is set as the result of some operations. As an example, an arithmetic operation could set a
2-bit condition code with one of the following four values: zero, positive, negative and
overflow. On such a machine, there could be four different conditional branch instructions:
In all of these cases, the result referred to is the result of the most recent operation that set
the condition code.
Another approach that can be used with three address instruction formats is to perform a
comparison and specify a branch in the same instruction.
For example,
[146]
BRE R1, R2, X ; Branch to X if contents of R1 = Contents of R2.
ISZ R1
This instruction will increment the value of the register R1. If the result of the increment is
zero, then it will skip the next instruction.
Since we can call a procedure from a variety of points, the CPU must somehow save the
return address so that the return can take place appropriately. There are three common
places for storing the return address:
• Register
[147]
• Start of procedure
• Top of stack
Consider a machine language instruction CALL X, which stands for call procedure at
location X. If the register approach is used, CALL X causes the following actions:
RN ← PC + IL
PC ← X
where RN is a register that is always used for this purpose, PC is the program counter and
IL is the instruction length. The called procedure can now save the contents of RN to be
used for the later return.
A second possibilities is to store the return address at the start of the procedure. In this
case, CALL X causes
X ← PC + IL
PC ← X + 1
Both of these approaches have been used. The only limitation of these approaches is that
they prevent the use of reentrant procedures. A reentrant procedure is one in which it is
possible to have several calls open to it at the same time.
A more general approach is to use stack. When the CPU executes a call, it places the return
address on the stack. When it executes a return, it uses the address on the stack.
It may happen that, the called procedure might have to use the processor registers. This
will overwrite the contents of the registers and the calling environment will lose the
information. So, it is necessary to preserve the contents of processor register too along with
the return address. The stack is used to store the contents of processor register. On return
from the procedure call, the contents of the stack will be popped out to appropriate
registers.
[148]
In addition to provide a return address, it is also often necessary to pass parameters with a
procedure call. The most general approach to parameter passing is the stack. When the
processor executes a call, it not only stacks the return address, it stacks parameters to be
passed to the called procedures. The called procedure can access the parameters from the
stack. Upon return, return parameters can also be placed on the stack. The entire set of
parameters, including return address, that is stored for a procedure invocation is referred to
as stack frame.
[149]
Figure 4.9: Four common Instruction formats
On some machines, all instructions have the same length; on others there may be many
different lengths. Instructions may be shorter than, the same length as, or more than the
word length. Having all the instructions be the same length is simpler and make decoding
easier but often wastes space, since all instructions then have to be as long as the longest
one. Possible relationship between instruction length and word length is shown in the
Figure 4.10.
Figure 4.10: Some Possible relationship between instructions and word length
Generally, there is a correlation between memory transfer length and the instruction length.
Either the instruction length should be equal to the memory transfer length or one should
[150]
be a multiple of the other. Also, in most of the case there is a correlation between memory
transfer length and word length of the machine.
For a given instruction length, there is a clearly a trade-off between the number of opcodes
and the power of the addressing capabilities. More opcodes obviously mean more bits in
the opcode field. For an instruction format of a given length, this reduces the number of
bits available for addressing. The following interrelated factors go into determining the use
of the addressing bits:
Number of Operands: Typical instructions on today's machines provide for two operands.
Each operand address in the instruction might require its own mode indicator, or the use of
a mode indicator could be limited to just one of the address field.
Register versus memory: A machine must have registers so that data can be brought into
the CPU for processing. With a single user-visible register (usually called the
accumulator), one operand address is implicit and consumes no instruction bits. Even with
multiple registers, only a few bits are needed to specify the register. The more that registers
can be used for operand references, the fewer bits are needed.
Number of register sets: A number of machines have one set of general-purpose registers,
with typically 8 or 16 registers in the set. These registers can be used to store data and can
be used to store addresses for displacement addressing. The trend recently has been away
from one bank of general-purpose registers and toward a collection of two or more
specialized sets (such as data and displacement).
[151]
Address range: For addresses that reference memory, the range of addresses that can be
referenced is related to the number of address bits. With displacement addressing, the
range is opened up to the length of the address register.
Address granularity: In a system with 16- or 32-bit words, an address can reference a
word or a byte at the designer's choice. Byte addressing is convenient for character
manipulation but requires, for a fixed size memory, more address bits.
(one source operand, e.g. NOT) or binary (two source operands, e.g. ADD).
Thus, we need a maximum of two addresses to reference source operands. The result of an
operation must be stored, suggesting a third reference. Three address instruction formats
are not common because they require a relatively long instruction format to hold the three-
address reference. With two address instructions, and for binary operations, one address
must do double duty as both an operand and a result. In one address instruction format, a
second address must be implicit for a binary operation. For implicit reference, a processor
register is used and it is termed as accumulator (AC). the accumulator contains one of the
operands and is used to store the result. Consider a simple arithmetic expression to
evaluate:
[152]
Y= (A + B) / (C * D)
[153]
Figure 4.13: One address instruction
1. __________________ instructions are responsible for moving data around inside the
processor as well as brining in data or sending data out.
2. ________ control instructions are those which are used for system setting and it can be
used only in privileged state.
3. __________ instructions are those that change the format or operate on the format of data.
9.10 Summary
1. Addressing modes are an aspect of the instruction set architecture in most central
processing unit (CPU) designs.
2. An addressing mode specifies how to calculate the effective memory address of an
operand by using information held in registers and/or constants contained within a
machine instruction or elsewhere.
[154]
3. The term addressing modes refers to the way in which the operand of an instruction
is specified.
4. Arithmetic instructions perform several basic operations such as addition,
subtraction, division, multiplication etc.
5. There are two kinds of branch instructions: Unconditional jump instructions: upon
their execution a jump to a new location from where the program continues
execution is executed. Conditional jump instructions: a jump to a new program
location is executed only if a specified condition is met. Otherwise, the program
normally proceeds with the next instruction.
6. Data transfer instructions move the content of one register to another.
7. Logic instructions perform logic operations upon corresponding bits of two
registers.
8. Similar to logic instructions, bit-oriented instructions perform logic operations. The
difference is that these are performed upon single bits.
[155]
11. Data can be categorized into how many forms?
12. What are the typical elements of a machine instruction?
13. What are the different categories of instructions?
14. Why are transfer of control instructions needed?
15. If an instruction contains four addresses, what might be the purpose of each
address?
16. List and explain the important design issues for instruction set design.
17. What are the different types of operands may present in an instruction?
18. Briefly explain the following addressing modes- immediate addressing direct
addressing, indirect addressing displacement addressing and relative addressing
19. What is indexed addressing and what is the advantage of auto indexing?
20. What are the advantages and disadvantages of using a variable-length instruction
format?
1. Immediate
2. Register
3. Register Indirect
4. Direct
5. Indirect
6. Displacement
1. Data transfer
2. System
3. Conversion
[156]
CPU DESIGN
Fetch Data: The execution of an instruction may require reading data from memory or I/O
module.
Process data: The execution of an instruction may require performing some arithmetic or
logical operation on data.
Write data: The result of an execution may require writing data to memory or an I/O
module.
To do these tasks, it should be clear that the CPU needs to store some data temporarily. It
must remember the location of the last instruction so that it can know where to get the next
instruction. It needs to store instructions and data temporarily while an instruction is being
executed. In other words, the CPU needs a small internal memory. These storage locations
are generally referred as registers. The major components of the CPU are an arithmetic and
logic unit (ALU) and a control unit (CU). The ALU does the actual computation or
processing of data. The CU controls the movement of data and instruction into and out of
the CPU and controls the operation of the ALU.
[157]
The CPU is connected to the rest of the system through system bus. Through system bus,
data or information gets transferred between the CPU and the other component of the
system. The system bus may have three components:
Data Bus: Data bus is used to transfer the data between main memory and CPU.
Address Bus: Address bus is used to access a particular memory location by putting the
address of the memory location.
Control Bus: Control bus is used to provide the different control signal generated by CPU
to different part of the system. As for example, memory read is a signal generated by CPU
to indicate that a memory read operation has to be performed. Through control bus this
signal is transferred to memory module to indicate the required operation.
There are three basic components of CPU: register bank, ALU and Control Unit. There are
several data movements between these units and for that an internal CPU bus is used.
Internal CPU bus is needed to transfer data between the various registers and the ALU. The
internal organization of CPU in more abstract level is shown in the Figure 5.1 and Figure
5.2.
[158]
Figure 5.2: Internal Structure of the CPU
Control and status registers: These are used by the control unit to control the operation
of the CPU. Operating system programs may also use these in privileged mode to control
the execution of program.
[159]
• Condition Codes
General-purpose registers can be assigned to a variety of functions by the programmer. In
some cases, general- purpose registers can be used for addressing functions (e.g., register
indirect, displacement). In other cases, there is a partial or clean separation between data
registers and address registers.
Data registers may be used to hold only data and cannot be employed in the calculation of
an operand address.
Condition Codes (also referred to as flags) are bits set by the CPU hardware as the result of
the operations. For example, an arithmetic operation may produce a positive, negative, zero
or overflow result. In addition to the result itself being stored in a register or memory, a
condition code is also set. The code may be subsequently be tested as part of a condition
branch operation. Condition code bits are collected into one or more registers.
[160]
• Program Counter (PC): Contains the address of an instruction to be fetched.
Typically, the PC is updated by the CPU after each instruction fetched so that it
always points to the next instruction to be executed. A branch or skip instruction
will also modify the contents of the PC.
• Instruction Register (IR): Contains the instruction most recently fetched. The
fetched instruction is loaded into an IR, where the opcode and operand specifiers
are analyzed.
• Memory Address Register (MAR): Contains the address of a location of main
memory from where information has to be fetched or information has to be stored.
Contents of MAR is directly connected to the address bus.
• Memory Buffer Register (MBR): Contains a word of data to be written to memory
or the word most recently read. Contents of MBR is directly connected to the data
bus.It is also known as Memory Data Register(MDR).
Apart from these specific registers, we may have some temporary registers which are not
visible to the user. As such, there may be temporary buffering registers at the boundary to
the ALU; these registers serve as input and output registers for the ALU and exchange data
with the MBR and user visible registers.
All CPU designs include a register or set of registers, often known as the processor status
word (PSW), that contains status information. The PSW typically contains condition codes
plus other status information. Common fields or flags include the following:
• Sign: Contains the sign bit of the result of the last arithmetic operation.
• Zero: Set when the result is zero.
• Carry: Set if an operation resulted in a carry (addition) into or borrow (subtraction)
out of a high order bit.
• Equal: Set if a logical compare result is equal.
• Overflow: Used to indicate arithmetic overflow.
• Interrupt enable/disable: Used to enable or disable interrupts.
• Supervisor: Indicate whether the CPU is executing in supervisor or user mode.
Certain privileged instructions can be executed only in supervisor mode, and
certain areas of memory can be accessed only in supervisor mode.
[161]
Apart from these, a number of other registers related to status and control might be found
in a particular CPU design. In addition to the PSW, there may be a pointer to a block of
memory containing additional status information (e.g. process control blocks).
The CPU keeps track of the address of the memory location where the next instruction is
located through the use of a dedicated CPU register, referred to as the program counter
(PC). After fetching an instruction, the contents of the PC are updated to point at the next
instruction in sequence. For simplicity, let us assume that each instruction occupies one
memory word. Therefore, execution of one instruction requires the following three steps to
be performed by the CPU:
1. Fetch the contents of the memory location pointed at by the PC. The contents of
this location are interpreted as an instruction to be executed. Hence, they are stored
in the instruction register (IR). Symbolically this can be written as:
IR = [ [PC] ]
PC = [PC] + 1
3. Carry out the actions specified by the instruction stored in the IR.
The first two steps are usually referred to as the fetch phase and the step 3 is known as the
execution phase. Fetch cycle basically involves read the next instruction from the memory
into the CPU and along with that update the contents of the program counter. In the
[162]
execution phase, it interprets the opcode and perform the indicated operation. The
instruction fetch and execution phase together known as instruction cycle. The basic
instruction cycle is shown in the Figure 5.3
In cases, where an instruction occupies more than one word, step 1 and step 2 can be
repeated as many times as necessary to fetch the complete instruction. In these cases, the
execution of a instruction may involve one or more operands in memory, each of which
requires a memory access. Further, if indirect addressing is used, then additional memory
access is required.
The fetched instruction is loaded into the instruction register. The instruction contains bits
that specify the action to be performed by the processor. The processor interprets the
instruction and performs the required action. In general, the actions fall into four
categories:
[163]
The execution cycle of a particular instruction may involve more than one reference to
memory. Also, instead of memory references, an instruction may specify an I/O operation.
With these additional considerations the basic instruction cycle can be expanded with more
details view in the Figure 5.4. The figure is in the form of a state diagram.
[164]
10.5 Processor Organization
There are several components inside a CPU, namely, ALU, control unit, general purpose
register, Instruction registers etc. Now we will see how these components are organized
inside CPU. There are several ways to place these components and interconnect them. One
such organization is shown in the Figure 5.6.
In this case, the arithmetic and logic unit (ALU), and all CPU registers are connected via a
single common bus. This bus is internal to CPU and this internal bus is used to transfer the
information between different components of the CPU. This organization is termed as
single bus organization, since only one internal bus is used for transferring of information
between different components of CPU. We have external bus or buses to CPU also to
connect the CPU with the memory module and I/O devices. The external memory bus is
also shown in the Figure 5.6 connected to the CPU via the memory data and address
register MDR and MAR.
The number and function of registers R0 to R(n-1) vary considerably from one machine to
another. They may be given for general-purpose for the use of the programmer.
Alternatively, some of them may be dedicated as special-purpose registers, such as index
register or stack pointers. In this organization, two registers, namely Y and Z are used
which are transparent to the user. Programmer cannot directly access these two registers.
These are used as input and output buffer to the ALU which will be used in ALU
operations. They will be used by CPU as temporary storage for some instructions.
[165]
Figure 5.6: Single bus organization of the data path inside the CPU
Most of the operation of a CPU can be carried out by performing one or more of the
following functions in some prespecified sequence:
1. Fetch the contents of a given memory location and load them into a CPU register.
2. Store a word of data from a CPU register into a given memory location.
3. Transfer a word of data from one CPU register to another or to the ALU.
4. Perform an arithmetic or logic operation, and store the result in a CPU register.
Now we will examine the way in which each of the above functions is implemented in a
computer. Fetching a Word from Memory:
Information is stored in memory location identified by their address. To fetch a word from
memory, the CPU has to specify the address of the memory location where this
[166]
information is stored and request a Read operation. The information may include both, the
data for an operation or the instruction of a program which is available in main memory.
The CPU transfers the address of the required memory location to the Memory Address
Register (MAR).
The MAR is connected to the memory address line of the memory bus, hence the address
of the required word is transferred to the main memory.
Next, CPU uses the control lines of the memory bus to indicate that a Read operation is
initiated. After issuing this request, the CPU waits until it receives an answer from the
memory, indicating that the requested operation has been completed.
As an example, assume that the address of the memory location to be accessed is kept in
register R2 and that the memory contents to be loaded into register R1. This is done by the
following sequence of operations:
The time required for step 3 depends on the speed of the memory unit. In general, the time
required to access a word from the memory is longer than the time required to perform any
operation within the CPU.
The scheme that is used here to transfer data from one device (memory) to another device
(CPU) is referred to as an asynchronous transfer.
[167]
This asynchronous transfer enables transfer of data between two independent devices that
have different speeds of operation. The data transfer is synchronized with the help of some
control signals. In this example, Read request and MFC signal are doing the
synchronization task.
An alternative scheme is synchronous transfer. In this case all the devices are controlled by
a common clock pulse (continuously running clock of a fixed frequency). These pulses
provide common timing signal to the CPU and the main memory. A memory operation is
completed during every clock period. Though the synchronous data transfer scheme leads
to a simpler implementation, it is difficult to accommodate devices with widely varying
speed. In such cases, the duration of the clock pulse will be synchronized to the slowest
device. It reduces the speed of all the devices to the slowest one.
As soon as MFC signal is set to 1, the information available in the data bus is loaded into
the Memory Data Register (MDR) and this is available for use inside the CPU.
As an example, assumes that the data word to be stored in the memory is in register R1 and
that the memory address is in register R2. The memory write operation requires the
following sequence:
1. MAR [R2]
2. MDR [R1]
3. Write
4. Wait for MFC
[168]
In this case step 1 and step 2 are independent and so they can be carried out in any order.
In fact, step 1 and 2 can be carried out simultaneously, if this is allowed by the
architecture, that is, if these two data transfers (memory address and data) do not use the
same data path. In case of both memory read and memory write operation, the total time
duration depends on wait for the MFC signal, which depends on the speed of the memory
module.
There is a scope to improve the performance of the CPU, if CPU is allowed to perform
some other operation while waiting for MFC signal. During the period, CPU can perform
some other instructions which do not require the use of MAR and MDR.
Since the input output lines of all the register are connected to the common internal bus,
we need appropriate input output gating. The input and output gates for register Ri are
controlled by the signal Ri in and Ri out respectively. Thus, when Ri in set to 1 the data
available in the common bus is loaded into Ri . Similarly, when, Ri out is set to 1, the
contents of the register Ri are placed on the bus. To transfer data from one register to other
register, we need to generate the appropriate register gating signal
For example, to transfer the contents of register R1 to register R2, the following actions are
needed:
[169]
➢ Enable the input gate of register R2 by setting R2 in to 1.
-- This loads data from the CPU bus into the register R2.
Generally, ALU is used inside CPU to perform arithmetic and logic operation. ALU is a
combinational logic circuit which does not have any internal storage. Therefore, to perform
any arithmetic or logic operation (say binary operation) both the input should be made
available at the two inputs of the ALU simultaneously. Once both the inputs are available
then appropriate signal is generated to perform the required operation.
We may have to use temporary storage (register) to carry out the operation in ALU. The
sequence of operations that have to carried out to perform one ALU operation depends on
the organization of the CPU. Consider an organization in which one of the operands of
ALU is stored in some temporary register Y and other operand is directly taken from CPU
internal bus. The result of the ALU operation is stored in another temporary register Z.
This organization is shown in the Figure 5.7.
Therefore, the sequence of operations to add the contents of register R1 to register R2 and
store the result in register R3 should be as follows:
1. R1out, Yin
2. R2out, Add, Zin
3. Zout, R3in
In step 2 of this sequence, the contents of register R2 are gated to the bus, hence to input –
B of the ALU which is directly connected to the bus. The contents of register Y are always
available at input A of ALU. The function performed by the ALU depends on the signal
applied to the ALU control lines. In this example, the Add control line of ALU is set to 1,
which indicate the addition operation and the output of ALU is the sum of the two numbers
at input A and B. The sum is loaded into register Z, since the input gate is enabled (Zin ).
In step 3, the contents of register Z are transferred to the destination register R3.
[170]
Figure 5.7: Organization for Arithmetic & Logic Operation
An alternative structure is the two-bus structure, where two different internal buses are
used in CPU. All register outputs are connected to bus A, add all registered inputs are
connected to bus B.
There is a special arrangement to transfer the data from one bus to the other bus. The buses
are connected through the bus tie G. When this tie is enabled data on bus A is transfer to
bus B. When G is disabled, the two buses are electrically isolated.
[171]
Since two buses are used here the temporary register Z is not required here which is used
in single bus organization to store the result of ALU. Now result can be directly transferred
to bus B, since one of the inputs is in bus A. With the bus tie disabled, the result can
directly be transferred to destination register. A simple two bus structure is shown in the
Figure 5.8.
For example, for the operation, [R3]← [R1] + [R2] can now be performed as
In this case source register R2 and destination register R3 has to be different, because the
two operations R2in and R2out cannot be performed together. If the registers are made of
simple latches then only, we have the restriction.
We may have another CPU organization, where three internal CPU buses are used. In this
organization each bus connected to only one output and number of inputs. The elimination
of the need for connecting more than one output to the same bus leads to faster bus transfer
and simple control. A simple three-bus organization is shown in the figure 5.9.
A multiplexer is provided at the input to each of the two working registers A and B, which
allow them to be loaded from either the input data bus or the register data bus. In the
diagram, a possible interconnection of three-bus organization is presented, there may be
different interconnections possible.
In this three-bus organization, we are keeping two input data buses instead of one that is
used in two bus organization. Two separate input data buses are present – one is for
external data transfer, i.e. retrieving from memory and the second one is for internal data
transfer that is transferring data from general purpose register to other building block
inside the CPU.
[172]
Figure 5.8: Two bus structure
Like two bus organization, we can use bus tie to connect the input bus and output bus.
When the bus tie is enable, the information that is present in input bus is directly
transferred to output bus. We may use one bus tie G1 between input data bus and ALU
output bus and another bus tie G2 between register data bus and ALU output data bus.
[173]
Figure 5.9: Three Bus structure
To execute a complete instruction, we need to take help of these basic operations and we
need to execute these operations in some particular order. As for example, consider the
instruction: "Add contents of memory location NUM to the contents of register R1 and
store the result in register R1." For simplicity, assume that the address NUM is given
explicitly in the address field of the instruction. That is, in this instruction, direct
addressing mode is used. Execution of this instruction requires the following action:
[174]
1. Fetch instruction
2. Fetch first operand (Contents of memory location pointed at by the address field of
the instruction)
3. Perform addition
4. Load the result into R1.
Following sequence of control steps are required to implement the above operation for the
single-bus architecture that we have discussed in earlier section.
Steps Actions
1. PCout, MARin, Read, Clear Y, Set carry -in to ALU, Add, Zin
In Step1: The instruction fetch operation is initiated by loading the contents of the PC into
the MAR and sending a read request to memory. To perform this task first of all the
contents of PC have to be brought to internal bus and then it is loaded to MAR. To perform
this task control circuit has to generate the PCout signal and MAR in signal. After issuing
the read signal, CPU has to wait for some time to get the MFC signal. During that time PC
is updated by 1 through the use of the ALU. This is accomplished by setting one of the
inputs to the ALU (Register Y) to 0 and the other input is available in bus which is current
value of PC. At the same time, the carry-in to the ALU is set to 1 and an add operation is
specified.
[175]
In Step 2: The updated value is moved from register Z back into the PC. Step 2 is initiated
immediately after issuing the memory Read request without waiting for completion of
memory function. This is possible, because step 2 does not use the memory bus and its
execution does not depend on the memory read operation.
In Step 3: Step3 has been delayed until the MFC is received. Once MFC is received, the
word fetched from the memory is transferred to IR (Instruction Register), Because it is an
instruction. Step 1 through 3 constitute the instruction fetch phase of the control sequence.
The instruction fetch portion is same for all instructions. Next step onwards, instruction
execution phase takes place.
As soon as the IR is loaded with instruction, the instruction decoding circuits interprets its
contents. This enables the control circuitry to choose the appropriate signals for the
remainder of the control sequence, step 4 to 8, which we referred to as the execution phase.
To design the control sequence of execution phase, it is needed to have the knowledge of
the internal structure and instruction format of the PU. Secondly, the length of instruction
phase is different for different instruction.
opcode M R
In Step 5: The destination field of IR, which contains the address of the register R1, is
used to transfer the contents of register R1 to register Y and wait for Memory function
Complete. When the read operation is completed, the memory operand is available in
MDR.
[176]
In Step 7: The result of addition operation is transferred from temporary register Z to the
destination register R1 in this step.
In step 8: It indicates the end of the execution of the instruction by generating End signal.
This indicates completion of execution of the current instruction and causes a new fetch
cycle to be started by going back to step 1.
10.10 Branching
With the help of branching instruction, the control of the execution of the program is
transferred from one particular position to some other position, due to which the sequence
flow of control is broken. Branching is accomplished by replacing the current contents of
the PC by the branch address, that is, the address of the instruction to which branching is
required. Consider a branch instruction in which branch address is obtained by adding an
offset X, which is given in the address field of the branch instruction, to the current value
of PC. Consider the following unconditional branch instruction
JUMP X
The control sequence that enables execution of an unconditional branch instruction using
the single - bus organization is as follows:
Execution starts as usual with the fetch phase, ending with the instruction being loaded into
the IR in step 3. To execute the branch instruction, the execution phase starts in step 4.
In Step 5: The offset X of the instruction is gated to the bus and the addition operation is
performed.
In Step 6: The result of the addition, which represents the branch address is loaded into the
PC.
[177]
In Step 7: It generates the End signal to indicate the end of execution of the current
instruction.
Steps Actions
1. PCout, MARin, Read, Clear Y, Set Carry-in to ALU, Add ,Zin
2. Zout, PCin, Wait for MFC
3. MDRout, IRin
4. PCout, Yin
5. Address field-of IRout, Add, Zin
6. Zout, PCin
7. End
Consider now the conditional branch instruction instead of unconditional branch. In this
case, we need to check the status of the condition codes, between step 3 and 4. i.e., before
adding the offset value to the PC contents.
For example, if the instruction decoding circuitry interprets the contents of the IR as a
branch on Negative (BRN) instruction, the control unit proceeds as follows: First the
condition code register is checked. If bit N (negative) is equal to 1 , the control unit
proceeds with step 4 through step 7 of control sequence of unconditional branch
instruction.
This in effect, terminates execution of the branch instruction and causes the instruction
immediately following in the branch instruction to be fetched when a new fetch operation
is performed.
Therefore, the control sequence for the conditional branch instruction BRN can be
obtained from the control sequence of an unconditional branch instruction by replacing the
step 4 by
4. ̅ then End
If 𝑁
[178]
If N then PCout, yin
1. The processing required for the execution a single instruction is known as ____________.
2. ___________ is accomplished by replacing the current contents of the PC by the branch
address, that is, the address of the instruction to which branching is required.
3. The Bus which is connecting the major three components of a computer (CPU, Memory
and Input/Output devices) is known as _______________.
10.11 Summary
1. The major components of the CPU are an arithmetic and logic unit (ALU) and a
control unit (CU).
2. The ALU does the actual computation or processing of data. The CU controls the
movement of data and instruction into and out of the CPU and controls the
operation of the ALU.
3. There are three basic components of CPU: register bank, ALU and Control Unit.
4. A computer system employs a memory hierarchy. At the highest level of hierarchy,
memory is faster, smaller and more expensive.
5. The registers in the CPU can be categorized into two groups, User-visible registers
and Control and status registers.
[179]
6. General-purpose registers can be assigned to a variety of functions by the
programmer. In some cases, general- purpose registers can be used for addressing
functions.
7. Data registers may be used to hold only data and cannot be employed in the
calculation of an operand address.
8. Address registers may be somewhat general purpose, or they may be devoted to a
particular addressing mode.
9. Condition Codes (also referred to as flags) are bits set by the CPU hardware as the
result of the operations.
10. All CPU designs include a register or set of registers, often known as the processor
status word (PSW), that contains status information.
11. The PSW typically contains condition codes plus other status information.
12. The instructions constituting a program to be executed by a computer are loaded in
sequential locations in its main memory.
13. To execute this program, the CPU fetches one instruction at a time and performs
the functions specified. Instructions are fetched from successive memory locations
until the execution of a branch or a jump instruction.
14. The CPU keeps track of the address of the memory location where the next
instruction is located through the use of a dedicated CPU register, referred to as the
program counter (PC).
15. This bus is internal to CPU and this internal bus is used to transfer the information
between different components of the CPU.
16. Bus width means the number of lines available in the Bus.
17. An instruction cycle consists of two phase, Fetch cycle and Execution cycle.
18. Register transfer operations enable data transfer between various blocks connected
to the common bus of CPU.
[181]
Answers to Check your progress II
1. Instruction Cycle
2. Branching
3. System Bus
[182]
DESIGN OF CONTROL UNIT
To generate the control signal in proper sequence, a wide variety of techniques exist. Most
of these techniques, however, fall into one of the two categories,
• Hardwired Control
• Microprogrammed Control.
For the moment, for simplicity, let us assume that all slots are equal in time duration.
Therefore, the required controller may be implemented based upon the use of a counter
driven by a clock. Each state, or count, of this counter corresponds to one of the steps of
the control sequence of the instructions of the CPU.
In the previous lecture, we have mentioned control sequence for execution of two
instructions only (one is for add and other one is for branch). Like that we need to design
the control sequence of all the instructions.
By looking into the design of the CPU, we may say that there are various instruction for
add operation. As for example,
ADD NUM R1 Add the contents of memory location specified by NUM to the
contents of register R1.
R1 ←R1 + [NUM]
R1 ←R1 + R2
The control sequence for execution of these two ADD instructions are different. Of course,
the fetch phase of all the instructions remain same.
[184]
It is clear that control signals depend on the instruction, i.e., the contents of the instruction
register. It is also observed that execution of some of the instructions depend on the
contents of condition code or status flag register, where the control sequence depends in
conditional branch instruction.
Hence, the required control signals are uniquely determined by the following information:
The external inputs represent the state of the CPU and various control lines connected to it,
such as MFC status signal. The condition codes/ status flags indicate the state of the CPU.
These includes the status flags like carry, overflow, zero, etc.
[185]
The structure of control unit can be represented in a simplified view by putting it in block
diagram. The detailed hardware involved may be explored step by step. The simplified
view of the control unit is given in the Figure 5.10.
The decoder/encoder block is simply a combinational circuit that generates the required
control outputs depending on the state of all its input.
The decoder part of decoder/encoder part provide a separate signal line for each control
step, or time slot in the control sequence. Similarly, the output of the instructor decoder
consists of a separate line for each machine instruction loaded in the IR, one of the output
line INS1 to INSm is set to 1 and all other lines are set to 0.
The detailed view of the control unit organization is shown in the Figure 5.11.
All input signals to the encoder block should be combined to generate the individual
control signals.
[186]
In the previous section, we have mentioned the control sequence of the instruction,
It is required to generate many control signals by the control unit. These are basically
coming out from the encoder circuit of the control signal generator. The control signals
are: PCin, PCout, Zin, Zout, MARin, ADD, END, etc.
By looking into the above three instructions, we can write the logic function for Zin as :
For all instructions, in time step1 we need the control signal Zin to enable the input to
register Zin time cycle T6 of ADD_MD instruction, in time cycle T5 of BR instruction
and so on.
ADD = T1 + T6 . ADD_MD + T5 . BR + . . . . . . . . . . . . . .
These logic functions can be implemented by a two level combinational circuit of AND
and OR gates.
[187]
This END signal indicates the end of the execution of an instruction, so this END signal
can be used to start a new instruction fetch cycle by resetting the control step counter to its
starting value.
The circuit diagram (Partial) for generating Zin and END signal is shown in the Figure
5.12 and Figure 5.13 respectively.
The signal T1, T2, T3 etc are coming out from step decoder depends on control step
counter. The signal N (Negative) is coming from condition code register.
When wait for MFC (WMFC) signal is generated, then CPU does not do any works and it
waits for an MFC signal from memory unit. In this case, the desired effect is to delay the
initiation of the next control step until the MFC signal is received from the main memory.
This can be incorporated by inhibiting the advancement of the control step counter for the
required period.
Let us assume that the control step counter is controlled by a signal called RUN. By
looking at the control sequence of all the instructions, the WMFC signal is generated as:
WMFC = T2 + T5 . ADD_MD + . . . . . . . . . . . . . .
The RUN signal is generated with the help of WMFC signal and MFC signal. The
arrangement is shown in the Figure 5.14.
[189]
The MFC signal is generated by the main memory whose operation is independent of CPU
clock. Hence MFC is an asynchronous signal that may arrive at any time relative to the
CPU clock. It is possible to synchronized with CPU clock with the help of a D flip-flop.
When WMFC signal is high, then RUN signal is low. This run signal is used with the
master clock pulse through an AND gate. When RUN is low, then the CLK signal remains
low, and it does not allow to progress the control step counter.
When the MFC signal is received, the run signal becomes high and the CLK signal
becomes same with the MCLK signal and due to which the control step counter progresses.
Therefore, in the next control step, the WMFC signal goes low and control unit operates
normally till the next memory access signal is generated.
The timing diagram for an instruction fetch operation is shown in the Figure 5.15.
[190]
11.3 Programmable Logic Array
In this discussion, we have presented a simplified view of the way in which the sequence
of control signals needed to fetch and execute instructions may be generated.
It is observed from the discussion that as the number of instruction increases the number of
required control signals will also increase.
In VLSI technology, structure that involve regular interconnection patterns are much easier
to implement than the random connections.
One such regular structure is PLA (programmable logic array). PLAs are nothing but the
arrays of AND gates followed by array of OR gates. If the control signals are expressed as
sum of product form then with the help of PLA it can be implemented.
[191]
Check your progress I
1. In hardwired control, the control signals required inside the CPU can be generated using a
state counter and a _____ circuit.
2. The ___________ Control organization involves the control logic to be implemented with
gates, flip-flops, decoders, and other digital circuits.
3. In ________ control unit, if the design has to be modified or changed, all the
combinational circuits have to be modified which is a very difficult task.
A microprogrammed control unit is a relatively simple logic circuit that is capable of (1)
sequencing through microinstructions and (2) generating control signals to execute each
microinstruction.
[192]
Microprogram are stored in microprogram memory and the execution is controlled by
microprogram counter (µPC).
Microprogram consists of microinstructions which are nothing but the strings of 0's and
1's. In a particular instance, we read the contents of one location of microprogram memory,
which is nothing but a microinstruction. Each output line (data line) of microprogram
memory corresponds to one control signal. If the contents of the memory cell is 0, it
indicates that the signal is not generated and if the contents of memory cell is 1, it indicates
to generate that control signal at that instant of time. First let me define the different
terminologies that are related to microprogrammed control unit.
Control word is defined as a word whose individual bits represent the various control
signal. Therefore, each of the control steps in the control sequence of an instruction defines
a unique combination of 0s and 1s in the CW.
The control unit can generate the control signals for any instruction by sequentially reading
the CWs of the corresponding microprogram from the microprogram memory. To read the
control word sequentially from the microprogram memory a microprogram counter (PC) is
needed.
The basic organization of a microprogrammed control unit is shown in the Figure 5.17.
The "starting address generator" block is responsible for loading the starting address of the
[193]
microprogram into the µPC everytime a new instruction is loaded in the IR. The µPC is
then automatically incremented by the clock, and it reads the successive microinstruction
from memory.
Each microinstruction basically provides the required control signal at that time step. The
microprogram counter ensures that the control signal will be delivered to the various parts
of the CPU in correct sequence.
We have some instructions whose execution depends on the status of condition codes and
status flag, as for example, the branch instruction. During branch instruction execution, it
is required to take the decision between the alternative action.
To handle such type of instructions with microprogrammed control, the design of control
unit is based on the concept of conditional branching in the microprogram. For that it is
required to include some conditional branch microinstructions.
[194]
possibly, bits of the instruction register should be checked as a condition for branching to
take place. To support microprogram branching, the organization of control unit should be
modified to accommodate the branching decision.
To generate the branch address, it is required to know the status of the condition codes and
status flag. To generate the starting address, we need the instruction which is present in IR.
But for branch address generation we have to check the content of condition codes and
status flag.
The control bits of the microinstructions word which specify the branch conditions and
address are fed to the "Starting and branch address generator" block. This block performs
[195]
the function of loading a new address into the µPC when the condition of branch
instruction is satisfied.
In a computer program we have seen that execution of every instruction consists of two
part - fetch phase and execution phase of the instruction. It is also observed that the fetch
phase of all instruction is same.
At the end of fetch microprogram, the starting address generator unit calculate the
appropriate starting address of the microprogram for the instruction which is currently
present in IR. After the µPC controls the execution of microprogram which generates the
appropriate control signal in proper sequence.
During the execution of a microprogram, the µPC is always incremented everytime a new
microinstruction is fetched from the microprogram memory, except in the following
situations:
When an End instruction is encountered, the µPC is loaded with the address of the first
CW in the microprogram for the instruction fetch cycle.
1. When a new instruction is loaded into the IR, the PC is loaded with the starting
address of the microprogram for that instruction.
2. When a branch microinstruction is encountered, and the branch condition is
satisfied, the PC is loaded with the branch address.
Let us examine the contents of microprogram memory and how the microprogram of each
instruction is stored or organized in microprogram memory. Consider the two example that
are used in our previous lecture. First example is the control sequence for execution of the
instruction "Add contents of memory location addressed in memory direct mode to register
R1".
[196]
Steps Actions
1. PCout, MARin, Read, Clear Y, Set carry-in to ALU, Add, Zin
2. Zout, PCin, Wait For MFC
3. MDRout, IRin
4. Address-field-of-IRout, MARin, Read
5. R1out, Yin, Wait for MFC
6. MDRout, Add, Zin
7. Zout, R1in
8. END
Steps Actions
1. PCout, MARin, Read, Clear Y, Set Carry-in to ALU, Add , Zin
2. Zout, PCin, Wait for MFC
3. MDRout, IRin
4. PCout, Yin
5. Address field-of IRout, Add, Zin
6. Zout, PCin
7. End
First consider the control signal required for fetch instruction, which is same for all the
instruction, we are listing them in a particular order.
PCout MARin Read Clear Y Set Carry to ALU Add Zin Zout PCin WMFC MDRout IRin
[197]
The control word for the first three steps of the above two instruction are : ( which are the
fetch cycle of each instruction as follows ):
Step1 1 1 1 1 1 1 1 0 0 0 0 0 ---
Step2 0 0 0 0 0 0 0 1 1 1 0 0 ---
Step3 0 0 0 0 0 0 0 0 0 0 1 1 ---
We are storing this three CW in memory location 0, 1 and 2. Each instruction starts from
memory location 0. After executing upto third step, i.e., the contents of microprogram
memory location 2, this control word stores the instruction in IR. The starting address
generator circuit now calculate the starting address of the microprogram for the instruction
which is available in IR.
Consider that the microprogram for add instruction is stored from memory location 50 of
microprogram memory. So the partial contents from memory location 50 are as follows:
Location50 0 1 1 0 0 0 0 0 0 0 0 0 -- -- --
51 0 0 0 0 0 0 0 0 0 1 0 0 -- -- --
and so on . . . .
[198]
Memory
--------- 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Location
0 ---------- 0 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 0 -----------
1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 ----------
2 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 ---------
50 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0
51 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0
52 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0
53 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0
54 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
From the discussion, it is clear that microprograms are similar to computer program, but it
is in one level lower, that's why it is called microprogram.
For each instruction of the instruction set of the CPU, we will have a microprogram.
When we fetch an instruction from main memory, to execute that instruction, we execute
the microprogram for that instruction. Microprograms are nothing but the collection of
microinstructions. These microinstructions will be fetched from microprogram memory
one after another and its sequence is maintained by µPC. Fetching of microinstruction
basically provides the required control signal at that time instant.
[199]
• For each instruction of the CPU, we have to write a microprogram to generate the
control signal. The microprograms are stored in microprogram memory (control
store). The starting address of each microprogram are known to the designer
• Each microprogram is the sequence of microinstructions. And these
microinstructions are executed in sequence. The execution sequence is maintained
by microprogram counter.
• Each microinstruction is nothing but the combination of 0’s and 1’s which is
known as control word. Each position of control word specifies a particular control
signal. 0 on the control word means that a low signal value is generated for that
control signal at that particular instant of time, similarly 1 indicates a high signal.
• Since each machine instruction is executed by a corresponding micro routine, it
follows that a starting address for the micro routine must be specified as a function
of the contents of the instruction register (IR).
• To incorporate the branching instruction, i.e., the branching within the
microprogram, a branch address generator unit must be included. Both
unconditional and conditional branching can be achieved with the help of
microprogram. To incorporate the conditional branching instruction, it is required
to check the contents of condition code and status flag.
Microprogramed controlled control unit is very much similar to CPU. In CPU the PC is
used to fetch instruction from the main memory, but in case of control unit, microprogram
counter is used to fetch the instruction from control store.
But there are some differences between these two. In case of fetching instruction from
main memory, we are using two signals MFC and WMFC. These two signals are required
to synchronize the speed between CPU and main memory. In general, main memory is a
slower device than the CPU.
In microprogrammed control the need for such signal is less obvious. The size of control
store is less than the size of main memory. It is possible to replace the control store by a
faster memory, where the speed of the CPU and control store is almost same.
[200]
Since control store are usually relatively small, so that it is feasible to speed up their speed
through costly circuits.
If we can implement the main memory by a faster device then it is also possible to
eliminate the signals MFC & WMFC. But, in general, the size of main memory is very big
and it is not economically feasible to replace the whole main memory by a faster memory
to eliminate MFC & WMFC.
It is possible to reserve one-bit position for each control signal. If there are n control
signals in a CPU, them the length of each control signal is n. Since we have one bit for
each control signal, so a large number of resources can be controlled with a single
microinstruction. This organization of microinstruction is known as horizontal
organization.
If the machine structure allows parallel uses of a number of resources, then horizontal
organization has got advantage. Since a greater number of resources can be accessed
parallel, the operating speed is also more in such organization. In this situation, horizontal
organization of control store has got advantage.
If more number of resources can be accessed simultaneously, than most of the contents of
control store is 0. Since the machine architecture does not provide the parallel access of
resources, so simultaneously we cannot generate the control signal. In such situation, we
can combine some control signals and group them together. This will reduce the size of
control word. If we use compact code to specify only a small number of control functions
in each microinstruction, then it is known as vertical organization of microinstruction.
[201]
In case of horizontal organization, the size of control word is longer, which is in one
extreme point and in case of vertical organization, the size of control word is smaller,
which is in other extreme.
We will explain the grouping of control signal with the help of an example. Grouping of
control signals depends on the internal organization of CPU.
Assigning individual bits to each control signal is certain to lead to long microinstruction,
since the number of required control signals is normally large.
However, only a few bits are set to 1 and therefore used for active gating in any given
microinstructions. This obviously results in low utilization of the available bit space.
If we group the control signal in some non-over lapping group then the size of control
word reduces.
This CPU contains four general purpose registers R0 , R1 , R2 and R3 . In addition there
are three other register called SOURCES, DESTIN and TEMP. These are used for
temporary storage within the CPU and completely transparent to the programmer. A
computer programmer cannot use these three registers.
[202]
Figure 5.19: Single bus architecture of CPU
For the proper functioning of this CPU, we need all together 24 gating signals for the
transfer of information between internal CPU bus and other resources like registers.
In addition to these register gating signals, we need some other control signals which
include the Read, Write, Clear Y, set carry in, WMFC, and End signal. (Here we are
restricting the control signal for the case of discussion in reality, the number of signals are
more).
[203]
It is also necessary to specify the function to be performed by ALU. Depending on the
power of ALU, we need several control lines, one control signal for each function. Assume
that the ALU that is used in the design can perform 16 different operation such as ADD,
SUBSTRACT, AND, OR, etc. So, we need 16 different control lines.
If is observed that most signals are not needed simultaneously and many signals are
mutually exclusive.
As for example, only one function of the ALU can be activated at a time. In out case we
are considering 16 ALU operations. Instead of using 16 different signal for ALU operation,
we can group them together and reduce the number of control signal. From digital logic
circuit, it is obvious that instead of 16 different signal, we can use only 4 control signal for
ALU operation and them use a 4 X 16 decoder to generate 16 different ALU signals. Due
to the use of a decoder, there is a reduction in the size of control word.
Another possibility of grouping control signal is: A sources for data transfer must be
unique, which means that it is not possible to gate the contents of two different registers
onto the bus at the same time. Similarly Read Write signals to the memory cannot be
activated simultaneously.
This observation suggests the possibilities of grouping the signals so that all signals that
are mutually exclusive are placed in the same group. Thus a group can specify one micro
operation at a time.
At that point we have to use a binary coding scheme to represent a given signal within a
group. As for example, for 16 ALU function, four bits are enough to decode the
appropriate function.
A possible grouping of the 46 control signals that are required for the above mention CPU
is given in the Table 5.1.
[204]
Table 5.1: Grouping of the control signals
F1 F2 F3 F4 F5
(4 bits) (3 bits) (2 bits) (2 bits) (4 bits)
000: No 00: No 00: No
0000: No Transfer 0000: Add
Transfer Transfer Transfer
0001: PCout 001: PCin 01: MARin 01: Yin 0001: Sub
0010:
0010: MDRout 001: IRin 10: MDRin 10: SOURCEin
MULT
0011: Zout 011: Zin 11: TEMPin 11: DESTINin 0011: Div
1000: SOURCEout |
1001: DESTINout |
1010: TEMPout |
1011:
1111: XOR
ADDRESSout
F6 F7 F8 F9 F10
(2 bits) (1 bit) (1 bit) (1 bit) (1 bit)
00: no
0: no action 0: carry-in=0 0: no action 0: continue
action
01: read 1: clear Y 1: carry-in=1 1:WMFC 1: end
10: write
A possible grouping of signal is shown here. There may be some other grouping of signal
possible. Here all out- gating of registers are grouped into one group, because the contents
[205]
of only one bus is allowed to goto the internal bus, otherwise there will be a conflict of
data.
But the in-gating of registers are grouped into three different group. It implies that the
contents of the bus may be stored into three different registers simultaneously transfer to
MAR and Z. Due to this grouping, we are using 7 bits (3+2+2) for the in-gating signal. If
we would have grouped then in one group, then only 4 bits would have been enough; but it
will take more time during execution. In this situation, two clock cycles would have been
required to transfer the contents of PC to MAR and Z.
Therefore, the grouping of signal is a critical design parameter. If speed of operation is also
a design parameter, then compression of control word will be less.
In this grouping, 46 control signals are grouped into 10 different groups ( F1 , F2 ,……….,
F10 ) and the size of control word is 21. So, the size of control word is reduced from 46 to
21, which is more than 50%.
group F2 : 3 X 8 decoder
[206]
Writing a microprogram for each machine instruction is a simple solution, but it will
increase the size of control store.
We have already discussed that most machine instructions can operate in several
addressing modes. If we write different microroutine for each addressing mode, then most
of the cases, we are repeating some part of the microroutine.
The common part of the microroutine can be shared by several microroutine, which will
reduce the size of control store. This results in a considerable number of branch
microinstructions being needed to transfer control among various parts. So, it introduces
branching capabilities within the microinstruction.
This indicates that the microprogrammed control unit has to perform two basic tasks:
In designing a control unit, these tasks must be considered together, because both affect the
format of the microinstruction and the timing of control unit.
Two concerns are involved in the design of a microinstruction sequencing technique: the
size of the microinstruction and the address generation time.
[207]
11.6.2 Sequencing Techniques
Based on the current microinstruction, condition flags and the contents of the instruction
register, a control memory address must be generated for the next microinstruction. A wide
variety of techniques have been used and can be grouped them into three general
categories:
The branch control logic with two-address field is shown in the Figure 5.20.
[208]
A multiplier is provided that serves as a destination for both address fields and the
instruction register. Based on an address selection input, the multiplexer selects either the
opcode or one of the two addresses to the control address register (CAR). The CAR is
subsequently decoded to produce the next microinstruction address. The address selection
signals are provided by a branch logic module whose input consists of control unit flags
plus bits from the control portion of the microinstruction.
The two-address approach is simple but it requires mere bits in the microinstruction. With
some additional logic, savings can be achieved. The approach is shown in the Figure 5.21.
In this single address field branch control logic, the options for next address are as follows:
[209]
• Address field
• Instruction register code
• Next sequential address
The address selection signals determine which option to be selected. This approach reduces
the number of address fields to one.
In variable format branch control logic one bit designates which format is being used. In
one format, the remaining bits are used to active control signals. In the other format, some
bits drive the branch logic module, and the remaining bits provide the address. With the
first format, the next address is either the next sequential address or an address derived
from the instruction register. With the second format, either a conditional or unconditional
branch is being specified. The approach is shown in the Figure 5.22.
[210]
11.6.3 Address Generation
We have looked at the sequencing problem from the point of view of format consideration
and general logic requirements. Another viewpoint is to consider the various ways in
which the next address can be derived or computed.
The address generation technique can be divided into two techniques: explicit & implicit.
In two address field approach, signal address field or a variable format, various branch
instruction can be implemented with the explicit approaches.
In implicit technique, mapping is required to get the address of next instruction. The
opcode portion of a machine instruction must be mapped into a microinstruction address.
1. ____________ control unit is slower in speed because of the time it takes to fetch
microinstructions from the control memory.
2. In_________ technique, the address is explicitly available in the microinstruction.
3. In __________technique, additional logic circuit is used to generate the address.
4. A ________ is a word whose individual bits represent various control signals.
[211]
5. Microprogram are stored in microprogram memory and the execution is controlled by
___________ counter.
11.7 Summary
1. To generate the control signal in proper sequence, two categories of techniques
exists, Hardwired Control and Microprogrammed Control.
2. In this hardwired control techniques, the control signals are generated by means of
hardwired circuit.
3. In VLSI technology, structure that involve regular interconnection patterns are
much easier to implement than the random connections.
4. PLAs are the arrays of AND gates followed by array of OR gates.
5. In microprogrammed control unit, the logic of the control unit is specified by a
microprogram.
6. Control word is defined as a word whose individual bits represent the various
control signal.
7. A sequence of control words (CWs) corresponding to the control sequence of a
machine instruction constitutes the microprogram for that instruction.
8. The individual control words in this microprogram are referred to as
microinstructions.
9. For each instruction of the instruction set of the CPU, we will have a
microprogram.
10. While executing a computer program, we fetch instruction by instruction from main
memory which is controlled by program counter(PC).
11. In microprogrammed controlled CU, each machine instruction can be implemented
by a micro-routine.
12. The address generation technique can be divided into two techniques: explicit &
implicit.
[212]
11.8 Model Questions
1. Discuss about the hardwired implementation of the control unit.
2. Discuss about the microprogrammed implementation of the control unit.
3. Write the difference between hardwired and microprogrammed control unit.
4. What is PLA?
5. What is Control Word?
6. What is conditional branching? Explain the organization of microprogrammed
control with conditional branching.
7. Explain Microprogram Sequencing.
8. What are the various concerns involved in the design of a microinstruction
sequencing technique?
9. Explain various sequencing techniques.
1. PLA
2. hardwired
3. hardwired
[213]
Answers to Check your progress II
1. Micro-programmed
2. explicit
3. implicit
4. control word
5. microprogram
[214]
INPUT/OUTPUT DEVICE
• Explain the reasons why an I/O device or peripheral device is not directly
connected to the system bus;
• Know the major functions of an I/O module;
• Explain the various steps involved in Processor & Device Communication;
• Explain the functioning of an I/O module using a block diagram;
• Know the various addressing modes between CPU and I/O devices;
• Differentiate between Memory mapped and Isolated I/O;
• Explain basic forms of input and output systems;
• Know various types of I/O commands that an I/O module will receive when it is
addressed by a processor;
• Explain Interrupt processing;
• Know the design Issues for Interrupt;
• Understand possible arrangement to handle multiple interrupt;
• Explain Direct Memory Access;
The third key component of a computer system is a set of I/O modules. Each I/O module
interfaces to the system bus and controls one or more peripheral devices.
There are several reasons why an I/O device or peripheral device is not directly connected
to the system bus. Some of them are as follows -
[215]
• There are a wide variety of peripherals with various methods of operation. It would
be impractical to include the necessary logic within the processor to control several
devices.
• The data transfer rate of peripherals is often much slower than that of the memory
or processor. Thus, it is impractical to use the high-speed system bus to
communicate directly with a peripheral.
• Peripherals often use different data formats and word lengths than the computer to
which they are attached.
The I/O function includes a control and timing requirement to co-ordinate the flow of
traffic between internal resources and external devices. For example, the control of the
transfer of data from an external device to the processor might involve the following
sequence of steps –
• The processor interacts with the I/O module to check the status of the attached
device.
• The I/O module returns the device status.
[216]
• If the device is operational and ready to transmit, the processor requests the transfer
of data, by means of a command to the I/O module.
• The I/O module obtains a unit of data from external device.
• The data are transferred from the I/O module to the processor.
If the system employs a bus, then each of the interactions between the processor and the
I/O module involves one or more bus arbitrations.
During the I/O operation, the I/O module must communicate with the processor and with
the external device. Processor communication involves the following -
Command decoding: The I/O module accepts command from the processor, typically sent
as signals on control bus.
Data: Data are exchanged between the processor and the I/O module over the data bus.
Status Reporting: Because peripherals are so slow, it is important to know the status of
the I/O module. For example, if an I/O module is asked to send data to the processor(read),
it may not be ready to do so because it is still working on the previous I/O command. This
fact can be reported with a status signal. Common status signals are BUSY and READY.
Address Recognition: Just as each word of memory has an address, so thus each of the
I/O devices. Thus, an I/O module must recognize one unique address for each peripheral it
controls.
On the other hand, the I/O must be able to perform device communication. This
communication involves command, status information and data.
Data Buffering: An essential task of an I/O module is data buffering. The data buffering
is required due to the mismatch of the speed of CPU, memory and other peripheral devices.
In general, the speed of CPU is higher than the speed of the other peripheral devices. So,
the I/O modules store the data in a data buffer and regulate the transfer of data as per the
speed of the devices.
[217]
In the opposite direction, data are buffered so as not to tie up the memory in a slow transfer
operation. Thus, the I/O module must be able to operate at both device and memory speed.
Error Detection: Another task of I/O module is error detection and for subsequently
reporting error to the processor. One class or error includes mechanical and electrical
malfunctions reported by the device (e.g. paper jam). Another class consists of
unintentional changes to the bit pattern as it is transmitted from devices to the I/O module.
There will be many I/O devices connected through I/O modules to the system. Each device
will be identified by a unique address.
When the processor issues an I/O command, the command contains the address of the
device that is used by the command. The I/O module must interpret the address lines to
check if the command is for itself. Generally in most of the processors, the processor, main
memory and I/O share a common bus(data address and control bus).
[218]
• Memory-mapped I/O
• Isolated or I/O mapped I/O
Memory-mapped I/O
There is a single address space for memory locations and I/O devices. The processor treats
the status and address register of the I/O modules as memory location. For example, if the
size of address bus of a processor is 16, then there are 216 combinations and all together 216
address locations can be addressed with these 16 address lines.
Out of these 216 address locations, some address locations can be used to address I/O
devices and other locations are used to address memory locations.
Since I/O devices are included in the same memory address space, so the status and
address registers of I/O modules are treated as memory location by the processor.
Therefore, the same machine instructions are used to access both memory and I/O devices.
In this scheme, the full range of addresses may be available for both. The address refers to
a memory location or an I/O device is specified with the help of a command line.
̅ =1, it indicates that the address present in address bus is the address of an I/O
if IO/M
device.
̅ =0, it indicates that the address present in address bus is the address of a memory
if IO/M
location.
Since full range of address is available for both memory and I/O devices, so, with 16
address lines, the system may now support both 216 memory locations and 216 I/O
addresses
[219]
12.2 Input / Output Subsystem
There are three basic forms of input and output systems:
• Programmed I/O
• Interrupt driven I/O
• Direct Memory Access(DMA)
With programmed I/O, the processor executes a program that gives its direct control of the
I/O operation, including sensing device status, sending a read or write command, and
transferring the data.
With interrupt driven I/O, the processor issues an I/O command, continues to execute other
instructions, and is interrupted by the I/O module when the I/O module completes its work.
In Direct Memory Access (DMA), the I/O module and main memory exchange data
directly without processor involvement.
With both programmed I/O and Interrupt driven I/O, the processor is responsible for
extracting data from main memory for output operation and storing data in main memory
for input operation.
To send data to an output device, the CPU simply moves that data to a special memory
location in the I/O address space if I/O mapped input/output is used or to an address in the
memory address space if memory mapped I/O is used.
Data
memory address space if memory mapped I/O is used
To read data from an input device, the CPU simply moves data from the address (I/O or
memory) of that device into the CPU.
[220]
Input/Output Operation: The input and output operation looks very similar to a memory
read or write operation except it usually takes more time since peripheral devices are slow
in speed than main memory modules.
The working principle of the three methods for input of a Block of Data is shown in the
Figure 6.2.
[221]
12.3 Input/Output Port
An I/O port is a device that looks like a memory cell to the computer but contains
connection to the outside world. An I/O port typically uses a latch. When the CPU writes
to the address associated with the latch, the latch device captures the data and makes it
available on a set of wires external to the CPU and memory system. The I/O ports can be
read-only, write-only, or read/write. The write-only port is shown in the Figure 6.3.
First, the CPU will place the address of the device on the I/O address bus and with the help
of address decoder a signal is generated which will enable the latch. Next, the CPU will
indicate the operation is a write operation by putting the appropriate signal in CPU write
control line. Then the data to be transferred will be placed in the CPU bus, which will be
stored in the latch for the onward transmission to the device. Both the address decode and
write control lines must be active for the latch to operate. The read/write or input/output
port is shown in the Figure 6.4.
The device is identified by putting the appropriate address in the I/O address lines. The
address decoder will generate the signal for the address decode lines. According to the
operation, read or write, it will select either of the latch. If it is a write operation, then data
will be placed in the latch from CPU for onward transmission to the output device.
[222]
Figure 6.4: Read / Write port
If it is in a read operation, the data that are already stored in the latch will be transferred to
the CPU. A read only (input) port is simply the lower half of the Figure 6.4. In case of I/O
mapped I/O, a different address space is used for I/O devices. The address space for
memory is different. In case of memory mapped I/O, same address space is used for both
memory and I/O devices. Some of the memory address space are kept reserved for I/O
devices. To the programmer, the difference between I/O-mapped and memory-mapped
input/output operation is the instruction to be used. For memory-mapped I/O, any
instruction that accessed memory can access a memory-mapped I/O port. I/O-mapped
input/output uses special instruction to access I/O port.
[223]
Generally, a given peripheral device will use more than a single I/O port. A typical PC
parallel printer interface, for example, uses three ports, a read/write port, and input port and
an output port. The read/write port is the data port (it is read/write to allow the CPU to read
the last ASCII character it wrote to the printer port).
➢ These signals indicate whether the printer is ready to accept another character, is
off-line, is out of paper, etc.
The output port transmits control information to the printer such as:
Memory-mapped I/O subsystems and I/O-mapped subsystems both require the CPU to
move data between the peripheral device and main memory.
For example, to input a sequence of 20 bytes from an input port and store these bytes into
memory, the CPU must send each value and store it into memory.
1. _________ connected to a computer need special communication links for interfacing them
with the central processing unit.
2. In ______ we have common bus(data and address) for I/O and memory but separate read
and write control lines for I/O.
3. In _______________ every bus in common due to which the same set of instructions work
for memory and I/O.
[224]
12.4 Programmed I/O
In programmed I/O, the data transfer between CPU and I/O device is carried out with the
help of a software routine.
The I/O module will perform the requested action and then set the appropriate bits in the
I/O status register.
It is the responsibility of the processor to check periodically the status of the I/O module
until it finds that the operation is complete.
In programmed I/O, when the processor issues a command to a I/O module, it must wait
until the I/O operation is complete.
Generally, the I/O devices are slower than the processor, so in this scheme CPU time is
wasted. CPU is checking the status of the I/O module periodically without doing any other
work.
• Control: Used to activate a peripheral device and instruct it what to do. For
example, a magnetic tape unit may be instructed to rewind or to move forward one
record. These commands are specific to a particular type of peripheral device.
• Test: Used to test various status conditions associated with an I/O module and its
peripherals. The processor will want to know if the most recent I/O operation is
completed or any error has occurred.
[225]
• Read: Causes the I/O module to obtain an item of data from the peripheral and
place it in the internal buffer.
• Write: Causes the I/O module to take an item of data (byte or word) from the data
bus and subsequently transmit the data item to the peripheral.
This type of I/O operation, where the CPU constantly tests a part to see if data is available,
is polling, that is, the CPU Polls (asks) the port if it has data available or if it is capable of
accepting data. Polled I/O is inherently inefficient.
The solution to this problem is to provide an interrupt mechanism. In this approach the
processor issues an I/O command to a module and then go on to do some other useful
work. The I/O module then interrupt the processor to request service when it is ready to
exchange data with the processor. The processor then executes the data transfer. Once the
data transfer is over, the processor then resumes its former processing.
[226]
• The processor issues a READ command.
• It then does something else (e.g. the processor may be working on several
different programs at the same time)
• At the end of each instruction cycle, the processor checks for interrupts
• When the interrupt from an I/O module occurs, the processor saves the context
(e.g. program counter & processor registers) of the current program and
processes the interrupt.
• In this case, the processor reads the word of data from the I/O module and
stores it in memory.
• It then restores the context of the program it was working on and resumes
execution.
The occurrence of an interrupt triggers a number of events, both in the processor hardware
and in software. When an I/O device completes an I/O operation, the following sequences
of hardware events occurs:
The data changes of memory and registers during interrupt service is shown in the Figure
6.5.
[227]
Figure 6.5: Changes of memory and register for an interrupt
• Interrupt service routine starts at location X and the return instruction is in location
X + L.
• After fetching the return instruction, the value of program counter becomes X + L +
1.
• While returning to user's program, processor must restore the earlier values.
• From control stack, it restores the value of program counter and the general
registers.
• Accordingly, it sets the value of the top of the stack and accordingly stack pointer is
updated.
• Now the processor starts execution of the user's program (interrupted program)
from memory location N + 1.
[228]
The data changes of memory and registers during return from and interrupt is shown in the
Figure 6.6.
Once the program counter has been loaded, the processor proceeds to the next instruction
cycle, which begins with an interrupt fetch. The control will transfer to interrupt handler
routine for the current interrupt.
1. At the point, the program counter and PSW relating to the interrupted program have
been saved on the system stack. In addition to that some more information must be
saved related to the current processor state which includes the control of the
processor registers, because these registers may be used by the interrupt handler.
Typically, the interrupt handler will begin by saving the contents of all registers on
stack.
[229]
2. The interrupt handles next processes the interrupt. This includes an examination of
status information relating to the I/O operation or, other event that caused an
interrupt.
3. When interrupt processing is complete, the saved register values are retrieved from
the stack and restored to the registers.
4. The final act is to restore the PSW and program counter values from the stack. As a
result, the next instruction to be executed will be from the previously interrupted
program.
• There will almost invariably be multiple I/O modules, how does the processor
determine which device issued the interrupt?
• If multiple interrupts have occurred how the processor does decide which one to
process?
The most straight forward approach is to provide multiple interrupt lines between the
processor and the I/O modules. It is impractical to dedicate more than a few bus lines or
processor pins to interrupt lines. Thus, though multiple interrupt lines are used, it is most
[230]
likely that each line will have multiple I/O modules attached to it. Thus, one of the other
three techniques must be used on each line.
When the processor detects an interrupt, it branches to an interrupt service routine whose
job is to poll each I/O module to determine which module caused the interrupt.
The poll could be implemented with the help of a separate command line (e.g. TEST I/O).
In this case, the processor raises TEST I/O and place the address of a particular I/O module
on the address lines. The I/O module responds positively if it set the interrupt.
Alternatively, each I/O module could contain an addressable status register. The processor
then reads the status register of each I/O module to identify the interrupting module.
Once the correct module is identified, the processor branches to a device service routine
specific to that device.
The main disadvantage of software poll is that it is time consuming. Processor has to check
the status of each I/O module and in the worst case it is equal to the number of I/O
modules.
In this method for interrupts all I/O modules share a common interrupt request lines.
However, the interrupt acknowledge line is connected in a daisy chain fashion. When the
processor senses an interrupt, it sends out an interrupt acknowledgement.
The interrupt acknowledge signal propagates through a series of I/O module until it gets to
a requesting module.
The requesting module typically responds by placing a word on the data lines. This word is
referred to as a vector and is either the address of the I/O module or some other unique
identification.
[231]
In either case, the processor uses the vector as a pointer to the appropriate device service
routine. This avoids the need to execute a general interrupt service routine first. This
technique is referred to as a vectored interrupt. The daisy chain arrangement is shown in
the Figure 6.7.
In bus arbitration method, an I/O module must first gain control of the bus before it can
raise the interrupt request line. Thus, only one module can raise the interrupt line at a time.
When the processor detects the interrupt, it responds on the interrupt acknowledge line.
The requesting module then places it vector on the data line.
There are several techniques to identify the requesting I/O module. These techniques also
provide a way of assigning priorities when more than one device is requesting interrupt
service. With multiple lines, the processor just picks the interrupt line with highest priority.
During the processor design phase itself priorities may be assigned to each interrupt lines.
With software polling, the order in which modules are polled determines their priority.
In case of daisy chain configuration, the priority of a module is determined by the position
of the module in the daisy chain. The module nearer to the processor in the chain has got
higher priority, because this is the first module to receive the acknowledge signal that is
generated by the processor.
[232]
In case of bus arbitration method, more than one module may need control of the bus.
Since only one module at a time can successfully transmit over the bus, some method of
arbitration is needed. The various methods can be classified into two group – centralized
and distributed.
In distributed scheme, there is no central controller. Rather, each module contains access
control logic and the modules act together to share the bus.
In one interrupt line, more than one device can be connected in daisy chain fashion. The
High priorities devices should be connected to the interrupt lines that has got higher
priority. A possible arrangement is shown in the Figure 6.8.
[233]
12.5.6 Interrupt Nesting
The arrival of an interrupt request from an external device causes the processor to suspend
the execution of one program and starts the execution of another. The execution of this
another program is nothing but the interrupt service routine for that specified device.
Interrupt may arrive at any time. So during the execution of an interrupt service routine,
another interrupt may arrive. This kind of interrupts are known as nesting of interrupt.
Whether interrupt nesting is allowed or not? This is a design issue. Generally nesting of
interrupt is allowed, but with some restrictions. The common notion is that a high priority
device may interrupt a low priority device, but not the vice-versa.
To accommodate such type of restrictions, all computer provide the programmer with the
ability to enable and disable such interruptions at various time during program execution.
The processor provides some instructions to enable the interrupt and disable the interrupt.
If interrupt is disabled, the CPU will not respond to any interrupt signal.
On the other hand, when multiple lines are used for interrupt and priorities are assigned to
these lines, then the interrupt received in a low priority line will not be served if an
interrupt routine is in execution for a high priority device. After completion of the interrupt
service routine of high priority devices, processor will respond to the interrupt request of
low priority devices
[234]
The I/O transfer rate is limited by the speed with which the processor can test and service a
device. The processor is tied up in managing an I/O transfer; a number of instructions must
be executed for each I/O transfer.
To transfer large block of data at high speed, a special control unit may be provided to
allow transfer of a block of data directly between an external device and the main memory,
without continuous intervention by the processor. This approach is called direct memory
access or DMA.
DMA transfers are performed by a control circuit associated with the I/O device and this
circuit is referred as DMA controller. The DMA controller allows direct data transfer
between the device and the main memory without involving the processor.
To transfer data between memory and I/O devices, DMA controller takes over the control
of the system from the processor and transfer of data take place over the system bus. For
this purpose, the DMA controller must use the bus only when the processor does not need
it, or it must force the processor to suspend operation temporarily. The later technique is
more common and is referred to as cycle stealing, because the DMA module in effect
steals a bus cycle. The typical block diagram of a DMA controller is shown in the Figure
6.9.
When the processor wishes to read or write a block of data, it issues a command to the
DMA module, by sending to the DMA module the following information.
• Whether a read or write is requested, using the read or write control line between
the processor and the DMA module.
• The address of the I/O devise involved, communicated on the data lines.
• The starting location in the memory to read from or write to, communicated on data
lines and stored by the DMA module in its address register.
• The number of words to be read or written again communicated via the data lines
and stored in the data count register.
[235]
Figure 6.9: Typical DMA block diagram
The processor then continues with other works. It has delegated this I/O operation to the
DMA module.
The DMA module checks the status of the I/O devise whose address is communicated to
DMA controller by the processor. If the specified I/O devise is ready for data transfer, then
DMA module generates the DMA request to the processor. Then the processor indicates
the release of the system bus through DMA acknowledge.
The DMA module transfers the entire block of data, one word at a time, directly to or from
memory, without going through the processor.
When the transfer is completed, the DMA module sends an interrupt signal to the
processor. After receiving the interrupt signal, processor takes over the system bus. Thus,
the processor is involved only at the beginning and end of the transfer. During that time the
processor is suspended.
[236]
It is not required to complete the current instruction to suspend the processor. The
processor may be suspended just after the completion of the current bus cycle. On the other
hand, the processor can be suspended just before the need of the system bus by the
processor, because DMA controller is going to use the system bus, it will not use the
processor. The point where in the instruction cycle the processor may be suspended shown
in the Figure 6.10.
When the processor is suspended, then the DMA module transfer one word and return
control to the processor.
Note that, this is not an interrupt, the processor does not save a context and do something
else. Rather, the processor pauses for one bus cycle.
During that time processor may perform some other task which does not involve the
system bus. In the worst situation processor will wait for some time, till the DMA releases
the bus.
The net effect is that the processor will go slow. But the net effect is the enhancement of
performance, because for a multiple word I/O transfer, DMA is far more efficient than
interrupt driven or programmed I/O.
The DMA mechanism can be configured in different ways. The most common amongst
them are:
[237]
• Single bus, detached DMA - I/O configuration.
• Single bus, Integrated DMA - I/O configuration.
• Using separate I/O bus.
In this organization all modules share the same system bus.The DMA module here acts as
a surrogate processor. This method uses programmed I/O to exchange data between
memory and an I/O module through the DMA module.
For each transfer it uses the bus twice. The first one is when transferring the data between
I/O and DMA and the second one is when transferring the data between DMA and
memory. Since the bus is used twice while transferring data, so the bus will be suspended
twice. The transfer consumes two bus cycle. The interconnection organization is shown in
the Figure 6.11.
By integrating the DMA and I/O function the number of required bus cycle can be reduced.
In this configuration, the DMA module and one or more I/O modules are integrated
together in such a way that the system bus is not involved. In this case DMA logic may
actually be a part of an I/O module, or it may be a separate module that controls one or
more I/O modules.
The DMA module, processor and the memory module are connected through the system
bus. In this configuration each transfer will use the system bus only once and so the
processor is suspended only once.
[238]
The system bus is not involved when transferring data between DMA and I/O device, so
processor is not suspended. Processor is suspended when data is transferred between DMA
and memory. The configuration is shown in the Figure 6.12.
In this configuration the I/O modules are connected to the DMA through another I/O bus.
In this case the DMA module is reduced to one. Transfer of data between I/O module and
DMA module is carried out through this I/O bus. In this transfer, system bus is not in use
and so it is not needed to suspend the processor.
There is another transfer phase between DMA module and memory. In this time system
bus is needed for transfer and processor will be suspended for one bus cycle. The
configuration is shown in the Figure 6.13.
[239]
Check your progress II
1. To transfer large blocks of data at high speed, a special control unit may be provided
between an external device and the main memory, without continuous intervention by the
processor. This approach is called _________________.
2. Pre-Emption of low priority Interrupt by another high priority interrupt is known
as______________.
3. ______________ is the process by which the next device to become the bus master is
selected and bus mastership is transferred to it.
4. DMA transfers are performed by a control circuit associated with the I/O device and this
circuit is referred as______________ .
5. In ______________ all I/O modules share a common interrupt request lines.
12.7 Summary
1. The computer system's input/output (I/O) architecture is its interface to the outside
world.
2. The internal resources, such as main memory and the system bus, must be shared
among a number of activities, including data I/O.
3. The I/O function includes a control and timing requirement to co-ordinate the flow
of traffic between internal resources and external devices.
4. The data buffering is required due to the mismatch of the speed of CPU, memory
and other peripheral devices.
5. I/O module also perform error detection and for subsequently reporting error to the
processor.
6. With programmed I/O, the processor executes a program that gives its direct
control of the I/O operation, including sensing device status, sending a read or write
command, and transferring the data.
[240]
7. With interrupt driven I/O, the processor issues an I/O command, continues to
execute other instructions, and is interrupted by the I/O module when the I/O
module completes its work.
8. In Direct Memory Access (DMA), the I/O module and main memory exchange
data directly without processor involvement.
9. Memory-mapped I/O subsystems and I/O-mapped subsystems both require the
CPU to move data between the peripheral device and main memory.
10. The arrival of an interrupt request from an external device causes the processor to
suspend the execution of one program and starts the execution of another.
11. The DMA controller allows direct data transfer between the device and the main
memory without involving the processor.
[241]
Answers to Check your progress I
1. Peripherals
2. Isolated I/O
3. Memory mapped I/O
[242]
INPUT/OUTPUT DEVICE CONNECTION
13.1 Buses
The processor, main memory, and I/O devices can be interconnected through common data
communication lines which are termed as common bus. The primary function of a common
bus is to provide a communication path between the devices for the transfer of data. The
bus includes the control lines needed to support interrupts and arbitration. The bus lines
used for transferring data may be grouped into three categories:
• data,
• address
• control lines.
̅ line is used to indicate Read or Write operation. When several sizes are
A single R/W
possible like byte, word, or long word, control signals are required to indicate the size of
data. The bus control signal also carries timing information to specify the times at which
the processor and the I/O devices may place data on the bus or receive data from the bus.
[243]
There are several schemes exist for handling the timing of data transfer over a bus. These
can be broadly classified as
• Synchronous bus
• Asynchronous bus
In a synchronous bus, all the devices are synchronized by a common clock, so all devices
derive timing information from a common clock line of the bus. A clock pulse on this
common clock line defines equal time intervals. In the simplest form of a synchronous bus,
each of these clock pulse constitutes a bus cycle during which one data transfer can take
place. The timing of an input transfer on a synchronous bus is shown in the Figure 7.1.
Let us consider the sequence of events during an input (read) operation. At time t0, the
master places the device address on the address lines and sends an appropriate command
(read in case of input) on the command lines. In any data transfer operation, one device
plays the role of a master, which initiates data transfer by issuing read or write commands
on the bus.
Normally, the processor acts as the master, but other device with DMA capability may also
become bus master. The device addressed by the master is referred to as a slave or target
device.
The command also indicates the length of the operand to be read, if necessary. The clock
pulse width, t1 - t0, must be longer than the maximum propagation delay between two
[244]
devices connected to the bus. After decoding the information on address and control lines
by slave, the slave device of that particular address responds at time t1. The addressed slave
device places the required input data on the data line at time t1.
At the end of the clock cycle, at time t2, the master strobes the data on the data lines into its
input buffer. The period t2- t1 must be greater than the maximum propagation delay on the
bus plus the setup time of the input buffer register of the master. A similar procedure is
followed for an output operation. The master places the output data on the data lines when
it transmits the address and command information. At time t2, the addressed device strobes
the data lines and load the data into its data buffer.
The simple design of device interface by synchronous bus has some limitations.
• A transfer has to be completed within one clock cycle. The clock period, must be
long enough to accommodate the slowest device to interface. This forces all
devices to operate at the speed of slowest device.
• The processor or the master has no way to determine whether the addressed device
has actually responded. It simply assumes that the output data have been received
by the device or the input data are available on the data lines.
To solve these problems, most buses incorporate control signals that represent a response
from the device. These signals inform the master that the target device has recognized its
address and it is ready to participate in the data transfer operation. They also adjust the
duration of the data transfer period to suit the needs of the participating devices. A high
frequency clock pulse is used so that a complete data transfer operation span over several
clock cycles. The numbers of clock cycles involved can vary from device to device. An
instance of this scheme is shown in the Figure 7.2.
In clock cycle 1, the master sends address and command on the bus requesting a read
operation. The target device responded at clock cycle 3 by indicating that it is ready to
participate in the data transfer by making the slave ready signal high. Then the target
device places the data on the data line. The target device is a slower device and it needs
[245]
two clock cycle to transfer the information. After two clock cycle, that is at clock cycle 5,
it pulls down the slave ready signal down. When the slave ready signal goes down, the
master strobes the data from the data bus into its input buffer. If the addressed device does
not respond at all, the master waits for some predefined maximum number of clock cycles,
then aborts the operation.
In asynchronous mode of transfer, a handshake signal is used between master and slave.
• In asynchronous bus, there is no common clock, and the common clock signal is
replaced by two timing control signals: master-ready and slave-ready.
• Master-ready signal is assured by the master to indicate that it is ready for a
transaction, and slave-ready signal is a response from the slave.
[246]
• The master waits for slave-ready to become asserted before it removes its signals
from the bus.
• In case of a read operation, it also strobes the data into its input buffer.
The timing of an input data transfer using the handshake scheme is shown in the Figure
7.3.
The timing of an output operation using handshaking scheme is shown in the Figure 7.4.
[247]
Figure 7.4: Handshake control of data transfer during an output operation
1. A_________ is a subsystem that is used to connect computer components and transfer data
between them.
2. In a ___________ bus, all the devices are synchronized by a common clock, so all devices
derive timing information from a common clock line of the bus.
3. In ______________ bus, there is no common clock, and the common clock signal is
replaced by two timing control signals: master-ready and slave-ready.
The main memory is made up of semiconductor device and by nature it is volatile. For
permanent storage of information, we need some non-volatile memory. The memory
devices need to store information permanently are termed as external memory. While
working, the information will be transferred from external memory to main memory. The
devices need to store information permanently are either magnetic or optical devices.\
[248]
13.2.1 Magnetic Disk
The write mechanism is based on the fact that electricity flowing through a coil produces a
magnetic field. Pulses are sent to the head, and magnetic patterns are recorded on the
surface below. The pattern depends on the positive or negative currents. The direction of
current depends on the information stored, i.e., positive current depends on the information
'1' and negative current for information '0'.
The read mechanism is based on the fact that a magnetic field moving relative to a coil
produces on electric current in the coil. When the surface of the disk passes under the head,
it generates a current of the same polarity as the one already recorded. Read/ Write head
detail is shown in the Figure 7.5.
The head is a relatively small device capable of reading from or writing to a portion of the
platter rotating beneath it. The data on the disk are organized in a concentric set of rings,
called track. Each track has the same width as the head. Adjacent tracks are separated by
gaps. This prevents error due to misalignment of the head or interference of magnetic
[249]
fields. For simplifying the control circuitry, the same number of bits are stored on each
track. Thus, the density, in bits per linear inch, increases in moving from the outermost
track to the innermost track. Data are transferred to and from the disk in blocks. Usually,
the block is smaller than the capacity of the track. Accordingly, data are stored in block-
size regions known as sector. A typical disk layout is shown in the Figure 7.6.
Since data density in the outermost track is less and data density is more in inner tracks so
there is wastage of space on outer tracks. To increase the capacity, the concept of zone is
used instead of sectors. Each track is divided in zone of equal length and fix amount of
data is stored in each zone. So, the number of zones are less in innermost track and number
of zones are more in the outermost track. Therefore, a greater number of bits are stored in
outermost track. The disk capacity is increasing due to the use of zone, but the complexity
of control circuitry is also more. The concept of sector and zone of a track is shown in the
Figure 7.7.
[250]
Figure 7.7: Sector and zone of a disk track
• The head may be either fixed or movable with respect to the radial direction of the
platter.
• In a fixed-head disk, there is one read-write head per track. All of the heads are
mounted on a rigid arm that extends across all tracks.
• In a movable-head disk, there is only one read-write head. Again the head is
mounted on an arm. Because the head must be able to be positioned above any
track, the arm can be extended or retracted for this purpose. The fixed head and
movable head is shown in the Figure 7.8.
• The disk itself is mounted in a disk drive, which consists of the arm, the shaft that
rotates the disk, and the electronic circuitry needed for input and output the binary
data and to control the mechanism.
• A non-removable disk is permanently mounted on the disk drive. A removable
disk can be removed and replaced with another disk.
• For most disks, the magnetizable coating is applied to both sides of the platters,
which is then referred to as double sided. If the magnetizable coating is applied to
one side only, then it is termed as single sided disk.
Some disk drives accommodate multiple platters stacked vertically above one another.
Multiple arms are provided for read write head. The platters come as a unit known as a
disk pack. The physical organization of multiple platter disk is shown in the figure 7.9.
[251]
Figure 7.8: Fixed and Movable head disk
The organization of data on a disk and its addressing format is shown in the Figure 7.10.
Each surface is divided into concentric tracks and each track is divided into sectors. The set
of corresponding tracks on all surfaces of a stack of disks forms a logical cylinder. Data
bits are stored serially on each track. Data on disks are addressed by specifying the surface
number, the track number, and the sector number. In most disk systems, read and write
operations always start at sector boundaries. If the number of words to be written is smaller
than that required to fill a sector, the disk controller repeats the last bit of data for the
remaining of the sector. During read and write operation, it is required to specify the
starting address of the sector from where the operation will start, that is the read/write head
must positioned to the correct track, sector and surface. Therefore. the address of the disk
[252]
contains track no., sector no., and surface no. If more than one drive is present, then drive
number must also be specified.
The format of the disk address word is shown in the figure. It contains the drive no, track
no., surface no. and sector no.
To get the correct sector below the read/write head, the disk is rotated and bring the correct
sector with the help of sector number. Once the correct sector, track and surface is decided,
the read/write operation starts next.
Suppose that the disk system has 8 data recording surfaces with 4096 track per surface.
Tracks are divided into 256 sectors. Then the format of disk address word is:
[253]
For moving head system, there are two components involved in the time delay between
receiving an address and the beginning of the actual data transfer.
Seek Time: Seek time is the time required to move the read/write head to the proper track.
This depends on the initial position of the head relative to the track specified in the
address.
Rotational Delay: Rotational delay, also called the latency time is the amount of time that
elapses after the head is positioned over the correct track until the starting position of the
addressed sector comes under the Read/write head.
The read/write head will first be positioned to the correct track. In case of fixed head
system, the correct head is selected by taking the track no. from the address. In case of
movable head system, the head is moved so that it is positioned at the correct track.
Communication between a disk and the main memory is done through DMA. The
following information must be exchanged between the processor and the disk controller in
order to specify a transfer.
Main memory address: The address of the first main memory location of the block of
words involved in the transfer.
Disk address: The location of the sector containing the beginning of the the desired block
of words.
[254]
The word count may corresponds to fewer or more bytes than that are contained in a
sector. When the data block is longer than a track:
The disk address register is incremented as successive sectors are read or written. When
one track is completed then the surface count is incremented by 1.
Thus, long data blocks are laid out on cylinder surfaces as opposed to being laid out on
successive tracks of a single disk surface.
This is efficient for moving head systems, because successive sector areas of data storage
on the disk can be accessed by electrically switching from one Read/Write head to the next
rather than by mechanically moving the arm from track to track. The track-to-track
movement is required only at cylinder-to-cylinder boundaries.
The sum of the seek time, (for movable head system) and the rotational delay is termed as
access time of the disk, the time it takes to get into appropriate position (track & sector) to
read or write.
[255]
Once the head is in position, the read or write operation is then performed as the sector
moves under the head, and the data transfer takes place.
Seek Time: Seek time is the time required to move the disk arm to the required track. The
seek time is approximated as
where
Rotational Delay: Disk drive generally rotates at 3600 rpm, i.e., to make one revolution it
takes around 16.7 ms. Thus on the average, the rotational delay will be 8.3 ms.
Transfer Time: The transfer time to or from the disk depends on the rotational speed of
the disk and it is estimated as
𝒃
T= 𝒓𝑵
Where,
T= Transfer time
b = Number of bytes to be transferred.
n = Numbers of bytes on a track
r = Rotational speed, in revolution per second.
Thus, the total average access time can be expressed as
[256]
Where T5 is the average seek time.
Disks are potential bottleneck for system performances and storage system reliability. The
disk access time is relatively higher than the time required to access data from main
memory and performs CPU operation. Also, the disk drive contains some mechanical parts
and it involves mechanical movement, so the failure rate is also high. The disk
performance has been improving continuously, microprocessor performance has improved
much more rapidly.
In data striping, the data is segmented in equal-size partitions distributed over multiple
disks. The size of the partition is called the striping unit. The partitions are usually
distributed using a round-robin algorithm:
if the disk array consists of D disks, then partition i is written in to disk ( i mod D ).
Consider a striping unit equal to a disk block. In this case, I/O requests of the size of a disk
block are processed by one disk in the array.
If many I/O requests of the size of a disk block are made, and the requested blocks reside
on different disks, we can process all requests in parallel and thus reduce the average
response time of an I/O request.
Since the striping unit are distributed over several disks in the disk array in round robin
fashion, large I/O requests of the size of many continuous blocks involve all disks. We can
process the request by all disks in parallel and thus increase the transfer rate.
[257]
Disk arrays that implement a combination of data striping and redundancy are called
Redundant Arrays of Independent Disks (RAID).
13.3.3 Redundancy
While having more disks increases storage system performance, it also lower overall
storage system reliability, because the probability of failure of a disk in disk array is
increasing. Reliability of a disk array can be increased by storing redundant information. If
a disk fails, the redundant information is used to reconstruct the data on the failed disk.
One design issue involves here - where to store the redundant information. There are two
choices-either store the redundant information on a same number of check disks, or
distribute the redundant information uniformly over all disk. In a RAID system, the disk
array is partitioned into reliability group, where a reliability group consists of a set of data
disks and a set of check disks. A common redundancy scheme is applied to each group.
A RAID level 0 system is not a true member of the RAID family, because it does not
include redundancy, that is, no redundant information is maintained.
For RAID 0, the user and system data are distributed across all of the disk in the array, i.e.
data are striped across the available disk.
If two different I/O requests are there for two different data block, there is a good
probability that the requested blocks are in different disks. Thus, the two requests can be
issued in parallel, reducing the I/O waiting time.
RAID level 0 is a low-cost solution, but the reliability is a problem since there is no
redundant information to retrieve in case of disk failure.
[258]
RAID level 0 has the best write performance of all RAID levels, because there is no need
of updation of redundant information.
RAID level 1 is the most expensive solution to achieve the redundancy. In this system, two
identical copies of the data on two different disks are maintained. This type of redundancy
is called mirroring.
Every write of a disk block involves two write due to the mirror image of the disk blocks.
These writes may not be performed simultaneously, since a global system failure may
occur while writing the blocks and then leave both copies in an inconsistent state.
Therefore, write a block on a disk first and then write the other copy on the mirror disk.
A read of a block can be scheduled to the disk that has the smaller access time. Since we
are maintaining the full redundant information, the disk for mirror copy may be less costly
one to reduce the overall cost.
RAID levels 2 and 3 make use of a parallel access technique where all member disks
participate in the execution of every I/O requests.
Data striping is used in RAID levels 2 and 3, but the size of strips are very small, often a
small as a single byte or word.
With RAID 2, an error-correcting code is calculated across corresponding bits on each data
disk, and the bits of the cods are stored in the corresponding bit positions on multiple
parity disks.
[259]
RAID 2 requires fewer disks than RAID 1. The number of redundant disks is proportional
to the log of the number of data disks. For error-correcting, it uses Hamming code.
On a single read, all disks are simultaneously accessed. The requested data and the
associated error correcting code are delivered to the array controller. If there is a single bit
error, the controller can recognize and correct the error instantly, so that read access time is
not slowed down.
On a single write, all data disks and parity disks must be accessed for the write operation.
RAID level 3 is organized in a similar fashion to RAID level 2. The difference is that
RAID 3 requires only a single redundant disk.
Instead of an error correcting code, a simple parity bit is computed for the set of individual
bits in the same position on all of the data disks.
In this event of drive failure, the parity drive is accessed and data is reconstructed from the
remaining drives. Once the failed drive is replaced, the missing data can be restored on the
new drive.
RAID levels 4 through 6 make use of an independent access technique, where each
member disk operates independently, so that separate I/O request can be satisfied in
parallel.
Data stripings are used in this scheme also, but the data strips are relatively large for RAID
levels 4 through 6.
[260]
With RAID 4, a bit-by-bit parity strip is calculated across corresponding strips on each
data disks, and the parity bits are stored in the corresponding strip on the parity disk.
RAID 4 involves a write penalty when an I/O write request of small size is occurred. Each
time a write occurs, update is required both in user data and the corresponding parity bits.
RAID level 5 is similar to RAID 4, only the difference is that RAID 5 distributes the parity
strips across all disks.
The distribution of parity strips across all drives avoids the potential I/O bottleneck.
In RAID level 6, two different parity calculations are carried out and stored in separate
blocks on different disks.
The advantage of RAID 6 is that it has got a high data availability, because the data can be
regenerated even if two disk containing user data fails. It is possible due to the use of
Reed-Solomon code for parity calculations. In RAID 6, there is a write penalty, because
each write affects two parity blocks.
[261]
13.4 Summary
1. The primary function of a common bus is to provide a communication path
between the devices for the transfer of data.
2. The bus lines used for transferring data may be grouped into three categories: data,
address and control lines.
3. In the simplest form of a synchronous bus, each of these clock pulse constitutes a
bus cycle during which one data transfer can take place.
4. In asynchronous mode of transfer, a handshake signal is used between master and
slave.
5. The main memory is made up of semiconductor device and by nature it is volatile.
6. The memory devices need to store information permanently are termed as external
memory.
7. A disk is a circular platter constructed of metal or of plastic coated with a
magnetizable material.
8. Head is a relatively small device capable of reading from or writing to a portion of
the platter rotating beneath it.
9. Each surface is divided into concentric tracks and each track is divided into sectors.
10. Data on disks are addressed by specifying the surface number, the track number,
and the sector number.
11. On a movable-head system, the time taken to position the head at the track is
known a seek time.
12. The time it takes to reach the beginning of the desired sector is known as rotational
delay or rotational latency.
13. The sum of the seek time, (for movable head system) and the rotational delay is
termed as access time of the disk, the time it takes to get into appropriate position
(track & sector) to read or write.
14. Disk arrays that implement a combination of data striping and redundancy are
called Redundant Arrays of Independent Disks (RAID).
15. Reliability of a disk array can be increased by storing redundant information.
[262]
16. In a RAID system, the disk array is partitioned into reliability group, where a
reliability group consists of a set of data disks and a set of check disks.
1. bus
2. synchronous
3. asynchronous
[263]
Answers to Check your progress II
1. Von-Neuman
2. non volatile
3. Seek time
4. Head
5. Rotational delay
6. striping
[264]
REDUCED INSTRUCTION SET PROGRAMMING
14.1 Introduction
Since the development of the stored program computer around 1950, there are few
innovations in the area of computer organization and architecture. Some of the major
developments are:
• The Family Concept: Introduced by IBM with its system/360 in 1964 followed by
DEC, with its PDP-8. The family concept decouples the architecture of a machine
from its implementation. A set of computers are offered, with different
price/performance characteristics, that present the same architecture to the user.
• Microprogrammed Control Unit: Suggested by Wilkes in 1951, and introduced by
IBM on the S/360 line in 1964. Microprogramming eases the task of designing and
implementing the control unit and provide support for the family concept.
• Cache Memory: First introduced commercially on IBM S/360 Model 85 in 1968.
The insertion of this element into the memory hierarchy dramatically improves
performance.
• Pipelining: A means of introducing parallelism into the essentially sequential nature
of a machine instruction program. Examples are instruction pipelining and vector
processing.
• Multiple Processor: This category covers a number of different organizations and
objectives.
[265]
One of the most visual forms of evolution associated with computers is that of
programming languages. Even more powerful and complex high-level programming
languages has been developed by the researcher and industry people.
The computer designers intend to reduce this gap and include large instruction set, more
addressing mode and various HLL statements implemented in hardware. As a result, the
instruction set becomes complex. Such complex instruction sets are intended to-
To reduce the gap between HLL and the instruction set of computer architecture, the
system becomes more and more complex and the resulted system is termed as Complex
Instruction Set Computer (CISC). A number of studies have been done over the years to
determine the characteristics and patterns of execution of machine instructions generated
from HLL programs. The instruction execution characteristics involves the following
aspects of computation:
14.1.1 Operations
A variety of studies have been made to analyze the behavior of HLL programs. It is
observed that
[266]
• There is also a presence of conditional statements (IF, Loop, etc.). These statements
are implemented in machine language with some sort of compare and branch
instruction. This suggest that the sequence control mechanism of the instruction set
is important.
A variety of studies have analyzed the behavior of high level language program. The Table
8.1 includes key results, measuring the appearance of various statement types during
execution which is carried out by different researchers.
These results are instructive to the machine instruction set designers, indicating which type
of statements occur most often and therefore should be supported in an “optimal” fashion.
From these studies one can observe that though a complex and sophisticated instruction set
[267]
is available in a machine architecture, common programmer may not use those instructions
frequently.
14.1.2 Operands
Researches also studied the dynamic frequency of occurrence of classes of variables. The
results showed that majority of references are single scalar variables. In addition references
to arrays/structures required a previous reference to their index or pointer, which again is
usually a local scalar. Thus, there is a predominance of references to scalars, and these are
highly localized. It is also observed that operation on local variables is performed
frequently and it requires a fast accessing of these operands. So, it suggests that a prime
candidate for optimization is the mechanism for storing and accessing local scalar
variables.
The procedure calls and returns are an important aspect of HLL programs. Due to the
concept of modular and functional programming, the call/return statements are becoming a
predominate factor in HLL program.
It is known fact that call/return is a most time consuming and expensive statements.
Because during call we have to restore the current state of the program which includes the
contents of local variables that are present in general purpose registers. During return, we
have to restore the original state of the program from where we start the procedure call.
14.1.4 Implications
A number of groups have looked at these results and have concluded that the attempt to
make the instruction set architecture close to HLL is not the most effective design strategy.
[268]
Generalizing from the work of a number of researchers three element emerge in the
computer architecture.
• First, use a large number of registers or use a compiler to optimize register usage.
This is intended to optimize operand referencing.
• Second, careful attention needs to be paid to the design of instruction pipelines.
Because of the high proportion of conditional branch and procedure call
instructions, a straight forward instruction pipeline will be inefficient. This
manifests itself as a high proportion of instructions that are prefetched but never
executed.
• Third, a simplified (reduced) instruction set is indicated. It is observed that there is
no point to design a complex instruction set which will lead to a complex
architecture. Due to the fact, a most interesting and important processor
architecture evolves which is termed as Reduced Instruction Set Computer (RISC)
architecture.
Although RISC system have been defined and designed in a variety of ways by different
groups, the key element shared by most designs are these:
Reduced
Complex
Instruction Set
Instruction Set
(RISC) Superscalar
(CISC) Computer
Computer
IBM VAX Intel MIPS Power Ultra MIPS
Characteristics SPARC
370/168 11/780 80486 R4000 PC SPARC R10000
Year
1973 1978 1989 1987 1991 1993 1996 1996
Developed
Number of
208 303 235 69 94 225 -- --
Instructions
[269]
Instruction
Size 2-6 2-57 1-11 4 4 4 4 4
(bytes)
Addressing
4 22 11 1 1 2 1 1
modes
Number of
general-
16 16 8 40-520 32 32 40-520 32
purpose
registers
Control
Memory size 420 480 246 -- -- -- -- --
(kbits)
Cache size
64 64 8 32 128 16-32 32 64
(kbits)
A machine cycle is defined to be the time it takes to fetch two operands from registers,
perform an ALU operation, and store the result in a register.
With simple, one-cycle instructions there is little or no need of microcode, the machine
instructions can be hardwired. Hardware implementation of control unit executes faster
than the microprogrammed control, because it is not necessary to access a microprogram
control store during instruction execution.
[270]
14.2.2 Register –to– register operations
Almost all RISC instructions use simple register addressing. For memory access only, we
may include some other addressing, such as displacement and PC-relative. Once the data
are fetched inside the CPU, all instruction can be performed with simple register
addressing.
Generally, in most of the RISC machine, only one or few formats are used. Instruction
length is fixed and aligned on word boundaries. Field locations, especially the opcode, are
fixed. With fixed fields, opcode decoding and register operand accessing can occur
simultaneously. Simplified formats simplify the control unit.
1. The insertion of ________ into the memory hierarchy dramatically improves performance.
2. ____________ is the first company who defined RISC architecture.
3. How is memory accessed in the RISC architecture? (Chose the correct one)
A. Load and Store Instruction
B. Opcode Instruction
C. Memory Instruction
D. Bus Instruction
[271]
14.3 Design Issues of RISC
For fast execution of instructions, it is desirable of quick access to operands. There is large
proportion of assignment statements in HLL programs, and many of these are of the simple
form A←B. Also, there are significant number of operand accesses per HLL Statement.
Also, it is observed that most of the accesses are local scalars. To get a fast response, we
must have an easy excess to these local scalars, and so the use of register storage is
suggested.
Since registers are the fastest available storage devices, faster than both main memory and
cache, so the uses of registers are preferable. The register file is physically small, and on
the same chip as the ALU and Control Unit. A strategy is needed that will allow the most
frequently accessed operands to be kept in registers and to minimize register-memory
operations.
Two basic approaches are possible, one is based on software and the other on hardware.
• The software approach is to rely on the compiler to maximize register uses. The
compiler will attempt to allocate registers to those variables that will be used the
most in a given time period.
• The hardware approach is simply to use more registers so that more variables can
be held in registers for longer period of time. In hardware approach, it uses the
concept of register windows.
The use of a large set of registers should decrease the need to access memory. The design
task is to organize the registers in such a way that this goal is realized. Due to the use of
the concept of modular programming, the present day programs are dominated by
call/return statements. There are some local variables present in each function or
procedure.
[272]
1. On every call, local variables must be saved from the registers into memory, so that
the registers can be reused by the called program. Furthermore, the parameters must
be passed.
2. On return, the variables of the parent program must be restored (loaded back into
registers) and results must be passed back to the parent program.
3. There are also some global variables which are used by the module or procedure.
Thus, the variables that are used in a program can be categorized as follows:
From the studies it is observed that a typical procedure employs only a few passed
parameters and local variables. Also, the depth of procedure activation remains within a
relatively narrow range. To exploit these properties, multiple small sets of registers are
used, each assigned to a different procedure.
A procedure call automatically switches the processor to use a different fixed size window
of registers, rather than saving registers in memory. Windows for adjacent procedures are
overlapped to allow parameter passing. The concept of overlapping register window is
shown in the Figure 8.1.
At any time, only one window of registers is visible which corresponds to the currently
executing procedure.
[273]
Figure 8.1: Overlapping register window
• Parameter registers hold parameters passed down from the procedure that called the
current procedure and hold results to be passed back up.
• Local registers are used for local variables.
• Temporary registers are used to exchange parameters and results with the next
lower level (procedure called by current procedure).
The temporary registers at one level are physically the same as the parameter registers at
the next lower level. This overlap permits parameter to be passed without the actual
movement of data.
To handle any possible pattern of calls and returns, the number of register windows would
have to be unbounded. But we have a limited number of registers, it is not possible to
provide unlimited amount of registers. It is possible to hold the few most recent procedure
activation in register windows.
Older activations must be saved in memory and later restored when the nesting depth
decreases. It is observed that nesting depth is small in general.
The circular buffer of overlapping windows is shown in the Figure 8.2. The procedure call
pattern is : A called B, B called C, C called D and D called E, with procedure E as active
process. The current window pointer (CWP) points to the window of currently active
[274]
procedure. The saved window pointer identifies the window most recently saved in
memory.
As the nesting depth of procedure calls increases, there may not be sufficient register to
accommodate the new procedure. In this case, the information of oldest procedure is stored
back into memory and the saved window pointer keep tracks of the most recently saved
window.
It is clear that N-Window register file can hold only N-1 procedure activations. The value
of N need not be very large, because in general, the depth of procedure activation is small.
In case of recursive call, the depth of procedure call may increase. From survery, it is
found that with 8 windows, a save or restore is needed on only 1% of the calls or returns.
[275]
14.3.3 Global Variables
The window scheme provides an efficient organization for storing local scalar variables in
registers. Global variables are accessed by more than one procedure. Two solutions to
access the global variables:
A small number of registers (e.g. 16-32) is available on the target RISC machine and the
concept of registers window cannot be used. In this case, optimized register usage is the
responsibility of the compiler. A program written is a high-level language has no explicit
references to registers. The objective of the compiler is to keep the operands for as many
computations as possible in registers rather than main memory, and to minimize load and
store operations. To optimize the use of registers, the approach taken is as follows:
The task of optimization is to decide which quantities are to be assigned to registers at any
given point of time in the program. The technique most commonly used in RISC compiler
is known as graph coloring.
[276]
The graph coloring problem is as follows:
Given a graph consisting of nodes and edges, assign colors to nodes such that adjacent
nodes have different colors, and do this in such a way as to minimize the number of
different colors. This graph coloring problem is mapped to the register optimization
problem of the compiler in the following way:
The part 'a' of Figure 8.3 shows a program with seven symbolic registers to be compiled in
three actual registers. Part ‘b' of Figure 8.3 shows the register interference graph. A
possible coloring with three colors is shown. Only, a symbolic register E is left uncolored
and must be dealt with load and store.
Figure 8.3
[277]
14.4 Large Register file versus cache
The Register file, organized into windows, acts as a small, fast buffer for holding a subset
of all variables that are likely to be used the most heavily. From this point of view, the
register file acts much like a cache memory.
The question therefore arises as to whether it would be simpler and better to use a cache
and a small traditional register file instead of using a large register file. The Table 8.3
compares the characteristics of two approaches.
[278]
14.5 Summary
1. RISC is an abbreviation of Reduced Instruction Set Computer.
2. RISC processor has ‘instruction sets’ that are simple and have simple ‘addressing
modes’.
3. A RISC style instruction engages “one word” in memory.
4. Execution of the RISC instructions are faster and take one clock cycle per
instruction.
5. Although the forerunners of RISC computers were seen in 1960. But, due to the
popularity of CISC microprocessors which were implemented by the manufacturers
in calculators, video games, stereos, etc.
6. RISC helps and supports few simple data types and synthesize complex data types.
7. RISC utilizes simple addressing modes and fixed length instructions for pipelining.
8. RISC permits any register to use in any context.
9. The amount of work that a computer can perform is reduced by separating
“LOAD” and “STORE” instructions.
10. RISC contains Large Number of Registers in order to prevent various number of
interactions with memory.
11. In RISC, Pipelining is easy as the execution of all instructions will be done in a
uniform interval of time i.e. one click.
12. In RISC, more RAM is required to store assembly level instructions.
13. Reduced instructions need a smaller number of transistors in RISC.
14. RISC uses Harvard memory model means it is Harvard Architecture.
15. A compiler is used to perform the conversion operation means to convert a high-
level language statement into the code of its form.
16. The CISC approach attempts to minimize the number of instructions per program,
sacrificing the number of cycles per instruction.
17. Computers based on the CISC architecture are designed to decrease the memory
cost.
[279]
14.6 Model Questions
1. What are the distinguishing characteristics of RISC organization?
2. Briefly explain the basic approaches used to minimize register-memory operations
on RISC machines.
3. Give some reasons for shifting the paradigm from CISC to RISC.
4. Explain the concept of register window to handle the procedure calls.
5. If a circular register buffer is used to handle local variables for nested procedures,
describe the approaches for handling global variables.
6. Explain the concept of graph coloring to optimize the register uses.
7. What are the differences of using large register file and cache memory?
1. Cache memory
2. IBM
3. A
1. graph coloring
2. B
[280]
PIPELINING PROCESSOR
• Define pipelining;
• Explain the pipeline execution of fetch and execution cycle;
• Calculate the the cycle time of an Instruction pipeline;
• Calculate the speed up factor for the instruction pipeline compared to execution
without the pipeline;
• Explain various approaches used for dealing with conditional branches;
By laying the production process out in an assembly line, product at various stages can be
worked on simultaneously. This process is also referred to as pipelining, because, as in a
pipeline, new inputs are accepted at one end before previously accepted inputs appear as
outputs at the other end.
[281]
Let Fi and Ei refer to the fetch and execute steps for instruction Ii. Execution of a program
consists of a sequence of fetch and execute steps is shown in the Figure 9.1.
Now consider a CPU that has two separate hardware units, one for fetching instructions
and another for executing them. The instruction fetch by the fetch unit is stored in an
intermediate storage buffer B1. The results of execution are stored in the destination
location specified by the instruction. For simplicity it is assumed that fetch and execute
steps of any instruction can be completed in one clock cycle.
• In the first clock cycle, the fetch unit fetches an instruction (instruction I1, step F2)
and stored it in B1 buffer at the end of the clock cycle.
• In the second clock cycle, the instruction fetch unit proceeds with the fetch
operation for instruction I2 (step F2).
• Meanwhile, the execution unit performs the operation specified by instruction I1
which is already fetched and available in the buffer B1 (step E1).
• By the end of the second clock cycle, the execution of the instruction I1 is
completed and instruction I2 is available.
• Instruction I2 is stored in buffer B1 replacing I1, which is no longer needed.
• Step E2 is performed by the execution unit during the third clock cycle, while
instruction I3 is being fetched by the fetch unit.
• Both the fetch and execute units are kept busy all the time and one instruction is
completed after each clock cycle except the first clock cycle.
• If a long sequence of instructions is executed, the completion rate of instruction
execution will be twice that achievable by the sequential operation with only one
unit that performs both fetch and execute.
[282]
Basic idea of instruction pipelining with hardware organization is shown in the Figure 9.2.
The pipeline execution of fetch and execution cycle is shown in the Figure 9.3.
[283]
he processing of an instruction need not be divided into only two steps. To gain further
speed up, the pipeline must have more stages. Let us consider the following decomposition
of the instruction execution:
• Fetch Instruction (FI): Read the next expected instruction into a buffer.
• Decode Instruction ((DI): Determine the opcode and the operand specifiers.
• Calculate Operand (CO): calculate the effective address of each source operand.
• Fetch Operands(FO): Fetch each operand from memory.
• Execute Instruction (EI): Perform the indicated operation.
• Write Operand(WO): Store the result in memory.
There will be six different stages for these six subtasks. For the sake of simplicity, let us
assume the equal duration to perform all the subtasks. It the six stages are not of equal
duration, there will be some waiting involved at various pipeline stages. The timing
diagram for the execution of instruction in pipeline fashion is shown in the Figure 9.4.
[284]
From this timing diagram it is clear that the total execution time of 8 instructions in this 6
stages pipeline is 13-time unit. The first instruction gets completed after 6 time unit, and
thereafter in each time unit it completes one instruction. Without pipeline, the total time
required to complete 8 instructions would have been 48 (6 X 8) time unit. Therefore, there
is a speed up in pipeline processing and the speed up is related to the number of stages.
where
τm = maximum stage delay (delay through stage which experience the largest delay)
d = time delay of a latch, needed to advance signals and data from one stage to the next.
Now suppose that n instructions are processed and these instructions are executed one after
another. The total time required Tk to execute all n instructions is
A total of k cycles are required to complete the execution of the first instruction, and the
remaining (n-1) instructions require (n-1) cycles.
[285]
The time required to execute n instructions without pipeline is
The speed up factor for the instruction pipeline compared to execution without the pipeline
is defined as:
In general, the number of instruction executed is much more higher than the number of
stages in the pipeline So, the n tends to α , we have
Sk = k,
i.e. We have a k fold speed up, the speed up factor is a function of the number of stages in
the instruction pipeline.
Though, it has been seen that the speed up is proportional to number of stages in the
pipeline, but in practice the speed up is less due to some practical reason. The factors that
affect the pipeline performance is discussed next.
D: Decode, decode the instruction and fetch the source operand (S)
[286]
W: Write, store the result in the destination location.
The hardware organization of this four-stage pipeline processor is shown in the Figure 9.5.
In the preceding section we have seen that the speed up of pipeline processor is related to
number of stages in the pipeline, i.e, the greater the number of stages in the pipeline, the
faster the execution rate.
But the organization of the stages of a pipeline is a complex task and if affects the
performance of the pipeline.
The problem related to a greater number of stages: At each stage of the pipeline, there is
some overhead involved in moving data from buffer to buffer and in performing various
preparation and delivery functions. This overhead can appreciably lengthen the total
execution time of a single instruction. The amount of control logic required to handle
memory and register dependencies and to optimize the use of the pipeline increases
enormously with the number of stages.
Apart from hardware organization, there are some other reasons which may affect the
performance of the pipeline.
[287]
(A) Unequal time requirement to complete a subtask:
Consider the four-stage pipeline with processing step Fetch, Decode, Operand and write.
The stage-3 of the pipeline is responsible for arithmetic and logic operation, and in general
one clock cycle is assigned for this task.
Although this may be sufficient for most operations, but some operations like divide may
require more time to complete. Figure 9.6 shows the effect of an operation that takes more
than one clock cycle to complete an operation in operate stage.
The operate stage for instruction I2 takes 3 clock cycle to perform the specified operation.
Clock cycle 4 to 6 required to perform this operation and so write stage is doing nothing
during the clock cycle 5 and 6, because no data is available to write. Meanwhile, the
information in buffer B2 must remain intake until the operate stage has completed its
operation.
This means that stage 2 and stage 1 are blocked from accepting new instructions because
the information in B1 cannot be overwritten by a new fetch instruction. The contents of
B1, B2 and B3 must always change at the same clock edge. Due to that reason, pipeline
[288]
operation is said to have been stalled for two clock cycle. Normal pipeline operation
resumes in clock cycle 7. Whenever the pipeline stalled, some degradation in performance
occurs.
The use of cache memory solves the memory access problem Occasionally, a memory
request results in a cache miss. This causes the pipeline stage that issued the memory
request to take much longer time to complete its task and in this case the pipeline stalls.
The effect of cache miss in pipeline processing is shown in the Figure 9.7.
[289]
Function performed by each stage as a function of time is shown in Figure 9.8.
In this example, instruction I1 is fetched from the cache in cycle 1 and its execution
proceeds normally. The fetch operation for instruction I2 which starts in cycle 2, results in
a cache miss. The instruction fetch unit must now suspend any further fetch requests and
wait I2 for to arrive.
We assume that instruction I2 is received and loaded into buffer B1 at the end of cycle 5, It
appears that cache memory used here is four time faster than the main memory. The
pipeline resumes its normal operation at that point and it will remain in normal operation
mode for some times, because a cache miss generally transfer a block from main memory
to cache. From the figure, it is clear that Decode unit, Operate unit and Write unit remain
idle for three clock cycle. Such idle periods are sometimes referred to as bubbles in the
pipeline.
Once created as a result of a delay in one of the pipeline stages, a bubble moves
downstream until it reaches the last unit. A pipeline cannot stall as long as the instructions
and data being accessed reside in the cache. This is facilitated by providing separate on
chip instruction and data caches.
Dependency Constraints:
I1 : A ← A + 5
I2 : B ← 3 * A
When this program is executed in a pipeline, the execution of can begin before the
execution of completes. The pipeline execution is shown in Figure 9.9.
[290]
Figure 9.9: Pipeline execution of two instructions
In clock cycle 3, the specific operation of instruction I1 i.e. addition takes place and at that
time only the new updated value of A is available. But in the clock cycle 3, the instruction
I2 is fetching the operand that is required for the operation of I2. Since in clock cycle 3
only, operation of instruction I1 is taking place, so the instruction I2 will get the old value
of A , it will not get the updated value of A , and will produce a wrong result. Consider that
the initial value of A is 4. The proper execution will produce the result as
B=27
Due to the data dependency, these two instructions cannot be performed in parallel.
Therefore, no two operations that depend on each other can be performed in parallel. For
correct execution, it is required to satisfy the following:
[291]
• The operation of the fetch stage must not depend on the operation performed during
the same clock cycle by the execution stage.
• The operation of fetching an instruction must be independent of the execution
results of the previous instruction.
• The dependency of data arises when the destination of one instruction is used as a
source in a subsequent instruction.
[292]
15.3 Branching
In general, when we are executing a program the next instruction to be executed is brought
from the next memory location. Therefore, in pipeline organization, we are fetching
instructions one after another.
But in case of conditional branch instruction, the address of the next instruction to be
fetched depends on the result of the execution of the instruction.
Since the execution of next instruction depends on the previous branch instruction,
sometimes it may be required to invalidate several instruction fetches. Consider the
following instruction execution sequence of Figure 9.10.
The result of the instruction will be available at clock cycle 5. But by that time the fetch
unit has already fetched the instruction I4 and I5.
If the branch condition is false, then branch won't take place and the next instruction to be
executed is which is already fetched and available for execution.
[293]
Now consider that when the condition is true, we have to execute the instruction I10. After
clock cycle 5, it is known that branch condition is true and now instruction I10 has to be
executed.
But already the processor has fetched instruction I4 and I5 It is required to invalidate these
two fetched instruction and the pipe line must be loaded with new destination instruction
I10.
Due to this reason, the pipeline will stall for some time. The time lost due to branch
instruction is often referred as branch penalty.
The effect of branch takes place is shown in the Figure 9.11. Due to the effect of branch
takes place, the instruction I4 and I5 which has already been fetched is not executed and
new instruction I10 is fetched at clock cycle 6. There is not effective output in clock cycle 7
and 8, and so the branch penalty is 2. The branch penalty depends on the number of stages
in the pipeline. More numbers of stages results in more branch penalty.
[294]
Dealing with Branches:
One of the major problems in designing an instruction pipe line is assuming a steady flow
of instructions to the initial stages of the pipeline. The primary problem is the conditional
brancho instruction until the instruction is actually executed, it is impossible to determine
whether the branch will be taken or not. A variety of approaches have been taken for
dealing with conditional branches:
• Multiple streams
• Prefetch branch target
• Loop buffer
• Branch prediction
• Delayed branch
Multiple streams
A single pipeline suffers a penalty for a branch instruction because it must choose one of
two instructions to fetch next and sometimes it may make the wrong choice. A brute-force
approach is to replicate the initial portions of the pipeline and allow the pipeline to fetch
both instructions, making use of two streams. There are two problems with this approach.
With multiple pipelines there are contention delays for access to the registers and to
memory Additional branch instructions may enter the pipeline (either stream) before the
original branch decision is resolved. Each such instruction needs as additional stream.
When a conditional branch is recognized, the target of the branch is prefetced, in addition
to the instruction following the branch. This target is then saved until the branch
instruction is executed. If the branch is taken, the target has already been prefetched.
Loop Buffer
A top buffer is a small, very high-speed memory maintained by the instruction fetch stage
of the pipeline and containing the most recently fetched instructions, in sequence.
[295]
If a branch is to be taken, the hardware first cheeks whether the branch target is within the
buffer. If so, the next instruction is fetched from the buffer.
1. With the use of prefetching, the loop buffer will contain some instruction
sequentially ahead of the current instruction fetch address. Thus, instructions
fetched in sequence will be available without the usual memory access time.
2. If a branch occurs to a target just a few locations ahead of the address of the branch
instruction, the target will already be in the buffer. This is usual for the common
occurrence of IF-THEN and IF-THEN-ELSE sequences.
3. This strategy is particularly well suited for dealing with loops, or iterations; hence
the name loop buffer. If the loop buffer is large enough to contain all the
instructions in a loop, then those instructions need to be fetched from memory only
once, for the first iteration. For subsequent iterations, all the needed instructions are
already in the buffer.
The loop buffer is similar in principle to a cache dedicated to instructions. The differences
are that the loop buffer only retains instructions in sequence and is much smaller in size
and hence lower in cost.
Branch Prediction
Various techniques can be used to predict whether a branch will be taken or not. The most
common techniques are:
[296]
The first three approaches are static; they do not depend on the execution history upto the
time of the conditional branch instructions. The later two approaches are dynamic- they
depend on the execution history.
Predict never taken always assumes that the branch will not be taken and continue to fetch
instruction in sequence.
Predict always taken assumes that the branch will be taken and always fetch the branet
target In these two approaches it is also possible to minimize the effect of a wrong
decision.
If the fetch of an instruction after the branch will cause a page fault or protection violation,
the processor halts its prefetching until it is sure that the instruction should be fetched.
Studies analyzing program behavior have shown that conditional branches are taken more
than 50% of the time [LILJ88}, and so if the cost of prefetching from either path is the
same, then always prefetching from the branch target address should give better
performance than always prefetching from the sequential path.
However, in a paged machine, prefetching the branch target is more likely to cause a page
fault than prefetching the next instruction in the sequence and so this performance penalty
should be taken into account.
Predict by opcode approach makes the decision based on the opcade of the branch
instruction. The processor assumes that the branch will be taken for certain branch opcodes
and not for others. Studies reported in [LILJ88}showed that success rate is greater than
75% with the strategy.
[LILJ88} Lilja,D "Reducing the branch penalty in pipeline processors", computer, July
1988.
[297]
Dynamic branch strategies attempt to improve the accuracy of prediction by recording the
history of conditional branch instructions in a program. Scheme to maintain the history
information:
• One or more bits can be associated with each conditional branch instruction that
reflect the recent history of the instruction.
• These bits are referred to as a taken/not taken switch that directs the processor to
make a particular decision the next time the instruction is encountered.
• Generally, these history bits are not associated with the instruction in main
memory. It will unnecessarily increase the size of the instruction. With a single bit
we can record whether the last execution of this instruction resulted a branch or not.
• With only one bit of history, an error in prediction will occur twice for each use of
the loop: once for entering the loop. And once for exiting.
If two bits are used, they can be used to record the result of the last two instances of the
execution of the associated instruction. The history information is not kept in main
memory, it can be kept in a temporary high-speed memory. One possibility is to associate
these bits with any conditional branch instruction that is in a cache. When the instruction is
replaced in the cache, its history is lost.
Another possibility is to maintain a small table for recently executed branch instructions
with one or more bits in each entry.
The branch history table is a small cache memory associated with the instruction fetch
stage of the pipeline. Each entry in the table consists of three elements:
[298]
Delayed Branch
The Figure 9.12 shows the execution of instructions in pipeline where the instruction Ij is a
branch instruction.
The instruction Ij is a branch instruction. The processor begins fetching instruction Ij+1
before it determines whether the current instruction Ij, is a branch instruction.
When execution of Ij is completed and a branch must be made, the processor must discard
the instruction that was fetched and now fetch the instruction at the branch target.
The location following a branch instruction is called a branch delay slot. There may be
more than one branch delay slot, depending on the time it takes to execute a branch
instruction.
The instructions in the delay slots are always fetched and at least partially executed before
the branch decision is made and the branch target address is computed.
The instructions in the delay slots are always fetched, so we can arrange the instruction in
delay slots to be fully executed whether or not the branch is taken. The objective is to plane
[299]
useful instruction in these slots. If no useful instructions can be placed in the delay slots,
these slots must be filled with NOP (no operation) instructions. For example, consider the
consider the following code segments given in the Figure 9.13.
Here register R2 is used as a counter to determine the number of times the contents of
register R1 are sifted left. Consider a processor with a two-stage pipeline and one delay
slot. During the execution phase of the instruction I3, the fetch unit will fetch the
instruction I4. After evaluating the branch condition only, it will be clear whether
instruction I1 or I4 will be executed next. The nature of the code segment says that it will
remain in the top depending on the initial value of R2 and when it becomes zero, it will
come out from the loop and execute the instruction I4. During the loop execution, every
time there is a wrong fetch of instruction I4. The code segment can be recognized without
disturbing the original meaning of the program. The reordered code segment is shown in
Figure 9.14.
[300]
In this case, the shift instruction is fetched while the branch instruction is being executed.
After evaluating the branch condition, the processor fetches the instruction at LOOP or at
NEXT, depending on whether the branch condition is true or false, respectively. In either
case, it completes execution of the shift instruction.
Logically the program is executed as if the branch instruction was placed after the shift
instruction. That is, branching takes place one instruction later than where the branch
instruction appears in the instruction sequence in the memory, hence the name “delayed
branch” .
Figure 9.15 shows the execution timing for the last two passes through the loop of
reordered instructions.
Figure 9.16 shows the execution timing for the last two passes through the loop of the
original program loop.
Figure 9.15: Execution timing for last two passes through the loop of reordered instruction
[301]
Figure 9.16: Execution timing for last two passes through the loop of the original program
loop
[302]
3. The ___________ depends on the number of stages in the pipeline.
4. A __________ is a small, very high-speed memory maintained by the instruction fetch
stage of the pipeline and containing the most recently fetched instructions, in sequence.
15.4 Summary
1. Pipelining is the process of accumulating instruction from the processor through a
pipeline. It allows storing and executing instructions in an orderly process. It is also
known as pipeline processing.
2. Pipelining increases the overall instruction throughput.
3. Pipelining is a technique where multiple instructions are overlapped during
execution.
4. There are some factors that cause the pipeline to deviate its normal performance
like timing variations, data hazards, Branching, interrupts and data dependency.
[303]
Answers to Check your progress I
1. b
2. b
3. a
4. c
1. c
2. a
3. branch penalty
4. top buffer
[304]
PARALLEL PROCESSING
It is observed that, at the micro operation level, multiple control signals are generated at
the same time.
Instruction pipelining, at least to the extent of overlapping fetch and execute operations,
has been around for long time.
[305]
By looking into these phenomena, researcher has look into the matter whether some
operations can be performed in parallel or not.
As computer technology has evolved, and as the cost of computer hardware has dropped,
computer designers have sought more and more opportunities for parallelism, usual to
enhance performance and, in some cases, to increase availability.
The design issues relating to SMPs and NUMA are complex, involving issues relating to
physical organization, interconnection structures, inter processor communication, operating
system design, and application software techniques.
Symmetric Multiprocessors
[306]
1. There are two or more similar processor of comparable capability.
2. These processors share the same main memory and I/O facilities and are
interconnected by a bus or other internal connection scheme.
3. All processors share access to I/O devices, either through the same channels or
through different channels that provide paths to the same device.
4. All processors can perform the same functions.
5. The system is controlled by an integrated operating system that provides interaction
between processors and their programs at the job, task, file and data element levels.
6. The operating system of a SMP schedules processors or thread across all of the
processors. SMP has a potential advantage over uniprocessor architecture:
a. Performance: A system with multiple processors will perform in a better
way than one with a single processor of the same type if the task can be
organized in such a manner that some portion of the work done can be done
in parallel.
b. Availability: Since all the processors can perform the same function in a
symmetric multiprocessor, the failure of a single processor does not stop the
machine. Instead, the system can continue to function at reduce
performance level.
c. Incremental growth: A user can enhance the performance of a system by
adding an additional processor.
d. Sealing: Vendors can offer a range of product with different price and
performance characteristics based on number of processors configured in
the system.
Organization
[307]
Figure 10.1: Block diagram of tightly coupled multiprocessors
[308]
Time shared bus is the simplest mechanism for constructing a multiprocessor system. The
bus consists of control, address and data lines. The block diagram is shown in Figure 10.2.
The bus organization has several advantages compared with other approaches:
[309]
• Reliability: The bus is essentially a passive medium and the failure of any attached
device should not cause failure of the whole system.
The main drawback to the bus organization is performance. Thus, the speed of the system
is limited by the bus cycle time. To improve performance, each processor can be equipped
with local cache memory. The use of cache leads to a new problem which is known as
cache coherence problem. Each local cache contains an image of a portion of main
memory. If a word is altered in one cache, it may invalidate a word in another cache. To
prevent this, the other processors must perform an update in its local cache.
Multiport Memory
The multiport memory approach allows the direct, independent access of main memory
modules by each processor and I/O module. The multiport memory system is shown in
Figure 10.3
[310]
The multiport memory approach is more complex than the bus approach, requiring a fair
amount of logic to be added to the memory system. Logic associated with memory is
required for resolving conflict. The method often used to resolve conflicts is to assign
permanently designated priorities to each memory port.
In NUMA architecture, all processors have access to all parts of main memory using loads
and stores. The memory access time of a processor differs depending on which region of
main memory is accessed. The last statement is true for all processors; however, for
different processors, which memory regions are slower and which are faster differ.
A NUMA system in which cache coherence is maintained among the cache of the various
processors is known as cache-cohence NUMA (CC-NUMA). A typical CC-NUMA
organization is shown in the Figure 10.4.
There are multiple independent nodes, each of which is, in effect, an SMP organization.
Each node contains multiple processors, each with its own L1 and L2 caches, plus main
memory. The node is the basic building block of the overall CC NUMA organization The
nodes are interconnected by means of some communication facility, which could be a
switching mechanism, a ring, or some other networking facility.
Each node in the CC-NUMA system includes some main memory. From the point of view
of the processors, there is only a single addressable memory, with each location having a
unique system-wide address. When a processor initiates a memory access, if the requested
memory location is not in the processors cache, then the L2 cache initiates a fetch
operation. If the desired line is in the local portion of the main memory, the line is fetched
across the local bus.
If the desired line is in a remote portion of the main memory, then an automatic request is
sent out to fetch that line across the interconnection network, deliver it to the local bus, and
then deliver it to the requesting cache on that bus.
[311]
Figure 10.4: CC- NUMA Organization
All of this activity is atomic and transparent to the processors and its cache. In this
configuration, cache coherence is a central concern. For that each node must maintain
some sort of directory that gives it an indication of the location of various portion of
memory and also cache status information.
[312]
Single Bus
The simplest and most economical means of interconnecting a number of modules is to use
a single bus. Since several modules are connected to the bus and any module can request a
data transfer at any time, it is essential to have an efficient bus arbitration scheme. In a
simple mode of operation, the bus is dedicated to a particular source-destination pair for
the full duration of the requested transfer. For example, when a processor uses a read
request on the bus, it holds the bus until it receives the desired data from the memory
module. Since the memory module needs a certain amount of time to access the data bus,
the bus will be idle until the memory is ready to respond with the data. Then the data is
transferred to the processors. When this transfer is completed, the bus can be assigned to
handle another request.
A scheme known as the split- transaction protocol makes it possible to use the bus during
the idle period to serve another request.
Consider the following method of handling a series of read requests possibly from different
processor. After transferring the address involved in the first request, the bus may be
reassigned to transfer the address of the second request; assuming that this request is to a
different memory module. At this point, we have two modules proceeding with read access
cycle in parallel. If neither module has finished with its access, the bus may be reassigned
to a third request and so on. Eventually, the first memory module completes its access
cycle and uses the bus to transfer the data to the corresponding source.
As other modules complete their cycles, the bus is needed to transfer their data to the
corresponding sources. The split transaction protocol allows the bus and the available
bandwidth to be used more efficiently. The performance improvement achieved with this
protocol depends on the relationship between the bus transfer time and the memory access
time.
[313]
• Since a memory module needs to know which source-initiated a given read request,
a source identification tag must be attached to the request.
• Complexity also increases because all modules, not just the processor, must be able
to act as bus muster.
The main limitation of a single bus is that the number of modules that can be connected to
the bus is not that large. Networks that allow multiple independent transfer operations to
proceed in parallel can provide significantly increased data transfer rate.
Crossbar Network
In a fully connected network, many simultaneous transfers are possible. If n sources need
to send data to n distinct destinations then all of these transfers can take place concurrently.
Since no transfer is prevented by the lack of a communication path, the crossbar is called a
nonblocking switch. In the Figure 10.5 of crossbar interconnection network, a single
switch is shown at each cross point. In actual multiprocessor system, the paths through the
crossbar network are much wider.
Multistage Network
The bus and crossbar systems use a single stage of switching to provide a path from a
source to a destination. In multistage network, multiple stages of switches are used to setup
a path between source and destination. Such networks are less costly than the crossbar
structure, yet they provide a reasonably large number of parallel paths between source and
destinations.
In the Figure 10.6, it shows a three-stage network that called a shuffle network that
interconnects eight modules. The term "shuffle" describes the pattern of connections from
the outputs of one stage to the inputs of the next stage. The switchbox in the Figure 10.6 is
a switch that can route either input to either output. If the inputs request distinct outputs,
they can both be routed simultaneously in the straight through or crossed pattern. If both
inputs request the same output, only one request can be satisfied. The other one is blocked
until the first request finishes using the switch.
In a crossbar switch, conflicts occur when two or more concurrent requests are made to the
same destination device. These conflicting requests are usually handled on a predetermined
priority basis.
The crossbar switch has the potential for the highest bandwidth and system efficiency.
However, because of its complexity and cost, it may be cost effective for a large
multiprocessor system.
A network consisting of s stages can be used to interconnect 2s modules. In this case, there
is exactly one path through the network from any module Pi to any module Pj. Therefore,
this network provides full connectivity between sources and destinations. Many request
patterns cannot be satisfied simultaneously. For example, the connection from P2 to P7 can
not be provided at the same time as the connection from P3 to P6.
[315]
Figure 10.6: Multistage Shuffle Network
A multistage network is less expansive to implement than a crossbar network. If nodes are
to be interconnected using this scheme, then we must use s = log2n stages with n/2
switches per stage. Since each switches contains four switches, the total number of
switches is:
𝑛
4∗ ∗ log 2 𝑛 = 2𝑛 ∗ log 2 𝑛
2
which, for a large network, is considerably less than the switches needed in a crossbar
network. Multistage networks are less capable of providing concurrent connection than
crossbar n2 switches. The connection path between P2 and P4 is indicated by RED lines in
the Figure 10.6.
[316]
Hypercube Networks
The Figure 10.7 shows a three dimensional hypercube. The small circles represent the
communication circuits in the nodes. The edge of the cube represent bi-directional
communication links between neighboring nodes.
Routing messages through the hypercube is easy. If the processor at node Ni wishes to send
a message to node Nj, it proceeds as follows:
• The binary addresses of the source, i , and the destination, j , are compared from
least to most significant bits.
• Suppose that they differ first in position P.
[317]
• Node Ni then sends the message to its neighbor whose address, k , differs from i in
bit position P.
• Node Nk forwards the message to the appropriate neighbor using the same address
comparison scheme
• The message gets closer to destination node Nj with each of these hops from one
node to another.
Message traverses from node N0 to N1, they differ in 1st bit position.
Then message traverses from N1 to N5, they differ in 3rd bit position.
Therefore, it takes two hops. The maximum distance that any message needs to travel in
an- dimensional hypercube is n - hops.
Mesh networks
[318]
The link between the nodes are bi-directional.
The functional unit are attached to each node of the mesh network.
One of the simplest and most effective possibilities is to choose the path between a source
node Ni and a destination node Nj such that the transfer first takes place in the horizontal
directional from Ni towards Nj.
When the column in which Nj resides is reached, the transfer proceeds in the vertical
direction along this column. If a wraparound connection is made between the nodes at the
opposite edges of a mesh network, the result is a network that comprises a set of bi-
directional rings in the X direction connected by a set of rings in the Y direction.
This network is called a torus. The average latency of information transfer is reduced in a
torus, but the complexity increases.
Tree Networks
In this tree, each parent node allows communication between two of its children at a time.
An intermediate-level node, for example node in the figure, can provide a connection from
one of its child node to its parent. This enables two leaf nodes that are any distance apart to
communicate. Only one path at any time can be established through a given node in the
tree.
[319]
Figure 10.9: A four way Tree Network
To reduce the possibility of a bottleneck, the number of links in the upper levels of a tree
hierarchy can be increased. This is done in a fat tree network, in which each node in the
tree (except at the top level) has more than one parent.
The Figure 10.10 shows a fat tree in which each node has two parents.
One of the simplest network topologies uses a ring to interconnect the nodes in the system.
A single ring is shown in the Figure 10.11.
[320]
Figure 10.11: A Single Ring
The main advantage of the arrangement is that the ring is easy to implement. Links in the
ring can be wide, because each node is connected to only two neighbors. It is not useful to
construct a very long ring to connect many nodes because the latency of information
transfer would be unacceptably large
The simple possibility of using ring in a tree structure; this results in a hierarchy of rings. A
hierarchy of rings is shown in the Figure 10.12. Having short rings reduces substantially
the latency of transfers that involve nodes on the same ring. The latency of transfers
between two nodes on different rings is shorter than if a single ring were used.
[321]
Check your progress I
Cache creates a problem, which is known as the cache coherence problem. The cache
coherence problem is: Multiple copies of the same data can exist in different caches
simultaneously, and if processors are allowed to update their own copies freely, an
inconsistent view of memory can result.
[322]
• Write back: Write operations are usually made only to the cache. Main memory is
only updated when the corresponding cache line is flushed from the cache.
• Write through: All write operations are made to main memory as well as to the
cache, ensuring that main memory is always valid.
It is clear that a write back policy can result in inconsistency. If two caches contain the
same line, and the line is updated in one cache, the other cache will unknowingly have an
invalid value. Subsequently read to that invalid line produce invalid results.
Even with the write through policy, inconsistency can occur unless other cache monitor the
memory traffic or receive some direct notification of the update.
For any cache coherence protocol, the objective is to let recently used local variables get
into the appropriate cache and stay there through numerous reads and write, while using
the protocol to maintain consistency of shared variables that might be in multiple caches at
the same time.
When a processor writes a new value into its cache, the new value is also written into the
memory module that holds the cache block being changed. Some copies of this block may
exist in other caches, these copies must be updated to reflect the change caused by the
write operation. The simplest way of doing this is to broadcast the written data to all
processor modules in the system.
[323]
As each processor module receives the broadcast data, it updates the contents of the
affected cache block if this block is present in its cache.
When a processor writes a new value into its cache, this value is written into the memory
module, and all copies in the other caches are invalidated. Again, broadcasting can be used
to send the invalidation requests through the system.
In the write-back protocol, multiple copies of a cache block may exist if different
processors have loaded (read) the block into their caches. If some processor wants to
change this block, it must first become an exclusive owner of this block.
When the ownership is granted to this processor by the memory module that is the home
location of the block. all other copies, including the one in the memory module, are
invalidated. Now the owner of the block may change the contents of the memory.
When another processor wishes to read this block, the data are sent to this processor by the
current owner. The data are also sent to the home memory module, which requires
ownership and updates the block to contain the latest value.
There are software and hardware solutions for cache coherence problem.
Software solution
On the other hand, compile time software approaches generally make conservative
decisions, leading to inefficient cache utilization.
[324]
Compiler-based cache coherence mechanism perform an analysis on the code to determine
which data items may become unsafe for caching, and they mark those items accordingly.
So, there are some non-cacheable items, and the operating system or hardware does not
cache those items.
The simplest approach is to prevent any shared data variables from being cached.
This is too conservative, because a shared data structure may be exclusively used during
some periods and may be effectively read-only during other periods.
It is only during periods when at least one process may update the variable and at least one
other process may access the variable then cache coherence is an issue.
More efficient approaches analyze the code to determine safe periods for shared variables.
The compiler then inserts instructions into the generated code to enforce cache coherence
during the critical periods.
Hardware solution
Directory protocols collect and maintain information about where copies of lines reside.
Typically, there is a centralized controller that is part of the main memory controller, and a
directory that is stored in main memory.
The directory contains global state information about the contents of the various local
caches.
[325]
When an individual cache controller makes a request, the centralized controller checks and
issues necessary commands for data transfer between memory and caches or between
caches themselves.
It is also responsible for keeping the state information up to date, therefore, every local
action that can affect the global state of a line must be reported to the central controller.
The controller maintains information about which processors have a copy of which lines.
Before a processor can write to a local copy of a line, it must request exclusive access to
the line from the controller.
Before granting this exclusive access, the controller sends a message to all processors with
a cached copy of this time, forcing each processor to invalidate its copy.
After receiving acknowledgement back from each such processor, the controller grants
exclusive access to the requesting processor.
When another processor tries to read a line that is exclusively granted to another
processors, it will send a miss notification to the controller.
The controller then issues a command to the processor holding that line that requires the
processors to do a write back to main memory.
Directory schemes suffer from the drawbacks of a central bottleneck and the overhead of
communication between the various cache controllers and the central controller.
Snoopy protocols distribute the responsibility for maintaining cache coherence among all
of the cache controllers in a multiprocessor system. A cache must recognize when a line
that it holds is shared with other caches.
[326]
When an update action is performed on a shared cache line, it must be announced to all
other caches by a broadcast mechanism.
Each cache controller is able to "snoop" on the network to observed these broadcasted
notifications and react accordingly.
Snoopy protocols are ideally suited to a bus-based multiprocessor, because the shared bus
provides a simple means for broadcasting and snooping.
Two basic approaches to the snoopy protocol have been explored: Write invalidates or
write-update (write-broadcast).
1. For which shared (virtual) memory systems is the snooping protocol suited?
a. Crossbar connected systems
b. Systems with hypercube network
c. Systems with butterfly network
d. Bus based systems
2. Which MIMD systems are best scalable with respect to the number of processors?
[327]
a. Distributed memory computers
b. ccNUMA systems
c. nccNUMA systems
d. Symmetric multiprocessors
3. The cost of a parallel processing is primarily determined by
a. Time Complexity
b. Switching Complexity
c. Circuit Complexity
d. None of the above
4. In shared address space platform ensuring that concurrent operations on multiple copies of
the same memory word have well-defined semantics is called____________.
5. The maximum number of edges mapped onto any edge in E' is called the _________ of the
mapping.
16.4 Summary
1. Parallel processing is a method in computing of running two or more processors
(CPUs) to handle separate parts of an overall task.
2. Time shared bus is the simplest mechanism for constructing a multiprocessor
system.
3. The multiport memory approach allows the direct, independent access of main
memory modules by each processor and I/O module.
4. In NUMA architecture, all processors have access to all parts of main memory
using loads and stores.
5. n a multiprocessor system, the interconnection network must allow information
transfer between any pair of modules in the system.
6. A scheme known as the split- transaction protocol makes it possible to use the bus
during the idle period to serve another request.
7. The main limitation of a single bus is that the number of modules that can be
connected to the bus is not that large. Networks that allow multiple independent
[328]
transfer operations to proceed in parallel can provide significantly increased data
transfer rate.
8. A bus-based network is perhaps the simplest network consisting of a shared
medium that is common to all the nodes. A bus has the desirable property that the
cost of the network scales linearly as the number of nodes, p.
9. In a star-connected network, one processor acts as the central processor. Every
other processor has a communication link connecting it to this processor.
10. The cache coherence problem is: Multiple copies of the same data can exist in
different caches simultaneously, and if processors are allowed to update their own
copies freely, an inconsistent view of memory can result.
11. In multistage network, multiple stages of switches are used to setup a path between
source and destination.
12. The protocols used in cache coherence system are (a) Invalidate protocol and (b)
Update protocol.
13. Snoopy protocols distribute the responsibility for maintaining cache coherence
among all of the cache controllers in a multiprocessor system.
[329]
Answers to Check your progress I
1. b
2. d
3. b
1. d
2. a
3. c
4. cache coherence
5. congestion
[330]
Reference
Burch, C. (2020, June 08). The hierarchy of memory & caches. Retrieved from
www.toves.org: http://www.toves.org/books/cache/ available under a Creative
Commons Attribution-Share Alike 3.0 United States License.
Deka, J. K. (2009, Dec. 31). Computer Organization and Architecture. Retrieved from
https://nptel.ac.in/courses/106/103/106103068/ available under Creative Commons
Attribution-NonCommercial-ShareAlike license.
[331]