MCA603 Advanced Comp Architecture PDF
MCA603 Advanced Comp Architecture PDF
MCA603 Advanced Comp Architecture PDF
Advanced Computer
Architecture
(MCA-603)
Course Objectives: To understand and analyse the functionality, connectivity and performance of various
processors and memory types.
Section-A
Fundamentals of Processors: Instruction set architecture; single cycle processors, hardwired and
micro- coded FSM processors; pipelined processors, multi-core processors; resolving structural, data,
control and name hazards; analysing processor performance.
Section-B
Section-C
Section-D
Advanced Memories: Non-blocking cache memories; memory protection, translation and virtualization;
memory synchronization, consistency and coherence.
Recommended Books:
Reviewed by:
Er. Gurpreet Singh Bains
Assistant Professor,
ECE, Punjab University,
SSGR Campus,
Hoshiarpur
Contents
The father of computer is Charles Babbage who design first analytic machine in 1833. This
machine is capable to perform all four mathematical calculation.It had a number of certain
features which are seen in modern’s computer. He includes the central processing system,
memory, storage area and input/output in his design.
1
Computer Architecture Consists of three main parts:-
a) Central Processing unit: -CPU consists of electronic circuitry consists of
active or passive electronic components which carries instructions and control
logic of computer so as to perform basic arithmetic and logical data
processing.
CPU consists of three main components:-
a) Registers: - It is a temporary storage device which is requires to hold
temporary data and variables which were used in data transferring and
data processing.
b) Control Unit (CU):- All the data transfer job and there movement in
Central processing unit is monitored by Control Unit. It controls the
operation of other units of CPU by providing timing and control signal.
Example: - Traffic Light
c) Arithmetic Logic Control (ALU):- ALU controls all the data
processing part of CPU like arithmetic, logical operation etc… The
operation in ALU is done by taking data and taking necessary opcode
(set of tasks to be performed) so as to execute processed result.
b) Memory: -The storage element of computer which holds all the data and
instruction of user. It stores the items like operating system, application,
kernel, firewall, user data and many more…
a) Primary memory: - It holds all the data which are used by computer to
run on current state. It has limited capacity and data get lost when power is
lost. These memory are known as volatile memory. These memory are
directly accessed by Central Processing Unit. These memories are faster
than secondary memory.These memories are costlier and computer system
can’t work without these memories. The Primary Memory consist of :
1. Random Access Memory (RAM)
2. Read Only Memory (ROM)
b) Secondary Memory: - Theses type of memory are used to hold user data
and system software’s. These are non- volatile part of memory whose data
2
didn’t get erased on switching of the power. These memories are slower
than primary memory. CPU can’t directly access these type of memories
directly. These memories are accessed only by input/output mode and
cheaper than primary memory. E.g., Hard disk, Pend Drives.
1.2.1 Software: - The things which cannot be touch by living beings like operating
systems, office, PowerPoint etc.
1.2.2 Software Abstraction Level:
3
a) Operating System: - . Operating system is system software which acts an
interface between user and hardware. It manages hardware and software of
computer system resources and provides common programs and services. The
operating system is again classified in two types :-
Single User operating system: - Accessed by only single user.
Multi User operating system: -Accessed by multiple user at a time.
4
e) Hardware:-The things which can be touched by living beings like monitor,
keyboard, CD – Compact Disk etc.
Hardware Abstraction:-
Hardware abstractions are implemented in software between computing hardware and system
software which runs on that computer. Its main function is to hide differences in hardware
from most of the operating system kernel, so that most of the kernel-mode code does not need
to be changed to run on systems with different hardware.it allows programmer to write
program device independent and highly efficient applications which calls the hardware.
A huge bulkier device covered with transparent glass material and consumes a large
amount of power as compared to modern computers. At the same time the data is
given in the form of machine language which can’t be easily be understand by new
user. Due to the large requirement of very large amount of power these devices
requires a separate cooling units to stop the devices from overheating problems. As
overheating can cause a damages of internal circuitry and requires large downtime as
theses dives are not reliable.
E.g. Electronic Numerical Integrator and Calculator (ENIAC), Electronic
Discrete Variable Automatic Computer (EDVAC), Electronic Delay Storage
Automatic Computer (EDSAC), Universal Automatic Computers (UNIVAC-I)
b) Second Generation(1949-1954)[Transistor]:-
5
Figure 1.3: Transistors
On the same time there was also major event happens due to invention of Magnetic
Tapes. They are very small in size (0.02inch) ring shaped structure and can be
magnetised by application of magnetic field in either clockwise or anticlockwise
direction.
Due to the invention of transistor and magnetic tapes the memory and processing
capabilities also get increases at low power requirement than vacuum tube. Due to
small size of devices the sizes of computer also gets decreases and due to the less
complexity of circuitry the downtime also get decreases but this generation stills
needs a separate cooling systems.
Due to increase in memory size the development of High Level Language also
increases and we seeFORTRAN, COBOL, ALGOL, SNBOL etc... Which are user
understandable and error correction can easily be done.
Invention of Integrated Circuits in 1958 by Jack Kilby at Texas Instrument replaces all the
transistors. These Chips are made of silicon chips in which circuitry all already printed over
it. And by certain lithography techniques all the working of active devices can be integrated
over a single chip which has a large number of pins at either of the side.
6
Due their miniature size and less power requirement they become extremely popular in
computer industries. Due their low power requirement they didn’t emit heat which totally
overcome the necessity of cooling units.
Theses IC requires extremely purified form of silicon and due the advancement in fabrication
technology theses Integrated Circuits are become very cheaper and reliable than Vacuum
tubes and Transistors and also become very much faster than previous generations.
The IC industries are only capable of make Small Scale integrated Circuits (SSI) which
consists of 10 transistors per chip and due to certain advancements the technology reaches up
to Medium Scale Integrated Circuits (MSI) which is about 100 transistors per chips. Due to
this the size of main memory increase up to 4 Megabytes. Due to theses advancements the
CPU are become more powerful and become capable to perform Million Instruction per
Second (MIPS).
E.g, IBM 360, Honeywell 6000 Series Computer, ICL 1900 series, ICL 2900
In 1995 the Pentium series come into existence .And RISC (Reduced Instruction set
Computers) microprocessor are preferred for numeric and file handling services.
E.g, IBM 5100 (First Portable Computer), TOSHIBA T1100
7
The Fifth generation computer totally depends on artificial intelligence and are still in
development using image processing, voice recognition face recognition finger print
etc... The Use of parallel processing also significantly increases.
The use of Ultra large Scale Integrated Circuits which contains 1000000 or more
number of transistors significantly increases the speed of central processing unit. And
due the high package density capabilities theses IC are easily be used in Personal
Digital Assistant (PDA) devices.
Fifth generation computers has huge storage capabilities even in Terabyte range and
even more. Due to advancement in magnetic type disk storing capabilities the era of
portable storage devices comes and high processing capabilities make it beyond the
era of fifth generation.
Exercise
Q1: What are the different generations of evolution of Computers?
Ans: Refer section 1.3
Q2: What do you mean by hardware and software abstractions? Explain briefly.
Ans: Refer section 1.2
Define computer architecture.
Ans: Refer section 1.1
8
9
Chapter 2 Instruction Set Architecture
Contents
2.1 CISC Architecture
2.1.1 CISC Approach
2.1.2 Addressing Modes In CISC
2.1.3 CISC Examples
2.2 RISC
2.2.1 RISC Performance
2.2.2 RISC Architecture Features
2.2.3RISC Examples
2.3 Comparison
10
Earlier programming was done either in assembly language or in machine code. This lead
the designers to develop the instructions that are easy to use. With advent of high level
language, computer architects created dedicated instructions that would do as much work
as possible and can be directly implemented to perform a particular task. Next task was to
implement concept of orthogonality that is to provide every addressing mode for every
instruction. This will lead to storage of results and operands directly in memory instead of
register or immediate only.
At that time hardware design was given more importance than compiler design, this
became the reason for implementation of functionality in microcode or hardware rather
than compiler alone. The design philosophy was term as Complex Instruction Set
Computer. CISC chips were the first PC microprocessor as the instructions were built into
chips.
Another factor which encouraged this complex architecture was very limited main
memories. This architecture hence proved to be advantageous as it lead to the high
density of information held in computer programs, as well as other features such as
variable length instructions, data loading. These issues were given high priority as
compare to ease of decoding of the instructions.
Other reason was that main memories were slow. With the usage of dense information
packing, frequency with which CPU access memory can be reduced. To overcome these
slow memories, fast cache can be employed but they are of limited size.
2.1.1 CISC Approach
Main motive of designing CISC architecture is that a task can be compiled in very few
lines of assembly. This can be accomplished by implementing a hardware which is
capable of understanding and executing series of operations. For example, if we want to
execute a multiplication operation, then CISC processor comes with a specific instruction
(‘MUL’).
When this instruction is executed, two values are loaded into separate registers, operands
are multiplied in execution unit and then product is stored in appropriate registers. This
whole task of multiplication can be compiled in just one instruction
MUL 3:2, 2:5
11
2.1.3 CISC Examples
1. PDP-11
Series of 16-bit minicomputer
Most popular minicomputer
Smallest system that could run UNIX
C programming language was easily implemented in several low level PDP-11
2. Motorola 68000
Also known as Motorola 68K
16/32 bit CISC Microprocessor core
Introduced in 1979 with HMOS technology
Software forward compatible
Advantages
12
Compiler has to perform a little task to translate high level language into
assembly language.
Because of short length of code, little RAM is required for its storage.
Disadvantages
Optimisation is difficult
Complex control unit
Hard to exploit complex machine instructions
2.2 RISC
RISC stands for reduced instruction set computing. Generally, the term ‘RISC’ is
misunderstood with the concept that number of instructions in RISC is small means it has
small instruction set. But this is not true. We can have any number of instructions until
they are confined within a particular clock period. Their instruction set can be larger than
those of CISC but their complexity is reduced that is why the term reduced instruction set
is used.
In RISC, the operation is divided into sub operations. For example, if we want to add two
numbers X and Y then operation will be performed as:
Load ‘X’
Load ‘Y’
Add X and Y
STORE Z
RISC architecture is based on concept of pipelining due to which execution time of each
instruction is short and number of cycles is also reduced.
13
For efficient execution of RISC pipeline most frequently used instructions and addressing
modes are selected.
Tradeoff between RISC and CISC can be expressed in form of total time required for the
task execution
Although CISC has less instructions for a particular task but its execution will require
more cycles due to its complex operation as compared to RISC. In addition to less cycles
in RISC, simplicity of its architecture leads to shorter clock period To leading to higher
speed as compare to CISC.
Load/Store architecture:
RISC architecture is also known as load store architecture because of separate execution
of load and store operations from other instructions, thus obtaining a high level of
concurrency. Also, the access to memory is accomplished through load and store
instruction only. Operations in this instruction set are also called as register to register
operation as all the operands on which operation has to be performed resides in general
purpose register file(GPR) and result is also stored in GPR. RISC pipeline architecture is
designed in a way that it can accommodate both operations and memory access with
equal efficiency.
Selected set of instructions:
Concept of locality is applied in RISC that is small set of instructions are frequently used
leading to efficient instruction level parallelism, hence efficient pipeline organization.
Such pipeline executes three main instruction classes efficiently.
Load /Store
Arithmetic logic
Branch
14
Fixed size instruction results in efficient execution of pipelined architecture. This
property of decoding of instruction in one cycle is also helpful in execution of branch
instruction where its outcome can be determined in one cycle and at the same time,
instruction address in new target will be issued.
Simple addressing modes:
One of the essential requirements of pipeline is simple addressing mode as it leads to
address calculation in predetermined number of pipeline cycles. In real programmes,
address computation requires only three simple addressing modes.
I.Immediate
II.Base + displacement
III.Base + Index
These addressing modes cover 80% of all addressing modes implemented in as process.
ATMEL AVR
15
a) 8-bit RISC single chip microcontroller.
b) Modified Harvard architecture.
c) First microcontroller to use on chip flash memory.
POWER PC
16
Figure 2.6: SPARC IC
Advantages of RISC
Disadvantages
2.3 Comparison
Table 2.1: Comparison between RISC and CISC
CISC RISC
Large number of instructions(120-350) Fewer instructions (<100)
17
Multiclock, complex instructions Single clock, reduced instruction
Microprocessor can do multiple things. But only those operations will be completed
that are user/ programmer specific. To do this the programmer should have the
knowledge of programming different instructions.
For example in banks the balance is shown in the account. It needs the account
number for doing this. The identity of user will be the account number. Similarly for a
data that is stored in other memory location, the programmer/user has to specify the
memory location for accessing the data. It is also very necessary to tell the computer
that the values which he is specifying are not the data values but the memory location.
So, there is a need to have some protocols/ rules for achieving this purpose.
Another example can be taken from the simple calculator in which the user specifies
the two values and he wants the addition operation to be done on these two values. In
this case the values are directly to be operated on rather than calling data from these
locations. So, while programming the microprocessor one needs to be very careful
otherwise the results will be incorrect and the number of applications which make use
of theses microprocessor will not be of any use.
Therefore, there is a need to define the rules for the following considerations:
1. How to specify whether the values typed are immediate data or the memory
locations
2. How to distinguish whether the data to be called is present in the register
specified in the instruction or further at the location which is given in the
register.
3. There is another type of instructions available, in which there is no need to
specify the immediate data or any register to be processed on. However, these
kind of instructions directly operate on the pre-specified registers or the
locations.
These problems are easily handled with the help of addressing modes. But before
going to the concept of addressing modes and classification, there is a need to discuss
the concept of ocpode and operand.
2.4.1 Op-code and Operand
A microprocessor program is consisted of multiple instructions which perform
different instructions. The program instruction is consisted of two main parts:
The first part tells the microprocessor about the kind of operation to be
performed.
Second part tells the microprocessor about the data on which any specified
operation is to be performed.
18
The part which contains the information about the function performed is called as
Opcode. And the other part which gives us the data or the way to access the data is
called Operand.
Instruction
Opcode Operand
19
When the above stated instruction will be executed, the accumulator contents will be
stored in the specified memory location i.e. 2004H. Suppose the accumulator contents
at the given time are 18 H. So, 18 H data stored in the accumulator will be copied to
the memory location 2004 H.
2008H
2007H
2006H
A
2005H
18H
2004H
18H
2001H
2002H
20
A 2003H
2004H
Figure 2.8: Indirect addressing
D A
Data
The concept of register addressing mode can further be illustrated by considering the
fllowing examples:
ANA C
When the above stated instruction will be executed the contents present in the
Register C will be ANDed logically with contents present in the accumulator (register
A).
SUB L
The execution of the instruction SUB L will lead to the subtraction of the contents of
the register L from the accumulator contents.
21
2.4.3.4 Immediate addressing mode:
In the case of immediate addressing, the data to be processed itself is specified in the
operand part of the instruction. In the opcode part, the operation to be performed is
specified. Now, based o these two things the microprocessor will accomplish the
given task .It is the simple way of getting things done where we are providing both
the things: the function to be performed and the data on which the function is to be
performed.
Consider the following example:
ADI 34H – This instruction adds the immediate data, 34H to the
accumulator.
Suppose, the contents of the accumulator register at present are 8H. When the
instruction ‘ADI 34H’ will be executed, 34H will be added to 8H and the final result
will be stored in accumulator.
In the above instruction the operand is specified within instruction itself.
2.4.3.5 Implicit mode:
In this case, there is no need to type/write any register, data or memory location. The
data is automatically fetched from the predefined location according to the instruction
used/typed. Generally, this type of the addressing mode is used when there is a need
to operate on the data available in the accumulator only. Example:
CMA
RAL
RAR
There are three types of MIPS ISA i.e. R-type, I-type, and J-type.
2.5.1 R-type
They are called R-type instructions because they refer to register type instructions.
They are the most complex. The format of the R-type instruction is given below. The
encoding is done in this way.
22
opcode register s register t register d shift amount function
It is noticed that there are three registers in the above instruction one is destination
register ($rd) and other two following registers are source registers (($rs and $rt).
In the instruction shown in the above table the two source registers are stored first,
then data is copied to destination register. This is the way how the programmer works
with the instructions.
I-type stands for “Immediate type” instruction. The format of an I-type instruction is
as shown below:
In the instruction $rt is the destination register, and the source register is only $rs.
23
where IR is the “instruction register”, which holds the op-code of current instruction.
(IR15)16 refers to the bit B15 of the instruction register (which is the sign bit of the
immediate value) is repeated 16 times. IR15-0 is the 16 bits of the immediate value.
J-type is acronym for "jump type". The format of a J-type instruction is shown below:
B31-26 B25-0
opcode target
j target
where PC (program counter) holds the address of next instruction to be executed. The
upper 4 bits of PC appended with 26 bits of target and followed by two 0s which
creates a 32 bit address.
2.5.4 MIPS Arithmetic and Logic Instructions
These instructions are used to perform mathematical and logical operations, such as
Addition, Subtraction, Multiplication, Division, Comparison, Negation, Increment,
Decrement, ANDing, ORing, XORing, NOT, Shifting, Rotating, compare etc.
The flags are affected after executing these instructions. The CPU performs these
operations on the data which is stored in the registers (CPU registers).
Some arithmetic and logical instructions are shown in following table:
24
2.5.5 MIPS Branch control Instruction
The branch control instructions are used to transfer control to instructions which are not
coming immediate after current instruction which is being executed. The control transfer can
be done by loading the address of the target instruction into the program counter (PC), by this
the next instruction to be executed will be the target instruction which is read from memory
to be retrieved from a new location. The branching can be conditional and unconditional.
There are two types of branching controls instructions:
25
• For example: loops, if statements.
Exercise
Q1. What are the different features of CISC and RISC architectures?
Ans: Refer section 2.1 and 2.2.
Q2. Explain different addressing modes for instructions.
Ans: Refer section 2.4.3
26
Q3. What are the different instruction sets defined for MIPS?
Ans: Refer section 2.5
Q.4 Write short notes on following:
a. Pipelining
b. Arithematic and logical instructions
c. Branch control instructions
Ans: Refer section 2.5.4, 2.5.5 and 2.5.6
27
Chapter 3
Processor Design
Structure
3.0 Objectives
3.1 Introduction
3.2 Control Unit
3.3 Control Signals
3.4 Design Process
3.5 MIPS Microarchitecture
3.6 Hardwired Control
3.7 Micro programmed Control
3.8 Single Cycle Processor
3.9 Multi Cycle Processor
3.10 Pipelining
3.11 Multi Core Processor
3.12 Test Yourself
3.0 Objectives:
Objectives of this chapter are to familiarize one with the following aspects of Processor
Design:
Processor control mechanism
Important components in the design process
Introduction to various type of microarchitecture and their definition
Detailed description of Single cycle, multi cycle and pipelined microarchitecture
Different types of controls: Hard wired and Microprogrammed Control
What is meant by a Multi Core Processor?
3.1 Introduction:
The main motive of this chapter is to learn the design of a MIPS microprocessor. You will
learn three different designs. Designing a microprocessor may seem like doing the toughest
job. But it is actually very easy and straightforward. Before designing the microprocessor,
you just need to have the knowledge of combinational and sequential logic. We assume that
you are also familiar with circuits like ALU and memory and also you have learned the MIPS
28
architecture, which from a programmer’s point of view in terms of various registers,
instructions, and memory. In this chapter you will come to know about the microarchitecture,
which is an interface between logic and the physical architecture. You will learn how to
arrange registers, ALUs, (FSMs), memories and other things which are important in
implementing architecture. Different architectures have many different microarchitectures
and each microarchitecture have different trade-offs of performance, cost, and complexity.
There is a vast difference in their internal designs.
The function of the control unit is to decode the binary machine word in the IR (Instruction
Register) and issue appropriate control signals. These cause the computer to execute its
program.
The control signals are to be generated in the proper sequence so that the instructions can be
executed in a proper way. The control signals are generated with the help of the internal logic
circuitry of the control unit.
We already know that the basic parts that are used to build a processor are ALU, registers,
data paths and some of the operations that get executed. For the proper working of the control
unit, it needs inputs that help it to know the state of the system and the outputs that help to
command the system for a certain behavior. This is how the control unit looks from outside.
But from inside, it needs the circuits of logic units/gates so that it can to perform the
operations.
29
Figure 3.2: Control unit inputs and outputs
Figure 3.2 shows the inputs and outputs. The inputs are
Clock: clock signal is used to “keep the record of time”. Every clock cycle is
important as it is required to perform the execution of the instructions. In MIPS
microarchitecture, one clock cycle is used to execute one micro-operation. This is
called processor clock time.
Flags: It is one of the important parts of control unit input that determines the effect
of the instruction execution on the processor. It is also required to determine the
output of the already executed ALU operations.
Control signals from control bus: The control signals from the control bus provide
signals to the control unit.
Control signals within the processor: There are two types of control signals:
I. The signals which result in the operations that move the data from one register
to another
II. The others that are used to initiate special ALU operations.
Control signals to control bus: They are divided into two categories:
I. Memory control signals
II. I/O control signals.
The control signals that are mostly used are: The signals that initiate an ALU operation, the
signals that are used to initiate the data paths and the signals that are used to direct the
external system. These control signals are applied directly in the form of zeroes and ones to
the logic units/gates.
The status, where the control unit has reached in the clock cycle, should be known to the
control unit itself. This is necessary for the control unit to take decisions. This knowledge is
used by the control unit, while it reads the input ports, to generate the control signals that
initiate the execution of further operations. The clock cycle is used to time the control signals
and to time the occurrence of events. This allows the signals to get stable.
Our microarchitectures are divided into two parts: the datapath and the control. The datapath
processes the data-words. Different structures like memories, registers, ALUs, and
multiplexers are present in the datapath. We will take an example of MIPS which is a 32-bit
architecture; hence we are going to use a 32-bit datapath. The function of control unit is to
receive the current instruction (which is to be executed) from the datapath and to tell the
30
datapath how to execute that instruction. In other words we can say that the control unit
selects the multiplexer lines, enable the register, and write signals are give to the memory to
control the operation of the datapath.
The program counter is a 32-bit register. It points to the current instruction which is to be
executed. The input of the program counter shows the address of the next instruction.
The instruction memory has a single read port. The function of the instruction memory is to
read the 32-bit instruction address input, A, and from that address, to provide the 32-bit
instruction on the RD lines after reading from that address.
The 32-bit register file has two read ports and one write port. The 5-bit address inputs, A1
and A2 are taken from each of the address lines. The 32-bit register values are read onto the
data outputs RD1 and RD2. A clock, a 5-bit address input is provided to the write port, A3;
also a write data input, WD, which is 32-bit; a write enable WE3 input. During the rising edge
of the clock, for writing the data into the specified register the write enable should be 1.
In the data memory, one read and one write port is provided. If the write enable, WE, is 1,
data is written from WD into address A on the rising edge of the clock. If the write enable is
0, it reads address A onto RD.
In the single-cycle microarchitecture the entire instruction is executed in one cycle. It has a
simple control unit and is easy to explain. Nonarchitectural state is not required in the single
cycle processor because the operation is completed in one cycle only. However, the slowest
instruction limits the cycle time.
31
In the pipelined microarchitecture, the concept of pipelining is applied. Hence it can execute
several instructions at a time which helps in improving the throughput.
Before learning the above three microarchitectures, we will first learn the hardwired and
micro-coded/micro-programmed control unit.
Here the control signals are produced by using the circuit which is hardwired. As we know
that the motive of the control unit is to generate the control signals. These control signals
should be in proper sequence and the time slot dedicated for each control signal must be
enough wide that the operation indicated by each control signal must be finished before the
occurrence of the next in sequence. Because in the hardwired control the control unit is
designed using hardwired (fixed) units, and these hardwired units have a certain propagation
delay, so the hardwired control dedicates small time interval to stabilize the output signals.
For the sake of simplicity, we assume that the time slots are equal. So a counter can be used
to design the control unit. This counter is driven by a separate clock signal. Every step is
dedicated to a particular part of the instructions of the CPU.
We know that large number of instructions are present for add operation. As for example,
ADD NUM R1 Add the contents of memory location specified by NUM to the
contents of register R1 and store the result in R1;
It is clear from the above example that the fetch operation will be similar but the control
signals will be separately generated for the above mentioned ADD instructions.
Hence it is concluded that the type of instruction defines the control signals.
Also some of the instructions use the status flags. So the execution of these instructions
depends upon the flag register values and the content of the instruction register. For example
the conditional branch instructions like JZ, JNZ, etc.
The external inputs are coming from the Central Processing Unit. They define the status of
the CPU and the other devices connected to it. The condition codes/ status flags indicates the
state of the CPU. For example the flags like carry, overflow, zero, etc.
32
Clock Step Counter
External Inputs
IR Encoder/
Decoder
Condition Codes
Control Signals
A simple block diagram can depict the structure of the control unit. But the detailed view can
be understood by going step by step into the design.
The decoder/encoder block is simply a combinational circuit that generates the required
control outputs depending on the state of all its input.
Every control step is provided with a separate control line by the decoder part. Also in the
output, separate control line is provided for every instruction in the instruction register.
The detailed view of the control unit organization is shown in the Figure 3.5.
Reset
Clock Step Counter
Step Decoder
External Inputs
Instruction
IR Decoder Encoder
Condition Codes
End
Control Signals
33
Figure 3.5: Detailed view of hardwired control unit
The encoder block combines all the inputs to produce control signals. The encoder/decoder
consists of a large number of logic units/gates that process the input signals to produce a
control signal. Every output control signal is the result of combination of many input signals
coming from various units.
Instruction decoder decodes all the instructions coming from instruction register. The encoder
takes inputs from the instruction decoder.
Finally we can say that the hardwired control implementation has fixed blocks and fixed
number of control outputs depending upon the different combination of inputs.
The other approach to generate the control signal is called microprogrammed control.
Control unit is designed specifically by a microprogram. In this approach the control signals
depend upon the microprogram that provides an instruction sequence. This is done by using
microprogramming language. Operations are defined by instructions.
This type of control unit is simple in terms of the logic circuit. A microprogram is like a
computer program. In the computer programming language the main memory is used store
the program. Instructions fetch operation is done in a sequence that depends upon the
program counter values.
The storage for the microprograms is called microprogram memory. The sequence of
execution depends upon the microprogram counter ( PC).
The microinstructions is a combination of the binary digits i.e. 0's and 1's. The
microinstruction is fetched from the microprogram memory. The output of the memory is a
control signal. If the control line contains 0, the control signal is not generated. If the control
line contains 1, the control signal is going to be generated at that instant of time.
There are different terms related to the microprogrammed control. Let us discuss them.
Control word is group of bits that indicate different control signals. So a different
combination of zeroes and ones define a different control signal. When we combine a
number of control words, then the sequence so formed becomes a microprogram of an
instruction. So we call these individual control words as microinstructions.
As we already discussed that these microprograms are stored in a special memory called
microprogram memory. The control unit reads these microprograms from the memory in the
sequence and produces a control signal corresponding to an instruction. The reading of the
control word (CW) is done with the help of microprogram counter ( PC).
34
The basic organization of a microprogrammed control unit is shown in the Figure 3.6.
The role of the "starting address generator" is to load microprogram counter with initial
address of the microprogram when the instruction register is loaded with a new instruction.
The reading of microinstructions is done by the microprogram counter by using the clock.
Clock µPC
Microprogram
Memory CW
The condition codes and status flag play a major role in the execution of few instructions. For
example the execution of branch instruction needs to skip the ongoing execution sequence
and to enter a new execution sequence. So the designer has to design a control unit that can
handle the microprogrammed branch instructions.
For this purpose we use conditional microinstructions. These microinstructions tell the
address where the microprogram counter has to point. The address is called branch address.
Also these microinstructions point out the flags, input bits etc. that has to be checked. This all
action is defined in a single microinstruction. Branch instructions require the knowledge of
the flag register and the condition codes.
The “Starting and branch address generator" takes the microinstruction control word bits.
These bits indicate the branch address and the condition that has to be fulfilled before
jump/branch actually takes place. The other role of this block is to load the PC with the
address that is generated by it.
In the normal computer program codes, the instruction is first fetched and then execution
takes place. Similarly the instruction fetch phase is same in the microprogrammed control,
but a common microprogram is used to fetch the instruction. This microprogram is located in
different memory location. Hence the execution of the microinstruction involves that memory
location.
35
During the execution of the current instruction, the address of the next one is calculated by
the “starting address generator unit”.
The main function of the PC is to point the location of next instruction in the sequence. It
is incremented every time an instruction is fetched. There are few conditions in which the
PC contents remain same. These are:
1. During the execution of END instruction, the PC starts pointing to the address of
the first CW.
2. During the loading of the IR with the new instruction, the PC points to the starting
address of the microprogram for that instruction.
3. During the branch microinstruction, if the condition is fulfilled, the PC is loaded
with the address of that branch.
During the execution of the END microinstruction, the microprogram produces an End
control signal. The PC is loaded with the starting address of the instruction to be fetched
with the help of the END control signal. The next address is nothing but the address of the
starting CW. Every instruction has a microprogram associated with it.
Therefore it is concluded that the microprograms are almost similar to computer program.
The only difference is the association of microprogram for every instruction. So it is called as
microprogrammed control.
To construct the datapath, we have to connect the state elements with combinational circuit
which can execute various instructions. Based on the current instruction, the appropriate
control signals are generated by the controller which contains the combinational logic. At any
given time, the specific instruction which is being carried out by the datapath is determined
by the control signals.
Firstly the instruction from instruction memory is read. Figure 3.8 shows that the address
lines of the instruction memory are connected to the program counter PC. A 32-bit instruction
is given out by the instruction memory, which is labeled Instr. The specific instruction that
was fetched decides the functioning of the processor. We will show how the datapath
connections work for the lw instruction. For an lw instruction, next we fetch the base address,
36
by reading the source register. In the rs field of the instruction Instr25:2, the register is
specified.
Figure 3.9: Address lines of the instruction memory connected to address input A1 of register
The 16-bit immediate data must be sign-extended to 32 bits because we know that the 16-bit
immediate data may be positive or negative, as shown in Figure 3.10. We denote the 32-bit
sign-extended value as SignImm.
37
shown in Figure 3.11. SrcA and SrcB are the two data/ operands that come from the memory
and the sign extension unit. The functions of ALU are vast. ALUControl signal is a 3-bit
signal which directs the ALU to perform a particular function. The output of the ALU is 32-
bit which is denoted as ALUResult along with flag denoted as Zero. The Zero flag is
important as it indicates the result of ALU is zero or not. While executing the lw instruction,
the value of ALUControl signal is set to decimal 2 or binary “010” to find the address by
adding the base address and offset. The output of ALU i.e. the ALUResult is fed into the data
memory which works as the address for the load instruction, as shown in Figure 3.11.
The rt field specifies the destination register for the lw instruction. The rt field is connected
the address input, A3, of the register file. The bus write data input of port3 of register file i.e.
WD3, must be connected to the ReadData signal of the data memory. RegWrite is a control
signal. The write enable input of the port 3 of the register file, WE3, is connected to the
38
RegWrite control signal. While executing a lw instruction the RegWrite signal is asserted. It is
done in order to write the data value to the register file. The write operation is done during
the rising edge of the clock at the end of the cycle.
During one instruction execution operation of the processor, there is another thing that the
processor must do. It is the computation of the memory address of the next instruction, PC_.
As the instructions are 32 bits i.e. 4 bytes long, so it is obvious that the address of the next
instruction is at PC _+ 4. Hence there comes the role of an adder into play. The PC- value is
incremented by adding 4 in it. During the next rising edge of the clock, the address of the new
instruction is then put into the program counter so that the processor can fetch the next
instruction. This is the overall datapath for the lw instruction.
Like the lw instruction, datapath for sw instruction can be designed. As in the lw instruction,
the sw instruction consists of reading the address from the port of the register file and adding
the base address into it. Also sign-extends operation is carried on an immediate data. To find
the memory address, the addition of the base address to the immediate data is done by the
ALU. But in the sw instruction, a second register is read from the register file and is written
onto the data memory. Figure 3.14 shows the new connections for this function. The
specification of this register is done in the rt field, Instr20:16. Unlike the lw instruction, bits
of the instruction are connected to the second register file read port, A2. The register value is
read onto the RD2 port.
39
The enhanced datapath handling R-type instructions is shown in Figure 3.15. There are two
registers which are read by the register file. An operation on these two registers is performed
by ALU. The ALU always received the second operand SrcB, from the sign-extended
immediate (SignImm). It requires the addition of a multiplexer. The purpose of multiplexer is
to select SrcB from either the register file RD2 port or SignImm. To control the multiplexer, a
new signal, ALUSrc is added. If ALUSrc is 0 then SrcB from the register file is selected; to
choose SignImm for lw and sw, it is 1. ALUResult is written to the register file in the R-type
instructions. So, another multiplexer is added to choose one among the ReadData and
ALUResult. The output denoted as Result. MemtoReg is a signal to control the multiplexer. If
Result from the ALUResult is to be chosen then MemtoReg is set to 0 for R-type instructions.
Otherwise to choose ReadData for lw, it is set to 1. We don’t care about the value of
MemtoReg for sw, because sw does not write to the register file. Another multiplexer is added
to choose WriteReg. The multiplexer is controlled by RegDst is a control signal for. If RegDst
=1: from the rd field WriteReg is chosen; to choose the rt field for lw it is 0.
Next is the extension of the datapath for beq instructions. Two registers are compared by the
beq instruction. If the result is equal i.e. the registers are equal in size then the addition of the
branch offset and the program counter is carried and the result is taken as the new branch.
The offset is the number of past instructions to branch. There are other elements that are
added into the datapath for example the fourth multiplexer, the 2-bit shift register, control
signals and ALU.
40
Figure 3.16: Complete single cycle datapath
The function of the control unit is to generate the control signals. Its function is based upon
the opcode. Also it depends upon the funct fields of the instruction. The opcode is the main
source of the control signals. The funct field is used by the R-type instructions, which tells
the processor about the ALU operation. So the control unit is divided into two units of
combinational logic. Figure 7.12 shows the control unit. The opcode is used by the main
decoder to compute the output. A 2-bit ALUOp signal is also computed by the main decoder.
41
3.8.3 The complete single cycle processor:
There always exist weaknesses of the single-cycle processor. First, a long clock cycle is
required for the slowest instruction (lw). Second, it requires more number of costly adders.
And third, there are individual instruction and data memories. These limitations are removed
in a multicycle processor. Instruction is broken into shorter/smaller steps. Every instruction
has different numbers of steps associated with it. Only one Adder is required. And a
combination of instruction and data memory is used.
The source register contains the base address. In an lw instruction, the base address is read
from this register. The rs field of the instruction contains the register.
42
Figure 3.17 Program counter selecting instruction location
There are many address inputs of the register file. The address input, A1, is connected to the
Instr output, as shown in Figure 3.18. For this address input A1, the output is generated. This
output is put onto RD1. To store this value, a nonarchitectural register A is used.
We know that an offset is needed by the lw instruction. The immediate field stores this offset.
This offset is sign extended to 32 bits, figure shows this all.
The load address is computed by adding the base address to the offset. This is done by using
an ALU, as shown in Figure 3.20.
43
Figure 3.20: Computing load address
The addition is done if the ALUControl is set to 010. A register called ALUOut is used to
store ALUResult. After address calculation, data is loaded to that address in the memory.
Memory address is selected by a newly added multiplexer, Adr, in front of the memory as
shown in Figure 3.21.
A signal called IorD is used to select an instruction address or data address. Another register
is used to store the memory read output. Here we find that in the first step, the instruction is
fetched from the calculated address. Later on the address is treated as the memory address.
Next we write the data back to the register file. This is illustrated in Figure 3.22.
During the write back operation of the processor, the program counter must be updated by
adding 4 to the PC. The multicycle processor is different and better from the single cycle
44
processor in the way it uses its ALUs. A multiplexer is added to use the existing ALU for
different operations as shown in Figure 3.23.
The sw is different from the lw in the context that it reads another register from the register
file and write it into the memory, as shown in Figure 3.24.
The rt field is used to specify the register. The second port is connected to the rt field output
Instr20:16. The register is put into the register B which is a nonarchitectural register. The
data port (WD) is written by the register B. The MemWritesignal is used to control the
memory write process.
45
Figure 3.25 Datapath for R-type instructions
In the R-type instructions, two multiplexers are added. From the register file, reading of two
source registers carried out. One of the inputs of SrcB multiplexer selects the register B for
the purpose of choosing another source register to b used in the ALU, as shown in Figure
7.25. Computation is done in the ALU and the results are stored in ALUOut. Next, the
ALUOut is written back to the register which is specified by the rd field. There is another
control signal MemtoReg multiplexer is used to select WD3/ALUOut (for R-type instructions)
or from Data (for lw).
Additional components like shift register multiplexer, the branch controls are added. The
Address of next instruction is calculated. As each instruction is 32 bit wide or 4 byte wide,
the address of next instruction is computed by adding 4 to PC. Hence PC'=PC=4. PCSrc is a
control signal which is used to indicate the memory address or the instruction address.
The control signals are computed by processor based on the opcode. It is similar to the single
cycle processor. It also makes use of the funct field. Figure 3.27 shows the multicycle control
and the complete multicycle processor is shown in figure 3.28
46
Figure 3.27 Multicycle control
3.10 Pipelining
The CPU performance can be can be improved by modifying the CPU organization. We
studied the impact of using a number of registers in place of only one accumulator. Also the
use of cache memory is very important for improving the performance.
In addition to this, another technique is used. It is known as pipelining. Pipelining helps the
designer to improve the performance of the processor by letting the designer to use the
concept of parallelism.
For this purpose the instruction is broken down into small tasks. The different tasks get
executed in different element. Instruction is executed in two phases, instruction fetch and
instruction execution. So the CPU performs this one after the other.
For every instruction there is a fetch and execute step associated with it. Suppose
Fi and Ei are the two steps associated for Ii. These fetch and execution steps are shown in
figure 3.29.
47
I1 I2 I3 I4
F1 E1 F2 E2 F3 E3 F4 E4 --------
It is clear from the figure that the two operations fetch and execute are performed one after
the other for every instruction. Hence for the execution of instruction I1, the fetch operation
must have been completed. And also second instruction fetch operation can only take place
after the completion of the first one.
Let us suppose the processor have two hardware units, one dedicated for fetch and other for
execution. Here the fetch unit fetches the instruction and stores the instruction in a storage
buffer B1. The execution unit executes that instruction. While the execution unit is executing,
the fetch unit starts fetching the second instruction and stores it.
Besides these two operations there are other operations like decode, operand read and result
write back. Therefore the instruction execution can be divided into following parts:
In the first cycle, the instruction is fetched from the memory location by the fetch unit
and stored in an intermediate buffer
After this, the fetch unit starts fetching the second instruction from the program
memory
While the fetch unit is fetching second instruction, the decode unit starts decoding the
first instruction
Hence in the second cycle, the decoding of first instruction and the fetching operation
of second instruction are done
In the third cycle the fetch unit fetches the third instruction, the decoding unit decodes
the second instruction and the read unit reads the operands for the first instruction
from the data memory
48
In the fourth cycle the fetch unit fetches the fourth instruction, the decoding unit
decodes the third instruction, the read unit reads the operands for the second
instruction and the execute unit executes the first instruction
In the fifth cycle the fetch unit fetches the fifth instruction, the decoding unit decodes
the fourth instruction, the read unit reads the operands for the third instruction and the
execute unit executes the second instruction and the write back unit performs the
operation of writing the results of first instruction back into the memory
Hence the first instruction is completed in five cycles and besides this different
operations on other instructions have also been performed
This parallelism continues until the completion of the last instruction in the sequence
This approach is very helpful in improving the performance of the CPU. It is shown in figure
If we look at the timing diagram of the pipelined processor, we find that there are five
instructions. Each instruction requires five clock cycles for its completion. If all the
instructions are to be executed sequentially without the pipelining, then it will require 5X5
clock cycles i.e. 25 clock cycles. But by using the pipelining, all the instructions get executed
in just 9 clock cycles as shown in figure 3.30.
49
Figure 3.31: Pipelined Processor
We know that the single core processor has only one CPU core associated with it to perform
all the computations. This CPU core consists of register file, ALU, control unit and many
more things. The single core processor is shown in figure 3.32.
Only one CPU core performs all the computations in the single core architecture. In a multi-
core processor the concept of multiprocessing is used. Two or more cores are present on a
single physical chip. These cores can share caches and also they may pass messages.
For example the dual core processors. In a dual-core system, the chip consists of two
computer cores. Usually, the single die contains two identical processors. Each core has its
own path to connect it to the front-side bus. Multi-core therefore can be thought of an
expanded version of the dual-core technology.
50
In a dual-core processor, there are two execution cores each having its own front-side bus
interfaces. The individual cache of the cores enables the the operating system to utilize the
parallelism. This is done in order to improve the multitasking in the CPU. The system is
optimized in terms of the operating system and software to carry thread-level parallelism.
Thread-level parallelism refers to the concept of running multiple threads at one time. Thread
is a small portion of the operating system or the application program that can be executed
independent of any other part.
If the operating system supports thread level programming, we can see the benefits of dual-
core processors even if the application program does not support thread level programming.
For example, we can see this feature in Microsoft Windows XP, we can work on multiple
applications simultaneously i.e. we can surf on Internet browser while MS Office running in
the background also listening to music on Media Player. This can be handled simultaneously
by the Dual Core processor. Nowadays, most of the operating systems and the application
softwares support thread level programming.
There are two types of multi-core processors: symmetric multi-core and asymmetric multi-
core. In a symmetric multicore processor, the single IC consists of identical CPU cores which
have similar design and similar features. On the other hand an asymmetric multi-core
processor is one that has multiple cores on a single IC chip, but these cores have different
designs.
51
3.11 Test Yourself:
. = +4
. = ×5
Answer: ALU, Control Unit, Register Memory, Data Memory, Instruction Memory and
program counter are the main parts of the CPU.
Answer: The function of the control unit is to decode the binary machine word in the IR
(Instruction Register) and issue appropriate control signals. These cause the computer to
execute its program.
The control signals are to be generated in the proper sequence so that the instructions can be
executed in a proper way. The control signals are generated with the help of the internal logic
circuitry of the control unit.
Answer: Instruction is executed in two phases, instruction fetch and instruction execution. So
the CPU performs this one after the other
52
53
Chapter 4
Pipelined Processors
Structure
4.0 Objectives
4.1 Introduction
4.2 Structural Hazards
4.3 Data Hazards
4.4 Branch/Control Hazards
4.5 Test Yourself
4.0 Objectives
4.1 Introduction
Sometimes there occur instances that prohibit the execution of next instruction during the
clock cycle designated for it. They are called hazards. Performance of the pipelined processor
is reduced vastly due to hazards. There are following three types' of hazards that occur:
1. Structural hazards are the hazards that occur when there is not enough hardware present to
support all possible occurrences of instructions in overlapped execution.
2. Data hazards occur when the execution of one instruction depends on the output of the
previous one, but the previous instruction is not executed and both the instructions are under
execution simultaneously.
3. Sometimes instructions change the value of the internal control registers like the PC.
Hence there exists a situation called Control hazards.
Pipeline stalling becomes necessary if the hazards occur. To have a check on hazard, it
becomes necessary to let some instructions to be executed and some delayed. Due to this, the
fetching of new instructions is stopped until the earlier ones are cleared.
54
During the pipelining of a processor, there is always a necessity of the pipelined functional
units and also the resources require to be duplicated for the structural hazards not to be
happened. This allows the instructions to happen in any of the combination. Sometimes due
to the lack of resources, the instruction can't be executed. Hence the hazard occurs. Structural
hazards commonly occur due to the lack of the proper pipelining of the functional units. This
results as a barrier in the occurrence of the sequence of instructions when they try to access
the non-pipelined unit. These instructions can't get executed at the speed of one instruction
per clock cycle. Since the resources are not properly pipelined i.e. the resources are not
properly duplicated, there is always a chance that the instructions can't be executed in all their
combinations. For example, when there is only one ALU unit available, but at a certain time
the processor is trying to use this ALU to perform two additions simultaneously in one clock
cycle, hence processor is said to have structural hazard.
When such a condition occurs, the processor pipeline stalls/stops one of them until the time
the earlier instruction is not executed and the ALU is not released for use by next instruction.
This kind of hazard will increase the cycle count per instruction. Sometimes a processor has
used same memory space for data and instructions. When an instruction have to access the
data memory and a new instruction comes which is referring to instruction memory, here due
to pipelining the second instruction will conflict the first instruction, as shown in figure 4.1.
For solving this kind of stalls, again the second instruction has to wait until the first has been
executed. This stall occurs for 1 clock cycle and hence it is again costlier when CPI count is
to be calculated.
For a processor having the structural hazard the average instruction time is:
= ×
55
= (1 + 0.4 × 1) × .
= 1.3 ×
From this equation we find that the ideal pipelined processor is faster; this ratio of the
average instruction time of the ideal pipelined processor to the processor with structural
hazards shows that the ideal processor is 1.3 times faster than the processor with structural
hazards.
To eradicate this kind of structural hazard, the cache memory can be separated into two parts
i.e. data and instruction memory. Each section can be accessed separately. Another method to
do so is to use a set of buffers. These buffers are called instruction buffers; the main function
of these buffers is to hold instructions. For a pipelined processor, if everything is taken care
of and all the affecting factors are equal, then if the processor does not have any structural
hazard, there will always be a lower CPI.
Even we want that the processor should not have structural hazards, then why a designer
designs a processor that is not totally free of structural hazards? The reason behind this is that
if we design a totally ideal pipelined processor, it would cost much more as compared to the
processor having structural hazards. As we know that pipelining requires the duplication of
functional units and also separate data memory and instruction memory space. Since this
increases the number of actual units and hence the cost.
We know that the timing of the instructions vary when the pipelining is done. This change in
the execution times of instructions is due to the fact that the instruction execution is
overlapped when pipelining comes into play. Due to this overlapping, there is always a
problem of overlapping of instructions which are referring to the same data. This problem is
called Data hazards. When the sequence of instructions is seen outside the pipeline, it seems
that everything is executing in its sequential form. But actually the order of read/write
instructions has been changed. This also changes the data/operand accesses.
Here we see that the preceding instructions of DADD are using R1 which is the output of the
first instruction. In the first instruction, R1 is written after adding all the operands. It is clear
from the above examples that the value of R1 is written by the DADD instruction, in the WB
pipe stage, but the value of R1 is read in the ID stage by the DSUB instruction. This reading
of data before it is being written creates a problem called. If this happens, the second
instruction will always read the wrong data from register R1 and the results will always be
wrong. So the designer should take steps to eradicate such kind of problems.
Also it will not be the case that the preceding instructions will always read the wrong value or
the value which was assigned to R1 in the instructions happening before DADD instruction.
For example, before the execution of DSUB instruction completes, if an interrupt occurs, then
56
the WB stage of the DADD will complete. Since the DADD has been computed, the right
value of R1 will be available for the DSUB instruction. This kind of functioning gives
uncertain results, and these results are unacceptable. We also see that the other instructions
are also affected by this data hazard. From the figure it is clear that the R1 is not written till
the end of the 5th clock cycle. Hence the other instructions which are using the R1 during the
respective cycles will always read the wrong value out of it.
The operations like XOR and OR are executed rightly. This is because the registers are read
at the 6th clock cycle after the register has been written in the XOR instruction. Hence it
operates properly. Also in the OR instruction, the hazard is not present because the register is
read in the 2nd part of the clock cycle and the write operation is carried out in the 1st half.
Suppose there are two instructions i1 and i2, then following things may happen:
A. Read After Write (RAW)
(i2 tries to read a source before i1 writes to it)
A read after write (RAW) data hazard occurs when an instruction needs a result that has to be
calculated yet. This situation can occur even if the instruction is to be executed after a
previous instruction but the previous instruction is not yet completed in the pipeline.
Example
i1. R2 <- R1 + R3
i2. R4 <- R2 + R3
Here in the first instruction, the value which is to be saved in register R2 is calculated, and
this value is to be used by the second instruction. But in a pipeline, while fetching the
operands in the second instruction, the output of the first instruction has not been saved yet.
This results in a data dependency. Here i2 is data dependent on i1.
We say that there is a data dependency with instruction i2, as it is dependent on the
completion of instruction i1.
B. Write After Read (WAR)
(i2 tries to modify a destination before it is read by i1)
Example
i1. R4 <- R1 + R5
i2. R5 <- R1 + R2
When there occurs an instance that instruction i2 may get executed before i1 (i.e. in parallel
execution) we should take care that the result of register R5 is not stored before i1 is
completed.
C. Write After Write (WAW)
(i2 tries to write an operand before it is written by i1)
57
Example
i1. R2 <- R4 + R7
i2. R2 <- R1 + R3
We must take care that the write back of i2 is not done until the execution of i1 finishes
4.3.1 Minimizing Data Hazard
The problem of data hazards can be eradicated by using a method known as forwarding. This
is a hardware technique. It is also known as bypassing or short-circuiting. The principal
behind the bypassing/ forwarding is that DSUB instruction executes in a way that it does not
require result of the DADD instruction until it is not actually modified by it.
This method works in the way that the result of the DADD instruction is moved from the
register of pipeline to where it is stored by the DADD register to the place where it is
required by the DSUB instruction. By doing this, the designer can avoid the requirement of a
stall. After understanding the principal behind forwarding, we can say that:
1. On the inputs of the ALU, the ALU results of the pipelined registers like EX/MEM
and MEM/WB are filled.
2. If the previous ALU execution has already been written on the storage register related
to a source then for the ongoing ALU execution, the previous output of the ALU is
selected as a current input of the ALU instead of a value which is read from the
storage register. This is done by the control logic in the forwarding technique.
For the forwarding to be successful, the DSUB should not be stalled. If DSUB is stalled then
the forwarding will not take place because the DADD will get executed. Similarly, if the
interrupts occurs in between the two instructions, the forwarding will not take place again
because the DADD instruction will be completed before the execution of the DSUB
58
instruction. It is clear from the Figure 4.3 that for the forwarding technique, the designer
should forward output of the immediate previous instructions. We need to forward results not
only from the immediately previous instruction but also possibly from an instruction that
started 2 cycles earlier.
It is shown in Figure 4.2 that how the bypass is done and how the paths are placed. It also
mentions the different timings of the operation on the register such as reads and writes. Hence
following code can be worked upon without the need of stalls.
The designer can design the forwarding in such a way that the result can be passed directly to
the functional unit that requires it: The forwarding of the result takes place from the output of
one unit to the input of other unit in a pipelined register. In place of a unit taking the input
from the output of another unit, the forwarding is done only to the place where the output is
required. Let us take the following example:
Here in this case, a stall can be stopped from occurring if we forward the output results of the
ALU pipelined registers, to the inputs of the ALU. Also the output results of the pipelined
memory unit registers would be forwarded to the input of the data memory. All the action is
clear from the Figure 4.3.
59
The effect of control hazards is enormous. They are more effective in terms of performance
losses done by them as compared to the data hazards. The execution of a branch can change
the values in the Program counter. This is not always true i.e. branch may change the value or
may not change it. When the Program counter is altered, it always shows a value which is
incremented by 4. When a branch alters the value of the PC to the address where it is required
then it is called taken branch. If there is no change in the value of PC, it means that the
sequence is executed as usual and the address pointed out by the branch is not to be executed.
This is known as not taken branch. For example, if a branch targets an instruction I, after
computing the address, then normally, the value of the PC is not altered up to the end of the
ID.
Figure 4.4 shows that after the detection of the branch is done during the ID, then to deal with
the branch, the method of redoing the instruction fetch, where the branch has pointed, is
applied. It is the simplest technique to deal with branches. As in the first IF cycle, no work is
done: hence it works as a stall in the pipeline. One thing which is worth to be noticed is that,
for an untaken or not taken branch if the IF stage gets repeated then this repetition process is
not necessary because the correct instruction has been fetched already.
About ten to thirty percent loss occurs to the performance of a pipelined processor if a stall of
only one cycle for every branch is present. A designer always takes care to check this loss
and uses techniques to eradicate this loss.
The designer can use many design techniques while designing the processor so that the stalls
in pipeline due to the branch delays can be reduced. But here we are going to discuss four
techniques which deal with branch delays. These techniques have fixed actions which are to
be taken during the branch. As the actions are fixed, hence for each kind of branch the actions
taken do not change during the whole execution. The compiler can reduce the penalty caused
by a branch by implementing the knowledge of the hardware scheme and also by knowing
the branch behavior.
Freeze or flush is a technique that deals with the branches. It is a simple scheme, which either
holds the instruction which comes next to the branch or by deleting any instruction after the
branch. This is done until the time the branch destination is not known. When the destination
of the branch instruction is known, the instruction is released or re-read from the instruction
sequence. This is a very simple technique in terms of implementing it on hardware as well as
60
on software. This technique is used as a solution in the pipeline. The drawback is that the
software can not reduce the branch penalty as it is fixed.
The other scheme that can be applied to overcome the branch hazards is to take the branch as
not taken. It has little bit higher performance as compared to freeze or Flush scheme. Also it
is complex as compared to it. In this technique the designer simply allows the hardware to
execute the instructions in the same sequence as it is written but the execution of branch is
not done. One thing is to be taken care of is that the state of the processor should not be
changed before the execution of the branch is not complete because the output of that branch
should be definitely known. The drawback of this technique is that if the state of the
processor is changed by any instruction; it can become tedious to come out of that state.
Otherwise the branch hazard may become more complex and the pipeline performance can be
affected enormously.
The third technique is to take all the branches. When the branch instruction occurs it is first
decoded and the address where the target is present is calculated, we treat this branch as taken
and the instruction at the target address fetched and is executed. This pipeline technique is not
very useful due to the fact that we do not know the address of the target instruction before the
execution of the branch. Few processors have condition codes that are implicitly set or have
branch conditions that are slower but more powerful. For such processors, the target address
of the branch is already known as compared to the output of the branch. So a pre-planned
branch taken technique might make sense. In both the branch predicted-taken as well as
branch predicted-not-taken techniques, performance can be increased by the compiler by
organizing the instructions in a way that the hardware has no problem in taking the path
which is defined by the code.
61
Processor performance can be increased to a large extent if the fourth technique is followed
while designing the pipeline. This technique is called delayed branch. Few processors make
use of this scheme. RISC processors in early days used this technique. In a delayed branch,
the execution cycle with a branch delay of one is
branch instruction
sequential successor1
branch target if taken.
The branch delay slot consists of the sequential successor. Even if the branch is not been
taken, this instruction will be executed. We can have a larger branch delay which might be
longer than usual delay. Practically most of the processors that have delayed branch possess a
single instruction delay.
As the complexity of the pipelines increases and the penalty of branches in terms of the
performance also increases, the use of the techniques like delayed branches, etc. is not
enough to handle this complexity. So we move on to new techniques that are more efficient
and are more accurate for predicting branches. These techniques are divided into two classes:
1) static techniques that are cheaper in cost and they rely on the compile time information and
2) the dynamic branch prediction techniques that are based on behavior of the program.
The prediction of branch during the time of compilation can be made more accurate if the
profile information, which we got from the earlier compilation runs, is used. The principal
behind this static prediction is that the branches are biased. It means that some of the
branches may favor the always taken mode and some of them may be not taken all the times.
For different compilations, same input is used. Other studies show that the change in the
input for different compilations does not have a bigger effect on the accuracy of predictions
which are profile-based. For the success of any branch prediction technique, there are two
things that are to be fulfilled: a) the accuracy of the technique and b) the frequency of
conditional branches. The main drawback of this prediction technique is that the number of
miss-predictions for the integer programs is higher. The branch frequency for this kind of
programs is high.
62
The correctness of the branch prediction using this type of buffer is not known. Even if the
correct address has been predicted by the branch, it is not sure that it has been put by another
branch or by the same. Another branch may have put this address as it could have the similar
low-order address bits. This is not to worry about. The branch predicts a target address and it
is treated to be true, after that the instruction is fetched from that address. But if the predicted
address is not true then the prediction bit in the buffer is complemented and is stored back.
Recalling from the cache, we can imagine that the buffer used is like a cache memory. We
can also imagine that the access done to this cache is always a hit.
There is a drawback associated with this one bit prediction technique that most of the time we
take the branch to be taken. Here if we predict the incorrect target then we waste our two
clock cycles. One at the time of wrong branch prediction and the second at the time of buffer
bit inversion and restoring it in the memory.
To eradicate this problem in the 1-bit prediction, a 2-bit prediction technique is used most of
the times. A 2-bit technique requires that there must be two misses in the prediction of branch
before changing it. A 2-bit prediction technique is shown in Figure 4.6.
After detecting the instruction as a branch and if this branch is treated as taken branch then
the instruction is fetched from the target immediately after the PC value is known. If the
branch is not to be taken then the fetching and execution of the instructions is continued in a
sequential way. As shown in above that bits are changed when the branch prediction is
wrong.
The wrong prediction of branches will be reduced in 2 bits prediction as compared to the 1-
bit prediction. The two bits can be used to encode the four states i.e. 00, 01, 10 and 11. In
63
general, we can extend the two bit scheme to an n-bit. For an n-bit technique, there are 0 to
2n-1 values that can be defined. For a branch to be treated as taken, the value of the counter
must be more than or equal to the half of the maximum value i.e. 2n –1, otherwise, it is
predicted as untaken. As the 2-bit predictors are equally efficient as compared to n-bit
predictors, hence most of the prediction techniques adopt a 2-bit system.
Q3: What are structural hazards and how do they affect the performance of processor?
Answer: During the pipelining of a processor, there is always a necessity of the pipelined
functional units and also the resources require to be duplicated for the structural hazards not
to be happened. This allows the instructions to happen in any of the combination. Sometimes
due to the lack of resources, the instruction can't be executed. Hence the hazard occurs.
Structural hazards commonly occur due to the lack of the proper pipelining of the functional
units. This results as a barrier in the occurrence of the sequence of instructions when they try
to access the non-pipelined unit. These instructions can't get executed at the speed of one
instruction per clock cycle. Since the resources are not properly pipelined i.e. the resources
are not properly duplicated, there is always a chance that the instructions can't be executed in
all their combinations. For example, when there is only one ALU unit available, but at a
certain time the processor is trying to use this ALU to perform two additions simultaneously
in one clock cycle, hence processor is said to have structural hazard.
For a processor having the structural hazard the average instruction time is:
= ×
= (1 + 0.4 × 1) × .
= 1.3 ×
From this equation we find that the ideal pipelined processor is faster; this ratio of the
average instruction time of the ideal pipelined processor to the processor with structural
hazards shows that the ideal processor is 1.3 times faster than the processor with structural
hazards
64
Q 5: What do you mean by static branch strategies and dynamic branch strategies to
deal with branches in pipeline processor?
Answer: Refer section 4.3.3 and section 4.3.4.
An anti dependency occurs when an instruction gets execute before a previous instruction
and modifies the result that is to be used in by the previous one.
Example
i1. R4 <- R1 + R5
i2. R5 <- R1 + R2
When there occurs an instance that instruction i2 may get executed before i1 (i.e. in parallel
execution) we should take care that the result of register R5 is not stored before i1 is
completed.
65
Chapter 5 Memory and Memory Hierarchy
Contents
Introduction
5.1 Storage technologies
5.1.1 RAM or Random Access Memory
5.1.2 Static RAM
5.1.3 Dynamic RAM
5.2 Memory modules
5.2.1 Improved DRAM
5.2.2 Non Volatile Memory
5.3 Access the main memory
5.4 Virtual Memory
5.4.1 Virtual memory models
5.5 Disk storage
5.5.1 Disk Geometry
5.5.2 Disk Capacity
5.5.3 Disk Operation
Introduction
The computer system is consisting of three major parts – devices working as input,central
processing unit (CPU) and the devices working for output. The CPU is composed of
threecomponents which are the arithmetic logic unit (ALU), the control unit and memory
unit.
66
Figure 5.1 block structure of computer
The main objective of the chapter is to introduce a important concept of Memory unit in the
computer which as mentioned in the diagram as main memory and secondarymemory. The
Memory unit is interfaced with other units of the computer as described in the figure5.2.
Up till this point of the various studies on computational systems, a simple model of a
computer system has Central Processing Unit (CPU) that works on instructions given to it
and a storage system that stores instructions (commands to CPU) and data for the CPU. The
memory system is considered as an array of bytes which are linear, and the CPU takes certain
fix amount of time to access the memory location while this is a most basic model of a
computer system. In reality, it is not the exact type that the latest systems are follows.
Actually in general, a memory system is a collection of storing devices at various levels
based on their different capacities, costs, and access times. Frequently used data is hold by
registers of the CPU. The cache memories which are small in size and fast and near to the
CPU act as primary location for the data and programs which is to be accessed by computer
systems firstly. The next stages for data are the main memory stage and optical disks and
magnetic tapes stage. An efficiently written program needs to access the storage system at
67
any level of the memory hierarchy more frequently than the access of the lower levels. So the
lower level storage of the memory hierarchy is slower, larger and cheaper per bit. The storage
near to the bottom of memory hierarchy is larger in size and cheaper as compare to the upper
stage of hierarchy. So, it is required to analyze the overall structure of a memory system to
deliver efficient and fast computer applications.
In a CPU register, if the data program needed, then the data stored in these resisters can be
accessed with in 0 clock cycle duringexecution of program. If the data present in cache, the
accessing time to that data will be 1 to 30 clock cycles. The accessing time from main
memory will be 50 to 100 clock cycles. If data is stored in disk memory then the accessing
time will be millions cycle count. Now here, this is the main fundamental consideration in
programming: If we try to understand how actually the system moves the data upward and
downward into the level of hierarchy of memory, then we can write our application programs
so that the required data are placed higher in memory hierarchy, the CPU can access the data
more quickly. This is the idea for a computer system which is also known by a term called
locality of computer program, which is a fundamental property of a computer program. A
computer programs with high locality, frequently uses the high level storage of memory
hierarchy for accessing the data than programs with poor locality, and so executes with high
speed. In this chapter, we will discuss the various aspects in which includes the storage
devices like DRAM (Dynamic Random Access Memory), SRAM (Static Random Access
68
Memory), ROM (Read only Memory ), and solid state disks technology. We will also
understand how to analyze our C programs on the basis of locality and techniques to better
the locality of programs. We will also understand how can the executions of program moves
data upward anddownward in the memory hierarchy, and then we can write an application
program with good locality such that the data is accessed at higher stage in of memory
hierarchy.
There are 2 types of Random-access memory (RAM) —Static Random Access Memory
(SRAM) and Dynamic Random Access Memory (DRAM). Static RAM (SRAM) is
considered faster and more expensive as compare to Dynamic RAM (DRAM). SRAMs are
used in cache as a memory where as DRAM is used for the main memory as well as for the
frame buffer of a graphics system. SRAM have not much a few megabytes of space in our
desktop type system, while in case of DRAM we have hundreds or thousands of MB (mega
bytes) space.
Static Random Access Memory (Static RAM or SRAM) is a RAM that uses static form to
hold data that means as long as it has power it will be available. Unlike a DRAM, it is not
needed refreshing circuit. It stores one bit of data using four transistors, which are arranged
into two cross-coupled inverters. It is Bi-stable and the two stable states are 0 and 1. For read
and write operations another two transistors are there which are used to manage the
availability to a memory cell. To store 1 memory bit six MOSFET (metal-oxide-
semiconductor field-effect transistors) are required. There are two types of SRAM chips are
available –one is MOSFET based andthe other type is the bipolar junction transistor based.
69
The bipolar junction transistor (BJT) is faster as compare to MOSFET but it consumes a lot
of power. So, MOFSET based SRAM type is most popularly used.
Charge in the capacitor is stored bit of DRAM. The typical value of capacitance of the
capacitor is 30000 pF or 30 fF (Femto Farad). DRAM storage can be very dense as the size of
DRAM cell with a capacitor and single transistor is very small. For any disturbance or noise,
DRAM memory cell is more sensitive as compare to SRAM cell because the charge stored in
capacitor is very much sensitive to any external disturbance which may cause the charge.
Even if capacitor voltage is exposed to light rays then it will be changed. If the voltage of
capacitor is disturbed, it cannot be recovered..
70
Figure 5.5 Dynamic RAM cell
In the sensors of digital cameras and cam-coders the array of DRAM cells are used. In a each
time period of around ten to hundred milliseconds, a DRAM cell needs to be refreshed. A
DRAM cell will lose its charge by various sources of leakage current. The retention time is
very long for a computer as its operating clock period in few nano seconds. Memory system
should be refreshed after reading and then writing every bit .Computer can do correction and
detection of error bit within a word by adding a some redundant bits (e.g., a 8-bit word may
be encoded by using 10 bits). Unlike DRAM, does not require refreshing and SRAM work as
long as power is available to it. Accessing of SRAM is faster as compare to DRAM. SRAM
cells have more transistors as compare to DRAM cells, so that lower densities, high cost and
more power consumption. In a DRAM chip, there are no of DRAM cells and supercells
collectively. Let us consider a DRAM chip with“d” supercells and “w” DRAM cells.
71
Figure 5.6 128-bit 16x8 DRAM chip
The supercell is denoted by shaded box at address (2, 1). Pins are as used here as external
connectors for information to flow in and out from memory cell. Each of the pin carries a one
bit signal. In the Figure 5.6 there are two sets of the mentioned pins are shown: one is a
collection of 8 data pins used to transfer or transmit 1 byte from (memory chip) or in the
memory chip, and other one is address pin that generally carry 2-bit rows and columns of
supercell addresses. Another set of pins which for control information exists which is not
shown. Figure.5.6 shows the High level view of a DRAM chip with size of 128-bit (16 × 8).
It is to be noted that the storage community does not defined a proper name of DRAM array
elements. Computer architects give it the name “cell”, which is similar to term with DRAM
storage cell. The circuit designers call it “word,” as the data is loaded with words in the main
memory. Hence the term “supercell” is adopted to avoid confusion. Memory controller which
manages the transfer of “w” bits simultaneously to and from each DRAM chip, is connected
to every DRAM chip. If the micro controller need to read the contents of super cell (i,j), then
72
it will first send the address of rows i and then address of column j to the DRAM chip.
DRAM cell sends the contents of the supercell with address (i,j) to the controller in response.
The row address “i” is called as a “Row Access Strobe” (RAS) request. The request for “j”
column address is called “Column Access Strobe (CAS)” request. Let us take an example, as
shown in figure 5.7, the super cell(2,1) is to be read from the 16 × 8 DRAM, for this RAS
request is sent by the memory controller for row address 2. In the response the DRAM chip
then copy the all contents stored in row 2 into a buffer called “internal row buffer” and when
and CAS for column address 1is sent by the controller, the DRAM chip transfers or copies all
the data (8 bit) into the supercell (2, 1) from the row buffer and then send those bits back to
the memory controller.
It is to be noted that the circuit developers consider the DRAMs as a 2-D arrays structure to
minimize the address pin required on chip where as if the chip is considered as linear array
then the no of address pin may increased. Let us consider, if our example for 128-bit DRAM
which is arranged or considered as a linear array of 16 supercells with the addresses 0 to 15,
then the given chip will require 4 address pins instead of 2. Due to the 2-dimensional array
organization structure its address sent n two distinct steps and which increases access time.
73
Memory modules are the discrete units of different size, depending upon the no of DRAM
cells packaged into it. There are separate slots for memory modules on the main board of our
system in which these can be plugged in.The most common package consists of 168 pin dual
inline memory module (DIMM). Data can be transferred with the help of DIMM to the
micro-controller and also form the micro-controller in blocks of size 64 bits, whereas in the
72 pin SIMM (single inline memory module) 32 bits block is transferred. The Basic idea of a
memory module is shown in figure 5.8. The considered module stores exactly the 64 MB
using 8 chips of 8M × 8 DRAM each numbered from 0 to 7. One byte of main memory is
stored by each and every supercell. As shown in figure 5.8, 8 supercells are used to represent
an address A of 64 bits in main memory and (i,j) is the corresponding address of supercell.
The example shown in Figure 5.8, lower byte is stored in DRAM 0 and next byte is in
DRAM 1 and so on. For getting a address A of 64 bits, A is converted to supercell address
(i,j)by the memory controllerand the controller will then send this address to the memory
module which is attached to it, the memory module will then transfer addresses i and j to each
and every DRAM. As a result of this, 8-bit contents of (i,j) super cell will be the output of a
each DRAM. In collected data is converted into a 64-bit double word, which is returned in
the memory controller. In this case, on receiving address by controller and module k that
contains A will be selected by the controller and the controller converts address A into (i, j)
form, and sends it to memory module k.
74
Figure 5.8 Reading the contents of a memory module
Fast Page Mode DRAM (FPM DRAM):Theseare most common type of DRAM cells in
personal computers.A whole row of the supercell in its internal buffer line is used, and then
blows rest of the remaining copies in a conventional DRAM. Allowing consecutive access of
the same row that is to be served directly from the line buffer FPM DRAM is improved. Let
us take an example, the line i of supercell is to be read from the general DRAM chip, then at
least four RAS / CAS requests are required to be sent by memory controller, even if the
address i for row is the same in all requests. If super cells have to be read from the same line
of FPM DRAM, then the controller will first send a RAS/CAS requests initially and then
75
these are followed by next three CAS requests. First application RAS/CAS will copy the line
i in the line buffer and returns the super cell addressed by the CAS. Row buffer would have
three super cellsthat will be served directly and therefore the speed will be the initial super
cell.
Extended data on DRAM (EDO):The CAS individual signal to be compared over time in
an improved form of FPM DRAM.
Synchronous DRAM:FPM and EDO classics are asynchronous and therefore they need to
communicate using explicit control signals with the memory controller. The rising edge of
the clock signal which is externally applied, replaces control signals by SDRAM. The net
effect of SDRAM supercells content will be faster than asynchronous consideration.
Double Data Rate Synchronous DRAM-(DDR SDRAM): The purpose of DDR SDRAM is
to increase the speed of the DRAM two times. It can be done by utilizing both the edges
(rising/falling) of a clock signals. The size of pre-fetch buffer decides the type ofDouble Data
Rate Synchronous DRAM (DDR SDRAM).
Video RAM (VRAM).Graphic systems use frame buffers of VRAM. VRAM has similar
technology to FPM DRAM. There are two important differences which are that (1) Shifting
of content of VRAM‘s buffer in a sequence leads to output of VRAM. (2)By using VRAM
the memory can be read and written. Earlier before1995, the memory technologies used in
most PCs were FPM DRAMs. Later onEDO DRAMs came into picture from 1996 to 1999,
and replaced FPM DRAMs. Upto the year 2010, DDR3 SDRAMs memory technology was
the most preferred memory used in server as well as desktop systems.It is to be noted that
core i7 processor supports only DDR3 SDRAMs.
76
memories the stored data will be remained even if power to the memory is off. There are a
various types of non-volatile memories. These arepopularly called read-only
memories(ROM). ROM are ranked based on the number of times we can be writeand based
mechanism by which we reprogrammed.
A programmable ROM (PROM) is one time programmable. In PROM there is a fuse
attached to each memory cell which is to be blown by high current when data is to be written
into it. The fuses cannot be recovered once they are blown so the PROMS are one time
writable.
AnEPROM (erasable programmable ROM) is another type of ROM which is re-writable.
The data can be erased by radiating the ultra violet(UV) rays on storage cell. A transparent
quartz window is used to pass the light on storage cells. EPROM programming is done using
a special device for writing in EPROM. We can erase and reprogram a typical EPROM by
the order of 1000 times. The main disadvantage of EPROM is its erasing circuitry. We need
to plug out the EPROM every time whenever the EPROM is to be erased.
EEPROM or Electrically erasable PROM is almost similar to the EPROM. Electric field is
used to erase the data from memory rather than light as in simple EPROM.In this technology,
on chip erasing circuitry is provided. So, we need not to plug out the whole ROM chip for
erasing.
77
Bus is a communication system that is used to share data between the processor and main
memory DRAM to transfer data; there are series of steps, between CPU and memory, called
bus transaction. Transfer the main memory data is known as transaction read and transfer
data from the main memory data is known as write transaction. A collection of parallel son
called bus carries the address, data and control bus signals. Design decides whether the data
and address signals will share the same set of son or different sets. Two or more devices can
also share the same bus. The control son synchronizes the transaction such as transfer
relevance to the main memory or another device o (disk controller) i /, reading detection /
write operation, the information is on the data bus or address bus. Figure below shows the
computer system configuration. The main components include the CPU, a chip assembly
which is an I/O connection and DRAM memory modules that constitute the main memory.
All these components are connected by a bus system. There are following types of bus exist
in a typical computer system:
System Bus: - A system bus is used to the CPU chip to I/O Bridges.
I/O Bus: - I/O bus is connecting the input/output of the main Memory Bridge. The I/O
Bridge has two functions.
As an example, when the processor performs a charging operation, in the first step
In this, the register% eax is loaded with contents of the address A. A read operation through
the bus is initiated by the bus interface. The read transaction process involves three steps.
1) A CPU puts the address on the system bus, which SIGNALIS transmitted to the memory
bus through the bridge I / O
78
2) This signal is detected by the memory, retrieves the data Dynamic RAM, and data is
written to the memory bus. This translation of the system bus signal memory bus signal is
carried by the bridge I / O and then it passes along the system bus.
3) The data on the system bus is detected by the CPU reads on the bus, and then copies it to
the% eax register. A write operation is initiated by the CPU, the CPU executes a store
instruction -> movl% eax, A
Here, the content of the register% eax are designed to meet A. The CPU initiates a write
operation. There are three basic steps.
The address on the system bus is set by the CPU. The address of the memory bus is read
by the memory and waits until the data arrives.
CPU copies the data word in% eax to the system bus.
The data word from the memory bus is read by the main memory and stores the bits of the
DRAM
79
(b) The word x stored in address A in main memory, is placed on the bus.
(a) A on the address bus ( memory bus) is placed by CPU for data writing
80
(b) data word y is placed on the bus by CPU
81
Only active programs or portions of them become residence of the physical memory at one
time. Active portions of programs can be loaded in and out from disc to physical memory
dynamically under the coordination of operating system. To the users, virtual memory
provides almost unbounded memory space to work with. Without virtual memory, it would
have been impossible to develop the multiple program cr time sharing computer systems that
are in use today.
Address spaces each word in the physical memory is identified by unique physical address.
All memory words in the main memory form a physical addresses space. Virtual addresses
are those used by machine instructions making up an executable program.
The virtual addresses must be translated into physical addresses at real time. A system of
translation tables and mapping functions are used in this process, the address translation and
memory management policies are affected by the virtual memory model used and by the
organization of the disc and of the main memory.
The use of virtual memory facilitates sharing of the main memory by many software
processes on a dynamic basis. It also facilitates software portability and allows users to
execute programs requiring much more memory than the available physical memory.
Only the active portions of running programs are brought into the main memory. This permits
the relocation of code and data, makes it possible to implement protections in the OS kernel,
and allows high level optimization of memory allocation and management.
Address mapping Let V be the set of virtual addresses generated by a program running on a
processor. Let M be the set of physical addresses allocated to run this program. A virtual
memory space system demands an automatic mechanism to implement the following
mapping:
Ft : V →M∪ {∮}
This mapping is a time function that varies from time to time because the physical memory is
dynamically allocated and deallocated. Consider the virtual address v Є V. The mapping ft is
formally defined as follows:
82
In other words, the mapping ft(v) uniquely translate the virtual address v into the physical
address m if there is a memory hit in M. When there is a memory miss, the value return ft(v)
=∮ , signals that the referenced item (instructions or data) has not been brought into the main
memory at the time of reference.
The efficiency of the address translation process effects the performance of the virtual
memory. Virtual memory is more difficult to implement in a microprocessor, where
additional problems such as coherence, protection and consistency become more challenging.
Two virtual memory models are discussed below.
Private virtual memory The first model uses a private virtual memory space associated
with each processor, as was seen in the VAX/11 and in most UNIX systems(Fig. 5.11a). Each
private space is divided into pages. Virtual pages from different virtual spaces are mapped
into the same physical memory shared by all processors.
The advantages of using private virtual memory includes the use of a small processor
address space (32 bits), protection on each page or on a per-process basis, and the use of
private memory maps, which require no locking.
The shortcoming lies in the synonym problem, in which different virtual addresses in
different virtual spaces point to the same physical page.
Shared virtual memory This model combines all the virtual address spaces into a single
globally shared virtual spaces (Fig.b). Each processor is given a portion of the shared virtual
memory to declare their addresses. Different processors may use disjoint spaces. Some areas
of virtual space can be also shared by multiple processors. Examples of machines using
shared virtual memory include the IBM801, RT, RP3, System 38,the HP spectrum, the
Stanford dash, MIT Alewife, Tera etc.
--------------
Shared
memory
--------------
--------------
83
Virtual space Virtual space
of processor 1
-----------
----------- ----------- P1 space
(pages) (pages)
-----------
----------- --------------
-----------
Shared -----------
space
--------------
P2 space
The advantages in using shared virtual memory includes the fact that all addresses are unique.
However, each processor must be allowed to generate addresses larger than 32 buts, such a s
46 bits for a 64 T byte (246 byte) address space. Synonyms are not allowed in a globally
shared virtual memory.
The page table must allow shared accesses. Therefore, mutual exclusion (locking) is needed
to enforce protected access. Segmentation is built on top of the paging system to confine each
process to its own address space(segments). Global virtual memory may make the address
translation process longer.
84
Disks read operation takes more time in the range of milliseconds.
85
The maximum amount of no of bits that can be stored in a disk is recognized as disk capacity.
Disk capacity is calculated by the below mentioned parameters:
• Recording density (bits/inch): It is the number of bits in 1-inch segment of the rack is
called recording density.
•Track density (tracks/in): The no of tracks that can be compressed per inch radius passing
from the center of the dish called track density.
A Real density (bits/in2): The products of the recording density and the track density are
called as real density. Manufacturers of Disks work very hard to enhance areal density (so
capacity), and number get doubled after every year. The disks are designed by taking care
of real density. Every track is divided into every track in number of sectors. To keep a fixed
no of sectors per track, the sectors are placed farther apart on the outer tracks. This
approach was used when areal densities were lower. As the areal density increases, the
spaces between sectors enlarge. With further improvements in methods of increasing disk
capacity a new technique is introduced called multiple zones, each zone having adjoining
collection of cylinders. The collection of cylinders is partitioned into separate subsets called
zones. The no of tracks in each cylinder in a zone have the same number of sectors, which
is fixed by the number of sectors that can be encased into the interior track of the zone. The
reason behind floppy disks not successful is that it uses the outdated approach, with a
constant number of sectors/track. The capacity of a disk is calculated by the below
mentioned formula:
For example, we have five platters in a disk, 512 bytes per sector, 30,000 tracks per surface,
and an average of 400 sectors/ track.
Then the capacity of the disk is Disk capacity = 512 bytes /sector *400 sectors /track *30,000
tracks per surface * two surfaces /platter * five platters disk
= 61,440,000,000 Bytes = 61.44GB
86
A disk reads/write bits on/from the magnetic surface using head attached at the actuator arm’s
end. In Figure 5.12, it is shown that by sliding the arm back and forth along its radial axis
moves data from the disk to the main memory. This mechano-motion is called seek. Once the
head is at its position above the target track, then each & every single bit on the track comes
directly under it, so with the help of the head, the bits can either be read or written. As shown
in Figure 5.12(b), disk having multiple platters with each having a head for read/write
operation. The heads that are piled up vertically one above another move in unison. At a
certain point of time, all the heads are placed on the same cylinder. By moving exhaustively,
the read/write head are positioned by the corresponding arm over track Spindle .The disk
surface rotates at a fixed rate .Fig 5.12 shows (a) Arm of Single-platter view (b) Multiple-
platter view. On a thin space of air over the disk surface, the read/write head slides at a
distance of about .0001 mm at a speed of about49.7097 mph. Disks are always sealed in
airtight packages.
Disks read and write data in sector-sized blocks. The access time for a sector has 3 main
components: seek time, rotational latency, and transfer time.
87
• Seek time: It is amount of time needed to read the elements of a target region, the arm 1st
locates the head over the track that holds the target region. The time required to move the arm
is called the seek time. The seek time (Tseek) is depending on two factors
2) The rotational speed with which the arm moves through the surface.
The Taverage in modern drives is calculated by taking the mean of several thousand seeks to
random region,Tseekis typically on the order of 3 to 9ms.The seek time can sometimes be as
high as 20 ms.
• Rotational latency: When the head is positioned upon the track, the drive awaits for the
first bit of the target sector to move below the head. The performance depends upon two
factors
In worst case scenario, the head omits the target sector and waits for the disk to complete a
full rotation. The maximum rotational latency (in seconds) can be formulated by
• Transfer time: The drive will start to read or write the data of the sector when the head is
below the first bit of the sector. The total transfer time depends on following parameters:
The average time to read and write data a disk sector can be estimated as sum of following
terms
88
B) The rotational latency (average)
Parameter Value
The average rotational latency for this disk (in mili second) is
≈ 4 ms
Array everything together, the total needed access time can be computed as
= 13.02 ms
Seek time and the rotation latency dominates the time to access the 512 bytes in a disk sector.
• The time to access the 512 bytes in a disk sector is dominated by the seek time and the
rotational latency. The first byte access in the sector takes long time, but the remaining bytes
are free.
• Although the rotational latency&seek time arealmost the same, disk access time can be
estimated by calculating twice the seek time
89
Summary
Memory elements are provided within the processor operate at processor speed, but they are
small in size, limited by cost and power consumption. Farther away from the processor,
memory elements commonly provided are (one or more level of) cache memory, main
memory, and secondary storage. The memory at each level is slower than the one at the
previous level, but also much larger and less expensive per bit. The aim behind providing a
memory hierarchy is to achieve, as far as possible, the speed of fast memory at the cost of
slower memory. The properties of inclusion, coherence and locality make it possible to
achieve this complex objective in a computer system.
Virtual memory systems aim to free program size from size limitations of main memory.
Working set, paging, segmentation, TBLs, and memory replacement policies make up the
essential elements of a virtual memory.
Exercise
Q3: What are the main differences in DRAM and SRAM technologies?
Ans: Refer section 5.1.
Q4: What are the various steps to be followed by a processor while executing an instruction?
Ans: Refer section 5.3
90
Q6: Write a short note on
1) Disk Storage 2) Disk operations 3) Disk Performance parameters
Ans: Refer section 5.5
91
Chapter 6
Structure
6.0 Objectives
6.1 CACHES
6.2 Cache organization
6.2.1 Look Aside
6.2.2 Look Through
6. 3 Cache operation
6.4 Cache Memory Mapping
6.5 Cache Writing Techniques
6.5.1 Write-through
6.5.2 Write-back
6.5.3 Single Cycle Cache
6.6 FSM Cache
6.7 Pipelined Cache
6.0 Objectives
6.1 CACHES
The caches refers to a type of memory that comparatively small but can be performed
operation very quickly. It store information that we likely reuse. Whereas cache came in the
beginning at 1968 in the IBM System. After that low-cost, high density RAM and
microprocessor ICs appeared in the 1980s. Caches precisely address the von Neumann
bottleneck by giving the CPU with fast, single cycle access to its external memory. A small
portion of memory located directly on the same chip as a microprocessor. This is known as
92
CPU cache. The CPU cache stores the most recently used information so they can be they
can be fetched more quickly. This information is equivalent of information stored
somewhere, but it is more easily available. In this section; we focus on caches used an
intermediary between a CPU and its main memory. But caches arrive as buffer memories in
distinct other contexts.
6.1.1 Main Features
The cache and main memory form a two level sub hierarchy (ME1, ME2) that differs from
the main-secondary memory system (ME2, ME3). The combination of (ME1, ME2) having
fastest speed than (ME2, ME3). The typical access time ratio of (ME1, ME2) is nearby 5/1,
while (ME2, ME3) is about 1000/1. Due to the speed difference (ME1, ME2) mainly
managed by high speed hardware circuits whereas (ME2, ME3) controlled by operating
systems. Communication between (ME1, ME2) is by pages of 8 B size which is much
smaller than the page size of (ME2, ME3) which is 4 KB. At last, we learn that CPU has
direct access to both ME1 and ME2 but does not have direct access to ME3.
Two types of system organizations for caches are look-aside and look-through.
93
6.2.1 Look Aside
In look-aside, the cache and the main memory are parallel connected to the system bus. Both
the cache and main memory see a bus cycle at same time. Hence it is known as look-aside.
During the read cycle of processor, cache check that address is a cache hit or cache miss.
Cache responds to the read cycle and complete the bus cycle, if the cache contain the memory
location. Which result in cache Hit. In another way main memory responds to the processor
and terminates the bus cycle, if the cache does not contain the memory location. The cache
will mismatch the data, so the processor requests this data will be cache hit next time. Which
result in the cache Miss.
Look aside caches structure is simpler, which mean they are less expensive. The main
disadvantage is that processor cannot access cache, if another bus master accessing the main
memory.
94
Fig.1.3: Look Through
cache
As shown in a diagram of cache architecture, cache unit is connected between the processor
and main memory, In this cache notice processor bus cycle before allowing it to pass on to
the system bus. This is fastest and more costly organization.
In this cache respond to the processor’s request before starting an access to main memory,
which means cache hit. In the case of cache miss, the cache passes the bus cycle onto the
system bus. Cache mismatch the data due to main memory respond to the processor request,
so next time the processor requests this data for the cache hit. The main disadvantage of the
look through, its higher complexity and cost.
6. 3 Cache operation
Figure 1.4 shows a cache system that illustrates the relationship between the data stored in
cache memory or main memory. Here we assumed that cache clock size is 4 bytes. Memory
address is 12 bits long, so that displacement address within the block defined by the 10 high
order bits from tag or block address and 2 lower order bits. It shows that content of two
blocks of cache tag memory presented to cache data memory. Address Aj = 101111000110 is
sent to M1, which compares Aj’s tag art with M1 stored tags and find a match “hit”. Two bit
displacement is used to output the target word to CPU.
95
Fig.1.4: Cache execution of read operation
A cache write operation also performs the same addressing technique. As shown in figure 5,
the cache tag memory presented to cache data memory, along with the data word to be stored.
When the cache hit occurs, new data 88 is stored at the location pointed to by Aj in data
memory of cache, thereby overwriting the old data FF. The problem arises; the data in cache
differs from the data in main memory with the same address. This leads to temporary
inconsistency, which we can minimize by implementing a policy that systematically updates
the data in Main memory in response to changes made to the corresponding data in cache.
There are two basic writing schemes are: write through and write back. In the write through
approach write is done synchronously both to the cache and to the backing store with no write
allocation. Write allocate means when data at the missed, write location loaded to the cache,
followed by write hit. Write misses are similar to read misses. In the write back approach,
the writing is done initially only to the cache. Write to the backing store is delayed until the
cache blocks containing the data are about to be replaced or modified by new data.
96
Fig.1.5: Cache execution of write operation
97
Fig.1. 6: Cache with direct mapping
The advantage of direct mapping is that it is simplest replacement algorithm. Also its
disadvantage is that it is not flexible and there is contention problem if cache is not full.
Associative Mapping Technique
In the fully associative cache data to be stored in any cache block rather than forcing each
memory address into the particular block. Data can be placed in any unused block of the
cache. The way of relating the cache data to the
main memory address is to store both the cache address and data combined in the cache. That
is known as fully associated mapping. Cache to be composed of associative memory contains
both the memory address and the data for each cache line. As shown in Fig.1.7 by using the
internal logic of the associative memory, the incoming memory address is simultaneously
compared with all stored addresses. Data is read out, if a match is found. If the associative
part of the cache is capable of holding a full address then Single words from anywhere within
the main memory could be held in the cache.
98
Fig.1.7: Fully associative mapping
Advantage of fully associative mapping is that it is flexible; cache can be used any empty
block. It is expensive and must check all tags to check for a hit. In this to speed up the
process parallel algorithm has been developed.
.
Fig 1.8: Several ways of organizations of 8 blocks cache
99
Fully associative cache could be described as N way associative, where N is the number of
blocks in the cache. Also direct mapped cache is 1-way set associative, i.e. one location in
each set. For the better performance increase the number of the entries rather than
associativity and that 2 to 16 way set associative cache perform. It allows the limited number
of blocks with the same index and different tags in the cache. As shown in the figure 8 a four
way associative cache in which cache is divided into “sets” of blocks and in each set it
consist of four blocks. Number of block in a set is known as set size and also known as
associativity. Each block in every set consists of tag and data along with the index. Firstly,
processor index of address is used to access the set. After that comparators are used to
compare the incoming tag with the tag of the selected block. If a match is found, then the
corresponding location is accessed. Otherwise if tag match not found, an access to the main
memory is made.
For the full address, the tag bits are always chosen to be most significant bits. The next
significant bits are the block address bits. During consecutive set in the cache, byte address
bits from the least significant bits send out consecutive main memory blocks. This addressing
system used by all known systems is known as bit selection. In case of set associative the set
address bits are the most significant bits and the block address bits are the next significant
bits.
For each associative search, the association between the incoming tag and stored tag is done
using comparator. All the information, tags and data can store in random access memory. In
the set associative cache the number of comparator required is given by the number of blocks
in a set. All the blocks of the set read out simultaneously set can be selected with the tags
before the tag comparisons to be made by comparator. Particular block can be selected after
the tag has been identified.
100
Set associative mapping better performance than other two mapping but it is more expensive.
In this number of comparators are equal to the number of cache ways so this is less complex
then the fully associative. 8-way set associative caching for level 1 data and 4, 8, 16 or 24-
way set associative for level 2 uses by Intel Pentium.
There are two basic writing techniques which are used popularly, which are Write-Through
and Write-Back techniques.
6.5.1 Write-through
In write through technique, in duration of writing the data to cache the main memory also
being written which requires a processor to wait till the main memory completes its write
operation. This technique is easy to implement although it writes many undesired files to the
main memory. Consider an example in which a particular program writes a data block in the
cache, then reads it and again writes the same, so this data block required to stay in the cache
during all the three processes. It is not necessary to update the main memory after the writing
of first data block, because the second write overwrites the data which is written during the
first write. Basically in this technique writing of data is done synchronously both to the
memory and to the cache.
101
Write through technique directs write I/O in the cache memory and through to main memory
and ensure I/O completion to the host. This confirms the written data is safely stored. Write
through technique is mainly used for applications where writing and re-reading of data takes
place frequently.
6.5.2 Write-back
In write-back technique writing is performed only on the cache. The main advantage of this
technique is that it reduces the number of write operations to the main memory. It writes the
data block into the memory only when the existing block get replaced with a current one and
the current block written to the memory during the existence of previous block. In this
technique an extra bit is associated with the data blocks, known as “dirty bit”. When we write
to the data block in the cache memory we set the dirty bit to it and check this bit when
replacing the current block with the existing block. This ensures that we should copy this data
block to the main memory or not.
The write back techniques is much more difficult to implement, because it requires
continuous tracking of its locations which have been written over, and a dirty bit is set to
them for later writing to the main memory. The data in these locations are only written when
main memory does not contain the same block. There are two approaches used to overcome
write- misses when no data is returned on the write operations:
102
Write allocate: Here the data which is written at undesired location is written back to the
cache, followed by a write-hit operation.
No-write allocate: Here the data which is written at undesired location is written back to the
main memory and cache being unaffected by it.
103
Fig 1.13: 16 bit memory single cycle operation
The function of memory will be determined by the "wr" and "enable" inputs in every cycle.
Since cache is integral part of CPU, so execution of each instruction is performed in one
cycle. So our CPI (Cycles Per Seconds) should be 1 for that. Each cycle would require same
amount of time. It spends the same amount of time for execution of each instruction
regardless the complexity of instruction. The most complex instruction should complete
execution in one cycle only in order to ensure that our processor works correctly. The
disadvantage of this kind of CPU is that it should work at the speed of its slowest instruction
and advantage is that it is quite east to implement.
104
Fig 1.14: FSM cache
As shown in figure 1.5 there are four states. Initially cache stays in the idle state. It only
changes its state when it gets CPU request for operation. After that cache sends “cache ready”
signal further to processor. If cache hit, then further operations will be performed onto the
cache memory. If cache miss, there are two possibilities in that further i.e. “block clean” and
“block dirty”. If dirty bit is set to the block then It writes the data again. If block is clean (no
dirty bit is set), then block will be allocated to the data to be written. After allocation
“memory ready” signal will be sent back to CPU.
105
In pipelined cache the data in particular address of memory can be accessed in cache along
with the data in RAM at the same time. In the operation of pipelining the data transfer of
instructions from the cache memory or to the cache memory is categorized in two different
stages. One of each stage remains busy by one of these operations all the time. The same
concept is being utilized in assembly language programming. The pipelining overcomes the
drawback of traditional memory operations which wastes a huge amount of time and
eventually reduces the speed of processor.
Burst Mode
In the burst mode of the cache, the data which is stored in memory is fetched before
proceeding the request to access that. Consider an example of a typical cache where each line
has the size of 32 bytes. It means it reads and writes 32 bytes of data in one complete cycle.
The data paths utilized by cache are of 8 bytes of size. It requires four different operations for
each cache data transfer. In burst mode there is no need to define the different address for
each transfer unlike other modes. There is no need to specify the different address in each
transfer after the first transfer in the burst mode. This causes a large amount of improvement
in speed of operation.
106
Q.2 Explain cache read and write operation.
Q.6 Write the difference between look through and look aside? Which one is better in
performance.
Q.7 Explain in brief direct, associative and set associative memory mapping.
Q.8 What do you understand by cache writing techniques? Explain write back and write
through techniques in detail.
107
Q.9 Differentiate between single cycle cache and FSM cache.
108
Chapter 7
Pipelined Processors
Structure
7.1 Objective
7.2 Linear Pipeline Processor
7.3 Non Linear Pipeline Processor
7.4 Instruction Pipeline Processor
7.5 Arithmetic Pipeline Processor
7.6 Super Pipeline Processor
7.1 Objective
Objective of this chapter is to enhance the performance of the sequential processors through
pipelining. Pipelining allows the processing of instructions by dividing a single task into
multiple subtasks. There are four main phases in which the overall instruction is executed i.e.
Fetch, Decode, Execute, and Deliver. Section 7.2 includes of a Linear Pipeline Processor
consists of multiple processing stages connected sequentially to perform a desired function.
Linear pipeline processors are further categorized into two sub categories Asynchronous
pipeline model and synchronous pipeline model. Section 7.3 consists of Non-Linear Pipeline
model which is used when the functions are variable in nature. Further section 7.4 consists of
Instruction pipelined processors in which more than one ‘Execute’ operations can be there.
Prefetch Buffers are needed for efficient execution of instructions in pipelined form. 7.5
consist of Arithmetic pipeline processors. Arithmetic pipelining techniques are used to speed
up arithmetic operations. Section 7.6 involves the introduction to Super Pipeline processors
which consist the process of ILP (Instruction Level Parallelism).
109
these stages operate like an assembly line. So, the very first stage accepts the input and the
last stage provides the output. The basic pipeline structure works in synchronized form. A
new input is accepted at the start of clock cycle and in the next clock cycle result is delivered.
Figure 7.1 shows the basic pipeline structure as discussed above.
Now, take an example to illustrate how pipelining improves the performance. The single
instruction processing is divided into multiple instructions: Fetch, Decode, Execute, and
Deliver. In first clock cycle instruction 1 is fetched; in second clock cycle instruction 1 is
decoded and instruction 2 is fetched; in third clock cycle instruction 1 reaches third stage i.e.
execution, instruction 2 is decoded and instruction 3 is fetched. Figure 7.2 shows the
processing of these instructions in pipelined manner.
110
Figure 7.2 Processing of pipelined instructions
Various Pipelining designs for processor development are available such as:
Linear Pipeline Processor
Non Linear Pipeline Processor
Instruction Pipeline Processor
Arithmetic Pipeline Processor
Super Pipeline Processor
A handshaking protocol is used in Asynchronous model to control data flow along the
pipeline. When first stage is ready with output it sends a ready signal to next stage. In
response next stage sends acknowledge signal to the first stage. Delay may vary in different
stages in case of Asynchronous Pipelined Model. Figure 7.3 depicts working of
Asynchronous Pipelined Model.
A clocked latch made with master-slave flip-flops is used in Synchronous model to control
data flow along the pipeline. When clock pulse arrives, all latches transfer data to the next
111
stage at the same time. Delay remains approximately equal in all stages. Successive tasks are
initiated one per cycle to enter the pipeline. Once the pipeline is occupied, one result is
extracted for each incremented cycle. For efficient throughput, successive tasks must be
independent of each other. Figure 7.4 shows the working of Synchronous Pipelined model.
Linear pipelines are also called static pipelines as they are specifically used to perform fixed
functions. When the functions are variable and to be performed at different times, dynamic/
non linear pipeline processor is used. Non linear pipeline allows feed-forward and feed-back
channels along with straight data flows. A non linear pipeline works in three stages. Besides
112
the straight data flows from stage 1 to 2 and from stage 2 to 3, there is a feed-forward channel
from stage 1 to 3 and two feed-back channels from stage 3 to 2 and stage 3 to 1. So, output is
not necessary to be extracted from last stage. Figure 7.5 shows three-stage non linear pipeline
processor architecture.
Figure 7.5 Non linear pipeline processor model with three stages
Prefetch Buffers are needed for efficient execution of instructions in pipelined form. For
Sequential instructions sequential buffers are used. For Branch instructions target buffers are
113
more effective. Both these buffers work in FIFO (First-In-First-Out) fashion. A third type of
instruction is Conditional Branch instruction that needs both sequential and target buffers for
smooth pipeline flow. Basically, the role of buffer is to reduce the miss-match between the
speeds of instruction fetching and pipeline consumption. Figure 7.6 illustrates the use of
sequential and target buffers to execute conditional branch instruction.
Buffers are always used in pairs. When the conditional branch instruction is fetched from
memory; first of all branch condition is checked. After checking, suitable instructions are
picked from one of the two buffers and the instructions available in other buffers are
discarded. In each pair, one buffer is used to load instructions from memory and other is used
to feed instructions in the pipeline.
Third type of Prefetch buffer used is loop buffer. This buffer is used to store sequential
instructions enclosed in a loop. Loop buffer works in two steps. In first step, it place
instructions sequentially before the current instruction to save the instruction fetch time. In
second step, it identifies the halting condition for the branch instruction. This avoids
unnecessary memory accesses as halting condition may lie in the loop itself.
114
operation from memory to register can be replaced by move operation from one register to
other. This technique will reduce memory traffic and also reduce the execution time.
Figure 7.7(a) Traditional store-load forwarding (b) same process with internal data
forwarding
115
Table 7.1 (a) Interrelated instructions prior to Static scheduling. (b) Statically scheduled
instructions
Dynamic scheduling is another technique for handling stage delays. This technique requires a
dedicated hardware for detection and solution of errors regarding interlocking instructions.
Dynamic scheduling is generally preferred for traditional RISC and CISC pipeline
processors. For scalar processors (instruction issue rate of 1) a parallel optimizing ILP-
compiler is used to search independent instructions and place them immediately after the
Load instructions. For superscalar (when instruction issue rates are 2 and higher) it is
probably not possible to find multiple independent instructions. Therefore, an algorithmic
procedure is required to prevent the dependent instruction from executing until the
interrelated data is available. For example, for IBM 360/91 processor Tomasulo’s algorithm
is implemented for Dynamic instruction scheduling. This algorithm resolved conflicts and
clear data dependencies using register tagging. Register tagging is a process to allocate and
de-allocate the source and destination registers.
7.4.3.3 Scoreboarding
Earlier processors used Dynamic instruction scheduling hardware where multiple parallel
units were allowed to execute irrespective of the original sequence of instructions. For each
execution unit the processor had instruction buffers. Instructions were issued to functional
units without checking for the availability of register input data. So, the instruction would
116
have to wait for its data in a buffer. To overcome this problem and for correct routing of data
between execution units and registers a control unit known as Scoreboard was introduced.
This unit is responsible for keeping track of data registers needed by instructions waiting in
buffers. When all registers had valid data only then scoreboard enabled instruction execution.
Similarly, when a functional unit finished execution it sends a signal to scoreboard to release
the resources.
Here, a pipeline unit for floating point addition is elaborated. Inputs to floating point adder
are:
X = A X 2a
Y = B X 2b
Floating point additions and subtractions can be performed in four stages.
Compare the Exponents
Arrange the Mantissa part
Add or Subtract the Mantissa
Normalize the result
This example with figures can explain the four stages clearly. Decimals numbers are used for
simplicity; therefore radix becomes 10 instead of 2 as stated above for binary. Consider two
floating point numbers.
A = 0.8403 X 104
B = 0.7100 X 103
117
Now according to first stage, compare the exponents i.e. 4 – 3 = 1. Larger exponent 4 is
chosen as the exponent of result. Second stage, arrange the mantissa part, shifts the mantissa
part of Y one position (difference) to the right. Now intermediate result is
A = 0.8403 X 104
B = 0.0710 X 104
The exponent part of two mantissas has become same. Third stage adds the two mantissas to
make the sum as
C = 0.9113 X 104
Fourth stage tells to normalize the result i.e. it should have a fraction with nonzero first digit.
As the result is already normalized so above value of Z is the final result. If it is something
like X. 0XXX X 104 then it can be normalized by shifting mantissa once to the right and
incrementing exponent by one i.e. 0.X0XXX X 105.
Figure 7.9 Arithmetic Pipeline Processor for Floating-point addition and subtraction
118
Consider as an example the multiplication of two 8-bit integers A X B = C, where C is the
16-bit multiplication result. We can also assume this multiplication as the addition of eight
intermediate products.
C = A X B = C0 + C1 + C2 + C3 + C4 + C5 + C6 + C7
For a superscalar machine of degree m, m instructions are issued per cycle and ILP
(Instruction Level Parallelism) should be m in order to fully utilize the pipeline. ILP is the
maximum number of instructions that can be simultaneously executed in the pipeline.
Accordingly, the instruction decoding and execution resources are enhanced to operate m
pipelines in parallel. At some stages functional units may be shared by multiple pipelines.
Figure 7.10 represents a dual-pipeline, super processor. Processor can issue two instructions
per cycle. Dual pipeline refers to the concept that there are essentially two pipelines in the
design. Both pipelines have four processing stages. These two instruction streams are fetched
from a single source known as I-cache. Two store units are used dynamically by the two
pipelines depending on availability. A lookahead window is also present in the design for
instruction lookahead in case out-of-order issue is required to improve the throughput.
119
Figure 7.10 A dual pipeline super processor requiring out-of-order issues
When super scalar instructions are executed in parallel, usually they finished in out-of-order
scenario. It does not depend on issuing of instructions that whether they are issued in-order or
out-of-order but the reason is difference in execution times. Shorter instructions may finish
earlier as compared to the longer instructions.
To handle the issue, we can create a difference between terms ‘to finish’, ‘to complete’, and
‘to retire’. ‘to finish’ is used to indicate that the required operation for instruction is
accomplished, except for writing back the result into specified memory location. ‘to
complete’ points to the scenario when last action of instruction execution i.e. writing back the
result into specified memory location is being executed. Final term is ‘to retire’ having a
connection with ROB (reorder buffer) since in this case two tasks are to be performed; to
write back the result and to delete the completed instruction from last ROB (reorder buffer)
entry.
Summary
120
new input is accepted at the start of clock cycle and in the next clock cycle result is delivered.
The single instruction processing is divided into multiple instructions: Fetch, Decode,
Execute, and Deliver. In first clock cycle instruction 1 is fetched; in second clock cycle
instruction 1 is decoded and instruction 2 is fetched; in third clock cycle instruction 1 reaches
third stage i.e. execution, instruction 2 is decoded and instruction 3 is fetched.
When the functions are variable and to be performed at different times, dynamic/ non linear
pipeline processor is used. Non linear pipeline allows feed-forward and feed-back channels
along with straight data flows. A non linear pipeline works in three stages. Besides the
straight data flows from stage 1 to 2 and from stage 2 to 3, there is a feed-forward channel
from stage 1 to 3 and two feed-back channels from stage 3 to 2 and stage 3 to 1. So, output is
not necessary to be extracted from last stage.
For a superscalar machine of degree m, m instructions are issued per cycle and ILP
(Instruction Level Parallelism) should be m in order to fully utilize the pipeline. ILP is the
maximum number of instructions that can be simultaneously executed in the pipeline. When
super scalar instructions are executed in parallel, usually they finished in out-of-order
scenario.
121
Exercise
Problem 7.1 - For a seven-segment pipeline, draw a space-time diagram to represent time it
takes to process eight tasks.
Problem 7.2 – Find out the number of clock cycles required to process 200 tasks in a six-
segment pipeline.
Problem 7.3 – An Arithmetic operation (Ai + Bi) (Ci + Di) is to be performed with a stream of
numbers. Show the pipeline structure to execute this task. Elaborate the contents of all
registers in pipeline for i = 1 to 6.
Problem 7.4 – Modify the flowchart represented in Figure 7.9 to add 100 floating-point
numbers X1 + X2 + X3 +……..+ X100
Problem 7.5 – A non-pipeline system takes 50 ns to execute a task. Same task can be
executed in six-segment pipeline with a clock cycle of 10 ns. Determine the speed up ratio of
pipeline for 100 tasks.
Problem 7.6 – Formulate a seven-segment instruction pipeline for a computer. Specify the
operations to be performed in each segment.
Problem 7.7 - Define the out-of-order issue in super pipelined computer. How it can be
resolved?
Problem 7.8 – Draw a pipeline unit for floating-point addition, A = 0.9273 X 104 and B =
0.6542 X 103. Result of addition, C = A + B.
Problem 7.9 - Draw a pipeline unit for floating-point subtraction, A = 0.9273 X 104 and B =
0.6542 X 103. Result of subtraction, C = A - B.
Problem 7.10 – Multiply two 16-bit binary numbers, C = A X B. How many bits will the
result carry? Show the pattern of intermediate products available for the extraction of final
result.
122
Problem 7.11 – What is the basic difference between asynchronous and synchronous linear
pipeline structure.
Problem 7.12 - What is the basic difference between linear and non linear pipeline structure.
Which one is better in which situation?
Problem 7.13 – In the instruction queue, in the dispatch unit of Power PC 601, instructions
might be dispatched in unordered way to branch processing and floating-point units. But
these instructions which are meant for integer units might be sending from the bottom of the
queue. Why this limitation occurs?
Problem 7.14 – When an unordered completion is there in a super pipelined processor. There
is a complication of resumption of execution after the interrupt is processed, this is due to the
exceptional conditions that might have occurred and produced its results in an unordered
way. The program cannot be started again due to this exceptional instruction, the reason for
this is that the other alternate instructions have been completed and doing so would force the
instruction to execute twice. What are necessary steps to handle this situation?
Problem 7.15 – Draw a binary integer multiply pipeline with maximum number of five
stages. The first stage is used only for partial product. The last stage consists of a 36-bit carry
look-ahead adder. All the middle stages consist of 16 carry-bit adders.
(a) Construct a 5-stage multiply pipeline.
(b) What would be the maximum throughput of this multiply pipeline in terms of numbers
which are generated from a 36-bit results obtained per second?
123
Chapter 8 Multi-core Processors and Multithreading
Contents
8.1 Overview
8.2 Architectural Concepts
8.2.1 Cores Multiple
8.2.2 Networks Interconnection
8.2.3Controllers Memory
8.2.4 Memory Consistency
8.2.5 Multi-threading hardware
8.2.6 Multi-processor Interconnect
8.3 Multitasking vs. multi-threading Principles of Multithreading
8.3.1 Multitasking
8.3.2 Multithreading
8.4 Intel Xenon 5100
8.4.1 Thermal and power management capabilities
8.4.2 Electrical specification
8.5 Multiprocessor
8.5.1 Multiprocessor Hardware
8.5.1.1 UMA Bus-Based SMP Architectures
8.5.1.2 UMA Multiprocessors Using Crossbar Switches
8.5.1.2 NUMA Multiprocessors
Multi-core processors and threading in processors has a very small difference. The most
important feature of processor’s architecture like memory architecture and core organization
will be discussed. In multiprocessor we have multiple processors to complete a single task
while in multithreading different communication cables are added to interconnect the
multiple cores.
8.1 Overview
IBM Power 4 processor was released in 2001 that kind of general purpose processors mainly
on the chip, which is made of the CMOS chip, implemented. Now a days multi-core
124
processor will be used for performance improvement of high-end processors. Since then,
multi-core processors to improve performance of high-end processors of the standard and is
currently the only way. This is achieved by adding support multiple cores or threads through
mask multithreaded long latency operations. Since we cannot use the clock speed gains of the
past for various reasons. Reasons chain Ghent is unsustainable energy consumption level that
is involved in higher clock frequencies. Wire delays are also an important fact, instead of the
switching transistor is the dominant theme for each clock cycle. Multi-core processors are
more diversified and then design space is compared with the single processor threaded. In
that topic also architectural principle of multi-core package base architectures will be
discussed .we will discuss some recent examples and report on critical issues related to
scalability.
8.2 Architectural Concepts
Multiple cores are one of the few in the concept used, but if we talk about scaling problems.
There are some compromises, considering the first question; whether homogeneous or
heterogeneous processor should .mostly Nature homogeneous multi-core processors are used.
This means that the core is capable of executing the same binary file and has no concern for a
functional point of view, as. In order to ensure that power is designed, for example on the
core program or increase performance single threaded multi-core architectures now days
system software can control the frequency of clock for the individual cores individually. All
these homogeneous architecture announced a global address space to implement the full
cache coherence. This way, we cannot distinguish a core of another, even if the process
migrates contrast during. In homogeneous architecture that is a heterogeneous architecture
least two types of core, which differ in both functionality and performance and game
characteristic architecture instructions (ISA), the most famous example of a heterogeneous
multi-core architecture, the cell can be architecture, it is developed by HP , DELL and
Lenovo and used in areas such as game consoles and highly reliable performance computing.
The consistent architecture which shares memory globally is probably compatible for
parallelism i.e., in the program, entire core in use non homogeneous architecture, wherein the
cores do not use the same instruction set. Furthermore, in an application that will naturally
lead to partitioned into sustainable control with regular communication useful for partitions
on specialized seed manually for the specific task are provided. The organizations of internal
cores are different in many manners. Today each modern pipeline core facility is given with
125
decoding instruction and guided to improve the overall throughput of the scene in order, even
if the control latency is either similar or increased. All high-performance models also contain
cores of speculative dynamic scheduling instructions in the complete equipment. All average
number of instructions per clock cycle (IPC) is increased by the above detailed methods,
rather than a level parallelism limited instructions (ILP) are running in existing applications,
and because these techniques at the same time that they are both valuable silicon real estate
complex and power- consumption, in these are very little importance and the modern multi-
core architectures which are used today have more importance. Although, with some
advanced architectures had the single designers, only one question in the order back rings
(although in the case of the Knights Corner complemented by powerful vector instructions,
while reducing silicon and energy consumption for each core). The core measures to a limited
number of instruction level parallelism (ILP) will be added to the most modern kernels
concurrent Multithreading (CMT), and is known in the world by its brand Intel Hyper-
Threading. These are physical techniques that make better use of hardware resources to
choose multi-point pipeline directives from more than one thread. The advantage is that for
applications for much architecture several ILP single-thread performances are very reliable,
so that to reduce level parallelism over ILP utilized. Continuous be Multithreading is a useful
property which is comparatively inexpensive in terms of other whole region its energy
consumption.
126
each core. Whereas other general-purpose popular processors when written will be of multi
core feature with an interconnect crossbar between the modules, the memory levels processor
cache 1 or 2, and the rest thereof to the latest memory interfaces level cache of RAM and
main memory. While other technologies like multi-ring bus, on-chip switched networks will
emerge and gain ground because either a higher bandwidth, lower power consumption, or all.
If the cores will be increased on-chip communication networks with more scalable and
performance limitations.
8.2.3Controllers Memory
Memory Interface is an important part of several high-performance processor and it is further
more for a multi -core processor, because it is a shares the on chip resource of all cores. One
can synergies in the situation of concurrent programs, expected to share the work Data and
parallelization. In these parallel programs , program on different nuclei which are generally
on the same instructions and data in the vicinity of other threads assembly work and thus the
chances that they will be the same DRAM - increased enough access sites working for, on the
127
other hand, when multi-core processors to work as many core, programmer will see the
sharing of space in the time management that happens in the current operating systems.
The fundamental issues that must be considered in all multi-core design is that the memory
has consistent view. The imitation of physical memory location several times at different
cache level, but also in the cores requires understanding a consistent and easily model of
processor, to know how to the loads and competing stores coordinated consistent view the
contents of the memory, To save most important concepts on the consistency of the memory
in a core processor. In the existence of more than one copies of the same memory location, a
business in the place all the cores must zero time something that can not be passed to get in
practice. But the presence of propagation immediately stored can be get if a universal order
memory in the same memory location are applied, and a load (immediate value) is universal
before the value, it is used to return, which means is carried out, the company with the new
value is performed in the system in relation to all the cores. The strict model of the
persistence of memory consistency today with the simultaneous use of computers is used.
Formally, it is defined as: a multi coherent set if the result of execution is the same as if all
threads execute serial storage operations in order, and the operations of each wire in order,
display thread. Intuitively, this means that if the memory access instruction is maintained
over a core which can be accessed from different cores are interleaved in any order. If you
want to conduct a survey of programmers, the model memory would be the first choice
among the programmers, while it makes a considerable effort to implement in hardware
demands. Many advanced models have been launched at that time, with less consistency
requirements. the examples are given below:
• If it meets a different location in memory, it begins to load through bypass and stores in the
core of processor.
•Before it is visible to every processors, it stores the value emitted by core, first stores in the
main memory .
•Rely on the use of atomic sections while accessing shared memory areas, implementation
and mutual exclusion; all other loads and stores are performed without any application of
consistency, as they are considered LOCAL BASE.
128
• Processors based on execution sometimes deploy violations speculative memory
commands: the kernel assumes no reading or writing values
If this assumption is false, the operation is canceled and re-run; But the significant advantage
of performance can be made in the case of conflicts recognized course for programmers, the
consistency of the memory must be considered one of the most complex and difficult issues
when the model of multi-core systems. To know the basics of how memory access is
therefore essential works 18 2.2 that architectural principles difficult to implement when
handling bugs or encourage competition for basic synchronization primitives on core
operation .As we go through in this book, see Partitioning concurrent tasks and
synchronization between instruction are important activities were needed to design computer
systems for many core. Synchronization is very difficult to achieve in the software, so that
the assistance is needed in hardware; Devices with a synchronization is generally difficult in
scaling and also having limited flexibility. Because of that the most common solutions for
reliable software in the base support with hardware. In this programmers are concentrate on
reinforce provided by hardware. This method is current processors provide reinforce for
synchronization method are regarded as read-modify - write (RMW) or conditional stores.
The fundamental principle behind supervisors to give the smallest unfavorable section has
assurance a conflict without ingress to a specific memory location that will contains the data
which is required for synchronization. The read-modify - write (RMW) instructions written
below the most commonly used:
Test and set (T & E) peruses a memory location, set it to one , and write back the data in a
register core and further more peruse (so none core has ability to store action in this memory
space, where as the command T & S is accomplish to perform ) . That statement of the given
method is to obtain / release a castle should be like this :
Compare and Swap (CAS) automatically compare the data present in the memory of a data
provided , and if they are found to be same, the memory contents will be replaced with the
129
data stored in a main memory. To block implementation services with the Compare and
SwapCAS is as follows (all similar to the implementation based on S & T).
Load bound and store- conditional: This is an unbundled version of T & S, which is driving
more convenient. Although there are two commands in this case are both connected :
connected by a load, a memory - related only succeed if no other operations were recovered
space because the execution carried out 2 multi-core and Core architectures plusieurs- 19
charge if the register of the position value is used - 0. Here is how conditional lock can be
implemented with these mechanisms:
The presence of these ISA constructions is significantly higher for the use of synchronization
mechanism. However, only one of these basic mechanisms is sufficient for most types of
synchronization primitives, as well as the software for carrying out lock-free data structures.
8.2.5 Multi-threading hardware
130
This observation led to the implementation of multithreading hardware, a mechanism that
could be a basis of multiple thread contexts in a tangible medium (including the program
counter and register sets, but the exchange eg cache) and fast switching between hardware
threads when certain subjects because of very high latency operations are involved. There are
many hardware implementations have implemented on this concept, the most current thin
multi threading technology in the context of cores is called simultaneous multi-threading.
This technique was developed earlier by most major companies such as Intel (Hyper-
Threading Technology (HTT) Concept), IBM (thread priority concept), Oracle and Sun (with
up to eight threads materials on each base support). Recently, the difference between the
speed of the memory access and the nuclei started to limit the fall of the base frequency of
treatment to make technical hiding latency perks for large programs.
Interconnects are one of the most important architectural building blocks for multiprocessors.
It collectively interconnects the multiple processors together to provide a single logical
processing unit. Currently, two popular technologies for interconnects are found in systems:
• Intel Quick Path Interconnect (QPI): This technology of interconnects is used in most
Intel chips. It is a popular technology for connection of I/O devices
8.3.1 Multitasking
Process level concurrent execution is usually called Multitasking. All currently available
operating systems support multitasking. Multitasking refers to concurrent execution of
processors. Multitasking provides parallel execution of two or more parts of a single program
thus a multitasked job required less execution time. Multitasking can be achieved by adding
codes in original program in order to provide proper linkage and synchronization of divided
task. Multitasking was introduced in operating system in the mid 1960s, including among
others, the IBM operating system for the system/360 such as /DOS. OS/MFT or OS/MVT.
Almost all operating systems provide this feature.
131
Tradeoffs do exist between multitasking and not multitasking. Only when overhead is short
should multitasking is practiced. Sometimes, not all parts of a program can be divided into
parallel tasks. Therefore, multitasking tradeoffs must be analyzed before implementation.
Single Core
A core can be considered as a processing unit. A processor with single core can execute a
single at a time. In a single core processing the tasks other than currently executing task have
to wait for their turns so due to waiting time the overhead is increased. In these types of
processors the performance can be improved by programming. Through programming the
appropriate time slots are given to each task in which it has to be executed.
In the figure8.2 ,here are applications requests like word processing, E-mail, web browsing
and virus scanning by antivirus. The operating system will handle these requests by making a
task queue for these applications. The application tasks will be sent for execution one by one
because of a single execution unit core.
Multi-core
Multi-core chips A multi-core system has one CPU which is divided in more than one core.
Each core works as independent microprocessor. Due to multiple cores, a processor can
perform multiple operations in a single process. The resources like cache and FSB etc. which
are needed for processing are shared in multiprocessing system, so the processor cores in
multi-core chips operate in a shared memory mode. However, message passing, which works
independently of physical locations of processes or threads, also provides a natural software
model to exploit the structural parallelism present in an application. The system with
132
multiple cores provides similar performance as multiprocessor systems with advantage of
much lower cost. The cost is lower because in multi-core systems a single CPU support
multiprocessing. Another advantage of multi-core systems is that a multi-core system with
hardware multi-threading also supports the natural parallelism which is always present
between two or more independent programs running on a system. Even two or more
operating systems can share a common hardware platform, in effect providing multiple
virtual computing environments to users. Such virtualization makes it possible for system to
support more complex and composite workloads, resulting in better system utilization and
return on investment. The figure 8.3 shows a multi-core processor with 4 cores. Each core has
its individual local memory. A system memory is shared by all the cores. External devices
communicate with processor by system bus.
8.3.2 Multithreading
A thread can be considered as light weight process. Through parallelism, the threads
improve the performance of program execution. There are two ways by which threads are
implemented;
User level threads: these are managed by user. Kernel does not have any information about
these types of threads. User creates these threads for any application with the help of thread
library. The thread library contains the executable files of thread creation, deletion, and
133
communication by message passing between threads. The user level threads can run on any
operating system. These are fast to create and manage.
Kernel level threads: Operating system manages these threads. The scheduling of each
thread is done by operating system. Kernel performs thread creation, scheduling and thread
management. Kernel can simultaneously schedule the multiple threads.
Some operating systems support the co-ordinated combination of both types of threads (
user level and kernel level). Developer can creates multiple threads as necessary and the
corresponding kernel level threads can be executed in parallel. There are following thread
models which maps the user level threads on to kernel level threads.
Many to many mapping: Multiple user level threads are multiplexed on to smaller or equal
kernel threads. The number of kernel threads is specific for each application. The following
figure 8.5 shows the many to many mapping terminology.
Many to one mapping: In this type of mapping multiple user-level threads are mapped onto
single kernel thread. The disadvantage of this type of mapping is that when a single thread is
blocked the whole execution will be blocked because of single kernel level thread.
134
Figure 8.6 Many to one thread relationship
One to one mapping: The concurrency in this model is more than the many to one. This
model eliminates the blockage due to single blocking thread problem of many to one
mapping. Another thread can run when one thread is blocked. The main disadvantage of this
type of mapping is that each user level thread required corresponding individual kernel level
thread.
The conventional von-neumann machines are built with processors that execute a single
context by each processor at a time. In other words, each processor maintains a single thread
of control with its hardware resources.. In a multithreaded architecture, each processor can
execute multiple contexts at a same time. The term multithreading implies that there are
multiple threads of control in each processor. Multithreading offers an effective mechanism
for hiding long latency in building large scale multiprocessors and is today a mature
technology. In multithreading processors, the operating system not only assign time slot to
each application in which each application has to be executed but also assign time slot to
each thread of a application as each application can be considered as a collection of multiple
thread
A multithread idea was pioneered by Burton Smith (1978) in the HEP system which
extended the concept of scorebording of multiple functional units in the CDC 6400.
135
Subsequent multithreaded microprocessor projects were the Tera computer ( Alverson, Smith
et al., 1990) and MIT Alewife ( Agarwal et al., 1989).
One possible multithreaded MPP system is modeled by a network of processor (P) and
memory (M) nodes as depicted in figure.. The distributed memories form a global address
space.
136
There are four parameters are defined to analyze the performance of a multithreaded
processor:
The latency (L): This is the communication latency on a remote memory access. The value
of L includes the network delays, cache-miss penalty, and delays caused by contentions in
split transactions.
The number of threads (N): This is the number of threads that can be interleaved in each
processor. A thread is represented by a context consisting of program counter, a register set,
and required context status word.
The context switching overhead (C): This refers the cycle lost in performing context
switching in a processor. This time depends on the switch mechanism and amount of
processor states devoted to maintain active threads.
The interval between switches (R): This refers to the cycles between switches triggered by
remote reference. The inverse p=1/r is called rate of requests for remote accessing. This
reflects combination of program behavior and memory system design.
137
Figure 8.8 Block diagram of Dual-Core Intel Xeon processor of 5100 Series
Terminology
A ‘#’ symbol indicate active low signal. Basic terms used here are explained below:
Dual-Core Intel® Xeon® Processor 5100 Series – Intel 64-bit microprocessor uses for dual
processor servers and workstation based on Intel’s 65 nanometer process and advanced high
power capability.
FC-LGA6 (Flip Chip Land Grid Array) Package – 5100 series processor package The
Dual-Core Intel Xeon is a Land Grid Array , comprising a processor core mounted on a shaft
771 lands with less substrate , and includes an integral heat sink ( IHS).
LGA771 socket – The 5100 processor interfaces of the Dual-Core Intel Xeon to the
baseboard via this surface mount socket 771 Land. View socket LGA771 design
recommendations for details about this decision.
138
Processor core – Two core processor shares integrated L1 cache, L2 cache and system bus
interface between the cores the die. All synchronization health requirements AC signal
processor kernel buffers.
FSB (Front Side Bus) -
We use electrical interface in order to connect the processor to the Entire chip. Also
mentioned to as system bus or processor system bus. Besides memory and I / O we also use
interrupt messages for passing entire less processor and the chips on the FSB.
Dual Independent Bus (DIB) - A Front side bus architecture comes into market with one
processor having one bus Rather in shared two processor FSB agents are used. The dual
independent Architecture performance enhances by high speed and FSB bandwidth.
Flexible Motherboard Guidelines (FMB) the values of the Dual-Core Intel Xeon 5100
series is estimated and have some interval of time. There would be difference between the
estimate and actual values.
Functional operation – requires normal operating procedures in which all CPU
specifications, includes DC, AC, and FSB, mechanical and thermal.
Storage conditions - requires a state which is non-operational. The processor could reside in
a platform embedded into tank or in bulk. Processors can be enclosed in packaging or has the
property of non-reactive to. Under these circumstances, the processors land should not be
attached to supply voltages having I /Os or receive clocks. During the exhibition of "fresh
air" (that is, packaging unsealed , removed a device packaging materials) the processor must
be tested according to moisture labeling sensitivity (MSL) as written on the packaging
material. Presentation 14 Dual-Core Intel® Xeon® 5100 Series Specifications
• Priority Agent –this is called as the chipset and acts as the bridge between the host
processor
• Symmetric Agent - SA is a processor that has common I / O. Also it has same operating
system and subsystem memory array. It is run by another processor in a system. Systems
having symmetric agents are called symmetric Multiprocessing (SMP).
• Integrated heat spreader (IHS)- In order to improve the thermal performance of the
package, we can use components of the whole processor. Thermal components solutions
interface with the processor to IHS surface.
•Thermal design power - PT solutions should be designed to accomplish this goal. It is the
high capacity power provided the highest in the known execution real power intensive
applications. TDP does not have that much power which the CPU can exploit.
• Intel® Extended Memory 64 Technology (EM64T) - IA-32, the Intel architecture that
139
permits the processor to execute and run operating systems and design applications to take
benefit of the 64-bit extension technology. For further details you can surf the website-
http://developer.intel.com/.
• Enhanced Intel Speed Step Technology (EIST) - which is used for servers and
workstations and it yields technology power management capabilities.
• Interface Platform Control Environment (PECI) - A bus owner by thread interface
providing a communication shared channel between the processor and Intel chip set
components for thermal monitoring devices (TMD) from the outside by controlling fans peed
and also for communication with PECI temperature sensor outputs of the digital processor.
PECI changes the thermal diode available in previous processors.
• Intel® Virtualization Technology – technology used to allow multiple, robust independent
software environments into a single platform CPU virtualization when used with the Virtual
Machine Monitor software
• VRM (Voltage Regulator Module) - DC-DC converter integrated on a module
interconnects with a card edge socket and provides the correct voltage and current at the
processor based on the state of logical processor VID bit.
• EVRD (Enterprise Voltage Regulator Down) - DC-DC converter in built on card system
that gives the correct voltage and current to the processor on the basis of the state of the
logical processor VID bits.
• VCC - Power supply of core processor.
• VSS - Ground processor of terminal.
• ATV - FSB termination voltage.
8.4.2 Electrical specification
140
enhances noise tolerance because of increase in processor frequency. For signal integrity we
need enhanced speed data and address buses.
Power and ground land
For the distribution of clean energy on-chip heart processor, the processor 223 VDC (power)
Inputs 273 and VSS (ground). Connection of all land VDC is with CPU power level and all
land VSS is with ground plane system. The VCC processor lands are to be supplied with the
voltage fixed by the processor Identification voltage signals (VID). See Table 2-3 for
definitions of VID. About twenty-two lands are given as mountain biking, which yields
termination for the FSB and provides power to the I/O buffers. A separate power supply must
be implemented for this land that meets the ATV.
Decoupling guidelines
Large average current wings between low power states is generated as there is a large
population of transistors and high internal clock rates like the Dual-Core Intel Xeon
Processor 5100 series. Because of inadequate decoupling can lead strain on power plans sag -
dessous their minimum values bulk decoupling is not adequate. Bulk
capacityincreases(CBULK), such as electrolytic capacitors , feed stream in the more enduring
changes in current demand by component , as emerging from a state of rest.
8.5 Multiprocessor
Multiprocessor is a type of computer system where more than one CPUs use full access to a
shared RAM. All CPUs may be equal in multiprocessor, or some may be reserved for special
purpose task. A program running on any CPU can be seen as a normal virtual address space.
Unlike the specification of this system, a CPU writes data into stack and reads it back from
that memory and may get another value (because another CPU has changed it). To solve this,
it should be organized correctly so that inter processor communication can be per formed in
such a manner:
The data is places into memory by the CPU. Multiprocessor operating systems are same as
old operating systems. CPU manages system calls, perform memory management, arrange a
file system and manage I/O devices. Besides that there are some applications which are based
on unique features. These features are scheduling, resource management, and process
synchronization. Systems that treat all CPUs equally are called symmetric multiprocessing
systems. Other than that there are number of ways to segregate resources as asymmetric, non-
141
uniform memory access and clustered multiprocessing. There are brief over view at
multiprocessor hardware and then goes operating systems issues.
MPD having the property of all multiprocessors in which every CPU easily address all
memory, some multiprocessors having the additional property that every memory easily can
be read fast like every other memory word. The machines which follow above specifications
are called Uniform Memory Access multiprocessors rest called as non-uniform. Memory
access multiprocessor does not have that type of property.
In the simplest multiprocessors single bus is used, one or more memory modules and two or
more CPUs all share bus for communication. If CPU wants to read a memory word, it would
first check whether bus is idle or not. If the it is free, the CPU carries the address of the word
which it want to read , after that it places a few control signals and wait for the time until the
memory puts the necessary word on the bus. Other than that if the bus is doing some
communication or busy and CPU wants to communicate with the memory, the central
processing unit have to wait until bus come into idle state. When it becomes idle it continues
its operation.
142
To connect a group of lines we use crossbar switches from long time with in telephone
switching exchanges. The lines may be incoming and outgoing. At each crossing of a
horizontal incoming and out vertical outgoing line is 508 MULTIPLE PROCESSOR
SYSTEMS.as cross point. A cross point is a simple switch that is operated electrically
function as opened or closed.
Single-bus UMA multiprocessors having limited uses because of a few dozen CPUs and
crossbar switched multiprocessors required expensive hardware and are bulky in size.
Basically it is the idea that all memory modules having the same access time. This concept
leads the idea of NUMA multiprocessors. They provide a single address space is provides
across all the CPUs in UMA multiprocessor, but in the UMA machines, access to remote
memory modules is slower than access in memory module. The performance of UMA
multiprocessor is worse than on a UMA machine at the same clock frequency. Three basic
key characteristics of NUMA are given below
143
2. Access to remote memory is through LOAD and STORE instructions.
Q2) What is the difference between single core and multicore processors?
Q3) What are the different interconnection network topologies used in computer architecture?
144
Q5) Explain different multiprocessor hardwares.
Chapter 9
Superscalar Processors
Structure
9.0 Objectives
9.1 Introduction
9.1.1 Limitations of scalar pipelines
9.1.2 What is Superscalar?
9.2 Superscalar execution
9.3 Design issues
9.3.1 Parallel Decoding
9.3.2 Instruction Issue policies
9.3.2.1 Register renaming
145
Objectives
9.1 Introduction
146
Super scalar processors are emerged in late 1980s, superscalar processors got more light due
to RISC processors, though superscalar concept is applied to even non RISC processors like
Pentium 4 and AMD processors. In today’s market Desktop applications and Server
applications use superscalar processors, few examples of superscalar processors are Pentium,
PowerPC, UltraSparc, AMD K5, HP PA7100, DEC . The basic concept in superscalar
processor architecture is fetch few instructions at a time and execute in parallel by taking the
advantage of higher bandwidth memories available with advancement in technology. A CISC
or a RISC scalar processor can be enhanced with a superscalar or vector architecture. Scalar
processors are the ones which execute one instruction per cycle. One instruction is issued per
cycle, and only one instruction is expected to be completed from the pipeline per cycle.
Pipelining is an implementation method for increasing the value of throughput of a processor.
As this is a technique implemented below the dynamic/static interface, it does not need any
special effort from the user. So, speed up can be attained for existing sequential programs
without any modification in software. This method of enhancing the performance while
maintaining code compatibility is very attractive. Moreover, this approach is the main reason
for the present dominance of the microprocessor market by Intel as it proposed the pipeline
i486 microprocessor, as it is code compatible with previous generation of non-pipelined Intel
microprocessors. As pipelining has been established to be an extremely effective micro-
architecture technique, these types of scalar pipelines has a number of limitations. As there is
a never-ending push for better performance, these limitations must be countered in order to
keep providing further speed up for existing programs. The solution is superscalar pipelines
which are able to achieve performance levels higher than scalar pipelines.
9.1.1 Limitation of scalar pipelines
Scalar pipelines are identified by a single instruction pipeline of k stages. All instruction,
irrespective of instruction type, travel through the same set of the pipeline stages. At most,
one instruction can be placed in each stage of pipeline at any time, and those instructions go
about the pipeline stages in a lock-step style. Apart from the pipeline stages which are stalled,
every instruction remains in every pipeline stage for one cycle time and moves to the further
stage in the next cycle. This type of rigid scalar pipelines have three main limitations that are
listed below and elaborated further:-
The maximum throughput for a scalar pipeline is connected by single instruction per
cycle.
The amalgamation of all instruction type into single pipeline can result in a design
with high performance.
The delaying of a lock-step or rigid scalar pipeline leads to unnecessary pipeline
bubbles.
9.1.2 What is Superscalar?
To improve the performance of the scalar processor by increasing execution speed of the
instructions superscalar machine is designed. In superscalar processor multiple instructions
which are independent from one another are executed in parallel unlike scalar processors,
where one instruction is executed after finishing previous instruction in cycle. In superscalar
processors more than one independent instruction pipelines are used, where in multiple stages
147
are there in each pipeline. Therefore, at a time multiple streams of instructions are enabled to
process thus achieving a new level of parallelism. Superscalar architecture is able to execute
instructions which are independent in different pipelines, hence improves overall
performance. Instructions that can be executed independently and in parallel are commonly
Load/Store, Arithmetic and Conditional Branch instructions.
Figure 9.1 shows the organization of the general superscalar processor in which it is
shown that 5 instructions are executed in parallel, where each instruction has 4 pipeline
stages. At the same time all the 5 instructions i.e. 2 integer operations, 1 memory operation
(Load/ Store) and 2 floating point operations will be executed.
EU EU EU
Sequential Stream of
instructions
Register File
148
is ineffective at maintaining all of these units full with instructions, the system's performance
won't be better than that of a much simpler and cheaper design. A superscalar processor
usually keeps a consistent execution rate, ahead, of one instruction per machine cycle. But
just processing multiple instructions, simultaneously, doesn't make a processor a superscalar
one. As it is pipelined, multiprocessor achieves that too, but with separate steps.
In superscalar central processor unit, the dispatcher accesses certain instructions from the
memory and finds which one can be accessed in parallel, dispatching each instruction to one
of the several EUs contained in a single central processing units. Therefore, a superscalar
processor can be declared to have various parallel pipelines, which are processing the
instructions, simultaneously, from just an instruction thread.
9.2 Superscalar execution
The multiple instruction execution in superscalar is compared with a sequential,
pipelined, and super-pipelined processors instruction execution in Figure 9.3. It is shown how
an instruction is executed in different processors, in case of sequential processors execution
of instructions takes place one after another, and the execution occurs in different steps in a
sequence i.e. fetch, decode then execution by the functional unit then the result is finally
written into memory, to perform these actions a sequential processor takes at least 4 no. of
cycles for one instruction i.e. Cycles Per Instruction (CPI) = 4. Where as in modern
processors it is not the scenario, for example in pipelined processors CPI is reduced to 1
because the steps in execution as discussed in sequential processors are overlapped like an
assembly line.
In pipelined processors when one instruction is in execution by functional unit second
next instruction will be in decoding and third next instruction will be in fetching stage thus in
every clock cycle one instruction will be completed. The pipeline stages are actually some
combinational logic circuits and may also involve register/cache memory access with each
pipeline stage separated by a latch with a clock signal which is common to all latches for
synchronization of data as shown in Figure 9.4.
Sequential Processor
149
Pipelined Processor
Super-pipelined processor
Superscalar processor
Each stage as shown in Figure 9.4 may vary in length based on the type of
instructions therefore processor’s whole speed will be reduced for long stages. In super-
pipelined processors, deeper pipeline stages are used as in pipelined processors pipeline
stages duration can be varied based on its length. Longer pipeline stages are subdivided into
small stages resulting in super-pipeline processor with more number of shorter stages so that
higher clock speed can be achieved i.e. more number of instructions are executed in less time
lesser CPI.
150
Figure 9.5 Concept of superscalar execution
The Figure 9.5 above illustrates the parallel execution method which is used in most
superscalar processors. The fetching of instruction process, along with prediction of branch,
is used to form a line of dynamic type of instructions. These dynamic instructions are
checked for various dependences where artificial dependences are removed. Then, the
instructions are sent into the execution window. The instructions in the execution window are
not shown in sequential order now, but are partially represented by their data dependences
(true). Issue of instructions from the window is in an order, which is decided by the data
dependences and hardware resource availability. Finally, after the implementation,
instructions are arranged into a program order as they retire sequentially and their respective
outcomes update the state of architected process.
9.3 Design Issues
As in superscalar processors multiple pipelined instructions which are independent of
one another are executed in parallel, it is necessary to understand how this can be achieved
and which is discussed under design issues of superscalar processor. The design issues of
superscalar processors include how and what are the policies used for the issue of multiple
instructions, how registers are used for multiple instructions, how the parallelism at machine
level is achieved and during multiple instruction execution how branch instructions are
treated. Superscalar processor is applicable equally to CISC and RISC though it is straight
forward in case of RISC machines. All the common instructions can be executed
independently and parallelly and usually the execution order is assisted by the compiler. The
specific tasks of the superscalar processor covered are shown in Figure 9.6.
151
Figure 9.6 Superscalar processor design tasks
The multiple instructions to be executed in parallel must be independent therefore it is
required to check the dependency of the instructions. There are three types of dependencies
i.e. Data dependency, Control dependency and resource dependency. Data Dependency is
occurred if the data modified by one segment of an instruction is modified by another
instruction in parallel. In control dependency if before run time the control flow of segments
cannot be identified, and then the data dependence between the segments is variable.
Resource dependence occurs when there aren’t sufficient processing resources (e.g.
functional units) even if several instructions are independent they cannot be executed in
parallel.
Example:
I1: ADD r2 r1 r2 r2 + r1
I2: MUL r3 r2 r3 r3 * r2
Where I1 and I2 cannot be executed in parallel because output of I1 i.e. r2 is
being used in inputs of r3 i.e r2, therefore it is necessary to read r2 only after its value
updated in the instruction I1. Which means read r2 of I2 only after writing it in I1,
hence I1 and I2 cannot be executed simultaneously. This dependence also called
Read-After-Write (RAW) Dependence. This is a true dependence which cannot be
abandoned because I2 is RAW dependent on I1.
Anti-dependence: I1 precedes I2, and the output of I2 overlaps the input to I1.
Example:
I1: ADD r2 r3 r2 r2 + r3
I2: MUL r3 r1 r3 r3 * r1
Here I1 input and I2 output using the same register for reading and writing
respectively. This is false dependency which can be eliminated through Register
renaming, the output register r3 is renamed to some other register other than r1 r2 and
r3 by compiler or Processor. This dependency also called as Write-After-Read (WAR)
dependency.
Output dependence: I1 and I2 write to the same output variable.
152
Example:
I1: ADD r2 r3 r4 r2 r3 + r4
I2: MUL r2 r1 r5 r2 r1 * r5
Here I1 and I2 are writing into the same output register r2 making I1 and I2 output
dependent hence preventing parallel execution, this dependency is also false
dependency as it can be avoided by register renaming the output register r2 by
compiler or Processor. This dependency also called Write-After-Write (WAW)
dependency.
I/O dependence: The same variable and/or the same file is referenced by two I/O
statements i.e. (read/write).
Control Dependence: When conditional instructions, branch instructions and loops are
there in segments of a program, control dependency occurs which prevents the control
dependent instructions to execute in parallel with other independent instructions.
Control-independent
Example:
for (j=0; j<n; j++)
{
b[j]=c[j];
if(b[j]<0)
b[j]=1;
}
Control-dependent
Example:
for(j=1; j<n; j++)
{
if(b[j-1]<0)
b[j]=1;
}
Compiler techniques are needed to get around control dependence limitations.
Resource Dependence: Data and control dependencies are based on the independence of
the work to be done, where as Resource independence is concerned with conflicts in
using shared resources, such as registers, integer and floating point ALUs, etc. ALU
conflicts are called ALU dependence. Memory (storage) conflicts are called storage
dependence.
9.3.1 Parallel Decoding
The decoding in a scalar and super scalar processor is shown in Figure 9.8. In scalar
processor from instruction buffer one instruction at a time is send to Decode/ Issue unit where
as in super scalar processor more than one instruction is send to decode/issue unit from the
instruction buffer. For example if the superscalar processor is 3-way issue processor then 3-
153
instructions will be send to decode/issue unit. The scalar processor takes one pipeline cycle
for decode/issue where as superscalar processor takes more than one pipeline cycle for
decode/issue, to speed-up this process in superscalar processor pre-decoding principle is
introduced as shown in Figure 9.7.
The concept
Figureof
9.7pre-decoding
Principle of pre-isdecoding
a part of decoding is done at the loading phase i.e.
when load the instructions from second level cache or memory to the Instruction Cache (I-
Cache) by pre-decode unit. The result of doing this pre-decoding task is at I-cache level only
the information about the class of the instruction, required type of the resources for execution
will be known. Even in some processors for example in UltraSparc branch target addresses
calculation also done by pre-decode unit. The pre-decoded instructions from pre-decoder are
saved in I-Cache by appending with some extra bits as shown in Figure 9.7 from 128 bits to
148 bits, these extra bits have the pre-decoded information about the instruction. The pre-
decoder results in reduced over cycle time for superscalar decode/issue unit.
I-Cache I-Cache
Decode / Issue
Decode / Issue
F D I ...
F D/I ...
(a) (b)
Figure 9.8 Decoding in (a) Scalar processor (b) 3-way Superscalar processor
154
instruction issue rate. Issue policy specifies during the instructions issue process for parallel
processing how the dependencies are handled. Issue rate specifies in each cycle the maximum
number instructions the superscalar processor can issue. Instruction issue policies discuss
about the instruction fetch order, instruction execution order and about the order in which
instructions change memory and registers. Based on issue order and completion order issue
policies can be categorized as below
It can be understand clearly from the example that Instructions I1 to I6 are completed in
the same order that they are issued.
In-Order Issue and Out-of-Order Completion:
In this policy the instructions are issued in the order of their occurrence but their
completion is not in the same order.
Example:
In the above example it can be observed that the completion of instructions are not in the
same order of the decode unit, the possible chances of data dependency must be checked
here in this case.
Out-of-Order Issue and In-Order Completion:
In this policy the instruction issue and their completion both are not in the order of their
occurrence. The possible chances of data dependency must be checked here in this case
Example:
155
Issue policies are shown in Figure 9.9 in different cases. In case of false data
dependency and unresolved control dependency the design options for issue policies are
either don’t issue those instructions until they are resolved or issue by applying some
technique. In case of true data dependency instruction will not issued until dependency is
resolved where as in case of false data dependency avoid instruction dependency during
instruction issue by using register renaming. In case of conditional branch instructions either
waits for the resolution of the condition or use speculative branch processing to issue the
instruction in parallel with other instructions which are independent.
156
There are two ways of implementing register renaming technique one is static and the
other dynamic. In static implementation register renaming is performed by compiler which is
used in pipelined processors. In dynamic implementation is done during execution of the
instructions by the processor which is used in superscalar processors, for implementation it
requires additional registers and logic. The renaming can be partial or full, in partial case only
certain types of instructions dealt by register renaming for execution and in full renaming all
the instructions which are eligible are dealt by register renaming.
Method of
fetching operands
Merged FX- Merged FP- FX-rename FX-arch. FP-rename FP-arch. ROB Architectural
Method of updating reg. file
reg. file reg. file reg. file reg. file reg. file reg. file
the program state
Figure 9.109.10
Figure Implementation
illustrates ofimplementation
renaming buffers of
the register renaming in three different ways i.e.
Merged architectural and rename register files
Standalone register file
Holding renamed values in the Reorder Buffer(ROB)
In merged architectural and rename register files approach the same physical register file is
used for both architectural registers and rename buffers. Physical registers which are
available are dynamically assigned to rename registers or architectural ones. In this approach
merged register files are used for Fixed Point (FP: Integer) and Floating Point (FP) used
separately. The examples of processors which use this approach are mainframe family of
IBM ES/9000, Sparc64, power line processors and R1000. In standalone rename register files
approach, exclusive rename register files are used. The examples of this line are PowerPC
processors i.e. PowerPC 603 - PowerPC 620. In the last approach i.e. renaming with ROBs
in addition to renaming the instruction execution sequential consistency is prevailed.
157
Example of Register Renaming:
From the above example it can be observed how false dependency is eliminated by renaming
register F6, F8 with S and T. After Register renaming only RAW dependency is there, which
must be avoided during scheduling for parallel execution.
It is in this phase, that an execution tuple (an opcode with register names, physical register
are in upper case (Rl, R2, R3, and R4) storage locations) is formed. As the execution tuples
are made and buffered, and then the next step is to decide which tuples can be sent for
implementation. Instruction issue can be explained as the checking (run-time) for availability
of resources and data. It is a part of the processing pipeline which is at centre of many
superscalar executions. This is the part that holds the execution window. In ideal case, an
instruction is to be executed as soon as the availability of input operands. But, other
limitations may be there on the instruction issue, most importantly the physical resources'
availability such as EUs, register file ports and interconnect. an example of mechanism of a
possible parallel execution is given below.
This schedule assumes hardware resource consists of two integer units, one branch
unit and one path to memory. The horizontal direction means the operations implemented in
the time step and the vertical direction means time steps. In this given schedule, we predicted
that the branch ble was unable to be predicted and are speculatively implementing
instructions from the path predicted. Shown, here, are only the renamed values for r3. In a
real implementation, the other registers will be renamed too. Each of the values given to r3 is
stuck to a separate physical register. Following paragraphs gives detail of a number of ways
of arranging buffers (instruction issue), as the complexity increases. Some of the basic
organizations are illustrated.
158
Single queue method: In case of just one queue and no issuing out of order, the register
renaming is unnecessary, and availability of the operand can be coped up via each register
being assigned with simple reservation bits as shown in Figure 9.11. A register is in reserved
condition when a reservation is cleared when the instruction completes it and an instruction is
updating the register issues.
159
Figure 9.13 Reservation stations
9.4 Branch Prediction
Branch instructions modify the value of the Program Counter (PC) conditionally or
unconditionally and transfer the flow control of the program. The major types of the branches
are shown in Figure 9.14. The branches are always taken which are under unconditional type
where as under conditional type based on the condition status whether condition met (True)
or not (False) it will be taken.
The return address is not saved in simple unconditional branches but in branches to
sub-routines return address is saved by saving PC, and then return from sub-routines transfers
Figure 9.14 Major types of branches
the control to the saved return address. In loop closing conditional branches which are also
called as backward branches, will be taken for all the iterations except the last iteration. The
processor performance is dependent on the schemes used in branch prediction. The
prediction is mainly two types i.e. Fixed prediction and True Prediction.
Fixed Prediction: The guess here is one outcome out of taken or not-taken, the scheme
follows the tasks given below
160
Guess as “not-taken” against the detection of an unresolved conditional branch.
Proceed with the sequential execution path, but be prepared for wrong guess and also
start with execution if “Taken” path in parallel by doing calculation of Branch Target
Address (BTA).
Check the made guess when the condition status is available.
Continue the sequential path of execution if the made guess is correct and delete the
pre-processed information of BTA calculation.
In case of guess is wrong delete the sequential execution information and continue
with the pre-processing of “Taken”
The above steps are for the approach of “always not taken”, if the instructions are taken in
this approach the penalty of taken instructions (TP: Taken Penalty) is higher than Not-Taken-
penalty (NTP). In case of the other approach all the above steps are same except start with
Taken guess and sequential path and guessed path will interchange, in this case TP is less
than NTP usually. Implementation of Not-Taken approach is easier than the taken approach.
In pipelined microprocessor Not-taken approach is used for example in SuperSPARC,
Power1, Power2 and Alpha-series processors. Always taken approach is used in MC 68040
processor.
True Prediction: The guess here has two possible outcomes of taken and not-taken and
further can be categorized as static and dynamic based on the code and execution of the code.
If the prediction is simply based on the code then it is static prediction and if the prediction is
based on the history of code execution then it is called as dynamic prediction and classified as
op-code based, Displacement based or compiler-directed based. In op-code based prediction
for certain types of op-codes the branch is always assumed as taken and for some always not-
taken. In displacement prediction approach a displacement parameter (D) is defined, and
based on the sign of D predictions are made for example if D≥0 then prediction is Not-Taken
and id D<0 then guess as Taken. In Compiler directed prediction the guess is based on hint
by complier, compiler gives hint based on compiled kind of construct, the hint by compiler is
by setting or clearing a bit in the encoded instruction of conditional branch case. In case of
dynamic prediction, based on history of the branch prediction is made. The basic approach
here is that in the recent occurrence if the branches are taken then they are assumed to be
taken in their next occurrence. The performance of the dynamic prediction techniques is
higher than static, but dynamic is more complex for implementation and hence costly.
History of the branch instructions are expressed in two different ways i.e. explicit dynamic
technique and implicit dynamic technique. In the first case history bits are explicitly stated
for the history of the branches and in later case the target access path of the predicted branch
instruction is stated implicitly with the entry presence. Explicit technique is explained in
details as below
161
As shown in Figure 9.15, static prediction is made based on a particular object code
attribute which is
1-bit dynamic prediction: In this approach to express the taken or not taken branch in the
last occurrence one bit is used. A two-state history for taken or not taken for different number
of branches is indicated by state diagram as shown in Figure 9.16. From the state diagram it
can be observed that the history of the branch is updated after evaluation of the branch. In this
case the last occurred branch prediction will be same.
Usually the initial state in this approach is “strongly taken”, and as per the actual state of the
counter the prediction is made.
162
3-bit dynamic prediction: In this approach, the branch instructions outcome for the last
three times occurrences are stored as shown in Figure 9.18 and based on the majority of
occurrences the prediction is made. For example, if out of three branch instructions in recent
two branches were taken then the prediction will be considered as Taken, then the entry in the
table is updated based on this outcome of prediction in FIFO manner. The implementation of
3-bit prediction is simpler than 2-bit prediction. The scheme of 3-bit prediction is
implemented in PA 8000 processor.
163
Figure 9.19 Speculative branching
Figure 9.19 shows how the speculative branching handles the unresolved conditional
branch instructions for parallel processing. Based on the branch prediction, execution of the
speculated path will be continued saving the branch address from where the sequence of
execution changed, if the speculation is correct after condition resolved then execution will
be continued otherwise sequential path will be followed by deleting the predicted executed
path. The extent of speculativeness can be discussed at two levels as given below
Extent of speculativeness
Fetched
9.5 Memory
1
Disambiguation
2 4 6
Fetched, Fetched,
Decoded Fetched, Decoded,
Static or dynamic scheduling is used in superscalar Decoded,
processorsDispatched,
to achieve instruction
Executed but not
level parallelism, but due to memory instruction dependency ambiguity
Dispatche reordering of the
completed
code for parallel execution is restricted severely, which can be observed from the example
given below.
Example:
164
Effective address computation of the both the memory references
Other instructions and run time data may affect the effective addresses of the memory
references
Wider comparators are required for addresses comparison
Example:
Load Bypassing: If there is no aliasing then load instructions can bypass store instructions.
For loads and stores separate address generation and reservation stations are employed.
Before the issue of load instructions, to allow checking of dependencies in load instructions
store instructions addresses need to be computed first. Addresses of store instructions cannot
be determined if the dependencies in load instructions are not checked which result in all the
subsequent load instructions to be held until valid address. Store instructions are kept in
Reorder Buffer (ROB) until the completion of the previous instructions. In load bypassing the
execution load instruction is out-of-order thus improves the performance.
Example:
Load Forwarding: In this approach forward the data that is to be stored using store
instruction is directly forward to the load.
Example:
165
9.6 Dynamic Instruction Scheduling
Arranging or scheduling two or more instructions for parallel execution is called
instruction scheduling and if this job is done by processor then it is called dynamic
instruction scheduling also called as Out-Of-Order (OOO) execution. The instruction
execution in dynamic scheduled processors (in Superscalar processor) done in out of order,
this is the main advantage when compared to static scheduled processors (in VLIW
processor) in which instruction execution is done in order. In case of In-Order execution the
instruction must wait until its operand and/or dependency resolved hence blocking all the
next instructions for execution. This problem can be overcome in dynamic scheduling where
the eligible independent instructions can be scheduled for parallel execution without
bothering about program order. Therefore, in dynamic scheduling the only wait is for
instruction’s input operands hence achieves higher performance when compared to static
scheduled processors.
9.7 Multithreading
Program can be defined in programmer’s point of view as set of ordered instructions
and in OS point of view it is executable file which is termed a process instead of program.
Within a process smaller code chunks are called threads and in a process there may be many
threads which share same resources of process. The instructions of a program divided into
smaller threads and parallelism can be achieved by simultaneously executing these fine
grained threads in parallel. Simultaneous Multithreading (SMT) is a method allowing
166
numerous self-regulating threads to issue commands to a superscalar’s several functional
units in one single cycle. It issues multiple instructions from multiple threads in each cycle.
The aim of SM is to considerably escalate processor utilization in both long memory latencies
and limited accessible parallelism in one thread. It fully exploits thread-level parallelism and
ILP. It has a better performance with programs that are parallelizable and single threaded
program, this type of multithreading is shown in Figure 9. 20. It uses five utililized threads
and one non utilized thread.
Fine grain multithreading in Superscalar (FGMS) - Only one thread issues instructions in
one cycle, but it can utilize the whole issue width of the processor. This leads to hiding of
vertical wastes and showing of horizontal wastes. It is the only model that does not feature
simultaneous multithreading.
Single issue and dual issue - These three types of models limit the number of instructions
each thread can issue in the scheduling window in one cycle. Single issue type issues one
instruction per cycle and dual issue type issues two instructions per cycle.
167
The features of the SMT are given as below
Instruction level parallelism and Thread level parallelism is exploited to the full extent
Results in better performance due to
o Independent programs mix
o Parallelizable programs
o Program with single thread
Out-of-Order superscalar processors follow the architecture of SMT, for example
MIPSR1000.
9.8 Example of Superscalar Architecture
Pentium 4 is the Intel processor that was proposed in November 2000 is a superscalar
processor with CISC architecture as shown in Figure 9.21. The P4 processor has a
feasible clock speed that now surpasses 2 gigahertz in contrast to the 1 GHz of the Pentium 3.
Even if the concept of superscalar design is related to the reduced instruction set computing
architecture, superscalar principle can be applied to complex instruction set computing
machine too. Pentium 4 processor has implemented pipeline with 20 stages and it has two
separate EUs for integer and floating point operations. The operation of Pentium 4 can be
given as:
The instructions are fetched from the memory by the processor in order of the static
program.
Every instruction is translated in to multiple fixed length reduced instruction set
computing instructions.
168
Figure 9.22 Alternate view of Pentium 4 architecture
Pipeline stage 1 &2(Generation of Micro-ops) Branch Target Buffer and Instruction
Translation Lookaside Buffer are used for instructions fetching which are accessed from the
L2 cache 64 bytes at a time as shown in architecture block diagram. The instruction
boundaries are determined and instructions decoded micro-ops codes and the the trace cache
is used to store these µ-code as shown in Figure 9.22.
Pipeline stage 3: (Trace cache next instruction pointer) The dynamic gathered history
information saved in Trace Cache Branch Target Buffer (BTB) and if BTB doesn’t have
target then the following actions will be taken place
• if it is a return then predict branch as taken and as not taken otherwise
(Branch is not PC relative)
• predict as taken for PC relative backward conditional branches otherwise
as not taken
Pipeline stage 4: (Trace Cache fetch) In the program order the micro-ops are ordered which
are called traces. Based on the branch prediction these traces are fetched in the order. Many
micro-ops are required for some micro-ops like in CISC instructions, which are coded into
the ROM and fetched subsequently from ROM.
Pipeline stage 5: (Drive) For reordering this stage delivers instructions to Rename/Allocator
module from the Trace Cache.
Pipeline stage 6, 7&8: (Allocate: register naming) For execution resources are allocated
where in 3 micro-ops will be arrived in clock cycle. If the resources are available then these
micro-ops are dispatched in “out of order” manner. One of the 2 scheduler queues gets an
entry for memory access or not. From the ROB the micro-ops are retired in the order.
Pipeline stage 9: (Micro-op queuing) One of the two queues get loaded with the micro-ops
in FIFO policy where in one for the memory operation and the other for non-memory.
169
Pipeline stage 10, 11 & 12: (Micro-op scheduling )If all the operands of the micro-ops are
ready then the two schedulers retrieve them and based on availability of the unit they will be
dispatched at a rate of 6/clock cycle.
Pipeline stage 13 & 14: (Dispatch) If the same unit is needed by two micro-ops then they
are dispatched in order.
Pipeline stage 15 & 16: (Register File): For pending integer and floating point operations
the source is register file.
Pipeline stage 17 & 18 (Execute Flags): Computation of the flag values.
Pipeline stage 19 (Branch check): Branch prediction results are compared after checking
flag values.
Pipeline stage 20 (Branch check results): If the branch prediction goes wrong then all the
micro-ops which are incorrect are flushed. The branch predictor is provided with the correct
branch destination. From the new target address the pipeline will start again.
Questions
170
processor.
Summary
This chapter discusses about superscalar approach of achieving parallelism and its
limitations in the introductory part. The design issues and superscalar execution of
instructions are explained clearly with examples. The instruction issue policies for parallel
execution and dependences among instructions also covered with illustrative examples and
also explained in easy manner how the instructions execution can be achieved parallelism
with register renaming. The crucial topics of superscalar execution like branch prediction,
Memory Disambiguation, Dynamic Instruction Scheduling and Speculative Execution and
multi threading also covered in a simple understandable manner for students. An example of
superscalar architecture i.e Pentium 4 processor has been explained in this chapter with its
micro-architecture and pipeline implementation.
Glossary
Pre-Decoding: A part of decoding is done at the loading phase i.e. when load the instructions
from second level cache or memory to the Instruction Cache (I-Cache) which is called as pre-
decoding.
Instruction Issue policy: It specifies during the instructions issue process for parallel
processing how the dependencies are handled.
Instruction Issue Rate: It specifies in each cycle the maximum number instructions the
superscalar processor can issue.
Register renaming: For multiple execution paths of multiple instructions which uses same
registers, to avoid conflicts among these instruction execution the technique used is called
register renaming.
Branch Prediction: The condition outcome guess of a conditional branch instruction is called
branch prediction.
Fixed Prediction: The conditional branch instruction’s outcome guess is one out of taken or
not-taken.
True Prediction: The guess here has two possible outcomes of taken and not-taken and
further can be categorized as static and dynamic based on the code and execution of the code.
Speculative Execution: Speculation allows an independent instruction to issue on branch
predicted to be taken without any consequences (including exceptions) if branch is not
actually taken.
Memory Disambiguation: It is defined as determination of aliasing between two memory
references or stating whether two memory references are dependent or not.
Dynamic Instruction Scheduling: Arranging or scheduling two or more instructions for
parallel execution is called instruction scheduling and if this job is done by processor then it
is called dynamic instruction scheduling also called as Out-Of-Order (OOO) execution.
171
Reservation station: Its executions hold pointers to where the data can be found and does not
store the actual data.
Reorder buffer: It is a type of buffer that makes way for completing instructions only in the
series of program by allowing completion of instruction only if it has completed its execution
and the earlier instructions are completed too.
Process: Program can be defined in programmer’s point of view as set of ordered instructions
and in OS point of view it is executable file which is termed a process instead of program.
Thread: Within a process smaller code chunks are called threads and in a process there may
be many threads which share same resources of process.
Multithreading: The instructions of a program divided into smaller threads and parallelism
can be achieved by simultaneously executing these fine grained threads in parallel called as
multi-threading.
SMT (Simultaneous Multi Threading): It is a method allowing numerous self-regulating
threads to issue commands to a superscalar’s several functional units in one single cycle.
172
Chapter 10
Structure
10.0 Objectives
10.1 Introduction
10.0 Objectives
After studying this chapter one will understand
What is VLIW and SIMD architectures?
How parallel processing is doe in VLIW and SIMD architectures?
How classification different parallel processing architectures are done.
How to implement a long instruction word with multiple instructions.
How to achieve data parallelism?
What are different networks for SIMD architectures?
A case study of VLIW and SIMD architectures.
10.1 Introduction
173
VLIW architectures are different from the traditional RISC and CISC architectures.
It is important to discern architecture of the instruction-set from implementation. VLIW
processors and superscalar processors share some characteristics like having multiple
execution units and the potential to perform multiple operations at the same time. The
method used by both for achieving high performance is quite different in VLIW approach
and superscalar approach; in the former case much burden on complier than architecture i.e
it has simple architecture of the processor and later much engineering needed at processor
architecture level. The RISC architectures showcase simple and good performance
execution than the CISC architecture. VLIW architectures are quite simple and much better
than RISCs. This is due to their hardware simplicity. VLIW architectures need the help of
compilers much more than the others.
When Intel introduced the IA-64 architecture, they also introduced the term EPIC
(explicitly parallel instruction computer) for this architectural design. A VLIW type of
processor has an internally parallel architecture which characteristically has various
functional units which are independent as shown in Figure 10.1. These processors are
statically scheduled by the compiler. VLIW has an advantage of presenting highly
simultaneous implementation which is a lot simpler and cheaper to construct than same
simultaneous a reduced instruction set computing or complex instruction set computing
chips. It can attain a good performance by making use of parallelism at instruction and
data level.
174
10.1.1 Instruction-level Parallelism (ILP)
On a VLIW processor which has two load/store units ,one multiply unit and one add
unit, the exact same code can be implemented in just 5 cycles. Cycle 1: load a1 load p1
Cycle 2: load a2 load p2 Multiply z1 a1 p1 Cycle 3: load a3 load p3 Multiply z2 a2 p2
Cycle 4: multiply z3 a3 p3 add y z1 z2 Cycle 5: add y y z3. Thus, the performance is
almost twice as fast as that of a sequential a reduced instruction set computing processor. If
175
this particular loop needs to be executed again and again, the free slots in cycles 3, 4, and 5
can be used by further overlapping the execution and loading for the next in line output
value to further enhance the performance.
176
r[i] =p[i] + q[i];
As shown in Figure. 10.2 the VLIW type of architecture is derived from the two very
well known concepts: horizontal micro-coding and superscalar processing. Every word
contains fields to manage the routing of data to various register files and execution units. This
gives the compiler more control over data operations. However, the superscalar processor's
control unit must make instruction-issue decisions on the basis of little local information; the
VLIW machine has the ability of making these execution-order decisions at compile time,
thus allowing optimizations that lessen the number of hazards and stalls. This is a major
advantage in the implementation of straight-line code, but it is a disadvantage while dealing
with branch instructions, because there are longer, more frequent stalls. A typical VLIW
machine has hundreds of bits of instruction words in length. As shown in Figure. 10.2,
different FUs are utilized simultaneously in a VLIW processor. Here, instruction cache
performs the function of supplying multiple instructions per fetch. Whereas, the real number
of instructions issued to different functional units may vary in each cycle. The number of
instructions is constrained by data dependencies. It is noted that the average of ILP is almost
2 for the code without unrolling of loop.
IN VLIW architecture all the functional units has the common large register as shown
in Figure. 10.3. The operations to be carried out by the FUs, simultaneously, are then
177
synchronized in a VLIW instruction (256 or 1024 bits) as in case of multi flow computer
models. The concept of VLIW is basically taken from the horizontal micro coding method.
Various fields of the instruction word (long) have to carry the opcodes which are to be
dispatched to multiple FUs. The programs that are written in instruction words (short) must
be bundled to result in VLIW instructions. This type of code compacting has to be performed
by a compiler, that can predict the outcomes of branch using detailed run-time statistics.
multiple operations
31 25 24as shown
19 18 in Figure
16 10.4.
15 13 12 11 7 6 1 0
Opcode Dest Dest_bank Branch Test Src_1 Src_2 Imm
31 25 24 19 18 16 15 13 12 11 7 6 1 0
31 25 24 19 18 16 15 13 12 11 7 6 1 0
31 25 24 19 18 16 15 13 12 11 7 6 1 0
Opcode dest Dest_bank Src_1 Src_2 Imm
178
Word 5 Immediate Constant
31 0
Figure 10.4 shows very long word instruction used in Trace 7/200 processor, where
each word is subdivided into 8 sub words with early and late beats for execution, where in
a single instruction, multiple operations like addition and multiplication are included. The
typical word length ranges from 52 bits to 1 Kbits. All the operations present in an
instruction given are implemented in a lock-step mode. One or multiple register file is
required for FX and FP data. It relies on a compiler to find the parallelism and schedule the
program code without any dependency. These processors consist of various FUs and fetch,
from the instruction cache, a Very-Long Instruction Word having many instructions. Later,
it dispatches the whole VLIW for the parallel execution. Such abilities are utilized by the
compilers that generate a code which has assembled independent instructions which are
executable in parallel. These processors have simple logic of control as they don’t perform
any dynamic type of scheduling (like in contemporary superscalar processors).VLIW has
also been called as a natural successor to the a reduced instruction set computing as it
transfers complexity from hardware to the given compiler. Thus, it allows simple and fast
processing.
179
The main aim of a very long instruction word is to remove the complex, time
consuming scheduling of instruction and dispatching parallelly in modern microprocessors.
Theoretically, a VLIW processor should be fast and cheaper than a reduced instruction set
computing chip. The compiler, here, must accumulate many operations into a single
instruction word in such a way that the various FUs are made busy, as it requires
instruction-level parallelism (ILP) in a code series to fill up the available slots. This
parallelism is performed by compiler by scheduling the code through basic blocks,
implement software pipelining and lessening the operation number being executed.
4. The compiler has to keep a record of worst case delays and cache misses.
5. This type of hardware dependency limits the using same type of compiler for a line
of very long instruction word processors.
As a very long instruction word architecture lessens the hardware complication over a
superscalar architecture, a way more complicated compiler is required. Extricating
maximum performance from a superscalar reduced instruction set computing or
implementation of CISC requires experienced techniques of compiler, but the experience
level in a very long instruction word compiler is notably higher.
180
this is that the fixed very long instruction word format has bits for non-executable operations,
whereas the issues of superscalar processor are only executable instructions. Thirdly, a
superscalar processor can be object-code-compatible along with a huge family of nonparallel
machines. However, a very long instruction word machine exploits different amounts of
parallelism which would need multiple instruction sets. Instruction level parallelism along
with data movement in very long instruction word architecture is specified at compile time.
The result being that the run-time scheduling of resource and synchronization is completely
eliminated. A VLIW processor can be viewed as an extreme level of a superscalar processor
in which all independent operations are synchronously compacted together beforehand. The
CPI of a VLIW processor can be quite lower than that of a superscalar processor. A
comparison of RISC, CISC and VLIW are summarised in Table 10.1.
Table 10. 1 Architectural feature Comparison of RISC, CISC and VLIW
Property
181
10.2.3 Implementation and advantages of VLIW
Third, while using a lot of registers, it is quite possible to emulate the functions of
the reorder buffer of superscalar execution. The objective of the reorder buffer is to allow a
superscalar processor to carry out instructions and be able to immediately discard the
results when necessary. By using a lot of registers, a very long instruction word machine
can place the outcomes of implemented instructions in the temporary registers. The
compiler also has the knowledge of how many instructions will be executed, so it directly
uses the non-permanent registers along the predicted path and disregards the values in
those registers along the path which will be used if the branch had been mis-predicted.
Advantages of VLIW
182
2. Compiler assigns the function units which correspond to the position within the
instruction packet.
4. Their simple logic of instruction issue also often facilitates very long instruction
word processors to accommodate more implementation units onto a certain chip
space than the superscalar processors
6. Tasks similar to decoding, data dependency detection, and instruction issue etc. are
simplified.
8. It amounts to instructions being executed with shorter length of clock cycles than
superscalar processors
Disadvantages of VLIW
3. Very long instruction word programs work well only when implemented on a
processor with exact same number of EUs and exact same instruction latencies as
the processor they were compiled for.
4. Increase in the number of execution units between generations will lead to the new
processor trying to combine operations from various instruction in each cycle which
causes dependent instructions to be executed in the same cycle
5. Unscheduled events such as 'cache miss’ can lead to the stalling of entire processor.
183
7. Code expansion causes high power consumption.
8. If the compiler does not find enough number of parallel operations to fill all of the
slots in a particular instruction, it must use explicit NOP (no operation) operations
into the operation slots which causes Very long instruction word programs to take
much more memory than equivalent programs for the superscalar processors.
9. In case of unfilled opcodes, memory space and instruction bandwidth are wasted in
VLIW. Hence, low slot utilization.
10.3 Example of VLIW processor
Itanium (IA-64)
Itanium is a line of 64-bit Intel microprocessors that implements the Intel Itanium
Architecture (formerly called IA-64) as shown in Figure 10.6. Intel markets the processors
for enterprise servers and high-performance computing systems. It is a line of 64-bit Intel
microprocessors. IA-64 is the first type of architecture to bring ILP (Instruction Level Parallel
execution) attributes to general-purpose microprocessors. They are based on the EPIC
(Explicitly Parallel Instruction Computer). Its speed of computing is very high and
architecture simple.
IA- 64 is an explicit parallel architecture with rich register set with base data word
length of 64 bits and is byte addressable and logical address space is 264 bytes. The
architecture also implements branch prediction and speculation. For parameter passing it uses
a mechanism called register renaming, which is also executes loops in parallel. Compiler
controls the prediction, speculation and register renaming, to accommodate this controlling
action each instruction has an extra bit in its word, which is the distinguish characteristic of
the I-64 architecture. The architecture implements the following features:
Integer Registers: 128
Floating Point Registers: 128
One-bit predicates: 64
Branch Register: 8
Length of the Floating Point Registers: 82 bits
184
Instruction execution in IA-64
One instruction of IA-64 which is Very long instruction, whose bit length is 128- bit
contains 3 instructions and up to 2 instruction words can be read from the L1 cache into
pipeline, by fetch mechanism in a single clock cycle. Therefore in a single clock cycle, 6
instructions can be executed by the processor. IA_64 has, in 11 groups total 30 functional
execution units. Each sub-word of the long instruction is executed by each execution unit in
one clock cycle provided the data is available. Different execution unit groups are given
below:
2 Integer units, 6 General purpose ALUs and 1 shift unit
4 data cahse units
1 parallel multiply, 2 parallel shift units , one population count and 6
multimedia units
2 SIMD FP MAC ( Floating Point Multiply and Accumulate) units
2 FP MAC units of 82-bit
3 branch units
IA-64 at 800MHz frequency is rated with 3.2 Giga Floating Point operations (GFLOPs) and
6.67 Gflops at 1.67 GHz.
Instruction stream
Data stream
Figure 10.7SISD Architecture
In this architecture execution of the instructions takes place in serial order. It is a
conventional sequential machine model. Example: 2 instructions c = a+b and a =b*2 will be
executed as below
1st Cycle: Load a
2nd Cycle: Load b
3rd Cycle: c=a+b
4th Cycle: Store c
5th Cycle: a = b *2
6th Cycle: Store a
From the above illustration it can be observed that only one instruction and data
stream is acted on during any one clock cycle.
SIMD, which could operate on a vector of data with a single instruction, in modern
SIMD architectures processing of all the elements of the vector takes place simultaneously.
Instruction stream
Data stream
Processor
Memory
Unit (P1)
Module
Memory
Processor
Modul e
Control Unit (P2)
Unit
Memory
Processor Module
Unit (Pn)
Cycle P1 P2 Pn
From the above Figure 10.8 and example it can be observed that, all the processing units
i.e. P1, P2 ....Pn , at any given clock cycle execute the same instruction but each processing
unit operates on a different data element. This concept is used in vector computers with
scalar and vector hardware. Instruction stream
Data stream
Processor
Control Memory
Unit (P1)
Unit Module
Memory
Control Processor
Modul e
Unit Unit (P2)
Memory
Processor Module
Control
Unit (Pn)
Unit
187
Figure 10.9 MIMD Architecture
MIMD, in this technique achieve parallelism, as shown in Figure 10.9 architecture has a
number of processors which function asynchronously and independently. At any time,
different instructions on different pieces of data will be executed by different processors.
Example:
Cycle P1 P2 Pn
From the above example it can be observed that MIMD machine executes different
instructions on different data elements in one clock cycle. CAD is the most type of MIMD
parallel computers.
MISD is a type of parallel computing architecture as shown in Figure 10.10 where on same
data different operations are performed using many functional units. As in Pipeline
architectures after processing by each stage in the pipeline the data is different so we can
Instruction stream
conclude this in MISD type.
Data stream
Control Processor
Unit Unit (P1)
Control Processor
Unit Unit (P2) Memory Memory Memory
Module Modul e Module
Processor
Control
Unit Unit (Pn)
188
Figure 10.10 MISD Architecture
MISD is used in very few practical applications, one important application that uses
MISD architecture is Systolic Arrays, where a single encoded message is cracked using
multiple cryptography algorithms.
There are two types SIMD architectures exist viz., one is True SIMD and the other
Pipelined SIMD.
Distributed memory and shared memory usage determines the true SIMD
architectures. The SIMD architecture implementation for both shared and distributed
memory is shown in Figure, but they differ in the processor and memory modules
placement.
The control unit of a true SIMD architecture with distributed memory interacts with each
and every processing element in the architecture and each processor has its own local
189
memories as shown in Figure. The control unit provides instruction to the processing
elements which are used as arithmetic unit. Information transformation between the
processing elements and Information fetching is done through controlling unit for a
processing element which if needs to communicate with another memory on the same
architecture. The main drawback of this architecture is slow performance time as the
controlling unit handles all the data transfer activity.
The processing elements in this architecture have shared memory but doesn’t does not have
a local memory. The processing elements are connected to a network they can access any
memory module through this network. As shown in Figure, the same network allows every
processing element to share their memory content with others. The controlling unit role in
this architecture is to send instructions for computation only and has no role in accessing
190
memory. The advantage of this architecture is as less processing time as controlling unit
has no role in data transfer.
Interconnection Network
In pipelined SIMD, the controlling unit sends instruction to each processing element, then
processing elements will do the computing at multiple stages using a shared memory. The
architecture changes with the no. of stages used for pipelining.
The interconnection network carries data between memory and processors and the topology
of a network discusses how the connection pattern formed between these communication
nodes. Two types of topology is used in general i.e 1) direct topology, which is point to
point connection using static network and 2) indirect topology, which is dynamic
connection using switches. The interconnection network selection will be based on the
191
demands of application of the SIMD architecture for which it is designed. Some common
topologies used in SIMD architecture are
The most commonly used network topology i.e. 2-dimensional mesh for SIMD architecture
is shown in Figure. It is direct topology in which switches are arranged in a 2-D lattice
structure. In this topology only communication between neighboring switches is allowed.
The main
Figurefeature
10.13 (a)of2-dimensional
this network topology
mesh is that it support
(b) with wrap-around close
connections at thelocal
edgesconnections, in
several applications this feature is exploited. The main drawback of this topology is that the
2-D mesh has relatively maximum distance value, with the edges wraparound the
maximum distance Dmax for N-processor elements is √ .
Figure (a) Suflle exchange (b) Direct Binary 3-Cube (c) Indirect Binary 3-Cube
Data level parallelism can be exploited significantly in SIMD architectures for scientific
computing which is matrix-oriented and in image and sound processors which are media
oriented. As SIMD needs only one instruction to be fetched for data operation, hence it
leads to energy efficient architecture which is attractive for mobile devices. SIMD operates
on multiple data elements which can be viewed in time and space. In time point of view at
a time multiple data is executed in single instruction by using Array Processors. On
contrary in view of space multiple data is executed in single instruction in consecutive time
steps by using vector processors.
Example:
Instruction Stream
LD B A[3:0]
ADD B B, 1
InMUL
the above
Bexample
B, 2 illustrates how in array processors multiple data is executed parallel
bas
In the above example instructions are shown how they executed on space and in at a time
multiple data in vector processors.
193
The full form of Maspar is massively parallel machine, as it involves huge number of
processing elements for parallel processing. In this architecture unlimited number
processing elements can be used as the design incorporates distributed memory scheme, the
only limitation of this type of architecture is processing elements cost.
MP-1 Machine: Processing Elements (PE) are connected in 2-D lattice structure in
this architecture, where the entire PE is customized by MP1. A front end machine (VAX)
drives this machine, I/O devices with very high speed can be attached to this architecture.
194
In the maspar architecture an instructions with multiple data elements is executed in one go
as shown in Figure 10.15. The architecture has mainly two parts one is Front End and the
other Data parallel Unit (DPU) which further divided into two parts Array Control Unit
(ACU) and Processor Element Array (PEA).
Front End:
Maspar computational engine does not have its own operating system, therefore a front end
work station( ex: VAX) is required to interface with Maspar to make it programmer
friendly.
DPU:
A program’s parallel portions are executed by DPU, which consists of two parts i.e ACU &
PEA. ACU performs the following 2 tasks first one being execution of the instruction
which operate on singular data and second being feeding the instructions to PEs which
operates on parallel data.
PEA:
Simple processors are connected in a 2-D mesh, the end connections of the mesh are wrap
around.
195
All the eight neighboring PEs, which are capable of writing and reading from/to memory
and able to perform arithmetic operations, are connected to each processor as shown in the
Figure. The PEs can only execute but can’t do the task of fetching and decoding.
Questions
Q1. What is VLIW stands for? Explain it with suitable example.
Q2. Explain the instruction format of VLIW architecture.
Q3. Explain the architecture of VLIW and give an example.
Q4. Explain detailed architecture and features of IA 64 processor.
Q5. What is parallel processing? Give the classification of parallel architectures based
on instructions and data.
196
Q6. What is SIMD stands for? Explain its architecture in details.
Q7. Explain the network topology used in SIMD architectures.
Q8. What is MASPAR? Explain in detail an example of SIMD architecture.
Summary
This chapter describes What Very long instruction word (VLIW) and its importance
in parallel processing environment and how instruction level and data level parallelism
is achieved at architecture level. Detailed VLIW instruction format have been discussed
and how the instruction is scheduled for parallel processing is also covered. Pipelining
and implementation in VLIW architectures are discussed with its advantages and
disadvantages. This chapter also covers parallel architecture classification based on
data and instruction with architectural descriptions of each class and explained in detail
SIMD architecture implementation with interconnect network. In this chapter examples
for VLIW and SIMD architectures are also explained in a better and simple way.
Glossary
VLIW: Very Long instruction Word, where the length of the instruction is around
256 bits to
1024 bits. The length depends on the number of execution units available in a
processor.
RISC : Reduced Instruction set computing, in this instruction set architecture is
very simple
hence leads to increase in the performance of the processor.
CISC : Complex instruction set computing, in this more than one instruction of
low level
are put in one main instruction.
EPIC: Explicitly parallel instruction computing, in this architecture parallel
execution of the
instructions are done at the complier level.
Register file: A processor’s set of registers arranged in array.
Cache miss: When the data in the cache memory that a processor trying to write
or read is not
found then it results in Cache-miss.
SISD architecture: One instruction and one data will be executed by a processor
in one clock cycle.
197
SIMD architecture : One instruction and more than data will be executed by a
processor in one
clock cycle.
MISD architecture : More than one instruction and one data will be executed by
a processor in
one clock cycle.
MIMD architecture : More than one instruction and more than one data will be
executed by a
processor in one clock cycle.
Distributed memory: Each Processor will have its own memory which acts as
local memory to that
processor, in multi processor environment.
Shared Memory: Same memory is used by many processors through a network in
multi processor
environment.
198
Chapter 11
Advanced Memories
Structure
11.1 Objective
11.2 Cache Accessing
11.3 Latency during Cache access
11.4 Cache Miss Penalties
11.5 Non-Blocking Cache memory
11.6 Distributed Coherent Cache
11.1 Objective
The objective of this chapter is to define and discuss the working of advanced memories.
There are variety of techniques and processors available which make use of these advance
memories in order to enhance the overall performance and execution of the instructions.
Sections 11.2 defines the cache accessing mechanisms such as write through and write back
along with the bottlenecks related with these techniques. Section 11.3 further discusses the
latency issues related to the cache access. There are various factors on which the latency is
dependent these are: Cache size, Hit Ratio, Miss Ratio, Miss Penalty, Page size. Section 11.4
involves the cache miss penalties which are further categorized into three main categories
Compulsory, Capacity, Conflict. 11. 5 include Non-Blocking Cache Memory. Whenever a
miss situation occur the gap between processor access time and memory latency increases as
a result the utilization of processor decreases. To solve this miss penalty situation Non-
Blocking Cache memory technique is used effectively. 11.6 contain distributed coherent
cache in which distributed memory is used to solve the cache coherence problem.
Access Mechanism
199
Technologically a variety of memory types are available having different access mechanisms
depending upon this mechanism, memory is used to store or retrieve particular type of
processes. These mechanisms are:
It is defined as the memory for which a location, usually known as a word has a unique
addressing mechanism. This address is physically wired-in and can be fetched in one memory
cycle. Generally it uses a constant time slot for retrieving any given location. Main memory,
cache uses this access mechanism.
This memory is also famous with the name associative memory where for accessing the
location, we require a field of data word instead of address. Here, concept of RAM is used
where logic for bit comparison is physically wired-in along with addressing system. This
logic circuit facilitates comparison of desired bit positions of all the words with a particular
input key. Comparison is done for all the words. Word for which match occurs are then
accessed.
General organization of memory sub system is detailed in Figure 11.1. This diagram
represents main storage of computer system. For economic senses, designers have to design
large capacity main storage with lesser speed as compared to CPU. To resolve this problem
200
and to reduce this speed mismatch between CPU and main memory a high speed low
capacity cache memory is introduced in-between. The typical memory hierarchy engaged in
computer system is represented in Figure 11.2. And the access mechanism for defining cache
as high speed memory is elaborated in following section.
When the processor needs to read from a location in main memory, it first checks whether a
copy of that data is available in the cache. If it exists, the processor immediately reads from
the cache. Block diagram in Figure 11.3 shows relation of CPU with cache and main memory
via common data and address bus.
201
Write through
Write through is a simple technique where all write operations are made to main memory as
well as cache. Since, main memory is a valid portion for accessing; any CPU-cache module
can manage traffic to main memory to maintain consistency within its own cache. Due to this
a considerable amount of memory traffic is generated which may lead to bottleneck
Write back
Write back is a technique to reduce memory writes to the minimum. With this approach,
updates can be done only in cache. When any update occurs a flag bit F associated with the
page is set. Then, when a page is replaced, it can be written back to main memory if and only
if the flag bit F associated with the page is set. Key point to note in case of Write back is that
accessing from main memory is invalid. Any access by input/output device is viable only
through cache. This technique is especially suitable for complex circuitry where potential for
bottleneck is there.
Figure 11.4 Flowchart representing the working principle of read/write operation with
cache memory
202
There are four basic modes of accessing:
Read
Write
Execute
Modify.
Read and Write modes are simply used for reading of data from memory and writing/ storing
data to memory. Execute access mode is used in case instructions are to be accessed from
memory. Modify access mode is used to coordinate simultaneously executing programs in
one or more processors. Accessing is further comprised of three phases. First phase is
translation, used for compilation of high level language program into a lower level
representation. Second phase is linking, used to combine several separately translated
programs to form a single larger program for execution. Third phase is Execution, running of
the linked program. Most modern CPUs have at least three independent caches: I-cache
(instruction cache), D-catch (Data cache) and a TLB (Translation Lookaside Buffer). I-cache
is useful for accessing the executable instructions from cache in speedy manner. D-cache is
responsible for accessing and storing of data and third TLB is discussed later on in this
chapter.
An effective memory accessing design provides access to information in the computer when
the processor needs it. It should support object naming. For efficiency, it should not allow the
information to get too far from the processor, unless the information is accessed infrequently.
Thus to do its job efficiently, allocation mechanism may move information closer to or
farther from the processor depending upon how frequent this information is required. To
implement these moves and to keep track of the closest location of each object, the accessing
mechanism must be able to map names among name spaces as the object move among
memory modules. This mapping is performed in multiple stages.
To access an object, the object must be selected, either on the basis of its location i.e. location
addressing or its contents i.e. content addressing. Location addressing selects objects based
on location within a memory address space. Content addressing deals with a specific portion
of the object’s representation, known as key that is required to match to the selector value.
Key is defined as the component of object in memory and selector is defined as the input
value that describes the entry to be accessed.
11.2.2 Object Name Mapping Strategies
203
Different types of object names may exist within a computer system; these are collected into
separate name spaces. A name mapping transforms addresses between name spaces. Multiple
name mapping options are available. The translation from a name n1 in name space N1 to
corresponding name n2 in name space N2 can be represented by a mapping function
n2 = f1, 2(n1)
Mapping functions can be implemented on the basis of two representation schemes:
Algorithmic representation
Tabular representation
Tabular representation specifies output name corresponding to each input name if possible.
Tabular mapping can also be represented as a binary relation. If two items I1 and I2 satisfy the
binary relation R, it is written as
Complete relation can be traversed by listing all pairs that belongs to this relation. Complete
listing will give a view of table. Here, two main keywords are used first is domain and second
is range. Domain is the set of input values for which mapping is defined. Range is set of
output values produced by the mapping.
An important constraint of name mapping is its lifetime i.e. the time interval during which the
mapping remains valid. We can discard the mapping information when it is sure that it will
never be required again. A fixed name mapping is same for all execution of a program a
variable name mapping may change. If same name mapping function is used during entire
execution of program, it is known as static name mapping. The other having variable name
mapping function is known as dynamic name mapping. Static name mappings can be
performed during program compilation.
204
Memory is a device which responds to CPU’s read/write instructions by retrieving/storing
information from/to the addressed location provided by CPU. Memory speed is represented
by memory clock cycle Tm i.e. the time period elapsed between the moment when the address
is placed on MAR (Memory Address Register) and reading/writing of information from/in
the address location is completed. When the read operation is complete, the read information
is available in MDR (Memory Data Register). When the write operation is complete,
information from MDR is written onto addressed location. For a single module memory,
latency can be defined as the interval between the instance when a read instruction is sent by
the CPU to the memory and the moment when data is available with the CPU. This approach
is clearly diagrammatically explained in Figure 11.5.
To achieve minimum possible latency, data or program targeted by CPU should be available
in cache and directly accessible by CPU. Latency is dependent upon a number of factors
like:
Cache size
Hit Ratio
Miss Ratio
Miss Penalty
Page size
Cache Size
First element cache size significantly accesses the memory access time. The size of the cache
should be small enough so that average cost per bit is closed to main memory alone. It must
be large enough so that average access time is closed to cache alone. It is almost an
impossible mechanism to achieve an optimum cache size. Larger the cache, larger the
number of gates involved in addressing the cache. It makes the working slightly slow as
compared to smaller caches, even when built with same IC technology.
Hit Ratio
205
Average memory access time can be reduced by pushing up the hit ratio to lower level
memory. Hit ratio (H) is defined as
N
=
N +N
Where Ni is the number of times word targeted by CPU is available in ith level (Cache), and i
+ 1 is the number of times the word is not available and to be accessed from (i+1)th i.e. main
memory to ith level (cache memory).
Miss Ratio
Miss Penalty
Miss penalty is defined as the collaborative time to replace a page from (i +1)th level to ith
level and time to deliver the page to CPU from ith level.
Miss penalty (at ith level) = time to replace a page from (I +1)th
level to ith level + time to deliver the page to CPU from ith level.
Time taken to replace a page from (i +1)th level to ith level contains access time in (i +1)th
level memory followed by page transfer time. So, memory access time (at ith level) is
concluded as
Memory access time (at ith level) = Access time on hit to level i memory X percentage
of hit (H) + miss penalty X percentage of miss (M).
Page Size
Page size also affects the memory access time. The effect of page size on miss penalty and
miss rate is elaborated in Figure 11.6 (a) & (b). Memory latency is fixed for a memory type.
Therefore, with increase in size, miss penalty increases 11.6 (a). With larger page size, hit
ratio increases only up to a limit. After, the limit crosses, hit ratio drops because of
availability of lesser number of larger size pages in the memory 11.6 (b). Miss rate varies
206
inversely with hit ratio Figure 11.6 (c) shows the variation of average access time with page
size.
Figure 11.6 Effect of Page Size (a) on Miss penalty (b) on Hit ratio (c) on Average access
time
To reduce the problem, instead of single module, memory having multiple modules can be
preferred. CPU can pass instructions to different modules in interleaved manner as shown in
Figure 11.7. As a result, a set of memory words (data/instruction) is made available to CPU
sequentially. This technique will reduce the wait time and improve bandwidth of data flow
because data/instruction is delivered to memory one after the other without wait.
207
Figure 11.7 Latency reduced using multiple module memory
Cache memory plays a very important role to make a system efficient. Whenever we wish to
read or write any data/instruction it is first checked in cache. If the data is available, it is sent
to processor immediately and this process is known as hit. For an efficient system with
effective throughput, every time there should be a hit. But this is not possible practically.
When data/instruction is not available in cache, it is fetched from main storage to cache and
then the same is sent to processor for further execution. This process is known as miss and
obviously it carries an overhead along. This overhead is technically termed as miss penalty.
Therefore, miss penalty is defined as the collaborative time to replace a page from main
memory to cache and time to deliver the page to CPU from cache memory.
Compulsory
It is defined as a start miss for the very first reference to a page that is currently residing in
main storage. In other words, either it is a page that is very less frequently used or used for
the first time.
208
Capacity
It is defined as a miss when cache memory is full and there is a requirement to bring another
page from main storage to cache memory by replacing one of the existing pages. For this
replacement two popular schemes are there. One is FIFO (First in First Out) replacement
algorithm and other is LRU (Least Recently Used) replacement scheme.
Conflict
In some page mapping techniques like direct mapping and set-associative mapping, a set of
pages are mapped to same set of page frames. This may result in a miss known as conflict
miss. In case of full associative mapping with larger number of page frames per set reduces
conflict misses as there is high level of flexibility for placing a page in set.
To evaluate the cache performance in overall program execution time, first we evaluate CPU
execution time for a program. CPU time is calculated as the product of instruction count (IC)
i.e. number of instructions present in the program, clock per instruction (CPI) i.e. number of
clock cycles accommodated by individual instruction and clock cycle time (CCT). CPI value
also contains delay to access data from memory subsystem. Memory subsystem includes
cache and main storage. So, cache access delay is included with in the CPI value. However,
we have already stated that it is not mandatory that while executing a program it may not
reside in cache due to miss situation. In that case, CPI value increases due to memory stall
clock cycles (MSC).
209
Effect of memory stall cycles due to cache misses is considerable in a system having
lesser CPI value
Higher CPU clock rate result in larger miss penalty
For systems having lesser CPI value and higher clock rate e.g. RISC architecture,
cache performance plays a very important role.
Normally read operation is used more frequently as compared to write operation. So, in order
to improve efficiency read miss penalty must be reduced. Two main approaches are used to
reduce the same:
While transferring a page from main memory, CPU operation need not be stalled till
the full page is transferred. CPU execution can be made to continue as soon as the
desired page is received.
A large page may be divided into a number of sub-blocks, each sub-block having a
discrete valid bit. This will reduce miss penalty as only a sub-block is transferred
from main storage.
However, both the above said approaches need additional hardware for implementation.
Miss penalty for write operation can be reduced by providing a write buffer of accurate
capacity. The accurate buffer size can be accessed from study of simulations of already
executing benchmarked programs. However,+ a drawback is associated with this approach
that in case the read from the page is required from the main storage while updated data is
still in the write buffer, then it would lead to a erroneous result. This issue can be handled by
adapting approaches like:
Read miss operation should wait until the write buffer is empty. But, this will increase
read miss penalty.
Check the content of write buffer on read miss if the targeted word is not available in
buffer, read miss action can continue.
For read operation data from cache can be read simultaneously. For write operation writing is
processed sequentially only after a tag match. Therefore, write consumes more than one clock
cycle. We can increase the speed of write operation by pipelining the two steps- tag search
and writing on cache. While in the first stage of pipeline, tag is searched for write operation;
in second stage of pipeline previous write operation will be executed. This approach will
ensure that if a hit is identified in the first stage then in second stage of pipeline write
operation will be executed in a single cycle. In case of a write through cache single cycle
210
cache write can be achieved by skipping the tag match operation. For this a page is divided
into a number of sub-blocks, each sub-block is equal to the size of a word; each page has a
discrete valid bit. Writing is done along with setting of valid bit in the cache in one clock
cycle.
While designing a memory subsystem there must be an efficient tradeoff between two
conflicting strategies-higher speed and larger capacity. A compromise between these two
conflicting strategies can be achieved by providing two levels of cache memory as shown in
Figure 11.8. Here, the first level cache (C1) has low capacity but high speed, while second
level cache (C2) has high capacity and low speed. C1 speed is similar to the speed of CPU
and capacity should be enough to achieve desired hit ratio.
Average memory access time with C1 and C2 = hit time in C1 + miss rate in C1 X miss
penalty of C1
Speed of C1 affects CPU clock rate, while speed of C2 affects miss penalty for C1. Cost
consideration can be analyzed from the fact that size of high speed C1 can be limited.
Therefore, design considerations are aimed towards cost effective design of second level
cache in terms of speed and size. Increase in size beyond a certain limit will not bring any
benefit in terms of execution speed and miss rate. In practice, C1 is usually synchronized
with C2 cache and CPU. Table 11.1 elaborates typical parameters for C1 and C2 cache for
cost considerable design.
211
Table 11.1 Typical Parameters of C1 and C2 cache
Whenever a miss situation occur the gap between processor access time and memory latency
increases. This situation worsens the problem of cache miss penalties to a large extent and
leads to the processor degradation. i.e. the utilization of processor decreases. So there are
many techniques evolved to reduce cache miss penalties such as increasing the hit ratio in the
cache by adding small buffers, by adding two level cache designs which reduces the access
time and more cost effective, by improving speed of write operation etc. There is another
technique to extend processor utilization by using write buffers, non-blocking caches or
prefetching techniques to access data within the same process.
Typically, cache memory can handle only one request at a time. If there is a miss situation i.e.
data is not found within the cache, it has to be fetched from main memory. During this
process of retrieval, cache remains idle or ‘blocked’ and do not handle any further request
until the fetch operation is complete. In case of ‘non-blocking’ cache, this problem is
addressed successfully. Rather than sit idly and wait for the operation to be complete, cache
takes another request from the processor provided this request must be independent form the
previous one. Non-blocking cache is also popular with the name Lock-up free cache as the
name suggests that failure or suspension of any task cannot cause failure or suspension of
another request. Non-blocking cache is used as one of the popular latency-hiding technique.
212
Non-blocking cache is used along with other latency-reducing techniques like prefetching,
consistency models, multithreading etc.
Non-blocking caches were first introduced by Kroft. The design of these caches was based on
three main features:
Load operations are lock-up free
Write operations are lock-up free.
Cache can handle a number of cache miss requests simultaneously.
To handle multiple misses in non-blocking manner, special registers known as Miss Status
Holding Registers (MSHRs) were introduced. These registers store information regarding
pending requests. MSHR contains following attributes.
1. Address of data block
2. Cache frame required for the block
3. Word due to which miss has been occurred in the block
4. Destination register
From the features list discussed above, non-blocking loads are predetermined by the
processor; buffering writes are responsible for handling non-blocking writes; to handle
number of cache miss requests simultaneously, only MSHR is not enough, but available
cache bandwidth should also be taken into consideration. There is a requirement of additional
support in execution unit of processor for non-blocking load along with MSHRs. Some kind
of register interlock is required to maintain data dependency, in case a processor applies static
instruction scheduling in pipelines. For dynamic instruction scheduling possessing out-of-
order execution, score boarding technique is preferred. Non-blocking functions generate
interrupts and to handle these interrupts successfully, interrupt handling routines are required
for both scheduling approaches.
Write buffers play a significant role in removing stalls on the write operations. They allow
the processor to execute even when there are pending writes available. Write penalty can be
reduced by using write buffers along with write through caches. Write buffers are also used to
store the written value temporarily for write back caches until the data line is back. Multiple
writes on one line can also be combined to reduce the total number of writes to subsequent
level. This technique may also pose the problem of consistency as the last read may be
required before the previous buffered write is performed. In this scenario, an associative
check is executed to enable the correct value to the last read.
213
Non-blocking functions are concerned with utilizing post-miss overlap of executions and
memory accesses. Processor halt can be delayed by using non blocking loads until and unless
any data dependency occurs. These loads are especially beneficial for superscalar processors.
For static scheduling, the non blocking distance i.e. the combination of number of
instructions and memory accesses tend to be small. This distance can be enhanced by
appending code for this combination produced by the compilers. For dynamic instruction
scheduling, additional hardware is used to increase non blocking distance but its efficiency is
dependent on many factors such as branch prediction, lookahead window, and data
dependency. In contrast, non blocking writes are more beneficial in minimizing the write
miss penalty as memory access time and non blocking distance are almost same. In addition,
no extra hardware is required for pending writes. This whole process will not exaggerate the
data access throughput to a great extent as the write miss penalty only is not responsible for
degradation. The other factors include lookahead distance (number of cycles generated by a
Prefetch request before the execution of reference instruction) can be managed with the help
of prefetching caches. Prefetching caches require high implementation costs, extra on chip
support units and complex hardware as compared to the requirement of non blocking caches.
Basically MIMD architecture belongs to the class of uni-processor family containing single
processor attached with single memory module. If there is a need to extend this architecture
to multiple processors attached with multiple memory modules; this can be handled by two
different mechanisms. Same processor/memory pair can be duplicated and then connected
with the help of an interconnection network. The processor/memory pair works as one
element and is independent of other pairs. To establish communication among pairs,
messaging approach can be used. One element cannot access the memory portion of other
element directly. This class of extended architecture is known as Distributed memory MIMD
architectures. Distributed memory architectures do not contain the issue of cache coherency
as the message passing approach handles multiple copies of same data in form of messages.
This architecture is depicted in figure 11.9.
214
Figure 11.9 Structure of Distributed Memory MIMD Architecture.
Second mechanism suggests creating a set of processors and memory modules. Here, any
processor can access any memory module directly. Interconnection network is present in this
scheme too to make an interface between processor and memory. The set of memory modules
when combined makes a global address space that can be shared among the participating
processors. Thus, the architecture got its name, shared memory MIMD architectures. This
scheme is shown in figure 11.10.
215
In both the architectures, the major design concern is to construct the interconnection network
with minimum message traffic (for distributed memory MIMD) and memory latency (for
shared memory MIMD). Distributed memory MIMD architectures use static interconnection
networks where connection of switching units is fixed and generally treated as point-to-point
connections. Shared memory MIMD architectures use dynamic interconnection networks
where links can be reconfigured every time according to active switching units participating.
Different characteristics of interconnection networks in both architectures create a difference
in working too. In distributed, network is more concerned with transferring complete message
in one shot, no matters how long it is. So focus is on message passing protocols. In shared,
memory is accessed very frequently by short interrupts. So, major concern is to avoid
contention in the network.
To reduce the memory contention problems, shared memory systems are elaborated with
small size memories known as cache memories. Whenever a memory reference request is
posed by processor, the cached is checked first for the required data. If data found, memory
reference can be executed without utilizing interconnection network. So, memory contention
is reduced to a limited extent, at least until a hit situation is there. But as the number of cache
miss increase, contention problem will also increase proportionally. The above explained
logical shared memory architecture can also be implemented physically as a collection of
local cache memories. This new architecture is termed as distributed shared memory
architecture. From construction point of view, distributed shared memory architecture is
similar to distributed memory architecture. Main distinction lies in the organization of
memory address space. In distributed shared memory system, local memories are part of
global address space and any local memory can be accessed by any processor. In distributed
memory architecture, one processor cannot directly access the local memory space of other
processor (already discussed above). Distributed shared memory architectures can be further
categorized into three categories on the basis of accessing of local memories:
Non-uniform memory access (NUMA) machines
Cache-only memory access (COMA) machines.
Cache coherent Non-uniform memory access (CC-NUMA) machines
NUMA
216
dependent upon shared memory approach. Solutions for cache consistency problem through
hardware are not available in NUMA machines. These machines can cache read only data,
local data, but cannot share or modify data. These systems are similar to distributed memory
architectures than shared memory.
COMA
Both the categories of distributed shared memory architectures, COMA and CC-NUMA use
coherent caches to remove the drawbacks of NUMA machines. COMA use single address
space and coherent caches to perform data partitioning and dynamic load balancing. Thus,
this architecture is better suited for multiprogramming and parallel compilers. In these
machines every memory block behaves as a cache memory. Because of applied cache
coherence technique, data easily migrate at run time to local caches of the particular
processors where it is actually needed. General architecture of COMA machines is shown in
Figure 11.12.
217
Figure 11.12 General Architecture of COMA machines
CC-NUMA
218
Figure 11.13 General Architecture of CC-NUMA machines
Summary
Technologically a variety of memory types are available having different access mechanisms
depending upon this mechanism, memory is used to store or retrieve particular type of
processes. First is RAM It is defined as the memory for which a location, usually known as a
word has a unique addressing mechanism. Second is CAM this memory is also famous with
the name associative memory where for accessing the location
When the processor needs to read from a location in main memory, it first checks whether a
copy of that data is available in the cache. If it exists, it is termed as ‘hit’. The processor
immediately reads from the cache. If it does not exist, it is termed as ‘Miss.’ Data is fetched
from main memory and send to cache, from where it is used by processor. Two basic
approaches are applied for effective utilization of cache. Write through a simple technique
where all write operations are made to main memory as well as cache. Write back is a
technique to reduce memory writes to the minimum. With this approach, updates can be done
only in cache. There are four basic modes of accessing: Read, Write, Execute, Modify. Read
and Write modes are simply used for reading of data from memory and writing/ storing data
to memory. Execute access mode is used in case instructions are to be accessed from
memory.
For a single module memory, latency can be defined as the interval between the instance
when a read instruction is sent by the CPU to the memory and the moment when data is
available with the CPU. Latency is dependent upon a number of factors like: First element
cache size significantly accesses the memory access time. The size of the cache should be
219
small enough so that average cost per bit is closed to main memory alone. Hit Ratio Average
memory access time can be reduced by pushing up the hit ratio to lower level memory. Miss
Ratio Miss Ratio (M) is defined by M = 1 – H, Miss Penalty and page size.
Cache miss penalty is defined as the collaborative time to replace a page from main memory
to cache and time to deliver the page to CPU from cache memory. Cache misses can be
categorized under three different variants: Compulsory, Capacity, and Conflict. To evaluate
the cache performance in overall program execution time, first we evaluate CPU execution
time for a program. CPU time is calculated as the product of instruction count (IC) i.e.
number of instructions present in the program, clock per instruction (CPI) i.e. number of
clock cycles accommodated by individual instruction and clock cycle time (CCT). Miss
penalty for write operation can be reduced by providing a write buffer of accurate capacity.
Miss penalty can also be reduced by using two level caches.
Typically, cache memory can handle only one request at a time. If there is a miss situation i.e.
data is not found within the cache, it has to be fetched from main memory. During this
process of retrieval, cache remains idle or ‘blocked’ and do not handle any further request
until the fetch operation is complete. In case of ‘non-blocking’ cache, this problem is
addressed successfully. Rather than sit idly and wait for the operation to be complete, cache
takes another request from the processor.
Distributed memory architectures do not contain the issue of cache coherency as the message
passing approach handles multiple copies of same data in form of messages. In shared,
memory is accessed very frequently by short interrupts. So, major concern is to avoid
contention in the network. To reduce the memory contention problems, shared memory
systems are elaborated with small size memories known as cache memories. Whenever a
memory reference request is posed by processor, the cached is checked first for the required
data. If data found, memory reference can be executed without utilizing interconnection
network. So, memory contention is reduced to a limited extent, at least until a hit situation is
there. But as the number of cache miss increase, contention problem will also increase
proportionally. The above explained logical shared memory architecture can also be
implemented physically as a collection of local cache memories. This new architecture is
termed as distributed shared memory architecture. Distributed shared memory architectures
can be further categorized into three categories on the basis of accessing of local memories:
Non-uniform memory access (NUMA) machines, Cache-only memory access (COMA)
machines, and Cache coherent Non-uniform memory access (CC-NUMA) machines.
Exercise
Problem 11.1 – specify the impact of cache-main storage, memory hierarchy on the CPU
execution time, where miss rate is 12%; memory is referenced three times per instruction, and
220
miss penalty is 5 clock cycles. Suppose, average CPI value is 5 if memory stall due to miss
penalty is not taken into account.
Problem 11.2 – based on the following data recognize the percentage of associativity of Level
2 cache that would lead to best performance:
Direct-map (one-way set associative) hit time on level2 cache = 4 clock cycles
Miss penalty for level 2 cache = 40 clock cycles
Local miss rate for level2 cache with one way associativity = 30 %
Local miss rate for level2 cache with two way associativity = 25 %
Local miss rate for level2 cache with four way associativity = 20 %
Problem 11.3 - Time span for read/write operation in a cache-main storage hierarchical
memory is given by following table.
TCA = Cycle time for cache
TMS = Cycle time for main storage
PD = Probability that page is dirty
Problem 11.5 – Which class of distributed-shared memory machines rely on coherent cache
and how?
221
Problem 11.6 – CC-NUMA architecture is inspired from general NUMA machines. Then,
through which technique, traffic is more manageable in CC-NUMA as compared to NUMA
machines.
Problem 11.7 – How multiple ‘Miss’ situation is handled using Non-blocking cache memory?
Chapter 12
Structure
12.1 Objective
12.2 Memory management
12.3 Memory Translation
12.4 Translation Look-aside Buffer
12.5 Paging
12.6 Segmentation
12.7 Memory Virtualization
12.8 Memory Synchronization
12.9 Memory Consistency
12.10 Memory Coherence Problem
12.1 Objective
Objective of this chapter is to discuss the various memory management techniques available
for computer systems. Further memory is divided into two parts i.e. for operating systems and
for current program in execution. Section 12.2 contains memory management definition and
the various techniques related to it such as swapping, memory allocation and memory
fragmentation. Section 12.3 discusses the memory translation in which memory distribution
is done by assigning the pages of virtual memory to the page frames of physical memory.
Section 12.4 discusses the translation look-aside buffer known as TLB. TLB supports fast
search mechanism while number of entries remains small. 12.5 contain the concept of Paging.
Paging is one of the most significant techniques for managing the memory. Paging permits
the physical address space of every procedure to be in a non contiguous manner. Section 12.6
discusses the concept of Segmentation. Segmentation is a technique to manage memory that
supports the user’s perception of memory. Section 12.7 involves the concept of memory
virtualization. Demand Paging is very popular technique used for memory virtualization in
which pages are loaded from back up unit to the main memory which is required at that
particular instant. Memory synchronization is explained in 12.8 Synchronization problem
occurs due to sharing of data objects between other processes. To solve these problems, there
are various protocols and policies available which are further explained in this section.
Section 12.9 contains memory consistency issues. Memory inconsistency occurs due to miss
match between the ordering of memory access and process execution. There are various
memory consistencies models such as weak consistency models and sequential consistency
222
models. Finally, section 12.10 contains memory coherence problems and various protocols
such as snoopy bus and directory bus protocols to solve the issue.
Memory is a very important part in all the computer systems. Memory contains a large
collection of words. Each of this word or byte has their own memory address. The CPU uses
the program counter value to access the instruction from the memory. Memory in single
programming systems are divided into two parts: first is reserved for operating systems and
other is used for the current program in execution. On the other hand, in multiprogramming
systems the user component of memory is further divided to hold multiple processes. This
division and further subdivision of the memory is done at run time by the operating system to
manage memory. In this chapter several issues and various techniques are discussed related
memory management.
12.2.1 Swapping
It is necessary for the process that before its execution it must be in the memory. Swapping of
the process is done from the memory to the storage unit or from the storage unit to the
memory whenever it is required for execution. For example in round-robin CPU scheduling
algorithm used for multiprogramming systems whenever a time related to the process expires
the memory manager identifies the particular process and swap- it out from the main memory
and swap in the other process needs to be executed in the same memory space which was
made vacant by swapping out of the process. Meanwhile, time slice is given to some other
process to be executed by the CPU scheduler. Therefore the time slice should be large to
accommodate the swapping of process in and out from the memory whenever it is required.
Figure 12.1 shows the swapping process in and out from the memory.
Other place where these swapping techniques are used in priority based scheduling
algorithms. For example, if there is some process in execution and at the same time any other
process arrives with higher priority than a memory manager swaps or take out the process
with lower priority and swap in the process with higher priority in the memory for execution.
After finishing the higher priority process, memory manager can again swap or load the
lesser priority process in memory for execution.
223
Whenever, a process is swapped out from the memory and swapped in again in the memory
then it uses the same memory location it was holding previously. This is due to the address
binding method. If address binding is done at compile time or load time then it is not easy to
move the process in some other memory location. But if the address binding is done on the
run time then it is easy to swap the process into some other memory location because the
physical addresses are calculated at run time. There is a requirement of backing unit for
Swapping. This backing unit must be large enough so that it can store or hold all the memory
images. Direct and fast access to these images must be provided. There is a ready queue
maintained by each system which contains the memory images of all those process which are
ready for execution. Whenever, the request comes to execute for any process the CPU launch
dispatcher which observes the ready queue. If the next process which is to be executed is in
waiting queue and not in the memory and there is no space available in the memory then the
dispatcher swaps or takes out one process from the memory and swaps in the process needs to
be executed.
There are some constraints related to swapping. It is necessary that whenever there is a need
to swap in the process in the main memory for execution the process must be completely idle.
For example, if some process has demanded an I/O operation then it must be completely idle
or in wait state for that I/O request. But, if the process is not idle then the swapping of that
process is not possible in the memory.
The memory is categorized into two parts: first is the fixed sized memory which is occupied
by an operating system and the other partition is used to serve the multiple processes. The
simplest technique to allocate the memory to the processes is to divide the memory into
partitions. Each this partition is having a fixed size such that each partition can contain only
one process to serve. This becomes a limitation for multiprocessing systems. The process has
to wait until and unless a memory is free to serve it. In this scheme a table is maintained by
the operating system which keeps a track of available memory and occupied memory.
Initially, a large block of memory is accessible for the user and recognized as a hole. When a
process arrives for execution the memory is allocated and rest available memory is
distributed in other processes. The other more efficient technique is variable sized partition
approach. In this approach the allocation of memory space is done according to the
requirement of the process. This technique ensures the wastage of memory is reduced. As
more and more processes occurs there exist plenty of small holes left in the memory as a
result its utilization is declined.
12.2.3 Fragmentation
The memory space is divided into small pieces as the process moves in and out from the
memory after and before execution. For instance, when a request to execute a process occurs
and the total available memory is available which more than enough to execute the process
but this memory is not available in a contiguous manner then there is a problem of external
fragmentation. i.e. the storage is available into large number of small pieces which is not
suitable for executing any process. This is a very big problem. There can be a situation where
there can be a block of memory in between every two processes.
The way out to handle the difficulty of external fragmentation is termed as compaction. In
this technique all the partly available memory blocks are shuffled in such a manner that they
224
are combined in a single larger memory block which can be used to assign space to the
upcoming request. If the address binding is statically done then the compaction cannot be
done but if the address binding is done dynamically then the program or data is moved and a
new base address is allocated to the register. In simplest form the compaction algorithm
moves all the available space at one end of the memory and all the processes at other end of
the memory. This results in the single larger memory block available to serve the pending or
upcoming requests.
Both virtual memory and physical memory are fragmented into fixed length pages. The
concept of memory distribution is to assign pages of virtual memory to the page frames of
physical memory. This concept of allocation of pages from virtual to physical comes under
address translation. Virtual addresses are mapped to physical addresses at run time with the
help of hardware device known as Memory Management Unit (MMU). A very general
approach under this category is base-register scheme. For this particular approach base
register is termed as relocation register. As shown in Figure 12.2 the value in the relocation
register is added to each address created by a user procedure at the time it is loaded into
memory. User program never deal with real physical addresses but only with logical
addresses.
Every operating system has their own tools and techniques for storing page tables. Generally
a single page table is allocated for all the process and a pointer or indicator to the page table
is saved with the other register values. When a process is started, it first reloads the user
registers and defines the correct hardware page table values. There are plenty of ways to
implement page table through hardware basis. First one suggests the implementation, as a set
of committed registers built with extremely high speed logic to enable paging-address
transformation efficient. This technique is satisfactory provided page table should be
reasonably small.
The standard approach to store page tables is to use a unique, little, fast-lookup hardware
cache known as translation look-aside buffer (TLB). All the entries in TLB contain two sub
fields: a key/tag and a value. When in associative memory there is a need to search for an
225
item, target item is compared with all keys at the same time. If item is found matching value
field is returned. TLB supports fast search mechanism while number of entries remains small.
In case, page tables are stored under TLB, a few entries can be entertained when a logical
address is created by CPU; its page number is presented to TLB. If page number is found, its
frame number is immediately accessible and is used to access memory. If page number is not
presented in TLB, it is termed as TLB miss, memory location to page table must be made.
Frame number is used to access the memory after it is obtained. Both the frame number and
page number is added into the TLB, so that at next reference they will be found swiftly. If
there is no space in TLB, then the replacement is done by the operating system, using plenty
of available replacement algorithms. Figure 12.3 describes the working of TLB as stated
above.
There are TLBs which can store Address Space Identifiers with each TLB entry. These
identifiers are used to discover a process and give them an address space protection. For
currently executing process, whenever a TLB seeks for its virtual page number, TLB assures
the address space identifiers for the current process is same as its virtual page number. If in
case it is not same, then there is a TLB miss.
12.5 Paging
Paging is one of the most significant techniques for managing the memory. Paging permits
the physical address space of every procedure to be in a non contiguous manner. It overcomes
the huge sum of memory chunks of different sizes onto the backing unit. This problem occurs
when any data available in the memory is swapped out, to save this data on the hard disk or
backing unit there must be available space on that unit. These, backing units or hard disks
also suffer from fragmentation problem such as main memory. But in comparison of main
memory these backing units work at much slower rate so compaction is almost impossible.
226
In paging, physical memory is subdivided into fixed-sized chunks or blocks which are known
as frames and logical memory is subdivided into the same sized blocks known as pages. The
size of the page is termed by the hardware generally it is the power of 2. When a request
comes to execute a process its pages are loaded from the backing unit into frames. Further
these memory units are broken into the same sized blocks as the memory frames as shown in
Figure 12.4
A CPU generated address contains two parts: first is page number which is used for indexing
in the page table. Second is page offset which is a joint with base address contained in the
page table. This combination defines the physical memory address which is further forwarded
to the main memory unit. Paging is dynamic process relocation. In paging every logical
address is bind with the hardware which is further bound to some physical address space.
12.6 Segmentation
227
To implement the concept of segmentation a segment table is maintained. This segment table
consists two parameters first is segment base which is used to maintain the starting physical
address of every segment that where the segment is actually situated in the memory. Second
is segment limit which contains the size or length of every segment. Figure 12.5 illustrates
the hardware of segmentation
Segment table contains both segment number and offset value. Segment number is further
used for an indexing purpose in the segment table whereas; segment offset of the logical
address is minimum 0 and maximum to the segment limit value.
In multiprocessor systems, more than one program runs simultaneously. So, it is necessary
for the running programs that they must reside in the main memory for execution. But, due to
the size limitation of main memory, it is sometimes not possible to load all the programs in it.
So, the concept of Virtual memory was introduced to eliminate this problem. Virtual memory
separates the user’s perspective of logical memory from physical memory. This perception
provides a very large memory space for the programmers, when there is only a smaller
physical memory available actually. The logical view of storing a process in the memory
refers to a virtual address space. The address translation from virtual address to physical
228
addresses can be done at run time. This translation is done with the help of mapping functions
and translation tables. Virtual memory provides the facility for sharing the main memory with
the help of numerous software processes available at run time. Virtual memory allows only
very active portions of code to stay in main memory; remaining code will stay in hard disk to
wait for its turn for execution. Beyond dividing, logical memory from physical memory it
allows multiple records and memory locations to be shared between two or more than two
processes with the help of page sharing. The virtual memory is implemented through Demand
Paging.
When a program needs to be executed, it must be moved from backup unit to main memory.
There are a number of options available to load the process in main memory throughout its
execution. First way is to move the whole program into the main memory in one go; but
sometimes the complete program is not required initially. The other way is to load the pages
from back up unit to the main memory which are required at that particular instant. This
approach is termed as Demand paging. Demand paging system generally matches with a
paging system along with swapping process included as shown in figure 12.6. Here processes
initially stay in secondary memory; with the implementation of demand paging in virtual
memory, pages which are required during the execution are only loaded into the main
memory. Unlike paging, virtual memory uses a lazy swapper (also known as pager), which
never swaps the page until and unless it is required for execution.
When a process needs to be swapped in, pager decides the page that will be used before the
swapping out of process. Rather than swapping in the complete process, pager shifts only
those required pages into memory. Thus, it removes the overhead of reading for those
memory pages that will not be used in any sense and also reduce the time spent on swap
operation and quantity of physical memory required. According to this scenario some kind of
229
hardware support is required to differentiate between the pages that reside in memory and the
Pages located on the disk. The valid-invalid bit criteria can be applied for this. When bit is set
to “valid” corresponding page is situated in memory. When bit is set to “invalid” it is located
on the disk. Page-table entry for the page located in memory is set as valid, but page-table
entry for page located in disk is either marked invalid or it may contain the address of the
location where page is located on disk. Now, when a page is marked as invalid, it leads to
two different options. One is the process never tries to access that page. This is a general
approach. Sometimes, process tries to access the page marked as invalid, this may cause a
page-fault trap. Page-fault is defined as the situation when the referenced page is not present
in main memory. It generates an interrupt and forces the operating system to bring that page
in. the operating system when fails to bring the desired page then this will result in a useless
trap.
The atoms are further classified into two categories. First is hard atom whose access conflicts
are resolved by hardware. Second are soft atoms whose sharable data object accesses are
resolved with the help of software. The implementation of atomicity was done explicitly with
the help of software. The program is not out of order until it maintains the meaning of code.
Program dependencies are of three types
Data dependencies such as Write After Read (WAR), Write After Write (WAW),
Read After Write (RAW)
Control Dependencies which contains Goto and If-then-Else statements
Side-effect Dependencies occurs due to traps, time-out, Input and output accesses etc.
230
12.8.2 Wait Protocols
Wait protocols are of two types: first is busy wait. In this protocol, the program resides in the
processor’s context registers and is permitted to simultaneously retry. Till the time program
remains in processor’s context register, it consumes the processor cycles; but its response is
faster whenever shared object is available. Second is sleep wait, in this protocol if the shared
object is not available; the program is removed from the registers and put into a wait queue.
When the hared object is available for access, the waiting program is notified for the same.
The complexity of using sleep wait is much more than the complexity of using busy wait in
multiprocessor systems. In a multiprocessor system, when a process is synchronized by using
locks, busy wait approach is more preferred over sleep wait. Busy wait protocol can also be
implemented along with a self service approach by polling across the entire network; or it
may be implemented with a full service approach by sending it a notification across the
network when the shared data object is available on the network.
Synchronization delay may be reduced by the use of busy wait protocols whenever atom or
shared data object is available for use. But its main drawback is it continuously checks the
object state and as a result wastes lots of processor cycle. This causes hotspot problem in
memory access. While in sleep wait protocol, the hardware resources are better used. The
main drawback is the longer synchronization delays. For all the processes which are waiting
for a shared data in a waiting queue there must be a fairness policy which is used to review
one of the waiting processes. There are three kinds of fairness policies:
FIFO
Bounded
LiveLock-Free
Sole access protocols are used to sequence conflicting shared operations. There are three
types of synchronization methods that are described on the basis that who will update the
atom or shared data object or whether the sole access is allowed after or before the
completion of atomic operation.
Lock synchronization - In this method sole access is allowed before the atomic
operation. After the operation, shared data object is updated by the process who poses
the sole access request. It is also known as pre-synchronization. This method can
only be used for shared read-only memory accesses.
Optimistic synchronization – In this method sole access is permitted after the atomic
operation is complete. After the operation, shared data object is updated by the
process who poses the sole access request. It is also known as post-synchronization. It
has got the name optimistic for the reason that the approach suggests that there will be
no simultaneous access to the data object during processing of single atomic
operation.
Server synchronization – This approach use server process to update the atom. When
compared with lock synchronization and optimistic synchronization, server
synchronization provides full service support. An atom contains a unique update
server. Any process that requests an atomic operation sends its request to atom’s
update server. Update server can be a dedicated server processor (SP) related with the
atoms memory module. Common examples of server synchronization are remote
231
procedure calls and object oriented systems, where shared objects are encapsulated by
a process.
Memory inconsistency occurs due to miss match between the ordering of memory access and
process execution. When instructions run in parallel their execution finishes out of order even
if the instructions are dispatched or send in order. Because small or shorter instructions take
less time to execute than the larger instructions. In a single-processor system SISD sequence
is followed so, instructions are executed sequentially one after the other. Memory accesses
are consistent with order of instruction execution. This whole phenomenon is characterized as
sequential consistency. While, in case of shared memory multiprocessor, multiple instruction
sequences are followed in multiple processors. These MIMD instructions can be executed
differently and leads to a different memory access sequence as compared to actual execution
order. Both the approaches of sequential consistency and event ordering are diagrammatically
analyzed in Figure 12.7
232
12.9.1.1 Event orderings
12.9.1.2 Atomicity
233
atomic memory accesses. A shared memory access is termed as atomic if an update in
memory is reachable to all processors at the same time. Therefore, to make an atomic
memory sequentially consistent, there must be a mandatory condition that all memory
accesses must be performed to maintain all individual program sequences. A shared memory
access is termed as non-atomic if an invalidation signal is not known to all processors at the
same time. In case of non-atomic memory system, multiprocessors cannot be strictly
sequenced. Therefore, weak ordering is preferred which leads to the division between strong
and weak consistency models.
Sequential consistency memory model is a very popular model for multiprocessor designs.
According to this model, the load operations, store operations and swap operations of all
processors execute sequentially in a single global memory order that relates to the program
order of indivisual processors. A Sequential consistency memory model is illustrated in
Figure 12.8
234
2. If two operations follow a particular program order then the same memory order is
followed.
3. Swap operation is atomic with respect to other write operations. No other write can
interfere between read and write parts of a swap.
4. All writes and swaps must terminate at a particular moment.
Most multiprocessors have implemented Sequential consistency memory model because of
its simplicity. This model imposes strong ordering of memory events thus, may lead to poor
memory performance. This drawback is more considerable when the system becomes very
large. To reduce the affect, another class of consistency model is established known as weak
consistency model.
Weak Consistency model was introduced by Dubois, Scheurich, and Briggs; thus is also
popular with name DSB model. DSB model is specified by three conditions:
1. All earlier synchronization accesses must be executed, before a read or write access is
permitted to execute with respect to to any other processor
2. All earlier read and write accesses must be executed, before a synchronization access
is permitted to execute with respect to any other processor.
3. Synchronization access is sequentially consistent with respect to one another.
These conditions enable a weak ordering of memory access events in a multiprocessor
because these conditions are only bound to hardware recognized synchronizing variables.
Another weak consistency model termed as TSO (Total Store Order) is shown in figure 12.9.
This model is specified with the following popular proverbs
1. A read access is always returned with latest write to same memory location by any
processor.
2. If two writes occurs in a specific program order, then their memory order must be
retained.
235
3. If a memory operation is executed immediately after a read in program order then it
must be executed in same sequence in memory order.
4. All writes and swaps must terminate at a particular event.
5. A swap operation is isolated according to other write operations. No write operation
can interfere between the read and write parts of a swap.
In a multiprocessor system data inconsistency is a very big issue to deal with in the same
levels or between neighboring levels. In case of same level, multiple cache memory modules
may represent different imitations of same memory block because these are processed by
multiple processers asynchronously and independently. In case of neighboring levels, cache
memory and main storage may contain non-uniform copies of same data object. Cache
coherence schemes are available to overcome this problem by preserving a consistent state
for each cached block of data. In consistencies in cache can be occurred through multiple
ways:
The cache inconsistency problem occurs by sharing writable data among processors. Figure
12.10 evaluates the process clearly. Suppose there is a multiprocessor with two processors
each using private cache and sharing main memory. D is the shared data element used by
both processors. Figure shows three actions before update, write-through, and write-back.
Before update, three copies of D are uniform if processor PA writes new data D’ into the
cache, same copy is written immediately into the shared memory using write through. Now,
the two copies in two caches are inconsistent. As cache for processor PA contains D’ and
cache for processor PB contains D. during third action write-back, result is again inconsistent
as the main memory will be updated at a particular event when the modified data in cache is
invalidated.
Process migration
236
Cache inconsistency problem can also be occurs during process migration from one processor
to another. This process is illustrated in fig12.11 through three actions before migration, writ-
through, and write-back. Data object is consistent before migration. Inconsistency occurs
after the process containing shared variable D migrates from processor PA to processor PB
using write-back. Same situation of inconsistency propagates when process migrates from
processor PB to PA using write-through.
I/O operations
Another reason for inconsistency can be I/O operations that bypass the caches. This process
is implemented in figure 12.12
When I/O processor loads a new data D’ into main memory bypassing write-through
operation, data in cache of first processor and shared memory becomes inconsistent. When
237
data is outputted directly from shared memory (bypassing the caches) this write-back
operation results in inconsistent data among caches.
Early multiprocessors used bus based memory systems for performing cache coherence due
to all processors available in the system could observe executing memory transactions. If any
inconsistency occurs, the cache controller has the ability to take necessary actions to
invalidate the copy. Here, each cache snoops on the transactions of other caches thus,
protocols using this technique are known as snoopy bus protocols. Snoopy protocols work on
two basic approaches:
Write-invalidate
Write-update (write broadcast)
In case of write-invalidate procedure, there can be various readers but there is only one writer
at a time. A data object may be shared amongst multiple caches for read operation. When any
of the cache wants to perform a write to that data object, first it propagates a signal that
invalidates that data object in remaining caches. Now, this data object is exclusively available
for the writing cache.
In case of write-update protocol, several writers as well as several readers can exist. When a
processor needs to update a shared data object, data word to be updated is circulated to all.
All caches that contain that data object are authorized to update it.
Figure 12.13 shows working of snoopy bus protocol in three stages. Part (a) represents
original consistent copies of data object in shared memory and three processor caches. Part
(b) represents status of data object after write invalidate operation by processor P1. Part (c)
represents status of data object after write-update operation by processor P1.
238
Figure 12.13 (a) Consistent copies of block X in shared memory and processor caches, (b)
After a write-invalidate operation by P1, (c) After a write-update operation by P1.
Scalable multiprocessor systems interconnect processors using short, point to point links in
direct networks. The major advantage with this scenario is that bandwidth of these networks
increases dynamically as more and more processors are added to the arrangement or network.
In these systems cache coherence problem can be resolved using directory based protocols.
Directory based protocols gather and retain information about the locations where copies of
data objects reside. A centralized controller which is part of main memory controller and a
239
directory stored in main memory. This directory holds overall information about contents of
multiple local cache memory modules. When cache controller makes a demand, centralized
controller verifies and sends the required commands for data transmit between main memory
and cache memory or between multiple caches.
Centralized controller also keeps the state information upto date. Generally, centralized
controller preserves data that which processor has copy of which data object. Before a
processor can make any change in the local copy of a data object, it must request an exclusive
access to the data object. In response centralized controller sends a information to every
processor along with a saved copy of this data object to invalidate their owned copy. And
exclusive access is granted to the requesting processor. If any other processor tries to read the
data object i.e. exclusively granted to some processor, a miss notification is send to the
controller. Directory based protocols suffer from drawbacks of central bottlenecks and extra
burden of communication between centralized controller and multiple cache controllers.
Figure 12.14 explains basic concept of directory based cache coherence protocol. Three
different categories of directory based protocols are available:
Full-map directory protocol implements directory access with only one bit per
processor and a dirty bit. Each bit in a processor signifies status of the data object in
processors cache. That weather the data object is present or not. If the dirty bit is on,
then one and only one processor can write into the data object.
Limited directories protocol is similar to full-map directory, except that more than i
caches request read copies of a particular data object. Here, i stand for number of
pointers to which this data object is to broadcast.
Chained directories it is simplest technique to keep record of shared copies of data
by preserving a chain of directory pointers. This scheme is implemented in form of a
240
singly linked chain. Assume there are no shared copies of location M. if processor P1
reads location M, memory sends a copy to cache C1, along with a chain termination
pointer. Memory also keeps a pointer to cache C1. In the same way when processor
P2 reads location M, memory sends a copy to cache C2 along with the pointer to
cache C1. By repeating the above steps all caches can cache a copy of location M.
Summary
Memory is a very important part in all the computer systems. Memory contains a large
collection of words. Each of this word or byte has their own memory address. The CPU uses
the program counter value to access the instruction from the memory.
It is necessary for the process that before its execution it must be in the memory. Swapping of
the process is done from the memory to the storage unit or from the storage unit to the
memory whenever it is required for execution. Whenever a time related to the process expires
the memory manager identifies the particular process and swap- it out from the main memory
and swap in the other process needs to be executed in the same memory space which was
made vacant by swapping out of the process. Whenever, a process is swapped out from the
memory and swapped in again in the memory then it uses the same memory location it was
holding previously. This is due to the address binding method. If address binding is done at
compile time or load time then it is not easy to move the process in some other memory
location. But if the address binding is done on the run time then it is easy to swap the process
into some other memory location because the physical addresses are calculated at run time.
The memory is divided into two parts first is the fixed sized memory which is occupied by an
operating system and the other partition is used to serve the multiple processes. The simplest
technique to allocate the memory to the processes is to divide the memory into partitions.
Each this partition is having a fixed size such that each partition can contain only one process
to serve. The memory space is divided into small pieces as the process moves in and out from
the memory after and before execution. For instance, when a request to execute a process
occurs and the total available memory is available which more than enough to execute the
process but this memory is not available in a contiguous manner then there is a problem of
external fragmentation.
The slandered approach to store page tables is to use a unique, little, fast-lookup hardware
cache known as translation look-aside buffer (TLB). All the entries in TLB contain two sub
fields: a key/tag and a value. When in associative memory there is a need to search for an
item, target item is compared with all keys at the same time. TLB supports fast search
mechanism while number of entries remains small.
Paging is one of the most significant techniques for managing the memory. Paging permits
the physical address space of every procedure to be in a non contiguous manner. It overcomes
the huge sum of memory chunks of different sizes onto the backing unit.
Another technique which is used to subdivide the addressable memory is called
segmentation. Paging separates the user’s perception of memory and real physical memory.
That is paging provides larger address space to the programmer as it is invisible to the
programmer whereas, segmentation is a technique to manage memory that supports the users
perception of memory.
241
In multiprocessor systems, more than one program runs simultaneously. So, it is necessary
for the running programs that they must reside in the main memory for execution. But, due to
the size limitation of main memory, it is sometimes not possible to load all the programs in it.
So, the concept of Virtual memory was introduced to eliminate this problem. Virtual memory
separates the user’s perspective of logical memory from physical memory.
Execution of a parallel program is dependent upon efficient and effective synchronization
which is the key parameter for its performance and wellness. All the parallel operations needs
both hardware as well as software techniques for synchronization. Synchronization problem
occurs due to sharing of data objects between other processes.
Memory inconsistency occurs due to miss match between the ordering of memory access and
process execution. When instructions run in parallel their execution finishes out of order even
if the instructions are dispatched or send in order
In a multiprocessor system data inconsistency is a very big issue to deal with in the same
levels or between neighboring levels. In case of same level, multiple cache memory modules
may represent different imitations of same memory block because these are processed by
multiple processers asynchronously and independently. In case of neighboring levels, cache
memory and main storage may contain non-uniform copies of same data object.
Exercise
Problem 12.2 – when the process is under execution system permits programs to assign more
memory than required its address pace such as data assigned in the case of heap segments.
Write the necessary requirements to assist dynamic memory allocation for following
scenarios:
a) Contiguous memory allocation
b) Pure segmentation
c) Pure paging
Problem 12.3 – Compare paging and segmentation according to the concept of amount of
memory required by address translation mechanism to map virtual addresses to physical
addresses.
Problem 12.4 – Define the need for paging the page tables.
Problem 12.5 – what is demand paging? Discuss the hardware support required to perform
demand paging.
Problem 12.6 – virtual memory has page size of 2K words having 8 pages and 4 blocks
following are the values for associative memory page table:
Page Block
0 3
1 1
242
4 2
6 0
Note all virtual addresses that will result in page fault when requested by processor.
Problem 12.7 – what are the various issues concerned with memory consistency? How these
can be resolved? Which technique is preferred over the other?
Problem 12.8 - what are the various issues concerned with memory coherence? How these
can be resolved? Which technique is preferred over the other?
243