MCA603 Advanced Comp Architecture PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 247

Self Learning Material

Advanced Computer
Architecture
(MCA-603)

Course: Masters in Computer Applications


Semester-VI

Distance Education Programme


I.K. Gujral Punjab Technical University
Jalandhar
Syllabus
I.K. Gujral Punjab Technical University

MCA-603Advanced Computer Architecture

Course Objectives: To understand and analyse the functionality, connectivity and performance of various
processors and memory types.

Section-A

Fundamentals of Processors: Instruction set architecture; single cycle processors, hardwired and
micro- coded FSM processors; pipelined processors, multi-core processors; resolving structural, data,
control and name hazards; analysing processor performance.

Section-B

Fundamentals of Memories: Memory technology; direct-mapped, associative cache; write-


through and write-back caches; single-cycle, FSM, pipelined cache; analysing memory
performance.

Section-C

Advanced Processors: Superscalar execution, out-of-order execution, register renaming, memory


disambiguation, dynamic instruction scheduling, branch prediction, speculative execution; multi-
threaded, VLIW and SIMD processors.

Section-D

Advanced Memories: Non-blocking cache memories; memory protection, translation and virtualization;
memory synchronization, consistency and coherence.

Recommended Books:

1. Computer Architecture: A Quantitative Approach, by J.L Hennessy and D.A Patterson.


2. Digital Design and Computer Architecture, by D.M Harris and S.L Harris.
Table of Contents
Chapter No. Title Written By Page No.
1 1
Dr. Balwinder Singh,
Introduction to Computer Architecture C-DAC Mohali
2 10

Dr. Balwinder Singh,


Instruction Set Architecture C-DAC Mohali
3 28
Mr. Satbir Singh
Processor Design Project Engineer,
C-DAC Mohali
4 54
Pipelined Processors Mr. Satbir Singh
Project Engineer,
C-DAC Mohali
5 66
Prashant Bhardwaj
Project Associate, C-
Memory and Memory Hierarchy DAC Mohali
6 Cache Memory and operations 92
Dr. Balwinder Singh,
C-DAC Mohali
7 Pipelined Processors Mr.Akhil Goyal 109
Project Engineer,
C-DAC Mohali
8 124
Prashant Bhardwaj
Project Associate, C-
Multi-core Processors and Multithreading DAC Mohali
9 146
Vemu Sulochana
Superscalar Processors Project Engineer II,
C-DAC Mohali
10 Vemu Sulochana 173
Project Engineer II,
VLIW and SIMD Processors C-DAC Mohali
11 Advanced Memories Mr.Akhil Goyal 199
Project Engineer,
C-DAC Mohali
12 222
Memory Management and Protection Mr.Akhil Goyal
Project Engineer,
C-DAC Mohali
222

Reviewed by:
Er. Gurpreet Singh Bains
Assistant Professor,
ECE, Punjab University,
SSGR Campus,
Hoshiarpur

© IK Gujral Punjab Technical University Jalandhar


All rights reserved with IK Gujral Punjab Technical University Jalandhar
Chapter 1 Introduction to Computer Architecture

Contents

1 Introduction to Computer Architecture

1.1 Computer Architecture


1.2 Software and Hardware Abstraction:-
1.2.1 Software:
1.2.2 Software Abstraction Level

1.3 Evolution of Computer

1 Introduction to Computer Architecture

The father of computer is Charles Babbage who design first analytic machine in 1833. This
machine is capable to perform all four mathematical calculation.It had a number of certain
features which are seen in modern’s computer. He includes the central processing system,
memory, storage area and input/output in his design.

1.1 Computer Architecture

Figure 1.1: Basic diagram of Computer Architecture

1
Computer Architecture Consists of three main parts:-
a) Central Processing unit: -CPU consists of electronic circuitry consists of
active or passive electronic components which carries instructions and control
logic of computer so as to perform basic arithmetic and logical data
processing.
CPU consists of three main components:-
a) Registers: - It is a temporary storage device which is requires to hold
temporary data and variables which were used in data transferring and
data processing.
b) Control Unit (CU):- All the data transfer job and there movement in
Central processing unit is monitored by Control Unit. It controls the
operation of other units of CPU by providing timing and control signal.
Example: - Traffic Light
c) Arithmetic Logic Control (ALU):- ALU controls all the data
processing part of CPU like arithmetic, logical operation etc… The
operation in ALU is done by taking data and taking necessary opcode
(set of tasks to be performed) so as to execute processed result.

b) Memory: -The storage element of computer which holds all the data and
instruction of user. It stores the items like operating system, application,
kernel, firewall, user data and many more…
a) Primary memory: - It holds all the data which are used by computer to
run on current state. It has limited capacity and data get lost when power is
lost. These memory are known as volatile memory. These memory are
directly accessed by Central Processing Unit. These memories are faster
than secondary memory.These memories are costlier and computer system
can’t work without these memories. The Primary Memory consist of :
1. Random Access Memory (RAM)
2. Read Only Memory (ROM)

b) Secondary Memory: - Theses type of memory are used to hold user data
and system software’s. These are non- volatile part of memory whose data

2
didn’t get erased on switching of the power. These memories are slower
than primary memory. CPU can’t directly access these type of memories
directly. These memories are accessed only by input/output mode and
cheaper than primary memory. E.g., Hard disk, Pend Drives.

1.2 Software and Hardware Abstraction:-

Abstraction is an act of showing essential things without showing its background


details. The abstraction is used to reduce the system complexity and make an efficient
design of complex systems.

In Computer System the Abstraction are shown through various levels:-

1.2.1 Software: - The things which cannot be touch by living beings like operating
systems, office, PowerPoint etc.
1.2.2 Software Abstraction Level:

Fig 1.2: Software Abstractions

3
a) Operating System: - . Operating system is system software which acts an
interface between user and hardware. It manages hardware and software of
computer system resources and provides common programs and services. The
operating system is again classified in two types :-
Single User operating system: - Accessed by only single user.
Multi User operating system: -Accessed by multiple user at a time.

b) Kernel: -Kernel is a part of computer program which manages input/output


requests made by software and translates that data in the form of machine
understandable hardware language for central processing unit. Kernel is a
fundamental part of operating system. As it connects system software with
system hardware to perform necessary task.
Kernel is responsible for the memory management, process management
and task management.

c) Assembler: - Assembler is a set of programs which are used to convert user


understandable programing language in machine language. The assembler
calculates constant expressions and resolves symbolic names for memory
locations and other entities. The use of symbolic references leads to saving
tedious calculations and manual address updates after making program
modifications. Other assemblers also have macro facilities to perform textual
substitution – e.g., to generate common short sequences of instructions
as inline, instead of called subroutines.

d) Firmware: - It is a type of software that provides control, monitoring and


data manipulation. Firmware is present in non- volatile memory like ROM,
EPROM (Erasable Programmable Read Only Memory) and Flash memory.
These programs must be present in computer system of its entire life time.
And these are pre written by the manufacturers and it is unmodifiable by any
new user. Sometimes these are updated for correcting bugs. Sometimes these
are also called as BIOS (Basic Input Output System).

4
e) Hardware:-The things which can be touched by living beings like monitor,
keyboard, CD – Compact Disk etc.

Hardware Abstraction:-

Hardware abstractions are implemented in software between computing hardware and system
software which runs on that computer. Its main function is to hide differences in hardware
from most of the operating system kernel, so that most of the kernel-mode code does not need
to be changed to run on systems with different hardware.it allows programmer to write
program device independent and highly efficient applications which calls the hardware.

In Microsoft windows the HAL is Windows NT.

1.3 Evolution of Computer

a) First Generation(1939-1945)[Era of Vacuum Tube]:-

A huge bulkier device covered with transparent glass material and consumes a large
amount of power as compared to modern computers. At the same time the data is
given in the form of machine language which can’t be easily be understand by new
user. Due to the large requirement of very large amount of power these devices
requires a separate cooling units to stop the devices from overheating problems. As
overheating can cause a damages of internal circuitry and requires large downtime as
theses dives are not reliable.
E.g. Electronic Numerical Integrator and Calculator (ENIAC), Electronic
Discrete Variable Automatic Computer (EDVAC), Electronic Delay Storage
Automatic Computer (EDSAC), Universal Automatic Computers (UNIVAC-I)
b) Second Generation(1949-1954)[Transistor]:-

As the invention of Transistor by John Bardeen, Walter Brattain and William


Shockley at AT&T Bell Labs in 1948. The computing speed and cost gets decreases.
As it consumes less power than vacuum tube and takes less space. It requires 1/10th
power consumed by tube and 10 times cheaper over it.

5
Figure 1.3: Transistors

On the same time there was also major event happens due to invention of Magnetic
Tapes. They are very small in size (0.02inch) ring shaped structure and can be
magnetised by application of magnetic field in either clockwise or anticlockwise
direction.

Due to the invention of transistor and magnetic tapes the memory and processing
capabilities also get increases at low power requirement than vacuum tube. Due to
small size of devices the sizes of computer also gets decreases and due to the less
complexity of circuitry the downtime also get decreases but this generation stills
needs a separate cooling systems.

Due to increase in memory size the development of High Level Language also
increases and we seeFORTRAN, COBOL, ALGOL, SNBOL etc... Which are user
understandable and error correction can easily be done.

E.g, IBM 1401, IBM 7094, RCA 501, UNIVAC 1108

c) Third Generation(1959-1971)[IC-Integrated Circuit]:-

Invention of Integrated Circuits in 1958 by Jack Kilby at Texas Instrument replaces all the
transistors. These Chips are made of silicon chips in which circuitry all already printed over
it. And by certain lithography techniques all the working of active devices can be integrated
over a single chip which has a large number of pins at either of the side.

6
Due their miniature size and less power requirement they become extremely popular in
computer industries. Due their low power requirement they didn’t emit heat which totally
overcome the necessity of cooling units.

Theses IC requires extremely purified form of silicon and due the advancement in fabrication
technology theses Integrated Circuits are become very cheaper and reliable than Vacuum
tubes and Transistors and also become very much faster than previous generations.

The IC industries are only capable of make Small Scale integrated Circuits (SSI) which
consists of 10 transistors per chip and due to certain advancements the technology reaches up
to Medium Scale Integrated Circuits (MSI) which is about 100 transistors per chips. Due to
this the size of main memory increase up to 4 Megabytes. Due to theses advancements the
CPU are become more powerful and become capable to perform Million Instruction per
Second (MIPS).

E.g, IBM 360, Honeywell 6000 Series Computer, ICL 1900 series, ICL 2900

d) Fourth Generation(1971 to at present )[Microprocessor]:-

Due to advancement in semiconductor and IC industries the Medium Scale Integrated


IC are replace by Very large Integrated Circuits (VLSI ) which has packaging
capacity of 50000 transistors in a single chip. This lead to the extremely power
computer and due to low cost of theses IC the fourth generation computer are become
affordable by common people. Due to the faster processing speeds and high memory
capacity it helped to make much more powerful operating system.

In 1995 the Pentium series come into existence .And RISC (Reduced Instruction set
Computers) microprocessor are preferred for numeric and file handling services.
E.g, IBM 5100 (First Portable Computer), TOSHIBA T1100

e) Fifth Generation(Present and Beyond):-

7
The Fifth generation computer totally depends on artificial intelligence and are still in
development using image processing, voice recognition face recognition finger print
etc... The Use of parallel processing also significantly increases.
The use of Ultra large Scale Integrated Circuits which contains 1000000 or more
number of transistors significantly increases the speed of central processing unit. And
due the high package density capabilities theses IC are easily be used in Personal
Digital Assistant (PDA) devices.
Fifth generation computers has huge storage capabilities even in Terabyte range and
even more. Due to advancement in magnetic type disk storing capabilities the era of
portable storage devices comes and high processing capabilities make it beyond the
era of fifth generation.

Exercise
Q1: What are the different generations of evolution of Computers?
Ans: Refer section 1.3
Q2: What do you mean by hardware and software abstractions? Explain briefly.
Ans: Refer section 1.2
Define computer architecture.
Ans: Refer section 1.1

8
9
Chapter 2 Instruction Set Architecture

Contents
2.1 CISC Architecture
2.1.1 CISC Approach
2.1.2 Addressing Modes In CISC
2.1.3 CISC Examples

2.2 RISC
2.2.1 RISC Performance
2.2.2 RISC Architecture Features
2.2.3RISC Examples

2.3 Comparison

2.4 Instruction Formats


2.4.1 Op-code and Operand
2.4.2 Addressing Modes
2.4.3 Classification of Addressing Modes
2.4.3.1 Direct addressing mode:
2.4.3.2 Indirect Mode:
2.4.3.3 Register Addressing Mode:
2.4.3.4 Immediate addressing mode:
2.4.3.5 Implicit mode:

2.5 MIPS Instructions and their Format


2.5.1 R-type
2.5.2 I-type instructions
2.5.3 J-type instructions
2.5.4 MIPS Arithmetic and Logic Instructions
2.5.5 MIPS Branch control Instruction
2.5.6 Pipelining Instructions

2.1 CISC Architecture

10
Earlier programming was done either in assembly language or in machine code. This lead
the designers to develop the instructions that are easy to use. With advent of high level
language, computer architects created dedicated instructions that would do as much work
as possible and can be directly implemented to perform a particular task. Next task was to
implement concept of orthogonality that is to provide every addressing mode for every
instruction. This will lead to storage of results and operands directly in memory instead of
register or immediate only.
At that time hardware design was given more importance than compiler design, this
became the reason for implementation of functionality in microcode or hardware rather
than compiler alone. The design philosophy was term as Complex Instruction Set
Computer. CISC chips were the first PC microprocessor as the instructions were built into
chips.
Another factor which encouraged this complex architecture was very limited main
memories. This architecture hence proved to be advantageous as it lead to the high
density of information held in computer programs, as well as other features such as
variable length instructions, data loading. These issues were given high priority as
compare to ease of decoding of the instructions.
Other reason was that main memories were slow. With the usage of dense information
packing, frequency with which CPU access memory can be reduced. To overcome these
slow memories, fast cache can be employed but they are of limited size.
2.1.1 CISC Approach
Main motive of designing CISC architecture is that a task can be compiled in very few
lines of assembly. This can be accomplished by implementing a hardware which is
capable of understanding and executing series of operations. For example, if we want to
execute a multiplication operation, then CISC processor comes with a specific instruction
(‘MUL’).
When this instruction is executed, two values are loaded into separate registers, operands
are multiplied in execution unit and then product is stored in appropriate registers. This
whole task of multiplication can be compiled in just one instruction
MUL 3:2, 2:5

MUL can be called as complex instruction as it directly operates on computer’s memory


and doesn’t require the programmer to call any loading or storing functions.
2.1.2 Addressing Modes In CISC
Variety of addressing modes in CISC lead to variable length instructions.For example if
operand is in memory instead of register, instruction length increases.This happens
because we have to provide memory address in the coding which takes many bits. It leads
to problem in instruction decoding and scheduling.
Wide range of instruction type leads to variation in the number of clocks required to
execute instructions.

11
2.1.3 CISC Examples
1. PDP-11
 Series of 16-bit minicomputer
 Most popular minicomputer
 Smallest system that could run UNIX
 C programming language was easily implemented in several low level PDP-11

Figure 2.1: PDP-11

2. Motorola 68000
 Also known as Motorola 68K
 16/32 bit CISC Microprocessor core
 Introduced in 1979 with HMOS technology
 Software forward compatible

Figure 2.2: Motorola 68000 IC

Advantages

12
 Compiler has to perform a little task to translate high level language into
assembly language.
 Because of short length of code, little RAM is required for its storage.

Disadvantages

 Optimisation is difficult
 Complex control unit
 Hard to exploit complex machine instructions

2.2 RISC
RISC stands for reduced instruction set computing. Generally, the term ‘RISC’ is
misunderstood with the concept that number of instructions in RISC is small means it has
small instruction set. But this is not true. We can have any number of instructions until
they are confined within a particular clock period. Their instruction set can be larger than
those of CISC but their complexity is reduced that is why the term reduced instruction set
is used.
In RISC, the operation is divided into sub operations. For example, if we want to add two
numbers X and Y then operation will be performed as:
Load ‘X’
Load ‘Y’
Add X and Y
STORE Z

2.2.1 RISC Performance


The thirst for higher performance has always been present in every computer architecture
and model which leads to introduction of new architecture or system organization.

Many methods can be adopted to achieve higher performance such as


 Technology advances
 Better architecture
 Better machine organization
 Optimization and improvement in compiler technology

By technology, performance is enhanced proportionally to improvement in technology


which is equally available to everyone. It is mainly due to organization of machine and its
architecture where experience and skill of computer design is shown. These goals are
fulfilled by RISC and as a result of RISC architecture we get fewer addressing modes,
instructions, instruction formats and simple circuit for control unit.

RISC architecture is based on concept of pipelining due to which execution time of each
instruction is short and number of cycles is also reduced.

13
For efficient execution of RISC pipeline most frequently used instructions and addressing
modes are selected.

Tradeoff between RISC and CISC can be expressed in form of total time required for the
task execution

Time (task) =I*C*P*To


where I=number of instructions
C=number of cycles
P=number of clock periods
To=clock period (ns)

Although CISC has less instructions for a particular task but its execution will require
more cycles due to its complex operation as compared to RISC. In addition to less cycles
in RISC, simplicity of its architecture leads to shorter clock period To leading to higher
speed as compare to CISC.

2.2.2 RISC Architecture Features

 Load/Store architecture:
RISC architecture is also known as load store architecture because of separate execution
of load and store operations from other instructions, thus obtaining a high level of
concurrency. Also, the access to memory is accomplished through load and store
instruction only. Operations in this instruction set are also called as register to register
operation as all the operands on which operation has to be performed resides in general
purpose register file(GPR) and result is also stored in GPR. RISC pipeline architecture is
designed in a way that it can accommodate both operations and memory access with
equal efficiency.
 Selected set of instructions:
Concept of locality is applied in RISC that is small set of instructions are frequently used
leading to efficient instruction level parallelism, hence efficient pipeline organization.
Such pipeline executes three main instruction classes efficiently.
 Load /Store
 Arithmetic logic
 Branch

Because of its simple pipelined architecture, control part of RISC is implemented in


hardware while CISC heavily depends on microcoding.
 Fixed format of instruction:
One of the important features of RISC is its fixed and predetermined format and
instructions which results in decoding of instruction in one cycle and simplifies the
control hardware. The size of instruction is fixed to 32 bits, but there can be two sizes of
instruction 32 bit and 16 bit.16-bit size instruction is used in IBM ROMP processor.

14
Fixed size instruction results in efficient execution of pipelined architecture. This
property of decoding of instruction in one cycle is also helpful in execution of branch
instruction where its outcome can be determined in one cycle and at the same time,
instruction address in new target will be issued.
 Simple addressing modes:
One of the essential requirements of pipeline is simple addressing mode as it leads to
address calculation in predetermined number of pipeline cycles. In real programmes,
address computation requires only three simple addressing modes.
I.Immediate
II.Base + displacement
III.Base + Index

These addressing modes cover 80% of all addressing modes implemented in as process.

 Separate instruction and data Caches:


Generally, operands are found in first level of memory hierarchy, i.e. general purpose
register file (GPR). This is register to register operation feature. Access to data from
GPR is fast. If operands are not present in GPR, it should not take long time to fetch
data. This requires access to a fast memory which is next to CPU that is Cache. RISC
machine requires only one cache cycle for its efficient working, if it takes two or more
cycles, the performance is degraded to maintain 1- cycle cache bandwidth instruction
and data access should not collide. This feature of separation of data caches and
instruction is present in Harvard architecture which is a must feature for RISC.
2.2.3RISC Examples
 ARM (Advanced RISC Machine)
a) Most widely used 32-bit instruction set architecture.
b) Was initially known as Acorn RISC machine, used for desktop computer.
c) Suitable for low power applications.
d) Dominant in embedded and mobile electronics because of low cost and small
processors.

Figure 2.3: ARM IC

 ATMEL AVR

15
a) 8-bit RISC single chip microcontroller.
b) Modified Harvard architecture.
c) First microcontroller to use on chip flash memory.

Figure 2.4: AVR IC

 POWER PC

a) RISC architecture created by Apple-IBM-Motorola.


b) Used in high performance processors
c) High level of compatibility with IBM earlier architecture.

Figure 2.5: POWER PC IC

 Scalable processor architecture (SPARC)


a) RISC instruction set architecture developed by Sun Microsystems in 1986.
b) Initially 32 bit SPARC used for Sun’s sun-4 workstation and server system.

16
Figure 2.6: SPARC IC

Advantages of RISC

 As each instruction is executed in only one cycle, so execution time of entire


program takes almost same time as taken by CISC architecture.
 Hardware space is less because of use of less transistors.
 Separate ‘LOAD’ and ‘STORE’ instruction reduces amount of work to be
performed by computer.
 In CISC, operand in the register is removed as soon it is executed, but in RISC,
operand remains in the register until new value is loaded.

Disadvantages

 Longer programs require more memory space.


 Certain application may face low speed execution of instructions.
 Difficult to program assembly programs.

2.3 Comparison
Table 2.1: Comparison between RISC and CISC

CISC RISC
Large number of instructions(120-350) Fewer instructions (<100)

Large number of addressing modes Fewer addressing modes


Variable length instruction format Fixed length instruction format
Number of cycles per instruction(CPI) is Number of CPI is one due to pipelining
between 1-20
General purpose register varies form 8-32 Large number of general purpose registers
Microprogrammed control unit Hardware control unit
Instruction decode area is approx. 10% Instruction decode area is <50%
Memory to memory operation Register to register operation
Emphasis given on hardware Emphasis given on software

17
Multiclock, complex instructions Single clock, reduced instruction

2.4 Instruction Formats

Microprocessor can do multiple things. But only those operations will be completed
that are user/ programmer specific. To do this the programmer should have the
knowledge of programming different instructions.

For example in banks the balance is shown in the account. It needs the account
number for doing this. The identity of user will be the account number. Similarly for a
data that is stored in other memory location, the programmer/user has to specify the
memory location for accessing the data. It is also very necessary to tell the computer
that the values which he is specifying are not the data values but the memory location.
So, there is a need to have some protocols/ rules for achieving this purpose.
Another example can be taken from the simple calculator in which the user specifies
the two values and he wants the addition operation to be done on these two values. In
this case the values are directly to be operated on rather than calling data from these
locations. So, while programming the microprocessor one needs to be very careful
otherwise the results will be incorrect and the number of applications which make use
of theses microprocessor will not be of any use.
Therefore, there is a need to define the rules for the following considerations:
1. How to specify whether the values typed are immediate data or the memory
locations
2. How to distinguish whether the data to be called is present in the register
specified in the instruction or further at the location which is given in the
register.
3. There is another type of instructions available, in which there is no need to
specify the immediate data or any register to be processed on. However, these
kind of instructions directly operate on the pre-specified registers or the
locations.

These problems are easily handled with the help of addressing modes. But before
going to the concept of addressing modes and classification, there is a need to discuss
the concept of ocpode and operand.
2.4.1 Op-code and Operand
A microprocessor program is consisted of multiple instructions which perform
different instructions. The program instruction is consisted of two main parts:
 The first part tells the microprocessor about the kind of operation to be
performed.
 Second part tells the microprocessor about the data on which any specified
operation is to be performed.

18
The part which contains the information about the function performed is called as
Opcode. And the other part which gives us the data or the way to access the data is
called Operand.

Instruction

Opcode Operand

2.4.2 Addressing Modes


There is a method of specifying the data to be operated on by some microprocessor
instruction.. The method by which the instructions address the data to be processed
(operated on)is known as addressing. And the ways of identifying the operand for
some particular task/instruction by the microprocessor are known as Addressing
modes. In other words, the manner in which the target address or effective address are
identified within the instruction is called addressing mode.

2.4.3 Classification of Addressing Modes


We can identify the process to be operated upon the operands based on thegiven
instructionin different ways. Generally, the following addressing modes are used in
8085 microprocessor.
 Direct addressing mode.
 Indirect addressing mode
 Register addressing mode
 Immediate addressing mode
 Implicit addressing mode.
These modes are discussed below in detail
2.4.3.1 Direct addressing mode:
In case of direct addressing mode, the memory location of the data (on which some
specified function is to be performed) is specified within the instruction. So, the first
part of the instruction will contain the operation to be performed & the second part of
the instruction will contain the memory location on which the data is stored.
Example:
OUT 11H
LDA 4050H
STA 2004H

Consider the last instruction: - STA 2004H

19
When the above stated instruction will be executed, the accumulator contents will be
stored in the specified memory location i.e. 2004H. Suppose the accumulator contents
at the given time are 18 H. So, 18 H data stored in the accumulator will be copied to
the memory location 2004 H.

2008H

2007H

2006H
A
2005H

18H
2004H
18H

Figure 2.7: Direct addressing

2.4.3.2 Indirect Mode:


In case of register indirect addressing, the register in which the location of the data to
be processed is stored is specified in the operand part of the instruction.
Example:
SUB M
MOV A, M
DCR M

Consider the first example: MOV A, M.


This instruction will move the data available on the memory location, whose address
is given in the H-L register pair to the accumulator.
Here, M represents the memory location present in the H-L register pair. So when
MOV A, M is executed, the data available on the memory location specified by H-L
register pair are moved to accumulator.

2001H

2002H
20
A 2003H

2004H
Figure 2.8: Indirect addressing

2.4.3.3 Register Addressing Mode:


In this way of calling the data (on which some specified operation is to be performed),
the registers related to the data (in which the data is stored) are specified in the
operand part of the instruction. The operation to be performed is specified in the
opcode part of the instruction.
Example:
MOV A, D
In this case MOV is the Opcode. When the above stated instruction is executed, the
contents present in the Register D are moved to the accumulator which is generally
known as register ‘A’.

D A

Data

Figure 2.9: Register Addressing

The concept of register addressing mode can further be illustrated by considering the
fllowing examples:
ANA C
When the above stated instruction will be executed the contents present in the
Register C will be ANDed logically with contents present in the accumulator (register
A).
SUB L
The execution of the instruction SUB L will lead to the subtraction of the contents of
the register L from the accumulator contents.

21
2.4.3.4 Immediate addressing mode:
In the case of immediate addressing, the data to be processed itself is specified in the
operand part of the instruction. In the opcode part, the operation to be performed is
specified. Now, based o these two things the microprocessor will accomplish the
given task .It is the simple way of getting things done where we are providing both
the things: the function to be performed and the data on which the function is to be
performed.
Consider the following example:
ADI 34H – This instruction adds the immediate data, 34H to the
accumulator.
Suppose, the contents of the accumulator register at present are 8H. When the
instruction ‘ADI 34H’ will be executed, 34H will be added to 8H and the final result
will be stored in accumulator.
In the above instruction the operand is specified within instruction itself.
2.4.3.5 Implicit mode:
In this case, there is no need to type/write any register, data or memory location. The
data is automatically fetched from the predefined location according to the instruction
used/typed. Generally, this type of the addressing mode is used when there is a need
to operate on the data available in the accumulator only. Example:
CMA
RAL
RAR

CMA complements the contents of accumulator.


If RAL is executed the contents of accumulator is rotated left one bit through carry.
If RAR is executed the contents of accumulator is rotated right one bit through carry.

2.5 MIPS Instructions and their Format

There are three types of MIPS ISA i.e. R-type, I-type, and J-type.

2.5.1 R-type

They are called R-type instructions because they refer to register type instructions.
They are the most complex. The format of the R-type instruction is given below. The
encoding is done in this way.

Table 2.2: R-Type instructions

B31-26 B25-21 B20-16 B15-11 B10-6 B5-0

22
opcode register s register t register d shift amount function

General form of MIPS R-type instruction is:

add $rd, $rs, $rt

where $rd denotes a register d (d is represented as a variable, but to use it in the


instruction, we must put a number between 0 and 31, inclusive for d). $rs, $rt are also
registers.

The semantics of the instruction are;

R[d] = R[s] + R[t]


where the addition is signed addition.

It is noticed that there are three registers in the above instruction one is destination
register ($rd) and other two following registers are source registers (($rs and $rt).

In the instruction shown in the above table the two source registers are stored first,
then data is copied to destination register. This is the way how the programmer works
with the instructions.

2.5.2 I-type instructions

I-type stands for “Immediate type” instruction. The format of an I-type instruction is
as shown below:

Table 2.3: I-Type instructions

B31-26 B25-21 B20-16 B15-0


opcode register s register t immediate

The programmer use these type of instructions as shown below:

add $rt, $rs, immed

In the instruction $rt is the destination register, and the source register is only $rs.

The semantics of the addi instruction are;

R[t] = R[s] + (IR15)16 IR15-0

23
where IR is the “instruction register”, which holds the op-code of current instruction.
(IR15)16 refers to the bit B15 of the instruction register (which is the sign bit of the
immediate value) is repeated 16 times. IR15-0 is the 16 bits of the immediate value.

2.5.3 J-type instructions

J-type is acronym for "jump type". The format of a J-type instruction is shown below:

Table 2.4: J-Type instructions

B31-26 B25-0
opcode target

The J-type instruction is used as:

j target

The semantics for the j- instructions (j means jump) are:

PC <- PC31-28 IR25-0 00

where PC (program counter) holds the address of next instruction to be executed. The
upper 4 bits of PC appended with 26 bits of target and followed by two 0s which
creates a 32 bit address.
2.5.4 MIPS Arithmetic and Logic Instructions
These instructions are used to perform mathematical and logical operations, such as
Addition, Subtraction, Multiplication, Division, Comparison, Negation, Increment,
Decrement, ANDing, ORing, XORing, NOT, Shifting, Rotating, compare etc.
The flags are affected after executing these instructions. The CPU performs these
operations on the data which is stored in the registers (CPU registers).
Some arithmetic and logical instructions are shown in following table:

Table 2.5: MIPS ALU instructions

24
2.5.5 MIPS Branch control Instruction
The branch control instructions are used to transfer control to instructions which are not
coming immediate after current instruction which is being executed. The control transfer can
be done by loading the address of the target instruction into the program counter (PC), by this
the next instruction to be executed will be the target instruction which is read from memory
to be retrieved from a new location. The branching can be conditional and unconditional.
There are two types of branching controls instructions:

Conditional branch instructions:


• For branching to new target address, CPU first checks for the condition to be true.
Normally, these conditions are based on the value of flags.
• Normally in these instructions, the target address is close to the current PC location.

25
• For example: loops, if statements.

Unconditional branch instructions:


• The control to the target instruction will be transferred unconditionally
• In these types of instructions, normally the target address is far away from the current PC
location.
• For example: subroutine calls.

2.5.6 Pipelining Instructions


One of the most important features of RISC architecture is pipelining.
 Concept of pipelining
Conventionally, the computer is used to execute only one instruction at a time and
program counter points to currently executed instruction. In pipelining, there is
simultaneous execution of parts or instructions leading to fast and efficient process. If
many pipelined stages are used in RISC machine, then they are called as super
pipelined machines.
For the execution of an instruction, the stages are Instruction fetch (IF), Instruction
Decode (ID), Instruction execution (EX), Memory access (MM), Write back to register
(WB). In the pipelining scheme to improve the performance of execution, the next
instruction is to be fetched along with the instruction decoding stage of the current
instruction.
The 5- stage pipelining scheme is shown in figure:

Figure 2.10: 5-stage pipelining

Exercise
Q1. What are the different features of CISC and RISC architectures?
Ans: Refer section 2.1 and 2.2.
Q2. Explain different addressing modes for instructions.
Ans: Refer section 2.4.3

26
Q3. What are the different instruction sets defined for MIPS?
Ans: Refer section 2.5
Q.4 Write short notes on following:
a. Pipelining
b. Arithematic and logical instructions
c. Branch control instructions
Ans: Refer section 2.5.4, 2.5.5 and 2.5.6

27
Chapter 3

Processor Design

Structure
3.0 Objectives
3.1 Introduction
3.2 Control Unit
3.3 Control Signals
3.4 Design Process
3.5 MIPS Microarchitecture
3.6 Hardwired Control
3.7 Micro programmed Control
3.8 Single Cycle Processor
3.9 Multi Cycle Processor
3.10 Pipelining
3.11 Multi Core Processor
3.12 Test Yourself

3.0 Objectives:

Objectives of this chapter are to familiarize one with the following aspects of Processor
Design:
 Processor control mechanism
 Important components in the design process
 Introduction to various type of microarchitecture and their definition
 Detailed description of Single cycle, multi cycle and pipelined microarchitecture
 Different types of controls: Hard wired and Microprogrammed Control
 What is meant by a Multi Core Processor?

3.1 Introduction:

The main motive of this chapter is to learn the design of a MIPS microprocessor. You will
learn three different designs. Designing a microprocessor may seem like doing the toughest
job. But it is actually very easy and straightforward. Before designing the microprocessor,
you just need to have the knowledge of combinational and sequential logic. We assume that
you are also familiar with circuits like ALU and memory and also you have learned the MIPS

28
architecture, which from a programmer’s point of view in terms of various registers,
instructions, and memory. In this chapter you will come to know about the microarchitecture,
which is an interface between logic and the physical architecture. You will learn how to
arrange registers, ALUs, (FSMs), memories and other things which are important in
implementing architecture. Different architectures have many different microarchitectures
and each microarchitecture have different trade-offs of performance, cost, and complexity.
There is a vast difference in their internal designs.

3.2 Control Unit:

The function of the control unit is to decode the binary machine word in the IR (Instruction
Register) and issue appropriate control signals. These cause the computer to execute its
program.
The control signals are to be generated in the proper sequence so that the instructions can be
executed in a proper way. The control signals are generated with the help of the internal logic
circuitry of the control unit.

Figure 3.1: Control unit


3.3 Control Signals

We already know that the basic parts that are used to build a processor are ALU, registers,
data paths and some of the operations that get executed. For the proper working of the control
unit, it needs inputs that help it to know the state of the system and the outputs that help to
command the system for a certain behavior. This is how the control unit looks from outside.
But from inside, it needs the circuits of logic units/gates so that it can to perform the
operations.

29
Figure 3.2: Control unit inputs and outputs

Figure 3.2 shows the inputs and outputs. The inputs are

 Clock: clock signal is used to “keep the record of time”. Every clock cycle is
important as it is required to perform the execution of the instructions. In MIPS
microarchitecture, one clock cycle is used to execute one micro-operation. This is
called processor clock time.

 Instruction register: Instruction register is used to store the instructions. This is


determined by the opcode and the addressing mode that what kind of operation is to
be performed.

 Flags: It is one of the important parts of control unit input that determines the effect
of the instruction execution on the processor. It is also required to determine the
output of the already executed ALU operations.

 Control signals from control bus: The control signals from the control bus provide
signals to the control unit.

The outputs are as follows:

 Control signals within the processor: There are two types of control signals:
I. The signals which result in the operations that move the data from one register
to another
II. The others that are used to initiate special ALU operations.

 Control signals to control bus: They are divided into two categories:
I. Memory control signals
II. I/O control signals.

The control signals that are mostly used are: The signals that initiate an ALU operation, the
signals that are used to initiate the data paths and the signals that are used to direct the
external system. These control signals are applied directly in the form of zeroes and ones to
the logic units/gates.

The status, where the control unit has reached in the clock cycle, should be known to the
control unit itself. This is necessary for the control unit to take decisions. This knowledge is
used by the control unit, while it reads the input ports, to generate the control signals that
initiate the execution of further operations. The clock cycle is used to time the control signals
and to time the occurrence of events. This allows the signals to get stable.

3.4 Design Process

Our microarchitectures are divided into two parts: the datapath and the control. The datapath
processes the data-words. Different structures like memories, registers, ALUs, and
multiplexers are present in the datapath. We will take an example of MIPS which is a 32-bit
architecture; hence we are going to use a 32-bit datapath. The function of control unit is to
receive the current instruction (which is to be executed) from the datapath and to tell the

30
datapath how to execute that instruction. In other words we can say that the control unit
selects the multiplexer lines, enable the register, and write signals are give to the memory to
control the operation of the datapath.

Figure 3.3: Design elements of control unit

The program counter is a 32-bit register. It points to the current instruction which is to be
executed. The input of the program counter shows the address of the next instruction.

The instruction memory has a single read port. The function of the instruction memory is to
read the 32-bit instruction address input, A, and from that address, to provide the 32-bit
instruction on the RD lines after reading from that address.

The 32-bit register file has two read ports and one write port. The 5-bit address inputs, A1
and A2 are taken from each of the address lines. The 32-bit register values are read onto the
data outputs RD1 and RD2. A clock, a 5-bit address input is provided to the write port, A3;
also a write data input, WD, which is 32-bit; a write enable WE3 input. During the rising edge
of the clock, for writing the data into the specified register the write enable should be 1.
In the data memory, one read and one write port is provided. If the write enable, WE, is 1,
data is written from WD into address A on the rising edge of the clock. If the write enable is
0, it reads address A onto RD.

3.5 MIPS Microarchitectures


The microarchitectures which are discussed in this chapter are: single-cycle, multicycle, and
pipelined. The difference is in the connections of the state elements and the amount of
nonarchitectural state.

In the single-cycle microarchitecture the entire instruction is executed in one cycle. It has a
simple control unit and is easy to explain. Nonarchitectural state is not required in the single
cycle processor because the operation is completed in one cycle only. However, the slowest
instruction limits the cycle time.

The execution of instructions in a multicycle microarchitecture takes place in a series of


shorter cycles. The execution of simpler instructions is done in less number of cycles than
complicated ones. The hardware cost is reduced to some extent because the multicycle
microarchitecture reuses the hardware blocks such as adders and memories. As an example,
the adder may be used on several different cycles for several purposes while carrying out a
single instruction. This is done by deploying several registers (which are nonarchitectural) to
hold intermediate results

31
In the pipelined microarchitecture, the concept of pipelining is applied. Hence it can execute
several instructions at a time which helps in improving the throughput.

Before learning the above three microarchitectures, we will first learn the hardwired and
micro-coded/micro-programmed control unit.

3.6 Hardwired Control

Here the control signals are produced by using the circuit which is hardwired. As we know
that the motive of the control unit is to generate the control signals. These control signals
should be in proper sequence and the time slot dedicated for each control signal must be
enough wide that the operation indicated by each control signal must be finished before the
occurrence of the next in sequence. Because in the hardwired control the control unit is
designed using hardwired (fixed) units, and these hardwired units have a certain propagation
delay, so the hardwired control dedicates small time interval to stabilize the output signals.

For the sake of simplicity, we assume that the time slots are equal. So a counter can be used
to design the control unit. This counter is driven by a separate clock signal. Every step is
dedicated to a particular part of the instructions of the CPU.

We know that large number of instructions are present for add operation. As for example,

ADD NUM R1 Add the contents of memory location specified by NUM to the
contents of register R1 and store the result in R1;

ADD R2 R1 Add the contents of register R2 to the contents of register R1


and store result in R1

It is clear from the above example that the fetch operation will be similar but the control
signals will be separately generated for the above mentioned ADD instructions.
Hence it is concluded that the type of instruction defines the control signals.

Also some of the instructions use the status flags. So the execution of these instructions
depends upon the flag register values and the content of the instruction register. For example
the conditional branch instructions like JZ, JNZ, etc.

Therefore, to determine the control signals, we need to know the:

 Contents of the control counter.


 Contents of the instruction register.
 Contents of the condition code and other status flags.

The external inputs are coming from the Central Processing Unit. They define the status of
the CPU and the other devices connected to it. The condition codes/ status flags indicates the
state of the CPU. For example the flags like carry, overflow, zero, etc.

32
Clock Step Counter

External Inputs

IR Encoder/
Decoder
Condition Codes

Control Signals

Figure 3.4 Hardwired control unit

A simple block diagram can depict the structure of the control unit. But the detailed view can
be understood by going step by step into the design.

The decoder/encoder block is simply a combinational circuit that generates the required
control outputs depending on the state of all its input.

Every control step is provided with a separate control line by the decoder part. Also in the
output, separate control line is provided for every instruction in the instruction register.

The detailed view of the control unit organization is shown in the Figure 3.5.

Reset
Clock Step Counter

Step Decoder

External Inputs
Instruction
IR Decoder Encoder
Condition Codes

End

Control Signals

33
Figure 3.5: Detailed view of hardwired control unit

The encoder block combines all the inputs to produce control signals. The encoder/decoder
consists of a large number of logic units/gates that process the input signals to produce a
control signal. Every output control signal is the result of combination of many input signals
coming from various units.

Instruction decoder decodes all the instructions coming from instruction register. The encoder
takes inputs from the instruction decoder.

Finally we can say that the hardwired control implementation has fixed blocks and fixed
number of control outputs depending upon the different combination of inputs.

3.7 Microprogrammed Control

The other approach to generate the control signal is called microprogrammed control.

Control unit is designed specifically by a microprogram. In this approach the control signals
depend upon the microprogram that provides an instruction sequence. This is done by using
microprogramming language. Operations are defined by instructions.

This type of control unit is simple in terms of the logic circuit. A microprogram is like a
computer program. In the computer programming language the main memory is used store
the program. Instructions fetch operation is done in a sequence that depends upon the
program counter values.

The storage for the microprograms is called microprogram memory. The sequence of
execution depends upon the microprogram counter ( PC).

The microinstructions is a combination of the binary digits i.e. 0's and 1's. The
microinstruction is fetched from the microprogram memory. The output of the memory is a
control signal. If the control line contains 0, the control signal is not generated. If the control
line contains 1, the control signal is going to be generated at that instant of time.

There are different terms related to the microprogrammed control. Let us discuss them.

3.7.1 Control Word (CW):

Control word is group of bits that indicate different control signals. So a different
combination of zeroes and ones define a different control signal. When we combine a
number of control words, then the sequence so formed becomes a microprogram of an
instruction. So we call these individual control words as microinstructions.

As we already discussed that these microprograms are stored in a special memory called
microprogram memory. The control unit reads these microprograms from the memory in the
sequence and produces a control signal corresponding to an instruction. The reading of the
control word (CW) is done with the help of microprogram counter ( PC).

34
The basic organization of a microprogrammed control unit is shown in the Figure 3.6.

The role of the "starting address generator" is to load microprogram counter with initial
address of the microprogram when the instruction register is loaded with a new instruction.
The reading of microinstructions is done by the microprogram counter by using the clock.

Starting External inputs


Address
IR Generator Condition codes

Clock µPC

Microprogram
Memory CW

Figure 3.6: Detailed view of microprogrammed control unit

The condition codes and status flag play a major role in the execution of few instructions. For
example the execution of branch instruction needs to skip the ongoing execution sequence
and to enter a new execution sequence. So the designer has to design a control unit that can
handle the microprogrammed branch instructions.

For this purpose we use conditional microinstructions. These microinstructions tell the
address where the microprogram counter has to point. The address is called branch address.
Also these microinstructions point out the flags, input bits etc. that has to be checked. This all
action is defined in a single microinstruction. Branch instructions require the knowledge of
the flag register and the condition codes.

The “Starting and branch address generator" takes the microinstruction control word bits.
These bits indicate the branch address and the condition that has to be fulfilled before
jump/branch actually takes place. The other role of this block is to load the PC with the
address that is generated by it.

In the normal computer program codes, the instruction is first fetched and then execution
takes place. Similarly the instruction fetch phase is same in the microprogrammed control,
but a common microprogram is used to fetch the instruction. This microprogram is located in
different memory location. Hence the execution of the microinstruction involves that memory
location.

35
During the execution of the current instruction, the address of the next one is calculated by
the “starting address generator unit”.

The main function of the PC is to point the location of next instruction in the sequence. It
is incremented every time an instruction is fetched. There are few conditions in which the
PC contents remain same. These are:

1. During the execution of END instruction, the PC starts pointing to the address of
the first CW.
2. During the loading of the IR with the new instruction, the PC points to the starting
address of the microprogram for that instruction.
3. During the branch microinstruction, if the condition is fulfilled, the PC is loaded
with the address of that branch.

During the execution of the END microinstruction, the microprogram produces an End
control signal. The PC is loaded with the starting address of the instruction to be fetched
with the help of the END control signal. The next address is nothing but the address of the
starting CW. Every instruction has a microprogram associated with it.

Therefore it is concluded that the microprograms are almost similar to computer program.
The only difference is the association of microprogram for every instruction. So it is called as
microprogrammed control.

3.8 SINGLE-CYCLE PROCESSOR

To construct the datapath, we have to connect the state elements with combinational circuit
which can execute various instructions. Based on the current instruction, the appropriate
control signals are generated by the controller which contains the combinational logic. At any
given time, the specific instruction which is being carried out by the datapath is determined
by the control signals.

Full clock cycle to complete an instruction

Figure 3.7: Detailed view of hardwired control unit


3.8.1 Single-Cycle Datapath

Firstly the instruction from instruction memory is read. Figure 3.8 shows that the address
lines of the instruction memory are connected to the program counter PC. A 32-bit instruction
is given out by the instruction memory, which is labeled Instr. The specific instruction that
was fetched decides the functioning of the processor. We will show how the datapath
connections work for the lw instruction. For an lw instruction, next we fetch the base address,

36
by reading the source register. In the rs field of the instruction Instr25:2, the register is
specified.

Figure 3.8: Address lines of the instruction memory


The address input of the register, A1, is connected to bits of the instruction, as shown in
Figure 3.9. The register value is read onto RD1 by the register file.

Figure 3.9: Address lines of the instruction memory connected to address input A1 of register

The 16-bit immediate data must be sign-extended to 32 bits because we know that the 16-bit
immediate data may be positive or negative, as shown in Figure 3.10. We denote the 32-bit
sign-extended value as SignImm.

Figure 3.10: Sign-extended immediate data


While reading from the memory, the address should be known. To do this, the offset and the
base address are added by the processor. An ALU is required to perform this function as

37
shown in Figure 3.11. SrcA and SrcB are the two data/ operands that come from the memory
and the sign extension unit. The functions of ALU are vast. ALUControl signal is a 3-bit
signal which directs the ALU to perform a particular function. The output of the ALU is 32-
bit which is denoted as ALUResult along with flag denoted as Zero. The Zero flag is
important as it indicates the result of ALU is zero or not. While executing the lw instruction,
the value of ALUControl signal is set to decimal 2 or binary “010” to find the address by
adding the base address and offset. The output of ALU i.e. the ALUResult is fed into the data
memory which works as the address for the load instruction, as shown in Figure 3.11.

Figure 3.11: Addition of ALU in the datapath


After reading the data memory the data is put onto the ReadData bus, and after the
completion of the cycle, the data on the ReadData bus is put into in the register file (the
destination register), as shown in Figure 3.12.

Figure 3.12: ReadData is put into the register file

The rt field specifies the destination register for the lw instruction. The rt field is connected
the address input, A3, of the register file. The bus write data input of port3 of register file i.e.
WD3, must be connected to the ReadData signal of the data memory. RegWrite is a control
signal. The write enable input of the port 3 of the register file, WE3, is connected to the

38
RegWrite control signal. While executing a lw instruction the RegWrite signal is asserted. It is
done in order to write the data value to the register file. The write operation is done during
the rising edge of the clock at the end of the cycle.

During one instruction execution operation of the processor, there is another thing that the
processor must do. It is the computation of the memory address of the next instruction, PC_.
As the instructions are 32 bits i.e. 4 bytes long, so it is obvious that the address of the next
instruction is at PC _+ 4. Hence there comes the role of an adder into play. The PC- value is
incremented by adding 4 in it. During the next rising edge of the clock, the address of the new
instruction is then put into the program counter so that the processor can fetch the next
instruction. This is the overall datapath for the lw instruction.

Figure 3.13: ReadData is put into the register file

Like the lw instruction, datapath for sw instruction can be designed. As in the lw instruction,
the sw instruction consists of reading the address from the port of the register file and adding
the base address into it. Also sign-extends operation is carried on an immediate data. To find
the memory address, the addition of the base address to the immediate data is done by the
ALU. But in the sw instruction, a second register is read from the register file and is written
onto the data memory. Figure 3.14 shows the new connections for this function. The
specification of this register is done in the rt field, Instr20:16. Unlike the lw instruction, bits
of the instruction are connected to the second register file read port, A2. The register value is
read onto the RD2 port.

Figure 3.14: Datapath for sw instruction

39
The enhanced datapath handling R-type instructions is shown in Figure 3.15. There are two
registers which are read by the register file. An operation on these two registers is performed
by ALU. The ALU always received the second operand SrcB, from the sign-extended
immediate (SignImm). It requires the addition of a multiplexer. The purpose of multiplexer is
to select SrcB from either the register file RD2 port or SignImm. To control the multiplexer, a
new signal, ALUSrc is added. If ALUSrc is 0 then SrcB from the register file is selected; to
choose SignImm for lw and sw, it is 1. ALUResult is written to the register file in the R-type
instructions. So, another multiplexer is added to choose one among the ReadData and
ALUResult. The output denoted as Result. MemtoReg is a signal to control the multiplexer. If
Result from the ALUResult is to be chosen then MemtoReg is set to 0 for R-type instructions.
Otherwise to choose ReadData for lw, it is set to 1. We don’t care about the value of
MemtoReg for sw, because sw does not write to the register file. Another multiplexer is added
to choose WriteReg. The multiplexer is controlled by RegDst is a control signal for. If RegDst
=1: from the rd field WriteReg is chosen; to choose the rt field for lw it is 0.

Figure 3.15: Enhanced datapath for R-type instructions

Next is the extension of the datapath for beq instructions. Two registers are compared by the
beq instruction. If the result is equal i.e. the registers are equal in size then the addition of the
branch offset and the program counter is carried and the result is taken as the new branch.
The offset is the number of past instructions to branch. There are other elements that are
added into the datapath for example the fourth multiplexer, the 2-bit shift register, control
signals and ALU.

With this the single cycle datapath is complete.

40
Figure 3.16: Complete single cycle datapath

3.8.2 Single-Cycle Control

The function of the control unit is to generate the control signals. Its function is based upon
the opcode. Also it depends upon the funct fields of the instruction. The opcode is the main
source of the control signals. The funct field is used by the R-type instructions, which tells
the processor about the ALU operation. So the control unit is divided into two units of
combinational logic. Figure 7.12 shows the control unit. The opcode is used by the main
decoder to compute the output. A 2-bit ALUOp signal is also computed by the main decoder.

Figure 3.16: Single cycle control

41
3.8.3 The complete single cycle processor:

Figure 3.17: Complete single cycle processor

3.9 MULTICYCLE PROCESSOR

There always exist weaknesses of the single-cycle processor. First, a long clock cycle is
required for the slowest instruction (lw). Second, it requires more number of costly adders.
And third, there are individual instruction and data memories. These limitations are removed
in a multicycle processor. Instruction is broken into shorter/smaller steps. Every instruction
has different numbers of steps associated with it. Only one Adder is required. And a
combination of instruction and data memory is used.

3.9.1 Multicycle Datapath

Figure 3.17 Multicycle datapath design elements

The source register contains the base address. In an lw instruction, the base address is read
from this register. The rs field of the instruction contains the register.

42
Figure 3.17 Program counter selecting instruction location

There are many address inputs of the register file. The address input, A1, is connected to the
Instr output, as shown in Figure 3.18. For this address input A1, the output is generated. This
output is put onto RD1. To store this value, a nonarchitectural register A is used.

Figure 3.18 Flow of signal into register file

We know that an offset is needed by the lw instruction. The immediate field stores this offset.
This offset is sign extended to 32 bits, figure shows this all.

Figure 3.19 Apply sign extend

A SignImm is a signal, which is sign-extended to 32-bit. SignImm is stored in another


nonarchitectural register. During the processing of current instruction, the combinational
function, SignImm of Instr will not change. So no register is dedicated to store this constant
value.

The load address is computed by adding the base address to the offset. This is done by using
an ALU, as shown in Figure 3.20.

43
Figure 3.20: Computing load address

The addition is done if the ALUControl is set to 010. A register called ALUOut is used to
store ALUResult. After address calculation, data is loaded to that address in the memory.
Memory address is selected by a newly added multiplexer, Adr, in front of the memory as
shown in Figure 3.21.

Figure 3.21: Selection of memory address

A signal called IorD is used to select an instruction address or data address. Another register
is used to store the memory read output. Here we find that in the first step, the instruction is
fetched from the calculated address. Later on the address is treated as the memory address.

Next we write the data back to the register file. This is illustrated in Figure 3.22.

Figure 3.22 Write back data into register file

During the write back operation of the processor, the program counter must be updated by
adding 4 to the PC. The multicycle processor is different and better from the single cycle

44
processor in the way it uses its ALUs. A multiplexer is added to use the existing ALU for
different operations as shown in Figure 3.23.

Figure 3.23 Reusing existing ALU

To select either the PC or register A as SrcA a 2-input multiplexer is used. It is controlled by


ALUSrcA. ALUSrcB is used to choose either 4 or SignImm as SrcB for the four-input
multiplexer. SrcA (PC) is added to SrcB (4); the program counter register is updated by this
value. To enable the PC to be written, the PCWrite signal is used.

The sw is different from the lw in the context that it reads another register from the register
file and write it into the memory, as shown in Figure 3.24.

Figure 3.24 Datapath for sw instruction

The rt field is used to specify the register. The second port is connected to the rt field output
Instr20:16. The register is put into the register B which is a nonarchitectural register. The
data port (WD) is written by the register B. The MemWritesignal is used to control the
memory write process.

45
Figure 3.25 Datapath for R-type instructions

In the R-type instructions, two multiplexers are added. From the register file, reading of two
source registers carried out. One of the inputs of SrcB multiplexer selects the register B for
the purpose of choosing another source register to b used in the ALU, as shown in Figure
7.25. Computation is done in the ALU and the results are stored in ALUOut. Next, the
ALUOut is written back to the register which is specified by the rd field. There is another
control signal MemtoReg multiplexer is used to select WD3/ALUOut (for R-type instructions)
or from Data (for lw).

Similarly the beq instruction is carried out.

Figure 3.26 Datapath for beq instructions

Additional components like shift register multiplexer, the branch controls are added. The
Address of next instruction is calculated. As each instruction is 32 bit wide or 4 byte wide,
the address of next instruction is computed by adding 4 to PC. Hence PC'=PC=4. PCSrc is a
control signal which is used to indicate the memory address or the instruction address.

3.9.2 Multicycle Control

The control signals are computed by processor based on the opcode. It is similar to the single
cycle processor. It also makes use of the funct field. Figure 3.27 shows the multicycle control
and the complete multicycle processor is shown in figure 3.28

46
Figure 3.27 Multicycle control

Figure 3.28: Complete multicycle processor

3.10 Pipelining

The CPU performance can be can be improved by modifying the CPU organization. We
studied the impact of using a number of registers in place of only one accumulator. Also the
use of cache memory is very important for improving the performance.

In addition to this, another technique is used. It is known as pipelining. Pipelining helps the
designer to improve the performance of the processor by letting the designer to use the
concept of parallelism.

For this purpose the instruction is broken down into small tasks. The different tasks get
executed in different element. Instruction is executed in two phases, instruction fetch and
instruction execution. So the CPU performs this one after the other.

For every instruction there is a fetch and execute step associated with it. Suppose
Fi and Ei are the two steps associated for Ii. These fetch and execution steps are shown in
figure 3.29.

47
I1 I2 I3 I4

F1 E1 F2 E2 F3 E3 F4 E4 --------

Figure 3.29: Fetch and execute for different instructions

It is clear from the figure that the two operations fetch and execute are performed one after
the other for every instruction. Hence for the execution of instruction I1, the fetch operation
must have been completed. And also second instruction fetch operation can only take place
after the completion of the first one.

Let us suppose the processor have two hardware units, one dedicated for fetch and other for
execution. Here the fetch unit fetches the instruction and stores the instruction in a storage
buffer B1. The execution unit executes that instruction. While the execution unit is executing,
the fetch unit starts fetching the second instruction and stores it.

Besides these two operations there are other operations like decode, operand read and result
write back. Therefore the instruction execution can be divided into following parts:

 Fetch: Fetch the expected instruction into a buffer


 Decode: Decode the instruction to get the opcode
 Read: Fetch the operands from data memory
 Execute: Perform the operation
 Write Back: Write the result back into memory

The detailed operation of the computer proceeds as follows:

 In the first cycle, the instruction is fetched from the memory location by the fetch unit
and stored in an intermediate buffer
 After this, the fetch unit starts fetching the second instruction from the program
memory
 While the fetch unit is fetching second instruction, the decode unit starts decoding the
first instruction
 Hence in the second cycle, the decoding of first instruction and the fetching operation
of second instruction are done
 In the third cycle the fetch unit fetches the third instruction, the decoding unit decodes
the second instruction and the read unit reads the operands for the first instruction
from the data memory

48
 In the fourth cycle the fetch unit fetches the fourth instruction, the decoding unit
decodes the third instruction, the read unit reads the operands for the second
instruction and the execute unit executes the first instruction
 In the fifth cycle the fetch unit fetches the fifth instruction, the decoding unit decodes
the fourth instruction, the read unit reads the operands for the third instruction and the
execute unit executes the second instruction and the write back unit performs the
operation of writing the results of first instruction back into the memory
 Hence the first instruction is completed in five cycles and besides this different
operations on other instructions have also been performed
 This parallelism continues until the completion of the last instruction in the sequence

This approach is very helpful in improving the performance of the CPU. It is shown in figure

Clock 1 Clock2 Clock 3 Clock 4 Clock 5 Clock 6 Clock 7 Clock 8 Clock 9

I1 Fetch Decode Read Execute Write Back

Fetch Decode Read Execute Write Back


I2
I3 Fetch Decode Read Execute Write Back

Fetch Decode Read Execute Write Back


I4
Fetch Decode Read Execute Write Back
I5

Figure 3.30: Pipelining

If we look at the timing diagram of the pipelined processor, we find that there are five
instructions. Each instruction requires five clock cycles for its completion. If all the
instructions are to be executed sequentially without the pipelining, then it will require 5X5
clock cycles i.e. 25 clock cycles. But by using the pipelining, all the instructions get executed
in just 9 clock cycles as shown in figure 3.30.

3.10.1 Block Diagram of Pipelined Processor:

The complete pipelined processor is shown in figure 3.31.

49
Figure 3.31: Pipelined Processor

3.11 Multicore Processors:

We know that the single core processor has only one CPU core associated with it to perform
all the computations. This CPU core consists of register file, ALU, control unit and many
more things. The single core processor is shown in figure 3.32.

Figure 3.32: Single Core Processor

Only one CPU core performs all the computations in the single core architecture. In a multi-
core processor the concept of multiprocessing is used. Two or more cores are present on a
single physical chip. These cores can share caches and also they may pass messages.

For example the dual core processors. In a dual-core system, the chip consists of two
computer cores. Usually, the single die contains two identical processors. Each core has its
own path to connect it to the front-side bus. Multi-core therefore can be thought of an
expanded version of the dual-core technology.

50
In a dual-core processor, there are two execution cores each having its own front-side bus
interfaces. The individual cache of the cores enables the the operating system to utilize the
parallelism. This is done in order to improve the multitasking in the CPU. The system is
optimized in terms of the operating system and software to carry thread-level parallelism.
Thread-level parallelism refers to the concept of running multiple threads at one time. Thread
is a small portion of the operating system or the application program that can be executed
independent of any other part.

Figure 3.33: Dual Core Processor

If the operating system supports thread level programming, we can see the benefits of dual-
core processors even if the application program does not support thread level programming.
For example, we can see this feature in Microsoft Windows XP, we can work on multiple
applications simultaneously i.e. we can surf on Internet browser while MS Office running in
the background also listening to music on Media Player. This can be handled simultaneously
by the Dual Core processor. Nowadays, most of the operating systems and the application
softwares support thread level programming.

There are two types of multi-core processors: symmetric multi-core and asymmetric multi-
core. In a symmetric multicore processor, the single IC consists of identical CPU cores which
have similar design and similar features. On the other hand an asymmetric multi-core
processor is one that has multiple cores on a single IC chip, but these cores have different
designs.

Applications of multi-core processors include: embedded systems, network devices, digital


signal processing and graphics. Performance of a multi-core processor can be increased
enormously by using the software algorithms. Use of multi-core processing is growing at a
rapid pace because the performance of the single core processors is not sufficient to perform
complex and speedy computations. During the computation, cores are shared among the
applications dynamically. To improve the performance of the multicore processor, the
programmers should use multiple threads or processes.

51
3.11 Test Yourself:

Q 1: Explain the concept of instruction pipeline.


Answer: Refer section 3.9

Q 2: How many clocks are needed to execute d number of instructions by a pipelined


processor with respect to a processor without pipeline?

Answer: For pipelined processor:

. = +4

For non pipelined processor:

. = ×5

Q 3: What are the major components of CPU?

Answer: ALU, Control Unit, Register Memory, Data Memory, Instruction Memory and
program counter are the main parts of the CPU.

Q 4: What is the overall function of a processor's control unit?

Answer: The function of the control unit is to decode the binary machine word in the IR
(Instruction Register) and issue appropriate control signals. These cause the computer to
execute its program.
The control signals are to be generated in the proper sequence so that the instructions can be
executed in a proper way. The control signals are generated with the help of the internal logic
circuitry of the control unit.

Q 5: What are the main two phases of instruction execution?

Answer: Instruction is executed in two phases, instruction fetch and instruction execution. So
the CPU performs this one after the other

Q 6: What is hardwired architecture and why it is called hardwired?

Answer: Refer section 3.5

Q 7: What is the advantage of using multicore processors?

Answer: Refer section 3.10

52
53
Chapter 4

Pipelined Processors

Structure
4.0 Objectives
4.1 Introduction
4.2 Structural Hazards
4.3 Data Hazards
4.4 Branch/Control Hazards
4.5 Test Yourself

4.0 Objectives

After studying this chapter one will understand


 What are the different types of Hazards in processor?
 What is a Structural Hazard?
 Data Hazards and methods to reduce their effect
 Detailed illusion of Branch/ control Hazards and the mechanism to minimize their
role in processor

4.1 Introduction

Sometimes there occur instances that prohibit the execution of next instruction during the
clock cycle designated for it. They are called hazards. Performance of the pipelined processor
is reduced vastly due to hazards. There are following three types' of hazards that occur:

1. Structural hazards are the hazards that occur when there is not enough hardware present to
support all possible occurrences of instructions in overlapped execution.
2. Data hazards occur when the execution of one instruction depends on the output of the
previous one, but the previous instruction is not executed and both the instructions are under
execution simultaneously.
3. Sometimes instructions change the value of the internal control registers like the PC.
Hence there exists a situation called Control hazards.
Pipeline stalling becomes necessary if the hazards occur. To have a check on hazard, it
becomes necessary to let some instructions to be executed and some delayed. Due to this, the
fetching of new instructions is stopped until the earlier ones are cleared.

4.2 Structural Hazards

54
During the pipelining of a processor, there is always a necessity of the pipelined functional
units and also the resources require to be duplicated for the structural hazards not to be
happened. This allows the instructions to happen in any of the combination. Sometimes due
to the lack of resources, the instruction can't be executed. Hence the hazard occurs. Structural
hazards commonly occur due to the lack of the proper pipelining of the functional units. This
results as a barrier in the occurrence of the sequence of instructions when they try to access
the non-pipelined unit. These instructions can't get executed at the speed of one instruction
per clock cycle. Since the resources are not properly pipelined i.e. the resources are not
properly duplicated, there is always a chance that the instructions can't be executed in all their
combinations. For example, when there is only one ALU unit available, but at a certain time
the processor is trying to use this ALU to perform two additions simultaneously in one clock
cycle, hence processor is said to have structural hazard.

Figure 4.1: Pipeline Hazards

When such a condition occurs, the processor pipeline stalls/stops one of them until the time
the earlier instruction is not executed and the ALU is not released for use by next instruction.
This kind of hazard will increase the cycle count per instruction. Sometimes a processor has
used same memory space for data and instructions. When an instruction have to access the
data memory and a new instruction comes which is referring to instruction memory, here due
to pipelining the second instruction will conflict the first instruction, as shown in figure 4.1.
For solving this kind of stalls, again the second instruction has to wait until the first has been
executed. This stall occurs for 1 clock cycle and hence it is again costlier when CPI count is
to be calculated.

For a processor having the structural hazard the average instruction time is:

= ×

55
= (1 + 0.4 × 1) × .
= 1.3 ×

From this equation we find that the ideal pipelined processor is faster; this ratio of the
average instruction time of the ideal pipelined processor to the processor with structural
hazards shows that the ideal processor is 1.3 times faster than the processor with structural
hazards.

To eradicate this kind of structural hazard, the cache memory can be separated into two parts
i.e. data and instruction memory. Each section can be accessed separately. Another method to
do so is to use a set of buffers. These buffers are called instruction buffers; the main function
of these buffers is to hold instructions. For a pipelined processor, if everything is taken care
of and all the affecting factors are equal, then if the processor does not have any structural
hazard, there will always be a lower CPI.

Even we want that the processor should not have structural hazards, then why a designer
designs a processor that is not totally free of structural hazards? The reason behind this is that
if we design a totally ideal pipelined processor, it would cost much more as compared to the
processor having structural hazards. As we know that pipelining requires the duplication of
functional units and also separate data memory and instruction memory space. Since this
increases the number of actual units and hence the cost.

4.3 Data Hazards

We know that the timing of the instructions vary when the pipelining is done. This change in
the execution times of instructions is due to the fact that the instruction execution is
overlapped when pipelining comes into play. Due to this overlapping, there is always a
problem of overlapping of instructions which are referring to the same data. This problem is
called Data hazards. When the sequence of instructions is seen outside the pipeline, it seems
that everything is executing in its sequential form. But actually the order of read/write
instructions has been changed. This also changes the data/operand accesses.

Let us take the following example.

DADD R1, R2, R3


DSUB R4, R1, R5
AND R6, R1, R7
OR R8, R1, R9
XOR R10, R1, R11

Here we see that the preceding instructions of DADD are using R1 which is the output of the
first instruction. In the first instruction, R1 is written after adding all the operands. It is clear
from the above examples that the value of R1 is written by the DADD instruction, in the WB
pipe stage, but the value of R1 is read in the ID stage by the DSUB instruction. This reading
of data before it is being written creates a problem called. If this happens, the second
instruction will always read the wrong data from register R1 and the results will always be
wrong. So the designer should take steps to eradicate such kind of problems.
Also it will not be the case that the preceding instructions will always read the wrong value or
the value which was assigned to R1 in the instructions happening before DADD instruction.
For example, before the execution of DSUB instruction completes, if an interrupt occurs, then

56
the WB stage of the DADD will complete. Since the DADD has been computed, the right
value of R1 will be available for the DSUB instruction. This kind of functioning gives
uncertain results, and these results are unacceptable. We also see that the other instructions
are also affected by this data hazard. From the figure it is clear that the R1 is not written till
the end of the 5th clock cycle. Hence the other instructions which are using the R1 during the
respective cycles will always read the wrong value out of it.
The operations like XOR and OR are executed rightly. This is because the registers are read
at the 6th clock cycle after the register has been written in the XOR instruction. Hence it
operates properly. Also in the OR instruction, the hazard is not present because the register is
read in the 2nd part of the clock cycle and the write operation is carried out in the 1st half.

Conditions which lead to data hazards are:

1. read after write (RAW), a true dependency


2. write after read (WAR), an anti-dependency
3. write after write (WAW), an output dependency

Suppose there are two instructions i1 and i2, then following things may happen:
A. Read After Write (RAW)
(i2 tries to read a source before i1 writes to it)
A read after write (RAW) data hazard occurs when an instruction needs a result that has to be
calculated yet. This situation can occur even if the instruction is to be executed after a
previous instruction but the previous instruction is not yet completed in the pipeline.
Example
i1. R2 <- R1 + R3
i2. R4 <- R2 + R3
Here in the first instruction, the value which is to be saved in register R2 is calculated, and
this value is to be used by the second instruction. But in a pipeline, while fetching the
operands in the second instruction, the output of the first instruction has not been saved yet.
This results in a data dependency. Here i2 is data dependent on i1.
We say that there is a data dependency with instruction i2, as it is dependent on the
completion of instruction i1.
B. Write After Read (WAR)
(i2 tries to modify a destination before it is read by i1)
Example
i1. R4 <- R1 + R5
i2. R5 <- R1 + R2
When there occurs an instance that instruction i2 may get executed before i1 (i.e. in parallel
execution) we should take care that the result of register R5 is not stored before i1 is
completed.
C. Write After Write (WAW)
(i2 tries to write an operand before it is written by i1)

57
Example
i1. R2 <- R4 + R7
i2. R2 <- R1 + R3
We must take care that the write back of i2 is not done until the execution of i1 finishes
4.3.1 Minimizing Data Hazard

The problem of data hazards can be eradicated by using a method known as forwarding. This
is a hardware technique. It is also known as bypassing or short-circuiting. The principal
behind the bypassing/ forwarding is that DSUB instruction executes in a way that it does not
require result of the DADD instruction until it is not actually modified by it.

Figure 4.2: Minimizing hazards

This method works in the way that the result of the DADD instruction is moved from the
register of pipeline to where it is stored by the DADD register to the place where it is
required by the DSUB instruction. By doing this, the designer can avoid the requirement of a
stall. After understanding the principal behind forwarding, we can say that:

1. On the inputs of the ALU, the ALU results of the pipelined registers like EX/MEM
and MEM/WB are filled.
2. If the previous ALU execution has already been written on the storage register related
to a source then for the ongoing ALU execution, the previous output of the ALU is
selected as a current input of the ALU instead of a value which is read from the
storage register. This is done by the control logic in the forwarding technique.

For the forwarding to be successful, the DSUB should not be stalled. If DSUB is stalled then
the forwarding will not take place because the DADD will get executed. Similarly, if the
interrupts occurs in between the two instructions, the forwarding will not take place again
because the DADD instruction will be completed before the execution of the DSUB

58
instruction. It is clear from the Figure 4.3 that for the forwarding technique, the designer
should forward output of the immediate previous instructions. We need to forward results not
only from the immediately previous instruction but also possibly from an instruction that
started 2 cycles earlier.

It is shown in Figure 4.2 that how the bypass is done and how the paths are placed. It also
mentions the different timings of the operation on the register such as reads and writes. Hence
following code can be worked upon without the need of stalls.

The designer can design the forwarding in such a way that the result can be passed directly to
the functional unit that requires it: The forwarding of the result takes place from the output of
one unit to the input of other unit in a pipelined register. In place of a unit taking the input
from the output of another unit, the forwarding is done only to the place where the output is
required. Let us take the following example:

DADD R1, R2, R3


LD R4, 0(R1)
SD R4, 12(R1)

Figure 4.3: Minimizing hazards

Here in this case, a stall can be stopped from occurring if we forward the output results of the
ALU pipelined registers, to the inputs of the ALU. Also the output results of the pipelined
memory unit registers would be forwarded to the input of the data memory. All the action is
clear from the Figure 4.3.

4.4 Branch/ Control Hazards

59
The effect of control hazards is enormous. They are more effective in terms of performance
losses done by them as compared to the data hazards. The execution of a branch can change
the values in the Program counter. This is not always true i.e. branch may change the value or
may not change it. When the Program counter is altered, it always shows a value which is
incremented by 4. When a branch alters the value of the PC to the address where it is required
then it is called taken branch. If there is no change in the value of PC, it means that the
sequence is executed as usual and the address pointed out by the branch is not to be executed.
This is known as not taken branch. For example, if a branch targets an instruction I, after
computing the address, then normally, the value of the PC is not altered up to the end of the
ID.

Figure 4.4 shows that after the detection of the branch is done during the ID, then to deal with
the branch, the method of redoing the instruction fetch, where the branch has pointed, is
applied. It is the simplest technique to deal with branches. As in the first IF cycle, no work is
done: hence it works as a stall in the pipeline. One thing which is worth to be noticed is that,
for an untaken or not taken branch if the IF stage gets repeated then this repetition process is
not necessary because the correct instruction has been fetched already.

About ten to thirty percent loss occurs to the performance of a pipelined processor if a stall of
only one cycle for every branch is present. A designer always takes care to check this loss
and uses techniques to eradicate this loss.

Figure 4.4: Branch hazards

4.4.1 Reducing Pipeline Branch Penalties

The designer can use many design techniques while designing the processor so that the stalls
in pipeline due to the branch delays can be reduced. But here we are going to discuss four
techniques which deal with branch delays. These techniques have fixed actions which are to
be taken during the branch. As the actions are fixed, hence for each kind of branch the actions
taken do not change during the whole execution. The compiler can reduce the penalty caused
by a branch by implementing the knowledge of the hardware scheme and also by knowing
the branch behavior.

Freeze or flush is a technique that deals with the branches. It is a simple scheme, which either
holds the instruction which comes next to the branch or by deleting any instruction after the
branch. This is done until the time the branch destination is not known. When the destination
of the branch instruction is known, the instruction is released or re-read from the instruction
sequence. This is a very simple technique in terms of implementing it on hardware as well as

60
on software. This technique is used as a solution in the pipeline. The drawback is that the
software can not reduce the branch penalty as it is fixed.

The other scheme that can be applied to overcome the branch hazards is to take the branch as
not taken. It has little bit higher performance as compared to freeze or Flush scheme. Also it
is complex as compared to it. In this technique the designer simply allows the hardware to
execute the instructions in the same sequence as it is written but the execution of branch is
not done. One thing is to be taken care of is that the state of the processor should not be
changed before the execution of the branch is not complete because the output of that branch
should be definitely known. The drawback of this technique is that if the state of the
processor is changed by any instruction; it can become tedious to come out of that state.
Otherwise the branch hazard may become more complex and the pipeline performance can be
affected enormously.

This predicted-not-taken or predicteduntaken technique is simple and straight-forward. This


five-stage pipeline technique treats the branch as a general instruction and the target indicated
by the branch is not executed. Hence the normal execution of all the instructions is carried
out. This result looks as there is nothing out of order.
But if the pipeline takes the branch then we have to change the target instruction into a no
operation (NOP) instruction and also the target address is to be fetched again. Figure 4.5
shows this kind of situations.

Figure 4.5: Reducing Branch Penalties

The third technique is to take all the branches. When the branch instruction occurs it is first
decoded and the address where the target is present is calculated, we treat this branch as taken
and the instruction at the target address fetched and is executed. This pipeline technique is not
very useful due to the fact that we do not know the address of the target instruction before the
execution of the branch. Few processors have condition codes that are implicitly set or have
branch conditions that are slower but more powerful. For such processors, the target address
of the branch is already known as compared to the output of the branch. So a pre-planned
branch taken technique might make sense. In both the branch predicted-taken as well as
branch predicted-not-taken techniques, performance can be increased by the compiler by
organizing the instructions in a way that the hardware has no problem in taking the path
which is defined by the code.

61
Processor performance can be increased to a large extent if the fourth technique is followed
while designing the pipeline. This technique is called delayed branch. Few processors make
use of this scheme. RISC processors in early days used this technique. In a delayed branch,
the execution cycle with a branch delay of one is

branch instruction
sequential successor1
branch target if taken.

The branch delay slot consists of the sequential successor. Even if the branch is not been
taken, this instruction will be executed. We can have a larger branch delay which might be
longer than usual delay. Practically most of the processors that have delayed branch possess a
single instruction delay.

4.4.2 Reducing the Cost of Branches through Prediction

As the complexity of the pipelines increases and the penalty of branches in terms of the
performance also increases, the use of the techniques like delayed branches, etc. is not
enough to handle this complexity. So we move on to new techniques that are more efficient
and are more accurate for predicting branches. These techniques are divided into two classes:
1) static techniques that are cheaper in cost and they rely on the compile time information and
2) the dynamic branch prediction techniques that are based on behavior of the program.

4.4.3 Static Branch Prediction

The prediction of branch during the time of compilation can be made more accurate if the
profile information, which we got from the earlier compilation runs, is used. The principal
behind this static prediction is that the branches are biased. It means that some of the
branches may favor the always taken mode and some of them may be not taken all the times.
For different compilations, same input is used. Other studies show that the change in the
input for different compilations does not have a bigger effect on the accuracy of predictions
which are profile-based. For the success of any branch prediction technique, there are two
things that are to be fulfilled: a) the accuracy of the technique and b) the frequency of
conditional branches. The main drawback of this prediction technique is that the number of
miss-predictions for the integer programs is higher. The branch frequency for this kind of
programs is high.

4.4.4 Dynamic Branch Prediction and Branch-Prediction Buffers

A branch-prediction buffer is a dynamic branch-prediction technique. It is also called a


branch history table. It is one of the simplest techniques. A small amount of memory is
provided to be used as a buffer. This memory is used branch instruction to store the lower
portion of the address. A bit is used to indicate that whether the branch is taken or not. This
bit is stored in the memory/buffer. This technique hence uses the simple form of a buffer.
There are no tags associated with it. It is helpful in improving the processor performance by
reducing the branch delay.

62
The correctness of the branch prediction using this type of buffer is not known. Even if the
correct address has been predicted by the branch, it is not sure that it has been put by another
branch or by the same. Another branch may have put this address as it could have the similar
low-order address bits. This is not to worry about. The branch predicts a target address and it
is treated to be true, after that the instruction is fetched from that address. But if the predicted
address is not true then the prediction bit in the buffer is complemented and is stored back.

Recalling from the cache, we can imagine that the buffer used is like a cache memory. We
can also imagine that the access done to this cache is always a hit.

There is a drawback associated with this one bit prediction technique that most of the time we
take the branch to be taken. Here if we predict the incorrect target then we waste our two
clock cycles. One at the time of wrong branch prediction and the second at the time of buffer
bit inversion and restoring it in the memory.

To eradicate this problem in the 1-bit prediction, a 2-bit prediction technique is used most of
the times. A 2-bit technique requires that there must be two misses in the prediction of branch
before changing it. A 2-bit prediction technique is shown in Figure 4.6.

Figure 4.6: 2-bit branch prediction

After detecting the instruction as a branch and if this branch is treated as taken branch then
the instruction is fetched from the target immediately after the PC value is known. If the
branch is not to be taken then the fetching and execution of the instructions is continued in a
sequential way. As shown in above that bits are changed when the branch prediction is
wrong.

4.4.5 The states in a 2-bit prediction scheme:

The wrong prediction of branches will be reduced in 2 bits prediction as compared to the 1-
bit prediction. The two bits can be used to encode the four states i.e. 00, 01, 10 and 11. In

63
general, we can extend the two bit scheme to an n-bit. For an n-bit technique, there are 0 to
2n-1 values that can be defined. For a branch to be treated as taken, the value of the counter
must be more than or equal to the half of the maximum value i.e. 2n –1, otherwise, it is
predicted as untaken. As the 2-bit predictors are equally efficient as compared to n-bit
predictors, hence most of the prediction techniques adopt a 2-bit system.

4.5 Test yourself

Q 1: What do you mean by hazards?


Answer: The instances that prohibit the execution of next instruction during the clock cycle
designated for it are called hazards. There are three types of hazards: data hazards, structural
hazards and control hazards.

Q2: How Data Hazards can be minimized?


Answer: Refer section 4.2.1.

Q3: What are structural hazards and how do they affect the performance of processor?
Answer: During the pipelining of a processor, there is always a necessity of the pipelined
functional units and also the resources require to be duplicated for the structural hazards not
to be happened. This allows the instructions to happen in any of the combination. Sometimes
due to the lack of resources, the instruction can't be executed. Hence the hazard occurs.
Structural hazards commonly occur due to the lack of the proper pipelining of the functional
units. This results as a barrier in the occurrence of the sequence of instructions when they try
to access the non-pipelined unit. These instructions can't get executed at the speed of one
instruction per clock cycle. Since the resources are not properly pipelined i.e. the resources
are not properly duplicated, there is always a chance that the instructions can't be executed in
all their combinations. For example, when there is only one ALU unit available, but at a
certain time the processor is trying to use this ALU to perform two additions simultaneously
in one clock cycle, hence processor is said to have structural hazard.

For a processor having the structural hazard the average instruction time is:

= ×
= (1 + 0.4 × 1) × .
= 1.3 ×

From this equation we find that the ideal pipelined processor is faster; this ratio of the
average instruction time of the ideal pipelined processor to the processor with structural
hazards shows that the ideal processor is 1.3 times faster than the processor with structural
hazards

Q 4: What is branch penalty?


Answer: Refer section 4.3.

64
Q 5: What do you mean by static branch strategies and dynamic branch strategies to
deal with branches in pipeline processor?
Answer: Refer section 4.3.3 and section 4.3.4.

Q 6: Explain 2-bit prediction scheme?


Answer: Refer section 4.3.5.

Q 7: Explain true dependency and anti dependency.


Answer: A true dependency or data dependency occurs when an instruction needs a result
that has to be calculated yet. This situation can occur even if the instruction is to be executed
after a previous instruction but the previous instruction is not yet completed in the pipeline.
Example
i1. R2 <- R1 + R3
i2. R4 <- R2 + R3
Here in the first instruction, the value which is to be saved in register R2 is calculated, and
this value is to be used by the second instruction. But in a pipeline, while fetching the
operands in the second instruction, the output of the first instruction has not been saved yet.
This results in a data dependency. Here i2 is data dependent on i1.

An anti dependency occurs when an instruction gets execute before a previous instruction
and modifies the result that is to be used in by the previous one.
Example
i1. R4 <- R1 + R5
i2. R5 <- R1 + R2
When there occurs an instance that instruction i2 may get executed before i1 (i.e. in parallel
execution) we should take care that the result of register R5 is not stored before i1 is
completed.

65
Chapter 5 Memory and Memory Hierarchy

Contents

Introduction
5.1 Storage technologies
5.1.1 RAM or Random Access Memory
5.1.2 Static RAM
5.1.3 Dynamic RAM
5.2 Memory modules
5.2.1 Improved DRAM
5.2.2 Non Volatile Memory
5.3 Access the main memory
5.4 Virtual Memory
5.4.1 Virtual memory models
5.5 Disk storage
5.5.1 Disk Geometry
5.5.2 Disk Capacity
5.5.3 Disk Operation

Introduction
The computer system is consisting of three major parts – devices working as input,central
processing unit (CPU) and the devices working for output. The CPU is composed of
threecomponents which are the arithmetic logic unit (ALU), the control unit and memory
unit.

66
Figure 5.1 block structure of computer

The main objective of the chapter is to introduce a important concept of Memory unit in the
computer which as mentioned in the diagram as main memory and secondarymemory. The
Memory unit is interfaced with other units of the computer as described in the figure5.2.

Figure 5.2 Block diagram of C.P.U

Up till this point of the various studies on computational systems, a simple model of a
computer system has Central Processing Unit (CPU) that works on instructions given to it
and a storage system that stores instructions (commands to CPU) and data for the CPU. The
memory system is considered as an array of bytes which are linear, and the CPU takes certain
fix amount of time to access the memory location while this is a most basic model of a
computer system. In reality, it is not the exact type that the latest systems are follows.
Actually in general, a memory system is a collection of storing devices at various levels
based on their different capacities, costs, and access times. Frequently used data is hold by
registers of the CPU. The cache memories which are small in size and fast and near to the
CPU act as primary location for the data and programs which is to be accessed by computer
systems firstly. The next stages for data are the main memory stage and optical disks and
magnetic tapes stage. An efficiently written program needs to access the storage system at

67
any level of the memory hierarchy more frequently than the access of the lower levels. So the
lower level storage of the memory hierarchy is slower, larger and cheaper per bit. The storage
near to the bottom of memory hierarchy is larger in size and cheaper as compare to the upper
stage of hierarchy. So, it is required to analyze the overall structure of a memory system to
deliver efficient and fast computer applications.

Figure 5.3 Memory Hierarchy Pyramid

In a CPU register, if the data program needed, then the data stored in these resisters can be
accessed with in 0 clock cycle duringexecution of program. If the data present in cache, the
accessing time to that data will be 1 to 30 clock cycles. The accessing time from main
memory will be 50 to 100 clock cycles. If data is stored in disk memory then the accessing
time will be millions cycle count. Now here, this is the main fundamental consideration in
programming: If we try to understand how actually the system moves the data upward and
downward into the level of hierarchy of memory, then we can write our application programs
so that the required data are placed higher in memory hierarchy, the CPU can access the data
more quickly. This is the idea for a computer system which is also known by a term called
locality of computer program, which is a fundamental property of a computer program. A
computer programs with high locality, frequently uses the high level storage of memory
hierarchy for accessing the data than programs with poor locality, and so executes with high
speed. In this chapter, we will discuss the various aspects in which includes the storage
devices like DRAM (Dynamic Random Access Memory), SRAM (Static Random Access

68
Memory), ROM (Read only Memory ), and solid state disks technology. We will also
understand how to analyze our C programs on the basis of locality and techniques to better
the locality of programs. We will also understand how can the executions of program moves
data upward anddownward in the memory hierarchy, and then we can write an application
program with good locality such that the data is accessed at higher stage in of memory
hierarchy.

5.1 Storage Technologies


Storage technology is the important factor for the success of a computer. Random-access
memory was a few kilobytes in early computers .Initially hard disks were not introduced in
IBM PCs. In 1982, the company introduced IBM PC-XT computer with 10 MB (Mega Byte)
hard disk. Till the time of present, the computer machines have 125,000 times more disk
storage then previous, and the storage capacity is continuously going to double after every 2-
3 year.

5.1.1 RAM or Random Access Memory

There are 2 types of Random-access memory (RAM) —Static Random Access Memory
(SRAM) and Dynamic Random Access Memory (DRAM). Static RAM (SRAM) is
considered faster and more expensive as compare to Dynamic RAM (DRAM). SRAMs are
used in cache as a memory where as DRAM is used for the main memory as well as for the
frame buffer of a graphics system. SRAM have not much a few megabytes of space in our
desktop type system, while in case of DRAM we have hundreds or thousands of MB (mega
bytes) space.

5.1.2 Static RAM

Static Random Access Memory (Static RAM or SRAM) is a RAM that uses static form to
hold data that means as long as it has power it will be available. Unlike a DRAM, it is not
needed refreshing circuit. It stores one bit of data using four transistors, which are arranged
into two cross-coupled inverters. It is Bi-stable and the two stable states are 0 and 1. For read
and write operations another two transistors are there which are used to manage the
availability to a memory cell. To store 1 memory bit six MOSFET (metal-oxide-
semiconductor field-effect transistors) are required. There are two types of SRAM chips are
available –one is MOSFET based andthe other type is the bipolar junction transistor based.

69
The bipolar junction transistor (BJT) is faster as compare to MOSFET but it consumes a lot
of power. So, MOFSET based SRAM type is most popularly used.

Figure 5.4 Static RAM Cell

5.1.3 Dynamic RAM

Charge in the capacitor is stored bit of DRAM. The typical value of capacitance of the
capacitor is 30000 pF or 30 fF (Femto Farad). DRAM storage can be very dense as the size of
DRAM cell with a capacitor and single transistor is very small. For any disturbance or noise,
DRAM memory cell is more sensitive as compare to SRAM cell because the charge stored in
capacitor is very much sensitive to any external disturbance which may cause the charge.
Even if capacitor voltage is exposed to light rays then it will be changed. If the voltage of
capacitor is disturbed, it cannot be recovered..

70
Figure 5.5 Dynamic RAM cell

In the sensors of digital cameras and cam-coders the array of DRAM cells are used. In a each
time period of around ten to hundred milliseconds, a DRAM cell needs to be refreshed. A
DRAM cell will lose its charge by various sources of leakage current. The retention time is
very long for a computer as its operating clock period in few nano seconds. Memory system
should be refreshed after reading and then writing every bit .Computer can do correction and
detection of error bit within a word by adding a some redundant bits (e.g., a 8-bit word may
be encoded by using 10 bits). Unlike DRAM, does not require refreshing and SRAM work as
long as power is available to it. Accessing of SRAM is faster as compare to DRAM. SRAM
cells have more transistors as compare to DRAM cells, so that lower densities, high cost and
more power consumption. In a DRAM chip, there are no of DRAM cells and supercells
collectively. Let us consider a DRAM chip with“d” supercells and “w” DRAM cells.

71
Figure 5.6 128-bit 16x8 DRAM chip

The information is stored in “d x w” cells. In this, the supercells can be considered as a


rectangular array with “r” rows and “c” columns, where d=r*c. The address of each supercell
is represented in form of (i,j), here “i” represents the corresponding row, and “j” specifies the
corresponding column. Let us take an example which is shown in Figure 5.6, a DRAM chip
organization of size 16x8 with 16 supercells (d) of 8bits(w) per supercell is shown. Here
number of rows(r) and columns (c) are 4 each.

The supercell is denoted by shaded box at address (2, 1). Pins are as used here as external
connectors for information to flow in and out from memory cell. Each of the pin carries a one
bit signal. In the Figure 5.6 there are two sets of the mentioned pins are shown: one is a
collection of 8 data pins used to transfer or transmit 1 byte from (memory chip) or in the
memory chip, and other one is address pin that generally carry 2-bit rows and columns of
supercell addresses. Another set of pins which for control information exists which is not
shown. Figure.5.6 shows the High level view of a DRAM chip with size of 128-bit (16 × 8).
It is to be noted that the storage community does not defined a proper name of DRAM array
elements. Computer architects give it the name “cell”, which is similar to term with DRAM
storage cell. The circuit designers call it “word,” as the data is loaded with words in the main
memory. Hence the term “supercell” is adopted to avoid confusion. Memory controller which
manages the transfer of “w” bits simultaneously to and from each DRAM chip, is connected
to every DRAM chip. If the micro controller need to read the contents of super cell (i,j), then

72
it will first send the address of rows i and then address of column j to the DRAM chip.
DRAM cell sends the contents of the supercell with address (i,j) to the controller in response.
The row address “i” is called as a “Row Access Strobe” (RAS) request. The request for “j”
column address is called “Column Access Strobe (CAS)” request. Let us take an example, as
shown in figure 5.7, the super cell(2,1) is to be read from the 16 × 8 DRAM, for this RAS
request is sent by the memory controller for row address 2. In the response the DRAM chip
then copy the all contents stored in row 2 into a buffer called “internal row buffer” and when
and CAS for column address 1is sent by the controller, the DRAM chip transfers or copies all
the data (8 bit) into the supercell (2, 1) from the row buffer and then send those bits back to
the memory controller.

It is to be noted that the circuit developers consider the DRAMs as a 2-D arrays structure to
minimize the address pin required on chip where as if the chip is considered as linear array
then the no of address pin may increased. Let us consider, if our example for 128-bit DRAM
which is arranged or considered as a linear array of 16 supercells with the addresses 0 to 15,
then the given chip will require 4 address pins instead of 2. Due to the 2-dimensional array
organization structure its address sent n two distinct steps and which increases access time.

(a) “RAS” request for row no 2 (b) “CAS” request for


column 1

Figure 5.7 Content reading of a DRAM of super cell

5.2 Memory Modules

73
Memory modules are the discrete units of different size, depending upon the no of DRAM
cells packaged into it. There are separate slots for memory modules on the main board of our
system in which these can be plugged in.The most common package consists of 168 pin dual
inline memory module (DIMM). Data can be transferred with the help of DIMM to the
micro-controller and also form the micro-controller in blocks of size 64 bits, whereas in the
72 pin SIMM (single inline memory module) 32 bits block is transferred. The Basic idea of a
memory module is shown in figure 5.8. The considered module stores exactly the 64 MB
using 8 chips of 8M × 8 DRAM each numbered from 0 to 7. One byte of main memory is
stored by each and every supercell. As shown in figure 5.8, 8 supercells are used to represent
an address A of 64 bits in main memory and (i,j) is the corresponding address of supercell.
The example shown in Figure 5.8, lower byte is stored in DRAM 0 and next byte is in
DRAM 1 and so on. For getting a address A of 64 bits, A is converted to supercell address
(i,j)by the memory controllerand the controller will then send this address to the memory
module which is attached to it, the memory module will then transfer addresses i and j to each
and every DRAM. As a result of this, 8-bit contents of (i,j) super cell will be the output of a
each DRAM. In collected data is converted into a 64-bit double word, which is returned in
the memory controller. In this case, on receiving address by controller and module k that
contains A will be selected by the controller and the controller converts address A into (i, j)
form, and sends it to memory module k.

74
Figure 5.8 Reading the contents of a memory module

5.2.1 Improved DRAM


There are different types of DRAM memories are appearing on the market with regularity.
Because of competition, manufacturers want to increase the processor speed. The new
DRAM is based on the conventional DRAM and continues to lead improvement to increase
speed.

Fast Page Mode DRAM (FPM DRAM):Theseare most common type of DRAM cells in
personal computers.A whole row of the supercell in its internal buffer line is used, and then
blows rest of the remaining copies in a conventional DRAM. Allowing consecutive access of
the same row that is to be served directly from the line buffer FPM DRAM is improved. Let
us take an example, the line i of supercell is to be read from the general DRAM chip, then at
least four RAS / CAS requests are required to be sent by memory controller, even if the
address i for row is the same in all requests. If super cells have to be read from the same line
of FPM DRAM, then the controller will first send a RAS/CAS requests initially and then

75
these are followed by next three CAS requests. First application RAS/CAS will copy the line
i in the line buffer and returns the super cell addressed by the CAS. Row buffer would have
three super cellsthat will be served directly and therefore the speed will be the initial super
cell.

Extended data on DRAM (EDO):The CAS individual signal to be compared over time in
an improved form of FPM DRAM.

Synchronous DRAM:FPM and EDO classics are asynchronous and therefore they need to
communicate using explicit control signals with the memory controller. The rising edge of
the clock signal which is externally applied, replaces control signals by SDRAM. The net
effect of SDRAM supercells content will be faster than asynchronous consideration.

Double Data Rate Synchronous DRAM-(DDR SDRAM): The purpose of DDR SDRAM is
to increase the speed of the DRAM two times. It can be done by utilizing both the edges
(rising/falling) of a clock signals. The size of pre-fetch buffer decides the type ofDouble Data
Rate Synchronous DRAM (DDR SDRAM).

Rambus DRAM (RDRAM)-The proprietary technology with a highbandwidth than DDR


SDRAM and having maximum bandwidth.

Video RAM (VRAM).Graphic systems use frame buffers of VRAM. VRAM has similar
technology to FPM DRAM. There are two important differences which are that (1) Shifting
of content of VRAM‘s buffer in a sequence leads to output of VRAM. (2)By using VRAM
the memory can be read and written. Earlier before1995, the memory technologies used in
most PCs were FPM DRAMs. Later onEDO DRAMs came into picture from 1996 to 1999,
and replaced FPM DRAMs. Upto the year 2010, DDR3 SDRAMs memory technology was
the most preferred memory used in server as well as desktop systems.It is to be noted that
core i7 processor supports only DDR3 SDRAMs.

5.2.3 Nonvolatile Memory


The disadvantage of DRAM and SRAM is that they are volatile which means that the stored
contents are erased when the power to these memories is off.Whereas, in the non-volatile

76
memories the stored data will be remained even if power to the memory is off. There are a
various types of non-volatile memories. These arepopularly called read-only
memories(ROM). ROM are ranked based on the number of times we can be writeand based
mechanism by which we reprogrammed.
A programmable ROM (PROM) is one time programmable. In PROM there is a fuse
attached to each memory cell which is to be blown by high current when data is to be written
into it. The fuses cannot be recovered once they are blown so the PROMS are one time
writable.
AnEPROM (erasable programmable ROM) is another type of ROM which is re-writable.
The data can be erased by radiating the ultra violet(UV) rays on storage cell. A transparent
quartz window is used to pass the light on storage cells. EPROM programming is done using
a special device for writing in EPROM. We can erase and reprogram a typical EPROM by
the order of 1000 times. The main disadvantage of EPROM is its erasing circuitry. We need
to plug out the EPROM every time whenever the EPROM is to be erased.

EEPROM or Electrically erasable PROM is almost similar to the EPROM. Electric field is
used to erase the data from memory rather than light as in simple EPROM.In this technology,
on chip erasing circuitry is provided. So, we need not to plug out the whole ROM chip for
erasing.

A leading storage technology is a non-volatile flash memory that is based on EEPROM.


Flash memory is used in various applications like digital-cameras, smart phones, audio/video
music players, PDAs, laptop, desktop and server computer systems etc. for the non-volatile
storage which is fast and durable. The main advantage of flash ROM over EEPROMis that, in
it we can erase data in blocks where as in EEPROM data is erased bit wise so the flash ROM
faster than EEPROM.
Later in this chapter, we will see the Solid State Disk (SSD), which are the advance version
of flash ROM and will know the reasons behind reason behind why the SSDs arefaster and
more robust than conventional rotating Disks. We will discuss in detail where programs are
stored ROM devices is called as firmware that will run when a computer system is on. Some
systems provide PC BIOS (input / output system) routines. Firmware provides a translation of
the IO request from the CPU to devices such as disk controllers and graphics cards.

5.3 Access the main memory

77
Bus is a communication system that is used to share data between the processor and main
memory DRAM to transfer data; there are series of steps, between CPU and memory, called
bus transaction. Transfer the main memory data is known as transaction read and transfer
data from the main memory data is known as write transaction. A collection of parallel son
called bus carries the address, data and control bus signals. Design decides whether the data
and address signals will share the same set of son or different sets. Two or more devices can
also share the same bus. The control son synchronizes the transaction such as transfer
relevance to the main memory or another device o (disk controller) i /, reading detection /
write operation, the information is on the data bus or address bus. Figure below shows the
computer system configuration. The main components include the CPU, a chip assembly
which is an I/O connection and DRAM memory modules that constitute the main memory.
All these components are connected by a bus system. There are following types of bus exist
in a typical computer system:

System Bus: - A system bus is used to the CPU chip to I/O Bridges.
I/O Bus: - I/O bus is connecting the input/output of the main Memory Bridge. The I/O
Bridge has two functions.

1. It converts the electrical signal from one bus to another.


2. It is also used to connect the system and memory buses to an I/O bus. The I/O bus is
connected to I/O devices like disks, sensors, printer and graphics cards etc.

As an example, when the processor performs a charging operation, in the first step

MOVL A,% eax,

In this, the register% eax is loaded with contents of the address A. A read operation through
the bus is initiated by the bus interface. The read transaction process involves three steps.

1) A CPU puts the address on the system bus, which SIGNALIS transmitted to the memory
bus through the bridge I / O

78
2) This signal is detected by the memory, retrieves the data Dynamic RAM, and data is
written to the memory bus. This translation of the system bus signal memory bus signal is
carried by the bridge I / O and then it passes along the system bus.

3) The data on the system bus is detected by the CPU reads on the bus, and then copies it to
the% eax register. A write operation is initiated by the CPU, the CPU executes a store
instruction -> movl% eax, A

Here, the content of the register% eax are designed to meet A. The CPU initiates a write
operation. There are three basic steps.

 The address on the system bus is set by the CPU. The address of the memory bus is read
by the memory and waits until the data arrives.
 CPU copies the data word in% eax to the system bus.

The data word from the memory bus is read by the main memory and stores the bits of the
DRAM

(a) Address A is placed on memory bus by CPU

79
(b) The word x stored in address A in main memory, is placed on the bus.

Figure 5.9: Memory read operation for instruction: movl A,%eax

(a) A on the address bus ( memory bus) is placed by CPU for data writing

80
(b) data word y is placed on the bus by CPU

(c) Data y is stored in memory at memory address A

Figure5.10: Data write for operation : movl A, % eax

5.4 Virtual memory


5.4.1 Virtual memory models
The main memory is considered the physical memory in which multiple running programs
may reside. However, the limited-size physical memory cannot load in all programs fully and
simultaneously. The virtual memory concept was introduced to alleviate this problem. The
idea is to expand the use of physical memory among many programs with the help of an
auxiliary (backup) memory such as disk arrays.

81
Only active programs or portions of them become residence of the physical memory at one
time. Active portions of programs can be loaded in and out from disc to physical memory
dynamically under the coordination of operating system. To the users, virtual memory
provides almost unbounded memory space to work with. Without virtual memory, it would
have been impossible to develop the multiple program cr time sharing computer systems that
are in use today.

Address spaces each word in the physical memory is identified by unique physical address.
All memory words in the main memory form a physical addresses space. Virtual addresses
are those used by machine instructions making up an executable program.

The virtual addresses must be translated into physical addresses at real time. A system of
translation tables and mapping functions are used in this process, the address translation and
memory management policies are affected by the virtual memory model used and by the
organization of the disc and of the main memory.

The use of virtual memory facilitates sharing of the main memory by many software
processes on a dynamic basis. It also facilitates software portability and allows users to
execute programs requiring much more memory than the available physical memory.

Only the active portions of running programs are brought into the main memory. This permits
the relocation of code and data, makes it possible to implement protections in the OS kernel,
and allows high level optimization of memory allocation and management.

Address mapping Let V be the set of virtual addresses generated by a program running on a
processor. Let M be the set of physical addresses allocated to run this program. A virtual
memory space system demands an automatic mechanism to implement the following
mapping:

Ft : V →M∪ {∮}
This mapping is a time function that varies from time to time because the physical memory is
dynamically allocated and deallocated. Consider the virtual address v Є V. The mapping ft is
formally defined as follows:

ft(v)= m, if m Є M has been allocated to store the data identified


by virtual address v
∮, if data v is missing in M.

82
In other words, the mapping ft(v) uniquely translate the virtual address v into the physical
address m if there is a memory hit in M. When there is a memory miss, the value return ft(v)
=∮ , signals that the referenced item (instructions or data) has not been brought into the main
memory at the time of reference.
The efficiency of the address translation process effects the performance of the virtual
memory. Virtual memory is more difficult to implement in a microprocessor, where
additional problems such as coherence, protection and consistency become more challenging.
Two virtual memory models are discussed below.
Private virtual memory The first model uses a private virtual memory space associated
with each processor, as was seen in the VAX/11 and in most UNIX systems(Fig. 5.11a). Each
private space is divided into pages. Virtual pages from different virtual spaces are mapped
into the same physical memory shared by all processors.
The advantages of using private virtual memory includes the use of a small processor
address space (32 bits), protection on each page or on a per-process basis, and the use of
private memory maps, which require no locking.
The shortcoming lies in the synonym problem, in which different virtual addresses in
different virtual spaces point to the same physical page.
Shared virtual memory This model combines all the virtual address spaces into a single
globally shared virtual spaces (Fig.b). Each processor is given a portion of the shared virtual
memory to declare their addresses. Different processors may use disjoint spaces. Some areas
of virtual space can be also shared by multiple processors. Examples of machines using
shared virtual memory include the IBM801, RT, RP3, System 38,the HP spectrum, the
Stanford dash, MIT Alewife, Tera etc.

Virtual space Physical


(page memory
Of processor 2
frames)

--------------
Shared
memory

--------------
--------------

83
Virtual space Virtual space
of processor 1

-----------
----------- ----------- P1 space
(pages) (pages)
-----------
----------- --------------
-----------

Shared -----------
space

--------------
P2 space

(a) Private virtual memory space in different processors --------------

(b) Globally shared virtual Memory space

Figure 5.11 Virtual memory model

The advantages in using shared virtual memory includes the fact that all addresses are unique.
However, each processor must be allowed to generate addresses larger than 32 buts, such a s
46 bits for a 64 T byte (246 byte) address space. Synonyms are not allowed in a globally
shared virtual memory.
The page table must allow shared accesses. Therefore, mutual exclusion (locking) is needed
to enforce protected access. Segmentation is built on top of the paging system to confine each
process to its own address space(segments). Global virtual memory may make the address
translation process longer.

5.5 Disk Storage


Discs are storage devices which carry large amounts of data. They have a storage capacity of
the order of gigabytes, compared RAM base where its capacity is of the order of megabytes.

84
Disks read operation takes more time in the range of milliseconds.

5.5.1 Disk Geometry


Platter building consists of a magnetic recording material on both sides.Platter rotates at a
fixed rate, at a range of 5400 to 15,000 RPM or revolutions per minute by spindle which is at
the Centre of the platter. There are supplementary platters enclosed in a packed container
consists of disk. Figure is showing a typical disk surface. Tracks are concentric surface rings
on each layer. Sectors consists of collection of tracks.. Each sector consists of equal number
of data bits (typically 512 bytes) encrypted in the magnetic material on the sector. Gaps
between the sectors do not contain any data bits but they contain formatting bits that identify
sectors. Multiple platter view is also shown in the Figure. Platters are stacked in the disk in a
sealed package. This complete assembly is known as a disk drive. It is also termed as rotating
disks because of its rotating parts which makes it unlike the flash-based solid state disks
(SSDs) because Solid State Disks do not have any moving parts. The structure of multiple-
platter drives is explaining degrading cylinders which is the collection of tracks on those
surfaces which is middle of the spindle. As shown in figure b, if the given drive has 3 platters
then it has 6 surfaces, and the tracks on each surface are numbered persistently, then cylinder
k is the collection of the six illustrations of track.

(a) A one-platter view (b) Multi-


platter view

Figure5.12: Disk Geometry

5.5.2 Disk Capacity

85
The maximum amount of no of bits that can be stored in a disk is recognized as disk capacity.
Disk capacity is calculated by the below mentioned parameters:

• Recording density (bits/inch): It is the number of bits in 1-inch segment of the rack is
called recording density.

•Track density (tracks/in): The no of tracks that can be compressed per inch radius passing
from the center of the dish called track density.

 A Real density (bits/in2): The products of the recording density and the track density are
called as real density. Manufacturers of Disks work very hard to enhance areal density (so
capacity), and number get doubled after every year. The disks are designed by taking care
of real density. Every track is divided into every track in number of sectors. To keep a fixed
no of sectors per track, the sectors are placed farther apart on the outer tracks. This
approach was used when areal densities were lower. As the areal density increases, the
spaces between sectors enlarge. With further improvements in methods of increasing disk
capacity a new technique is introduced called multiple zones, each zone having adjoining
collection of cylinders. The collection of cylinders is partitioned into separate subsets called
zones. The no of tracks in each cylinder in a zone have the same number of sectors, which
is fixed by the number of sectors that can be encased into the interior track of the zone. The
reason behind floppy disks not successful is that it uses the outdated approach, with a
constant number of sectors/track. The capacity of a disk is calculated by the below
mentioned formula:

For example, we have five platters in a disk, 512 bytes per sector, 30,000 tracks per surface,
and an average of 400 sectors/ track.
Then the capacity of the disk is Disk capacity = 512 bytes /sector *400 sectors /track *30,000
tracks per surface * two surfaces /platter * five platters disk
= 61,440,000,000 Bytes = 61.44GB

5.5.3 Disk Operation

86
A disk reads/write bits on/from the magnetic surface using head attached at the actuator arm’s
end. In Figure 5.12, it is shown that by sliding the arm back and forth along its radial axis
moves data from the disk to the main memory. This mechano-motion is called seek. Once the
head is at its position above the target track, then each & every single bit on the track comes
directly under it, so with the help of the head, the bits can either be read or written. As shown
in Figure 5.12(b), disk having multiple platters with each having a head for read/write
operation. The heads that are piled up vertically one above another move in unison. At a
certain point of time, all the heads are placed on the same cylinder. By moving exhaustively,
the read/write head are positioned by the corresponding arm over track Spindle .The disk
surface rotates at a fixed rate .Fig 5.12 shows (a) Arm of Single-platter view (b) Multiple-
platter view. On a thin space of air over the disk surface, the read/write head slides at a
distance of about .0001 mm at a speed of about49.7097 mph. Disks are always sealed in
airtight packages.

(a) One- platter view (b) Multi


platter view

Figure 5.13 Dynamics of disk

Disks read and write data in sector-sized blocks. The access time for a sector has 3 main
components: seek time, rotational latency, and transfer time.

87
• Seek time: It is amount of time needed to read the elements of a target region, the arm 1st
locates the head over the track that holds the target region. The time required to move the arm
is called the seek time. The seek time (Tseek) is depending on two factors

1) The last location of the head

2) The rotational speed with which the arm moves through the surface.

The Taverage in modern drives is calculated by taking the mean of several thousand seeks to
random region,Tseekis typically on the order of 3 to 9ms.The seek time can sometimes be as
high as 20 ms.

• Rotational latency: When the head is positioned upon the track, the drive awaits for the
first bit of the target sector to move below the head. The performance depends upon two
factors

1) Place on the surface where the head positions itself

2) Therevolving speed of the disk.

In worst case scenario, the head omits the target sector and waits for the disk to complete a
full rotation. The maximum rotational latency (in seconds) can be formulated by

T.max rotation=( 1 /RPM ) * 60 secs 1 min

The average rotational latency, Tavgrotationis half of T.max rotation.

• Transfer time: The drive will start to read or write the data of the sector when the head is
below the first bit of the sector. The total transfer time depends on following parameters:

1) The rotation speed

2) The number of sectors /track.

Therefore, the average transfer time can be estimated by

The average time to read and write data a disk sector can be estimated as sum of following
terms

A) The seek time (average)

88
B) The rotational latency (average)

C) The transfer time (average)

Let us consider a disk with the some of the parameters as:

Parameter Value

 Rotational rate 7200 RPM (Revolutions/minute)

 T.avg seek 9 ms(mili second )

 Average # sectors/track 400

The average rotational latency for this disk (in mili second) is

T.avg rotation = 1/2 *T(max rotation)

= 1/2 *(60 secs / 7200 RPM) * 1000 milisec/sec

≈ 4 ms

The average transfer time will be

T.avg transfer = 60 / 7200 RPM * 1 / 400 sectors/track * 1000 ms/sec

≈ 0.02 mille seconds

Array everything together, the total needed access time can be computed as

T.access = T.avg seek + T.avg rotation + T.avg transfer

= 9 + 4 + 0.02 mille seconds

= 13.02 ms

Following important points can be illustrated from this example:

Seek time and the rotation latency dominates the time to access the 512 bytes in a disk sector.

• The time to access the 512 bytes in a disk sector is dominated by the seek time and the
rotational latency. The first byte access in the sector takes long time, but the remaining bytes
are free.

• Although the rotational latency&seek time arealmost the same, disk access time can be
estimated by calculating twice the seek time

89
Summary

Memory elements are provided within the processor operate at processor speed, but they are
small in size, limited by cost and power consumption. Farther away from the processor,
memory elements commonly provided are (one or more level of) cache memory, main
memory, and secondary storage. The memory at each level is slower than the one at the
previous level, but also much larger and less expensive per bit. The aim behind providing a
memory hierarchy is to achieve, as far as possible, the speed of fast memory at the cost of
slower memory. The properties of inclusion, coherence and locality make it possible to
achieve this complex objective in a computer system.

Virtual memory systems aim to free program size from size limitations of main memory.
Working set, paging, segmentation, TBLs, and memory replacement policies make up the
essential elements of a virtual memory.

Disks storage providesthe large size storage capability to computer system.

Exercise

Q1: What arethe different storage technologies used in computer architecture?


Ans: Refer section 5.1.

Q2: Explain memory hierarchy pyramid for a computer.


Ans: Refer section Introduction

Q3: What are the main differences in DRAM and SRAM technologies?
Ans: Refer section 5.1.

Q4: What are the various steps to be followed by a processor while executing an instruction?
Ans: Refer section 5.3

Q5: What is memory module? Explain with suitable examples.


Ans: Refer section 5.2

90
Q6: Write a short note on
1) Disk Storage 2) Disk operations 3) Disk Performance parameters
Ans: Refer section 5.5

91
Chapter 6

Cache Memory and operations

Structure
6.0 Objectives
6.1 CACHES
6.2 Cache organization
6.2.1 Look Aside
6.2.2 Look Through
6. 3 Cache operation
6.4 Cache Memory Mapping
6.5 Cache Writing Techniques
6.5.1 Write-through
6.5.2 Write-back
6.5.3 Single Cycle Cache
6.6 FSM Cache
6.7 Pipelined Cache

6.0 Objectives

After studying this chapter one will understand


 What is cache memory and how it works?
 Writing and reading of caches.
Cache Writing Techniques like Write-through, Write-back, Single Cycle Cache
 Detailed concept of FSM and Pipeline Caches in processor

6.1 CACHES
The caches refers to a type of memory that comparatively small but can be performed
operation very quickly. It store information that we likely reuse. Whereas cache came in the
beginning at 1968 in the IBM System. After that low-cost, high density RAM and
microprocessor ICs appeared in the 1980s. Caches precisely address the von Neumann
bottleneck by giving the CPU with fast, single cycle access to its external memory. A small
portion of memory located directly on the same chip as a microprocessor. This is known as

92
CPU cache. The CPU cache stores the most recently used information so they can be they
can be fetched more quickly. This information is equivalent of information stored
somewhere, but it is more easily available. In this section; we focus on caches used an
intermediary between a CPU and its main memory. But caches arrive as buffer memories in
distinct other contexts.
6.1.1 Main Features
The cache and main memory form a two level sub hierarchy (ME1, ME2) that differs from
the main-secondary memory system (ME2, ME3). The combination of (ME1, ME2) having
fastest speed than (ME2, ME3). The typical access time ratio of (ME1, ME2) is nearby 5/1,
while (ME2, ME3) is about 1000/1. Due to the speed difference (ME1, ME2) mainly
managed by high speed hardware circuits whereas (ME2, ME3) controlled by operating
systems. Communication between (ME1, ME2) is by pages of 8 B size which is much
smaller than the page size of (ME2, ME3) which is 4 KB. At last, we learn that CPU has
direct access to both ME1 and ME2 but does not have direct access to ME3.

6.2 Cache organization


Figure 1.1 shows the basic structure of a cache. In cache data memory the memory words are
stored and are arranged into small pages known as cache blocks and lines. The contents of the
cache data memory are copies of main memory blocks. Every cache block has its own block
address, which known as tag. The set of tag addresses assigned to the cache which can be
randomly stored in special memory, which is cache tag memory.
Definitely cache memory to improve the performance of the computer, less time required to
check tag addresses and to access the cache’s data memory than the main memory access
time. Main memory is implemented with DRAM technology which having access time 50ns,
while cache memory is implemented with SRAM technology having access time 10ns.

Fig.1.1: Basic structure of a cache

Two types of system organizations for caches are look-aside and look-through.

93
6.2.1 Look Aside
In look-aside, the cache and the main memory are parallel connected to the system bus. Both
the cache and main memory see a bus cycle at same time. Hence it is known as look-aside.

Fig.1.2: Look Aside Cache

During the read cycle of processor, cache check that address is a cache hit or cache miss.
Cache responds to the read cycle and complete the bus cycle, if the cache contain the memory
location. Which result in cache Hit. In another way main memory responds to the processor
and terminates the bus cycle, if the cache does not contain the memory location. The cache
will mismatch the data, so the processor requests this data will be cache hit next time. Which
result in the cache Miss.
Look aside caches structure is simpler, which mean they are less expensive. The main
disadvantage is that processor cannot access cache, if another bus master accessing the main
memory.

6.2.2 Look Through

94
Fig.1.3: Look Through
cache

As shown in a diagram of cache architecture, cache unit is connected between the processor
and main memory, In this cache notice processor bus cycle before allowing it to pass on to
the system bus. This is fastest and more costly organization.
In this cache respond to the processor’s request before starting an access to main memory,
which means cache hit. In the case of cache miss, the cache passes the bus cycle onto the
system bus. Cache mismatch the data due to main memory respond to the processor request,
so next time the processor requests this data for the cache hit. The main disadvantage of the
look through, its higher complexity and cost.
6. 3 Cache operation
Figure 1.4 shows a cache system that illustrates the relationship between the data stored in
cache memory or main memory. Here we assumed that cache clock size is 4 bytes. Memory
address is 12 bits long, so that displacement address within the block defined by the 10 high
order bits from tag or block address and 2 lower order bits. It shows that content of two
blocks of cache tag memory presented to cache data memory. Address Aj = 101111000110 is
sent to M1, which compares Aj’s tag art with M1 stored tags and find a match “hit”. Two bit
displacement is used to output the target word to CPU.

95
Fig.1.4: Cache execution of read operation

A cache write operation also performs the same addressing technique. As shown in figure 5,
the cache tag memory presented to cache data memory, along with the data word to be stored.
When the cache hit occurs, new data 88 is stored at the location pointed to by Aj in data
memory of cache, thereby overwriting the old data FF. The problem arises; the data in cache
differs from the data in main memory with the same address. This leads to temporary
inconsistency, which we can minimize by implementing a policy that systematically updates
the data in Main memory in response to changes made to the corresponding data in cache.
There are two basic writing schemes are: write through and write back. In the write through
approach write is done synchronously both to the cache and to the backing store with no write
allocation. Write allocate means when data at the missed, write location loaded to the cache,
followed by write hit. Write misses are similar to read misses. In the write back approach,
the writing is done initially only to the cache. Write to the backing store is delayed until the
cache blocks containing the data are about to be replaced or modified by new data.

96
Fig.1.5: Cache execution of write operation

6.4 Cache Memory Mapping


The memory system has to determine if a given tag address is present in cache. So we quickly
compared to the stored tags whether a matching tag is presently assigned to the cache. The
fastest technique is implementing for the tag comparison, in which input tag to be compared
simultaneously to all tags in cache memory. There are three popular mapping techniques are
direct mapping, associative mapping and set associative mapping.
Direct Mapping
The simplest way of mapping is the direct mapping. In this, cache consist of high speed RAM
and each location in the cache holds the data at an address in cache given by the lower
significant bits of the main memory address. The block to be selected from the lower
significant bits of the memory address and the remaining higher significant bits of the address
are stored in the cache with the data to do the identification of the cached data.
As shown in the figure 1.6. Memory address from the processor divided into two ways: a tag
and an index. Tag stored higher significant bits of the address with the data. Index consists of
lower significant bits of the address which used to address the cache. Firstly an index is used
to access a word in the cache. After that the tag stored in the accessed word is read and
compared with the tag in the address. Access is made to the addressed cache word, if the two
tags are the same. If the tags are not the identical, that means the required word is not in the
cache, reference is made to the main memory to find it. For a memory read operation, the
word is accessed when it transferred into the cache. Simultaneously transfer the information
to the cache and the processor i.e., to read-through the cache, on a miss. For the write
operation cache location is altered. At the same time or later main memory may be altered
(write-through).

97
Fig.1. 6: Cache with direct mapping

The advantage of direct mapping is that it is simplest replacement algorithm. Also its
disadvantage is that it is not flexible and there is contention problem if cache is not full.
Associative Mapping Technique
In the fully associative cache data to be stored in any cache block rather than forcing each
memory address into the particular block. Data can be placed in any unused block of the
cache. The way of relating the cache data to the
main memory address is to store both the cache address and data combined in the cache. That
is known as fully associated mapping. Cache to be composed of associative memory contains
both the memory address and the data for each cache line. As shown in Fig.1.7 by using the
internal logic of the associative memory, the incoming memory address is simultaneously
compared with all stored addresses. Data is read out, if a match is found. If the associative
part of the cache is capable of holding a full address then Single words from anywhere within
the main memory could be held in the cache.

98
Fig.1.7: Fully associative mapping
Advantage of fully associative mapping is that it is flexible; cache can be used any empty
block. It is expensive and must check all tags to check for a hit. In this to speed up the
process parallel algorithm has been developed.

Set associative mapping


Set associative mapping is combination of direct mapping and fully associative mapping.
Here cache is divided into groups of blocks called sets. Data may be placed in any block
within that set. Any memory address maps to exactly one set in the cache. If each set has x
blocks then the cache is an x-way associative cache. The figure 1.8 illustrates several ways of
organizations of 8 blocks cache. 1-way associative consists of 8 sets, 1 block for each. Same
for the 2-way associativity consists of 4 sets, 2 blocks each same we done for the 4-way
associativity Set associative mapping is an advantage over the direct mapping in that each
word of cache can store two or more word of the memory having the same index address.

.
Fig 1.8: Several ways of organizations of 8 blocks cache

99
Fully associative cache could be described as N way associative, where N is the number of
blocks in the cache. Also direct mapped cache is 1-way set associative, i.e. one location in
each set. For the better performance increase the number of the entries rather than
associativity and that 2 to 16 way set associative cache perform. It allows the limited number
of blocks with the same index and different tags in the cache. As shown in the figure 8 a four
way associative cache in which cache is divided into “sets” of blocks and in each set it
consist of four blocks. Number of block in a set is known as set size and also known as
associativity. Each block in every set consists of tag and data along with the index. Firstly,
processor index of address is used to access the set. After that comparators are used to
compare the incoming tag with the tag of the selected block. If a match is found, then the
corresponding location is accessed. Otherwise if tag match not found, an access to the main
memory is made.
For the full address, the tag bits are always chosen to be most significant bits. The next
significant bits are the block address bits. During consecutive set in the cache, byte address
bits from the least significant bits send out consecutive main memory blocks. This addressing
system used by all known systems is known as bit selection. In case of set associative the set
address bits are the most significant bits and the block address bits are the next significant
bits.
For each associative search, the association between the incoming tag and stored tag is done
using comparator. All the information, tags and data can store in random access memory. In
the set associative cache the number of comparator required is given by the number of blocks
in a set. All the blocks of the set read out simultaneously set can be selected with the tags
before the tag comparisons to be made by comparator. Particular block can be selected after
the tag has been identified.

Fig.1.9: Cache with set associative mapping

100
Set associative mapping better performance than other two mapping but it is more expensive.
In this number of comparators are equal to the number of cache ways so this is less complex
then the fully associative. 8-way set associative caching for level 1 data and 4, 8, 16 or 24-
way set associative for level 2 uses by Intel Pentium.

6.5 Cache Writing Techniques


A system which performs the operation of writing the data to cache memory, also writes data
to backing store too. The timing of this writing this data is controlled by the cache writing
techniques which is also popularly known as “Caching”.
Caching is the technique which speeds up the data reading and writing operations. In caching
the desired data read directly from the cache memory which is embedded into computer
system, rather than reading it from original data source.
Basically in caching the data which is needed by computer to perform operations on is
fetched through the cache memory and cache continuously being updated from the main
memory. Since cache is closer to the processor physically and logically, so reading and
writing cache improves the speed of operation. Accessing the cache is very much faster than
fetching the data from its original source. A typical diagram of cache organization in shown
in figure 1.1.

Computer Remote System


Cache Original
Source

Fig: 1.10: Cache organization

There are two basic writing techniques which are used popularly, which are Write-Through
and Write-Back techniques.
6.5.1 Write-through
In write through technique, in duration of writing the data to cache the main memory also
being written which requires a processor to wait till the main memory completes its write
operation. This technique is easy to implement although it writes many undesired files to the
main memory. Consider an example in which a particular program writes a data block in the
cache, then reads it and again writes the same, so this data block required to stay in the cache
during all the three processes. It is not necessary to update the main memory after the writing
of first data block, because the second write overwrites the data which is written during the
first write. Basically in this technique writing of data is done synchronously both to the
memory and to the cache.

101
Write through technique directs write I/O in the cache memory and through to main memory
and ensure I/O completion to the host. This confirms the written data is safely stored. Write
through technique is mainly used for applications where writing and re-reading of data takes
place frequently.

Fig. 1.11: Write through technique flow diagram

6.5.2 Write-back
In write-back technique writing is performed only on the cache. The main advantage of this
technique is that it reduces the number of write operations to the main memory. It writes the
data block into the memory only when the existing block get replaced with a current one and
the current block written to the memory during the existence of previous block. In this
technique an extra bit is associated with the data blocks, known as “dirty bit”. When we write
to the data block in the cache memory we set the dirty bit to it and check this bit when
replacing the current block with the existing block. This ensures that we should copy this data
block to the main memory or not.
The write back techniques is much more difficult to implement, because it requires
continuous tracking of its locations which have been written over, and a dirty bit is set to
them for later writing to the main memory. The data in these locations are only written when
main memory does not contain the same block. There are two approaches used to overcome
write- misses when no data is returned on the write operations:

102
Write allocate: Here the data which is written at undesired location is written back to the
cache, followed by a write-hit operation.
No-write allocate: Here the data which is written at undesired location is written back to the
main memory and cache being unaffected by it.

Fig. 1.12: Write back technique flow diagram


6.5.3 Single Cycle Cache
In a single cycle operation processor executes each instruction in a single cycle. To perform
this operation a single cycle cache is used. This memory utilizes two instances, one for
instructions and one for data because there is a need to fetch instructions as well as data
within the same cycle.

103
Fig 1.13: 16 bit memory single cycle operation

The function of memory will be determined by the "wr" and "enable" inputs in every cycle.
Since cache is integral part of CPU, so execution of each instruction is performed in one
cycle. So our CPI (Cycles Per Seconds) should be 1 for that. Each cycle would require same
amount of time. It spends the same amount of time for execution of each instruction
regardless the complexity of instruction. The most complex instruction should complete
execution in one cycle only in order to ensure that our processor works correctly. The
disadvantage of this kind of CPU is that it should work at the speed of its slowest instruction
and advantage is that it is quite east to implement.

6.6 FSM Cache


FSM (Finite State Machine) cache utilizes these state machines to perform the operations in
computer. Finite state machines works on the particular conditions. This machine stays in one
state at a time and in which state it stays in present is known as “current state”. It changes
states from one to another according to occurrence of an event or condition. This process is
known as “transition”. FSM is defined by the triggering conditions of transitions and the list
of the states. In computers a cache which performs the execution of instructions on the basis
of state machine must has its specific states and triggering conditions.

104
Fig 1.14: FSM cache
As shown in figure 1.5 there are four states. Initially cache stays in the idle state. It only
changes its state when it gets CPU request for operation. After that cache sends “cache ready”
signal further to processor. If cache hit, then further operations will be performed onto the
cache memory. If cache miss, there are two possibilities in that further i.e. “block clean” and
“block dirty”. If dirty bit is set to the block then It writes the data again. If block is clean (no
dirty bit is set), then block will be allocated to the data to be written. After allocation
“memory ready” signal will be sent back to CPU.

6.7 Pipelined Cache


In pipelining technique instructions are processed in parallel manner in the processor. In
pipelining a particular instruction cycle is divided into a series. In this technique, instead of
processing the complete instruction sequentially the broken segments of the instruction are
processed in parallel. Which means a different instruction can be started before completion of
previous one concurrently.
Pipelined cache stores the data which is read from and written to it by processor in pipelining
fashion. The data transfer takes place in such manner that the later data segment or burst can
start its flow before the previous one reaches at the destination. Basically pipelined cache is
utilized in application of static RAM.
The Pipeline Burst Cache works on two principles of operations, Pipelined Mode and Burst
Mode.
Pipelining Mode

105
In pipelined cache the data in particular address of memory can be accessed in cache along
with the data in RAM at the same time. In the operation of pipelining the data transfer of
instructions from the cache memory or to the cache memory is categorized in two different
stages. One of each stage remains busy by one of these operations all the time. The same
concept is being utilized in assembly language programming. The pipelining overcomes the
drawback of traditional memory operations which wastes a huge amount of time and
eventually reduces the speed of processor.

Fig 1.15: Pipelining Process

Burst Mode
In the burst mode of the cache, the data which is stored in memory is fetched before
proceeding the request to access that. Consider an example of a typical cache where each line
has the size of 32 bytes. It means it reads and writes 32 bytes of data in one complete cycle.
The data paths utilized by cache are of 8 bytes of size. It requires four different operations for
each cache data transfer. In burst mode there is no need to define the different address for
each transfer unlike other modes. There is no need to specify the different address in each
transfer after the first transfer in the burst mode. This causes a large amount of improvement
in speed of operation.

Self Learning Exercise 1.1

Q.1 what do you mean by cache memory? Explain in brief.

106
Q.2 Explain cache read and write operation.

Q.3 what is cache memory mapping? Explain in detail.

Q.4 Explain the set associative memory mapping.

Q.5 Explain the cache organization.

Q.6 Write the difference between look through and look aside? Which one is better in
performance.

Q.7 Explain in brief direct, associative and set associative memory mapping.

Q.8 What do you understand by cache writing techniques? Explain write back and write
through techniques in detail.

107
Q.9 Differentiate between single cycle cache and FSM cache.

Q.10 Explains the various modes of pipelined cache.

108
Chapter 7

Pipelined Processors

Structure

7.1 Objective
7.2 Linear Pipeline Processor
7.3 Non Linear Pipeline Processor
7.4 Instruction Pipeline Processor
7.5 Arithmetic Pipeline Processor
7.6 Super Pipeline Processor

7.1 Objective

Objective of this chapter is to enhance the performance of the sequential processors through
pipelining. Pipelining allows the processing of instructions by dividing a single task into
multiple subtasks. There are four main phases in which the overall instruction is executed i.e.
Fetch, Decode, Execute, and Deliver. Section 7.2 includes of a Linear Pipeline Processor
consists of multiple processing stages connected sequentially to perform a desired function.
Linear pipeline processors are further categorized into two sub categories Asynchronous
pipeline model and synchronous pipeline model. Section 7.3 consists of Non-Linear Pipeline
model which is used when the functions are variable in nature. Further section 7.4 consists of
Instruction pipelined processors in which more than one ‘Execute’ operations can be there.
Prefetch Buffers are needed for efficient execution of instructions in pipelined form. 7.5
consist of Arithmetic pipeline processors. Arithmetic pipelining techniques are used to speed
up arithmetic operations. Section 7.6 involves the introduction to Super Pipeline processors
which consist the process of ILP (Instruction Level Parallelism).

Pipelining refers to temporal overlapping of processing. Pipelining is a fundamental parallel


approach to enhance the performance of sequential processors. It is a standard instruction
processing technique that enhances performance by processing instructions like an assembly
line i.e. receiving their input from the previous stage and delivering output to the next stage.
In a basic pipeline structure, a Single task is divided into multiple Subtasks. Now we assume
that there is a pipeline stage attached with each subtask to perform the desired operation. All

109
these stages operate like an assembly line. So, the very first stage accepts the input and the
last stage provides the output. The basic pipeline structure works in synchronized form. A
new input is accepted at the start of clock cycle and in the next clock cycle result is delivered.
Figure 7.1 shows the basic pipeline structure as discussed above.

Figure 7.1 Basic Pipeline Structure

Now, take an example to illustrate how pipelining improves the performance. The single
instruction processing is divided into multiple instructions: Fetch, Decode, Execute, and
Deliver. In first clock cycle instruction 1 is fetched; in second clock cycle instruction 1 is
decoded and instruction 2 is fetched; in third clock cycle instruction 1 reaches third stage i.e.
execution, instruction 2 is decoded and instruction 3 is fetched. Figure 7.2 shows the
processing of these instructions in pipelined manner.

110
Figure 7.2 Processing of pipelined instructions

Various Pipelining designs for processor development are available such as:
 Linear Pipeline Processor
 Non Linear Pipeline Processor
 Instruction Pipeline Processor
 Arithmetic Pipeline Processor
 Super Pipeline Processor

7.2 LINEAR PIPELINE PROCESSOR

A Linear Pipeline Processor consists of multiple processing stages connected sequentially to


perform a desired function. Inputs are injected into pipeline at first stage. Results are
processed subsequently from stage Si to Si+1 for all i. Here i stand for index of processing
stages. Output is extracted from the last stage. Linear Pipeline Processors can be categorized
in two variants:

7.2.1 Asynchronous Pipelined Model

A handshaking protocol is used in Asynchronous model to control data flow along the
pipeline. When first stage is ready with output it sends a ready signal to next stage. In
response next stage sends acknowledge signal to the first stage. Delay may vary in different
stages in case of Asynchronous Pipelined Model. Figure 7.3 depicts working of
Asynchronous Pipelined Model.

Figure 7.3 Linear pipelined model working asynchronously

7.2.2 Synchronous Pipelined Model

A clocked latch made with master-slave flip-flops is used in Synchronous model to control
data flow along the pipeline. When clock pulse arrives, all latches transfer data to the next

111
stage at the same time. Delay remains approximately equal in all stages. Successive tasks are
initiated one per cycle to enter the pipeline. Once the pipeline is occupied, one result is
extracted for each incremented cycle. For efficient throughput, successive tasks must be
independent of each other. Figure 7.4 shows the working of Synchronous Pipelined model.

Figure 7.4 Linear pipelined model working synchronously

7.2.3 Performance measures

Performance in linear pipeline processor is characterized by three different parameters:


 Cycle time of processor: This is the time allotted to each stage to perform the
requisite operation.
 Latency of instructions: This is the amount of time that the result of a particular
instruction takes to become available for the next dependent instruction. It is generally
determined as multiples of cycle time.
 Throughput of instruction: This is the shortest possible time interval between
following independent instructions in pipeline. For a basic linear model throughput is
one cycle but for complex instructions where one instruction may be dependent on
other, one or more stages need to be recycled so it becomes repetition rate of two or
more cycles.

7.3 NON LINEAR PIPELINE PROCESSOR

Linear pipelines are also called static pipelines as they are specifically used to perform fixed
functions. When the functions are variable and to be performed at different times, dynamic/
non linear pipeline processor is used. Non linear pipeline allows feed-forward and feed-back
channels along with straight data flows. A non linear pipeline works in three stages. Besides

112
the straight data flows from stage 1 to 2 and from stage 2 to 3, there is a feed-forward channel
from stage 1 to 3 and two feed-back channels from stage 3 to 2 and stage 3 to 1. So, output is
not necessary to be extracted from last stage. Figure 7.5 shows three-stage non linear pipeline
processor architecture.

Figure 7.5 Non linear pipeline processor model with three stages

7.3.1 Performance measures

 Throughput: Throughput is defined as the average number of task initiations per


clock cycle. Shorter the minimal average latency (MAL), higher the throughput.
 Efficiency: The percentage of time that each pipeline stage is used over a sequence of
task initiations is known as stage utilization. The cumulated rate of all stage
utilizations defined efficiency. Obviously higher throughput results in better
efficiency.

7.4 INSTRUCTION PIPELINE PROCESSOR

Instruction pipelined processor executes stream of instructions in overlapped manner.


Generally an instruction execution consists of a number of related operations like ‘Fetch’
operations from cache, ‘Decode’ to recognize the resources required for operation, ‘Issue’ to
hold resources till the operation lasts. More than one ‘Execute’ operations can be there. The
number of ‘Execute’ operations varies according to type of instruction. Final operation is
‘Writeback’ to write the resulting data on to registers. Further, instructions can be categorized
in two broad categories.
 Sequential Instruction
 Branch Instruction

7.4.1 Prefetch buffers

Prefetch Buffers are needed for efficient execution of instructions in pipelined form. For
Sequential instructions sequential buffers are used. For Branch instructions target buffers are

113
more effective. Both these buffers work in FIFO (First-In-First-Out) fashion. A third type of
instruction is Conditional Branch instruction that needs both sequential and target buffers for
smooth pipeline flow. Basically, the role of buffer is to reduce the miss-match between the
speeds of instruction fetching and pipeline consumption. Figure 7.6 illustrates the use of
sequential and target buffers to execute conditional branch instruction.

Figure 7.6 Use of Prefetch buffers to execute conditional branch instruction

Buffers are always used in pairs. When the conditional branch instruction is fetched from
memory; first of all branch condition is checked. After checking, suitable instructions are
picked from one of the two buffers and the instructions available in other buffers are
discarded. In each pair, one buffer is used to load instructions from memory and other is used
to feed instructions in the pipeline.
Third type of Prefetch buffer used is loop buffer. This buffer is used to store sequential
instructions enclosed in a loop. Loop buffer works in two steps. In first step, it place
instructions sequentially before the current instruction to save the instruction fetch time. In
second step, it identifies the halting condition for the branch instruction. This avoids
unnecessary memory accesses as halting condition may lie in the loop itself.

7.4.2 Internal Data Forwarding

Throughput of a pipelined processor can be improved by using a technique called internal


data forwarding. Here, register transfer operations can be used in place of some memory
access operations. Since register transfers are fast as compared to memory access operations.
Figure 7.7 shows a store-load forwarding where data from one register is stored into memory
and then the same data is loaded from memory and stored in second register. This load

114
operation from memory to register can be replaced by move operation from one register to
other. This technique will reduce memory traffic and also reduce the execution time.

Figure 7.7(a) Traditional store-load forwarding (b) same process with internal data
forwarding

7.4.3 Scheduling Instructions through Instruction Pipeline

There are three methods for scheduling instructions:


 Static Scheduling
 Dynamic Scheduling
 Score Boarding Scheme

7.4.3.1 Static scheduling

In a sequence of instructions multiple instructions are interrelated due to data dependency.


This interrelationship can be removed using static scheduling technique. A compiler is
required to create a distance between interrelated instructions by inserting as many
independent instructions between Load instruction and Use instruction. For example, as
shown in the table Multiply instruction is dependent upon two Load instructions. The other
Add and Move instructions are independent of two Loads. Therefore, two Loads can be
moved inform to maintain a distance between Load and Multiply instruction to compensate
for the Load delay.

115
Table 7.1 (a) Interrelated instructions prior to Static scheduling. (b) Statically scheduled
instructions

7.4.3.2 Dynamic Scheduling

Dynamic scheduling is another technique for handling stage delays. This technique requires a
dedicated hardware for detection and solution of errors regarding interlocking instructions.
Dynamic scheduling is generally preferred for traditional RISC and CISC pipeline
processors. For scalar processors (instruction issue rate of 1) a parallel optimizing ILP-
compiler is used to search independent instructions and place them immediately after the
Load instructions. For superscalar (when instruction issue rates are 2 and higher) it is
probably not possible to find multiple independent instructions. Therefore, an algorithmic
procedure is required to prevent the dependent instruction from executing until the
interrelated data is available. For example, for IBM 360/91 processor Tomasulo’s algorithm
is implemented for Dynamic instruction scheduling. This algorithm resolved conflicts and
clear data dependencies using register tagging. Register tagging is a process to allocate and
de-allocate the source and destination registers.

7.4.3.3 Scoreboarding

Earlier processors used Dynamic instruction scheduling hardware where multiple parallel
units were allowed to execute irrespective of the original sequence of instructions. For each
execution unit the processor had instruction buffers. Instructions were issued to functional
units without checking for the availability of register input data. So, the instruction would

116
have to wait for its data in a buffer. To overcome this problem and for correct routing of data
between execution units and registers a control unit known as Scoreboard was introduced.
This unit is responsible for keeping track of data registers needed by instructions waiting in
buffers. When all registers had valid data only then scoreboard enabled instruction execution.
Similarly, when a functional unit finished execution it sends a signal to scoreboard to release
the resources.

7.5 ARITHMETIC PIPELINE PROCESSOR

Arithmetic pipelining techniques are used to speed up arithmetic operations. Arithmetic


principles are categorized in two variants. Fixed-point or integer arithmetic, that works on
fixed range of numbers. Floating-point arithmetic, that works on dynamic range of numbers.
There are four main arithmetic operations: Add, Subtract, Multiply and Divide. For fixed
point numbers, Add or Subtract of two n-bit integers yields n-bit results. Multiplication of
two n-bit numbers results in two n-bit results. And division of an n-bit number by another
may results in an exceptionally long quotient and remainder. Arithmetic pipelining is
specially implemented for floating point additions and subtractions, and for multiplication of
fixed point numbers. The algebraic value for floating-point number is represented as:
X = m X re
Where X is the algebraic value, m is mantissa, r is radix or base of the floating-point number,
and e is the exponent.

7.5.1 Floating-point addition /subtraction

Here, a pipeline unit for floating point addition is elaborated. Inputs to floating point adder
are:
X = A X 2a
Y = B X 2b
Floating point additions and subtractions can be performed in four stages.
 Compare the Exponents
 Arrange the Mantissa part
 Add or Subtract the Mantissa
 Normalize the result
This example with figures can explain the four stages clearly. Decimals numbers are used for
simplicity; therefore radix becomes 10 instead of 2 as stated above for binary. Consider two
floating point numbers.
A = 0.8403 X 104
B = 0.7100 X 103

117
Now according to first stage, compare the exponents i.e. 4 – 3 = 1. Larger exponent 4 is
chosen as the exponent of result. Second stage, arrange the mantissa part, shifts the mantissa
part of Y one position (difference) to the right. Now intermediate result is
A = 0.8403 X 104
B = 0.0710 X 104
The exponent part of two mantissas has become same. Third stage adds the two mantissas to
make the sum as
C = 0.9113 X 104
Fourth stage tells to normalize the result i.e. it should have a fraction with nonzero first digit.
As the result is already normalized so above value of Z is the final result. If it is something
like X. 0XXX X 104 then it can be normalized by shifting mantissa once to the right and
incrementing exponent by one i.e. 0.X0XXX X 105.

Figure 7.9 Arithmetic Pipeline Processor for Floating-point addition and subtraction

7.5.2 Fixed-point multiply pipeline design

118
Consider as an example the multiplication of two 8-bit integers A X B = C, where C is the
16-bit multiplication result. We can also assume this multiplication as the addition of eight
intermediate products.
C = A X B = C0 + C1 + C2 + C3 + C4 + C5 + C6 + C7

7.6 SUPER PIPELINE PROCESSOR

For a superscalar machine of degree m, m instructions are issued per cycle and ILP
(Instruction Level Parallelism) should be m in order to fully utilize the pipeline. ILP is the
maximum number of instructions that can be simultaneously executed in the pipeline.
Accordingly, the instruction decoding and execution resources are enhanced to operate m
pipelines in parallel. At some stages functional units may be shared by multiple pipelines.
Figure 7.10 represents a dual-pipeline, super processor. Processor can issue two instructions
per cycle. Dual pipeline refers to the concept that there are essentially two pipelines in the
design. Both pipelines have four processing stages. These two instruction streams are fetched
from a single source known as I-cache. Two store units are used dynamically by the two
pipelines depending on availability. A lookahead window is also present in the design for
instruction lookahead in case out-of-order issue is required to improve the throughput.

119
Figure 7.10 A dual pipeline super processor requiring out-of-order issues

7.6.1 Parallel execution

When super scalar instructions are executed in parallel, usually they finished in out-of-order
scenario. It does not depend on issuing of instructions that whether they are issued in-order or
out-of-order but the reason is difference in execution times. Shorter instructions may finish
earlier as compared to the longer instructions.
To handle the issue, we can create a difference between terms ‘to finish’, ‘to complete’, and
‘to retire’. ‘to finish’ is used to indicate that the required operation for instruction is
accomplished, except for writing back the result into specified memory location. ‘to
complete’ points to the scenario when last action of instruction execution i.e. writing back the
result into specified memory location is being executed. Final term is ‘to retire’ having a
connection with ROB (reorder buffer) since in this case two tasks are to be performed; to
write back the result and to delete the completed instruction from last ROB (reorder buffer)
entry.

Summary

Pipelining is a fundamental parallel approach to enhance the performance of sequential


processors. It is a standard instruction processing technique that enhances performance by
processing instructions like an assembly line. . In a basic pipeline structure, a Single task is
divided into multiple Subtasks. The basic pipeline structure works in synchronized form. A

120
new input is accepted at the start of clock cycle and in the next clock cycle result is delivered.
The single instruction processing is divided into multiple instructions: Fetch, Decode,
Execute, and Deliver. In first clock cycle instruction 1 is fetched; in second clock cycle
instruction 1 is decoded and instruction 2 is fetched; in third clock cycle instruction 1 reaches
third stage i.e. execution, instruction 2 is decoded and instruction 3 is fetched.

A Linear Pipeline Processor consists of multiple processing stages connected sequentially to


perform a desired function. Inputs are injected into pipeline at first stage. Results are
processed subsequently from stage Si to Si+1 for all i. Linear Pipeline Processors can be
categorized in two variants: Asynchronous Pipelined Model and Synchronous Pipelined
Model. A handshaking protocol is used in Asynchronous model to control data flow along
the pipeline. A clocked latch made with master-slave flip-flops is used in Synchronous model
to control data flow along the pipeline.

When the functions are variable and to be performed at different times, dynamic/ non linear
pipeline processor is used. Non linear pipeline allows feed-forward and feed-back channels
along with straight data flows. A non linear pipeline works in three stages. Besides the
straight data flows from stage 1 to 2 and from stage 2 to 3, there is a feed-forward channel
from stage 1 to 3 and two feed-back channels from stage 3 to 2 and stage 3 to 1. So, output is
not necessary to be extracted from last stage.

Instruction pipelined processor executes stream of instructions in overlapped manner.


Generally an instruction execution consists of a number of related operations like ‘Fetch’
operations from cache, ‘Decode’ to recognize the resources required for operation, ‘Issue’ to
hold resources till the operation lasts. More than one ‘Execute’ operations can be there.
Prefetch Buffers are needed for efficient execution of instructions in pipelined form. For
Sequential instructions sequential buffers are used. For Branch instructions target buffers are
more effective. There are three methods for scheduling instructions: Static Scheduling,
Dynamic Scheduling, and Score Boarding Scheme.

Arithmetic pipelining techniques are used to speed up arithmetic operations. Arithmetic


principles are categorized in two variants. Fixed-point or integer arithmetic, that works on
fixed range of numbers. Floating-point arithmetic, that works on dynamic range of numbers.

For a superscalar machine of degree m, m instructions are issued per cycle and ILP
(Instruction Level Parallelism) should be m in order to fully utilize the pipeline. ILP is the
maximum number of instructions that can be simultaneously executed in the pipeline. When
super scalar instructions are executed in parallel, usually they finished in out-of-order
scenario.

121
Exercise

Problem 7.1 - For a seven-segment pipeline, draw a space-time diagram to represent time it
takes to process eight tasks.

Problem 7.2 – Find out the number of clock cycles required to process 200 tasks in a six-
segment pipeline.

Problem 7.3 – An Arithmetic operation (Ai + Bi) (Ci + Di) is to be performed with a stream of
numbers. Show the pipeline structure to execute this task. Elaborate the contents of all
registers in pipeline for i = 1 to 6.

Problem 7.4 – Modify the flowchart represented in Figure 7.9 to add 100 floating-point
numbers X1 + X2 + X3 +……..+ X100

Problem 7.5 – A non-pipeline system takes 50 ns to execute a task. Same task can be
executed in six-segment pipeline with a clock cycle of 10 ns. Determine the speed up ratio of
pipeline for 100 tasks.

Problem 7.6 – Formulate a seven-segment instruction pipeline for a computer. Specify the
operations to be performed in each segment.

Problem 7.7 - Define the out-of-order issue in super pipelined computer. How it can be
resolved?

Problem 7.8 – Draw a pipeline unit for floating-point addition, A = 0.9273 X 104 and B =
0.6542 X 103. Result of addition, C = A + B.

Problem 7.9 - Draw a pipeline unit for floating-point subtraction, A = 0.9273 X 104 and B =
0.6542 X 103. Result of subtraction, C = A - B.

Problem 7.10 – Multiply two 16-bit binary numbers, C = A X B. How many bits will the
result carry? Show the pattern of intermediate products available for the extraction of final
result.

122
Problem 7.11 – What is the basic difference between asynchronous and synchronous linear
pipeline structure.

Problem 7.12 - What is the basic difference between linear and non linear pipeline structure.
Which one is better in which situation?

Problem 7.13 – In the instruction queue, in the dispatch unit of Power PC 601, instructions
might be dispatched in unordered way to branch processing and floating-point units. But
these instructions which are meant for integer units might be sending from the bottom of the
queue. Why this limitation occurs?

Problem 7.14 – When an unordered completion is there in a super pipelined processor. There
is a complication of resumption of execution after the interrupt is processed, this is due to the
exceptional conditions that might have occurred and produced its results in an unordered
way. The program cannot be started again due to this exceptional instruction, the reason for
this is that the other alternate instructions have been completed and doing so would force the
instruction to execute twice. What are necessary steps to handle this situation?

Problem 7.15 – Draw a binary integer multiply pipeline with maximum number of five
stages. The first stage is used only for partial product. The last stage consists of a 36-bit carry
look-ahead adder. All the middle stages consist of 16 carry-bit adders.
(a) Construct a 5-stage multiply pipeline.
(b) What would be the maximum throughput of this multiply pipeline in terms of numbers
which are generated from a 36-bit results obtained per second?

123
Chapter 8 Multi-core Processors and Multithreading

Contents

8.1 Overview
8.2 Architectural Concepts
8.2.1 Cores Multiple
8.2.2 Networks Interconnection
8.2.3Controllers Memory
8.2.4 Memory Consistency
8.2.5 Multi-threading hardware
8.2.6 Multi-processor Interconnect
8.3 Multitasking vs. multi-threading Principles of Multithreading
8.3.1 Multitasking
8.3.2 Multithreading
8.4 Intel Xenon 5100
8.4.1 Thermal and power management capabilities
8.4.2 Electrical specification
8.5 Multiprocessor
8.5.1 Multiprocessor Hardware
8.5.1.1 UMA Bus-Based SMP Architectures
8.5.1.2 UMA Multiprocessors Using Crossbar Switches
8.5.1.2 NUMA Multiprocessors

Multi-core processors and threading in processors has a very small difference. The most
important feature of processor’s architecture like memory architecture and core organization
will be discussed. In multiprocessor we have multiple processors to complete a single task
while in multithreading different communication cables are added to interconnect the
multiple cores.

8.1 Overview

IBM Power 4 processor was released in 2001 that kind of general purpose processors mainly
on the chip, which is made of the CMOS chip, implemented. Now a days multi-core

124
processor will be used for performance improvement of high-end processors. Since then,
multi-core processors to improve performance of high-end processors of the standard and is
currently the only way. This is achieved by adding support multiple cores or threads through
mask multithreaded long latency operations. Since we cannot use the clock speed gains of the
past for various reasons. Reasons chain Ghent is unsustainable energy consumption level that
is involved in higher clock frequencies. Wire delays are also an important fact, instead of the
switching transistor is the dominant theme for each clock cycle. Multi-core processors are
more diversified and then design space is compared with the single processor threaded. In
that topic also architectural principle of multi-core package base architectures will be
discussed .we will discuss some recent examples and report on critical issues related to
scalability.
8.2 Architectural Concepts

8.2.1 Cores Multiple

Multiple cores are one of the few in the concept used, but if we talk about scaling problems.
There are some compromises, considering the first question; whether homogeneous or
heterogeneous processor should .mostly Nature homogeneous multi-core processors are used.
This means that the core is capable of executing the same binary file and has no concern for a
functional point of view, as. In order to ensure that power is designed, for example on the
core program or increase performance single threaded multi-core architectures now days
system software can control the frequency of clock for the individual cores individually. All
these homogeneous architecture announced a global address space to implement the full
cache coherence. This way, we cannot distinguish a core of another, even if the process
migrates contrast during. In homogeneous architecture that is a heterogeneous architecture
least two types of core, which differ in both functionality and performance and game
characteristic architecture instructions (ISA), the most famous example of a heterogeneous
multi-core architecture, the cell can be architecture, it is developed by HP , DELL and
Lenovo and used in areas such as game consoles and highly reliable performance computing.
The consistent architecture which shares memory globally is probably compatible for
parallelism i.e., in the program, entire core in use non homogeneous architecture, wherein the
cores do not use the same instruction set. Furthermore, in an application that will naturally
lead to partitioned into sustainable control with regular communication useful for partitions
on specialized seed manually for the specific task are provided. The organizations of internal
cores are different in many manners. Today each modern pipeline core facility is given with

125
decoding instruction and guided to improve the overall throughput of the scene in order, even
if the control latency is either similar or increased. All high-performance models also contain
cores of speculative dynamic scheduling instructions in the complete equipment. All average
number of instructions per clock cycle (IPC) is increased by the above detailed methods,
rather than a level parallelism limited instructions (ILP) are running in existing applications,
and because these techniques at the same time that they are both valuable silicon real estate
complex and power- consumption, in these are very little importance and the modern multi-
core architectures which are used today have more importance. Although, with some
advanced architectures had the single designers, only one question in the order back rings
(although in the case of the Knights Corner complemented by powerful vector instructions,
while reducing silicon and energy consumption for each core). The core measures to a limited
number of instruction level parallelism (ILP) will be added to the most modern kernels
concurrent Multithreading (CMT), and is known in the world by its brand Intel Hyper-
Threading. These are physical techniques that make better use of hardware resources to
choose multi-point pipeline directives from more than one thread. The advantage is that for
applications for much architecture several ILP single-thread performances are very reliable,
so that to reduce level parallelism over ILP utilized. Continuous be Multithreading is a useful
property which is comparatively inexpensive in terms of other whole region its energy
consumption.

8.2.2 Networks Interconnection


With multiple cores on one of the inter-core communication mechanisms chip is necessary.
The ancestral that the each processor in the multiprocessor communicates with shared
memory via a common bus which was mostly shared with every processor. That may be the
concept for multi-core processors beginning of versatile vendors like as AMD or Intel. In fact
programmer does not want to busy the bus with the data traffic which is stored in the memory
and peripheral can locate caches, the bus and the processor can be stores usually 1 or 2
levels. The common bus is also implemented coherent cache for facilitating how this clarifies
in a radio communication medium there is always operation performed by universal theory of
shared memory. current designs have been realized that communication media share issue
based on both bandwidth and latency than bus. A shared bus contains high power lines and
contains a huge potential slave units in the core of multi-core processor and all the memory
subsystem can master and slave bus capacitive loading will make it slower than earlier. in the
process many devices of the bus at the same time reducing the bandwidth fundamentally of

126
each core. Whereas other general-purpose popular processors when written will be of multi
core feature with an interconnect crossbar between the modules, the memory levels processor
cache 1 or 2, and the rest thereof to the latest memory interfaces level cache of RAM and
main memory. While other technologies like multi-ring bus, on-chip switched networks will
emerge and gain ground because either a higher bandwidth, lower power consumption, or all.
If the cores will be increased on-chip communication networks with more scalable and
performance limitations.

Figure 8.1: Network interconnection topologies

8.2.3Controllers Memory
Memory Interface is an important part of several high-performance processor and it is further
more for a multi -core processor, because it is a shares the on chip resource of all cores. One
can synergies in the situation of concurrent programs, expected to share the work Data and
parallelization. In these parallel programs , program on different nuclei which are generally
on the same instructions and data in the vicinity of other threads assembly work and thus the
chances that they will be the same DRAM - increased enough access sites working for, on the

127
other hand, when multi-core processors to work as many core, programmer will see the
sharing of space in the time management that happens in the current operating systems.

8.2.4 Memory Consistency

The fundamental issues that must be considered in all multi-core design is that the memory
has consistent view. The imitation of physical memory location several times at different
cache level, but also in the cores requires understanding a consistent and easily model of
processor, to know how to the loads and competing stores coordinated consistent view the
contents of the memory, To save most important concepts on the consistency of the memory
in a core processor. In the existence of more than one copies of the same memory location, a
business in the place all the cores must zero time something that can not be passed to get in
practice. But the presence of propagation immediately stored can be get if a universal order
memory in the same memory location are applied, and a load (immediate value) is universal
before the value, it is used to return, which means is carried out, the company with the new
value is performed in the system in relation to all the cores. The strict model of the
persistence of memory consistency today with the simultaneous use of computers is used.
Formally, it is defined as: a multi coherent set if the result of execution is the same as if all
threads execute serial storage operations in order, and the operations of each wire in order,
display thread. Intuitively, this means that if the memory access instruction is maintained
over a core which can be accessed from different cores are interleaved in any order. If you
want to conduct a survey of programmers, the model memory would be the first choice
among the programmers, while it makes a considerable effort to implement in hardware
demands. Many advanced models have been launched at that time, with less consistency
requirements. the examples are given below:

• If it meets a different location in memory, it begins to load through bypass and stores in the
core of processor.

•Before it is visible to every processors, it stores the value emitted by core, first stores in the
main memory .

•Rely on the use of atomic sections while accessing shared memory areas, implementation
and mutual exclusion; all other loads and stores are performed without any application of
consistency, as they are considered LOCAL BASE.

128
• Processors based on execution sometimes deploy violations speculative memory
commands: the kernel assumes no reading or writing values

If this assumption is false, the operation is canceled and re-run; But the significant advantage
of performance can be made in the case of conflicts recognized course for programmers, the
consistency of the memory must be considered one of the most complex and difficult issues
when the model of multi-core systems. To know the basics of how memory access is
therefore essential works 18 2.2 that architectural principles difficult to implement when
handling bugs or encourage competition for basic synchronization primitives on core
operation .As we go through in this book, see Partitioning concurrent tasks and
synchronization between instruction are important activities were needed to design computer
systems for many core. Synchronization is very difficult to achieve in the software, so that
the assistance is needed in hardware; Devices with a synchronization is generally difficult in
scaling and also having limited flexibility. Because of that the most common solutions for
reliable software in the base support with hardware. In this programmers are concentrate on
reinforce provided by hardware. This method is current processors provide reinforce for
synchronization method are regarded as read-modify - write (RMW) or conditional stores.
The fundamental principle behind supervisors to give the smallest unfavorable section has
assurance a conflict without ingress to a specific memory location that will contains the data
which is required for synchronization. The read-modify - write (RMW) instructions written
below the most commonly used:

Test and set (T & E) peruses a memory location, set it to one , and write back the data in a
register core and further more peruse (so none core has ability to store action in this memory
space, where as the command T & S is accomplish to perform ) . That statement of the given
method is to obtain / release a castle should be like this :

Compare and Swap (CAS) automatically compare the data present in the memory of a data
provided , and if they are found to be same, the memory contents will be replaced with the

129
data stored in a main memory. To block implementation services with the Compare and
SwapCAS is as follows (all similar to the implementation based on S & T).

Load bound and store- conditional: This is an unbundled version of T & S, which is driving
more convenient. Although there are two commands in this case are both connected :
connected by a load, a memory - related only succeed if no other operations were recovered
space because the execution carried out 2 multi-core and Core architectures plusieurs- 19
charge if the register of the position value is used - 0. Here is how conditional lock can be
implemented with these mechanisms:

The presence of these ISA constructions is significantly higher for the use of synchronization
mechanism. However, only one of these basic mechanisms is sufficient for most types of
synchronization primitives, as well as the software for carrying out lock-free data structures.
8.2.5 Multi-threading hardware

Thread level concurrent execution is termed as multithreading. In multithread architecture,


each processor can execute multiple contexts at a time mean multiple threads can be
generated for each process and these threads are executed concurrently or a parallel way. The
term multi threading implies that there are many threads to be controlled by a processor.
Multi threading offers an effective mechanism for improving latency in building of large
scale multiprocessors. Multiple threading demands that the processor be designed to handle
multiple contexts simultaneously on a context switching basis. All recent operating systems
(such as OS/2, Window NT or Sun OS 5.0) supports multithreading.

130
This observation led to the implementation of multithreading hardware, a mechanism that
could be a basis of multiple thread contexts in a tangible medium (including the program
counter and register sets, but the exchange eg cache) and fast switching between hardware
threads when certain subjects because of very high latency operations are involved. There are
many hardware implementations have implemented on this concept, the most current thin
multi threading technology in the context of cores is called simultaneous multi-threading.
This technique was developed earlier by most major companies such as Intel (Hyper-
Threading Technology (HTT) Concept), IBM (thread priority concept), Oracle and Sun (with
up to eight threads materials on each base support). Recently, the difference between the
speed of the memory access and the nuclei started to limit the fall of the base frequency of
treatment to make technical hiding latency perks for large programs.

8.2.6 Multi-processor Interconnect

Interconnects are one of the most important architectural building blocks for multiprocessors.
It collectively interconnects the multiple processors together to provide a single logical
processing unit. Currently, two popular technologies for interconnects are found in systems:

• Hyper Transport: It is a packet oriented link, which provides point to point


interconnection with low latency. The latest hyper transport can transfer signals up to 51.2
GB/s, 3.2GHz. This technology is used in chip by almost all vendors except Intel.

• Intel Quick Path Interconnect (QPI): This technology of interconnects is used in most
Intel chips. It is a popular technology for connection of I/O devices

8.3 Multitasking vs. multi-threading Principles of Multithreading

8.3.1 Multitasking

Process level concurrent execution is usually called Multitasking. All currently available
operating systems support multitasking. Multitasking refers to concurrent execution of
processors. Multitasking provides parallel execution of two or more parts of a single program
thus a multitasked job required less execution time. Multitasking can be achieved by adding
codes in original program in order to provide proper linkage and synchronization of divided
task. Multitasking was introduced in operating system in the mid 1960s, including among
others, the IBM operating system for the system/360 such as /DOS. OS/MFT or OS/MVT.
Almost all operating systems provide this feature.

131
Tradeoffs do exist between multitasking and not multitasking. Only when overhead is short
should multitasking is practiced. Sometimes, not all parts of a program can be divided into
parallel tasks. Therefore, multitasking tradeoffs must be analyzed before implementation.

Single Core

A core can be considered as a processing unit. A processor with single core can execute a
single at a time. In a single core processing the tasks other than currently executing task have
to wait for their turns so due to waiting time the overhead is increased. In these types of
processors the performance can be improved by programming. Through programming the
appropriate time slots are given to each task in which it has to be executed.
In the figure8.2 ,here are applications requests like word processing, E-mail, web browsing
and virus scanning by antivirus. The operating system will handle these requests by making a
task queue for these applications. The application tasks will be sent for execution one by one
because of a single execution unit core.

Figure 8.1: Single core processing

Multi-core

Multi-core chips A multi-core system has one CPU which is divided in more than one core.
Each core works as independent microprocessor. Due to multiple cores, a processor can
perform multiple operations in a single process. The resources like cache and FSB etc. which
are needed for processing are shared in multiprocessing system, so the processor cores in
multi-core chips operate in a shared memory mode. However, message passing, which works
independently of physical locations of processes or threads, also provides a natural software
model to exploit the structural parallelism present in an application. The system with

132
multiple cores provides similar performance as multiprocessor systems with advantage of
much lower cost. The cost is lower because in multi-core systems a single CPU support
multiprocessing. Another advantage of multi-core systems is that a multi-core system with
hardware multi-threading also supports the natural parallelism which is always present
between two or more independent programs running on a system. Even two or more
operating systems can share a common hardware platform, in effect providing multiple
virtual computing environments to users. Such virtualization makes it possible for system to
support more complex and composite workloads, resulting in better system utilization and
return on investment. The figure 8.3 shows a multi-core processor with 4 cores. Each core has
its individual local memory. A system memory is shared by all the cores. External devices
communicate with processor by system bus.

Figure 8.3 Multi core processing

8.3.2 Multithreading

A thread can be considered as light weight process. Through parallelism, the threads
improve the performance of program execution. There are two ways by which threads are
implemented;

User level threads: these are managed by user. Kernel does not have any information about
these types of threads. User creates these threads for any application with the help of thread
library. The thread library contains the executable files of thread creation, deletion, and

133
communication by message passing between threads. The user level threads can run on any
operating system. These are fast to create and manage.

Figure 8.4 User level threads

Kernel level threads: Operating system manages these threads. The scheduling of each
thread is done by operating system. Kernel performs thread creation, scheduling and thread
management. Kernel can simultaneously schedule the multiple threads.

Some operating systems support the co-ordinated combination of both types of threads (
user level and kernel level). Developer can creates multiple threads as necessary and the
corresponding kernel level threads can be executed in parallel. There are following thread
models which maps the user level threads on to kernel level threads.

Many to many mapping: Multiple user level threads are multiplexed on to smaller or equal
kernel threads. The number of kernel threads is specific for each application. The following
figure 8.5 shows the many to many mapping terminology.

Figure 8.5 Many to many thread relationship

Many to one mapping: In this type of mapping multiple user-level threads are mapped onto
single kernel thread. The disadvantage of this type of mapping is that when a single thread is
blocked the whole execution will be blocked because of single kernel level thread.

134
Figure 8.6 Many to one thread relationship

One to one mapping: The concurrency in this model is more than the many to one. This
model eliminates the blockage due to single blocking thread problem of many to one
mapping. Another thread can run when one thread is blocked. The main disadvantage of this
type of mapping is that each user level thread required corresponding individual kernel level
thread.

Figure 8.6 One to one thred relationship

The conventional von-neumann machines are built with processors that execute a single
context by each processor at a time. In other words, each processor maintains a single thread
of control with its hardware resources.. In a multithreaded architecture, each processor can
execute multiple contexts at a same time. The term multithreading implies that there are
multiple threads of control in each processor. Multithreading offers an effective mechanism
for hiding long latency in building large scale multiprocessors and is today a mature
technology. In multithreading processors, the operating system not only assign time slot to
each application in which each application has to be executed but also assign time slot to
each thread of a application as each application can be considered as a collection of multiple
thread

A multithread idea was pioneered by Burton Smith (1978) in the HEP system which
extended the concept of scorebording of multiple functional units in the CDC 6400.

135
Subsequent multithreaded microprocessor projects were the Tera computer ( Alverson, Smith
et al., 1990) and MIT Alewife ( Agarwal et al., 1989).

One possible multithreaded MPP system is modeled by a network of processor (P) and
memory (M) nodes as depicted in figure.. The distributed memories form a global address
space.

(a) Archtectural environment

(b) Multithread computation model

Figure 8.7 Multithread architecture and its computation model

136
There are four parameters are defined to analyze the performance of a multithreaded
processor:

The latency (L): This is the communication latency on a remote memory access. The value
of L includes the network delays, cache-miss penalty, and delays caused by contentions in
split transactions.

The number of threads (N): This is the number of threads that can be interleaved in each
processor. A thread is represented by a context consisting of program counter, a register set,
and required context status word.

The context switching overhead (C): This refers the cycle lost in performing context
switching in a processor. This time depends on the switch mechanism and amount of
processor states devoted to maintain active threads.

The interval between switches (R): This refers to the cycles between switches triggered by
remote reference. The inverse p=1/r is called rate of requests for remote accessing. This
reflects combination of program behavior and memory system design.

8.4 Intel Xenon 5100


The Dual-Core Intel Xeon is a Processor of 5100 Series that having 64-bit server/workstation
processors and made up of two Intel micro architecture cores. When the high performance
with the power is combining with 65 nano meter technology, this derived version is Dual core
processor (Intel Xenon). The number of microchip is two. Power efficiency of low-power
micro architecture is high. The Dual-Core Intel Xeon Processor 5100 Series is also
compatibility with IA. Some key features are that it consists of 32 Kilo Bytes. It also consists
of Level 2 cache of 4 MB with Advanced Transfer Cache Architecture. The processors
fetches data L1 cache after that fetches the data from L2 caches requests occurs, the result is
that it improves performance by reducing bus cycle penalties. Front Side Bus (FSB) is of
1333 MHz speed and is quad-pumped bus running off a 333 MHz system clock and result is
10.66 Giga Bytes/s data transfer rates. other lower speed SKU’s are available which can
drive the Front Side Bus (FSB) of 1066 MHz.

137
Figure 8.8 Block diagram of Dual-Core Intel Xeon processor of 5100 Series

8.4.1 Thermal and power management capabilities


These are implemented including Thermal Monitor 1, Thermal Monitor 2 and Enhanced Intel
Speed Step Technology. In enterprise environments above technologies are targeted for dual
processor. To get the efficient and effective cooling TM1 and TM2 are used in high
temperature situations. To manage the power capability of serve and work station Enhanced
Intel Speed Step Technology is used.

Terminology
A ‘#’ symbol indicate active low signal. Basic terms used here are explained below:
Dual-Core Intel® Xeon® Processor 5100 Series – Intel 64-bit microprocessor uses for dual
processor servers and workstation based on Intel’s 65 nanometer process and advanced high
power capability.
FC-LGA6 (Flip Chip Land Grid Array) Package – 5100 series processor package The
Dual-Core Intel Xeon is a Land Grid Array , comprising a processor core mounted on a shaft
771 lands with less substrate , and includes an integral heat sink ( IHS).
LGA771 socket – The 5100 processor interfaces of the Dual-Core Intel Xeon to the
baseboard via this surface mount socket 771 Land. View socket LGA771 design
recommendations for details about this decision.

138
Processor core – Two core processor shares integrated L1 cache, L2 cache and system bus
interface between the cores the die. All synchronization health requirements AC signal
processor kernel buffers.
FSB (Front Side Bus) -
We use electrical interface in order to connect the processor to the Entire chip. Also
mentioned to as system bus or processor system bus. Besides memory and I / O we also use
interrupt messages for passing entire less processor and the chips on the FSB.
Dual Independent Bus (DIB) - A Front side bus architecture comes into market with one
processor having one bus Rather in shared two processor FSB agents are used. The dual
independent Architecture performance enhances by high speed and FSB bandwidth.
Flexible Motherboard Guidelines (FMB) the values of the Dual-Core Intel Xeon 5100
series is estimated and have some interval of time. There would be difference between the
estimate and actual values.
Functional operation – requires normal operating procedures in which all CPU
specifications, includes DC, AC, and FSB, mechanical and thermal.
Storage conditions - requires a state which is non-operational. The processor could reside in
a platform embedded into tank or in bulk. Processors can be enclosed in packaging or has the
property of non-reactive to. Under these circumstances, the processors land should not be
attached to supply voltages having I /Os or receive clocks. During the exhibition of "fresh
air" (that is, packaging unsealed , removed a device packaging materials) the processor must
be tested according to moisture labeling sensitivity (MSL) as written on the packaging
material. Presentation 14 Dual-Core Intel® Xeon® 5100 Series Specifications
• Priority Agent –this is called as the chipset and acts as the bridge between the host
processor
• Symmetric Agent - SA is a processor that has common I / O. Also it has same operating
system and subsystem memory array. It is run by another processor in a system. Systems
having symmetric agents are called symmetric Multiprocessing (SMP).
• Integrated heat spreader (IHS)- In order to improve the thermal performance of the
package, we can use components of the whole processor. Thermal components solutions
interface with the processor to IHS surface.
•Thermal design power - PT solutions should be designed to accomplish this goal. It is the
high capacity power provided the highest in the known execution real power intensive
applications. TDP does not have that much power which the CPU can exploit.
• Intel® Extended Memory 64 Technology (EM64T) - IA-32, the Intel architecture that

139
permits the processor to execute and run operating systems and design applications to take
benefit of the 64-bit extension technology. For further details you can surf the website-
http://developer.intel.com/.
• Enhanced Intel Speed Step Technology (EIST) - which is used for servers and
workstations and it yields technology power management capabilities.
• Interface Platform Control Environment (PECI) - A bus owner by thread interface
providing a communication shared channel between the processor and Intel chip set
components for thermal monitoring devices (TMD) from the outside by controlling fans peed
and also for communication with PECI temperature sensor outputs of the digital processor.
PECI changes the thermal diode available in previous processors.
• Intel® Virtualization Technology – technology used to allow multiple, robust independent
software environments into a single platform CPU virtualization when used with the Virtual
Machine Monitor software
• VRM (Voltage Regulator Module) - DC-DC converter integrated on a module
interconnects with a card edge socket and provides the correct voltage and current at the
processor based on the state of logical processor VID bit.
• EVRD (Enterprise Voltage Regulator Down) - DC-DC converter in built on card system
that gives the correct voltage and current to the processor on the basis of the state of the
logical processor VID bits.
• VCC - Power supply of core processor.
• VSS - Ground processor of terminal.
• ATV - FSB termination voltage.
8.4.2 Electrical specification

Front Side Bus and GTLREF


Assisted Gunning Transceiver Logic (AGTL +) signaling technology is used by most FSB
5100 series signal processors Dual-Core Intel Xeon and it provides enhancement in noise
margins and controls ringing by using low voltage swings and controlled edge rates. Open-
drain system is used in AGTL + pads and need pull-up resistors to impart high logic level and
termination. AGTL differs from a GTL + buffer in terms of the output buffers adding a
PMOS based pull-up transistor which aids to "assist" pull-up resistors when the first clock of
a low to high voltage transition. Platforms applies a termination voltage level for signal
AGTL + defined as mountain biking. Separate VCC and VTT are supplies are necessary
that’s why we implement separate platform diet plans for each processor. This configuration

140
enhances noise tolerance because of increase in processor frequency. For signal integrity we
need enhanced speed data and address buses.
Power and ground land
For the distribution of clean energy on-chip heart processor, the processor 223 VDC (power)
Inputs 273 and VSS (ground). Connection of all land VDC is with CPU power level and all
land VSS is with ground plane system. The VCC processor lands are to be supplied with the
voltage fixed by the processor Identification voltage signals (VID). See Table 2-3 for
definitions of VID. About twenty-two lands are given as mountain biking, which yields
termination for the FSB and provides power to the I/O buffers. A separate power supply must
be implemented for this land that meets the ATV.
Decoupling guidelines
Large average current wings between low power states is generated as there is a large
population of transistors and high internal clock rates like the Dual-Core Intel Xeon
Processor 5100 series. Because of inadequate decoupling can lead strain on power plans sag -
dessous their minimum values bulk decoupling is not adequate. Bulk
capacityincreases(CBULK), such as electrolytic capacitors , feed stream in the more enduring
changes in current demand by component , as emerging from a state of rest.

8.5 Multiprocessor

Multiprocessor is a type of computer system where more than one CPUs use full access to a
shared RAM. All CPUs may be equal in multiprocessor, or some may be reserved for special
purpose task. A program running on any CPU can be seen as a normal virtual address space.
Unlike the specification of this system, a CPU writes data into stack and reads it back from
that memory and may get another value (because another CPU has changed it). To solve this,
it should be organized correctly so that inter processor communication can be per formed in
such a manner:

The data is places into memory by the CPU. Multiprocessor operating systems are same as
old operating systems. CPU manages system calls, perform memory management, arrange a
file system and manage I/O devices. Besides that there are some applications which are based
on unique features. These features are scheduling, resource management, and process
synchronization. Systems that treat all CPUs equally are called symmetric multiprocessing
systems. Other than that there are number of ways to segregate resources as asymmetric, non-

141
uniform memory access and clustered multiprocessing. There are brief over view at
multiprocessor hardware and then goes operating systems issues.

8.5.1 Multiprocessor Hardware

MPD having the property of all multiprocessors in which every CPU easily address all
memory, some multiprocessors having the additional property that every memory easily can
be read fast like every other memory word. The machines which follow above specifications
are called Uniform Memory Access multiprocessors rest called as non-uniform. Memory
access multiprocessor does not have that type of property.

8.5.1.1 UMA Bus-Based SMP Architectures

In the simplest multiprocessors single bus is used, one or more memory modules and two or
more CPUs all share bus for communication. If CPU wants to read a memory word, it would
first check whether bus is idle or not. If the it is free, the CPU carries the address of the word
which it want to read , after that it places a few control signals and wait for the time until the
memory puts the necessary word on the bus. Other than that if the bus is doing some
communication or busy and CPU wants to communicate with the memory, the central
processing unit have to wait until bus come into idle state. When it becomes idle it continues
its operation.

Figure 8.9 Bus based multiprocessors

8.5.1.2 UMA Multiprocessors Using Crossbar Switches

The size of a UMA multiprocessor is about to 16 or 32 CPUs in a single bus. In order to


increase or decrease beyond that, various type of interconnection network is required. The
easiest circuit for interfacing k CPUs to n memories is crossbar switch that is shown in figure.

142
To connect a group of lines we use crossbar switches from long time with in telephone
switching exchanges. The lines may be incoming and outgoing. At each crossing of a
horizontal incoming and out vertical outgoing line is 508 MULTIPLE PROCESSOR
SYSTEMS.as cross point. A cross point is a simple switch that is operated electrically
function as opened or closed.

Figure 8.10 Crossbar switches in UMA multiprocessing

8.5.1.2 NUMA Multiprocessors

Single-bus UMA multiprocessors having limited uses because of a few dozen CPUs and
crossbar switched multiprocessors required expensive hardware and are bulky in size.
Basically it is the idea that all memory modules having the same access time. This concept
leads the idea of NUMA multiprocessors. They provide a single address space is provides
across all the CPUs in UMA multiprocessor, but in the UMA machines, access to remote
memory modules is slower than access in memory module. The performance of UMA
multiprocessor is worse than on a UMA machine at the same clock frequency. Three basic
key characteristics of NUMA are given below

1. There is a single address space visible to all CPUs.

143
2. Access to remote memory is through LOAD and STORE instructions.

3. Access to remote memory is slower than access to local memory.

Figure 8.11 NUMA multiprocessor systems

Q1) What is multitasking? How it is different from multi-threading?

Ans: Refer Section 8.3

Q2) What is the difference between single core and multicore processors?

Ans: Refer Section 8.3.1

Q3) What are the different interconnection network topologies used in computer architecture?

Ans: Refer Section 8.2.2

Q4) What is a multiprocessor?

Ans: Refer Section 8.5

144
Q5) Explain different multiprocessor hardwares.

Ans: Refer Section 8.5.1

Chapter 9
Superscalar Processors

Structure

9.0 Objectives

9.1 Introduction
9.1.1 Limitations of scalar pipelines
9.1.2 What is Superscalar?
9.2 Superscalar execution
9.3 Design issues
9.3.1 Parallel Decoding
9.3.2 Instruction Issue policies
9.3.2.1 Register renaming

9.3.3 Execution and issue of operations

9.4 Branch Prediction


9.4.1 Speculative Execution
9.5 Memory Disambiguation
9.6 Dynamic Instruction Scheduling
9.7 Multithreading
9.8 Example of Superscalar Architecture
Summary
Glossary

145
Objectives

After studying this chapter one will understand

 What is superscalar processor?


 Various shortcomings of scalar processors
 Working of superscalar processor
 Instruction issue policies and design issues of superscalar processor.
 How the parallelism is achieved
 How to deal with various dependencies among instructions
 How the branch instructions are scheduled and executed in parallel with
speculations
 How to avoid Memory disambiguation for parallel processing in superscalar
processor
 What is multithreading and how to implement in superscalar processors
 The micro-architecture details and pipeline implementation of an example for
superscalar processor

9.1 Introduction

146
Super scalar processors are emerged in late 1980s, superscalar processors got more light due
to RISC processors, though superscalar concept is applied to even non RISC processors like
Pentium 4 and AMD processors. In today’s market Desktop applications and Server
applications use superscalar processors, few examples of superscalar processors are Pentium,
PowerPC, UltraSparc, AMD K5, HP PA7100, DEC . The basic concept in superscalar
processor architecture is fetch few instructions at a time and execute in parallel by taking the
advantage of higher bandwidth memories available with advancement in technology. A CISC
or a RISC scalar processor can be enhanced with a superscalar or vector architecture. Scalar
processors are the ones which execute one instruction per cycle. One instruction is issued per
cycle, and only one instruction is expected to be completed from the pipeline per cycle.
Pipelining is an implementation method for increasing the value of throughput of a processor.
As this is a technique implemented below the dynamic/static interface, it does not need any
special effort from the user. So, speed up can be attained for existing sequential programs
without any modification in software. This method of enhancing the performance while
maintaining code compatibility is very attractive. Moreover, this approach is the main reason
for the present dominance of the microprocessor market by Intel as it proposed the pipeline
i486 microprocessor, as it is code compatible with previous generation of non-pipelined Intel
microprocessors. As pipelining has been established to be an extremely effective micro-
architecture technique, these types of scalar pipelines has a number of limitations. As there is
a never-ending push for better performance, these limitations must be countered in order to
keep providing further speed up for existing programs. The solution is superscalar pipelines
which are able to achieve performance levels higher than scalar pipelines.
9.1.1 Limitation of scalar pipelines
Scalar pipelines are identified by a single instruction pipeline of k stages. All instruction,
irrespective of instruction type, travel through the same set of the pipeline stages. At most,
one instruction can be placed in each stage of pipeline at any time, and those instructions go
about the pipeline stages in a lock-step style. Apart from the pipeline stages which are stalled,
every instruction remains in every pipeline stage for one cycle time and moves to the further
stage in the next cycle. This type of rigid scalar pipelines have three main limitations that are
listed below and elaborated further:-
 The maximum throughput for a scalar pipeline is connected by single instruction per
cycle.
 The amalgamation of all instruction type into single pipeline can result in a design
with high performance.
 The delaying of a lock-step or rigid scalar pipeline leads to unnecessary pipeline
bubbles.
9.1.2 What is Superscalar?
To improve the performance of the scalar processor by increasing execution speed of the
instructions superscalar machine is designed. In superscalar processor multiple instructions
which are independent from one another are executed in parallel unlike scalar processors,
where one instruction is executed after finishing previous instruction in cycle. In superscalar
processors more than one independent instruction pipelines are used, where in multiple stages

147
are there in each pipeline. Therefore, at a time multiple streams of instructions are enabled to
process thus achieving a new level of parallelism. Superscalar architecture is able to execute
instructions which are independent in different pipelines, hence improves overall
performance. Instructions that can be executed independently and in parallel are commonly
Load/Store, Arithmetic and Conditional Branch instructions.

Figure 9.1 General Superscalar processor organization

Figure 9.1 shows the organization of the general superscalar processor in which it is
shown that 5 instructions are executed in parallel, where each instruction has 4 pipeline
stages. At the same time all the 5 instructions i.e. 2 integer operations, 1 memory operation
(Load/ Store) and 2 floating point operations will be executed.

Cache / Fetch Decode &


Unit
Memory Issue Unit

Data Data Data

EU EU EU

Sequential Stream of
instructions

Register File

Figure 9.2 Superscalar


The Superscalar approach approach
is shown in Figure 9.2, which shows a sequential stream of
instructions of a program will be fetched from memory then decode and issue multiple
instructions parallelly to multiple functional units (Execution Units) and the results are
written in a register file. The superscalar central processing unit design focuses on improving
the dispatcher accuracy of instruction, and facilitating it to keep the multiple EUs working all
the time. This has become necessary as there is increase in number of units considerably.
While earlier superscalar central processing units would have two arithmetic logic units and a
single FPU, a modern design because the PowerPC 970 includes four arithmetic logic units,
two floating point units, and two single instruction multiple data units. But, if the dispatcher

148
is ineffective at maintaining all of these units full with instructions, the system's performance
won't be better than that of a much simpler and cheaper design. A superscalar processor
usually keeps a consistent execution rate, ahead, of one instruction per machine cycle. But
just processing multiple instructions, simultaneously, doesn't make a processor a superscalar
one. As it is pipelined, multiprocessor achieves that too, but with separate steps.
In superscalar central processor unit, the dispatcher accesses certain instructions from the
memory and finds which one can be accessed in parallel, dispatching each instruction to one
of the several EUs contained in a single central processing units. Therefore, a superscalar
processor can be declared to have various parallel pipelines, which are processing the
instructions, simultaneously, from just an instruction thread.
9.2 Superscalar execution
The multiple instruction execution in superscalar is compared with a sequential,
pipelined, and super-pipelined processors instruction execution in Figure 9.3. It is shown how
an instruction is executed in different processors, in case of sequential processors execution
of instructions takes place one after another, and the execution occurs in different steps in a
sequence i.e. fetch, decode then execution by the functional unit then the result is finally
written into memory, to perform these actions a sequential processor takes at least 4 no. of
cycles for one instruction i.e. Cycles Per Instruction (CPI) = 4. Where as in modern
processors it is not the scenario, for example in pipelined processors CPI is reduced to 1
because the steps in execution as discussed in sequential processors are overlapped like an
assembly line.
In pipelined processors when one instruction is in execution by functional unit second
next instruction will be in decoding and third next instruction will be in fetching stage thus in
every clock cycle one instruction will be completed. The pipeline stages are actually some
combinational logic circuits and may also involve register/cache memory access with each
pipeline stage separated by a latch with a clock signal which is common to all latches for
synchronization of data as shown in Figure 9.4.

Sequential Processor

149

Pipelined Processor
Super-pipelined processor

Superscalar processor

Each stage as shown in Figure 9.4 may vary in length based on the type of
instructions therefore processor’s whole speed will be reduced for long stages. In super-
pipelined processors, deeper pipeline stages are used as in pipelined processors pipeline
stages duration can be varied based on its length. Longer pipeline stages are subdivided into
small stages resulting in super-pipeline processor with more number of shorter stages so that
higher clock speed can be achieved i.e. more number of instructions are executed in less time
lesser CPI.

Figure 9.4 Micro architecture of a pipelined processor

In superscalar processors execution of multiple instructions in parallel are carried as


shown Figure 9.3. For multiple instructions with pipelining, each pipeline execute stage
involves functional blocks which may be different and performs its own task, taking this fact
into consideration possible multiple instructions can be executed in parallel thus achieving
very high speed at the processor level.

150
Figure 9.5 Concept of superscalar execution

The Figure 9.5 above illustrates the parallel execution method which is used in most
superscalar processors. The fetching of instruction process, along with prediction of branch,
is used to form a line of dynamic type of instructions. These dynamic instructions are
checked for various dependences where artificial dependences are removed. Then, the
instructions are sent into the execution window. The instructions in the execution window are
not shown in sequential order now, but are partially represented by their data dependences
(true). Issue of instructions from the window is in an order, which is decided by the data
dependences and hardware resource availability. Finally, after the implementation,
instructions are arranged into a program order as they retire sequentially and their respective
outcomes update the state of architected process.
9.3 Design Issues
As in superscalar processors multiple pipelined instructions which are independent of
one another are executed in parallel, it is necessary to understand how this can be achieved
and which is discussed under design issues of superscalar processor. The design issues of
superscalar processors include how and what are the policies used for the issue of multiple
instructions, how registers are used for multiple instructions, how the parallelism at machine
level is achieved and during multiple instruction execution how branch instructions are
treated. Superscalar processor is applicable equally to CISC and RISC though it is straight
forward in case of RISC machines. All the common instructions can be executed
independently and parallelly and usually the execution order is assisted by the compiler. The
specific tasks of the superscalar processor covered are shown in Figure 9.6.

151
Figure 9.6 Superscalar processor design tasks
The multiple instructions to be executed in parallel must be independent therefore it is
required to check the dependency of the instructions. There are three types of dependencies
i.e. Data dependency, Control dependency and resource dependency. Data Dependency is
occurred if the data modified by one segment of an instruction is modified by another
instruction in parallel. In control dependency if before run time the control flow of segments
cannot be identified, and then the data dependence between the segments is variable.
Resource dependence occurs when there aren’t sufficient processing resources (e.g.
functional units) even if several instructions are independent they cannot be executed in
parallel.

Data Dependence: It is further divided into four types as given below


 Flow dependence: Instruction 1 (I1) precedes Instruction 2 (I2), and at least one
output of I1 is input to I2.

Example:

I1: ADD r2 r1 r2  r2 + r1
I2: MUL r3 r2 r3  r3 * r2
Where I1 and I2 cannot be executed in parallel because output of I1 i.e. r2 is
being used in inputs of r3 i.e r2, therefore it is necessary to read r2 only after its value
updated in the instruction I1. Which means read r2 of I2 only after writing it in I1,
hence I1 and I2 cannot be executed simultaneously. This dependence also called
Read-After-Write (RAW) Dependence. This is a true dependence which cannot be
abandoned because I2 is RAW dependent on I1.
 Anti-dependence: I1 precedes I2, and the output of I2 overlaps the input to I1.

Example:

I1: ADD r2 r3 r2  r2 + r3
I2: MUL r3 r1 r3  r3 * r1
Here I1 input and I2 output using the same register for reading and writing
respectively. This is false dependency which can be eliminated through Register
renaming, the output register r3 is renamed to some other register other than r1 r2 and
r3 by compiler or Processor. This dependency also called as Write-After-Read (WAR)
dependency.
 Output dependence: I1 and I2 write to the same output variable.

152
Example:

I1: ADD r2 r3 r4 r2  r3 + r4
I2: MUL r2 r1 r5 r2  r1 * r5
Here I1 and I2 are writing into the same output register r2 making I1 and I2 output
dependent hence preventing parallel execution, this dependency is also false
dependency as it can be avoided by register renaming the output register r2 by
compiler or Processor. This dependency also called Write-After-Write (WAW)
dependency.
 I/O dependence: The same variable and/or the same file is referenced by two I/O
statements i.e. (read/write).
Control Dependence: When conditional instructions, branch instructions and loops are
there in segments of a program, control dependency occurs which prevents the control
dependent instructions to execute in parallel with other independent instructions.
Control-independent
Example:
for (j=0; j<n; j++)
{
b[j]=c[j];
if(b[j]<0)
b[j]=1;
}
Control-dependent
Example:
for(j=1; j<n; j++)
{
if(b[j-1]<0)
b[j]=1;
}
Compiler techniques are needed to get around control dependence limitations.
Resource Dependence: Data and control dependencies are based on the independence of
the work to be done, where as Resource independence is concerned with conflicts in
using shared resources, such as registers, integer and floating point ALUs, etc. ALU
conflicts are called ALU dependence. Memory (storage) conflicts are called storage
dependence.
9.3.1 Parallel Decoding
The decoding in a scalar and super scalar processor is shown in Figure 9.8. In scalar
processor from instruction buffer one instruction at a time is send to Decode/ Issue unit where
as in super scalar processor more than one instruction is send to decode/issue unit from the
instruction buffer. For example if the superscalar processor is 3-way issue processor then 3-

153
instructions will be send to decode/issue unit. The scalar processor takes one pipeline cycle
for decode/issue where as superscalar processor takes more than one pipeline cycle for
decode/issue, to speed-up this process in superscalar processor pre-decoding principle is
introduced as shown in Figure 9.7.

The concept
Figureof
9.7pre-decoding
Principle of pre-isdecoding
a part of decoding is done at the loading phase i.e.
when load the instructions from second level cache or memory to the Instruction Cache (I-
Cache) by pre-decode unit. The result of doing this pre-decoding task is at I-cache level only
the information about the class of the instruction, required type of the resources for execution
will be known. Even in some processors for example in UltraSparc branch target addresses
calculation also done by pre-decode unit. The pre-decoded instructions from pre-decoder are
saved in I-Cache by appending with some extra bits as shown in Figure 9.7 from 128 bits to
148 bits, these extra bits have the pre-decoded information about the instruction. The pre-
decoder results in reduced over cycle time for superscalar decode/issue unit.

I-Cache I-Cache

Instruction Buffer Instruction Buffer

Decode / Issue
Decode / Issue

Scalar Issue Superscalar Issue

F D I ...
F D/I ...

(a) (b)

Figure 9.8 Decoding in (a) Scalar processor (b) 3-way Superscalar processor

9.3.2 Instruction Issue policies


Superscalar instruction issue discusses about how and when to send the instructions to
Execution Units. The key terms in instruction issue are instruction issue policy and

154
instruction issue rate. Issue policy specifies during the instructions issue process for parallel
processing how the dependencies are handled. Issue rate specifies in each cycle the maximum
number instructions the superscalar processor can issue. Instruction issue policies discuss
about the instruction fetch order, instruction execution order and about the order in which
instructions change memory and registers. Based on issue order and completion order issue
policies can be categorized as below

In-Order Issue and In-Order Completion:


In this policy the instructions are issued, executed and written based on their order of
occurrence. This is not very efficient policy because the instructions must be stall if there is
any dependency.
Example:

It can be understand clearly from the example that Instructions I1 to I6 are completed in
the same order that they are issued.
In-Order Issue and Out-of-Order Completion:
In this policy the instructions are issued in the order of their occurrence but their
completion is not in the same order.
Example:

In the above example it can be observed that the completion of instructions are not in the
same order of the decode unit, the possible chances of data dependency must be checked
here in this case.
Out-of-Order Issue and In-Order Completion:
In this policy the instruction issue and their completion both are not in the order of their
occurrence. The possible chances of data dependency must be checked here in this case
Example:

155
Issue policies are shown in Figure 9.9 in different cases. In case of false data
dependency and unresolved control dependency the design options for issue policies are
either don’t issue those instructions until they are resolved or issue by applying some
technique. In case of true data dependency instruction will not issued until dependency is
resolved where as in case of false data dependency avoid instruction dependency during
instruction issue by using register renaming. In case of conditional branch instructions either
waits for the resolution of the condition or use speculative branch processing to issue the
instruction in parallel with other instructions which are independent.

In case of Issue blockages, using shelving


Figure 9.9 Instruction concept
Issue Policies blockages
in superscalar can be reduced drastically and
processor
the final aspect in Figure 9.9 discusses about how to handle with the instruction issue
blockages.
9.3.2.1 Register renaming
For multiple execution paths of multiple instructions which uses same registers, to
avoid conflicts among these instruction execution the technique used is called register
renaming. In this technique processor will be using multiple sets of registers in place of just
one set. This avoids unnecessary stalls in pipeline by allowing execution of different
simultaneously. In general CPU uses the data which is stored in registers for various
computations. When processor have more number of registers, many variables of program
can be hold by using them which leads to reduction in no. of memory access and hence the
latency. Often there will be about 128 or more registers in a RISC processor chips. If the
contents of the register doesn’t reflect the program’s correct order then anti-dependency and
Output dependency may occur which may be resulting in a pipeline stall. To cope with the
false dependencies during issue of instructions register renaming instruction policy avoids
dependency by renaming the common registers of the instructions to be issued
simultaneously. False dependencies occurs either between instructions those which are in
execution and instructions that are to be issued or among the instructions which are being
issued. WAR and WAW dependencies occur between register references they can be avoided
by using register renaming. In x86 instruction set based current processors, to deal with
independent operations which are closely-grouped false dependency can be eliminated by
substituting physical registers dynamically. More physical registers are required internally for
register renaming than logical registers which are actually visible to external programmers.

156
There are two ways of implementing register renaming technique one is static and the
other dynamic. In static implementation register renaming is performed by compiler which is
used in pipelined processors. In dynamic implementation is done during execution of the
instructions by the processor which is used in superscalar processors, for implementation it
requires additional registers and logic. The renaming can be partial or full, in partial case only
certain types of instructions dealt by register renaming for execution and in full renaming all
the instructions which are eligible are dealt by register renaming.

Type of rename buffers


(The basic approach of how rename buffers are implemented)

Merged Stand alone Holding renamed


architectural and register files values in the
rename register files ROB

Method of
fetching operands

Merged FX- Merged FP- FX-rename FX-arch. FP-rename FP-arch. ROB Architectural
Method of updating reg. file
reg. file reg. file reg. file reg. file reg. file reg. file
the program state

Power1 (1990) PowerPC 603 (1993) Pentium Pro (1995)


Power2 (1993) PowerPC 604 (1995) Am29000 sup (1995)
ES/9000 (1992p) PowerPC 620 (1996) K5 (1995)
Nx586 (1994) M1 (1995)
PM1 (Sparc64, 1995)
R10000 (1996)
Alpha 21264 (1998)

Figure 9.109.10
Figure Implementation
illustrates ofimplementation
renaming buffers of
the register renaming in three different ways i.e.
 Merged architectural and rename register files
 Standalone register file
 Holding renamed values in the Reorder Buffer(ROB)
In merged architectural and rename register files approach the same physical register file is
used for both architectural registers and rename buffers. Physical registers which are
available are dynamically assigned to rename registers or architectural ones. In this approach
merged register files are used for Fixed Point (FP: Integer) and Floating Point (FP) used
separately. The examples of processors which use this approach are mainframe family of
IBM ES/9000, Sparc64, power line processors and R1000. In standalone rename register files
approach, exclusive rename register files are used. The examples of this line are PowerPC
processors i.e. PowerPC 603 - PowerPC 620. In the last approach i.e. renaming with ROBs
in addition to renaming the instruction execution sequential consistency is prevailed.

157
Example of Register Renaming:

From the above example it can be observed how false dependency is eliminated by renaming
register F6, F8 with S and T. After Register renaming only RAW dependency is there, which
must be avoided during scheduling for parallel execution.

9.3.3 Execution and issue of operations

It is in this phase, that an execution tuple (an opcode with register names, physical register
are in upper case (Rl, R2, R3, and R4) storage locations) is formed. As the execution tuples
are made and buffered, and then the next step is to decide which tuples can be sent for
implementation. Instruction issue can be explained as the checking (run-time) for availability
of resources and data. It is a part of the processing pipeline which is at centre of many
superscalar executions. This is the part that holds the execution window. In ideal case, an
instruction is to be executed as soon as the availability of input operands. But, other
limitations may be there on the instruction issue, most importantly the physical resources'
availability such as EUs, register file ports and interconnect. an example of mechanism of a
possible parallel execution is given below.

This schedule assumes hardware resource consists of two integer units, one branch
unit and one path to memory. The horizontal direction means the operations implemented in
the time step and the vertical direction means time steps. In this given schedule, we predicted
that the branch ble was unable to be predicted and are speculatively implementing
instructions from the path predicted. Shown, here, are only the renamed values for r3. In a
real implementation, the other registers will be renamed too. Each of the values given to r3 is
stuck to a separate physical register. Following paragraphs gives detail of a number of ways
of arranging buffers (instruction issue), as the complexity increases. Some of the basic
organizations are illustrated.

158
Single queue method: In case of just one queue and no issuing out of order, the register
renaming is unnecessary, and availability of the operand can be coped up via each register
being assigned with simple reservation bits as shown in Figure 9.11. A register is in reserved
condition when a reservation is cleared when the instruction completes it and an instruction is
updating the register issues.

Figure 9.11 Single queue method


Multiple queue method: In multiple queues, the queue issues instructions from each queue
in order, but the queues may not issue in order with respect to each other as shown in Figure
9.12. The independent queues are set according to type of instruction. For example, there is a
floating point instruction queue (IQ), a load store IQ and an integer IQ. Now, renaming is
used in a restricted form. Like, only the registers which are loaded from memory can be
renamed. This facilitates the load store IQ to “slip” ahead of the other IQs, fetching memory
data in advance even before it is needed.

Figure 9.12 multiple queue method


Reservation stations: With reservation stations, the instructions may not issue in order; there
is no strict first in first out ordering as shown in Figure 9.13. So, the data availability of the
source operands is supervised by the reservation stations. But, the conventional way of doing
this is to store in the reservation station, the operand data. Any already available operand
values are read from the register file and then placed in the reservation station when an
instruction is directly dispatched to the reservation station. Then operand designators of data
which is unavailable are compared with the result designators of completing instructions by
the reservation stations logic. Now, the resultant value is pulled into the matching reservation
station whenever there is a match. So, the instruction may issue when all the operands are
ready in the reservation station. Reservations may be divided as per type of instruction or
may be added into one large block. Ultimately, recent reservation station executions hold
pointers to where the data can be found and don't store the actual data.

159
Figure 9.13 Reservation stations
9.4 Branch Prediction
Branch instructions modify the value of the Program Counter (PC) conditionally or
unconditionally and transfer the flow control of the program. The major types of the branches
are shown in Figure 9.14. The branches are always taken which are under unconditional type
where as under conditional type based on the condition status whether condition met (True)
or not (False) it will be taken.

The return address is not saved in simple unconditional branches but in branches to
sub-routines return address is saved by saving PC, and then return from sub-routines transfers
Figure 9.14 Major types of branches
the control to the saved return address. In loop closing conditional branches which are also
called as backward branches, will be taken for all the iterations except the last iteration. The
processor performance is dependent on the schemes used in branch prediction. The
prediction is mainly two types i.e. Fixed prediction and True Prediction.
Fixed Prediction: The guess here is one outcome out of taken or not-taken, the scheme
follows the tasks given below

160
 Guess as “not-taken” against the detection of an unresolved conditional branch.
 Proceed with the sequential execution path, but be prepared for wrong guess and also
start with execution if “Taken” path in parallel by doing calculation of Branch Target
Address (BTA).
 Check the made guess when the condition status is available.
 Continue the sequential path of execution if the made guess is correct and delete the
pre-processed information of BTA calculation.
 In case of guess is wrong delete the sequential execution information and continue
with the pre-processing of “Taken”
The above steps are for the approach of “always not taken”, if the instructions are taken in
this approach the penalty of taken instructions (TP: Taken Penalty) is higher than Not-Taken-
penalty (NTP). In case of the other approach all the above steps are same except start with
Taken guess and sequential path and guessed path will interchange, in this case TP is less
than NTP usually. Implementation of Not-Taken approach is easier than the taken approach.
In pipelined microprocessor Not-taken approach is used for example in SuperSPARC,
Power1, Power2 and Alpha-series processors. Always taken approach is used in MC 68040
processor.
True Prediction: The guess here has two possible outcomes of taken and not-taken and
further can be categorized as static and dynamic based on the code and execution of the code.
If the prediction is simply based on the code then it is static prediction and if the prediction is
based on the history of code execution then it is called as dynamic prediction and classified as
op-code based, Displacement based or compiler-directed based. In op-code based prediction
for certain types of op-codes the branch is always assumed as taken and for some always not-
taken. In displacement prediction approach a displacement parameter (D) is defined, and
based on the sign of D predictions are made for example if D≥0 then prediction is Not-Taken
and id D<0 then guess as Taken. In Compiler directed prediction the guess is based on hint
by complier, compiler gives hint based on compiled kind of construct, the hint by compiler is
by setting or clearing a bit in the encoded instruction of conditional branch case. In case of
dynamic prediction, based on history of the branch prediction is made. The basic approach
here is that in the recent occurrence if the branches are taken then they are assumed to be
taken in their next occurrence. The performance of the dynamic prediction techniques is
higher than static, but dynamic is more complex for implementation and hence costly.
History of the branch instructions are expressed in two different ways i.e. explicit dynamic
technique and implicit dynamic technique. In the first case history bits are explicitly stated
for the history of the branches and in later case the target access path of the predicted branch
instruction is stated implicitly with the entry presence. Explicit technique is explained in
details as below

161
As shown in Figure 9.15, static prediction is made based on a particular object code
attribute which is

1-bit dynamic prediction: In this approach to express the taken or not taken branch in the
last occurrence one bit is used. A two-state history for taken or not taken for different number
of branches is indicated by state diagram as shown in Figure 9.16. From the state diagram it
can be observed that the history of the branch is updated after evaluation of the branch. In this
case the last occurred branch prediction will be same.

2-bit dynamic Figure 9.16 One-bit


prediction: dynamic
In this prediction
approach State transition
to express diagram
the taken or not taken branch in the
last occurrence two bits are used. It’s operation is a 4-state Finite State machine (FSM), state
transition diagram is given in Figure 9.17. The four possible states for a two bit scheme are
00- Strongly not Taken
01 - Weakly not Taken
10 - Weakly Taken
11 - Strongly Taken

Usually the initial state in this approach is “strongly taken”, and as per the actual state of the
counter the prediction is made.

162
3-bit dynamic prediction: In this approach, the branch instructions outcome for the last
three times occurrences are stored as shown in Figure 9.18 and based on the majority of
occurrences the prediction is made. For example, if out of three branch instructions in recent
two branches were taken then the prediction will be considered as Taken, then the entry in the
table is updated based on this outcome of prediction in FIFO manner. The implementation of
3-bit prediction is simpler than 2-bit prediction. The scheme of 3-bit prediction is
implemented in PA 8000 processor.

9.4.1 Speculative Execution


Figure 9.18 Principle of 3-bit prediction
Speculation means an intelligent guess, in a program when there are conditional
branch instructions to execute them in parallel with other instructions it is required to make
speculation about the outcome of the condition before actually it is available to utilize the
waiting time for resolution of the condition. Therefore, speculation is basically to run
predicted instructions and the prediction can be right or wrong. Instructions will be fetched
and executed even though it may not know immediately whether the instructions will be on
the final execution path or not. Speculation allows an independent instruction to issue on
branch predicted to be taken without any consequences (including exceptions) if branch is not
actually taken. It often combines with dynamic scheduling.

163
Figure 9.19 Speculative branching
Figure 9.19 shows how the speculative branching handles the unresolved conditional
branch instructions for parallel processing. Based on the branch prediction, execution of the
speculated path will be continued saving the branch address from where the sequence of
execution changed, if the speculation is correct after condition resolved then execution will
be continued otherwise sequential path will be followed by deleting the predicted executed
path. The extent of speculativeness can be discussed at two levels as given below

Extent of speculativeness

Level of speculativeness Degree of speculativenss

It talks about following a predicted


It talks about in succession how conditional branch, how far the branch
many conditional branch instructions instructions other than conditional
can be executed speculatively ones are executed speculatively

Fetched
9.5 Memory
1
Disambiguation
2 4 6
Fetched, Fetched,
Decoded Fetched, Decoded,
Static or dynamic scheduling is used in superscalar Decoded,
processorsDispatched,
to achieve instruction
Executed but not
level parallelism, but due to memory instruction dependency ambiguity
Dispatche reordering of the
completed
code for parallel execution is restricted severely, which can be observed from the example
given below.
Example:

Therefore memory disambiguation is defined as determination of aliasing between


two memory references or stating whether two memory references are dependent or not. The
store operation which involves external memory and the data caches are done in-order to
preserve sequential order which eliminates output and anti dependencies of memory
locations. However, issue of load operations can be performed in out of order w.r.t. store
operations, if the loads with out of order check data dependency of pending and previous
store operations. While performing detection of memory dependency the points to consider
are given below

164
 Effective address computation of the both the memory references
 Other instructions and run time data may affect the effective addresses of the memory
references
 Wider comparators are required for addresses comparison
Example:

Various ways to cope up with memory disambiguation are explained below


Total Order of Load and Store Instructions: In this approach all the loads and stores with
respect to each other will be kept totally in order. But with respect to other type of
instructions load and store instructions can be executed out of order, consequently for all the
previous instructions store instructions are held and load instructions are held for store
instructions. This approach prevents wrong branch path store instructions because all the
previous branch instructions are now resolved.

Load Bypassing: If there is no aliasing then load instructions can bypass store instructions.
For loads and stores separate address generation and reservation stations are employed.
Before the issue of load instructions, to allow checking of dependencies in load instructions
store instructions addresses need to be computed first. Addresses of store instructions cannot
be determined if the dependencies in load instructions are not checked which result in all the
subsequent load instructions to be held until valid address. Store instructions are kept in
Reorder Buffer (ROB) until the completion of the previous instructions. In load bypassing the
execution load instruction is out-of-order thus improves the performance.
Example:

Load Forwarding: In this approach forward the data that is to be stored using store
instruction is directly forward to the load.
Example:

165
9.6 Dynamic Instruction Scheduling
Arranging or scheduling two or more instructions for parallel execution is called
instruction scheduling and if this job is done by processor then it is called dynamic
instruction scheduling also called as Out-Of-Order (OOO) execution. The instruction
execution in dynamic scheduled processors (in Superscalar processor) done in out of order,
this is the main advantage when compared to static scheduled processors (in VLIW
processor) in which instruction execution is done in order. In case of In-Order execution the
instruction must wait until its operand and/or dependency resolved hence blocking all the
next instructions for execution. This problem can be overcome in dynamic scheduling where
the eligible independent instructions can be scheduled for parallel execution without
bothering about program order. Therefore, in dynamic scheduling the only wait is for
instruction’s input operands hence achieves higher performance when compared to static
scheduled processors.

Advantages of Dynamic Scheduling:


 Dynamic scheduling handles the cases where instructions involved with memory
references whose dependency is not known at compile time.
 As scheduling is done at processor level Compiler job is simplified as there is no need
of micro-architecture knowledge for the compiler.
 Code compilation is independent of pipeline hence it allows code compilation with
same efficiency for different pipelines
 Dynamic scheduling uses hardware speculation technique for branch instructions
which increases performance significantly.
Disadvantages of Dynamic Scheduling:
 Results in more hardware complexity
 It may create data dependency hazards like WAW and WAR
 Dynamic scheduling Complicates exceptions

9.7 Multithreading
Program can be defined in programmer’s point of view as set of ordered instructions
and in OS point of view it is executable file which is termed a process instead of program.
Within a process smaller code chunks are called threads and in a process there may be many
threads which share same resources of process. The instructions of a program divided into
smaller threads and parallelism can be achieved by simultaneously executing these fine
grained threads in parallel. Simultaneous Multithreading (SMT) is a method allowing

166
numerous self-regulating threads to issue commands to a superscalar’s several functional
units in one single cycle. It issues multiple instructions from multiple threads in each cycle.
The aim of SM is to considerably escalate processor utilization in both long memory latencies
and limited accessible parallelism in one thread. It fully exploits thread-level parallelism and
ILP. It has a better performance with programs that are parallelizable and single threaded
program, this type of multithreading is shown in Figure 9. 20. It uses five utililized threads
and one non utilized thread.
Fine grain multithreading in Superscalar (FGMS) - Only one thread issues instructions in
one cycle, but it can utilize the whole issue width of the processor. This leads to hiding of
vertical wastes and showing of horizontal wastes. It is the only model that does not feature
simultaneous multithreading.

Full simultaneous issue- This is a simultaneous multithreaded superscalar which is highly


flexible. All the eight threads strive for each of the issue slots. This is the least realistic model
when it comes to hardware complexity.

Single issue and dual issue - These three types of models limit the number of instructions
each thread can issue in the scheduling window in one cycle. Single issue type issues one
instruction per cycle and dual issue type issues two instructions per cycle.

Super Scalar Multithreading (FGMT) SMT


Time (Processor Cycles)

Figure 9.19 Multithreading in superscalar processor


Issue Slots

167
The features of the SMT are given as below
 Instruction level parallelism and Thread level parallelism is exploited to the full extent
 Results in better performance due to
o Independent programs mix
o Parallelizable programs
o Program with single thread
Out-of-Order superscalar processors follow the architecture of SMT, for example
MIPSR1000.
9.8 Example of Superscalar Architecture
Pentium 4 is the Intel processor that was proposed in November 2000 is a superscalar
processor with CISC architecture as shown in Figure 9.21. The P4 processor has a
feasible clock speed that now surpasses 2 gigahertz in contrast to the 1 GHz of the Pentium 3.
Even if the concept of superscalar design is related to the reduced instruction set computing
architecture, superscalar principle can be applied to complex instruction set computing
machine too. Pentium 4 processor has implemented pipeline with 20 stages and it has two
separate EUs for integer and floating point operations. The operation of Pentium 4 can be
given as:
 The instructions are fetched from the memory by the processor in order of the static
program.
 Every instruction is translated in to multiple fixed length reduced instruction set
computing instructions.

The pipeline stages


Figure 9.21 for instruction
Pentium4 Superscalarexecution is shown below
CISC Machine

168
Figure 9.22 Alternate view of Pentium 4 architecture
Pipeline stage 1 &2(Generation of Micro-ops) Branch Target Buffer and Instruction
Translation Lookaside Buffer are used for instructions fetching which are accessed from the
L2 cache 64 bytes at a time as shown in architecture block diagram. The instruction
boundaries are determined and instructions decoded micro-ops codes and the the trace cache
is used to store these µ-code as shown in Figure 9.22.
Pipeline stage 3: (Trace cache next instruction pointer) The dynamic gathered history
information saved in Trace Cache Branch Target Buffer (BTB) and if BTB doesn’t have
target then the following actions will be taken place
• if it is a return then predict branch as taken and as not taken otherwise
(Branch is not PC relative)
• predict as taken for PC relative backward conditional branches otherwise
as not taken
Pipeline stage 4: (Trace Cache fetch) In the program order the micro-ops are ordered which
are called traces. Based on the branch prediction these traces are fetched in the order. Many
micro-ops are required for some micro-ops like in CISC instructions, which are coded into
the ROM and fetched subsequently from ROM.
Pipeline stage 5: (Drive) For reordering this stage delivers instructions to Rename/Allocator
module from the Trace Cache.
Pipeline stage 6, 7&8: (Allocate: register naming) For execution resources are allocated
where in 3 micro-ops will be arrived in clock cycle. If the resources are available then these
micro-ops are dispatched in “out of order” manner. One of the 2 scheduler queues gets an
entry for memory access or not. From the ROB the micro-ops are retired in the order.
Pipeline stage 9: (Micro-op queuing) One of the two queues get loaded with the micro-ops
in FIFO policy where in one for the memory operation and the other for non-memory.

169
Pipeline stage 10, 11 & 12: (Micro-op scheduling )If all the operands of the micro-ops are
ready then the two schedulers retrieve them and based on availability of the unit they will be
dispatched at a rate of 6/clock cycle.
Pipeline stage 13 & 14: (Dispatch) If the same unit is needed by two micro-ops then they
are dispatched in order.
Pipeline stage 15 & 16: (Register File): For pending integer and floating point operations
the source is register file.
Pipeline stage 17 & 18 (Execute Flags): Computation of the flag values.
Pipeline stage 19 (Branch check): Branch prediction results are compared after checking
flag values.
Pipeline stage 20 (Branch check results): If the branch prediction goes wrong then all the
micro-ops which are incorrect are flushed. The branch predictor is provided with the correct
branch destination. From the new target address the pipeline will start again.
Questions

(1) Explain briefly the following terms


a. True data dependency
b. Anti dependency
c. Resource conflict
d. Output dependency
e. False dependency
f. Control dependency
g. RAW dependency
h. WAR dependency
i. WAW dependency
(2) Write a short note on superscalar micro-architecture and its instruction execution.
(3) What are the design issues to be considered in superscalar processors?
(4) Explain the issue policies of superscalar processor.
(5) With examples show how register renaming can avoid dependencies to achieve
parallelism.
(6) Explain different approaches for branch prediction in superscalar processors.
(7) What is Memory Disambiguation? Explain the ways to overcome it.
(8) Explain speculative execution of a branch instruction in a superscalar processor.
(9) What is thread? How multithreading is done in superscalar processor?
(10) Explain the micro architecture and instruction execution of Pentium 4 superscalar

170
processor.

Summary
This chapter discusses about superscalar approach of achieving parallelism and its
limitations in the introductory part. The design issues and superscalar execution of
instructions are explained clearly with examples. The instruction issue policies for parallel
execution and dependences among instructions also covered with illustrative examples and
also explained in easy manner how the instructions execution can be achieved parallelism
with register renaming. The crucial topics of superscalar execution like branch prediction,
Memory Disambiguation, Dynamic Instruction Scheduling and Speculative Execution and
multi threading also covered in a simple understandable manner for students. An example of
superscalar architecture i.e Pentium 4 processor has been explained in this chapter with its
micro-architecture and pipeline implementation.
Glossary
Pre-Decoding: A part of decoding is done at the loading phase i.e. when load the instructions
from second level cache or memory to the Instruction Cache (I-Cache) which is called as pre-
decoding.
Instruction Issue policy: It specifies during the instructions issue process for parallel
processing how the dependencies are handled.
Instruction Issue Rate: It specifies in each cycle the maximum number instructions the
superscalar processor can issue.
Register renaming: For multiple execution paths of multiple instructions which uses same
registers, to avoid conflicts among these instruction execution the technique used is called
register renaming.
Branch Prediction: The condition outcome guess of a conditional branch instruction is called
branch prediction.
Fixed Prediction: The conditional branch instruction’s outcome guess is one out of taken or
not-taken.
True Prediction: The guess here has two possible outcomes of taken and not-taken and
further can be categorized as static and dynamic based on the code and execution of the code.
Speculative Execution: Speculation allows an independent instruction to issue on branch
predicted to be taken without any consequences (including exceptions) if branch is not
actually taken.
Memory Disambiguation: It is defined as determination of aliasing between two memory
references or stating whether two memory references are dependent or not.
Dynamic Instruction Scheduling: Arranging or scheduling two or more instructions for
parallel execution is called instruction scheduling and if this job is done by processor then it
is called dynamic instruction scheduling also called as Out-Of-Order (OOO) execution.

171
Reservation station: Its executions hold pointers to where the data can be found and does not
store the actual data.
Reorder buffer: It is a type of buffer that makes way for completing instructions only in the
series of program by allowing completion of instruction only if it has completed its execution
and the earlier instructions are completed too.
Process: Program can be defined in programmer’s point of view as set of ordered instructions
and in OS point of view it is executable file which is termed a process instead of program.
Thread: Within a process smaller code chunks are called threads and in a process there may
be many threads which share same resources of process.
Multithreading: The instructions of a program divided into smaller threads and parallelism
can be achieved by simultaneously executing these fine grained threads in parallel called as
multi-threading.
SMT (Simultaneous Multi Threading): It is a method allowing numerous self-regulating
threads to issue commands to a superscalar’s several functional units in one single cycle.

172
Chapter 10

VLIW and SIMD Processors

Structure
10.0 Objectives
10.1 Introduction

10.1.1 Instruction-level Parallelism (ILP)

10.1.2 Data-level Parallelism

10.2 VLIW Architecture and its instruction format

10.2.1 Static Scheduling of Instructions in VLIW

10.2.2 Pipelining in VLIW Processors

10.2.3 Implementation and advantages of VLIW

10.3 Example of VLIW processor

10.4 Flynn’s Taxonomy

10.5 SIMD Architecture

10.6 Interconnection Networks for SIMD

10.7 Data Parallelism in SIMD

10.8 SIMD Example: MasPar MP1

10.0 Objectives
After studying this chapter one will understand
 What is VLIW and SIMD architectures?
 How parallel processing is doe in VLIW and SIMD architectures?
 How classification different parallel processing architectures are done.
 How to implement a long instruction word with multiple instructions.
 How to achieve data parallelism?
 What are different networks for SIMD architectures?
 A case study of VLIW and SIMD architectures.
10.1 Introduction

173
VLIW architectures are different from the traditional RISC and CISC architectures.
It is important to discern architecture of the instruction-set from implementation. VLIW
processors and superscalar processors share some characteristics like having multiple
execution units and the potential to perform multiple operations at the same time. The
method used by both for achieving high performance is quite different in VLIW approach
and superscalar approach; in the former case much burden on complier than architecture i.e
it has simple architecture of the processor and later much engineering needed at processor
architecture level. The RISC architectures showcase simple and good performance
execution than the CISC architecture. VLIW architectures are quite simple and much better
than RISCs. This is due to their hardware simplicity. VLIW architectures need the help of
compilers much more than the others.

Enumerating the multiple operations per instruction leads to a very-long instruction


word architecture or VLIW. It refers to processor architectures, designed to exploit
instruction level parallelism (ILP). On the other hand, conventional processors mostly
allow programs to specify instructions that will be executed in sequence, a VLIW processor
allows programs to explicitly specify instructions that will be executed at the same time i.e.
in parallel). The VLIW’s hardware isn’t accountable for finding chances to carry out
various operations simultaneously. So, explicit encoding of the instruction word more in
length results in dramatically reduced complexity of hardware in comparison to a
superscalar architecture of a reduced instruction set computing or complex instruction set
computing. This type of processor architecture intends to exude higher performance
without the inherent complexity of other approaches.

When Intel introduced the IA-64 architecture, they also introduced the term EPIC
(explicitly parallel instruction computer) for this architectural design. A VLIW type of
processor has an internally parallel architecture which characteristically has various
functional units which are independent as shown in Figure 10.1. These processors are
statically scheduled by the compiler. VLIW has an advantage of presenting highly
simultaneous implementation which is a lot simpler and cheaper to construct than same
simultaneous a reduced instruction set computing or complex instruction set computing
chips. It can attain a good performance by making use of parallelism at instruction and
data level.

174
10.1.1 Instruction-level Parallelism (ILP)

It is a measure of the number of operations in a computer program that can be


executed simultaneously. The speed of the programs can be increased by using ILP with
implementing several RISC-like operations, like Additions, Loads and Stores in parallel on
various functional units. Every VLI ( Very Long instruction) may contain an operation
code for every FU (functional unit), and all the FU get their operation codes at exact same
time. So, VLIW processors follow the same control design across all FUs. The register file
and memory banks on chip can be used by multiple functional units.

A better demonstration of ILP is shown with an example. Consider the computation


of an operation: y = a1p1 + a2p2 + a3p3 , on a sequential RISC processor which requires 11
cycles

1st Cycle - Load a1


2nd Cycle - Load p1
3rd Cycle - Load a2
4th Cycle - Load p2
5th Cycle - Multiply z1 a1 p1
6th Cycle - Multiply z2 a2 p2
7th Cycle - Add y z1 z2
8th Cycle -Load a3
9th Cycle - Load p3
10th Cycle - Multiply z1 a3 p3
11th Cycle - Add y y z2

On a VLIW processor which has two load/store units ,one multiply unit and one add
unit, the exact same code can be implemented in just 5 cycles. Cycle 1: load a1 load p1
Cycle 2: load a2 load p2 Multiply z1 a1 p1 Cycle 3: load a3 load p3 Multiply z2 a2 p2
Cycle 4: multiply z3 a3 p3 add y z1 z2 Cycle 5: add y y z3. Thus, the performance is
almost twice as fast as that of a sequential a reduced instruction set computing processor. If

175
this particular loop needs to be executed again and again, the free slots in cycles 3, 4, and 5
can be used by further overlapping the execution and loading for the next in line output
value to further enhance the performance.

10.1.2 Data-level Parallelism

The programs can be sped up by performing various partitioned operations where a


single arithmetic unit (AU) is divided to perform the exact same operation on different
smaller precision data, e.g., a 64-bit ALU is divided into eight 8-bit units to execute eight
operations, that too in parallel. Figure 10.2 illustrates an example of partitioned-add where
eight pairs of 8-bit pixels are added by a single instruction, parallelly. This feature is known
as a multimedia extension. By partitioning the Arithmetic logic unit to execute the exact
same operation on different data, it is possible to increase the performance level by 2x, 4x
or 8x depending upon the division size.

Figure. 10.2 Basic Structure of VLIW Architecture

The performance improvement using the data-level parallelism can be further


explained by an example of adding two given arrays (a and b, each has 128 elements and
each array element is of 8 bits), as shown below.

/* each element of array is 8 bits */

Char p [128], q [128], r [128];

for (i = 0; i < 128; i++)

176
r[i] =p[i] + q[i];

The same code can be executed using partitioned _add :

long p[16], q[16], r[16];

For (i = 0; i < 16; i++)

r[i] = partitioned_add(p[i], q[i]);

The performance using data-level parallelism leads to an increase by a factor of


eight in this very example. As the amount of loop iterations also decrease by a factor of
eight. In addition, there will be performance improvement due to the lessening of branch
overhead. For better data-level parallelism support, the functional units (e.g., ALU and
multiplier) have to be designed in accordance with a well-designed multimedia instruction
set.

As shown in Figure. 10.2 the VLIW type of architecture is derived from the two very
well known concepts: horizontal micro-coding and superscalar processing. Every word
contains fields to manage the routing of data to various register files and execution units. This
gives the compiler more control over data operations. However, the superscalar processor's
control unit must make instruction-issue decisions on the basis of little local information; the
VLIW machine has the ability of making these execution-order decisions at compile time,
thus allowing optimizations that lessen the number of hazards and stalls. This is a major
advantage in the implementation of straight-line code, but it is a disadvantage while dealing
with branch instructions, because there are longer, more frequent stalls. A typical VLIW
machine has hundreds of bits of instruction words in length. As shown in Figure. 10.2,
different FUs are utilized simultaneously in a VLIW processor. Here, instruction cache
performs the function of supplying multiple instructions per fetch. Whereas, the real number
of instructions issued to different functional units may vary in each cycle. The number of
instructions is constrained by data dependencies. It is noted that the average of ILP is almost
2 for the code without unrolling of loop.

10.2 VLIW Architecture and its instruction format

IN VLIW architecture all the functional units has the common large register as shown
in Figure. 10.3. The operations to be carried out by the FUs, simultaneously, are then

177
synchronized in a VLIW instruction (256 or 1024 bits) as in case of multi flow computer
models. The concept of VLIW is basically taken from the horizontal micro coding method.
Various fields of the instruction word (long) have to carry the opcodes which are to be
dispatched to multiple FUs. The programs that are written in instruction words (short) must
be bundled to result in VLIW instructions. This type of code compacting has to be performed
by a compiler, that can predict the outcomes of branch using detailed run-time statistics.

Execu Execu Fixed Fixed Floati Floati


tion tion Point Point ng ng
Unit Unit Execu Execu Point Point
tion tion Execu Execu
Unit Unit tion tion
Unit Unit

Fixed Point Floating Point


Register File Register File Register File
Figure. 10.3 VLIW architecture with Single and separate register files for Integer and floating points
It is managed by very long instruction words. These words comprise of a control
field for each of the EUs. The length of the instruction relies on the number of execution
units (5-30 Execution Units) and the code lengths that are required for controlling each
Execution Unit (16-32 bits). Its main aim is to hasten up the computation by using
instruction-level parallelism. VLIW has the same hardware core as superscalar processors
which
Wordhas multiple
0 ALU0 Early Beat execution units (EUs) working in parallel. An instruction constitute of

multiple operations
31 25 24as shown
19 18 in Figure
16 10.4.
15 13 12 11 7 6 1 0
Opcode Dest Dest_bank Branch Test Src_1 Src_2 Imm

Word 1 Immediate Constant


31 0
Immediate Constant (Early)

Word 2 ALU1 Early Beat

31 25 24 19 18 16 15 13 12 11 7 6 1 0

Opcode dest Dest_ bank Branch Test Src_1 Src_2 Imm

Word 3 FA/ ALU Adder Control Fields

31 25 24 19 18 16 15 13 12 11 7 6 1 0

Opcode dest Dest_bank Branch Test Src_1 Src_2 Imm

Word 4 ALU0 Late Beat

31 25 24 19 18 16 15 13 12 11 7 6 1 0
Opcode dest Dest_bank Src_1 Src_2 Imm
178
Word 5 Immediate Constant
31 0
Figure 10.4 shows very long word instruction used in Trace 7/200 processor, where
each word is subdivided into 8 sub words with early and late beats for execution, where in
a single instruction, multiple operations like addition and multiplication are included. The
typical word length ranges from 52 bits to 1 Kbits. All the operations present in an
instruction given are implemented in a lock-step mode. One or multiple register file is
required for FX and FP data. It relies on a compiler to find the parallelism and schedule the
program code without any dependency. These processors consist of various FUs and fetch,
from the instruction cache, a Very-Long Instruction Word having many instructions. Later,
it dispatches the whole VLIW for the parallel execution. Such abilities are utilized by the
compilers that generate a code which has assembled independent instructions which are
executable in parallel. These processors have simple logic of control as they don’t perform
any dynamic type of scheduling (like in contemporary superscalar processors).VLIW has
also been called as a natural successor to the a reduced instruction set computing as it
transfers complexity from hardware to the given compiler. Thus, it allows simple and fast
processing.

179
The main aim of a very long instruction word is to remove the complex, time
consuming scheduling of instruction and dispatching parallelly in modern microprocessors.
Theoretically, a VLIW processor should be fast and cheaper than a reduced instruction set
computing chip. The compiler, here, must accumulate many operations into a single
instruction word in such a way that the various FUs are made busy, as it requires
instruction-level parallelism (ILP) in a code series to fill up the available slots. This
parallelism is performed by compiler by scheduling the code through basic blocks,
implement software pipelining and lessening the operation number being executed.

10.2.1 Static Scheduling of Instructions in VLIW

The scheduling of instruction is performed wholly by the software compiler. As the


complication of hardware reduces, there is
1. Increase in the clock rate.

2. The degree of parallelism is increased.

3. The software (compiler) complexity is increased for which Compiler needs to be


aware of the hardware details such as number of Execution Units, their latencies,
delay, memory, load-use etc.

4. The compiler has to keep a record of worst case delays and cache misses.

5. This type of hardware dependency limits the using same type of compiler for a line
of very long instruction word processors.

As a very long instruction word architecture lessens the hardware complication over a
superscalar architecture, a way more complicated compiler is required. Extricating
maximum performance from a superscalar reduced instruction set computing or
implementation of CISC requires experienced techniques of compiler, but the experience
level in a very long instruction word compiler is notably higher.

10.2.2 Pipelining in VLIW Processors

The implementation of instructions by is shown in Figure 10.4. Very long instruction


word machines behave a lot like superscalar processors with three major differences: Firstly,
the decoding of very long instruction word instructions is much simpler than the superscalar
instructions. Secondly, density of code of the superscalar processor is much better when the
given ILP is less than that used by the very long instruction word machine. The reason behind

180
this is that the fixed very long instruction word format has bits for non-executable operations,
whereas the issues of superscalar processor are only executable instructions. Thirdly, a
superscalar processor can be object-code-compatible along with a huge family of nonparallel
machines. However, a very long instruction word machine exploits different amounts of
parallelism which would need multiple instruction sets. Instruction level parallelism along
with data movement in very long instruction word architecture is specified at compile time.
The result being that the run-time scheduling of resource and synchronization is completely
eliminated. A VLIW processor can be viewed as an extreme level of a superscalar processor
in which all independent operations are synchronously compacted together beforehand. The
CPI of a VLIW processor can be quite lower than that of a superscalar processor. A
comparison of RISC, CISC and VLIW are summarised in Table 10.1.
Table 10. 1 Architectural feature Comparison of RISC, CISC and VLIW

Architectu CISC RISC VLIW


ral

Property

Size of Differs One size (32 bits) One size


instruction

Format of Placement of field Placement of Regular, consistent


instruction differs fields is consistent placement of fields

Registers Few ( sometimes Many ( General Many ( General


Special Purpose Purpose registers) Purpose registers)
Registers)

Memory Bundled with Not bundled with Not bundled with


operations in many operations operations
different types of
instructions

Hardware Uses microcode Uses Uses


Design implementations implementations implementations with
with single pipeline multiple pipelines, no
and no microcode microcode and no
complex dispatch
logic

181
10.2.3 Implementation and advantages of VLIW

A very long instruction word implementation attains the exact consequence as a


superscalar reduced instruction set computing or complex instruction set computing
execution. The only difference is that a very long instruction word design does this without
the two very complex parts of a very high-performance superscalar design. Because very
long instruction word instructions especially mention several independent primitive
operations, it is unnecessary to have a decode and dispatch hardware which tries to
reorganize parallelism from a serial line of instruction. By not making hardware attempt to
locate parallelism, very long instruction word processors depend on the compiler which
produces the very long instruction word code to find parallelism. Depending on a compiler,
it has many advantages. Firstly, the compiler holds the capability to have much larger
windows of instructions than the hardware. In case of a superscalar processor, hardware
windows large in size signify a huge amount of logic and hence more chip area. The
software windows can be randomly large. Thus, locating parallelism inside a software
window is most probable to give better results. Secondly, compiler has an understanding of
the source code of the program. A technique known as trace-driven compilation is utilized
to improve the quality of code output given by the compiler.

Third, while using a lot of registers, it is quite possible to emulate the functions of
the reorder buffer of superscalar execution. The objective of the reorder buffer is to allow a
superscalar processor to carry out instructions and be able to immediately discard the
results when necessary. By using a lot of registers, a very long instruction word machine
can place the outcomes of implemented instructions in the temporary registers. The
compiler also has the knowledge of how many instructions will be executed, so it directly
uses the non-permanent registers along the predicted path and disregards the values in
those registers along the path which will be used if the branch had been mis-predicted.

Advantages of VLIW

1. Compiler determines the dependencies which are used to schedule according to


function unit latencies.

182
2. Compiler assigns the function units which correspond to the position within the
instruction packet.

3. Simpler instruction-issue logic in this.

4. Their simple logic of instruction issue also often facilitates very long instruction
word processors to accommodate more implementation units onto a certain chip
space than the superscalar processors

5. Hardware complexity is considerably reduced.

6. Tasks similar to decoding, data dependency detection, and instruction issue etc. are
simplified.

7. Higher clock rate.

8. It amounts to instructions being executed with shorter length of clock cycles than
superscalar processors

9. Power consumption is reduced.

Disadvantages of VLIW

1. Compiler's complexity increases manifold.

2. Compiler optimization requires considering technology dependent parameters such


as load-use time of cache and latencies.

3. Very long instruction word programs work well only when implemented on a
processor with exact same number of EUs and exact same instruction latencies as
the processor they were compiled for.

4. Increase in the number of execution units between generations will lead to the new
processor trying to combine operations from various instruction in each cycle which
causes dependent instructions to be executed in the same cycle

5. Unscheduled events such as 'cache miss’ can lead to the stalling of entire processor.

6. Varying instruction latencies among various generations of a processor line can


cause operations to be executed before the condition when their inputs are ready or
after their inputs have been overwritten, which results in erroneous behaviour.

183
7. Code expansion causes high power consumption.

8. If the compiler does not find enough number of parallel operations to fill all of the
slots in a particular instruction, it must use explicit NOP (no operation) operations
into the operation slots which causes Very long instruction word programs to take
much more memory than equivalent programs for the superscalar processors.

9. In case of unfilled opcodes, memory space and instruction bandwidth are wasted in
VLIW. Hence, low slot utilization.
10.3 Example of VLIW processor
Itanium (IA-64)
Itanium is a line of 64-bit Intel microprocessors that implements the Intel Itanium
Architecture (formerly called IA-64) as shown in Figure 10.6. Intel markets the processors
for enterprise servers and high-performance computing systems. It is a line of 64-bit Intel
microprocessors. IA-64 is the first type of architecture to bring ILP (Instruction Level Parallel
execution) attributes to general-purpose microprocessors. They are based on the EPIC
(Explicitly Parallel Instruction Computer). Its speed of computing is very high and
architecture simple.
IA- 64 is an explicit parallel architecture with rich register set with base data word
length of 64 bits and is byte addressable and logical address space is 264 bytes. The
architecture also implements branch prediction and speculation. For parameter passing it uses
a mechanism called register renaming, which is also executes loops in parallel. Compiler
controls the prediction, speculation and register renaming, to accommodate this controlling
action each instruction has an extra bit in its word, which is the distinguish characteristic of
the I-64 architecture. The architecture implements the following features:
 Integer Registers: 128
 Floating Point Registers: 128
 One-bit predicates: 64
 Branch Register: 8
 Length of the Floating Point Registers: 82 bits

184
Instruction execution in IA-64
One instruction of IA-64 which is Very long instruction, whose bit length is 128- bit
contains 3 instructions and up to 2 instruction words can be read from the L1 cache into
pipeline, by fetch mechanism in a single clock cycle. Therefore in a single clock cycle, 6
instructions can be executed by the processor. IA_64 has, in 11 groups total 30 functional
execution units. Each sub-word of the long instruction is executed by each execution unit in
one clock cycle provided the data is available. Different execution unit groups are given
below:
 2 Integer units, 6 General purpose ALUs and 1 shift unit
 4 data cahse units
 1 parallel multiply, 2 parallel shift units , one population count and 6
multimedia units
 2 SIMD FP MAC ( Floating Point Multiply and Accumulate) units
 2 FP MAC units of 82-bit
 3 branch units

IA-64 at 800MHz frequency is rated with 3.2 Giga Floating Point operations (GFLOPs) and
6.67 Gflops at 1.67 GHz.

10.4 Flynn’s Taxonomy

The parallel architectures have been classified in Flynn’s taxonomy, which


distinguishes architecture of the multi-processor by instruction and data as follows:

• Single instruction, single data stream (SISD)


• Single instruction, multiple data streams (SIMD)
• Multiple instructions, multiple data streams (MIMD)
• Multiple instructions, single data stream (MISD)

SISD is a term referring to a computer architecture as shown in Figure 10.7 in which a


single processor (a uniprocessor) executes a single instruction stream, to operate on data
stored in a single memory

Control Processor Memory


Unit Unit Module
185

Instruction stream

Data stream
Figure 10.7SISD Architecture
In this architecture execution of the instructions takes place in serial order. It is a
conventional sequential machine model. Example: 2 instructions c = a+b and a =b*2 will be
executed as below
1st Cycle: Load a
2nd Cycle: Load b
3rd Cycle: c=a+b
4th Cycle: Store c
5th Cycle: a = b *2
6th Cycle: Store a
From the above illustration it can be observed that only one instruction and data
stream is acted on during any one clock cycle.
SIMD, which could operate on a vector of data with a single instruction, in modern
SIMD architectures processing of all the elements of the vector takes place simultaneously.
Instruction stream

Data stream

Processor
Memory
Unit (P1)
Module

Memory
Processor
Modul e
Control Unit (P2)
Unit

Memory
Processor Module
Unit (Pn)

Figure 10.8 SIMD Architecture


186
Example:

Cycle P1 P2 Pn

1st Cycle: Previous Previous Previous


instruction instruction
instruction

2nd Cycle: Load a(1) Load a(2) Load a(n)

3rd Cycle: Load b(1) Load b(2) Load b(n)

4th Cycle: c(1) = a(1)*b(1) c(2) = a(2)*b(2) c(n) =


a(n)*b(n)

5th Cycle: Store c(1) Store c(2) Store c(n)

6th Cycle: Next Instruction Next Instruction Next


Instruction

From the above Figure 10.8 and example it can be observed that, all the processing units
i.e. P1, P2 ....Pn , at any given clock cycle execute the same instruction but each processing
unit operates on a different data element. This concept is used in vector computers with
scalar and vector hardware. Instruction stream

Data stream

Processor
Control Memory
Unit (P1)
Unit Module

Memory
Control Processor
Modul e
Unit Unit (P2)

Memory
Processor Module
Control
Unit (Pn)
Unit

187
Figure 10.9 MIMD Architecture
MIMD, in this technique achieve parallelism, as shown in Figure 10.9 architecture has a
number of processors which function asynchronously and independently. At any time,
different instructions on different pieces of data will be executed by different processors.

Example:

Cycle P1 P2 Pn

1st Cycle: Previous Previous Previous


instruction instruction
instruction

2nd Cycle: Load a(1) Call func z do 15 i=1,N

3rd Cycle: Load b(1) y=x*z k=m**4

4th Cycle: c(1) = a(1)*b(1) diff=y*2 q=c(i)

5th Cycle: Store c(1) call sub1(i,j) 15 continue

6th Cycle: Next Instruction Next Instruction Next


Instruction

From the above example it can be observed that MIMD machine executes different
instructions on different data elements in one clock cycle. CAD is the most type of MIMD
parallel computers.

MISD is a type of parallel computing architecture as shown in Figure 10.10 where on same
data different operations are performed using many functional units. As in Pipeline
architectures after processing by each stage in the pipeline the data is different so we can
Instruction stream
conclude this in MISD type.
Data stream

Control Processor
Unit Unit (P1)

Control Processor
Unit Unit (P2) Memory Memory Memory
Module Modul e Module

Processor
Control
Unit Unit (Pn)
188
Figure 10.10 MISD Architecture

MISD is used in very few practical applications, one important application that uses
MISD architecture is Systolic Arrays, where a single encoded message is cracked using
multiple cryptography algorithms.

10.5 SIMD Architecture

There are two types SIMD architectures exist viz., one is True SIMD and the other
Pipelined SIMD.

True SIMD architecture

Distributed memory and shared memory usage determines the true SIMD
architectures. The SIMD architecture implementation for both shared and distributed
memory is shown in Figure, but they differ in the processor and memory modules
placement.

True SIMD architecture with Distributed Memory

The control unit of a true SIMD architecture with distributed memory interacts with each
and every processing element in the architecture and each processor has its own local

189
memories as shown in Figure. The control unit provides instruction to the processing
elements which are used as arithmetic unit. Information transformation between the
processing elements and Information fetching is done through controlling unit for a
processing element which if needs to communicate with another memory on the same
architecture. The main drawback of this architecture is slow performance time as the
controlling unit handles all the data transfer activity.

Figure 10.11 True SIMD Architecture

True SIMD architecture with Shared Memory

The processing elements in this architecture have shared memory but doesn’t does not have
a local memory. The processing elements are connected to a network they can access any
memory module through this network. As shown in Figure, the same network allows every
processing element to share their memory content with others. The controlling unit role in
this architecture is to send instructions for computation only and has no role in accessing

190
memory. The advantage of this architecture is as less processing time as controlling unit
has no role in data transfer.

Processor Processor Processor


Unit (P1) Unit (P2) Unit (Pn)

Interconnection Network

Memory Memory Memory


Module 1 Module 2 Module n

Figure 10.12 True SIMD Architecture with shared memory

Pipelined SIMD Architecture

In pipelined SIMD, the controlling unit sends instruction to each processing element, then
processing elements will do the computing at multiple stages using a shared memory. The
architecture changes with the no. of stages used for pipelining.

10.6 Interconnection Networks for SIMD

The interconnection network carries data between memory and processors and the topology
of a network discusses how the connection pattern formed between these communication
nodes. Two types of topology is used in general i.e 1) direct topology, which is point to
point connection using static network and 2) indirect topology, which is dynamic
connection using switches. The interconnection network selection will be based on the

191
demands of application of the SIMD architecture for which it is designed. Some common
topologies used in SIMD architecture are

Mesh Connected Array:

The most commonly used network topology i.e. 2-dimensional mesh for SIMD architecture
is shown in Figure. It is direct topology in which switches are arranged in a 2-D lattice
structure. In this topology only communication between neighboring switches is allowed.

The main
Figurefeature
10.13 (a)of2-dimensional
this network topology
mesh is that it support
(b) with wrap-around close
connections at thelocal
edgesconnections, in

several applications this feature is exploited. The main drawback of this topology is that the
2-D mesh has relatively maximum distance value, with the edges wraparound the
maximum distance Dmax for N-processor elements is √ .

Shuffle exchange networks, direct and indirect binary n-cubes:

Figure (a) Suflle exchange (b) Direct Binary 3-Cube (c) Indirect Binary 3-Cube

(a) (b) (c)


Figure 10.14 (a) Suflle exchange (b) Direct Binary 3-Cube (c) Indirect Binary 3-Cube
192
10.7 Data Parallelism in SIMD

Data level parallelism can be exploited significantly in SIMD architectures for scientific
computing which is matrix-oriented and in image and sound processors which are media
oriented. As SIMD needs only one instruction to be fetched for data operation, hence it
leads to energy efficient architecture which is attractive for mobile devices. SIMD operates
on multiple data elements which can be viewed in time and space. In time point of view at
a time multiple data is executed in single instruction by using Array Processors. On
contrary in view of space multiple data is executed in single instruction in consecutive time
steps by using vector processors.

Example:

Instruction Stream

LD B  A[3:0]

ADD B B, 1

InMUL
the above
Bexample
B, 2 illustrates how in array processors multiple data is executed parallel
bas

In the above example instructions are shown how they executed on space and in at a time
multiple data in vector processors.

10.8 SIMD Example: MasPar MP1

193
The full form of Maspar is massively parallel machine, as it involves huge number of
processing elements for parallel processing. In this architecture unlimited number
processing elements can be used as the design incorporates distributed memory scheme, the
only limitation of this type of architecture is processing elements cost.

MP-1 Machine: Processing Elements (PE) are connected in 2-D lattice structure in
this architecture, where the entire PE is customized by MP1. A front end machine (VAX)
drives this machine, I/O devices with very high speed can be attached to this architecture.

Figure 10.15 MasPar system overview

194
In the maspar architecture an instructions with multiple data elements is executed in one go
as shown in Figure 10.15. The architecture has mainly two parts one is Front End and the
other Data parallel Unit (DPU) which further divided into two parts Array Control Unit
(ACU) and Processor Element Array (PEA).

Front End:

Maspar computational engine does not have its own operating system, therefore a front end
work station( ex: VAX) is required to interface with Maspar to make it programmer
friendly.

DPU:

A program’s parallel portions are executed by DPU, which consists of two parts i.e ACU &
PEA. ACU performs the following 2 tasks first one being execution of the instruction
which operate on singular data and second being feeding the instructions to PEs which
operates on parallel data.

PEA:

Simple processors are connected in a 2-D mesh, the end connections of the mesh are wrap
around.

195
All the eight neighboring PEs, which are capable of writing and reading from/to memory
and able to perform arithmetic operations, are connected to each processor as shown in the
Figure. The PEs can only execute but can’t do the task of fetching and decoding.

Questions
Q1. What is VLIW stands for? Explain it with suitable example.
Q2. Explain the instruction format of VLIW architecture.
Q3. Explain the architecture of VLIW and give an example.
Q4. Explain detailed architecture and features of IA 64 processor.
Q5. What is parallel processing? Give the classification of parallel architectures based
on instructions and data.

196
Q6. What is SIMD stands for? Explain its architecture in details.
Q7. Explain the network topology used in SIMD architectures.
Q8. What is MASPAR? Explain in detail an example of SIMD architecture.

Summary

This chapter describes What Very long instruction word (VLIW) and its importance
in parallel processing environment and how instruction level and data level parallelism
is achieved at architecture level. Detailed VLIW instruction format have been discussed
and how the instruction is scheduled for parallel processing is also covered. Pipelining
and implementation in VLIW architectures are discussed with its advantages and
disadvantages. This chapter also covers parallel architecture classification based on
data and instruction with architectural descriptions of each class and explained in detail
SIMD architecture implementation with interconnect network. In this chapter examples
for VLIW and SIMD architectures are also explained in a better and simple way.

Glossary

 VLIW: Very Long instruction Word, where the length of the instruction is around
256 bits to
1024 bits. The length depends on the number of execution units available in a
processor.
 RISC : Reduced Instruction set computing, in this instruction set architecture is
very simple
hence leads to increase in the performance of the processor.
 CISC : Complex instruction set computing, in this more than one instruction of
low level
are put in one main instruction.
 EPIC: Explicitly parallel instruction computing, in this architecture parallel
execution of the
instructions are done at the complier level.
 Register file: A processor’s set of registers arranged in array.
 Cache miss: When the data in the cache memory that a processor trying to write
or read is not
found then it results in Cache-miss.
 SISD architecture: One instruction and one data will be executed by a processor
in one clock cycle.

197
 SIMD architecture : One instruction and more than data will be executed by a
processor in one
clock cycle.
 MISD architecture : More than one instruction and one data will be executed by
a processor in
one clock cycle.
 MIMD architecture : More than one instruction and more than one data will be
executed by a
processor in one clock cycle.
 Distributed memory: Each Processor will have its own memory which acts as
local memory to that
processor, in multi processor environment.
 Shared Memory: Same memory is used by many processors through a network in
multi processor
environment.

198
Chapter 11

Advanced Memories

Structure

11.1 Objective
11.2 Cache Accessing
11.3 Latency during Cache access
11.4 Cache Miss Penalties
11.5 Non-Blocking Cache memory
11.6 Distributed Coherent Cache

11.1 Objective

The objective of this chapter is to define and discuss the working of advanced memories.
There are variety of techniques and processors available which make use of these advance
memories in order to enhance the overall performance and execution of the instructions.
Sections 11.2 defines the cache accessing mechanisms such as write through and write back
along with the bottlenecks related with these techniques. Section 11.3 further discusses the
latency issues related to the cache access. There are various factors on which the latency is
dependent these are: Cache size, Hit Ratio, Miss Ratio, Miss Penalty, Page size. Section 11.4
involves the cache miss penalties which are further categorized into three main categories
Compulsory, Capacity, Conflict. 11. 5 include Non-Blocking Cache Memory. Whenever a
miss situation occur the gap between processor access time and memory latency increases as
a result the utilization of processor decreases. To solve this miss penalty situation Non-
Blocking Cache memory technique is used effectively. 11.6 contain distributed coherent
cache in which distributed memory is used to solve the cache coherence problem.

Access Mechanism

199
Technologically a variety of memory types are available having different access mechanisms
depending upon this mechanism, memory is used to store or retrieve particular type of
processes. These mechanisms are:

Random Access Memory (RAM)

It is defined as the memory for which a location, usually known as a word has a unique
addressing mechanism. This address is physically wired-in and can be fetched in one memory
cycle. Generally it uses a constant time slot for retrieving any given location. Main memory,
cache uses this access mechanism.

Content Addressable Memory (CAM)

This memory is also famous with the name associative memory where for accessing the
location, we require a field of data word instead of address. Here, concept of RAM is used
where logic for bit comparison is physically wired-in along with addressing system. This
logic circuit facilitates comparison of desired bit positions of all the words with a particular
input key. Comparison is done for all the words. Word for which match occurs are then
accessed.

Figure 11.1 General organization of memory subsystem

General organization of memory sub system is detailed in Figure 11.1. This diagram
represents main storage of computer system. For economic senses, designers have to design
large capacity main storage with lesser speed as compared to CPU. To resolve this problem

200
and to reduce this speed mismatch between CPU and main memory a high speed low
capacity cache memory is introduced in-between. The typical memory hierarchy engaged in
computer system is represented in Figure 11.2. And the access mechanism for defining cache
as high speed memory is elaborated in following section.

Figure 11.2 Memory Hierarchy

11.2 Cache Accessing

When the processor needs to read from a location in main memory, it first checks whether a
copy of that data is available in the cache. If it exists, the processor immediately reads from
the cache. Block diagram in Figure 11.3 shows relation of CPU with cache and main memory
via common data and address bus.

Figure 11.3 Block diagram of CPU-Cache-Main Storage Relation

If the copy of the data to be accessed is available in cache, it is recognized as a ‘hit’. If it is


not, a ‘miss’ will be there. Working procedure of cache memory can be elaborated with the
help of a flow chart in Figure 11.4. The working technology of cache is referred as Address
Mapping (Discussed in chapter no. 6). In case of a hit, read/write operation is executed on
cache. If a miss is there, data is brought from main memory and addressed word is made
available to CPU for processing. So, two basic approaches are applied for effective utilization
of cache.

201
 Write through

Write through is a simple technique where all write operations are made to main memory as
well as cache. Since, main memory is a valid portion for accessing; any CPU-cache module
can manage traffic to main memory to maintain consistency within its own cache. Due to this
a considerable amount of memory traffic is generated which may lead to bottleneck

 Write back

Write back is a technique to reduce memory writes to the minimum. With this approach,
updates can be done only in cache. When any update occurs a flag bit F associated with the
page is set. Then, when a page is replaced, it can be written back to main memory if and only
if the flag bit F associated with the page is set. Key point to note in case of Write back is that
accessing from main memory is invalid. Any access by input/output device is viable only
through cache. This technique is especially suitable for complex circuitry where potential for
bottleneck is there.

Figure 11.4 Flowchart representing the working principle of read/write operation with
cache memory

202
There are four basic modes of accessing:

 Read
 Write
 Execute
 Modify.

Read and Write modes are simply used for reading of data from memory and writing/ storing
data to memory. Execute access mode is used in case instructions are to be accessed from
memory. Modify access mode is used to coordinate simultaneously executing programs in
one or more processors. Accessing is further comprised of three phases. First phase is
translation, used for compilation of high level language program into a lower level
representation. Second phase is linking, used to combine several separately translated
programs to form a single larger program for execution. Third phase is Execution, running of
the linked program. Most modern CPUs have at least three independent caches: I-cache
(instruction cache), D-catch (Data cache) and a TLB (Translation Lookaside Buffer). I-cache
is useful for accessing the executable instructions from cache in speedy manner. D-cache is
responsible for accessing and storing of data and third TLB is discussed later on in this
chapter.

An effective memory accessing design provides access to information in the computer when
the processor needs it. It should support object naming. For efficiency, it should not allow the
information to get too far from the processor, unless the information is accessed infrequently.
Thus to do its job efficiently, allocation mechanism may move information closer to or
farther from the processor depending upon how frequent this information is required. To
implement these moves and to keep track of the closest location of each object, the accessing
mechanism must be able to map names among name spaces as the object move among
memory modules. This mapping is performed in multiple stages.

11.2.1 Object Designation Options

To access an object, the object must be selected, either on the basis of its location i.e. location
addressing or its contents i.e. content addressing. Location addressing selects objects based
on location within a memory address space. Content addressing deals with a specific portion
of the object’s representation, known as key that is required to match to the selector value.
Key is defined as the component of object in memory and selector is defined as the input
value that describes the entry to be accessed.
11.2.2 Object Name Mapping Strategies

203
Different types of object names may exist within a computer system; these are collected into
separate name spaces. A name mapping transforms addresses between name spaces. Multiple
name mapping options are available. The translation from a name n1 in name space N1 to
corresponding name n2 in name space N2 can be represented by a mapping function
n2 = f1, 2(n1)
Mapping functions can be implemented on the basis of two representation schemes:

 Algorithmic representation
 Tabular representation

Algorithmic representation can be written as a procedure. It may be a slow process depending


upon complexity of the procedure. Simple algorithms reduce the mappers execution time thus
are more preferred.

Tabular representation specifies output name corresponding to each input name if possible.
Tabular mapping can also be represented as a binary relation. If two items I1 and I2 satisfy the
binary relation R, it is written as

Pair (I1, I2) belongs to the relation R.

Complete relation can be traversed by listing all pairs that belongs to this relation. Complete
listing will give a view of table. Here, two main keywords are used first is domain and second
is range. Domain is the set of input values for which mapping is defined. Range is set of
output values produced by the mapping.

11.2.3 Name Mapping Lifetimes

An important constraint of name mapping is its lifetime i.e. the time interval during which the
mapping remains valid. We can discard the mapping information when it is sure that it will
never be required again. A fixed name mapping is same for all execution of a program a
variable name mapping may change. If same name mapping function is used during entire
execution of program, it is known as static name mapping. The other having variable name
mapping function is known as dynamic name mapping. Static name mappings can be
performed during program compilation.

11.3 Latency during Cache Access

204
Memory is a device which responds to CPU’s read/write instructions by retrieving/storing
information from/to the addressed location provided by CPU. Memory speed is represented
by memory clock cycle Tm i.e. the time period elapsed between the moment when the address
is placed on MAR (Memory Address Register) and reading/writing of information from/in
the address location is completed. When the read operation is complete, the read information
is available in MDR (Memory Data Register). When the write operation is complete,
information from MDR is written onto addressed location. For a single module memory,
latency can be defined as the interval between the instance when a read instruction is sent by
the CPU to the memory and the moment when data is available with the CPU. This approach
is clearly diagrammatically explained in Figure 11.5.

Figure 11.5 Latency for a single module memory

To achieve minimum possible latency, data or program targeted by CPU should be available
in cache and directly accessible by CPU. Latency is dependent upon a number of factors
like:
 Cache size
 Hit Ratio
 Miss Ratio
 Miss Penalty
 Page size

Cache Size

First element cache size significantly accesses the memory access time. The size of the cache
should be small enough so that average cost per bit is closed to main memory alone. It must
be large enough so that average access time is closed to cache alone. It is almost an
impossible mechanism to achieve an optimum cache size. Larger the cache, larger the
number of gates involved in addressing the cache. It makes the working slightly slow as
compared to smaller caches, even when built with same IC technology.

Hit Ratio

205
Average memory access time can be reduced by pushing up the hit ratio to lower level
memory. Hit ratio (H) is defined as
N
=
N +N

Where Ni is the number of times word targeted by CPU is available in ith level (Cache), and i
+ 1 is the number of times the word is not available and to be accessed from (i+1)th i.e. main
memory to ith level (cache memory).

Miss Ratio

Miss Ratio (M) is defined by


M=1–H

Miss Penalty

Miss penalty is defined as the collaborative time to replace a page from (i +1)th level to ith
level and time to deliver the page to CPU from ith level.
Miss penalty (at ith level) = time to replace a page from (I +1)th
level to ith level + time to deliver the page to CPU from ith level.

Time taken to replace a page from (i +1)th level to ith level contains access time in (i +1)th
level memory followed by page transfer time. So, memory access time (at ith level) is
concluded as
Memory access time (at ith level) = Access time on hit to level i memory X percentage
of hit (H) + miss penalty X percentage of miss (M).

Page Size

Page size also affects the memory access time. The effect of page size on miss penalty and
miss rate is elaborated in Figure 11.6 (a) & (b). Memory latency is fixed for a memory type.
Therefore, with increase in size, miss penalty increases 11.6 (a). With larger page size, hit
ratio increases only up to a limit. After, the limit crosses, hit ratio drops because of
availability of lesser number of larger size pages in the memory 11.6 (b). Miss rate varies

206
inversely with hit ratio Figure 11.6 (c) shows the variation of average access time with page
size.

Figure 11.6 Effect of Page Size (a) on Miss penalty (b) on Hit ratio (c) on Average access
time

To reduce the problem, instead of single module, memory having multiple modules can be
preferred. CPU can pass instructions to different modules in interleaved manner as shown in
Figure 11.7. As a result, a set of memory words (data/instruction) is made available to CPU
sequentially. This technique will reduce the wait time and improve bandwidth of data flow
because data/instruction is delivered to memory one after the other without wait.

207
Figure 11.7 Latency reduced using multiple module memory

11.4 Cache Miss Penalties

Cache memory plays a very important role to make a system efficient. Whenever we wish to
read or write any data/instruction it is first checked in cache. If the data is available, it is sent
to processor immediately and this process is known as hit. For an efficient system with
effective throughput, every time there should be a hit. But this is not possible practically.
When data/instruction is not available in cache, it is fetched from main storage to cache and
then the same is sent to processor for further execution. This process is known as miss and
obviously it carries an overhead along. This overhead is technically termed as miss penalty.
Therefore, miss penalty is defined as the collaborative time to replace a page from main
memory to cache and time to deliver the page to CPU from cache memory.

Cache misses can be categorized under three different variants:


 Compulsory
 Capacity
 Conflict

Compulsory

It is defined as a start miss for the very first reference to a page that is currently residing in
main storage. In other words, either it is a page that is very less frequently used or used for
the first time.

208
Capacity

It is defined as a miss when cache memory is full and there is a requirement to bring another
page from main storage to cache memory by replacing one of the existing pages. For this
replacement two popular schemes are there. One is FIFO (First in First Out) replacement
algorithm and other is LRU (Least Recently Used) replacement scheme.

Conflict

In some page mapping techniques like direct mapping and set-associative mapping, a set of
pages are mapped to same set of page frames. This may result in a miss known as conflict
miss. In case of full associative mapping with larger number of page frames per set reduces
conflict misses as there is high level of flexibility for placing a page in set.

11.4.1 Cache performance evaluation

To evaluate the cache performance in overall program execution time, first we evaluate CPU
execution time for a program. CPU time is calculated as the product of instruction count (IC)
i.e. number of instructions present in the program, clock per instruction (CPI) i.e. number of
clock cycles accommodated by individual instruction and clock cycle time (CCT). CPI value
also contains delay to access data from memory subsystem. Memory subsystem includes
cache and main storage. So, cache access delay is included with in the CPI value. However,
we have already stated that it is not mandatory that while executing a program it may not
reside in cache due to miss situation. In that case, CPI value increases due to memory stall
clock cycles (MSC).

= X miss rate X miss penalty

When taking into account the effect of cache misses,

CPU time = IC X (CPI + MSC) X CCT

Thus, it is concluded that:

209
 Effect of memory stall cycles due to cache misses is considerable in a system having
lesser CPI value
 Higher CPU clock rate result in larger miss penalty
 For systems having lesser CPI value and higher clock rate e.g. RISC architecture,
cache performance plays a very important role.

11.4.2 Reducing miss penalty for read operation

Normally read operation is used more frequently as compared to write operation. So, in order
to improve efficiency read miss penalty must be reduced. Two main approaches are used to
reduce the same:
 While transferring a page from main memory, CPU operation need not be stalled till
the full page is transferred. CPU execution can be made to continue as soon as the
desired page is received.
 A large page may be divided into a number of sub-blocks, each sub-block having a
discrete valid bit. This will reduce miss penalty as only a sub-block is transferred
from main storage.
However, both the above said approaches need additional hardware for implementation.

11.4.3 Reducing miss penalty for writing operation

Miss penalty for write operation can be reduced by providing a write buffer of accurate
capacity. The accurate buffer size can be accessed from study of simulations of already
executing benchmarked programs. However,+ a drawback is associated with this approach
that in case the read from the page is required from the main storage while updated data is
still in the write buffer, then it would lead to a erroneous result. This issue can be handled by
adapting approaches like:
 Read miss operation should wait until the write buffer is empty. But, this will increase
read miss penalty.
 Check the content of write buffer on read miss if the targeted word is not available in
buffer, read miss action can continue.

11.4.4 Improving the speed of write operation

For read operation data from cache can be read simultaneously. For write operation writing is
processed sequentially only after a tag match. Therefore, write consumes more than one clock
cycle. We can increase the speed of write operation by pipelining the two steps- tag search
and writing on cache. While in the first stage of pipeline, tag is searched for write operation;
in second stage of pipeline previous write operation will be executed. This approach will
ensure that if a hit is identified in the first stage then in second stage of pipeline write
operation will be executed in a single cycle. In case of a write through cache single cycle

210
cache write can be achieved by skipping the tag match operation. For this a page is divided
into a number of sub-blocks, each sub-block is equal to the size of a word; each page has a
discrete valid bit. Writing is done along with setting of valid bit in the cache in one clock
cycle.

11.4.5 Reducing miss penalties with two level caches

While designing a memory subsystem there must be an efficient tradeoff between two
conflicting strategies-higher speed and larger capacity. A compromise between these two
conflicting strategies can be achieved by providing two levels of cache memory as shown in
Figure 11.8. Here, the first level cache (C1) has low capacity but high speed, while second
level cache (C2) has high capacity and low speed. C1 speed is similar to the speed of CPU
and capacity should be enough to achieve desired hit ratio.

Figure 11.8 Two-levels of cache memory

Average memory access time with C1 and C2 = hit time in C1 + miss rate in C1 X miss
penalty of C1

Miss penalty of C1 = hit time in C2 + miss rate in C2 X miss penalty of C2

Speed of C1 affects CPU clock rate, while speed of C2 affects miss penalty for C1. Cost
consideration can be analyzed from the fact that size of high speed C1 can be limited.
Therefore, design considerations are aimed towards cost effective design of second level
cache in terms of speed and size. Increase in size beyond a certain limit will not bring any
benefit in terms of execution speed and miss rate. In practice, C1 is usually synchronized
with C2 cache and CPU. Table 11.1 elaborates typical parameters for C1 and C2 cache for
cost considerable design.

211
Table 11.1 Typical Parameters of C1 and C2 cache

11.5 Non Blocking cache memory

Whenever a miss situation occur the gap between processor access time and memory latency
increases. This situation worsens the problem of cache miss penalties to a large extent and
leads to the processor degradation. i.e. the utilization of processor decreases. So there are
many techniques evolved to reduce cache miss penalties such as increasing the hit ratio in the
cache by adding small buffers, by adding two level cache designs which reduces the access
time and more cost effective, by improving speed of write operation etc. There is another
technique to extend processor utilization by using write buffers, non-blocking caches or
prefetching techniques to access data within the same process.

Typically, cache memory can handle only one request at a time. If there is a miss situation i.e.
data is not found within the cache, it has to be fetched from main memory. During this
process of retrieval, cache remains idle or ‘blocked’ and do not handle any further request
until the fetch operation is complete. In case of ‘non-blocking’ cache, this problem is
addressed successfully. Rather than sit idly and wait for the operation to be complete, cache
takes another request from the processor provided this request must be independent form the
previous one. Non-blocking cache is also popular with the name Lock-up free cache as the
name suggests that failure or suspension of any task cannot cause failure or suspension of
another request. Non-blocking cache is used as one of the popular latency-hiding technique.

212
Non-blocking cache is used along with other latency-reducing techniques like prefetching,
consistency models, multithreading etc.

Non-blocking caches were first introduced by Kroft. The design of these caches was based on
three main features:
 Load operations are lock-up free
 Write operations are lock-up free.
 Cache can handle a number of cache miss requests simultaneously.

To handle multiple misses in non-blocking manner, special registers known as Miss Status
Holding Registers (MSHRs) were introduced. These registers store information regarding
pending requests. MSHR contains following attributes.
1. Address of data block
2. Cache frame required for the block
3. Word due to which miss has been occurred in the block
4. Destination register

From the features list discussed above, non-blocking loads are predetermined by the
processor; buffering writes are responsible for handling non-blocking writes; to handle
number of cache miss requests simultaneously, only MSHR is not enough, but available
cache bandwidth should also be taken into consideration. There is a requirement of additional
support in execution unit of processor for non-blocking load along with MSHRs. Some kind
of register interlock is required to maintain data dependency, in case a processor applies static
instruction scheduling in pipelines. For dynamic instruction scheduling possessing out-of-
order execution, score boarding technique is preferred. Non-blocking functions generate
interrupts and to handle these interrupts successfully, interrupt handling routines are required
for both scheduling approaches.

Write buffers play a significant role in removing stalls on the write operations. They allow
the processor to execute even when there are pending writes available. Write penalty can be
reduced by using write buffers along with write through caches. Write buffers are also used to
store the written value temporarily for write back caches until the data line is back. Multiple
writes on one line can also be combined to reduce the total number of writes to subsequent
level. This technique may also pose the problem of consistency as the last read may be
required before the previous buffered write is performed. In this scenario, an associative
check is executed to enable the correct value to the last read.

11.5.1 Performance evaluation

213
Non-blocking functions are concerned with utilizing post-miss overlap of executions and
memory accesses. Processor halt can be delayed by using non blocking loads until and unless
any data dependency occurs. These loads are especially beneficial for superscalar processors.
For static scheduling, the non blocking distance i.e. the combination of number of
instructions and memory accesses tend to be small. This distance can be enhanced by
appending code for this combination produced by the compilers. For dynamic instruction
scheduling, additional hardware is used to increase non blocking distance but its efficiency is
dependent on many factors such as branch prediction, lookahead window, and data
dependency. In contrast, non blocking writes are more beneficial in minimizing the write
miss penalty as memory access time and non blocking distance are almost same. In addition,
no extra hardware is required for pending writes. This whole process will not exaggerate the
data access throughput to a great extent as the write miss penalty only is not responsible for
degradation. The other factors include lookahead distance (number of cycles generated by a
Prefetch request before the execution of reference instruction) can be managed with the help
of prefetching caches. Prefetching caches require high implementation costs, extra on chip
support units and complex hardware as compared to the requirement of non blocking caches.

11.6 Distributed Coherent Cache

Basically MIMD architecture belongs to the class of uni-processor family containing single
processor attached with single memory module. If there is a need to extend this architecture
to multiple processors attached with multiple memory modules; this can be handled by two
different mechanisms. Same processor/memory pair can be duplicated and then connected
with the help of an interconnection network. The processor/memory pair works as one
element and is independent of other pairs. To establish communication among pairs,
messaging approach can be used. One element cannot access the memory portion of other
element directly. This class of extended architecture is known as Distributed memory MIMD
architectures. Distributed memory architectures do not contain the issue of cache coherency
as the message passing approach handles multiple copies of same data in form of messages.
This architecture is depicted in figure 11.9.

214
Figure 11.9 Structure of Distributed Memory MIMD Architecture.

Second mechanism suggests creating a set of processors and memory modules. Here, any
processor can access any memory module directly. Interconnection network is present in this
scheme too to make an interface between processor and memory. The set of memory modules
when combined makes a global address space that can be shared among the participating
processors. Thus, the architecture got its name, shared memory MIMD architectures. This
scheme is shown in figure 11.10.

Figure 11.10 Structure of Shared Memory MIMD Architectures

215
In both the architectures, the major design concern is to construct the interconnection network
with minimum message traffic (for distributed memory MIMD) and memory latency (for
shared memory MIMD). Distributed memory MIMD architectures use static interconnection
networks where connection of switching units is fixed and generally treated as point-to-point
connections. Shared memory MIMD architectures use dynamic interconnection networks
where links can be reconfigured every time according to active switching units participating.
Different characteristics of interconnection networks in both architectures create a difference
in working too. In distributed, network is more concerned with transferring complete message
in one shot, no matters how long it is. So focus is on message passing protocols. In shared,
memory is accessed very frequently by short interrupts. So, major concern is to avoid
contention in the network.

To reduce the memory contention problems, shared memory systems are elaborated with
small size memories known as cache memories. Whenever a memory reference request is
posed by processor, the cached is checked first for the required data. If data found, memory
reference can be executed without utilizing interconnection network. So, memory contention
is reduced to a limited extent, at least until a hit situation is there. But as the number of cache
miss increase, contention problem will also increase proportionally. The above explained
logical shared memory architecture can also be implemented physically as a collection of
local cache memories. This new architecture is termed as distributed shared memory
architecture. From construction point of view, distributed shared memory architecture is
similar to distributed memory architecture. Main distinction lies in the organization of
memory address space. In distributed shared memory system, local memories are part of
global address space and any local memory can be accessed by any processor. In distributed
memory architecture, one processor cannot directly access the local memory space of other
processor (already discussed above). Distributed shared memory architectures can be further
categorized into three categories on the basis of accessing of local memories:
 Non-uniform memory access (NUMA) machines
 Cache-only memory access (COMA) machines.
 Cache coherent Non-uniform memory access (CC-NUMA) machines

NUMA

The general architecture of NUMA machines is displayed in Figure 11.11. In NUMA


machines shared memory is segmented into blocks where each memory block is connected to
a processor as a local memory. So, number of blocks will be same as number of processors
participating in the architecture. Whenever a processor addresses its local memory, speed of
accessing is much fast as compared to accessing from memory locating at remote site. For
this careful programming and data distribution is required in order to achieve high
performance. NUMA machines bear same disadvantages like distributed memory systems
only difference lies in programming technique. In distributed memory systems programming
concerns with message passing scenarios while in NUMA machines programming is

216
dependent upon shared memory approach. Solutions for cache consistency problem through
hardware are not available in NUMA machines. These machines can cache read only data,
local data, but cannot share or modify data. These systems are similar to distributed memory
architectures than shared memory.

Figure 11.11 General Architecture of NUMA machines

COMA

Both the categories of distributed shared memory architectures, COMA and CC-NUMA use
coherent caches to remove the drawbacks of NUMA machines. COMA use single address
space and coherent caches to perform data partitioning and dynamic load balancing. Thus,
this architecture is better suited for multiprogramming and parallel compilers. In these
machines every memory block behaves as a cache memory. Because of applied cache
coherence technique, data easily migrate at run time to local caches of the particular
processors where it is actually needed. General architecture of COMA machines is shown in
Figure 11.12.

217
Figure 11.12 General Architecture of COMA machines

CC-NUMA

CC-NUMA machines represent an intermediate between NUMA and COMA architectures.


As like NUMA, shared memory is constructed as a set of local memory blocks. To reduce the
traffic on the interconnection network each processor is connected with a large cache
memory. Initially data is distributed statically similar to NUMA machines. But for dynamic
load balancing, cache coherence protocols are used similar to COMA machines. General
architecture of CC-NUMA is summarized in Figure 11.13. These machines are very much
similar to real shared memory systems. Stanford cache is the best example of large scaled
multiprocessors that provides full hardware support for coherent caches.

218
Figure 11.13 General Architecture of CC-NUMA machines

Summary

Technologically a variety of memory types are available having different access mechanisms
depending upon this mechanism, memory is used to store or retrieve particular type of
processes. First is RAM It is defined as the memory for which a location, usually known as a
word has a unique addressing mechanism. Second is CAM this memory is also famous with
the name associative memory where for accessing the location

When the processor needs to read from a location in main memory, it first checks whether a
copy of that data is available in the cache. If it exists, it is termed as ‘hit’. The processor
immediately reads from the cache. If it does not exist, it is termed as ‘Miss.’ Data is fetched
from main memory and send to cache, from where it is used by processor. Two basic
approaches are applied for effective utilization of cache. Write through a simple technique
where all write operations are made to main memory as well as cache. Write back is a
technique to reduce memory writes to the minimum. With this approach, updates can be done
only in cache. There are four basic modes of accessing: Read, Write, Execute, Modify. Read
and Write modes are simply used for reading of data from memory and writing/ storing data
to memory. Execute access mode is used in case instructions are to be accessed from
memory.

For a single module memory, latency can be defined as the interval between the instance
when a read instruction is sent by the CPU to the memory and the moment when data is
available with the CPU. Latency is dependent upon a number of factors like: First element
cache size significantly accesses the memory access time. The size of the cache should be

219
small enough so that average cost per bit is closed to main memory alone. Hit Ratio Average
memory access time can be reduced by pushing up the hit ratio to lower level memory. Miss
Ratio Miss Ratio (M) is defined by M = 1 – H, Miss Penalty and page size.

Cache miss penalty is defined as the collaborative time to replace a page from main memory
to cache and time to deliver the page to CPU from cache memory. Cache misses can be
categorized under three different variants: Compulsory, Capacity, and Conflict. To evaluate
the cache performance in overall program execution time, first we evaluate CPU execution
time for a program. CPU time is calculated as the product of instruction count (IC) i.e.
number of instructions present in the program, clock per instruction (CPI) i.e. number of
clock cycles accommodated by individual instruction and clock cycle time (CCT). Miss
penalty for write operation can be reduced by providing a write buffer of accurate capacity.
Miss penalty can also be reduced by using two level caches.

Typically, cache memory can handle only one request at a time. If there is a miss situation i.e.
data is not found within the cache, it has to be fetched from main memory. During this
process of retrieval, cache remains idle or ‘blocked’ and do not handle any further request
until the fetch operation is complete. In case of ‘non-blocking’ cache, this problem is
addressed successfully. Rather than sit idly and wait for the operation to be complete, cache
takes another request from the processor.

Distributed memory architectures do not contain the issue of cache coherency as the message
passing approach handles multiple copies of same data in form of messages. In shared,
memory is accessed very frequently by short interrupts. So, major concern is to avoid
contention in the network. To reduce the memory contention problems, shared memory
systems are elaborated with small size memories known as cache memories. Whenever a
memory reference request is posed by processor, the cached is checked first for the required
data. If data found, memory reference can be executed without utilizing interconnection
network. So, memory contention is reduced to a limited extent, at least until a hit situation is
there. But as the number of cache miss increase, contention problem will also increase
proportionally. The above explained logical shared memory architecture can also be
implemented physically as a collection of local cache memories. This new architecture is
termed as distributed shared memory architecture. Distributed shared memory architectures
can be further categorized into three categories on the basis of accessing of local memories:
Non-uniform memory access (NUMA) machines, Cache-only memory access (COMA)
machines, and Cache coherent Non-uniform memory access (CC-NUMA) machines.

Exercise

Problem 11.1 – specify the impact of cache-main storage, memory hierarchy on the CPU
execution time, where miss rate is 12%; memory is referenced three times per instruction, and

220
miss penalty is 5 clock cycles. Suppose, average CPI value is 5 if memory stall due to miss
penalty is not taken into account.

Problem 11.2 – based on the following data recognize the percentage of associativity of Level
2 cache that would lead to best performance:
Direct-map (one-way set associative) hit time on level2 cache = 4 clock cycles
Miss penalty for level 2 cache = 40 clock cycles
Local miss rate for level2 cache with one way associativity = 30 %
Local miss rate for level2 cache with two way associativity = 25 %
Local miss rate for level2 cache with four way associativity = 20 %

Problem 11.3 - Time span for read/write operation in a cache-main storage hierarchical
memory is given by following table.
TCA = Cycle time for cache
TMS = Cycle time for main storage
PD = Probability that page is dirty

Operation Hit Miss

Read TCA TCA + (1 + PD) X TMS

Write TCA 2TCA + (1 + PD) X TMS

Problem 11.4 – A hierarchical Cache-MS memory subsystem has following specifications:


(1) cache access time of 50 ns, (2) main storage access time of 500 ns, (3) 80% of memory
request are for read, (4) hit ratio of 0.9 for read access, (5) write-through scheme is applied.
Estimate
a) Average access time of the system considering only memory read cycle;
b) Average access time for both reads and writes for write-through and write-back
schemes.
c) Hit-ratio for the system for write-back schemes.

Problem 11.5 – Which class of distributed-shared memory machines rely on coherent cache
and how?

221
Problem 11.6 – CC-NUMA architecture is inspired from general NUMA machines. Then,
through which technique, traffic is more manageable in CC-NUMA as compared to NUMA
machines.

Problem 11.7 – How multiple ‘Miss’ situation is handled using Non-blocking cache memory?

Chapter 12

Memory Management and Protection

Structure

12.1 Objective
12.2 Memory management
12.3 Memory Translation
12.4 Translation Look-aside Buffer
12.5 Paging
12.6 Segmentation
12.7 Memory Virtualization
12.8 Memory Synchronization
12.9 Memory Consistency
12.10 Memory Coherence Problem

12.1 Objective

Objective of this chapter is to discuss the various memory management techniques available
for computer systems. Further memory is divided into two parts i.e. for operating systems and
for current program in execution. Section 12.2 contains memory management definition and
the various techniques related to it such as swapping, memory allocation and memory
fragmentation. Section 12.3 discusses the memory translation in which memory distribution
is done by assigning the pages of virtual memory to the page frames of physical memory.
Section 12.4 discusses the translation look-aside buffer known as TLB. TLB supports fast
search mechanism while number of entries remains small. 12.5 contain the concept of Paging.
Paging is one of the most significant techniques for managing the memory. Paging permits
the physical address space of every procedure to be in a non contiguous manner. Section 12.6
discusses the concept of Segmentation. Segmentation is a technique to manage memory that
supports the user’s perception of memory. Section 12.7 involves the concept of memory
virtualization. Demand Paging is very popular technique used for memory virtualization in
which pages are loaded from back up unit to the main memory which is required at that
particular instant. Memory synchronization is explained in 12.8 Synchronization problem
occurs due to sharing of data objects between other processes. To solve these problems, there
are various protocols and policies available which are further explained in this section.
Section 12.9 contains memory consistency issues. Memory inconsistency occurs due to miss
match between the ordering of memory access and process execution. There are various
memory consistencies models such as weak consistency models and sequential consistency

222
models. Finally, section 12.10 contains memory coherence problems and various protocols
such as snoopy bus and directory bus protocols to solve the issue.

12.2 Memory Management

Memory is a very important part in all the computer systems. Memory contains a large
collection of words. Each of this word or byte has their own memory address. The CPU uses
the program counter value to access the instruction from the memory. Memory in single
programming systems are divided into two parts: first is reserved for operating systems and
other is used for the current program in execution. On the other hand, in multiprogramming
systems the user component of memory is further divided to hold multiple processes. This
division and further subdivision of the memory is done at run time by the operating system to
manage memory. In this chapter several issues and various techniques are discussed related
memory management.

12.2.1 Swapping

It is necessary for the process that before its execution it must be in the memory. Swapping of
the process is done from the memory to the storage unit or from the storage unit to the
memory whenever it is required for execution. For example in round-robin CPU scheduling
algorithm used for multiprogramming systems whenever a time related to the process expires
the memory manager identifies the particular process and swap- it out from the main memory
and swap in the other process needs to be executed in the same memory space which was
made vacant by swapping out of the process. Meanwhile, time slice is given to some other
process to be executed by the CPU scheduler. Therefore the time slice should be large to
accommodate the swapping of process in and out from the memory whenever it is required.
Figure 12.1 shows the swapping process in and out from the memory.

Figure 12.1 Swapping process in and out from the memory

Other place where these swapping techniques are used in priority based scheduling
algorithms. For example, if there is some process in execution and at the same time any other
process arrives with higher priority than a memory manager swaps or take out the process
with lower priority and swap in the process with higher priority in the memory for execution.
After finishing the higher priority process, memory manager can again swap or load the
lesser priority process in memory for execution.

223
Whenever, a process is swapped out from the memory and swapped in again in the memory
then it uses the same memory location it was holding previously. This is due to the address
binding method. If address binding is done at compile time or load time then it is not easy to
move the process in some other memory location. But if the address binding is done on the
run time then it is easy to swap the process into some other memory location because the
physical addresses are calculated at run time. There is a requirement of backing unit for
Swapping. This backing unit must be large enough so that it can store or hold all the memory
images. Direct and fast access to these images must be provided. There is a ready queue
maintained by each system which contains the memory images of all those process which are
ready for execution. Whenever, the request comes to execute for any process the CPU launch
dispatcher which observes the ready queue. If the next process which is to be executed is in
waiting queue and not in the memory and there is no space available in the memory then the
dispatcher swaps or takes out one process from the memory and swaps in the process needs to
be executed.

There are some constraints related to swapping. It is necessary that whenever there is a need
to swap in the process in the main memory for execution the process must be completely idle.
For example, if some process has demanded an I/O operation then it must be completely idle
or in wait state for that I/O request. But, if the process is not idle then the swapping of that
process is not possible in the memory.

12.2.2 Memory allocation

The memory is categorized into two parts: first is the fixed sized memory which is occupied
by an operating system and the other partition is used to serve the multiple processes. The
simplest technique to allocate the memory to the processes is to divide the memory into
partitions. Each this partition is having a fixed size such that each partition can contain only
one process to serve. This becomes a limitation for multiprocessing systems. The process has
to wait until and unless a memory is free to serve it. In this scheme a table is maintained by
the operating system which keeps a track of available memory and occupied memory.
Initially, a large block of memory is accessible for the user and recognized as a hole. When a
process arrives for execution the memory is allocated and rest available memory is
distributed in other processes. The other more efficient technique is variable sized partition
approach. In this approach the allocation of memory space is done according to the
requirement of the process. This technique ensures the wastage of memory is reduced. As
more and more processes occurs there exist plenty of small holes left in the memory as a
result its utilization is declined.

12.2.3 Fragmentation

The memory space is divided into small pieces as the process moves in and out from the
memory after and before execution. For instance, when a request to execute a process occurs
and the total available memory is available which more than enough to execute the process
but this memory is not available in a contiguous manner then there is a problem of external
fragmentation. i.e. the storage is available into large number of small pieces which is not
suitable for executing any process. This is a very big problem. There can be a situation where
there can be a block of memory in between every two processes.

The way out to handle the difficulty of external fragmentation is termed as compaction. In
this technique all the partly available memory blocks are shuffled in such a manner that they

224
are combined in a single larger memory block which can be used to assign space to the
upcoming request. If the address binding is statically done then the compaction cannot be
done but if the address binding is done dynamically then the program or data is moved and a
new base address is allocated to the register. In simplest form the compaction algorithm
moves all the available space at one end of the memory and all the processes at other end of
the memory. This results in the single larger memory block available to serve the pending or
upcoming requests.

12.3 Memory translation

Both virtual memory and physical memory are fragmented into fixed length pages. The
concept of memory distribution is to assign pages of virtual memory to the page frames of
physical memory. This concept of allocation of pages from virtual to physical comes under
address translation. Virtual addresses are mapped to physical addresses at run time with the
help of hardware device known as Memory Management Unit (MMU). A very general
approach under this category is base-register scheme. For this particular approach base
register is termed as relocation register. As shown in Figure 12.2 the value in the relocation
register is added to each address created by a user procedure at the time it is loaded into
memory. User program never deal with real physical addresses but only with logical
addresses.

Figure 12.2 Mapping of Logical address to Physical address

12.4 Translation Look-aside Buffer

Every operating system has their own tools and techniques for storing page tables. Generally
a single page table is allocated for all the process and a pointer or indicator to the page table
is saved with the other register values. When a process is started, it first reloads the user
registers and defines the correct hardware page table values. There are plenty of ways to
implement page table through hardware basis. First one suggests the implementation, as a set
of committed registers built with extremely high speed logic to enable paging-address
transformation efficient. This technique is satisfactory provided page table should be
reasonably small.

The standard approach to store page tables is to use a unique, little, fast-lookup hardware
cache known as translation look-aside buffer (TLB). All the entries in TLB contain two sub
fields: a key/tag and a value. When in associative memory there is a need to search for an

225
item, target item is compared with all keys at the same time. If item is found matching value
field is returned. TLB supports fast search mechanism while number of entries remains small.
In case, page tables are stored under TLB, a few entries can be entertained when a logical
address is created by CPU; its page number is presented to TLB. If page number is found, its
frame number is immediately accessible and is used to access memory. If page number is not
presented in TLB, it is termed as TLB miss, memory location to page table must be made.
Frame number is used to access the memory after it is obtained. Both the frame number and
page number is added into the TLB, so that at next reference they will be found swiftly. If
there is no space in TLB, then the replacement is done by the operating system, using plenty
of available replacement algorithms. Figure 12.3 describes the working of TLB as stated
above.

Figure 12.3 Working of TLB

There are TLBs which can store Address Space Identifiers with each TLB entry. These
identifiers are used to discover a process and give them an address space protection. For
currently executing process, whenever a TLB seeks for its virtual page number, TLB assures
the address space identifiers for the current process is same as its virtual page number. If in
case it is not same, then there is a TLB miss.

12.5 Paging

Paging is one of the most significant techniques for managing the memory. Paging permits
the physical address space of every procedure to be in a non contiguous manner. It overcomes
the huge sum of memory chunks of different sizes onto the backing unit. This problem occurs
when any data available in the memory is swapped out, to save this data on the hard disk or
backing unit there must be available space on that unit. These, backing units or hard disks
also suffer from fragmentation problem such as main memory. But in comparison of main
memory these backing units work at much slower rate so compaction is almost impossible.

226
In paging, physical memory is subdivided into fixed-sized chunks or blocks which are known
as frames and logical memory is subdivided into the same sized blocks known as pages. The
size of the page is termed by the hardware generally it is the power of 2. When a request
comes to execute a process its pages are loaded from the backing unit into frames. Further
these memory units are broken into the same sized blocks as the memory frames as shown in
Figure 12.4

Figure 12.4 Paging Process

A CPU generated address contains two parts: first is page number which is used for indexing
in the page table. Second is page offset which is a joint with base address contained in the
page table. This combination defines the physical memory address which is further forwarded
to the main memory unit. Paging is dynamic process relocation. In paging every logical
address is bind with the hardware which is further bound to some physical address space.

12.6 Segmentation

Another technique which is used to subdivide the addressable memory is called


segmentation. Paging separates the user’s perception of memory and real physical memory.
That is paging provides larger address space to the programmer as it is invisible to the
programmer whereas, segmentation is a technique to manage memory that supports the users
perception of memory. Segmentation makes the user or programmer to observe memory as
segments or a collection of many address spaces. Each segment contains a segment name and
segment length. Segments are dynamic and un-even in size. Different segments are assigned
to the programs or data as per their requirements. Usage rights and access rights can be
assigned individually to each segment. Combination of segment number and offset value is
used for segment reference in the memory.

227
To implement the concept of segmentation a segment table is maintained. This segment table
consists two parameters first is segment base which is used to maintain the starting physical
address of every segment that where the segment is actually situated in the memory. Second
is segment limit which contains the size or length of every segment. Figure 12.5 illustrates
the hardware of segmentation

Figure 12.5 Hardware of Segmentation

Segment table contains both segment number and offset value. Segment number is further
used for an indexing purpose in the segment table whereas; segment offset of the logical
address is minimum 0 and maximum to the segment limit value.

Advantages of segmentation over paging


 Segmentation provides a protection to itself. Segment can be used to store the
program or data and programmer can apply access rights to each and every segment
for its convenience.
 Segmentation permits programs or data to be recompiled individually and
independently without requiring the whole set of programs or data to be loaded into
the memory.
 Segmentation permits the sharing among other processes in the memory.

12.7 Memory Virtualization

In multiprocessor systems, more than one program runs simultaneously. So, it is necessary
for the running programs that they must reside in the main memory for execution. But, due to
the size limitation of main memory, it is sometimes not possible to load all the programs in it.
So, the concept of Virtual memory was introduced to eliminate this problem. Virtual memory
separates the user’s perspective of logical memory from physical memory. This perception
provides a very large memory space for the programmers, when there is only a smaller
physical memory available actually. The logical view of storing a process in the memory
refers to a virtual address space. The address translation from virtual address to physical

228
addresses can be done at run time. This translation is done with the help of mapping functions
and translation tables. Virtual memory provides the facility for sharing the main memory with
the help of numerous software processes available at run time. Virtual memory allows only
very active portions of code to stay in main memory; remaining code will stay in hard disk to
wait for its turn for execution. Beyond dividing, logical memory from physical memory it
allows multiple records and memory locations to be shared between two or more than two
processes with the help of page sharing. The virtual memory is implemented through Demand
Paging.

12.7.1 Demand Paging

When a program needs to be executed, it must be moved from backup unit to main memory.
There are a number of options available to load the process in main memory throughout its
execution. First way is to move the whole program into the main memory in one go; but
sometimes the complete program is not required initially. The other way is to load the pages
from back up unit to the main memory which are required at that particular instant. This
approach is termed as Demand paging. Demand paging system generally matches with a
paging system along with swapping process included as shown in figure 12.6. Here processes
initially stay in secondary memory; with the implementation of demand paging in virtual
memory, pages which are required during the execution are only loaded into the main
memory. Unlike paging, virtual memory uses a lazy swapper (also known as pager), which
never swaps the page until and unless it is required for execution.

Figure 12.6 Demand Paging including Swapping process

When a process needs to be swapped in, pager decides the page that will be used before the
swapping out of process. Rather than swapping in the complete process, pager shifts only
those required pages into memory. Thus, it removes the overhead of reading for those
memory pages that will not be used in any sense and also reduce the time spent on swap
operation and quantity of physical memory required. According to this scenario some kind of

229
hardware support is required to differentiate between the pages that reside in memory and the
Pages located on the disk. The valid-invalid bit criteria can be applied for this. When bit is set
to “valid” corresponding page is situated in memory. When bit is set to “invalid” it is located
on the disk. Page-table entry for the page located in memory is set as valid, but page-table
entry for page located in disk is either marked invalid or it may contain the address of the
location where page is located on disk. Now, when a page is marked as invalid, it leads to
two different options. One is the process never tries to access that page. This is a general
approach. Sometimes, process tries to access the page marked as invalid, this may cause a
page-fault trap. Page-fault is defined as the situation when the referenced page is not present
in main memory. It generates an interrupt and forces the operating system to bring that page
in. the operating system when fails to bring the desired page then this will result in a useless
trap.

12.8 Memory Synchronization

Execution of a parallel program is dependent upon efficient and effective synchronization


which is the key parameter for its performance and wellness. All the parallel operations needs
both hardware as well as software techniques for synchronization. Synchronization problem
occurs due to sharing of data objects between other processes. When the writeable data object
is converted into readable data object then the synchronization problem vanishes.
Synchronization involves executing the operations of algorithm in order by observing all the
writable data dependencies. MIMD architecture requires dynamic synchronization
management for the sharable write data objects. Whereas, as compared to MIMD,
synchronization in SIMD architectures are done at compile time which is much easier. Earlier
synchronization was done in hardware itself. Various memory units, CPU, buses etc can be a
part to provide effective synchronization.

12.8.1 Atomic Operations

Shared memory operations have two classes:


 Independent read and write
 Undividable read-modify-write
In case of synchronization a sharable data object is called atom and the operations on the
program is performed in read-modify-write sequence. Operation performed on atom is known
as atomic operation.

The atoms are further classified into two categories. First is hard atom whose access conflicts
are resolved by hardware. Second are soft atoms whose sharable data object accesses are
resolved with the help of software. The implementation of atomicity was done explicitly with
the help of software. The program is not out of order until it maintains the meaning of code.
Program dependencies are of three types
 Data dependencies such as Write After Read (WAR), Write After Write (WAW),
Read After Write (RAW)
 Control Dependencies which contains Goto and If-then-Else statements
 Side-effect Dependencies occurs due to traps, time-out, Input and output accesses etc.

230
12.8.2 Wait Protocols

Wait protocols are of two types: first is busy wait. In this protocol, the program resides in the
processor’s context registers and is permitted to simultaneously retry. Till the time program
remains in processor’s context register, it consumes the processor cycles; but its response is
faster whenever shared object is available. Second is sleep wait, in this protocol if the shared
object is not available; the program is removed from the registers and put into a wait queue.
When the hared object is available for access, the waiting program is notified for the same.
The complexity of using sleep wait is much more than the complexity of using busy wait in
multiprocessor systems. In a multiprocessor system, when a process is synchronized by using
locks, busy wait approach is more preferred over sleep wait. Busy wait protocol can also be
implemented along with a self service approach by polling across the entire network; or it
may be implemented with a full service approach by sending it a notification across the
network when the shared data object is available on the network.

12.8.3 Fairness policies

Synchronization delay may be reduced by the use of busy wait protocols whenever atom or
shared data object is available for use. But its main drawback is it continuously checks the
object state and as a result wastes lots of processor cycle. This causes hotspot problem in
memory access. While in sleep wait protocol, the hardware resources are better used. The
main drawback is the longer synchronization delays. For all the processes which are waiting
for a shared data in a waiting queue there must be a fairness policy which is used to review
one of the waiting processes. There are three kinds of fairness policies:
 FIFO
 Bounded
 LiveLock-Free

12.8.4 Sole access protocols

Sole access protocols are used to sequence conflicting shared operations. There are three
types of synchronization methods that are described on the basis that who will update the
atom or shared data object or whether the sole access is allowed after or before the
completion of atomic operation.
 Lock synchronization - In this method sole access is allowed before the atomic
operation. After the operation, shared data object is updated by the process who poses
the sole access request. It is also known as pre-synchronization. This method can
only be used for shared read-only memory accesses.
 Optimistic synchronization – In this method sole access is permitted after the atomic
operation is complete. After the operation, shared data object is updated by the
process who poses the sole access request. It is also known as post-synchronization. It
has got the name optimistic for the reason that the approach suggests that there will be
no simultaneous access to the data object during processing of single atomic
operation.
 Server synchronization – This approach use server process to update the atom. When
compared with lock synchronization and optimistic synchronization, server
synchronization provides full service support. An atom contains a unique update
server. Any process that requests an atomic operation sends its request to atom’s
update server. Update server can be a dedicated server processor (SP) related with the
atoms memory module. Common examples of server synchronization are remote

231
procedure calls and object oriented systems, where shared objects are encapsulated by
a process.

12.9 Memory consistency

Memory inconsistency occurs due to miss match between the ordering of memory access and
process execution. When instructions run in parallel their execution finishes out of order even
if the instructions are dispatched or send in order. Because small or shorter instructions take
less time to execute than the larger instructions. In a single-processor system SISD sequence
is followed so, instructions are executed sequentially one after the other. Memory accesses
are consistent with order of instruction execution. This whole phenomenon is characterized as
sequential consistency. While, in case of shared memory multiprocessor, multiple instruction
sequences are followed in multiple processors. These MIMD instructions can be executed
differently and leads to a different memory access sequence as compared to actual execution
order. Both the approaches of sequential consistency and event ordering are diagrammatically
analyzed in Figure 12.7

Figure 12.7 Sequential Consistency and Event ordering approach

12.9.1 Memory consistency issues

The accessing behavior of a shared memory system executed by processors is termed as a


memory model. Generally, when a memory model is decided, the decision is based on the
negotiation between a strong model and a weak model. Major memory operations for
multiprocessors are load/read, store/write, and synchronization operations such as swap or
conditional store.

232
12.9.1.1 Event orderings

On a multiprocessor, multiple processes execute on different processors concurrently. Each


process includes a discrete code segment. The order inhibited by one process to perform
shared memory operations can be followed by other processes to. With each shared memory
access particular memory events are related. Consistency models are there to maintain the
order by which the events from one process should be viewed by other processes in the
system. Event ordering is used to define weather a memory event is genuine or not when
multiple processes uses a common set of memory locations. Another technique known as
program order is used to define the sequence by which memory accesses occur for execution
of single process without any kind of program reordering execution. Three main memory
operations are defined to state memory consistency models.
 A read by processor Pa is assumed executed according to processor Pc at a particular
instant when the issuing of write to the same memory location by Pc cannot change
the value resulted from read instruction.
 A write by processor Pa is assumed executed according to processor Pc at a particular
instant when the issuing of read to the same memory location by Pc returns the same
value as write.
 A read is globally performed if it is executed according to all the available processors
and if the write i.e. the result of returned value has been performed according to all
processors.
In a uni-processor system, to enhance the performance instructions can be executed in out-of-
order scenario using compiler. Special hardware interlock mechanisms are required to check
data and control dependencies between these executing instructions. In case of multiprocessor
system during execution of parallel instructions local data and control dependence checking
is mandatory but may not be appropriate to guarantee the expected result. In MIMD systems
keeping the execution results correct and predictable is a complex task due to the following
points:
 In a parallel system, Instruction belongs to different processes and their execution
order is not fixed. In case of lack of synchronization, a large number of different
instruction interleaving can be there.
 If to boost the performance, the execution order of instructions in a single process is
made different from the program execution order, an even larger number of
instruction interleaving can be there.
 Is accesses are not atomic i.e. multiple copies of same data exists at a same time then
different processors can use different interleaving for the same execution. Therefore,
number of possible execution instantiations of program becomes even larger.

12.9.1.2 Atomicity

Behavior of multiprocessor memory accesses can be explained in three categories:


1. Program order is maintained and observation sequence remains consistent for all
processors.
2. Out-of-program order is allowed and observation sequence remains consistent for all
processors.
3. Out-of-program order is allowed and observation sequence is inconsistent for all
processors.
These three categories lead to two discrete classes of shared memory systems for
multiprocessors. First class belongs to atomic memory accesses and second belongs to non-

233
atomic memory accesses. A shared memory access is termed as atomic if an update in
memory is reachable to all processors at the same time. Therefore, to make an atomic
memory sequentially consistent, there must be a mandatory condition that all memory
accesses must be performed to maintain all individual program sequences. A shared memory
access is termed as non-atomic if an invalidation signal is not known to all processors at the
same time. In case of non-atomic memory system, multiprocessors cannot be strictly
sequenced. Therefore, weak ordering is preferred which leads to the division between strong
and weak consistency models.

12.9.2 Sequential consistency model

Sequential consistency memory model is a very popular model for multiprocessor designs.
According to this model, the load operations, store operations and swap operations of all
processors execute sequentially in a single global memory order that relates to the program
order of indivisual processors. A Sequential consistency memory model is illustrated in
Figure 12.8

Figure 12.8 Sequential Consistency Memory model

As defined by Lamport, “a multiprocessor system is sequentially consistent if the result of


any execution is same as if the operations of all the processors were executed in some
sequential order, and the operations of each individual processor appear in this sequence in
the order specified by its program”.
To achieve Sequential consistency in shared memory systems two mandatory conditions are
to be followed:
 Before allowing a read to perform according to any other processor, all prior read
accesses must be globally performed and all earlier write accesses must be performed
according to all processors.
 Before allowing a write to perform according to any other processor, all prior read
accesses must be globally performed and all earlier write accesses must be performed
according to all processors.
Four proverbs are popular for Sequential consistency memory model
1. A read by a processor always result in the same value written by latest write
instruction to the same memory location by other processors.

234
2. If two operations follow a particular program order then the same memory order is
followed.
3. Swap operation is atomic with respect to other write operations. No other write can
interfere between read and write parts of a swap.
4. All writes and swaps must terminate at a particular moment.
Most multiprocessors have implemented Sequential consistency memory model because of
its simplicity. This model imposes strong ordering of memory events thus, may lead to poor
memory performance. This drawback is more considerable when the system becomes very
large. To reduce the affect, another class of consistency model is established known as weak
consistency model.

12.9.3 Weak Consistency model

Weak Consistency model was introduced by Dubois, Scheurich, and Briggs; thus is also
popular with name DSB model. DSB model is specified by three conditions:
1. All earlier synchronization accesses must be executed, before a read or write access is
permitted to execute with respect to to any other processor
2. All earlier read and write accesses must be executed, before a synchronization access
is permitted to execute with respect to any other processor.
3. Synchronization access is sequentially consistent with respect to one another.
These conditions enable a weak ordering of memory access events in a multiprocessor
because these conditions are only bound to hardware recognized synchronizing variables.

Another weak consistency model termed as TSO (Total Store Order) is shown in figure 12.9.
This model is specified with the following popular proverbs

Figure 12.9 Weak Consistency Model TSO

1. A read access is always returned with latest write to same memory location by any
processor.
2. If two writes occurs in a specific program order, then their memory order must be
retained.

235
3. If a memory operation is executed immediately after a read in program order then it
must be executed in same sequence in memory order.
4. All writes and swaps must terminate at a particular event.
5. A swap operation is isolated according to other write operations. No write operation
can interfere between the read and write parts of a swap.

12.10 Memory coherence problem

In a multiprocessor system data inconsistency is a very big issue to deal with in the same
levels or between neighboring levels. In case of same level, multiple cache memory modules
may represent different imitations of same memory block because these are processed by
multiple processers asynchronously and independently. In case of neighboring levels, cache
memory and main storage may contain non-uniform copies of same data object. Cache
coherence schemes are available to overcome this problem by preserving a consistent state
for each cached block of data. In consistencies in cache can be occurred through multiple
ways:

 Inconsistency in data sharing

The cache inconsistency problem occurs by sharing writable data among processors. Figure
12.10 evaluates the process clearly. Suppose there is a multiprocessor with two processors
each using private cache and sharing main memory. D is the shared data element used by
both processors. Figure shows three actions before update, write-through, and write-back.
Before update, three copies of D are uniform if processor PA writes new data D’ into the
cache, same copy is written immediately into the shared memory using write through. Now,
the two copies in two caches are inconsistent. As cache for processor PA contains D’ and
cache for processor PB contains D. during third action write-back, result is again inconsistent
as the main memory will be updated at a particular event when the modified data in cache is
invalidated.

Figure 12.10 Cache coherence problems by data sharing

 Process migration

236
Cache inconsistency problem can also be occurs during process migration from one processor
to another. This process is illustrated in fig12.11 through three actions before migration, writ-
through, and write-back. Data object is consistent before migration. Inconsistency occurs
after the process containing shared variable D migrates from processor PA to processor PB
using write-back. Same situation of inconsistency propagates when process migrates from
processor PB to PA using write-through.

Figure 12.11 Cache coherence problems by process migration

 I/O operations

Another reason for inconsistency can be I/O operations that bypass the caches. This process
is implemented in figure 12.12

Figure 12.12 Cache inconsistencies after I/O operation

When I/O processor loads a new data D’ into main memory bypassing write-through
operation, data in cache of first processor and shared memory becomes inconsistent. When

237
data is outputted directly from shared memory (bypassing the caches) this write-back
operation results in inconsistent data among caches.

12.10.1 Snoopy bus protocols

Early multiprocessors used bus based memory systems for performing cache coherence due
to all processors available in the system could observe executing memory transactions. If any
inconsistency occurs, the cache controller has the ability to take necessary actions to
invalidate the copy. Here, each cache snoops on the transactions of other caches thus,
protocols using this technique are known as snoopy bus protocols. Snoopy protocols work on
two basic approaches:
 Write-invalidate
 Write-update (write broadcast)
In case of write-invalidate procedure, there can be various readers but there is only one writer
at a time. A data object may be shared amongst multiple caches for read operation. When any
of the cache wants to perform a write to that data object, first it propagates a signal that
invalidates that data object in remaining caches. Now, this data object is exclusively available
for the writing cache.

In case of write-update protocol, several writers as well as several readers can exist. When a
processor needs to update a shared data object, data word to be updated is circulated to all.
All caches that contain that data object are authorized to update it.

Figure 12.13 shows working of snoopy bus protocol in three stages. Part (a) represents
original consistent copies of data object in shared memory and three processor caches. Part
(b) represents status of data object after write invalidate operation by processor P1. Part (c)
represents status of data object after write-update operation by processor P1.

238
Figure 12.13 (a) Consistent copies of block X in shared memory and processor caches, (b)
After a write-invalidate operation by P1, (c) After a write-update operation by P1.

12.10.2 Directory based protocol

Scalable multiprocessor systems interconnect processors using short, point to point links in
direct networks. The major advantage with this scenario is that bandwidth of these networks
increases dynamically as more and more processors are added to the arrangement or network.
In these systems cache coherence problem can be resolved using directory based protocols.
Directory based protocols gather and retain information about the locations where copies of
data objects reside. A centralized controller which is part of main memory controller and a

239
directory stored in main memory. This directory holds overall information about contents of
multiple local cache memory modules. When cache controller makes a demand, centralized
controller verifies and sends the required commands for data transmit between main memory
and cache memory or between multiple caches.

Centralized controller also keeps the state information upto date. Generally, centralized
controller preserves data that which processor has copy of which data object. Before a
processor can make any change in the local copy of a data object, it must request an exclusive
access to the data object. In response centralized controller sends a information to every
processor along with a saved copy of this data object to invalidate their owned copy. And
exclusive access is granted to the requesting processor. If any other processor tries to read the
data object i.e. exclusively granted to some processor, a miss notification is send to the
controller. Directory based protocols suffer from drawbacks of central bottlenecks and extra
burden of communication between centralized controller and multiple cache controllers.

Figure 12.14 Directory Based cache coherence protocol

Figure 12.14 explains basic concept of directory based cache coherence protocol. Three
different categories of directory based protocols are available:
 Full-map directory protocol implements directory access with only one bit per
processor and a dirty bit. Each bit in a processor signifies status of the data object in
processors cache. That weather the data object is present or not. If the dirty bit is on,
then one and only one processor can write into the data object.
 Limited directories protocol is similar to full-map directory, except that more than i
caches request read copies of a particular data object. Here, i stand for number of
pointers to which this data object is to broadcast.
 Chained directories it is simplest technique to keep record of shared copies of data
by preserving a chain of directory pointers. This scheme is implemented in form of a

240
singly linked chain. Assume there are no shared copies of location M. if processor P1
reads location M, memory sends a copy to cache C1, along with a chain termination
pointer. Memory also keeps a pointer to cache C1. In the same way when processor
P2 reads location M, memory sends a copy to cache C2 along with the pointer to
cache C1. By repeating the above steps all caches can cache a copy of location M.

Summary

Memory is a very important part in all the computer systems. Memory contains a large
collection of words. Each of this word or byte has their own memory address. The CPU uses
the program counter value to access the instruction from the memory.
It is necessary for the process that before its execution it must be in the memory. Swapping of
the process is done from the memory to the storage unit or from the storage unit to the
memory whenever it is required for execution. Whenever a time related to the process expires
the memory manager identifies the particular process and swap- it out from the main memory
and swap in the other process needs to be executed in the same memory space which was
made vacant by swapping out of the process. Whenever, a process is swapped out from the
memory and swapped in again in the memory then it uses the same memory location it was
holding previously. This is due to the address binding method. If address binding is done at
compile time or load time then it is not easy to move the process in some other memory
location. But if the address binding is done on the run time then it is easy to swap the process
into some other memory location because the physical addresses are calculated at run time.
The memory is divided into two parts first is the fixed sized memory which is occupied by an
operating system and the other partition is used to serve the multiple processes. The simplest
technique to allocate the memory to the processes is to divide the memory into partitions.
Each this partition is having a fixed size such that each partition can contain only one process
to serve. The memory space is divided into small pieces as the process moves in and out from
the memory after and before execution. For instance, when a request to execute a process
occurs and the total available memory is available which more than enough to execute the
process but this memory is not available in a contiguous manner then there is a problem of
external fragmentation.
The slandered approach to store page tables is to use a unique, little, fast-lookup hardware
cache known as translation look-aside buffer (TLB). All the entries in TLB contain two sub
fields: a key/tag and a value. When in associative memory there is a need to search for an
item, target item is compared with all keys at the same time. TLB supports fast search
mechanism while number of entries remains small.
Paging is one of the most significant techniques for managing the memory. Paging permits
the physical address space of every procedure to be in a non contiguous manner. It overcomes
the huge sum of memory chunks of different sizes onto the backing unit.
Another technique which is used to subdivide the addressable memory is called
segmentation. Paging separates the user’s perception of memory and real physical memory.
That is paging provides larger address space to the programmer as it is invisible to the
programmer whereas, segmentation is a technique to manage memory that supports the users
perception of memory.

241
In multiprocessor systems, more than one program runs simultaneously. So, it is necessary
for the running programs that they must reside in the main memory for execution. But, due to
the size limitation of main memory, it is sometimes not possible to load all the programs in it.
So, the concept of Virtual memory was introduced to eliminate this problem. Virtual memory
separates the user’s perspective of logical memory from physical memory.
Execution of a parallel program is dependent upon efficient and effective synchronization
which is the key parameter for its performance and wellness. All the parallel operations needs
both hardware as well as software techniques for synchronization. Synchronization problem
occurs due to sharing of data objects between other processes.
Memory inconsistency occurs due to miss match between the ordering of memory access and
process execution. When instructions run in parallel their execution finishes out of order even
if the instructions are dispatched or send in order
In a multiprocessor system data inconsistency is a very big issue to deal with in the same
levels or between neighboring levels. In case of same level, multiple cache memory modules
may represent different imitations of same memory block because these are processed by
multiple processers asynchronously and independently. In case of neighboring levels, cache
memory and main storage may contain non-uniform copies of same data object.

Exercise

Problem 12.1 – Differentiate Internal fragmentation and External fragmentation.

Problem 12.2 – when the process is under execution system permits programs to assign more
memory than required its address pace such as data assigned in the case of heap segments.
Write the necessary requirements to assist dynamic memory allocation for following
scenarios:
a) Contiguous memory allocation
b) Pure segmentation
c) Pure paging

Problem 12.3 – Compare paging and segmentation according to the concept of amount of
memory required by address translation mechanism to map virtual addresses to physical
addresses.

Problem 12.4 – Define the need for paging the page tables.

Problem 12.5 – what is demand paging? Discuss the hardware support required to perform
demand paging.

Problem 12.6 – virtual memory has page size of 2K words having 8 pages and 4 blocks
following are the values for associative memory page table:

Page Block
0 3
1 1

242
4 2
6 0

Note all virtual addresses that will result in page fault when requested by processor.

Problem 12.7 – what are the various issues concerned with memory consistency? How these
can be resolved? Which technique is preferred over the other?

Problem 12.8 - what are the various issues concerned with memory coherence? How these
can be resolved? Which technique is preferred over the other?

243

You might also like