0% found this document useful (0 votes)
18 views139 pages

Cao Notes

Uploaded by

Uma Maheshwaran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views139 pages

Cao Notes

Uploaded by

Uma Maheshwaran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 139

Downloaded From : www.EasyEngineering.

ne

ww
w.E
a s yE
ng i
ne e
rin
g .n
e

**Note: Other Websites/Blogs Owners Please do not Copy (or) Republish


this Materials, Students & Graduates if You Find the Same Materials with
EasyEngineering.net Watermarks or Logo, Kindly report us to
[email protected]

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

ANNA UNIVERSITY, CHENNAI-25


SYLLABUS COPY
REGULATION 2013

CS6303 COMPUTER ARCHITECTURE LTPC


3

UNIT I OVERVIEW & INSTRUCTIONS 09 0 3


Eight ideas – Components of a computer system – Technology – Performance –
Power wall – Uniprocessors to multiprocessors; Instructions – operations and
operands – representing instructions – Logical operations – control operations –
Addressing and addressing modes.

ww
UNIT II ARITHMETIC OPERATIONS 7
ALU - Addition and subtraction – Multiplication – Division – Floating Point operations
– Subword parallelism.

w.E
UNIT IIIPROCESSOR AND CONTROL UNIT11

a
Basic MIPS implementation – Building datapath – Control Implementation scheme –

sy E
Pipelining – Pipelined datapath and control – Handling Data hazards & Control
hazards – Exceptions.

UNIT IVPARALLELISM9
ngi
Instruction-level-parallelism – Parallel processing challenges – Flynn's classification
– Hardware multithreading – Multicore processors

nee
ri n
UNIT VMEMORY AND I/O SYSTEMS9
Memory hierarchy - Memory technologies – Cache basics –

g .n
Measuring and improving cache performance - Virtual memory, TLBs -
Input/output system, programmed I/O, DMA and interrupts, I/OM processors.

e
TOTAL: 45 PERIODS

TEXT BOOK:
1. David A. Patterson and John L. Hennessey, “Computer Organization and Design’,
Fifth edition, Morgan Kauffman / Elsevier, 2014.

REFERENCES:
1.V.Carl Hamacher, Zvonko G. Varanesic and Safat G. Zaky, “Computer
Organisation“, VI edition, Mc Graw-Hill Inc, 2012.
2.William Stallings “Computer Organization and Architecture”, Seventh Edition,
Pearson Education, 2006.
3.Vincent P. Heuring, Harry F. Jordan, “Computer System Architecture”, Second
Edition, Pearson Education, 2005.
4.Govindarajalu, “Computer Architecture and Organization, Design Principles and
Applications", first edition, Tata Mc Graw Hill, New Delhi, 2005.
5.John P. Hayes, “Computer Architecture and Organization”, Third Edition, Tata Mc
Graw Hill, 1998.
6. http://nptel.ac.in/.

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

TABLE OF CONTENT
S.No TITLE Page no
a Aim and Objective of the 1
subject b Detailed Lesson Plan 2

UNIT 1 OVERVIEW & INSTRUCTION


a. Part A 4
b. Part B 8
1 Components of a computer system 8
2 Performance of a computer system 12
3 Eight great ideas in computer architecture 14
4 Techniques to represent instruction in a computer system 17

ww
5
6
Addressing modes
Power Wall
22
26
7
8
w.E
Uniprocessor and Multiprocessor
Operations and operand
27
28

a. a sy E
UNIT 2 ARITHMETIC OPERATIONS
Part A 29

ngi
b. Part B 31
1 Addition and Subtraction 31

nee
2 Carry look ahead adder 31
3 Multiplication 34

ri n
4 Fast Multiplication 37
5 Booth Algorithm 38

g .n
6 Division algorithm 39
7 Restoring & Non Restoring Division algorithm 43

e
8 Floating point addition 46
9 Sub Word Parallelism 49

UNIT 3 PROCESSOR AND CONTROL UNIT


a. Part A 50
b. Part B 53
1 MIPS Implementation scheme 53
2 Data Path and Control 55
3 Pipelining 60
4 Pipelined Data Path and Control 62
5 Pipeline Hazards 68
6 Exception handling in MIPS Architecture 74

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

UNIT 4 PARALLELISM
i Part A 77
J Part B 79
24 Instruction Level Processing 79
25 Challenges of Parallel Processing 81
26 FLYNN’S Classification 85
27 Hardware Multithreading 88
28 Multi-core Processors 94

UNIT 5 MEMORY AND IO SYSTEMS


k Part A 95
l Part B 97
29 Memory Technologies 97

ww
30
31
Memory Hierarchy
Cache Memory
102
103
32
33 w.E
Bus Arbitration
Improving Cache Performance
109
114

a
34 Virtual Memory 116
35
sy E
Direct Memory Access (DMA) 123

ngi
mIndustrial / Practical Connectivity of 125
the subject nUniversity Questions 126

nee
ri n
g .n
e

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

AIM AND OBJECTIVE OF THE SUBJECT

Aim:
To discuss the basic structure of a digital computer and to study in detail
the organization of the Control unit, the Arithmetic and Logical unit, the Memory
unit and the I/O unit.

Objectives:

 To make students understand the basic structure and operation of digital


computer.
 To understand the hardware-software interface.
 To familiarize the students with arithmetic and logic

ww unit and implementation of fixed point and floating-point arithmetic


operations.

w.E
 To expose the students to the concept of pipelining.
 To familiarize the

a
students with hierarchical memory system including cache memories and v

sy E
irtual
memory.

ngi
To expose the students with different ways of communicating with I/O
devices andstandard I/O interfaces

nee
ri n
g .n
e

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

DETAILED LESSON PLAN


Text Books:
T1: David A. Patterson and John L.Hennessey, “Computer Organization and Design“
Morgan Kauffman/Elsevier, Fifth Edition, 2014 (Copies Available in Library: Yes)
Reference Books:
R1: V.Carl Hamachar, Zvonko G.Varanesic and Safar G.Zaky, “Computer
Organisation”, VI edition, McGraw-Hill Inc, 2012 (Copies Available in Library: Yes)
R2: William Stallings, “ Computer Organization and Architecture”, Seventh
Edition. Pearson Education, 2006 (Copies Available in Library: Yes)
R3: Vincent P.Heuring, Harry F.Jordan, “C o m p u t e r System Architecture”, Second
Edition, Pearson Education, 2005 (Copies Available in Library: Yes)
R4: John P.Hayes,” Computer Architecture and Organization “, third Edition, Tata
McGraw Hill, 1998 (Copies Available in Library: Yes)

ww
S. Uni
No. t w.E Topic / Portions to be Covered
Ho
urs
Cumulativ
e Hrs
Books
Referred
No.

a sy E
UNIT I OVERVIEW AND INSTRUCTIONS
Req

ngi
1 1 Eight Ideas 1 1 T1
2 1 Components of Computer System 1 2 R1
3 1 Technology, performance, Power wall
nee 1 3 R1

ri n
4 1 Uniprocessor to 1 4 R1
multiprocessors, Instructions

g .n
5 1 Operations and operands. Representing 1 5 T1

e
Instructions
6 1 Logical operations 1 6 T1,R1

7 1 Control operations 1 7 T1,R1

8 1 Addressing and Addressing modes 1 8 T1,R1

9 1 Various Addressing modes 1 9 T1,R1


UNIT II ARITHMETIC OPERATIONS
10 2 ALU 1 10 T1,R1
11 2 Addition 1 11 T1,R1
12 2 Subtraction 1 12 T1,R1
13 2 Multiplication 1 13 T1,R1
14 2 Division 1 14 T1
15 2 Floating Point operations 1 15 T1,R1
16 2 Subword parallelism 1 16 T1,R1
2

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

S. Unit Hours Cumulative Books


No. Topic / Portions to be Covered Req Hrs Referred
No.
UNIT III PROCESSOR AND CONTROL UNIT
17 3 Basic MIPS implementation 1 17 T1,R1
18 3 Building datapath 1 18 T1,R1
19 3 Control implementation scheme 1 19 T1,R1
20 3 Pipelining 1 20 T1,R1
21 3 Pipelined datapath 1 21 T1,R1
22 3 Control path 1 22 R1
23 3 Handling Data hazards 1 23 R1

ww
24 3 Hazard Stalls 1 24 R1

w.E
25 3 Control Hazards 1 25 R1
26 3 Exceptions 1 26 T1,R1

27 3
a
Exceptions in a Pipelined
Implementation
sy E UNIT IV PARALLELISM
1 27 T1,R1

28 4 parallelism ngi
Introduction to Instruction Level 1 28 T1
29 4 Various Instruction Level parallelism
nee 1 29 T1

ri n
30 4 Parallel Processing challenges 1 30 T1
31 4 Flynn classification 1 31 T1
32
33
4
4
MISD & MIMD
Hardware multithreading
1
1
32
33
g .n T1
T1
34
35
36
4
4
4
Simultaneous multithreading
Multicore Processors
Shared Memory Multiprocessors
1
1
1
34
35
36
e
T1
T1
T1
UNIT V MEMORY AND I/O SYSTEMS
37 5 Memory Hierarchy 1 37 T1,R1
38 5 Memory Technologies 1 38 T1,R1
39 5 Flash and Disk Memory Technologies 1 39 T1,R1
40 5 Cache Basics 1 40 T1,R1
41 5 Measuring and improving cache 1 41 T1
performance
42 5 Virtual memory, TLBs 1 42 T1
43 5 Input/ Output System 1 43 T1,R1
44 5 Programmed I/O 1 44 T1,R1
45 5 DMA and Interrupts 1 45 T1,R1
46 5 I/O Processors 1 46 T1,R1
3

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

UNIT I OVERVIEW & INSTRUCTIONS

PART- A

1. State Amdahl’s law? (Nov/Dec 2014)


Amdahl’s law is used to find the execution time of a program after making the
improvement. It can be represented as
Execution time after improvement =Execution time affected by
improvement/Amount of improvement +Execution time unaffected.
Amdahl’s law states that in parallelization, if P is the proportion of a system or
program that can be made parallel, and 1-P is the proportion that remains serial,
then the maximum speed up that can be achieved using N number of processors is
1/((1- P)+(P/N).

ww
2. What is register indirect addressing mode? When it is used? (Nov/Dec 2013)

w.E
In this mode the instruction specifies a register in the CPU whose contents
give the effective address of the operand in the memory. In

a
other words, the selected register contains the address of the operand rather than

sy E
the operand itself. Before
using a register indirect mode instruction, the programmer must ensure that

ngi
the memory address of the operand is placed in the
processor register with a previous instruction.

nee
Uses:
The advantage of a register indirect mode instruction is that the
address field of the instruction uses fewer bits to select a register than
would have been required to specify a memory address directly.
ri n
g .n
Therefore EA = the address stored in the register R
• Operand is in memory cell pointed to by contents of register R

e
Example: Add (R2), R0

3. Write the CPU performance equation. (May/June 2014) (NOV/DEC’15, 16)


The Classic CPU Performance Equation in terms of instruction count (the
CPU time
number = Instruction
of instructions count
executed by* the
CPIprogram),
* Clock cycle
CPI,time
and (or)
clock cycle time.
Instruction count *CPI
CPU time = ----------------------------
Clock rate
CPI:
The term Clock Cycles per Instruction which is the average number of clock
cycles each instruction takes to execute, is often abbreviated as CPI.
CPI= CPU clock cycles/Instruction count.

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

1. What are the fields in an MIPS instruction? (April /May 2015)


MIPS fields are
Op Rs rt rd shamt funct
6 bits 5bits 5 bits 5 bits 5 bits

6 bits Where,
op: Basic operation of the instruction, traditionally called the
opcode.
rs: The first register source operand.
rt: The second register source operand.
rd: The register destination operand. It gets the result of the

ww
operation. shamt: Shift amount.
funct: Function.

w.E
2. Define word length. (Nov/Dec 2011)
In computing, word is a term for the

a
natural unit of data used by a particular processor design. A word is a fixed-sized

sy E
piece of data handled as a unit by the instruction set or the hardware of the
processor. The number of bits in a word (the

ngi
word size, word width, or word length) is an important characteristic of any
specific processor design or computer architecture.

nee
The size of a word is reflected in many aspects of a computer's structure and
operation;

ri n
the majority of the registers in a processor are usually word sized and the
largest piece of data that can be transferred to and from the working memory
in a single operation is a word in many (not all) architectures.
g .n
3. What are the merits and demerits of
single address instructions? (Nov/Dec 2011)
The machine will be slower in this case but not in all cases.
e
Programs are now even shorter. The registers may be used for
temporary results which are not needed immediately or for holding frequently used
operands e.g. the end count in a "for" loop.

4. What do you mean by relative addressing mode? (Nov/Dec 2014),


May/June 2012)
In this mode the content of the program counter is added to the address part of
the instruction in order to obtain the effective address. Effective address is defined as
the memory address obtained from the computation dictated by the given
addressing mode.
Example:
Bne $s0,$s1,Exit go to Exit if$s0 $s1
In this example branch address is calculated by adding the PC value with the
constant in the instruction.
5

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

1. Define Clock rate and execution time.


CPU clock speed, or clock rate, is measured in Hertz — generally in gigahertz,
or GHz. A CPU's clock speed rate is a measure of how many clock cycles a
CPU can perform per second.
In computer science, run time, runtime or execution time is the time during
which a program is running (executing), in contrast to other program lifecycle phases
such as compile time, link time and load time.

2. Distinguish Pipelining from Parallelism. (April/May 2015)


In order to increase the instruction throughput, high performance processors
make extensive use of a technique called pipelining. A pipelined processor
doesn't wait until the result from a previous operation has been written back into the
register files or main memory - it fetches and starts to execute the next instruction as

ww
soon as it has fetched the first one and dispatched it to the instruction register. So a
pipelined processor will start fetching the next instruction from memory as soon as it
has latched the current instruction in the instruction register.

w.E
Parallel computing is a type of computation in which many
calculations are carried out simultaneously, operating on the principle that

a
large problems can often be divided into smaller ones, which are then solved at the
same time. There are
sy E
several different forms of parallel computing: bit-level, instruction-level, data,

ngi
and task parallelism.

nee
3. Define auto increment and auto decrement addressing mode.
((April/May 2016)

ri n
Auto Increment mode:
The effective address of the operand is the contents of a register specified in

g .n
the instruction. After accessing the operand, the contents of this register are
automatically incremented to point to the next item in a list.
The Auto increment mode is written as (Ri) +. As a companion for the Auto
increment mode, another useful mode accesses the items of a list in the reverse
order.
e
Auto Decrement mode:
The contents of a register specified in the instruction are first automatically
decremented and is then used as the effective address of the operand. We denote
the Auto decrement mode by putting the specified register in parentheses, preceded
by a minus sign to indicate that the contents of the register are to be decremented
before being used as the effective address. Thus, we write - (Ri).
4. List the eight great ideas invented by computer architects (Apr/May 2015)
(i) Design of Moore’s Law
(ii) Use of abstraction to simplify design
(iii) Make the common case fast
(iv) Performance via parallelism
(v) Performance via pipelining
(vi) Performance via prediction
(vii) Hierarchy of Memories
(viii) Dependability via Redundancy

6
Downloaded From : www.EasyEngineering.ne
Downloaded From : www.EasyEngineering.ne

1. What is Instruction set architecture? (Nov/Dec 2015)


An instruction instructs the computer to perform any one of the 5 categories of
the operations: (1) Arithmetic (2) Data Transfer (3) Logical (4) Conditional Branch
(5) Unconditional Branch/Jump.
To perform each category of operation, many instructions are available in their
own formats. For example, to perform data transfer operation, the following
instructions may be used: load word, store word, load half, load half unsigned,
store half, load byte. Each of these instructions are having different formats.
To perform the unconditional jump operation, the Jump, Jump register, Jump and
Link instructions may be used.
2. How to represent instructions in a Computer system? (May/June 2016)
Instructions are stored inside the computer systems as a series of high (1-one)
and low (0-Zero) electronic signals and may be represented as numbers. MIPS uses
either registers or memory locations to store instructions. MIPS uses 32

ww
registers for
storing data or instructions temporarily and for fast accessing compared to

w.E
storages in memory locations. The size of a register in the MIPS architecture is 32
bits.

a sy E
To achieve highest performance and conserve energy, an instruction set
architecture must have sufficient number of registers and these registers must be

ngi
efficiently used.

3. What is an instruction register? (Nov/Dec 2016)

nee
An instruction register (IR) is part of CPU’s control unit that holds the
instruction currently being executed.

ri n
g .n
e

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

PART B

COMPONENTS OF A COMPUTER SYSTEM

1. Explain in detail the various components of computer system with a neat


diagram (NOV/DEC 14) (N/D’15) (N/D’16)

Components of a computer system


A computer consists of five functionally independent main parts. They are
1. Input
2. Memory
3. Arithmetic and logic

ww
4. Output
5. Control unit

w.E
Basic functional units of a computer

a
The computer accepts programs and the data through an input and stores

sy E
them in the memory. The stored data are processed by the
arithmetic and logic unit under program control. The processed data is delivered

ngi
through the output unit. All above activities are directed by control unit.

nee
ri n
g .n
e

Figure1.1 The basic components of computer system

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

a. Input unit
The computer accepts coded information through input unit. The input can be
from human operators, electromechanical devices such as keyboards or from
other computer over communication lines.
Examples of input devices are Keyboard, joysticks, trackballs and mouse are
used as graphic input devices in conjunction with display.

Keyboard
 It is a common input device.
 Whenever a key is pressed; the corresponding letter or digit is automatically
translated into its corresponding binary code and transmitted over cable to the
memory of the computer.

ww
a. Memory unit
Memory unit is used to store programs as well as data. Memory is classified
into primary and secondary storage.

w.E
Primary storage

a
It also called main memory. It operates at high speed and it is expensive. It is

sy E
made up of large number of semiconductor storage cells, each capable of storing one
bit

ngi
of information. These cells are grouped together in a fixed size called word. This
facilitates reading and writing the

nee
content of one word (n bits) in single basic operation instead of reading and wr
iting

ri n
one bit for each operation.

Secondary storage

g .n
It is slow in speed. It is cheaper than primary memory. Its capacity is high. It
is used to
store information that is not accessed frequently. Various secondary devices are mag
netic tapes and
disks, optical disks (CD-ROMs), floppy etc.
e
b. Arithmetic and logic unit
Arithmetic and logic unit (ALU) and control unit together form a processor.
Actual execution of most computer operations takes place in arithmetic and logic unit
of the processor. Example: Suppose two numbers located in the memory are to be
added. They are brought into the processor, and the actual addition is carried
out by the ALU.

Registers:
Registers are high speed storage elements available in the processor. Each
register can store one word of data. When operands are brought into the
processor for any operation, they are stored in the registers. Accessing data from
register is faster than that of the memory.
9

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

d. Output unit
The function of output unit is to produce processed result to the outside world
in human understandable form. Examples of output devices are Graphical
display, Printers such as inkjet, laser, dot matrix and so on. The laser printer works
faster.

ww
w.E
a sy E
ngi
nee
ri n
Figure 1.2 Computer Components

g .n
e. Control unit
Control unit coordinates the operation of memory, arithmetic and logic unit,
input unit, and output unit in some proper way. The control unit issues control
e
signals that cause the CPU (and other components of the computer) to fetch
the instruction to the IR (Instruction Register) and then execute the actions dictated
by the machine language instruction that has been stored there. Control units are
well defined, physically separate unit that interact with other parts of the machine. A
set of control lines carries the signals used for timing and synchronization of events in
all units Example: Data transfers between the processor and the memory are
controlled by the control unit through timing signals. Timing signals are the signals
that determine when a given action is to take place.

10

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

Basic Operational Concept

Computer Components:
Top-Level view
PC the program counter contains the address of the assembly language
instruction to be executed next. IR the instruction register contains the binary word
corresponding to the machine language version of the instruction currently
being executed.
MAR the memory address register contains the address of the word in main
memory that is being accessed. The word being addressed contains either data or a
machine language instruction to be executed.
MBR the memory buffer register (also called MDR for memory data register)
is the register used to communicate data to and from the memory.

ww
The operation of a processor is characterized by a fetch-decode-execute
cycle. In the first phase of the cycle, the processor fetches an instruction from
memory. The address of the instruction to fetch is stored in an

w.E
internal register named the program counter, or PC. As the processor is waiting for
the memory to respond with the instruction, it increments the PC. This means the

a
fetch phase of the next cycle will fetch the instruction in the next sequential location

sy E
in memory. In the decode phase the processor stores the information returned by
the memory in another internal register, known as the instruction register, or IR. The

ngi
IR now holds a single machine instruction, encoded as a binary number.
The processor decodes the value in the IR in order to figure out which operations to

nee
perform in the next stage.
In the execution stage the processor actually carries out the instruction.

ri n
This step often requires further memory operations; for example, the instruction
may direct the processor to fetch two operands from memory, add them, and store
the result in a third location (the addresses of the
g .n
operands and the result are also encoded as part of the instruction). At the end of
this phase the machine starts the cycle over again by entering the fetch phase
for the next instruction. The CPU exchanges data with memory. For this purpose, it
typically makes use of two internal (to the CPU) register:
 A memory address register (MAR), which specifies the address in memory for
e
the next read or write, and
 A memory buffer register (MBR), which contains the data to be written into
memory or receives the data read from memory.
An I/O addresses register (I/OAR) specifies a particular I/O device. An I/O
buffer (I/OBR) register is used for the exchange of data between an I/O module and
the CPU.A memory module consists of a set of locations, defined by sequentially
numbered address. Each location contains a binary number that can be
interpreted as either an instruction or data. An I/O module transfers data from
external devices to CPU and memory, and vice versa.
It contains internal buffers for temporarily holding these data until they can be
sent on. Instructions can be classified as one of three major types:
arithmetic/logic, data transfer, and control. Arithmetic and logic instructions apply
primitive functions of one or two arguments, for example addition, multiplication, or
logical AND.
11
Downloaded From : www.EasyEngineering.ne
Downloaded From : www.EasyEngineering.ne

PERFORMANCE OF A COMPUTER SYSTEM

(ii) State the CPU performance equation and discuss the factors that affect the
performance (NOV/DEC 14)

Response time: The time between the start and the completion of an event also
referred to as execution time.

Throughput: The total amount of work done in a given time. In comparing design
alternatives, we often want to relate the performance of two different machines, say X
and Y. The phrase ―X is faster than Y‖ is used here to mean that the response time
or execution time is lower on X than on Y for the given task. In particular, ―X is n

ww
times faster than Y‖ will mean

w.E
Sinceexecutiontimeisthereciprocalofperformance,thefollowing relationship

a sy E
ngi
nee
The performance and execution time are reciprocals, increasing performance
decreases execution time.

ri n
To help avoid confusion between the terms Increasing and decreasing, we usually
say ―improve performance‖ or ―improve execution time‖ when we mean
increase performance and decrease execution time.
g .n
CPU performance equation:
All computers are constructed using a clock running at a constant rate.
These discrete time events are called ticks, clock ticks, clock periods, clocks, cycles,
e
or
clock cycles. Computer designers refer to the time of a clock period by its
duration (e.g., 1 ns) or by its rate (e.g., 1 GHz). CPU time for a program can then be
expressed two ways:

CPU time = CPU clock cycles for a program Clock cycle


time CPU time=CPU clock cycle to exe a pgm / clock rate

In addition to the number of clock cycles needed to execute a program, we


can also count the number of instructions executed, the instruction path length or
instruction count (IC).

12

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

CPI is computed as:

By transposing instruction count in the above formula, clock cycles can be


defined As IC * CPI. This allows us to use CPI in the execution time formula

CPU to calculate the number of total CPU clock cycles as,

ww
Basic performance equation

w.E
Let T be the time required for the processor to execute a program in
high level language. The compiler generates machine language object program corre

a
sponding

sy E
to the source program. Assume that complete execution of the program requires the
execution of N machine language instructions.

ngi
Assume that average number of basic steps needed to
execute one machine instruction is S, where each basic step

nee
is completed in one clock cycle. If the
clock rate is R cycles per second, the program execution time is given by T = (N

ri n
x S) / R
this is often called Basic performance equation.
To achieve high performance, the performance parameter T should be
reduced. T value can be reduced by reducing N and S, and increasing R.
g .n
 Value of N is reduced if the source program is
compiled into fewer number of machine instructions.
 Value of S is reduced if instruction has a smaller no of basic steps to
e
perform or if the execution of the instructions is overlapped.
 Value of R can be increased by using high frequency clock, ie. Time required
to complete a basic execution step is reduced.
 N, S and R are dependent factors. Changing one may affect another.

13

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

EXAMPLE

1) Suppose we have made the following measurements:


Frequency of FP operations (other than FPSQR) = 25% Average CPI of FP
operations = 4.0 Average CPI of other instructions = 1.33 Frequency of FPSQR= 2%
CPI of FPSQR = 20 Assume that the two design alternatives are to decrease the CPI
of FPSQR to 2 or to decrease the average CPI of all FP operations to 2.5. Compare
these two design alternatives using the CPU performance equation.

ww
w.E
a sy E
ngi
nee
ri n
g .n
EIGHT GREAT IDEAS IN COMPUTER ARCHITECTURE

2.(i)Explain the basic concepts of Eight ideas in computer architecture.


e(8).

a. DESIGN FOR MOORE'S LAW:

The one constant for computer designers is rapid change, which is driven
largely by Moore's Law. It states that integrated circuit resource double every 18–
24 months. Moore's Law resulted from a 1965 prediction of such growth in IC
capacity. Moore's Law made by Gordon Moore, one of the founders of Intel.
14

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

As computer designs can take years, the resources available per chip can
easily double or quadruple between the start and finish of the project. Computer
architects must anticipate this rapid change.
Icon used: "up and to the right" Moore's Law graph represents designing for
rapid change.

b. USE ABSTRACTION TO SIMPLIFY DESIGN:

ww
w.E
Both computer architects and programmers had to invent techniques to

a
make themselves more productive. A major productivity technique for

sy E
hardware and soft ware is to use abstractions to represent the design at
different levels of representation; lower-level details are hidden to offer

ngi
a simpler model at higher levels.
Icon used: abstract painting icon.

c. MAKE THE COMMON CASE FAST:


nee
ri n
Making the common case fast will tend to
enhance performance better than optimizing the rare case. The common case is

g .n
often simpler than the rare case and it is often easier to enhance. Common case fast
is only possible with careful
experimentation and measurement.
Icon used: sports car ss the icon for making the common case fast
(as the most common trip has one or two passengers, and it's surely easier to make
a fast sports car than a fast minivan.)
e

15

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

d. PERFORMANCE VIA PARALLELISM:


Computer architects have offered designs that get more performance by
performing operations in parallel. Icon Used: multiple jet engines of a
plane is the icon for parallel performance.

e. PERFORMANCE VIA PIPELINING:


Pipelining- Pipelining is an implementation technique in which multiple
instructions are overlapped in execution. Pipelining improves performance by

ww
increasing instruction throughput. For example, before fire engines, a human chain
can carry a water source to fire much more quickly than individuals with buckets

w.E
running back and forth.
Icon Used: pipeline icon is used. It
is a sequence of pipes, with each section representing one stage of the pipeline.

a sy E
ngi
nee
a. PERFORMANCE VIA PREDICTION:
ri n
g .n
Following saying that it can be better to ask for forgiveness than to ask for
permission, the next great idea is prediction. In some cases it can be faster on

e
average to guess and start working rather than wait until you know for sure. This
mechanism to recover from a misprediction is not too expensive and the prediction is
relatively accurate.
Icon Used: fortune-teller's crystal ball,

b. HIERARCHY OF MEMORIES :
Programmers want memory to be fast, large, and cheap memory speed often
shapes performance, capacity limits the size of problems that can be solved, the cost
of memory today is often the majority of computer cost.
Architects have found that they can address these conflicting demands with a
hierarchy of memories the fastest, smallest, and most expensive memory per bit is at
the top of the hierarchy the slowest, largest, and cheapest per bit is at the bottom.
Caches give the programmer the illusion that main memory is nearly as fast as
the top of the hierarchy and nearly as big and cheap as the bottom of the hierarchy.

16

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

Icon Used: a layered triangle icon represents the memory hierarchy.


The shape indicates speed, cost, and size: the closer to the top, the faster and more
expensive per bit the memory; the wider the base of the layer, the bigger the
memory.

h. DEPENDABILITY VIA REDUNDANCY:


Computers not only need to be fast; they need to be dependable. Since any

ww physical device can fail, systems can made dependable by including redundant
components. These components can take over when a failure occurs and to

w.E
help detect failures.
Icon Used: the tractor-trailer, since the dual tires on each side of its rear axels

a
allow the truck to continue driving even when one tire fails.

sy E
ngi
nee
TECHNIQUES TO REPRESENT INSTRUCTION IN A COMPUTER SYSTEM
ri n
(ii)Discuss about the various techniques to represent instruction in
g .n
e
a computer system.
(April/May 2015).(16)

INSTRUCTION:
Instructions are kept in the computer as a series of high and low electronic
signals and may be represented as numbers. In fact, each piece of an instruction can
be considered as an individual number, and placing these numbers side by
side forms the instruction.
Since registers are referred to by almost all instructions, there must be a
convention
to map register names into numbers. In MIPS assembly language, registers
$s0 to
$s7 map onto registers 16 to 23, and registers $t0 to $t7 map onto registers 8
to 15. Hence, $s0 means register 16, $s1 means register 17, $s2 means register
18, . . . ,
$t0 means register 8, $t1 means register 9, and so on.
Translating a MIPS Assembly Instruction into a Machine Instruction Let’s do
the next step in the refinement of the MIPS language as an example. The real MIPS
language version of the instruction represented symbolically as first as a combination
17
Downloaded From : www.EasyEngineering.ne
Downloaded From : www.EasyEngineering.ne

of decimal numbers and then of binary numbers.


add $t0,$s1,$s2.The decimal representation is

0 17 18 8 0 32

Each of these segments of an instruction is called a field. The first and last
fields (containing 0 and 32 in this case) in combination tell the MIPS computer that
this instruction performs addition. The second field gives the number of the register
that is the first source operand of the addition operation (17 = $s1), and the third field
gives the other source operand for the addition (18 = $s2). The fourth field contains
the number of the register that is to receive the sum (8 = $t0). The fifth field is unused
in this instruction, so it is set to 0.
Thus, this instruction adds register $s1 to register $s2 and places the sum in
register $t0.This instruction can also be represented as fields of binary numbers as

ww opposed to decimal:

w.E 000000 10001 100010 01000


6 Bits5 Bits5 Bits5 Bits5 Bits6 Bits
00000 100000

a sy E
This layout of the instruction is called the instruction format. As you can see
from counting the number of bits, this MIPS instruction takes exactly 32 bits—the
same size as a data word. In
ngi
keeping with our design principle that simplicity favors regularity, all
MIPS instructions are 32 bits long.
nee
To distinguish it from assembly language, we call the numeric version of

ri n
instructions machine language and a sequence of such instructions machine code. It
would appear that you would now be reading and writing long, tedious strings
of binary numbers. We avoid that tedium by using a higher base than binary that
g .n
converts easily into binary. Since almost all computer data sizes are multiples of 4,
hexadecimal (base 16) numbers are popular.
e
Fig .The hexadecimal-binary conversion table

Because we frequently deal with different number bases, to avoid confusion


we will subscript decimal numbers with ten, binary numbers with two, and
hexadecimal numbers with hex. (If there is no subscript, the default is base 10.)

18

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

BINARY TO HEXADECIMAL AND BACK:


Convert the following hexadecimal and binary numbers into the other base:
 eca8 6420hex
 0001 0011 0101 0111 1001 1011 1101 1111 two

ww
w.E
MIPS FIELDS:
MIPS fields are given names to make them easier

a
to discuss. The meaning of each name of the fields in MIPS
instructions:

sy E
op: Basic operation of the

ngi
instruction, traditionally called the opcode.
rs: The first register source operand.

nee
rt: The second register source operand.
rd: The register destination operand. It
gets the result of the operation.
shamt: Shift amount.
ri n
g .n
funct: Function. This
field selects the specific variant of the operation in the op field and is somet

e
imes called the function code.
Today’s computers are built on two key principles:
1. Instructions are represented as numbers.
2. Programs are stored in memory to be read or written, just like numbers.
These principles lead to the stored-program concept; its invention let the
computing genie out of its bottle. Specifically, memory can contain the
source code for an editor program, the corresponding compiled machine code, the
text that the compiled program is using, and even the compiler that generated the
machine code. One consequence of instructions as numbers is that programs are
often shipped as files of binary numbers. The commercial implication is that
computers can inherit ready-made software provided they are compatible with an
existing instruction set. Such “binary compatibility” often leads industry to align
around a small number of instruction set architectures.
The compromise chosen by the MIPS designers is to keep all instructions the
same length, thereby requiring different kinds of instruction formats for different kinds
of instructions. For example, the format above is called R-type (for register) or R-
format. A second type of instruction format is called I-type (for immediate) or I-format
and is used by the immediate and data transfer 19 instructions. The fields of I-format are

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

op rs rt Constant or address
6 bits 5 bits 5 bits 16 bits

The 16-bit address means a load word instruction can load any word within a
region of ±215 or 32,768 bytes (±213 or 8192 words) of the address in the base
register rs. Similarly, add immediate is limited to constants no larger than ±215. We
see that more than 32 registers would be difficult in this format, as the rs and rt fields
would each need another bit, making it harder to fit everything in one word.

ww MIPS instruction encoding.

w.E
TRANSLATING MIPS ASSEMBLY LANGUAGE INTO MACHINE LANGUAGE:

a
We can now take an example all the way from what the programmer writes to

sy E
what the computer executes. If $t1 has the base of the array A and $s2 corresponds
to h,

ngi
the assignment stateme
nt A[300]
= h + A[300];
is compiled into
lw $t0,1200($t1) # Temporary r nee
eg
$t0 gets A[300]
ri n
add $t0,$s2,$t0 # Temporary reg $t0 gets h + A[300] sw
$t0,1200($t1) # Stores h + A[300] back into A[300] g .n
e
The lw instruction is identified by 35 in the first field (op). The base register 9
($t1) is specified in the second field (rs), and the destination Register 8 ($t0) is
specified in the third field (rt). The offset to select A[300] (1200 = 300 × 4) is found in
the final field (address).
The add instruction that follows is specified with 0 in the first field (op) and 32
in the last field (funct). The three register operands (18, 8, and 8) are found in the
second, third, and fourth fields and correspond to $s2, $t0, and $t0.
The sw instruction is identified with 43 in the first field. The rest of this final
instruction is identical to the lw instruction. Since 1200ten = 0000 0100 1011
0000two, the binary equivalent to the decimal form is:

20

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

1.Consider a computer with three instruction classes and CPI measurements


are given below and instruction counts for each instruction class for the same
program from two different compilers are given. (Nov/Dec 2014). (6)
(i) Which code sequence executes the more number of instructions?

ww
(ii) Which code sequence will be faster?
(iii) What is the CPI for each sequence?

w.E CPI for each instruction class

a
A B C
CPI
sy E 1 2 3

ngi
Instruction counts for each

nee
Code
instruction class
sequence
A B C

ri n
g .n
1 2 1 2
SOLUTION: 2 4 1 1

e
(i) Code sequence 1:
Executes 2+1+2= 5 instructions
Code sequence 2:
Executes 4+1+1= 6 instructions
That is code sequence 2 executes more number of instructions than code sequence
1.

(ii) The total number of clock cycles for each sequence can be found
using the following equation.
CPU clock cycles = (CPIi X Ci)

CPU clock cycles 1 = (2x1)+(1x2)+(2x3)=2+2+6=10 cycles. CPU

clock cycles 2 = (4x1)+(1x2)+(1x3)=4+2+3=9 cycles.

It is clear that code sequence 2 is faster than code sequence 1 even though it
executes one extra instruction.
(iii)The CPI values can be computed by,
CPI = CPU clock cycles/Instruction count
21
Downloaded From : www.EasyEngineering.ne
Downloaded From : www.EasyEngineering.ne

CPI 1 = 10/5= 2

CPI2 = 9/6 =1.5

ADDRESSING
AND
ADDRESSING
MODES

4. What is the need for addressing in a computer system? Explain the


different addressing modes with suitable examples.(April/May
15,16).N/D’15,16 (16)/
 Assume two address format specified as source and destination.
Examine the following sequence of instructions and explain the

ww addressing modes used and the operation done in every instruction.


(Nov/Dec 2014). (16)

w.E
Perform any operation; the corresponding instruction is to be given to the

a
microprocessor. In each instruction, programmer has to specify 3 things:

sy E
 Operation to be performed.
 Address of source of data.

ngi
 Address of destination of result.
The different ways in which the location of an operand is specified in an

nee
instruction are referred to as addressing modes. The method by which the
address of source of data or the address of destination of result is given in the i

ri n
nstruction is called Addressing Modes. Computers use addressing
mode techniques for the purpose of accommodating one or both of the

g .n
following provisions:
 To give programming versatility to the user by providing such facilities as
pointers to memory, counters for loop control, indexing of
data, and program relocation.
 To reduce the number of bits in the addressing field of the instruction.
TYPES:
e
• Implied addressing mode
• Immediate addressing mode
• Direct addressing mode
• Indirect addressing mode
• Register addressing mode
• Register Indirect addressing mode
• Auto increment or Auto decrement addressing mode
• Relative addressing mode
• Indexed addressing mode
• Base register addressing mode

a. Implied addressing mode:


In this mode the operands are specified implicitly in the definition of
the
instruction. For example the ‘complement accumulator’ instruction is an implied
mode instruction because the operand in the accumulator register
Downloaded From is implied in the
: www.EasyEngineering.ne
Downloaded From : www.EasyEngineering.ne

implied mode instructions. Zero address instructions in a stack organized computer


are implied mode instructions since the operands are implied to be on the top of the
stack.

a. Immediate addressing mode:


In this mode the operand is specified in the instruction itself. In other words, an
immediate mode instruction has a operand field rather than an address field.
The operand field contains the actual operand to be used in conjunction with the
operation specified in the instruction. Immediate mode instructions are useful for
initializing registers to a constant value.

Example: ADD 5
• Add 5 to contents of accumulator
• 5 is operand

ww
Advantages:
• No memory reference to fetch data
• Fast
w.E
• Limited range

a
b. Direct addressing mode:
sy E
In this mode the effective address is equal to the address part of the

ngi
instruction. The operand resides in memory and its address is given directly by the
address field of instruction. In a branch type instruction the address field specifies the

a.
actual branch address.
LDA A
nee
ri n
Look in memory at address A
for operand. Load contents of A
to
accumulator
g .n
Advantages and disadvantages:
• Single memory reference to access
data
• No additional calculations to work o
e
ut effective
address
• Limited address space

d.Indirect addressing mode:


In this mode the address field of
the instruction gives the address where
the
effective address is stored in memory/register. Control fetches the instruction
from memory and uses its address part to access memory again to read the effective
address.
EA = address contained in register/memory location
Example Add (M)
• Look in M, find address contained in M and
look there for operand
• Add contents of memory location pointed to by contents of M to
Downloaded From : www.EasyEngineering.ne
accumulator
Downloaded From : www.EasyEngineering.ne

In this mode the instruction specifies a register in the CPU whose contents
give the effective address of the operand in the memory. In other words, the selected
register contains the address of the operand rather than the operand itself. Before
using a register indirect mode instruction, the programmer must ensure that the
memory address of the operand is placed in the processor register with a previous
instruction. The advantage of a register indirect mode instruction is that the address
field of the instruction uses fewer bits to select a register than would have been
required to specify a memory address directly.
Therefore EA = the address stored in the register R
• Operand is in memory cell pointed to by contents of register R
• Example Add (R2), R0

Advantages and disadvantages:


o No memory access. So very fast execution.

ww o Very small address field needed.


o Shorter instructions

w.E
o Faster instruction fetch
o Limited number of registers.

a
o Multiple registers helps performance

sy E
o Requires good assembly programming or compiler writing

ngi
a.Register indirect addressing mode:
In this mode the instruction specifies a register in the CPU whose contents

nee
give the effective address of the operand in the memory. In
other words, the selected register contains the address of the operand rather than

ri n
the operand itself. Before using a register indirect mode instruction, the
programmer must ensure that the memory address of the operand is placed in the
processor register with a previous instruction.
g .n
The advantage of a register indirect mode instruction is that the address field of the
instruction uses fewer bits to select a register than would have been required to
specify a memory address directly.
Therefore EA = the address stored in the register R
• Operand is in memory cell pointed to by contents of register R
e
Advantage:
• Less number of bits are required to specify the register.
• One fewer memory access than indirect addressing.

b.Auto increment or auto decrement addressing mode:


Auto Increment mode:
The effective address of the operand is the contents of a register specified in
the instruction. After accessing the operand, the contents of this register are
automatically incremented to point to the next item in a list.
We denote the Auto increment mode by putting the specified register in
parentheses, to show that the contents of the register are used as the effective
address, followed by a plus sign to indicate that these contents are to be
incremented after the operand is accessed. Thus, the Auto increment
mode is

24
Downloaded From : www.EasyEngineering.ne
Downloaded From : www.EasyEngineering.ne

written as (Ri) +
Auto Decrement mode:
The contents of a register specified in the instruction are first automatically
decremented and is then used as the effective address of the operand. We denote
the Auto decrement mode by putting the specified register in parentheses, preceded
by a minus sign to indicate that the contents of the register are to be decremented
before being used as the effective address. Thus, we write - (Ri)
• These two modes are useful when we want to access a table of data.
ADD (R1)+
will increment the
register R1. LDA -(R1)
will decrement the
register R1.

ww
h.Relative addressing mode:
In this mode the content of the program counter is added to the address part

w.E
of the instruction in order to
obtain the effective address. Effective address is defined as the

a
memory address obtained from the computation dictated by the given

sy E
addressing mode. The address part of the instruction is usually a signed number (in
2’s complement representation) which can be either positive or negative. When

ngi
this number is added to the content of the program counter, the result produces an
effective address whose position in memory is relative to the address of the next

nee
instruction.
Relative addressing is often used with branch type instructions when the

ri n
branch address is in the area surrounding the instruction word itself. It
results in a shorter address field in the instruction format since the relative address

g .n
can be specified with a smaller number of bits compared to the bits required to
designate the entire memory address.
EA = A + contents of PC
Example: PC contains 825 and address pa rt of instruction contains 24.
After the instruction is read from location 825, the PC is incremented to 826.
So EA=826+24=850. The operand will be found at location 850 i.e. 24 memory
e
locations
forward from the address of the next instruction.

i.Indexed addressing mode:


In this mode the content of an index register is added to the address part of
the instruction to obtain the effective address. The index register is a special
CPU register that contains an index value. The address field of the instruction defines
the beginning address of a data array in memory. Each operand in the array is store
din memory relative to the beginning address. The distance between the beginning
address and the address of the operand is the index value stored in the index
register. Any operand in the array can be accessed with the same instruction
provided that the index register contains the correct index value. The index register
can be incremented to facilitate access to consecutive operands. Note that if an index
type instruction does not include an address field in its format, then the instruction
converts to the register indirect mode of operation.
Downloaded From : www.EasyEngineering.ne
25
Downloaded From : www.EasyEngineering.ne

Therefore EA = A + IR
•Example MOV AL , DS: disp
[SI] Advantage
• Good for accessing arrays.

i. Base register addressing


mode:
In this mode the content of base register is added to the address part of the
instruction to obtain the effective address. This is similar to the indexed addressing
mode except that the register is now called a base register instead of an index
register. The difference between the two modes is in the way they are used rather
than in the way that they are computed.
An index register is assumed to hold an index number that is relative to the
address part of the instruction. A base register is assumed to hold a base address

ww and the address field of the instruction gives a displacement relative to


this base

w.E
address. The base register addressing mode is used in computers to facilitate
the relocation of the programs in memory. When programs and data are moved from

a
one segment of memory to another, as required

sy E
in multiprogramming systems, the address values of instructions must reflect this
change of position. With a base register, the displacement values of instructions do

ngi
not have to change.

nee
POWER WALL

ri n
5.(i). Elaborate about power wall with neat sketch.
The dominant technology for
integrated circuits is called CMOS (complementary metal oxide semiconduct
or).
g .n
e
For CMOS, the primary source of energy consumption is so-called
dynamic energy— that is, energy that is consumed when transistors switch states
from 0 to 1 and vice versa. The dynamic energy depends on the
capacitive loading of each transistor and the voltage applied.

26

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

The power required per transistor is just the product of energy of a transition and the
frequency of transitions:

Frequency switched is a function of the clock rate. The capacitive load per transistor
is a function of both the number of transistors connected to an output (called the
fanout) and the technology, which determines the capacitance of both wires and
transistors.

UNIPROCESSOR:

(ii).Explain the relationship between Uniprocessors to Multiprocessors. (8)


Two disk drives are in front—the hard drive on the left and a DVD drive on the
right. The hole in the middle is for the laptop battery. The small rectangles on

ww
the motherboard contain the devices that drive our advancing technology, called
integrated circuits and nicknamed chips. The board is composed of three pieces: the

w.E
piece connecting to the I/O devices mentioned earlier, the
memory, and the processor. The memory is where the

a
programs are kept when they are running; it also contains the data needed by the

sy E
running programs. Figure 1.8 shows that memory is found on the two
small boards, and each small memory board contains eight integrated circuits.

ngi
The processor is the active part of the board, following the
instructions of a program to the letter. It adds numbers, tests numbers, signals I/O

nee
devices to activate, and so on. Occasionally, people call the processor the CPU, for
the more bureaucratic-sounding central processor

ri n
unit. Descending even lower into the hardware, The processor logically comprises
two main components:

 Datapath and control,


g .n


The respective brawn and
brain of the processor.
e
The datapath performs the arithmetic operations, and control tells the
datapath, memory, and I/O devices what to do according to the wishes of the
instructions of the program. This explains the datapath and control for a higher-
performance design descending into the depths of any component of the hardware
reveals insights into the computer. Inside the processor is another type of memory—
cache memory.

Cache Memory:
It consists of a small, fast memory that acts as a buffer for the DRAM memory.
(The nontechnical definition of cache is a safe place for hiding things.)Cache is built
using a different memory technology, static random access memory
(SRAM).
SRAM is faster but less dense, and hence more expensive, than DRAM
having noticed a common theme in both the software and the hardware descriptions.
The depths of hardware or software reveal more information or, conversely, lower-

27 Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

level details are hidden to offer a simpler model at higher levels.

Multiprocessor:
A type of architecture that is based on multiple computing units. Some of
the operations are done in parallel and the results are joined afterwards. There are
many types of classifications for multiprocessor architectures, the most commonly
known would be the Flynn Taxonomy. MIPS (originally an acronym for
Microprocessor without Interlocked Pipeline Stages) is a reduced instruction set
computer (RISC) instruction set architecture (ISA) developed by MIPS
Technologies.
To reduce confusion between the words processor and microprocessor,
companies refer to processors as “cores,” and such microprocessors are generically
called multicore microprocessors.

ww OPERATIONS AND OPERANDS:

w.E
(iii).Explain about the concepts of Logical operations and control
operations.

a
Every computer must be able to perform arithmetic. The

sy E
MIPS assembly language notation add a, b, c instructs a computer to add the two
variables b and c

ngi
and to put their sum in a. This
notation is rigid in that each MIPS arithmetic instruction performs only

nee
one operation and must always have exactly three variables.
Thus, it takes three instructions to sum the four variables. The words to

ri n
the right of the sharp symbol (#) on each line above are comments for the human
reader, so the computer ignores them. For example, suppose we want to
place the sum of four variables b, c, d, and e into variable a.
The following sequence of instructions adds the four variables:
g .n
add a, b, c # The sum of b and c
is placed in a add a, a, d # The sum of b, c,
e
and d is now in a add a, a, e # The sum of
b, c, d, and e is now in a

Logical Operations:
Although the first computers operated on full words, it soon became clear that
it was useful to operate on fields of bits within a word or even on individual bits.
Examining characters within a word, each of which is stored as 8 bits, is one
example of such an operation. It follows that operations were added to programming
languages and instruction set architectures to simplify, among other things, the
packing and unpacking of bits into words. These instructions are called logical
operations.

28

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

Control Operations:
What distinguishes a computer from a simple calculator is its ability to make
decisions. Based on the input data and the values created during computation,
different instructions execute. Decision making is commonly represented in
programming languages using the if statement, sometimes combined with go to
statements and labels. MIPS assembly language includes two decision-making
instructions, similar to an if statement with a go to.

ww
w.E
a sy E
ngi
nee
ri n
g .n
e

29

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

UNIT 2 ARITHMETIC OPERATIONS

PART A

1. How overflow occur in subtraction. (April/May 2015) (N/D’15)


Overflow occurs in subtraction when we subtract a negative number from a
positive number and get a negative result, or when we subtract a positive
number from a negative number and get a positive result. It means a borrow occurred
from the sign bit.

2. What is meant by Little endian and Big endian? (NOV/DEC 14)


Big endian systems in which the most significant byte of the word is stored in
the smallest address given and the least significant byte is stored in largest.

ww Little endian systems are those in which the least significant byte is stored in
the smallest address.

w.E
3. Define – Guard and Round. (May/June 2014) (N/D’ 16)

a
Guard is the first of two extra bits kept on

sy E
the right during intermediate calculations of floating point numbers. It
is used to improve rounding accuracy.

ngi
Round is a method to make the intermediate floating-point result
fit the

nee
floating-point format; the goal is typically to find the nearest number that can
be represented in the format. IEEE 754, therefore, always keeps two extra bits on
the

ri n
right during intermediate additions, called guard and round, respectively.

4. Let X=1010100 & Y=1000011 perform X-Y Using 2’s Complement, Y-X
g .n
1)2’s comp of Y
is 0111101 X-Y =
10010001
e
(2) 2’s comp of X
is 0101100 Y-X = 1101111

5. What is carry look ahead adder?


A carry-lookahead adder (CLA) or fast adder is a type of adder used in digital
logic. A carry-lookahead adder improves speed by reducing the amount of time
required to determine carry bits.

6. What are the techniques to speed up the multiplication operation?


There are two techniques for speeding up the multiplication operation. The
First technique guarantees that the maximum number of summands (versions of the
multiplicand) that must be added is n/2for n-bit operands. The second
technique reduces the time needed to add the summands (carry-save addition of
summands method). 30

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

7. List out the rules for add/sub of floating point number?


1.Choose the number with the smaller exponent and shift its mantissa right a
number of steps equal to the difference in exponents.
2. Set the exponent of the result equal to the larger exponent.
3. Perform addition /subtraction on the mantissa and determine the sign of the
result.
4. Normalize the resulting value, if necessary.

5. What is meant by sticky bit?


Sticky bit is a bit used in rounding in addition to guard and round that is set
whenever
there are nonzero bits to the right of the round bit. This sticky bit allows the computer
to see the difference between 0.50 … 00 ten and .... 01 ten when rounding.

ww
6. What are the functions of ALU?
An arithmetic logic unit (ALU) is a digital circuit used to perform arithmetic and

w.E
logic operations. It represents the fundamental building block of the
central processing unit (CPU) of a computer. Modern CPUs contain very powerful

a
and complex ALUs. In addition to ALUs, modern CPUs contain a control unit (CU).

sy E
7. What do you meant by sub word parallelism? (April/May 2015)

ngi
By partitioning the
carry chains within a 128 bit adder processor could use parallelism to

nee
perform simultaneous operations on short vectors of sixteen 8 bit operands, eight 16
bit operands,

ri n
four 32 bit operands, or two 64 bit operands. The
cost of such partitioned adders was small. This

g .n
concept is called as subword parallelism.

P
A
R
T
e
B ALU

1. Explain in detail about Addition


and Subtraction operations./Briefly Explain carry lookahead adder.(Nov/Dec
2014).(A/M’ 16) (N/D’16)

Most computer operations are executed in the arithmetic and logic unit
(ALU) of the processor. Consider a typical example: Suppose two numbers
located in the memory are to be added. They are brought into the processor, and
the actual addition is carried out by the ALU. The sum may then be stored in the
memory or retained in the processor for immediate use.

ADDITION AND SUBTRACTION:

The computer designer must therefore provide a way to ignore overflow in


Downloaded From : www.EasyEngineering.ne
some cases and to recognize it in others.
Downloaded From : www.EasyEngineering.ne

appropriate arithmetic instructions, depending on the type of the operands. Overflow


condition for addition subtraction are given below

The computer designer must decide how to handle arithmetic overflows.


Although some languages like C ignore integer overflow, languages like Ada and
Fortran require that the program be notified. The programmer or the programming
environment must then decide what to do when overflow occurs.
MIPS detects overflow with an exception, also called an interrupt on many

ww
computers. An exception or interrupt is essentially an unscheduled procedure call.
The address of the instruction that overflowed is saved in a register, and the
computer jumps to a predefined address to invoke the appropriate routine for

w.E
that exception. The interrupted address is saved so that in some situations the progra
m can continue after corrective code is

a
executed. MIPS include a register called the exception program counter (EPC) to

sy E
contain the address of the instruction that caused the exception. The instruction
move from system control (mfc0)

ngi
is used to copy EPC into a general-purpose register so
that MIPS software has the option of returning to the offending instruction via a jump

nee
register instruction.

ri n
Addition and Subtraction Example:

adding
+ 6 to 70000
0000
then
in binary
0000and
0000subtracting
0000 0000 0000 0000 0110= 6
0000 0000 60000
from0000
7 in binary: 0000
0000 0000 0000
1101 0000 00
= 13 g .n
00 0000 0000 0000 0111
= 7 6 from 7 can be done directly:
Subtracting
0000 0000 0000 0000 0000 0000 0000 0111 = 7
e
– 0000 0000 0000 0000 0000 0000 0000 0110 = 6

= 0000 0000 0000 0000 0000 0000 0000 0001 = 1

or via addition using the two’s complement representation of –6:


0000 0000 0000 0000 0000 0000 0000 0111 = 7

+ 1111 1111 1111 1111 1111 1111 1111 1010two = –6

0000 0000 0000 0000 0000 0000 0000 0001two = 1

32

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

Instructions available
Add, subtract, add immediate, add unsigned, subtract unsigned.

Carry-Look Ahead Adder:


•Binary addition would seem to be dramatically slower for large
registers consider 0111 + 0011, carries propagate left-to-right

ww
w.E
So 64-bit addition would be 8 times slower than 8- bit
addition

a
•It is possible to build a circuit called a “carry look-ahead adder” that

sy E
speeds up addition by eliminating the need to “ripple” carries through the word.
• Carry look-ahead is expensive

ngi
• If n is the number of bits in a ripple adder, the circuit complexity (number
of gates) is O(n)

nee
• For full carry look-ahead, the complexity is O(n3 )
• Complexity can be reduced by rippling smaller look-aheads:

ri n
e.g., each 16 bit group is handled by four 4-bit adders and the 16-bit
adders are rippled into a 64-bit adder.

g .n
e

Fig. Carry-look ahead adder

33

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

The advantage of the CLA scheme used in this circuit is its simplicity, because
each CLA block calculates the generate and propagate signals for two bits only. This
is much easier to understand than the more complex variants presented in other
textbooks, where combinatorial logic is used to calculate the G and P signals of four
or more bits, and the resulting adder structure is slightly faster but also less regular.

MULTIPLICATION

2.(i).Explain the sequential version of multiplication algorithm and its


hardware. (or) Explain BOOTHS algorithm for the multiplication of signed 2’s
complement numbers (April/May 2015) (N/D’15 & 16)
The first operand is called the multiplicand and the second the multiplier. The
final result is called the product. As you may recall, the algorithm learned in grammar

ww
school is to take the digits of the multiplier one at a time from right to left,
multiplying
the multiplicand by the single digit of the multiplier and shifting the intermediate

w.E
product one digit to the left of the earlier intermediate products.
The first

a
observation is that the number of digits in the product is considerably larger than

sy E
the number in either the multiplicand or the multiplier. In fact, if we ignore the sign
bits, the length of the multiplication of an n-bit multiplicand and an m-bit

ngi
multiplier is a product that is n + m bits long.

nee
That is, n + m bits are required to represent all possible products. Hence, like
add, multiply must cope with overflow because we frequently want a 32-bit product

ri n
as the result of multiplying two 32-bit numbers.

g .n
Multiplying 1000ten by 1001ten:
Multiplican 1000
d Multiplier X 1001

1000
0000
0000
e
1000
Product 1001000
In this example we restricted the decimal digits to 0 and 1. With only two
choices, each step of the multiplication is simple:
1. Just place a copy of the multiplicand in the proper place if the multiplier digit is a 1,
or
2. Place 0 ( multiplicand) in the proper place if the digit is 0.

34

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

ww
w.E
Fig. First version of the multiplication hardware

a sy E
ngi
nee
ri n
g .n
e

35

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

ww
w.E
a sy E
ngi
nee
ri n
g .n
Fig. The first multiplication algorithm
The multiplier is in the 32-bit Multiplier register and that the 64-bit Product
e
register is initialized to 0. Over 32 steps a 32-bit multiplicand would move 32 bits to
the left. Hence we need a 64-bit Multiplicand register, initialized with the 32-bit
multiplicand in the right half and 0 in the left half. This register is then shifted left 1 bit
each step to align the multiplicand with the sum being accumulated in the 64-bit
Product register.
Moore’s Law has provided so much more in resources that hardware
designers can now build much faster multiplication hardware. Whether the
multiplicand is to be added or not is known at the beginning of the multiplication by
looking at each of the 32 multiplier bits. Faster multiplications are possible by
essentially providing one 32-bit adder for each bit of the multiplier: one input is the
multiplicand ANDed with a multiplier bit and the other is the output of a prior adder.

36

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

SIGNED MULTIPLICATION:
In the signed multiplication, convert the multiplier and multiplicand to positive
numbers and then remember the original signs. The algorithms should then be run
for 31 iterations, leaving the signs out of the calculation. The shifting steps would
need to extend the sign of the product for signed numbers. When the algorithm
completes, the lower word would have the 32-bit product.

FASTER MULTIPLICATION

ww
w.E
a sy E
ngi
nee
Fig. Faster multiplier
ri n
g .n
Faster multiplications are possible by essentially providing one 32-bit adder
for each bit of the multiplier: one input is the multiplicand ANDed with a multiplier bit,

e
and the other is the output of a prior adder. Connect the outputs of adders on the
right to
the inputs of adders on the left, making a stack of adders 32 high.
The above figure shows an alternative way to organize 32 additions in a parallel
tree. Instead of waiting for 32 add times, we wait just the log2 (32) or five 32- bit add times.
Multiply can go even faster than five add times because of the use of
carry save adders. It is easy to pipeline such a design to be able to support
many multiplies simultaneously

Multiply in MIPS:
MIPS provide a separate pair of 32-bit registers to contain the 64-bit product,
called Hi and Lo. To produce a properly signed or unsigned product, MIPS have two
instructions: multiply (mult) and multiply unsigned (multu). To fetch the integer 32-bit
product, the programmer uses move from lo (mflo). The MIPS assembler generates a
pseudo instruction for multiply that specifies three general purpose registers,
generating mflo and mfhi instructions to place the product into registers.
37

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

2. (ii).Booth Algorithm: (Nov/Dec 2014).


Booth’s Algorithm Registers and Setup
• 3 n bit registers, 1 bit register logically to the right of Q (denoted as Q-1 )
• Register set up
• Q register <- multiplier
• Q-1 <- 0
• M register <- multiplicand
• A register <- 0
• Count <- n
• Product will be 2n bits in A Q registers Booth’s Algorithm Control Logic
• Bits of the multiplier are scanned one at a a time (the current bit Q0 )
• As bit is examined the bit to the right is considered also (the previous bit Q-1 )

ww
w.E
a sy E
ngi
nee
ri n
g .n
e
Fig. Booth algorithm
• Then:
00: Middle of a string of 0s, so no arithmetic operation.
01: End of a string of 1s, so add the multiplicand to the left half of the product
(A).
10: Beginning of a string of 1s, so subtract the multiplicand from the left half of
the product (A).
11: Middle of a string of 1s, so no arithmetic operation.
• Then shift A, Q, bit Q-1 right one bit using an arithmetic shift
• In an arithmetic shift, the MSB remains unchanged.
Example of Booth’s Algorithm (7*3=21)

38

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

DIVISION ALGORITHM AND ITS HARDWARE

ww
3.Explain the concepts of Division Algorithm and its hardware./Divide (12)10 by (3)10 using

Restoring and Non Restoring division algorithm with step by step intermediate results

w.E
and explain. (Nov/Dec 2014, 15, 16). (A/M’ 16) (16)
The reciprocal operation of multiply is divide, an

a
operation that is even less frequent and even more quirky. It even offers the
opportunity to perform a

sy E
mathematically invalid operation: dividing by 0. The example is dividing

ngi
1,001,010 by 1000. The two
operands (dividend and divisor) and the result (quotient) of divide are accompanied b

nee
y a second result called the remainder. Here is another way
to express the relationship between the components:

ri n
g .n
Dividend = Quotient * Divisor + Remainder
Where the remainder is smaller than the divisor. Infrequently, programs use
the divide instruction just to get the remainder, ignoring the quotient. The basic
grammar school division algorithm tries to see how big a number can be subtracted,
e
creating a digit of the quotient on each attempt. Binary numbers contain only 0 or 1,
so binary division is restricted to these two choices, thereby simplifying binary
division. If both the dividend and divisor are positive and hence the quotient and the
remainder are nonnegative. The division operands and both results are 32-bit values.
A DIVISION ALGORITHM AND HARDWARE:
Initially, the 32-bit Quotient register set to 0. Each iteration of the algorithm
needs to move the divisor to the right one digit, start with the divisor placed in the left
half of the 64-bit Divisor register and shift it right 1 bit each step to align it with the
dividend. The Remainder register is initialized with the dividend. Figure shows three
steps of the first division algorithm. Unlike a human, the computer isn’t smart enough
to know in advance whether the divisor is smaller than the dividend. It must first
subtract the divisor in step 1; If the result is positive, the divisor was smaller or equal

39

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

to the dividend, so generate a 1 in the quotient (step 2a). If the result is negative, the
next step is to restore the original value by adding the divisor back to the remainder
and generate a 0 in the quotient (step 2b). The divisor is shifted right and then iterate
again. The remainder and quotient will be found in their namesake registers after the
iterations are complete.
The following figure shows three steps of the first division algorithm. Unlike a
human, the computer isn’t smart enough to know in advance whether the
divisor is smaller than the dividend.

DIVISION ALGORITHM:
It must first subtract the divisor in step 1; remember that this is how we
performed the comparison in the set on less than instruction. If the result is positive,
the divisor was smaller or equal to the dividend, so we generate a 1 in the quotient
(step 2a). If the result is negative, the next step is to restore the original value by

ww
adding the divisor back to the remainder and generate a 0 in the quotient.
The divisor is shifted right and then we iterate again. The remainder

w.E
and quotient will be found in their namesake registers after the iterations are
complete.

a sy E
Using a 4-bit version of the algorithm to save pages, let’s try dividing 710 by 210, or 0000

01112 by 00102

ngi
nee
ri n
g .n
e

40

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

ww
w.E
a sy E
ngi
nee
ri n
g .n
e
.

41

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

ww
w.E Fig. Values of register in division algorithm

a sy E
The above figure shows the value of each register for each of the steps, with
the quotient being 3ten and the remainder 1ten. Notice that the test in step

ngi
2 of whether the remainder is positive or negative simply tests
whether the sign bit of the Remainder register is a 0 or 1.

nee
The surprising requirement of this algorithm is that it takes n + 1 steps to get the
proper quotient and remainder.

ri n
This algorithm and hardware can be refined to be faster
and cheaper. The speedup comes from shifting the

g .n
operands and the quotient simultaneously with the subtraction. This
refinement halves the width of the adder and registers by noticing where there are
unused portions of registers and adders.
SIGNED DIVISION:
The one complication of signed division is that we
must also set the sign of the remainder. Remember that the following equation must
e
always hold:
Dividend = Quotient × Divisor + Remainder
To understand how to set the sign of the remainder, let’s look at the example of
dividing all the combinations of ±7 10 by ±2 10.
Case:
+7 ÷ +2: Quotient = +3, Remainder = +1 Checking the results:
7 = 3 × 2 + (+1) = 6 + 1
If we change the sign of the dividend, the quotient must change as well:
–7 ÷ +2: Quotient = –3
Rewriting our basic formula to calculate the remainder:
Remainder = (Dividend – Quotient × Divisor) = –7 – (–3 × +2) = –7–(–6) = –1

42

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

So,
–7 ÷ +2: Quotient = –3, Remainder = –1
Checking the results again:
–7 = –3 × 2 + ( –1) = – 6 – 1
The following figure shows the revised
hardware.

ww
w.E
a sy E
ngi
Fig. Division hardware

nee
The reason the answer isn’t a
quotient of –4 and a remainder of +1, which would also fit
this formula, is that the absolute value of the quotient would then change depending

ri n
on the sign of the dividend and the divisor! Clearly, if programming would be an even

g .n
greater challenge. This anomalous behavior is avoided by following the
rule that the dividend and remainder must have the same signs, no matter what the

e
signs of the divisor and quotient. We calculate the other combinations by
following the same rule:
–(x ÷ y)≠ (–x) ÷ y
+7 ÷ –2: Quotient = –3, Remainder = +1
–7 ÷ –2: Quotient = +3, Remainder = –1
Thus the correctly signed division algorithm negates the quotient if the signs of
the operands are opposite and makes the sign of the nonzero remainder match the
dividend.

Faster Division:
Many adders can be used to speed up multiply, cannot be used to do the
same trick for divide. The reason is that it is needed to know the sign of the
difference before performing the next step of the algorithm, whereas with multiply we
could calculate the 32 partial products immediately.
There are techniques to produce more than one bit of the quotient per step.
The SRT division technique tries to guess several quotient bits per step, using
a table lookup based on the upper bits of the dividend and remainder. It relies on
subsequent steps to correct wrong guesses.
43

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

RESTORING DIVISION ALGORITHM:


Division of 4-bit number by 7-bit dividend:

ww
w.E
a sy E
ngi
nee
ri n
Assume ─ X register k-bit dividend
g .n
e
• Assume ─ Y the k-bit divisor
• Assume ─ S a sign-bit
1.Start: Load 0 into accumulator k-bit A and dividend X is loaded into the k-bit
quotient register MQ.
2. Step A : Shift 2 k-bit register pair A -MQ left
3. Step B: Subtract the divisor Y from A.
4. Step C: If sign of A (msb) = 1, then reset MQ 0 (lsb) = 0 else set = 1.
5. Steps D: If MQ 0 = 0 add Y (restore the effect of earlier subtraction).
6.Steps A to D repeat again till the total number of cyclic operations = k. At the end,
A has the remainder and MQ has the quotient.

DIVISION USING NON-RESTORING ALGORITHM:


•Assume ─ that there is an accumulator and MQ register, each of k-bits • MQ 0, (lsb
of MQ) bit gives the quotient, which is saved after a subtraction or addition
• Total number of additions or subtractions are k-only and total number of shifts =
k plus one addition for restoring remainder if needed
• Assume that X register has (2 k−1) bit for dividend and Y has the k-bit divisor
44

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

• Assume a sign-bit S shows the sign.

ww
w.E
a sy E
ngi
Fig. Division using Non-restoring Algorithm

nee
1. Load (upper half k −1 bits of the dividend X) into accumulator k-bit A

ri n
and load dividend X (lower half bits into the lower k bits at quotient register
MQ
• Reset sign S = 0
g .n
• Subtract the k bits divisor Y from S-A (1 plus k bits) and assign MQ 0 as per
S
2. If sign of A, S = 0, shift S plus 2 k-bit register pair A-MQ left and subtract
the k its divisor Y from S-A (1 plus k bits); else if sign of A, S = 1, shift S
e
plus 2 k-bit register pair A - MQ left and add the divisor Y into S-A (1 plus k
bits)
• Assign MQ 0 as per S
3. Repeat step 2 again till the total number of operations = k.
4. If at the last step, the sign of A in S = 1, then add Y into S -A to leave
the correct remainder into A and also assign MQ 0 as per S, else do
nothing.
5. A has the remainder and MQ has the quotient

45

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

FLOATING POINT ADDITION

1.Explain how floating point addition is carried out in a computer system. Give
an example for a binary floating addition. /Explain in detail about Floating Point
Operations. (April/May 2015). (16)
The scientific notation has a single digit to the left of the decimal point. A
number in scientific notation that has no leading 0s is called a normalized
number, which is the usual way to write it. Floating point - Computer arithmetic that
represents numbers in which the binary point is not fixed. Floating-point numbers are
usually a multiple of the size of a word.
The representation of a MIPS floating-point number is shown below, where s
is the sign of the floating-point number (1 meaning negative), exponent is the value of
the 8-bit exponent field (including the sign of the exponent), and fraction is the 23-bit
number. This representation is called sign and magnitude, since the sign has a

ww
separate bit from the rest of the number.
A standard scientific notation for real in normalized form offers three

w.E
advantages.
 It simplifies exchange of data that includes floating-point numbers;

a
 It simplifies the floating-point arithmetic algorithms to know that numbers will

sy E
always be in this form;
 It increases the accuracy of the numbers that can be stored in

ngi
a word, since the unnecessary leading 0s are replaced by real digits
to the right of the binary point.

nee
ri n
g .n
Fig. Scientific notation

e
FLOATING POINT ADDITION:
Example:
Let’s add numbers in scientific notation by hand to illustrate the problems in
floating-point addition: 9.999ten × 101 + 1.610ten × 10-1. Assume that we can store
only four decimal digits of the significant and two decimal digits of the exponent.
Step 1.
To be able to add these numbers properly, we must align the decimal point of
the number that has the smaller exponent. Hence, we need a form of the smaller
number, 1.610ten × 10-1, that matches the larger exponent. We obtain this by
observing that there are multiple representations of an unnormalized floating-point
number in scientific notation:
1.610ten × 10-1 = 0.1610ten × 100 = 0.01610ten × 101
The number on the right is the version we desire, since its exponent matches
the exponent of the larger number, 9.999ten × 101. Thus, the first step shifts the
significant of the smaller number to the right until its corrected exponent matches that
of the larger number. But we can represent only four decimal digits so, after
shifting, the number is really 0.016ten × 101
46

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

Step 2. Next comes the addition of the significant:


9.999ten

+ 0.016ten = 10.015ten so the sum is 10.015ten × 101.


Step 3. This sum is not in normalized scientific notation, so we need to adjust it:
10.015ten × 101 = 1.0015ten × 102
Thus, after the addition we may have to shift the sum to put it into normalized
form, adjusting the exponent appropriately. This example shows shifting to the
right, but if one number was positive and the other was negative, it would be possible
for the sum to have many leading 0s, requiring left shifts. Whenever the exponent is
increased or decreased, we must check for overflow or underflow—that is, we must
make sure that the exponent still fits in its field.
Step 4. Since we assumed that the significant can be only four digits long

ww
(excluding the sign), we must round the number. In our grammar school
algorithm, the rules

w.E
truncate the number if the digit to the right of the desired point is between 0
and 4
and add 1 to the digit if the number to the right is between 5 and 9. The
number
a sy E
1.0015ten × 102 is rounded to four digits in the significant to 1.002ten ×
102 since the fourth digit to the right of the decimal point was between 5 and 9.
Notice that if we have bad luck on rounding, such as adding 1 to a string of 9s, the

ngi
sum may no longer be normalized and we would need to perform step 3 again.

nee
ri n
g .n
e

47

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

ww
w.E
a sy E
ngi
nee
ri n
g .n
e
The algorithm for binary floating-point addition that follows this decimal
example. Adjust the significand of the number with the smaller exponent and then
add the two significands. Step 3 normalizes the results, forcing a check for overflow
or underflow. The test for overflow and underflow in step 3 depends on the precision
of the operands. Recall that the pattern of all 0 bits in the exponent is reserved and
used for the floating-point representation of zero. Moreover, the pattern of all 1 bits in
the exponent is reserved for indicating values and situations outside the scope of
normal floating-point numbers. Thus, for single precision, the maximum exponent is
127, and the minimum exponent is -126. The limits for double precision are 1023 and
-1022.

48

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

5. Write short notes on Sub Word Parallelism.


A subword is a lower precision unit of data contained within a word. In
subword parallelism, multiple subwords are packed into a word and then
process whole words. With the appropriate subword boundaries this technique
results in parallel processing of subwords. Since the same instruction is applied to all
subwords within the word, this is a form of SIMD (Single Instruction Multiple Data)
processing.
It is possible to apply subword parallelism to noncontiguous subwords of
different sizes within a word. In practical implementation is simple if subwords are
same size and they are contiguous within a word. The data parallel programs that
benefit from subword parallelism tend to process data that are of the same size.
For example if word size is 64bits and subwords sizes are 8,16 and 32 bits.
Hence an instruction operates on eight 8bit subwords, four 16bit subwords, two 32bit
subwords or one 64bit subword in parallel.

ww Subword Parallelism:
Subword parallelism is an efficient and flexible solution for media processing

w.E
because algorithm exhibit a great deal of data parallelism on lower precision data. It
is also useful for computations unrelated to

a
multimedia that exhibit data parallelism on lower precision data. Graphics and audio a
pplications

sy E
can take advantage of performing simultaneous operations on short vectors

ngi
The term SIMD was originally defined in 1960s as category of
multiprocessor with one control unit and multiple processing elements -

nee
each instruction is executed by all processing elements on different
data streams, e.g., Illiac IV. Today the term is

ri n
used to describe partitionable ALUs in which multiple operands can fit in a
fixed-width

g .n
register and are acted upon in parallel.(other terms include subword parallelism, micr
oSIMD,

e
short vector extensions, split-ALU, SLP / superword-level parallelism, and SIGD /
single-instruction-group[ed]-data)
The structure of the arithmetic element can be altered under program control.
Each instruction specifies a particular form of machine in which to operate,
ranging from a full 36-bit computer to four 9-bit computers with many variations. Not
only is such a scheme able to make more
efficient use of the memory in storing data of various word lengths, but it
also can be expected to result in greater over-all machine speed because of the
increased parallelism of operation.

49

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

UNIT III PROCESSOR AND CONTROL UNIT

PART-A

1.What are R type instructions? (April/May 2015)


For R-type instructions, an additional 6 bits are used (B5-0) called the function.
Thus, the 6 bits of the opcode and the 6 bits of the function specify the kind of
instruction for R-type instructions. This is the destination register.

OP Rs Rt Rd shamt Funct
6 bits 5 bits 5 bits 5 bits 5 bits 6 bits

OP-Basic operation of the instruction called opcode

ww
rs- The first register source operand
rt- The second register source operand

w.E
rd- The register destination operand .It gets the result of the operation.
Shamt – shift amount

a
Funct- Function. This field called function code, selects the specific

sy E
variant of the operation in the op field.

ngi
1.What is branch prediction buffer and branch
prediction? (April/May 2015) (N/D’15)

nee
A
branch prediction buffer or branch history table is a small memory indexed by the
lower portion of the address of the branch instruction. The memory contains a
bit that says whether the branch was recently taken or not.
ri n
g .n
Branch prediction is a method of resolving a branch hazard that
assumes a given outcome for the branch and proceeds from

e
that assumption rather than waiting to ascertain the actual outcome.

2. What is meant by speculation? (Nov/Dec 2014)


One of the most important methods for finding and exploiting more ILP is
speculation. It is an approach whereby the compiler or processor guesses the
outcome of an instruction to remove it as dependence in executing other instructions.
For example, we might speculate on the outcome of a branch, so that
instructions after the branch could be executed earlier.

3. What are exceptions and interrupts? (Nov/Dec 2014)


Exception, also called interrupt, is an unscheduled event that disrupts program
execution used to detect overflow. Eg. Arithmetic overflow, using an undefined
instruction. Interrupt is an exception that comes from outside of the processor. Eg.
I/O device request.
It is also used to detect an overflow condition. Events other than branches or
jumps that change the normal flow of instruction execution come under exception.
An exception is an unexpected event from within the processor. An interrupt is
an unexpected event from outside the processor.
50

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

1. Write control signals for storing a word in memory. (May/June 2014)


 R1out
 MARin R2out
 MDRin
 write MDRout E
 WMFC

2.What is meant by data hazards and control hazards in pipelining? (May/June


2012), (N/D 2013, 2015)
DATA HAZARDS:
This is when reads and writes of data occur in a different order in the pipeline

ww
than in the program code. There are three different types of data hazard named
according to the order of operations that must be maintained
 RAW: A Read after Write hazard

w.E
 WAR: A Write After Read hazard is the reverse of a RAW
 WAW:A Write After Write hazard

a
CONTROL HAZARDS:

sy E
This is when a decision needs to be made, but the information needed
to make the decision is not available yet. A Control Hazard is actually the
same thing as a RAW
ngi
data hazard (see above), but is considered separately because different techniques c
an be employed
nee
to resolve it - in effect, we'll make it less important by trying to make good
guesses as to what the decision is going to be.
ri n
3. What is meant by speculative execution? (May/June 2012) (N/D’14)
g .n
e
Speculative execution is an optimization technique where a computer system
performs some
task that may not be actually needed. The main idea is to do work before it is known
whether that work will be needed at all, so as to prevent a delay that would have to
be incurred by doing the work after it is known whether it is needed. If it turns out the
work was not needed after all, any changes made by the work are reverted and the
results are ignored.
The objective is to provide more concurrency if extra resources are available.
This approach is employed in a variety of areas, including branch prediction in
pipelined processors, prefetching memory and files, and optimistic concurrency
control in database systems.

4. What is meant by pipelining and pipeline stall? (April/May 2010)


Pipelining is an implementation technique in which multiple instructions are
overlapped in execution. Pipelining improves performance by increasing
instruction throughput, as opposed to decreasing the execution time of an individual
instruction.
Pipeline stall, also called bubble, is a stall initiated in order to resolve a
hazard.
They can be seen elsewhere in the pipeline.
Downloaded From : www.EasyEngineering.ne
Downloaded From : www.EasyEngineering.ne

1. What are the advantages of pipelining? (April/May 2010)


 The cycle time of the processor is reduced, thus increasing instruction issue-
rate in most cases.
 Some combinational circuits such as adders or multipliers can be made faster
by adding more circuitry. If pipelining is used instead, it can save circuitry vs. a
more complex combinational circuit
2. Define pipeline speedup. (Nov/Desc 2013)
The ideal speedup from a pipeline is equal to the number of stages in the
pipeline.
Speedup = Time per instruction on unpipelined machine / Number of
pipe
stages.
However, this only happens if the pipeline stages are all of equal length.
Splitting a 40 ns operation into 5 stages, each 8 ns long, will result in a 5x

ww
speedup. Splitting the same operation into 5 stages, 4 of which are 7.5 ns long and
one of which is 10 ns long will result in only a 4x speedup.

w.E
 If your starting point is a multiple clock cycle per instruction machine then
pipelining decreases CPI.

a
 If

sy E
your starting point is a single clock cycle per instruction machine then pipelinin
g decreases

ngi
cycle time.

nee
3.What are the disadvantages of increasing the number of stages in
pipelined processing? (April/May 2011)

ri n
 The design of a non-pipelined processor simpler and cheaper to
manufacture, non-pipelined processor executes only a single instruction at a
time.

g .n
 In pipelined processor, insertion of flip flops between modules increases the
instruction latency compared to a non-pipelined processor.
 A non-pipelined processor will
have a defined instruction throughput. The performance of a pipelined process
or is much
e
harder to predict and may vary widely for different programs.

4. What is the role of cache in pipelining? (Nov/Dec 2011)


A pipeline cache is a cache or storage area for a computer processor that is
designed to be read from or written to in a pipelining succession of four data transfers
in which later bursts can start to flow or transfer before the first burst has arrived at
the processor. A pipeline burst cache is often used for the static RAM (static random
access memory) that serves as the L1 and L2 cache in a computer.

5. What is meant by data path element? (N/D’16)


A data path element is a unit used to operate on or hold data within a
processor. In the MIPS implementation, the data path elements include the
instruction and data memories, the register file, the ALU, and adders.
52

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

1.What are the problems faced in Instruction Pipeline and How data hazard can
be prevented in pipelining?
 Resources Conflicts
 Data Dependency
 Branch Difficulties
Data hazards in the instruction pipelining can prevented by the following
techniques.
 Operand Forwarding
 Software Approach

PART B

ww MIPS IMPLEMENTATION SCHEME

w.E
1.Write short notes on MIPS
Implementation scheme. (8) (N/D’15) a.Instruction fetch cycle
(IF):
IR =
Mem[PC]; NP
a sy E
ngi
C
= PC

nee
+4;
Operation

ri n
:
Send out the
PC and fetch
the instruction
g .n
e
from memory i
nto the
instruction
register (IR). In
crement
the
PC by 4 to add
ress the next s
equential
instruction.
IR - holds instruction that will be needed on
subsequent clock cycles Register NPC -
holds next sequential PC.

a. Instruction decode/register fetch cycle (ID):


A=
Regs[rs];
B = Regs[rt];
Imm = sign-
extended
immediate
field of IR; Downloaded From : www.EasyEngineering.ne
Downloaded From : www.EasyEngineering.ne

i) Register-Register ALU instruction:


ALUOutput = A func B; Operation:
a) ALU performs the operation specified by the function code on the value in
register A and in register B.
b) Result is placed in temporary register ALU Output.
ii) Register-Immediate ALU instruction:
ALUOutput = A op Imm; Operation:
a) ALU performs operation specified by the opcode on the value in register A and
register Imm.
b) Result is placed in temporary register ALUOutput.
iii)Branch:
ALUOutput = NPC + (Imm << 2);
Cond = (A == 0)
Operation:

ww a) ALU adds NPC to sign-extended immediate value in Imm, which is shifted left
by 2 bits to create a word offset, to compute address of branch target.

w.E
b) Register A,
which has been read in the prior cycle, is checked to determine whether branc

a
h is
taken.
sy E
c) Considering only one form of branch (BEQZ), the comparison is against 0.

ngi
d.Memory access/branch completion cycle (MEM):

nee
PC is updated for all instructions: PC
= PC; i. Memory reference:

ri n
LMD =
Mem[ALUOutput] Me or
m[ALUOutput]
= B;
g .n
Operation:
a) Access memory if needed.
b) Instruction is load-data returns from memory and is placed in LMD .
c) Instruction is store-data from the B register is written into memory
e
Branch:
if (cond) PC = ALUOutput
Operation: If the instruction branch, PC is replaced with the branch destination
address in register ALUOutput.
A. rite-back cycle (WB):
* Register-Register ALU instruction:
Regs[rd] = ALUOutput;
* Register-Immediate ALU instruction:
Regs[rt] = ALUOutput;
* Load instruction:
Regs[rt] = LMD;
Operation: Write the result into register file, depending on the effective opcode.

54

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

BUILDING DATA PATH AND CONTROL IMPLEMENTATION SCHEME

2. Explain Data path and its control in detail. (Nov/Dec 2014).

(16) CONTROL:
Components of the processor that commands the datapath, memory, I/O
devices according to the instructions of the memory.

Building a Datapath:
Elements that process data and addresses in the CPU - Memories, registers,
ALUs. MIPS datapath can be built incrementally by considering only a subset
of instructions 3 main elements are

ww
w.E
a sy E Fig. Datapath

ngi
A memory unit is used to store instructions of a program and supply
instructions given an address.

nee
Needs to provide only read access (once the program is loaded).-
No control signal is needed PC (Program Counter or Instruction

ri n
address register) is a register that holds the address of the current instruction .A new
value is written

g .n
to it every clock cycle. No control signal is required to enable write Adder to
increment the PC to the address of the next instruction .An ALU permanently wired

e
to do only addition. No extra control signal required.

Fig. Datapath portion for Instruction Fetch

Types of Elements in the Datapath:


State element:
A memory element, i.e., it contains a state E.g., program counter, instruction
memory
55

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

Combinational element:
Elements that operate on values Eg adder ALU E.g. adder, ALU Elements
required by the different classes of instructions
o Arithmetic and logical instructions
o Data transfer instructions
o Branch instructions

R-Format ALU Instructions:


E.g., add $t1, $t2, $t3
Perform arithmetic/logical operation
Read two register operands and write register result

ww
Register file:
 A collection of the registers.
 Any register can be read or written by specifying the number of

w.E
the register contains the register state of the computer.
Read from register:

a
 inputs to the register file specifying the numbers 5 bit wide inputs for the 32
registers
sy E
 outputs from the register file with the read values 32 bit wide

ngi
 For all instructions. No control required.
Write to register file:

nee
1 input to the register file specifying the number 5 bit wide inputs for the 32
registers 1 input to the register file with the value to be written 32 bit wide
Only for some instructions. RegWrite control signal.
ALU:
ri n
g .n
o Takes two 32 bit input and produces a 32 bit output Also, sets one-bit

e
signal if the results is 0.
o The operation done by ALU is controlled by a 4 bit control signal
input. This is set according to the instruction.

CONTROL IMPLEMENTATION SCHEME:


The ALU Control:
The MIPS ALU defines the 6 following combinations of four control
inputs:

Depending on the instruction class, the ALU will need to perform one of these
first five functions. (NOR is needed for other parts of the MIPS instruction set not
56

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

found in the subset we are implementing.) For load word and store word instructions,
we use the ALU to compute the memory address by addition.
In this Figure we show how to set the ALU control inputs based on the 2 bit
ALUOp control and the 6 bit function code. Later in this chapter we will see how the
ALUOp bits are generated from the main control unit.

ww
Designing the Main Control Unit:
To start this process, let’s identify the fields of an instruction and the control

w.E
Lines that are needed for the datapath. To understand how to connect the fields
of an instruction to the datapath, it is useful to review the

instructions.
a
formats of the three instruction classes: the R-type, branch, and load-store

sy E
ngi
The three instruction classes (R-type, load and store, and branch)
use two different instruction formats:

nee
ri n
g .n
e
There are several major observations about this instruction format that we will
rely on:
 The op field, also called the opcode, is always contained in bits 31:26.
We will refer to this field as Op[5:0].
 The two registers to be read are always specified by the rs and rt fields,
at positions 25:21 and 20:16. This is true for the R-type instructions,
branch equal, and store.
 The base register for load and store instructions is always in bit
positions 25:21 (rs).
 The 16‑bit offset for branch equal, load, and store is always in positions
15:0.

57

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

ww
w.E FIG: ALU - CONTROL PATH OPERATIONS

a
Finalizing Control: sy E
ngi
nee
ri n
g .n
e
INPUT/OUTPUT SIGNALS

Now that we have seen how the instructions operate in steps, let’s continue
with the control implementation. The outputs are the control lines, and the input is the
6 bit opcode field, Op [5:0]. Thus, we can create a truth table for each of the outputs
based on the binary encoding of the opcodes.
Figure shows the logic in the control unit as one large truth table that
combines all the outputs and that uses the opcode bits as inputs. It completely
specifies the control function, and we can implement it directly in gates in an
automated fashion.
58

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

Now that we have a single-cycle implementation of most of the MIPS core


instruction set, let’s add the jump instruction to show how the basic datapath and
control can be extended to handle other instructions in the instruction set.

ww
w.E
a sy E SIGNAL EFFECTS

An additional multiplexor: ngi


nee
ri n
g .n
e

FIG: ADDITIONAL MULTIPLEXOR

59

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

This multiplexor is controlled by the jump control signal. The jump target
address is obtained by shifting the lower 26 bits of the jump instruction left 2 bits,
effectively adding 00 as the low-order bits, and then concatenating the upper 4 bits of
PC + 4 as the high-order bits, thus yielding a 32-bit address. The clock cycle is
determined by the longest possible path in the processor. This path is almost
certainly a load instruction, which uses five functional units in series: the instruction
memory, the register file, the ALU, the data memory, and the register file.

3.(i).Write short notes on Pipelining.

In computers, a pipeline is the continuous and somewhat overlapped


movement of instruction to the processor or in the arithmetic steps taken by the

ww
processor to perform an instruction. Pipelining is the use of a pipeline. Without a
pipeline, a computer processor gets the first instruction from memory, performs the
operation it calls for, and then goes to get the next instruction from memory, and so

w.E
forth. While fetching (getting) the instruction, the arithmetic part of the
processor is idle. It must wait until it gets the next instruction.

a
With pipelining, the computer architecture allows the next instructions to be

sy E
fetched while the processor is performing arithmetic operations, holding them
in a buffer close to the processor until each instruction operation can

ngi
be performed. The staging of instruction fetching is continuous. The result is an
increase in the number of instructions that can be performed during a given time

nee
period.

ri n
g .n
e

Fig. The laundry analogy for pipelining.

60

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

MIPS instructions classically take five steps:


1. Fetch instruction from memory.
2.Read registers while decoding the instruction. The regular format of MIPS
instructions allows reading and decoding to occur simultaneously.
3. Execute the operation or calculate an address.
4. Access an operand in data memory.
5. Write the result into a register.
Hence, the MIPS pipeline we explore in this chapter has five stages. The following
example shows that pipelining speeds up instruction execution just as it speeds up
the laundry.

a. Instruction fetch cycle (IF):


Send the program counter (PC) to memory and fetch the current instruction from

ww
memory. PC=PC+4.

b. Instruction decode/register fetch cycle (ID):

w.E
Decode the instruction and read the registers. Do the equality test on the
registers as they are read, for

a
a possible branch. Compute the possible branch target address by adding the

sy E
sign-extended offset to the incremented PC. Decoding is done
in parallel with reading registers, which is possible because the register specifie

ngi
s are
at a fixed location in a RISC architecture, known as fixed-field decoding.

c. Execution/effective address cycle(EX):


nee
ri n
The ALU operates on the operands prepared in the prior cycle, performing
one of three functions depending on the instruction type.

g .n
 Memory reference: The ALU adds the base register and the offset to form
the effective address.


Register-Register ALU instruction: The
ALU performs the operation specified by the ALU opcode on
the values read from the register file.
e
Register-Immediate ALU instruction:
The ALU performs the operation specified by the ALU opcode on the first
value read from the register file and the sign-extended immediate.

d. Memory access(MEM):
If the instruction is a load, memory does a read using the effective address
computed in the previous cycle. If it is a store, then the memory writes the data
from the second register read from the register file using the effective address.

e. Write-back cycle (WB):


Register-Register ALU instruction or Load instruction: file, whether it comes
from the memory system (for a load).Write the result into the register or from the
ALU.
61

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

PIPELINED DATA PATH AND CONTROL

3. (ii)Explain about Pipelined Data Path and Control.


The Classic Five-Stage Pipeline for a RISC Processor:
Each of the clock cycles from the previous section becomes a pipe stage—a
cycle in the pipeline. Each instruction takes 5 clock cycles to complete, during each
clock cycle the hardware will initiate a new instruction and will be executing
some part of the five different instructions.
3 observations:
1. Use separate instruction and data memories, which implement with separate
instruction and data caches.
2. The register file is used in the two stages: one for reading in ID and one for
writing in WB, need to perform 2 reads and one write every clock cycle.

ww
3. Does not deal with PC, To start a new instruction every clock, we must
increment and store the PC every clock, and this must be done during the IF
stage in preparation for the next instruction.
To
w.E
ensure that instructions in different stages of the pipeline do not interfere with one an

a
other.

sy E
This separation is done by introducing pipeline registers between successive stages
of the pipeline, so that at the end of a clock cycle all the results

ngi
from a given stage are stored into a register that is used as the input to the
next stage on the next clock cycle.

nee
Pipelined Datapath and Control:
The division of an instruction into five stages means a five stage pipeline,

ri n
which in turn means that up to
five instructions will be in execution during any single clock cycle. Thus, we must

g .n
separate the datapath into five pieces, with each piece named corresponding to a sta
ge of instruction

e
execution:
4. IF: Instruction fetch
5. ID: Instruction decode and register file read
6. EX: Execution or address calculation
7. MEM: Data memory access
8. WB: Write back

62

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

ww
w.E
a sy E
FIG: DATAPATH & CONTROLL IN PIPELINE PROCESSING
In Figure, these five components correspond roughly to the way the

ngi
datapath is drawn; instructions and data move generally from left to
right through the five

nee
stages as they complete execution. Returning to our laundry
analogy, clothes get cleaner,

ri n
drier, and more organized as they move through the line, and they never
move backward.

g .n
e

FIG: PROGRAM EXECUTION ORDER

There are, however, two exceptions to this left-to-right flow of instructions:


 The write-back stage, which places the result back into the register file in the
middle of the datapath
 The selection of the next value of the PC, choosing between the incremented
PC and the branch address from the MEM stage.

63

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

PIPELINED DATAPATH:

ww
w.E
a sy E
ngi
FIG: PIPELINED DATAPATH

nee
It shows the pipelined datapath with the pipeline registers highlighted. All
instructions advance during each clock cycle from one pipeline register to the

ri n
next. The registers are named for the two stages separated by that register. For
example, the pipeline register between the IF and ID stages is called IF/ID.

g .n
The pipeline registers separate each pipeline stage. They are labeled by the

e
stages that they separate; for example, the first is labeled IF/ID because it
separates the instruction fetch and instruction decodes stages. The registers must
be wide enough to store all the data corresponding to the lines that go through them.

FIVE STAGES OF LOAD INSTRUCTION:


a.Instruction fetch: The top portion of Figure shows the instruction being read
from memory using the address in the PC and then being placed in the IF/ID pipeline
register. The PC address is incremented by 4 and then written back into the PC to be
ready for the next clock cycle. This incremented address is also saved in the IF/ID
pipeline register in case it is needed later for an instruction, such as beq. The
computer cannot know which type of instruction is being fetched, so it must prepare
for any instruction, passing potentially needed information down the pipeline.
b.Instruction decode and register file read: The bottom portion of Figure shows
the instruction portion of the IF/ID pipeline register supplying the 16-bit immediate
field, which is sign-extended to 32 bits, and the register numbers to read the two
registers. All three values are stored in the ID/EX pipeline register, along with the
incremented PC address. We again transfer everything that might be needed by any
64

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

instruction during a later clock cycle.


a. Execute or address calculation: Figure shows that the load instruction reads the
contents of register 1 and the sign-extended immediate from the ID/EX pipeline
register and adds them using the ALU. That sum is placed in the EX/MEM pipeline
register.
b.Memory access: The top portion of Figure 4.38 shows the load instruction reading
the data memory using the address from the EX/MEM pipeline register and loading
the data into the MEM/WB pipeline register.
c.Write-back: The bottom portion of Figure 4.38 shows the final step: reading the
data from the MEM/WB pipeline register and writing it into the register file in the
middle of the figure.

FIVE STAGES OF STORE INSTRUCTION:

ww
a. Instruction fetch: The instruction is read from memory using the address in the
PC and then is placed in the IF/ID pipeline register. This stage occurs before the
instruction is identified, so the top portion of Figure works for store as well as load.

w.E
b. Instruction decode and register file read: The instruction in the IF/ID

a
pipeline register supplies the register numbers for reading two

sy E
registers and extends the sign of the 16-bit immediate. These three 32-bit
values are all stored in the ID/EX pipeline register. The bottom portion of Figure

ngi
for load instructions also shows the operations of the second stage for
stores. These first two stages are executed by all instructions, since it is too early

nee
to know the type of the instruction.

ri n
c. Execute and address calculation: Figure shows the
third step; the effective address is placed in the EX/MEM pipeline register.

d. Memory access: The top portion of Figure shows the data being written to
g .n
memory. Note that the register containing the data to be stored was
read in an earlier stage and stored in ID/EX. The only way to make
the data available during the MEM stage is to place the
e
data into the EX/MEM pipeline register in the EX stage, just as we stored the
effective address into EX/MEM.

e. Write-back: The bottom portion of Figure shows the final step of the store. For
this instruction, nothing happens in the write-back stage. Since every instruction
behind the store is already in progress, we have no way to accelerate those
instructions. Hence, an instruction passes through a stage even if there is nothing
to do, because later instructions are already progressing at the maximum rate.

Graphically Representing Pipelines:


Pipelining can be difficult to understand, since many instructions are
simultaneously executing in a single datapath in every clock cycle. The multiple-
clock-cycle diagrams are simpler but do not contain all the details. For example,
consider the following five-instruction sequence:

65 Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

lw $10, 20($1)
sub $11, $2, $3
add $12, $3,
$4 lw $13,
24($1) add
$14, $5, $6

Multiple-clock-cycle
pipeline diagram of
five instructions:

ww
w.E
a sy E
ngi
nee
Traditional multiple-clock-cycle pipeline diagram of five instructions:

ri n
g .n
e

Pipelined Control:
The first step is to label the control lines on the existing datapath. Figure
shows those lines. We borrow as much as we can from the control for the simple
datapath in Figure .In particular, we use the same ALU control logic, branch logic,
destination-register-number multiplexor, and control lines.

66

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

ww
w.E
a sy E
ngi
Fig.The pipelined datapath with the control signals identified.

nee
To specify control for the pipeline, we need only set the
control values during each pipeline stage. Because each control line

ri n
is associated with a component active in only a single pipeline stage, we can divide
the control lines into five groups according to the pipeline stage.

g .n
e

Fig. The control lines for the final three stages.

a.Instruction fetch: The control signals to read instruction memory and to


write the PC are always asserted, so there is nothing special to control in this
pipeline stage.
67

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

a. Instruction decode/register file read: As in the previous stage, the same


thing happens at every clock cycle, so there are no optional control lines to
set.

b. Execution/address calculation: The signals to be set are RegDst, ALUOp,


and ALUSrc The signals select the Result register, the ALU operation, and
either Read data 2 or a sign-extended immediate for the ALU.

c. Memory access: The control lines set in this stage are Branch, MemRead,
and MemWrite.These signals are set by the branch equal, load, and store
instructions, respectively. Recall that PCSrc in Figure selects the next
sequential address unless control asserts Branch and the ALU result was 0.

ww d. Write-back: The two control lines are MemtoReg, which decides between
sending the ALU result or the memory value to the register file, and RegWrite,

w.E
which writes the chosen value.

a sy E
PIPELINE HAZARDS

4. (i).What is Hazard? Explain its types with suitable examples. (Nov/Dec 2014)

ngi
/Explain the different types of pipeline hazards with
suitable examples.(April/May 2015). (16)

nee
There are situations in pipelining when the next instruction cannot execute in
the following clock cycle. These events are called hazards, and there are three

ri n
different types.

STRUCTURAL HAZARDS:
The first hazard is called a structural hazard. It g .n
means that the hardware cannot support the combination of instructions that we
want to execute in the same clock cycle. A structural hazard in the laundry room
would occur if we used a washer
e
dryer combination instead of a separate washer and dryer or if our roommate
was busy doing something else and wouldn’t put clothes away. Our carefully
scheduled pipeline plans would then be foiled.
Suppose, however, that we had a single memory instead of two memories. If
the pipeline in Figure had a fourth instruction, we would see that in the same
clock cycle the first instruction is accessing data from memory while the fourth
instruction is fetching an instruction from that same memory. Without two memories,
our pipeline could have a structural hazard.
The MIPS instruction set was designed to be pipelined making it fairly easy for
designers to avoid structural hazards when designing a pipeline. Suppose
however that we had a single memory instead two memories.

68

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

DATA HAZARDS:
This is when reads and writes of data occur in a different order in the pipeline
than in the program code. There are three different types of data hazard
(named according to the order of operations that must be maintained):

RAW:
A Read After Write hazard occurs when, in the code as written, one instruction
reads a location after an earlier instruction writes new data to it, but in the pipeline the
write occurs after the read (so the instruction doing the read gets stale data).

WAR :
A Write after Read hazard is the reverse of a RAW: in the code a write occurs
after a read, but the pipeline causes write to happen first.

ww WAW:
A Write after Write hazard is a situation in which two writes occur out of order.

w.E
For example, suppose we have an add instruction followed immediately by a
subtract instruction that uses the sum ($s0):

a
add $s0,
$t0, $t1 sub
$t2, $s0, $t3 sy E
FORWARDING
ngi
WITH TWO
INSTRUCTIONS:
nee
ri n
g .n
Fig. Graphical representation of forwarding. e
The figure shows the connection to forward the value in $s0 after the
execution stage of the add instruction as input to the execution stage of the sub
instruction. In this graphical representation of events, forwarding paths are valid only
if the destination stage is later in time than the source stage. For example, there
cannot be a valid forwarding path from the output of the memory access stage in the
first instruction to the input of the execution stage of the following, since that would
mean going backward in time.

LOAD USE DATA HAZARD:


It cannot prevent all pipeline stalls, however. For example, suppose the first
instruction was a load of $s0 instead of an add. The desired data would be
available only after the fourth stage of the first instruction in the dependence, which
is too late for the input of the third stage of sub. Hence, even with forwarding, we
would have to
69

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

stall one stage for a load-use data hazard, as Figure shows. This figure shows an
important pipeline concept, officially called a pipeline stall, but often given the
nickname bubble. We shall see stalls elsewhere in the pipeline.

Reordering Code to Avoid Pipeline Stalls:

ww
Consider the following code segment in C
a = b + e;

w.E
c = b + f;
Here is the generated MIPS code for
this segment, assuming all variables are in memory and are addressable as

a
offsets from $t0:
lw
$t1, 0($t sy E
0)
$t2, 4($t0)
lw
ngi
add $t3,
$t1,$t2 sw
nee
$t3, 12($t0) lw
$t4, 8($t0)
ri n
add $t5,
$t1,$t4 sw $t5, g .n
16($t0)
Find the hazards in the preceding code segment and reorder
the instructions to avoid any pipeline stalls. Both add instructions have
e
a hazard because of their
respective dependence on the immediately preceding lw instruction.
Notice that bypassing eliminates several other potential hazards, including the
dependence of the first adds on the first lw and any hazards for
store instructions. Moving up the third lw instruction to become the third instruction
eliminates both hazards:
lw $t1,
0($t0) lw $t2,
4($t0) lw $t4,
8($t0) add $t3,
$t1,$t2 sw $t3,
12($t0) add
$t5, $t1,$t4 sw
$t5, 16($t0)
On a pipelined processor with forwarding, the reordered sequence will
complete in two fewer cycles than the original version.
Downloaded From : www.EasyEngineering.ne
Downloaded From : www.EasyEngineering.ne

football team. Given how filthy the laundry is, we need to determine whether the
detergent and water temperature setting we select is strong enough to get the
uniforms clean but not so strong that the uniforms wear out sooner.
Performance of “Stall on Branch” Estimate the impact on the clock cycles per
instruction (CPI) of stalling on branches. Assume all other instructions have a CPI of
1. If we cannot resolve the branch in the second stage, as is often the case for
longer
pipelines, then we’d see an even larger slowdown if we stall on branches. The
cost of this option is too high for most computers to use and motivates a second
solution to the control hazard. Computers do indeed use prediction to handle
branches. One simple approach is to predict always that branches will be untaken.
When you’re right, the pipeline proceeds at full speed.

(ii).Explain the mechanism to handle data hazards & control hazards in

ww
pipelining.
Hazards:

w.E
Prevent the next instruction in the instruction stream from executing during its
designated clock cycle. Hazards reduce the performance from the ideal

a
speedup gained by pipelining.

sy E
Performance of Pipelines with Stalls:
A stall causes the pipeline performance to degrade from the

ngi
ideal performance.
Speedup from pipelining = 1
---------------------------------------* Pipeline depth

nee
1+ pipeline stall cycles per instruction

ri n
Structural Hazards:
 When

g .n
a processor is pipelined, the overlapped execution of instructions requires pipe
lining of functional units and duplication of resources
to allow all possible combinations of instructions in the pipeline.
 If some combination of instructions cannot
be accommodated because of resource conflicts, the processor is said to
have a structural hazard.
e
Instances:
 When functional unit is not fully pipelined, then a sequence of instructions
using that unpipelined unit cannot proceed at the rate of one per clock cycle.
 when some resource has not been duplicated enough to
allow all combinations of instructions in the pipeline to execute.

To resolve this hazard:


 Stall the pipeline for 1 clock cycle when the data memory access occurs. A
stall is commonly called a pipeline bubble or just bubble, since it floats through
the pipeline taking space but carrying no useful work.

Data Hazards:
 A major effect of pipelining is to change the relative timing of instructions by
71
Downloaded From : www.EasyEngineering.ne
Downloaded From : www.EasyEngineering.ne

overlapping their execution. This overlap introduces data and control hazards.
 Data hazards occur when the pipeline changes the order of read/write
accesses to operands so that the order differs from the order seen by
sequentially executing instructions on an unpipelined processor.

Minimizing Data Hazard Stalls by Forwarding:


The problem solved with a simple hardware technique called forwarding (also
called bypassing and sometimes short-circuiting).

Forwards works as:


The ALU result from both the EX/MEM and MEM/WB pipeline registers is
always fed back to the ALU inputs. If the forwarding hardware detects that the

ww
previous ALU operation has written the register corresponding to a source for the
current ALU operation, control logic selects the forwarded result as the ALU input
rather than the value read from the register file.

w.E
Data Hazards Requiring Stalls:
 The load instruction has a delay or latency that

a
cannot be eliminated by forwarding alone. Instead, we need to

sy E
add hardware, called a pipeline interlock, to preserve the correct execution
pattern.

ngi
 A pipeline interlock detects a hazard and stalls the pipeline until the hazard is
cleared.

nee
 This pipeline interlock introduces a stall or bubble. The CPI for the
stalled instruction increases by the length of the stall.

Branch Hazards:
ri n
g .n
 Control hazards can cause a greater performance loss for our MIPS pipeline

e
. When a branch is executed, it may or may not change the PC to something
other than its current value plus 4.
 If a branch changes the PC to its target address, it is a taken branch; if it falls
through, it is not taken, or untaken.

Reducing Pipeline Branch Penalties:


 Simplest scheme to handle branches is to freeze or flush the pipeline, holding
or deleting any instructions after the branch until the branch destination is
known.
 A higher-performance, and only slightly more complex, scheme is to treat
every branch as not taken, simply allowing the hardware to continue as if the
branch were not executed.
 In simple five-stage pipeline, this predicted-not-taken or predicted untaken
scheme is implemented by continuing to fetch instructions as if the branch
were a normal instruction.
o The pipeline looks as if nothing out of the ordinary is happening.
o If the branch is taken, however, we need to turn the fetched instruction
into a no-op and restart the fetch at the target address.
72

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

Performance of Branch Schemes:


Pipeline speedup =
Pipeline depth
----------------------------------------
1+ Branch frequency × Branch penalty
The branch frequency and branch penalty can have a component from both
unconditional and conditional branches.

Branch prediction:
A more sophisticated version of branch prediction would have some branches
predicted as taken and some as untaken. In our analogy, the dark or home uniforms
might take one formula while the light or road uniforms might take another. In the

ww
case of programming, at the bottom of loops are branches that jump backing to the
top of the loop. Since they are likely to be taken and they branch backward, we could
always predict taken for branches that jump to an earlier address.

w.E
a sy E
ngi
nee
ri n
g .n
e
FIG: INSTRUCTION EXECUTION ORDER
Predicting that branches are not taken as a solution to control hazard:
One popular approach to dynamic prediction of branches is keeping a history
for each branch as taken or untaken, and then using the recent past behavior
to predict the future.When the guess is wrong, the pipeline control must ensure that
the instructions following the wrongly guessed branch have no effect and must
restart the pipeline from the proper branch address. In our laundry analogy, we must
stop taking new loads so that we can restart the load that we incorrectly predicted.

73

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

Dynamic branch prediction:

ww
FIG: DYNAMIC BRANCH PREDICTION
Assuming a branch is not taken is one simple form of branch prediction. In that

w.E
case, we predict that branches are untaken, flushing the pipeline when we are
wrong. For the simple five-stage pipeline, such an approach, possibly coupled with
compiler- based prediction, is probably adequate. With deeper

a sy E
pipelines, the branch penalty
increases when measured in clock cycles.
Ideally, the accuracy of the predictor would match the taken branch frequency

ngi
for these highly regular branches. To remedy this weakness, 2-bit
prediction schemes are often used. In a 2-bit scheme, a prediction must be wrong
twice before
nee
it is changed. Figure shows the finite-state machine for a 2-bit prediction
scheme.
ri n
EXCEPTIONS
g .n
5.Explain in detail how exceptions are handled in MIPS
architecture.(April/May 2015).
Control is the most challenging aspect of processor design: it
e
is both the hardest part to get right and the hardest part to make fast. One
of the hardest parts of control is implementing exceptions and interrupts—events
other than branches or
jumps that change the normal flow of instruction execution. They were initially
created to handle unexpected events from within the processor, like arithmetic
overflow.
Many architectures and authors do not distinguish between interrupts and
exceptions, often using the older name interrupt to refer to both types of
events. For example, the Intel x86 uses interrupt. The MIPS convention, using the
term exception to refer to any unexpected change in control flow without
distinguishing whether the cause is internal or external; we use the term interrupt only
when the event is externally caused. Here are five examples showing whether the
situation is internally generated by the processor or externally generated:
74

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

How Exceptions Are Handled in the MIPS Architecture:


The two types of exceptions that our current implementation can generate are
execution of an undefined instruction and an arithmetic overflow. We’ll use arithmetic
overflow in the instruction add $1, $2, $1 as the example exception in the next few
pages. The basic action that the processor must perform when an exception occurs
is to save the address of the offending instruction in the exception program counter
(EPC) and then transfer control to the operating system at some specified address.
The operating system can then take the appropriate action, which may involve

ww
providing some service to the user program, taking some predefined action in
response to an overflow, or stopping the

w.E
execution of the program and reporting an error. After performing whatever action is r
equired because of the exception,

a
the operating system can terminate the program or may continue its

sy E
execution, using the EPC to determine where to
restart the execution of the program. The method used in the MIPS architecture is to

ngi
include a status register (called the Cause register), which holds a field
that indicates the reason for the exception.

nee
A second method is to use vectored interrupts. In
a vectored interrupt, the address to which control is transferred is determined
by the

ri n
cause of the exception. For example, to accommodate the two exception types listed

g .n
above, we might define
the following two exception vector addresses:

e
The operating system knows the reason for the exception by the address at
which it is initiated. The addresses are separated by 32 bytes or eight instructions,
and the operating system must record the reason for the exception and may perform
some limited processing in this sequence. When the exception is not vectored, a
single entry point for all exceptions can be used, and the operating system decodes
the status register to find the cause. We will need to add two additional registers to
the MIPS implementation:
EPC: A 32bit register used to hold the address of the affected
instruction. (Such a register is needed even when exceptions are vectored.)
Cause: A register used to record the cause of the exception. In the
MIPS architecture, this register is 32 bits, although some bits are currently
unused.
Assume there is a five-bit field that encodes the two possible exception
sources mentioned above, with 10 representing an undefined instruction and
12 representing arithmetic overflow.
Exceptions in a Pipelined Implementation.
For example, suppose there is an arithmetic overflow in an add instruction.
Just as we did for the taken branch in the previous section, we must flush the
75

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

instructions that follow the add instruction from the pipeline and begin fetching
instructions from the new address.
A new control signal, called ID.Flush, is ORed with the stall signal from the
hazard Detection unit to flush during ID. To flush the instruction in the EX phase, we
use a new signal called EX. Flush to cause new multiplexors to zero the control lines.
Exception in a Pipelined Computer
Given this instruction
sequence, 40hex sub $11, $2,
$4
44hex and $12,
$2, $5 48hex or $13,
$2, $6
4Chex add $1,
$2, $1 50hex slt $15,

ww $6, $7 54hex lw $16,


50($7)

w.E ...
Show what happens in the pipeline if an overflow exception occurs in the add

a
instruction.

sy E
The difficulty of always associating the correct exception with the
correct instruction in pipelined computers has led some computer designers to relax

ngi
this requirement in noncritical cases. Such processors are said to have imprecise
interrupts or imprecise exceptions. In the

nee
example above, PC would normally have 58hex at the start of the clock cycle
after the exception is detected, even though the offending instruction is at

ri n
address 4Chex. A
processor with imprecise exceptions might put 58hex into EPC and leave it up to

g .n
the operating system to determine which instruction caused the problem. MIPS and
the vast majority of computers today support precise interrupts or precise exceptions.

76

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

UNIT 4 PARALLELISM

PART A

1. Define speculation (NOV/DEC 14)


It is an approach that allows the compiler or the processor to “guess” about the
properties of an instruction, so as to enable execution to begin for other
instructions that may depend on the speculated instruction. For example, we might
speculate on the outcome of a branch, so that instructions aft er the branch could be
executed earlier

2. Differentiate between strong scaling and week scaling. (APR/MAY


15) Strong scaling

ww Speed up achieved on a multiprocessor without increasing the size of the


problem is called strong scaling.

w.E
Weak scaling
Speedup achieved on a multiprocessor while increasing the size of the

a
problem proportionally to the increase in the number of processor is called weak

sy E
scaling.

ngi
3. What is Flynn’s Classification? (Nov/Dec 2014).
Flynn uses the stream concept for describing a machine's structure. A stream

nee
simply means a sequence of items (data or instructions). The classification of
computer architectures based on

ri n
the number of instruction steams and data streams (Flynn’s Taxonomy).
SISD: Single instruction single data

g .n
MISD: Multiple instructions single data
SISD: (Singe-Instruction stream, Singe-Data stream)

e
SIMD: (Single-Instruction stream, Multiple-Data streams)

4. What is meant by hardware multithreading? (Nov/Dec 2014).


Hardware multithreading allows multiple threads to share the functional
units of a single processor in an overlapping fashion to try to utilize the
hardware resources efficiently. To permit this sharing, the processor must
duplicate the independent state of each thread. It Increases the utilization of a
processor.

5. What is ILP? (N/D’14 & 15 & 16)


Pipelining exploits the potential parallelism among instructions. This
parallelism is called instruction-level parallelism (ILP). There are two primary methods
for increasing the potential amount of instruction-level parallelism. The first
is increasing the depth of the pipeline to overlap more instructions

6. Define loop unrolling


A technique to get more performance from loops that access arrays, in which
multiple copies of the loop body are made and instructions from different iterations.

77 Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

1. What is Anti-dependence?
It is an ordering forced by the reuse of a name, typically a register, rather than
by a true dependence that carries a value between two instructions.

2. Define – Static Multiple Issue, Issue Slots and Issue Packet.


Static multiple issue is an approach to implement a multiple-issue processor
where many decisions are made by the compiler before execution.
Issue slots are the positions from which instructions could be issued in a
given clock cycle. By analogy, these correspond to positions at the starting
blocks for a sprint.
Issue packet is the set of instructions that issues together in one clock cycle;

ww
the packet may be determined statically by the compiler or dynamically by the
processor.

w.E
3. Define – Superscalar Processor and VLIW. (N/D’ 15)
Superscalar is an advanced pipelining technique that enables the processor

a
to execute more than one instruction per clock cycle by selecting them during

sy E
execution. Dynamic multiple-issue processors are also known
as superscalar processors, or simply superscalars.

ngi
Very Long Instruction Word (VLIW) is a style of instruction set architecture
that launches many operations that are defined to be independent in a single wide

nee
instruction, typically with many separate opcode fields.

4. Compare UMA and NUMA multiprocessors. (APR/MAY 15)


ri n
Uniform Memory Access (UMA)
Latency to a word in memory does not depend on which
g .n
processor asks for it.
That is the time taken to access a word is uniform for all
processors. e
Non – Uniform Memory Access (NUMA)
 Some memory accesses are much faster than others depending on which
processor asks for which word.
 Here the main memory is divided and attached to different microprocessors or to
different memory controllers on the same chip.
 Programming challenges are harder for NUMA than UMA.
 NUMA machines can scale to larger size.
 NUMA s can have lower latency to nearby memory.

78

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

PART – B

INSTRUCTION LEVEL PROCESSING

1.Explain in detail about Instruction Level Processing. (Nov/Dec 2014)


Instruction-level parallelism (ILP) is a measure of how many of the
operations in a computer program can be performed simultaneously. The potential
overlap among instructions is called instruction level parallelism.
Pipelining can overlap the execution
of instructions when they are
independent of one another. This potential overlap among instructions is called
instruction-level parallelism (ILP) since the instructions can be evaluated in parallel.

ww
The amount of parallelism available within a basic block ( a straight-line code
sequence with no branches in and out except for entry and exit) is quite small. The
average dynamic branch frequency in integer programs was measured to be about

w.E
15%, meaning that about 7 instructions execute between a pair of branches.
Since the instructions are likely to

a
depend upon one another, the amount of overlap we can exploit within a basic block

sy E
is likely to be much less than 7.To obtain
substantial performance enhancements, we must exploit ILP across multiple
basic blocks.

ngi
The simplest and most common way to increase the
amount of parallelism available among instructions is to
exploit parallelism among iterations of a loop. This
nee
ri n
type of parallelism is often called loop-level parallelism.

for
Example 1
g .n
(i=1; i<=1000; i= i+1)
x[i]
= x[i] + y[i];
e
This
a[i] is a parallel
= a[i] + b[i]; loop. //s1
Every iteration of the loop can overlap with any other
iteration, although
b[i+1] = c[i] + d[i]; within
//s2each loop iteration there is little opportunity for overlap.
}
Example 2
for (i=1;loop
Is this i<=100; i= i+1){
parallel? If not how to make it parallel?
Statement s1 uses the value assigned in the previous iteration by statement s2, so
there is a loop-carried dependency between s1 and s2. Despite this dependency, this
loop can be made parallel because the dependency is not circular:
 neither statement depends on itself;
 while s1 depends on s2, s2 does not depend on s1.
A loop is parallel unless there is a cycle in the dependencies, since the
absence of a cycle means that the dependencies give a partial ordering on the
statements. To expose the parallelism the loop must be transformed to conform to
79

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

the partial order. Two observations are critical to this transformation: There is no
dependency from s1 to s2. Then, interchanging the two statements will not affect the
execution of s2.
On the first iteration of the loop, statement s1 depends on the value of b[1]
computed prior to initiating the loop. This allows us to replace the loop above with the
following code sequence, which makes possible overlapping of the iterations of the
loop:
a[1] = a[1] + b[1];
for (i=1; i<=99;
i= i+1){ b[i+1] = c[i] +
d[i];
a[i+1] = a[i+1]
+ b[i+1];
}

ww b[101] = c[100]
+ d[100];

w.E
Example 3

a
for
(i=1; i<=100; i= i+1
){ a[i+
sy E
ngi
1] = a[i] + c[i];//S1
b[i+1] = b[i] +

nee
a[i+1];//S2
}
This loop is not parallel because it has cycles in the dependencies, namely
the statements S1 and S2 depend on themselves.
ri n
Parallelism and Advanced Instruction-Level Parallelism:
g .n
e
Pipelining exploits the potential parallelism among instructions. This
parallelism is called instruction-level parallelism (ILP). There are two
primary methods for increasing the
potential amount of instruction-level parallelism. The first is increasing the depth of
the pipeline to overlap more instructions. Using our laundry analogy and
assuming that the washer cycle was longer than the others were, we
could divide our washer into three machines that perform the wash, rinse, and spin
steps of a traditional washer.
We would then move from a four-stage to a six-stage pipeline. To get the full
speed-up, we need to rebalance the remaining steps so they are the same
length, in processors or in laundry. The amount of parallelism being exploited is
higher, since there are more operations being overlapped. Performance is potentially
greater since the clock cycle can be shorter.

a.Packaging instructions into issue slots: how does the processor


determine how many instructions and which instructions can be issued in a given
clock cycle? In most static issue processors, this process is at least partially handled
by the compiler; in dynamic issue designs, it is normally dealt with at runtime by the
processor, although the compiler will often have already tried to help improve the
issue rate by placing the instructions in a beneficial order.
Downloaded From : www.EasyEngineering.ne
b.Dealing with data and control hazards: in static issue processors, some or
Downloaded From : www.EasyEngineering.ne

hazards using hardware techniques operating at execution time.

2. State the challenges of Parallel Processing.(Nov/Dec 2014)

STATIC MULTIPLE ISSUE:


Static multiple-issue processors all use the compiler to
assist with packaging
instructions and handling hazards. In a static issue processor, you can think of
the set of instructions issued in a given clock cycle, which is called an issue packet,
as one large instruction with multiple operations. This view is more than an analogy.
Since a static multiple-issue processor usually restricts what mix of instructions can
be initiated in a given clock cycle, it is useful to think of the issue packet as a single
instruction allowing several operations in certain predefined fields. This view led to
the original name for this approach: Very Long Instruction Word (VLIW).

ww Most static issue processors also rely on the compiler to take on some
responsibility for handling data and control hazards. The compiler’s

w.E
responsibilities may include static branch prediction and code scheduling to
reduce or prevent all hazards. Let’s look at a simple static issue version of a MIPS

a
processor, before we describe the use of these techniques in more aggressive
processors.
sy E
ngi
nee
INSTRUCTION TYPE & PIPE STAGES
ri n
An Example: Static Multiple Issue with the MIPS ISA:
g .n
To give a flavor of static multiple issues, we consider a simple two-issue MIPS
processor, where one of the instructions can be an integer ALU operation or br
anch and the other can
be a load or store. Such a design is like that used in some
processors.
embedded MIPS
e
Dynamic Multiple-Issue Processors:
Dynamic multiple-issue processors are also known as superscalar processors,
or simply Superscalar. In the simplest superscalar processors, instructions
issue in order, and the processor decides whether zero, one, or more instructions can
issue in a given clock cycle. Obviously, achieving good performance on such a
processor still requires the compiler to try to schedule instructions to move
dependences apart and thereby improve the instruction issue rate. Even with such
compiler scheduling, there is an important difference between this simple superscalar
and a VLIW processor.

81

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

FIG: LOOP INSTRUCTION EXECUTION


Many superscalars extend the basic framework of dynamic issue decisions to
include dynamic pipeline scheduling. Dynamic pipeline scheduling chooses which
instructions to execute in a given clock cycle while trying to avoid hazards
and stalls.

ww Let’s start with a simple example of avoiding a data hazard. Consider the
following code sequence:

w.E
lw $t0, 20($s2)
addu $t1,

a
$t0,

sy E
$t2 sub $s4,
$s4, $t3 slti

ngi
$t5, $s4, 20
Even though the

nee
sub instruction is ready
to execute, it must wait
for the lw and

ri n
addu to complete first, which might take many clock cycles if

g .n
memory is slow. Dynamic pipeline scheduling allows such hazards to
be avoided either fully or partially.

Dynamic Pipeline Scheduling:


Dynamic pipeline scheduling chooses which instructions to
e
execute next, possibly reordering them to avoid stalls. In
such processors, the pipeline is divided into three major units: an instruction fetch
and issue unit, multiple functional units (a
dozen or more in high-end designs in 2008), and a commit unit. Figure shows
the model. The first unit fetches instructions, decodes them, and sends
each instruction to a corresponding functional unit for execution. Each functional unit
has buffers, called reservation stations, which hold the operands and the operation.
When the result is completed, it is sent to any reservation stations waiting for
this particular result as well as to the commit unit, which buffers the result until it is
safe to put the result into the register file or, for a store, into memory. The buffer in
the commit unit, often called the reorder buffer, is also used to supply operands, in
much the same way as forwarding logic does in a statically scheduled pipeline. Once
a result is committed to the register file, it can be fetched directly from there, just as in
a normal pipeline. 82

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

ww
w.E Fig.The three primary units of a dynamically scheduled pipeline
To make programs behave as if they were running on a simple in-order

a
pipeline, the instruction fetch and decode unit is required to issue instructions in

sy E
order, which allows dependences to be tracked, and the commit
unit is required to write results to registers and memory in program fetch order. This

ngi
conservative mode is called in-order commit.
There are at least two different kinds of parallelism in computing.

nee
• Using multiple processors to work toward a given
goal, with each processor running its own program.

ri n
• Using only a single processor to run a single program, but allowing
instructions from that program to execute in parallel.
The latter is called instruction-level parallelism, or ILP.
g .n
LIMITATIONS OF ILP:
THE HARDWARE MODEL:
An ideal processor is one where all constraints on ILP are removed. The only
e
limits on ILP in such a processor are those imposed by the actual data flows through
either registers or memory. The assumptions made for an ideal or perfect processor
are as follows:

a.Register renaming:
There are an infinite number of virtual registers available, and hence all WAW
and WAR hazards are avoided and an unbounded number of instructions can begin
execution simultaneously.
b.Branch prediction:
Branch prediction is perfect. All conditional branches are predicted
exactly.
c.Jump prediction:
All jumps (including jump register used for return and computed jumps) are
perfectly predicted. When combined with perfect branch prediction, this is equivalent
83

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

to having a processor with perfect speculation and an unbounded buffer of


instructions available for execution.
d.Memory address alias analysis:
All memory addresses are known exactly, and a load can be moved before a
store provided that the addresses are not identical. Note that this implements perfect
address alias analysis.
A. erfect caches:
All memory accesses take 1 clock cycle. In practice, superscalar processors
will typically consume large amounts of ILP hiding cache misses, making these
results highly optimistic. To measure the available parallelism, a set of programs was
compiled and optimized with the standard MIPS optimizing compilers. The
programs were instrumented and executed to produce a trace of the instruction and
data references. Every instruction in the trace is then scheduled as early as possible,
limited only by the data dependences. Since a trace is used, perfect branch

ww
prediction and perfect alias analysis are easy to do.

w.E
LIMITATIONS ON THE WINDOW SIZE AND MAXIMUM ISSUE COUNT:
To build a processor that even comes close to perfect branch prediction and

a
perfect alias analysis requires extensive dynamic analysis, since static compile time

sy E
schemes cannot be perfect. Of course, most realistic dynamic schemes will
not be perfect, but the use of dynamic schemes will provide the ability to uncover

ngi
parallelism that cannot be analyzed by static compile time analysis. Thus,
a dynamic processor might be able to more closely

nee
match the amount of parallelism uncovered by our ideal processor.

ri n
THE EFFECTS OF REALISTIC BRANCH AND JUMP PREDICTION:
Our ideal processor assumes that branches can be perfectly predicted: The
outcome of any branch in the program is known before the
g .n
first instruction is executed! Of course, no real processor can ever achieve this. We
assume a separate predictor is used for jumps. Jump
predictors are important primarily with the
most accurate branch predictors, since the branch frequency is higher and the
e
accuracy of the branch predictors dominates.

a. Perfect : All branches and jumps are perfectly predicted at the start of
execution.
b. Tournament-based branch predictor: The prediction scheme uses a
correlating 2-bit predictor and a noncorrelating 2-bit predictor together with a
selector, which chooses the best predictor for each branch.

THE EFFECTS OF FINITE REGISTERS:


Our ideal processor eliminates all name dependences among register
references using an infinite set of virtual registers. To date, the IBM Power5 has
provided the largest numbers of virtual registers: 88 additional floating-point and 88
additional integer registers, in addition to the 64 registers available in the base
architecture. All 240 registers are shared by two threads when executing in

84

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

multithreading mode, and all are available to a single thread when in single-thread
mode.

THE EFFECTS OF IMPERFECT ALIAS ANALYSIS:


Our optimal model assumes that it can perfectly analyze all memory
dependences, as well as eliminate all register name dependences. Of course, perfect
alias analysis is not possible in practice: The analysis cannot be perfect at
compile time, and it requires a potentially unbounded number of comparisons at run
time (since the number of simultaneous memory references is unconstrained).

FLYNN'S CLASSIFICATION

3.Discuss about SISD,MIMD,SIMD,SPMD and Vector systems.

ww
(April/May 2015).(N/D’ 15, 16)(16)
Flynn uses the stream concept for describing a machine's structure. A stream
simply means a sequence of items (data or instructions). The classification of comput

w.E
er architectures based
on the number of instruction steams and data streams (Flynn’s Taxonomy).

a sy E
SISD (Singe-Instruction stream, Singe-Data stream):
SISD corresponds to the traditional mono-processor

ngi
(von Neumann computer). A
single data stream is being processed by one instruction stream .A single-processor

nee
computer (uni-processor) in which a single
stream of instructions is

ri n
generated from the program.

g .n
SIMD (Single-Instruction stream, Multiple-Data streams): e

Each instruction is executed on a different set of data by different processors


i.e multiple processing units of the same type process on multiple-data streams.
This group is dedicated to array processing machines. Sometimes, vector
processors can also be seen as a part of this group.
MISD (Multiple-Instruction streams, Singe-Data stream):
Each processor executes a different sequence of instructions. In case of
MISD
85

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

computers, multiple processing units operate on one single-data stream. In practice,


this kind of organization has never been used.

ww
MIMD (Multiple-Instruction streams, Multiple-Data streams):
Each processor has a separate program. An instruction stream is generated

w.E
from each program. Each instruction operates on different data. This last
machine type builds the group for the traditional multi-processors.
Several processing units operate on multiple-data streams.

a sy E
ngi
nee
ri n
g .n
SISD, MIMD, SIMD, SPMD, and Vector:
e
Another categorization of parallel hardware proposed in the 1960s is still used
today. It was based on the number of instruction streams and the number of
data streams. Figure shows the categories. Thus, a conventional uniprocessor has a
single instruction stream and single data stream, and a conventional multiprocessor
has multiple instruction streams and multiple data streams. These two categories are
abbreviated SISD and MIMD, respectively.

While it is possible to write separate programs that run on different processors


86

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

on a MIMD computer and yet work together for a grander, coordinated goal,
programmers normally write a single program that runs on all processors of an MIMD
computer, relying on conditional statements when different processors should
execute different sections of code. This style is called Single Program Multiple Data
(SPMD), but it is just the normal way to program a MIMD computer.
While it is hard to provide examples of useful computers that would be
classified as multiple instruction streams and single data stream (MISD), the
inverse makes much more sense. SIMD computers operate on vectors of data. For
example, a single SIMD instruction might add 64 numbers by sending 64 data
streams to 64 ALUs to form 64 sums within a single clock cycle.
The original motivation behind SIMD was to amortize the cost of the control
unit over dozens of execution units. Another advantage is the reduced size of
program memory—SIMD needs only one copy of the code that is being
simultaneously executed, while message-passing MIMDs may need a copy in every

ww
processor, and shared memory MIMD will need multiple instruction caches.
SIMD works best when dealing with arrays in for loops. Hence, for parallelism

w.E
to work in SIMD there must be a great deal
of identically structured data, which is called data-level parallelism. SIMD is at its

a
weakest in case or switch statements, where each execution unit must perform a

sy E
different operation on its data, depending
on what data it has. Execution units with the

ngi
wrong data are disabled so that units with proper data may
continue. Such situations essentially run at 1/nth performance, where n are

nee
the number of cases.

ri n
SIMD in x86: Multimedia Extensions:
The most widely used variation of SIMD is found in almost every
microprocessor today, and is the basis of the hundreds of MMX and
g .n
SSE instructions of the x86 microprocessor (see Chapter 2). They were added to
improve
performance of multimedia programs. These instructions allow the hardware
to have many ALUs operate simultaneously or, equivalently, to
partition a single, wide ALU into many parallel smaller ALUs that operate
e
simultaneously.
ector:
An older and more elegant interpretation of SIMD is called a vector architecture,
which has been closely identified with Cray Computers. It is again a great match to
problems with lots of data-level parallelism. Rather than having 64 ALUs perform 64
additions simultaneously, like the old array processors, the vector architectures
pipelined the ALU to get good performance at lower cost. A key feature of vector
architectures is a set of vector registers. Thus, vector architecture might have 32
vector registers, each with 64 64-bit elements.
Vector versus Scalar:
Vector instructions have several important properties compared to conventional
instruction set architectures, which are called scalar architectures in this context:
■ A single vector instruction specifies a great deal of work—it is equivalent to
executing an entire loop. The instruction fetch and decode bandwidth needed is
dramatically reduced.
■ By using a vector instruction, the compiler or programmer indicates
Downloaded From that the
: www.EasyEngineering.ne
Downloaded From : www.EasyEngineering.ne

computation of each result in the vector is independent of the computation of other


results in the same vector, so hardware does not have to check for data hazards
within a vector instruction.
■Vector architectures and compilers have a reputation of making it much easier than
MIMD multiprocessors to write efficient applications when they contain data-level
parallelism.
■Hardware need only check for data hazards between two vector instructions once
per vector operand, not once for every element within the vectors. Reduced checking
can save power as well.
■ Vector instructions that access memory have a known access pattern. If the
vector’s elements are all adjacent, then fetching the vector from a set of heavily
interleaved memory banks works very well. Thus, the cost of the latency to main
memory is seen only once for the entire vector, rather than once for each word of the
vector.

ww
■Because an entire loop is replaced by a vector instruction whose behavior is
predetermined, control hazards that would normally arise from the loop

w.E
branch are nonexistent.
■The savings in instruction bandwidth and hazard checking plus the efficient use

a
of memory bandwidth give vector architectures advantages in power and energy
versus
scalar architectures.
sy E
ngi
HARDWARE MULTITHREADING

nee
4.Explain about Hardware Multithreading.(Nov/Dec 2014)/
What is hardware multithreading? Compare and contrast Fine grained multi

ri n
threading and coarse grained multi threading.(April/May 2015) (N/D 14)
Exploiting Thread-Level Parallelism within a Processor:

g .n
Multithreading allows multiple threads to share the functional units of a single
processor in an overlapping fashion. To permit this sharing, the processor
must duplicate the independent state of each thread. For example, a separate copy
of the register file, a separate PC, and a separate page table are required for each
thread.
Multithreading enables the thread-level parallelism (TLP) by duplicating the
e
architectural state on each processor, while sharing only one set of processor
execution resources.
When scheduling threads, the operating system treats those distinct
architectural states as separate "logical" processors. Logical processors are the
hardware support for sharing the functional units of a single processor among
different threads. There are several different sharing mechanism for different
structures. The kind of state a structure stores decides what sharing mechanism the
structure needs.

88

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

Category Resources

Program counter(PC)
Replicated Architectural registers
Register renaming logic

Re-order buyers
Partitioned Load/Store buyers
Various queues,
like the scheduling
queue, etc
Caches
Shared Physical registers
Execution units
Table: Sharing Mechanisms

ww
Replicated resources are the kind of resources that you just cannot get around
replicating if you want to maintain two fully independent contexts on each logical

w.E
processor. The most obvious of these is the program counter (PC), which is the
pointer that helps the processor keep track of its
place in the instruction stream by pointing to the next instruction to be fetched. We

a sy E
need separate PC for each thread to keep track of its instruction stream.

FINE-GRAINED MULTITHREADING (FGMT):

ngi
Fine-grained multithreading switches between threads on two

nee
schedules, usually processing instructions between two threads.
Three different approaches use the issue slots of a superscalar processor cycle,
causing the execution of multiple
threads to be interleaved.
ri n
g .n
As illustrated above, this cycle-by-cycle interleaving is often done in a round-
robin fashion, skipping any threads that are stalled due to branch mispredict or

e
cache miss or any other reason. But the thread scheduling policy is not limited to the
cycle-
by-cycle (round-robin) model; other scheduling policy can also be applied too.

89

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

Although FGMT can hide performance loses due to stalls caused by any
reason, there are two main drawbacks for FGMT approach:

ww
w.E
a sy E
ngi
nee
ri n
FIG: FGMT
g .n
FGMT sacrifices the performance of the individual threads. It
needs a lot of threads to hide the stalls, which also means a lot of register less.
e
COARSE-GRAINED MULTITHREADING (CGMT):
Coarse-grained multithreading won't switch out the executing thread until it
reaches a situation that triggers a switch. This situation occurs when the instruction
execution reaches either a long-latency operation or an explicit additional switch
operation.
CGMT was invented as an alternative to FGMT, so it won't repeat the primary
disadvantage of FGMT: severely limits on single-thread performance. CGMT makes
the most sense on an in-order processor that would normally stall the pipeline on a
cache miss (using CGMT approach, rather than stall, the pipeline is called with
ready instructions from an alternative thread).
Since instructions following the missing instructions may already be in
the
90

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

pipeline, they need to be drained from the pipeline. And similarly, instructions from
new thread will not reach the execution stage until have traversed earlier pipeline
stages. The cost for draining out and lying in the pipeline is considered as thread-
switch penalty, and it depends on the length of the pipeline. So normally CGMT need
short in-order pipeline for good performance.

There are two main approaches to multithreading.


1. Fine-grained multithreading switches between threads on each instruction,
causing the execution of multiples threads to be interleaved. This
interleaving is often done in a round-robin fashion, skipping any threads
that are stalled at that time.
2. Coarse-grained multithreading was invented as an alternative to fine-grained
multithreading. Coarse-grained multithreading switches threads only on costly
stalls, such as level two caches miss. This change relieves the need to have

ww thread-switching be essentially free and is much less likely to slow the


processor down, since instructions

w.E
from other threads will only be issued, when a thread encounters a costly stall.

a sy E
SIMULTANEOUS MULTITHREADING:
Converting Thread-Level Parallelism into Instruction-Level Parallelism. Simulta

ngi
neous multithreading (SMT) is a variation
on multithreading that uses the resources of a multiple issue, dynamically-scheduled

nee
processor to exploit TLP at the
same time it exploits ILP.

ri n
The key insight that motivates SMT is that modern multiple-issue
processors often have more functional unit parallelism available than a single
thread can effectively use. Furthermore, with register renaming and dynamic
scheduling,
g .n
multiple instructions from independent threads can be issued without regard
to the dependences among them; the resolution of the
dependences can be handled by the dynamic scheduling capability.
The following figure illustrates the differences in a processor’s ability to
e
exploit the resources of a superscalar for the following processor configurations:
o A superscalar with no multithreading support,
o A superscalar with coarse-grained Multithreading,
o A superscalar with fine-grained Multithreading, and
o A superscalar with Simultaneous multithreading.

91

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

FIG. ISSUE SLOTS

In the superscalar without multithreading support, the use of issue slots is


limited by a lack of ILP. In the coarse-grained multithreaded superscalar, the long stalls

ww
are partially hidden by switching to another thread that uses the resources of the
processor.

w.E
In the fine-grained case, the interleaving of threads eliminates fully empty slots.
Because only one thread issues instructions in a given clock cycle. In
the SMT case, thread-level parallelism (TLP) and instruction-level parallelism (ILP)

a sy E
are exploited simultaneously; with multiple threads using
the issue slots in a single clock cycle.

ngi
MULTICORE PROCESSORS

nee
5.Explain about Multicore Processors. (Nov/Dec 2014).

ri n
g .n
e
Fig. Classic organization of a shared memory multiprocessor.

A shared memory multiprocessor (SMP) is one that offers the programmer a


single physical address space across all processors, although a more accurate term
would have been shared-address multiprocessor. Note that such systems can still run
independent jobs in their own virtual address spaces, even if they all share a physical
address space. Processors communicate through shared variables in memory, with
all processors capable of accessing any memory location via loads and stores.
Figure shows the classic organization of an SMP. Single address space
multiprocessors come in two styles. The first takes about the same time to access
92

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

main memory no matter which processor requests it and no matter which word is
requested. Such machines are called uniform memory access (UMA)
multiprocessors.
In the second style, some memory accesses are much faster than others,
depending on which processor asks for which word. Such machines are called
nonuniform memory access (NUMA) multiprocessors. As you might expect, the
programming challenges are harder for a NUMA multiprocessor than for a
UMA multiprocessor, but NUMA machines can scale to larger sizes and NUMAs can
have lower latency to nearby memory.
As processors operating in parallel will normally share data, they also need to
coordinate when operating on shared data; otherwise, one processor could
start working on data before another is finished with it. This coordination is called
synchronization. When sharing is supported with a single address space, there must
be a separate mechanism for synchronization. One approach uses a lock for a

ww
shared variable. Only one processor at a time can acquire the lock, and other
processors interested in shared data must wait until the original processor unlocks

w.E
the variable.

a
CLUSTERS AND OTHER MESSAGE-PASSING MULTIPROCESSORS:

sy E
The alternative approach to sharing an address space is for the processors to
each have their own private physical address space. Figure shows the

ngi
classic organization of a multiprocessor with multiple private address spaces. This
alternative multiprocessor must communicate via

nee
explicit message passing, which traditionally is the name of such
style of computers. Provided the system has routines to

ri n
send and receive messages, coordination is built
in with message passing, since one processor knows when a message is sent,
and the receiving processor knows when a message arrives. If the
sender needs confirmation that the message has arrived, the
g .n
receiving processor can then send an acknowledgment message back to the sender.
e

FIG: Multiprocessor with Multiple address spaces

Some concurrent applications run well on parallel hardware, independent of


whether it offers shared addresses or message passing. In particular, job-level
parallelism and applications with little communication like Web search, mail servers,
and file servers do not require shared addressing to run well. There were several
attempts to build high-performance computers based on high performance message-
passing networks, and they did offer better absolute communication performance
93

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

than clusters built using local area networks.


The problem was that they were much more expensive. Few applications
could justify the higher communication performance, given the much higher
costs. Hence, clusters have become the most widespread example today of the
message- passing parallel computer.
Clusters are generally collections of commodity computers that are connected
to
each other over their I/O interconnect via standard network switches and
cables. Each runs a distinct copy of the operating system. Virtually every Internet
service relies on clusters of commodity servers and switches.

LIMITATIONS:
a.One drawback of clusters has been that the cost of administering a cluster
of n machines is about the same as the cost of administering n independent

ww
with n
machines, while the cost of administering a shared memory multiprocessor

w.E
processors is about the same as administering a single machine. This
weakness is one of the reasons for the popularity of virtual machines, since VMs

a
make clusters easier to administer.

sy E
For example, VMs make it possible to stop or start
programs atomically, which simplifies software upgrades. VMs can even migrate a pr

ngi
ogram from one computer
in

nee
a cluster to another without stopping the program, allowing a program to
migrate from failing hardware.

ri n
b.Another drawback to clusters is that the
processors in a cluster are usually connected using the I/O interconnect of each com

g .n
puter, whereas
the cores in a

e
multiprocessor are usually connected on
the memory interconnect of the computer. The memory interconnect has higher band
width and lower latency, allowing
much better communication performance.
A.final weakness is the overhead in the division of memory: a cluster of n
machines has n independent memories and n copies of the operating system, but a s
hared memory multiprocessor allows a single program
to use almost all the memory in the computer, and it only needs a single copy of the
OS.
PROPERTIES OF MULTI-CORE SYSTEMS:
o Cores will be shared with a wide range of other applications
dynamically. Load can no longer be considered symmetric
across the cores.
o Cores will likely not be asymmetric as accelerators become common
for scientific hardware.
o Source code will often be unavailable, preventing compilation
against the specific hardware configuration.

APPLICATIONS THAT BENEFIT FROM MULTI-CORE:


o Database servers
o Web servers Downloaded From : www.EasyEngineering.ne
Downloaded From : www.EasyEngineering.ne

UNIT – 5 MEMORY AND IO SYSTEMS


PART A

1. What is the need to implement memory as a hierarchy?


(April/May 2015)
Computers use different types of memory units for different types of purposes.
Each memory unit has its own advantages and disadvantages. A structure that uses
multiple levels of memories is called hierarchy. A memory hierarchy consists of
multiple levels of memory with different speeds and sizes.
The faster memories are more expensive per bit than the slower memories
and thus are smaller. As the distance from the processor increases, the size of the
memories and the access time both increases.

ww
2. Compare SRAM from DRAM. (Nov/Dec 2013)
SRAMs are simply integrated circuits that are memory arrays with a single access

w.E
port that can provide either a read or a write. SRAMs have a fixed access time to
any datum. SRAMs don’t need to refresh and so the access time is very close to the
cycle time. SRAMs typically use six to eight transistors per bit to prevent the

a sy E
information from being disturbed when read. SRAM needs only minimal power to
retain the charge in standby mode.

ngi
In a dynamic RAM (DRAM), the value kept in a cell is stored as a charge
in a capacitor. A single transistor is then used to access this stored charge, either to

nee
read the value or to overwrite the charge stored there. Because
DRAMs use only a single transistor per bit of storage, they
are much denser and cheaper per bit than SRAM. As DRAMs store
ri n
the charge on a capacitor, it cannot be kept indefinitely and must periodically be
refreshed.
g .n
3. What is meant by interleaved memory? (May/June 2012)
In computing, interleaved memory is a design made to compensate for the
relatively slow speed of dynamic random-access memory (DRAM) or core memory,
e
by spreading memory addresses evenly across memory banks. That way, contiguous
memory reads and writes are using each memory bank in turn, resulting in
higher memory throughputs due to reduced waiting for memory banks to become
ready for desired operations.
It is different from multi-channel memory architectures, primarily as interleaved
memory is not adding more channels between the main memory and the
memory controller. However, channel interleaving is also possible, for example in free
scale i.MX6 processors, which allow interleaving to be done between two channels.

4. What is the purpose of Dirty /Modified bit in cache memory? (Nov/Dec 2014)
A dirty bit or modified bit is a bit that is associated with a block of computer
memory and indicates whether or not the corresponding block of memory has been
modified. The dirty bit is set when the processor writes to (modifies) this memory.
The bit indicates that its associated block of memory has been modified and
has not yet been saved to storage. When a block of memory is to be
replaced, its
Downloaded From : www.EasyEngineering.ne
95
Downloaded From : www.EasyEngineering.ne

corresponding dirty bit is checked to see if the block needs to be written back to
secondary memory before being replaced or if it can simply be removed. Dirty bits
are used by the CPU cache and in the page replacement algorithms of an operating
system.

1. Define cache hit and cache miss. (Nov/Dec 2013)


A cache hit is a state in which data requested for processing by a component
or application is found in the cache memory. It is a faster means of delivering data to
the processor, as the cache already contains the requested data.
Cache miss is a state where the data requested for processing by a
component or application is not found in the cache memory. It causes
execution delays by requiring the program or application to fetch the data from other
cache levels or the main memory

ww 2. What are the temporal and spatial localities of references? (May/June


2014) (A/M’16)

w.E
o Temporal locality (locality in time): if an item is referenced,
it will tend to be referenced again soon.

a
o Spatial locality (locality in space): if an item is referenced,

sy E
items whose addresses are close by will tend to be referenced
soon.

ngi
3. What is DMA? Mention its advantages.(Nov/Dec 2013)

allows
nee
Direct memory access (DMA) is a feature of computer systems that

ri n
certain hardware subsystems to
access main system memory (RAM) independently of the central processing unit
(CPU).
The need to handle more data and at higher rates
g .n
means DMA is now an important part of hardware and software design. A dedicated
DMA controller, often
integrated in the processor, can be configured to move data between main
e
memory and a range of subsystems, including another part of main memory.

4. What is the use of TLB? (April/May 2015)


A translation lookaside buffer (TLB) is a memory cache that stores recent
translations of virtual memory to physical addresses for faster retrieval. It is used to
avoid an access to the page table.

9. Point out how DMA can improve I/O speed. (April/May 2015)
The CPU still responsible for initiating each block transfer. Then the DMA
interface controller can take the control and responsibility of transferring data. So that
data can be transferred without the intervention of CPU. The CPU and I/O controller
interacts with each other only when the control of bus is required.

96

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

10 .Difference between Programmed I/O and Interrupt I/O. (Nov/Dec 2014)


Programmed I/O Interrupt I/O
The program is polling or checking some The same mouse will trigger a signal to
hardware item e.g. Mouse within a loop. the program to process the mouse event.
Slow and inefficient Fast and efficient
Easy to program and understand Can be tricky to write if you are using a
low level language.

PART B

MEMORY TECHNOLOGIES

ww
1.Draw different memory access layouts and brief about the technique used to

w.E
increase the average rate of fetching words from the main memory. (Nov/Dec
2014) (8)

a
(OR)

sy E
Elaborate on the various memory technologies and its
relevance.(April/May 2015)(16)

MEMORY TECHNOLOGIES:
ngi
nee
Memory latency is traditionally quoted using two measures access time
and

ri n
cycle time.
Access time is the time between when a read is requested and when the desired wor

g .n
d arrives, cycle
time is the minimum time between requests to memory. One reason that cycle time

e
is greater than access time
is that the memory needs the address lines to be stable between accesses.

DRAM TECHNOLOGY:
The solution was to multiplex the address lines, thereby cutting the number of
address pins in half. Figure shows the basic DRAM organization. One-half of the
address is sent first, called the row access strobe (RAS). The other half of the
address, sent during the column access strobe (CAS), follows it. These names come
from the internal chip organization, since the memory is organized as a rectangular
matrix addressed by rows and columns.

97

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

An additional requirement of DRAM derives from the property signified by its


first letter, D, for dynamic. To pack more bits per chip, DRAMs use only a single

ww
transistor to store a bit. Reading that bit destroys the information, so it must be
restored. This is one reason the DRAM cycle time is much longer than the access

w.E
time. In addition, to prevent loss of information when a bit is not read or written, the
bit must be “refreshed” periodically. Fortunately, all the bits in a row can
be refreshed simultaneously just by reading that row. Hence, every DRAM in

a sy E
the memory system must access every row within a certain time window, such as 8
ms. Memory controllers include hardware to refresh the DRAMs periodically.

SRAM TECHNOLOGY: ngi


The first letter of SRAM stands for static.
nee
The dynamic nature of the circuits in DRAM requires data to be
written back after being read—hence the difference between the
ri n
g .n
access time and the cycle time as well as the need to refresh. SRAMs don’t need to
refresh and so the access time is very close to the cycle time.

e
SRAMs typically use six transistors per bit to
prevent the information from being disturbed when read. SRAM needs only minimal
power to retain the charge in standby mode. SRAM
designs are concerned with speed and capacity, while in DRAM designs the
emphasis is on cost per bit and capacity. For memories designed in comparable
technologies, the capacity of DRAMs is roughly 4–8 times that of SRAMs. The cycle
time of SRAMs is 8–16 times faster than DRAMs, but they are also 8–16 times as
expensive.

RAMBUS: This is an interface improvement using a pipelined bus interface


sometimes called a split-transaction
 Bus comprises row and column address line + 18 bits of data
 transactions on bus simultaneously (RAS/CAS/Data)
 High clock rate (400MHz) with data transfers on both edges

ROM: Memory is the third key component of a microprocessor-based system


(besides the CPU and I/O devices). More specifically, the primary storage directly
addressed by the CPU is referred to as main memory to distinguish it from other
“memory” structures such as CPU registers, caches, and disk drives. Main memory is
98
Downloaded From : www.EasyEngineering.ne
Downloaded From : www.EasyEngineering.ne

typically built from several discrete semiconductor memory devices. Most systems
contain two or more types of main memory.

All memory types can be categorized as ROM or RAM, and as volatile or non-volatile:
• Read-Only Memory (ROM) cannot be modified (written), as the name implies. A
ROM chip’s contents are set before the chip is placed in the system.
• Read-Write Memory is referred to as RAM (for Random-Access Memory). This
distinction is inaccurate, since ROMs are also random access, but we are stuck
with it for historical reasons.
• Volatile memories lose their contents when their power is turned off.
• Non-volatile memories do not.
The memory types currently in common usage are:
ROM RAM

ww Volatile (nothing) Static RAM (SRAM)


Dynamic RAM (DRAM)

w.E Non-volatile Mask ROM


PROM
EEPROM
Flash

a sy E
EPROM memory
BBSRAM

ngi
Every system requires some non-volatile memory to store the instructions
that get executed when the system is powered up (the boot code) as well as some

nee
(typically volatile) RAM to store program state while the system is running.

ri n
PROGRAMMABLE ROM (PROM):
• Replace diode with diode + fuse, put one at every cell (a.k.a. “fusible-link”

g .n
PROM)
• Initial contents all 1s; users program by blowing fuses to create 0s

e
• Plug chip into PROM programmer (“burner”) device, download data file
• One-

UV ERASABLE PROM (UV EPROM, OR JUST EPROM):


Replace PROM fuse with pass transistor controlled by “floating” (electrically
isolated) gate Program by charging gate; switches pass transistor Careful
application of high voltages to overcome gate insulation (again using special
“burner”) Erase by discharging all gates using ultraviolet light UV photons carry
electrons across insulation Chip has window to let light in Insulation eventually
breaks down limited number of erase/reprogram cycles (100s/1000s).

NON-VOLATILE RAM TYPES:

Three basic types: EEPROM, Flash, BBSRAM

ELECTRICALLY ERASABLE PROM (EEPROM, E2PROM):


• Similar to UV EPROM, but with on-chip circuitry to electrically charge/discharge
floating gates (no UV needed) Writable by CPU it’s RAM, not ROM (despite
99
Downloaded From : www.EasyEngineering.ne
Downloaded From : www.EasyEngineering.ne

name) .Reads & writes much like generic RAM on writes, internal circuitry
transparently erases affected byte/word, then reprograms to new value Write
cycle time on the order of a millisecond typically poll status pin to know when write
is done
 High-voltage input (e.g. 12V) often required for writing
 Limited number of write cycles (e.g. 10,000)
 Selective erasing requires extra circuitry (additional transistor) per
memory cell lower density, higher cost than EPROM

FLASH MEMORY:
Again, floating-gate technology like EPROM, EEPROM Electrically erasable
like EEPROM, but only in large 8K-128K blocks (not a byte at a time) Moves erase
circuitry out of cells to periphery of memory array Back to one transistor/cell

ww
excellent density Reads just like memory Writes like memory for locations in
erased
blocks typ. write cycle time is a few microseconds slower than volatile RAM,

w.E
but faster than EEPROM To rewrite a location, software must explicitly erase entire
block initiated via

a
control registers on flash memory device erase can take several seconds erased bloc
ks
sy E
can be written (programmed) a byte at a time Still have

ngi
erase/reprogram cycle limit (10K-100K cycles per block).

nee
FLASH APPLICATIONS:
Flash technology has made rapid advances in last few years cell density rivals

ri n
DRAM; better than EPROM, much better than
EEPROM. Multiple gate voltage levels can encode 2 bits
per cell 64 Mbit devices available ROMs &
EPROMs rapidly becoming obsolete as cheap or cheaper, allows field
g .n
upgrades Replacing hard disks in some applications smaller, lighter, faster more
reliable (no moving parts) cost- effective up to
tens of megabytes block erase good match for file-system type interface.
e
BATTERY-BACKED STATIC RAM (BBSRAM):
Take volatile static RAM device and add battery backup Key advantage: write
performance write cycle time same as read cycle time Need circuitry to switch to
battery on power-off have to worry about battery running out Effective for small
amount of storage when you need battery anyway (e.g. PC built-in clock)

VOLATILE RAM TYPES:


Two basic types: static and
dynamic STATIC RAM (SRAM):
• Each cell is basically a flip-flop
• Four or six transistors (4T/6T)
relatively poor density
• Very simple interfacing; writes &
reads at same speed 100
• Very fast (access times under 10 ns
available) Downloaded From : www.EasyEngineering.ne
Downloaded From : www.EasyEngineering.ne

DYNAMIC RAM (DRAM):


• One transistor per cell (drain acts as capacitor)
• Highest density memory available
• Very small charges involved
• bit lines must be precharged to detect bit values: cycle time > access
time
• reads are destructive; internally does writeback on read
• values must be refreshed (rewritten) periodically by touching each row of array
or charge will leak away
• External row/column addressing saves pins, $
• Row/column addressing + refresh complex interfacing

EMBEDDED PROCESSOR MEMORY TECHNOLOGY: ROM AND FLASH


Embedded computers usually have small memories, and most do not have a

ww
disk to act as non-volatile storage. Two memory technologies are found in embedded
computers to address this problem.

w.E
The first is Read-Only Memory (ROM). ROM is programmed at
time of manufacture, needing only a single transistor per bit to represent 1 or

a
0. ROM is used for the embedded program and for constants, often included as part
of a larger
sy E
chip.In addition to being non-volatile, ROM

ngi
is also non-destructible; nothing the computer can do can modify the contents of this
memory. Hence, ROM also provides a level of protection to the

nee
code of embedded computers. Since address based protection is often not enabled i
n embedded processors,

ri n
ROM can fulfill an important role.
The second memory technology offers non-volatility but allows the memory to

g .n
be modified. Flash memory allows the embedded device to alter nonvolatile memory
after the system is manufactured, which can shorten product development.

IMPROVING MEMORY PERFORMANCE IN A STANDARD DRAM CHIP:


a.To improve bandwidth, there have been a variety of
evolutionary innovations over time. The first was timing signals that
e
allow repeated accesses to the row buffer
without another row access time, typically called fast page mode..
b.The second major change is that conventional DRAMs have an
asynchronous interface to the memory controller, and hence every transfer involves
overhead to synchronize with the controller. This optimization is called Synchronous
DRAM
(SDRAM).
c.The third major DRAM innovation to increase bandwidth is to transfer data
on both the rising edge and falling edge of the DRAM clock signal, thereby doubling
the peak data rate. This optimization is called Double Data Rate (DDR).

101

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

THE MEMORY HIERARCHY PYRAMID

(ii). Explain about Memory Hierarchy in detail.


IC Memory Types and Cycles:

ROM Read-only Memory (NEC) Permanent storage (boot code,


embedded code)
SRAM Static Random Access Memory cache and high speed access
DRAM Dynamic Random Access Memory (Micron) Main Memory
EPROM Electrically programmable read-only memory (Atmel) Replace
ROM when reprogramming required
EEPROM Electrically erasable, programmable read-only memory (Atmel)
Alternative to EPROM, limited but regular reprogramming,
Device configuration info during power down (USB memories)

ww
FLASH advancement on EEPROM technology allowing blocks of
memory location to

w.E be written and cleared at one time instead. (Samsung). Found in


thumb drives/memory

a
stick or as solid-state hard disks.

sy E
Note: EEPROM and FLASH have lifetime write cycle limitations
Average Memory Access Time (Registers and Main Memory):

ngi
The entire computer memory can be viewed as the hierarchy depicted in
Figure. The fastest access is to data held in processor registers. Therefore, if we

nee
consider the registers to be
part of the memory hierarchy, then the processor registers are at

ri n
the top in terms of the speed
of access. The registers provide only a minuscule portion of the required memory.

g .n
At the next level of the hierarchy is a relatively small amount of memory that
can be implemented directly on the processor chip. This
memory, called a processor cache, holds copies of instructions and data stored
in a much larger memory that is provided externally. There are often two levels of
caches.
e
A primary cache is always located on the processor chip. This cache is small
because it competes for space on the processor chip, which must implement
many other functions. The primary cache is referred to as level (L1) cache. A larger,
secondary cache is placed between the primary cache and the rest of the memory. It
is referred to as level 2 (L2) cache. It is usually implemented using SRAM chips. It is
possible to have both Ll and L2 caches on the processor chip.
The next level in the hierarchy is called the main memory. This rather large
memory is implemented using dynamic memory components, typically in the form of
SIMMs, DIMMs, or RIMMs. The main memory is much larger but significantly slower
than the cache memory. In a typical computer, the access time for the main memory
is about ten times longer than the access time for the L 1 cache.

102

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

ww
w.E
a sy E FIG: MEMORY HIERARCHY

ngi
Disk devices provide a huge amount of inexpensive storage. They are very slo
w

nee
compared to the semiconductor devices used to implement the main memory. A
hard disk drive (HDD; also hard drive, hard disk, magnetic disk or
disk drive) is a device for storing and retrieving digital
ri n
g .n
information, primarily computer data. It consists of one or more
rigid (hence "hard") rapidly rotating discs (often referred to

e
as platters), coated with magnetic material and with magnetic heads arranged to
write data to the surfaces and read it from them.

2.(i). Explain the features of cache memory and its

accesses. (May/June 2014). Basic Ideas:


The cache is a small mirror-image of a portion (several "lines") of main
memory. cache is faster than main memory ==> so maximize its utilization
cache is more expensive than main memory ==> so it is much smaller .
Locality of reference:
The principle that the instruction currently being fetched/executed is very close
in memory to the instruction to be fetched/executed next. The same idea
applies to the data value currently being accessed (read/written) in memory.
If we keep the most active segments of program and data in the cache, overall
execution speed for the program will be optimized. Our strategy for cache utilization
should maximize the number of cache read/write operations, in comparison with the
number of main memory read/write operations.
103

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

Example:
A line is an adjacent series of bytes in main memory (that is, their addresses
are contiguous). Suppose a line is 16 bytes in size. For example, suppose we
have a 212= 4K-byte cache with 28 = 256 16-byte lines; a 224 = 16M-byte main
memory, which is 212 = 4K times the size of the cache; and a 400-line program, which
will not all fit into the cache at once.

ww
w.E
a
FIG: Cache memory access

sy E
Each active cache line is established as a copy of
a corresponding memory line during execution. Whenever a memory write takes

ngi
place in the cache, the "Valid" bit is reset (marking that
line "Invalid"), which means that it is no longer an exact image of its
corresponding line in memory.
nee
Cache Dynamics:
When a memory read (or fetch) is issued by the CPU:
ri n
g .n
1. If the line with that memory address is in the cache (this is called a cache
hit), the data is read from the cache to the MDR.
2. If the line with that memory address is not in the cache (this is called a miss),
the cache is updated by replacing one of its active lines by the
line with that memory address, and then the data is read from the cache to the
e
MDR.

When a memory write is issued by the CPU:


3. If the line with that memory address is in the cache, the data is written from the
MDR to the cache, and the line is marked "invalid" (since it no longer is an
image of the corresponding memory line).
4. If the line with that memory address is not in the cache, the cache is updated
by replacing one of its active lines by the line with that memory address. The
data is then written from the MDR to the cache and the line is marked "invalid."

Cache updation is done in the following way:


1.A candidate line is chosen for replacement using an algorithm that tries to
minimize the number of cache updates throughout the life of the program run.
Two algorithms have been popular in recent architectures: Choose the line
that has been least recently used - "LRU" for short (e.g., the PowerPC)
104
Downloaded From : www.EasyEngineering.ne
Downloaded From : www.EasyEngineering.ne

Choose the line randomly (e.g., the 68040)


1. If the candidate line is "invalid," write out a copy of that line to main memory
(thus bringing the memory up to date with all recent writes to that line in the
cache).
2. Replace the candidate line by the new line in the cache.

ww
w.E
a sy E
ngi
nee
(ii).Explain mapping functions in cache memory to determine how memory

ri n
blocks are placed in cache. (Nov/Dec 2014).
As a working example, suppose the cache has 27 = 128 lines, each with 24 =
16 words. Suppose the memory has a 16-bit address, so that 216 =
64K words are in the memory's address space. g .n
e

FIG: Mapping cache and Main memory block

Direct Mapping:
Direct mapping of the cache for this model can be accomplished by using the
rightmost 3 bits of the memory address. For instance, the memory address 7A00
=
0111101000000 000, which maps to cache address 000. Thus, the cache
address of

105
Downloaded From : www.EasyEngineering.ne
Downloaded From : www.EasyEngineering.ne

any value in the array is just its memory address modulo 8.


Using this scheme, we see that the above calculation uses only cache words
000 and 100, since each entry in the first row of memory address with either
000 or 100 as its rightmost 3 bits. The hit rate of a program is the number of cache
hits among its reads and writes divided by the total number of memory reads and
writes. There are 30 memory reads and writes for this program, and the following
diagram illustrates cache utilization for direct mapping throughout the life of these two
loops:

ww
w.E
a sy E
FIG : DIRECT MAPPING OF CACHE
Reading the sequence of events from left to right over the

ngi
ranges of the indexes i and j, it is easy to pick out the hits and misses. In fact, the
first loop has a series of 10 misses (no hits). The second loop contains a read and a

nee
write of the
same memory location on each repetition (i.e., a[0][i] = a[0][i]/Ave; ), so that

ri n
the 10 writes are guaranteed to be hits. Moreover, the first two repetitions of the
second

g .n
loop have hits in their read operations, since a09 and a08 are still in the cache at the end

e
of the first loop. Thus, the hit rate for direct mapping in this algorithm is 12/30 =
40%.

Associative Mapping:
Associative mapping for this problem simply uses the entire address as the
cache tag. If we use the least recently used cache replacement strategy, the
sequence of events in the cache after the first loop completes is shown in the left-
half of the following diagram. The second loop happily finds all of a 09 - a02 already in the

cache, so it will experience a series of 16 hits (2 for each repetition) before missing on
a 01 when i=1. The last two steps of the second loop therefore have 2 hits and 2 misses.

106

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

FIG: ASSOCIATIVE MAPPING

Set-Associative Mapping:

ww
Set associative mapping tries to compromise these two. Suppose we divide
the cache into two sets, distinguished from each other by the rightmost bit of the

w.E
memory address, and assume the least recently used strategy for cache line
replacement. Cache utilization for our program can now be pictured as follows:

a sy E
ngi
nee
ri n
g .n
e
FIG: SET ASSOCIATIVE MAPPING

Here all the entries in that are referenced in this algorithm have even-
numbered addresses (their rightmost bit = 0), so only the top half of the cache is
utilized. The hit rate is therefore slightly worse than associative mapping and slightly
better than direct. That is, set-associative cache mapping for this program yields 14
hits out of 30 read/writes for a hit rate of 46%.
Example:
Suppose we have an 8-word cache and a 16-bit memory address space,
where each memory "line" is a single word (so the memory address needs not have a
"Word" field to distinguish individual words within a line). Suppose we also
have a 4x10 array a of numbers (one number per addressable memory word)
allocated in memory column-by-column, beginning at address 7A00.

107

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

That is, we have the following declaration and memory allocation picture for
the array a = new float [4][10];

ww
w.E FIG: EXAMPLE SET ASSOCIATIVE
Here is a simple equation that recalculates the elements of the first

a
row of a:

sy E
This
ngi
nee
calculation could have been implemented directly in C/C++/Java as follows:
Sum = 0;

ri n
for
(j=0; j<=9; j++) Sum

= Sum + a[0][j];
g .n
Sum /10;
Ave =

for
(i=9; i>=0; i--)
e
a[0][i] =
a[0][i] / Ave;

The emphasis here is on the underlined parts of this program which represent
memory read and write operations in the array a. Note that the 3rd and 6th lines
involve a memory read of a[0][j] and a[0][i], and the 6th line involves a memory write
of a[0][i]. So altogether, there are 20 memory reads and 10 memory writes during the
execution of this program. The following discussion focuses on those particular parts
of this program and their impact on the cache.

ONE PROCESSOR OR CONTROLLER FUNCTIONING AS BUS MASTER

2 (i) Explain in detail about any two standard input and output interface
required to connect the I/O device to the bus.(Nov/Dec 2014).
 Explain in detail about the Bus Arbitration techniques in DMA. (Nov /Dec
2014) Downloaded From : www.EasyEngineering.ne
Downloaded From : www.EasyEngineering.ne

controller functioning as bus master. Only one processor or controller can be bus
Master. The bus master─ the controller that has access to a bus at an instance. Any
one controller or processor can be the bus master at the given instance (s).

ww
FIG: BUS ARBITRATION IN DMA

Three bus arbitration processes:

w.E
a. Daisy Chain Method
b.Independent Bus Requests and Grant

a
Method c.Polling Method.

Daisy Chain Method:


sy E
ngi
A method for a centralized bus arbitration process.
The bus control passes from one bus master to the next one, then to the next and so

nee
on. Bus control passes
from unit U0 to U1, then to U2, then U3,

ri n
and so on. U0 has highest priority, U1 next, and so on

g .n
e

FIG: DAISY CHAIN

109

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

Timing diagram in daisy chaining method:

ww FIG: TIMING DIAGRM

Step 1: w.E
a
Bus Grant BGri:

sy E
This signal means that a unit has been granted bus access and can take
control. Bus grant signal passes from ith unit to (i +1)th unit in daisy chaining

ngi
when ith unit does not need bus control. The arbitrator issues only BGr0.

Step 2:
nee
ri n
Bus Request BRqi:
This signal means that i-th unit has requested for the grant of the
bus access and requests to take control of the bus.
g .n
• Busy:
Step 3:

This signal is to and from a bus master to enables all other units with the bus
e
to note that presently bus access is not possible as one of the units is busy using the
bus or has been granted control over the bus. The unit, which accepts the BGr,
issues the Busy.

(ii)Independent Bus-Request and Grant Method:


Bus independent requests and grants method. The bus control passes from
one bus master to another only through the centralized bus controller. Assume n units
can be granted bus master status by a centralized processor as bus controller
after listening to its request.

110

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

Independent request method:


The centralized controller listens to requests of each device individually and
grants access to the bus. If number of requests pending, then grant by a
priority resolution algorithm which resolves the priority issue. Bus control passing
from a bus master to another only grant of bus on an independent request.

ww
w.E
a sy E
FIG: BUS REQUEST AND GRANT

Timing diagram in independent bus request and grant method:

ngi
nee
ri n
g .n
e
FIG: TIMING DIAGRAM – BUS REQUEST & GRANT METHOD

Step 1:
Bus Request BRqi for i = 0 to n – 1. BRqi—this signal means that ith unit has
requested for the grant of the bus access and requests to take control of the bus.

Step 2:
Bus Grant BGri for i = 0 to n – 1. BGri signal means that ith unit has been
granted bus access and can take control. Bus grant signal passes to any ith
unit from the centralized processor only after the unit sends ith BRqi.

111

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

Step 3:
• Busy:
This signal is from a bus master to enable all other units with the bus to note
that presently bus access is not possible as one of the units is busy using the bus or
has been granted control over the bus.

Polling Method:
The bus control passes from one processor (bus controller) to another only
through the centralized bus controller, but only when the controller sends poll count
bits, which correspond to the unit number. Assume n units can be granted bus
master status by a centralized processor.

ww
w.E
a sy E
ngi
nee
ri n
g .n
Step 1:
FIG: POLLING METHOD
e
Bus Poll Count BPC (on three lines for a Ui where i = 0 to n – 1 or n = 8 ).The
count = c means that (2c – 1)th unit being polled for the grant of bus access and can
take control from the processor. Bus count lines connect to each unit from the
centralized processor

Step 2:
Bus Request BRqi for i = 0 to n – 1. This signal means that the ith unit has
accepted the grant of the bus available access and requests to take control of
the bus.

Step 3:
• Busy:
This signal is from a bus master to enableball other units with the bus to note
112

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

that presentlybbus access is not possible as one of the units is busy in using the bus
or has been granted control over the bus.

Centralized Arbitration:
Processor is normally the bus master grants bus mastership to DMA.DMA
controller 2 requests and acquires bus and later releases the bus. During its tenure
as the bus master, it may or more data transfer operations, depending on
operating in the cycle stealing or block mode. After it releases the bus, the processor.

Distributed Arbitration:
All devices waiting to use the bus have to the arbitration process - no central
arbiter each device on the bus is assigned wit identification number. One or more
devices request the bus by the start-arbitration signal and pl identification number on

ww
the four open lines. ARB0 through ARB3 are the four open lines .One among the four
is selected using the lines and one with the highest ID.

w.E
a sy E
ngi
nee
ri n
g .n
FIG: DISTRIBUTED ARBITRATION

Assume that two devices, A and B, having 5 and 6, respectively, are


e
requesting the use of .Device A transmits the pattern 0101, and transmits the pattern
0110.The code seen by both devices is 0111.Each device compares the pattern on
the lines to its own ID, starting from the most significant. If it detects a difference at
any bit position its drivers at that bit position and for all lower-does so by placing a 0
at the input of these driver. In the case of our example, device A data difference on
line ARB I. Hence, it disables its d lines ARB 1 and ARBO.

113

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

MEASURING AND IMPROVING CACHE PERFORMANCE

3. (ii) Explain the techniques for measuring and improving cache performance.
(May/June 2014).

We then explore two different techniques for improving cache performance.


One focuses on reducing the miss rate by reducing the probability that two different
memory blocks will contend for the same cache location. The second technique
reduces the miss penalty by adding an additional level to the hierarchy. This
technique, called multilevel caching.
Thus, CPU time = (CPU execution clock cycles + Memory-stall clock cycles) ×
Clock cycle time. The memory-stall clock cycles come primarily from cache misses,
and we make that assumption here. We also restrict the discussion to a simplified

ww
model of the memory system. In real processors, the stalls generated by reads and
writes can be quite complex, and accurate performance prediction usually requires
very detailed simulations of the processor and memory system.

w.E
Memory-stall clock cycles can be defined as the sum of the stall cycles
coming from reads plus those coming from writes: Memory-stall clock cycles

a
= Read-stall cycles + Write-stall cycles. The read-stall cycles can be defined in
terms
sy E
of the number of read accesses per program, the miss penalty in clock cycles

ngi
for a read, and the read miss rate: Read-stall cycles = Reads Program × Read miss
rate

nee
× Read miss penalty Writes are more complicated.
Write-stall cycles = (Writes Program × Write miss rate × Write miss

ri n
penalty)
+ Write buffer stalls.

g .n
Because the write buffer stalls depend on the proximity of writes, and not just
the frequency, it is not possible to give a simple equation to compute such stalls.

words) and a memory capable of accepting writes at a rate


e
Fortunately, in systems with a reasonable write buffer depth (e.g., four or more

that significantly exceeds the average write frequency in programs (e.g., by a factor
of 2), the write buffer stalls will be small, and we can safely ignore them.
If we assume that the write buffer stalls are negligible, we
can combine the reads and writes by using a single miss rate and the miss penalty:
Memory-stall clock cycles = Memory accesses Program × Miss rate × Miss
penalty
We can also factor this as,

Memory-stall clock cycles = Instructions Program ×Misses Instruction ×


Miss Penalty

Calculating Cache Performance:


Assume the miss rate of an instruction cache is 2% and the miss rate of
the
data cache is 4%. If a processor has a CPI of 2 without any memory stalls and
the miss penalty is 100 cycles for all misses, determine how much faster a
processor would run with a perfect cache that never missed. Assume the frequency
of all loads and stores is 36%. Downloaded From : www.EasyEngineering.ne
The number of memory miss cycles for instructions in terms of the Instruction
Downloaded From : www.EasyEngineering.ne

(I) is

Instruction miss cycles = I × 2% × 100 = 2.00 × I


As the frequency of all loads and stores is 36%, we can find the number of memory
miss cycles for data references.
Data miss cycles = I × 36% × 4% × 100 = 1.44 × I
The total number of memory-stall cycles is 2.00 I + 1.44 I = 3.44 I. This is more than
three cycles of memory stall per instruction. Accordingly, the total CPI including
memory stalls is 2 + 3.44 = 5.44. Since there is no change in instruction count or
clock rate, the ratio of the CPU execution times is CPU time with stalls.

CPU time with perfect cache=I × CPIstall × Clock cycle (I) × CPIperfect × Clock
cycle

ww
=CPIstall x CPIperfect
= 5.44
The performance with the perfect cache is better by 5.44

w.E
Calculating Average Memory Access Time:
Find the AMAT for a processor with a 1 ns clock cycle time, a miss

a
penalty of 20 clock cycles, a miss rate of 0.05 misses per instruction, and a cache
access time
sy E
(including hit detection) of 1 clock cycle. Assume that the read and write miss

ngi
penalties are the same and ignore other write stalls. The average memory access
time per instruction is

nee
AMAT = Time for a hit + Miss rate × Miss penalty = 1 + 0.05 ×
20= 2 clock cycles or 2 ns.

ri n
Reducing the Miss Penalty Using Multilevel Caches:
g .n
All modern computers make use of caches. To close the gap further between
the fast clock rates of modern processors and the increasingly long time
required to access DRAMs, most microprocessors support an additional level of cac
hing. This second-level cache is usually on
e
the same chip and is accessed whenever a miss occurs in the primary cache. If the
second-level cache contains the desired data, the miss penalty for the first-level
cache will be essentially the access time of the second-level cache, which will be
much less than the access time of main memory. If neither the primary nor the
secondary cache contains the data, a main memory access is required, and a larger
miss penalty is incurred.
Performance of Multilevel Caches Suppose we have a processor with a base
CPI of 1.0, assuming all references hit in the primary cache, and a clock rate of 4
GHz. Assume a main memory access time of 100 ns, including all the miss handling.
Suppose the miss rate per instruction at the primary cache is 2%. How much faster
will the processor be if we add a secondary cache that has a 5 ns access time for
either a hit or a miss and is large enough to reduce the miss rate to main memory to
0.5%? The miss penalty to main memory is

115

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

REDUCING CACHE MISS RATE:


The average memory access time formula gave us a framework to present cache
optimizations for improving cache performance:

Average memory access time = Hit time + Miss rate × Miss Penalty
Hence, we organize six cache optimizations into three categories: Reducing

ww
the miss rate: larger block size, larger cache size, and higher associatively Reducing
the miss penalty: multilevel caches and giving reads priority over writes

w.E
Reducing the time to hit in the cache: avoiding address
translation when indexing the cache.

a
SPLIT CACHES:

sy E
The classical approach to improving cache behavior is to reduce miss

ngi
rates, and we resent three techniques to do so. To gain better
insights into the causes of misses, we first start with a model that sorts all misses

nee
into three simple categories:
Compulsory: The very first access to a block cannot be
in the cache, so the block must be brought into the cache. These are also called
cold-start misses or
ri n
g .n
first-reference misses.
Capacity: If

e
the cache cannot contain all the blocks needed during execution of a program, capa
city
misses (in addition to compulsory misses) will occur because of blocks being
discarded and later retrieved.
Conflict: If the block placement strategy is set
associative or direct mapped, conflict misses
(in addition to compulsory and capacity misses) will occur because a block may
be discarded and later retrieved if too many blocks map to its set. These misses are
also called collision misses. The idea is that hits in a fully associative cache that
become misses in an n-way set-associative cache are due to more than n requests
on some popular sets.

VIRTUAL MEMORY

4. What is virtual memory? Explain the steps involved in virtual memory


address translation. (April/May 2015) (N/D’15)
The main memory can act as a “cache” for the secondary storage, usually
implemented with magnetic disks. This technique is called virtual memory.
Historically, there were two major motivations for virtual memory: to allow efficient
and safe sharing of memory among multiple programs, and to remove the
Downloaded From : www.EasyEngineering.ne
programming burdens of a small, limited amount of main memory.
Downloaded From : www.EasyEngineering.ne

translated by a combination of hardware and software to a physical address, which


in turn can be used to access main memory. Figure shows the virtually addressed
memory with pages mapped to main memory. This process is called address
mapping or address translation.

ww
w.E
a sy E
FIG: ADDRESS TRANSLATION

ngi
Virtual memory also simplifies loading the program for execution by
providing relocation. Relocation maps the virtual addresses used by a program

nee
to
different physical addresses before the addresses are used to access memory. This
relocation allows us
to load the program anywhere in main
ri n
g .n
memory. Furthermore, all virtual memory systems in use today
relocate the program as a set of fixed-size blocks (pages), thereby eliminating the

e
need to find a contiguous block of memory to allocate to
a program; instead, the operating system need only find
a sufficient number of pages in main memory.

FIG: MAPPING FROM VIRTUAL TO A PHYSICAL ADDRESS

117

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

ww
w.E
a sy E FIG: PAGE TABLE MAPPING

ngi
The physical main memory is not as large as the

nee
address space spanned by an address issued by the processor. When
a program does not completely fit into the main memory, the parts of it not currently

ri n
being executed are stored on secondary storage devices, such as magnetic disks.

g .n
e

118

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

ww
w.E
a sy E
ngi
FIG: VIRTUAL MEMORY ORGANIZATION
When a new segment of a program is to be moved into a full memory, it must

nee
replace another segment already in the memory. The operating
system moves programs and data automatically between the main

ri n
memory and secondary storage. This process is known as swapping.
Thus, the application programmer does not need to be aware of limitations
imposed by the available main
g .n
memory. Figure shows a typical organization that implements virtual memory. A
special hardware unit, called the
Memory Management Unit (MMU), translates virtual addresses into physical address
es.
When the desired data (or instructions) are in the main memory, these data are
e
fetched as described in our presentation of the ache mechanism. If the data are not
in the main memory, the MMU causes the operating system to bring the data into
the memory from the disk. The DMA scheme is used to perform the data Transfer
between the disk and the main memory.

119

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

(ii).Explain the steps involved in virtual address translation. (April/May 2015).


The MMU must use the page table information for every read and write
access; so ideally, the page table should be situated within the MMU.

ww
w.E
a sy E
ngi
nee
ri n
FIG: VIRTUAL ADDRESS TRANSLATION
g .n
Unfortunately, the page table may be rather large, and since the
MMU is normally implemented as part of the processor chip (along with the primary
cache), it
is impossible to include a complete page table on this chip. Therefore, the
e
page table is kept in the main memory. However, a copy of a small portion of the
page table can be accommodated within the MMU.
The process of translating a virtual address into physical address is known as
address translation. It can be done with the help of MMU. A simple method for
translating virtual addresses into physical addresses is to assume that all
programs and data are composed of fixed-length units called pages, each of which
consists of a block of words that occupy contiguous locations in the main memory.
Pages commonly range from 2K to 16K bytes in length. They constitute the basic
unit of
information that is moved between the main memory and the disk whenever
the translation mechanism determines that a move is required.
The cache bridges the speed gap between the processor and the main
memory and is implemented in hardware. The virtual-memory mechanism bridges
the size and speed gaps between the main memory and secondary storage
and is
Downloaded From : www.EasyEngineering.ne
120
Downloaded From : www.EasyEngineering.ne

usually implemented in part by software techniques.

A virtual-memory addresses translation method based on the concept of fixed-


length pages. Each virtual address generated by the processor, whether it is for an
instruction fetch or an operand fetch/store operation, is interpreted as a virtual page
number (high-order bits) followed by an offset (low-order bits) that specifies the
location of a particular byte (or word) within a page. Information about the main
memory location of each page is kept in a page table. This information includes the
main memory address where the page is stored and the current status of the page.
This bit allows the operating system to invalidate the page without actually
removing it. Another bit indicates whether the page has been modified during
its residency in the memory. As in cache memories, this information is needed to
determine whether the page should be written back to the disk before it is removed
from the main memory to make room for another page. Other control bits indicate

ww
various restrictions that may be imposed on accessing the page. For example, a
program may be given full read and write permission, or it may be restricted to

w.E
read accesses only.

a
TLBS- INPUT/OUTPUT SYSTEM:

sy E
This portion consists of the page table entries that correspond to the
most recently accessed pages. A small cache, usually called the

ngi
Translation Lookaside Buffer (TLB) is incorporated into the MMU for this
purpose. The operation of the TLB with respect to the page table in the

nee
main memory is essentially the same as the operation of cache memory; the
TLB must also include the virtual address of the

ri n
entry. Figure shows a possible organization of a TLB where the associative-mapping
technique is used. Set associative mapped TLBs
are also found in commercial products.
g .n
e

121

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

ww
w.E
a sy E
ngi
FIG: TRANSLATION LOOKAASIDE BUFFER – I/O

An essential requirement is that the contents of the TLB be coherent with the

nee
contents of page tables in the memory. When the operating system
changes the contents of page tables, it must simultaneously invalidate the

ri n
corresponding entries in the TLB. One of the control bits in the TLB is provided

g .n
for this purpose. When an entry is invalidated, the TLB will acquire the new
information as part of the MMU's normal response to access misses.

e
Given a virtual address, the MMU looks in the TLB for
the referenced page. Page table entry for this page is found in the TLB, the
physical address is obtained immediately. If there is a miss in the TLB, then the
required entry is obtained from the page table in the main memory and the TLB is
updated. When a program generates an access request to a page that is not in the
main memory, a page fault is said to have occurred. The whole page must be
brought from the disk into the memory before access can proceed.
A modified page has to be written back to the disk before it is removed
from the main memory. It is important to note that the write-through
protocol, which is useful in the framework of cache memories, is not suitable for
virtual memory.

122

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

DMA :

5. Discuss Direct Memory Access in detail. (May/June 2014).(N/D’15)(16)


A special control unit is provided to allow transfer of a block of data directly
between an external device and the main memory, without continuous intervention by
the processor. This approach is called direct memory access, or DMA.
DMA transfers are performed by a control circuit that is part of the I/O device
interface. We refer to this circuit as a DMA controller. The DMA controller performs
the functions that would normally be carried out by the processor when accessing the
main memory. For each word transferred, it provides the memory address and all the
bus signals that control data transfer.
Although a DMA controller can transfer data without intervention by the
processor, its operation must be under the control of a program executed by the
processor. To initiate the transfer of a block of words, the processor

ww
sends the
starting address, the number of words in the block, and the direction of the

w.E
transfer. On receiving this information, the DMA controller proceeds to perform the
requested operation. When the entire block has been
transferred, the controller informs the processor by raising an interrupt signal.

a sy E
While a DMA transfer is taking place, the program that requested the
transfer cannot continue, and the processor
can be used to execute another program. After the

ngi
DMA transfer is completed, the processor can return to the
program that requested the transfer. I/O operations are always performed by the

nee
operating system of the computer in response to a request from an application
program.
Two registers are used for
ri n
g .n
storing the starting address and the word count. The third register contains
status and control flags. The R/W bit determines the direction of the transfer.

e
When this bit is set to 1 by a program instruction, the controller performs a read
operation, that is, it transfers data from the memory to the
I/O device. Otherwise, it performs a write operation.
A DMA controller connects a high-speed network to the computer bus.
The disk controller, which controls two disks, also has DMA capability and provides
two DMA channels. It can perform two independent DMA operations, as if each disk
had
its own DMA controller. The registers needed to store the memory address, the
word count, and so on are duplicated, so that one set can be used with each device.
To start a DMA transfer of a block of data from the main memory to one of the
disks, a program writes the address and word count information into the registers of
the corresponding channel of the disk controller. It also provides the disk
controller with information to identify the data for future retrieval. The DMA controller
proceeds independently to implement the specified operation.
When the DMA transfer is completed, this fact is recorded in the status and
control register of the DMA channel by setting the done bit. At the same time, if
the IE bit is set, the
123

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

ww FIG: DMA ORGANIZATION

w.E
controller sends an interrupt request to the processor and sets the IRQ bit.

a
The status register can also be used to

sy E
record other information, such as whether the transfer took place correctly or errors
occurred.

ngi
Memory accesses by the
processor and the DMA controllers are interwoven. Requests by DMA devices for

nee
using the bus are always given higher priority than processor requests.
Among different DMA devices, top priority is given to high-speed

ri n
peripherals such as a disk, a high-speed network interface,
or a graphics display device.

g .n
Alternatively, the DMA controller may be given exclusive access to the
main memory to transfer a block of data without interruption. This is known as block

e
or
burst mode. Most DMA controllers incorporate a data storage buffer. In
the case of the
network interface for example, the DMA controller reads a block of data from the
main memory and stores it into its input buffer. This transfer takes place using burst
mode at a speed appropriate to the memory and the computer bus. Then, the data in
the buffer are transmitted over the network at the speed of the network.

124

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

Industry Connectivity and Latest Developments

Industry Connectivity:

 The following companies (Industries) are connectivity to computer architecture


design: Intel, National Instruments, IBM, Free scale semiconductor
Latest Developments:

 Design for the evolution of the next generation of POWER and Z series
Processors

Industrial Visit (Planned if any)


Date:

Industry:

ww
w.E
a sy E
ngi
nee
ri n
g .n
e

125

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

University Questions

Question Paper Code: 80289

B.E./B.Tech. DEGREE EXAMINATION, NOV/DEC 2016


Sixth Semester
ELECTRONICS & COMMUNICIATON ENGINNERING
CS 6303-
COMPUTER ARCHITECTURE
(Common to
Information Technology) (Regulation
Time: Three hours 2013) Maximum:

ww
100 marks

Answer ALL questions

w.E PART A - (10 x 2 = 20 marks)

a sy E
1) What is an Instruction register?
2) Give the formula for CPU execution time for a

ngi
program
3) What is a guard bit and what are the ways to

nee
truncate the guard bits?
4) What is an arithmetic overflow?

ri n
5) What is meant by pipeline bubble?
6) What is data path?

g .n
7) What is instruction level parallelism?
8) What is multithreading?

e
9)What is meant by address mapping?
10)What is Cache memory?

PART B – (5*13= 65 Marks)

11)A) Explain in detail the various components of a computer system with a neat
diagram
OR

B) Explain the different types of addressing modes with suitable examples

12)A) Explain BOOTH’S Algorithm for the multiplication of signed 2’s complement
numbers
OR
B) Discuss in detail about division algorithm with a
diagram and examples

126

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

1) A) Why is branch prediction algorithm is needed? Differentiate between the


static and dynamic techniques
OR
A) Explain how the instruction pipeline works. What are the various situations
where an instruction pipeline can stall?

2) A) Explain in detail about FLYNN’S classification of parallel hardware


OR
A) Discuss shared memory multiprocessor with a neat diagram

3) A) Discuss DMA controller with a block diagram


OR
A) Discuss the steps involved in the address translation of virtual memory with a

ww
necessary block diagram

PART C (1*15 = 15 Marks)

w.E
4) A) What is the disadvantage of ripple carry addition and how it is over come

a
in carry look ahead adder and draw the logic circuit of CLA

sy E
OR
A) Design and explain a parallel priority interrupt hardware for

ngi
a system with eight interrupt sources

nee
ri n
g .n
e

127

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

Question Paper Code: 57241

B.E./B.Tech. DEGREE EXAMINATION, APR/MAY 2016


Sixth Semester
ELECTRONICS & COMMUNICIATON ENGINNERING
CS 6303-
COMPUTER ARCHITECTURE
(Common to
Information Technology) (Regulation
Time: Three hours 2013) Maximum:
100 marks

Answer ALL questions

ww PART A - (10 x 2 = 20 marks)

w.E
1) How to represent instruction in a computer
system?

a
2) Distinguish between auto increment and auto decre
ment addressing
modes
sy E
ngi
3) Define ALU
4) What is sub word parallelism?

nee
5) What are the advantages of pipelining?
6) What is exception?

ri n
7) State the need for instruction level parallelism
8) What is fine grained multi threading?

g .n
9) Define memory hierarchy
10)State the advantages of virtual memory

PART B – (5*16= 80 Marks)

11) A) Discuss about the various components of a computer system


e
OR

B) Elaborate different types of addressing modes with suitable examples

12)A) Explain briefly about floating point addition and subtraction algorithms
OR
B) Define BOOTH multiplication algorithm with suitable
example

13)A) What is pipelining? Discuss about pipelined datapath


and control
OR
B) Explain briefly about the various categories of hazards
with examples 128

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

1) A) Explain in detail about FLYNN’S classification


OR
A) Write short notes on
i) Hardware multi threading
ii) Multicore processors

2) A) Define Cache memory. Explain various mapping techniques associated


with cache memories
OR
A) Explain about DMA controller with a help of a block diagram

ww
w.E
a sy E
ngi
nee
ri n
g .n
e

129

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

Question Paper Code: 27162

B.E./B.Tech. DEGREE EXAMINATION, NOV/DEC 2015


Third Semester
Computer Science and Engineering
CS 6303- COMPUTER ARCHITECTURE
(Common to Information Technology, VI SEM
ECE) (Regulation 2013)

Time: Three hours Maximum:


100 marks

Answer ALL questions

ww PART A - (10 x 2 = 20 marks)

w.E
1) What is instruction set architecture?
2) How CPU execution time for a program is

a
calculated?

sy E
3) What are the overflow/ underflow conditions for
addition and subtraction?

ngi
4) State
the representation of double precision floating point

nee
number
5) What is hazard? What are its types?

ri n
6) What is meant by branch prediction?
7) What is ILP?

g .n
8) Define superscalar processor
9)What are the various memory technologies?
10)Define hit ratio

PART B – (5*16= 80 Marks)


e
10) A) Explain in detail the various components of a computer system with a neat
diagram
OR

A) What is an addressing mode? Explain various addressing modes with suitable


examples

11) A) Explain in detail about the multiplication algorithm with a suitable example
and diagram
OR
A) Discuss in detail about the division algorithm with the diagram and examples

130

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

1) A)Explain the basic MIPS implementation with necessary multiplexers and


control lines
OR
A) Explain how the instruction pipeline works? What are the various situations
where an instruction pipeline can stall? Illustrate with an example.

2) A) Explain in detail about FLYNN’S classification of parallel hardware


OR
A) Explain in detail about hardware multithreading

3) A) What is virtual memory? Explain in detail about how virtual memory is

ww
implemented with a neat diagram
OR
A) Draw the typical block diagram of a DMA controller and explain how it is used

w.E
for direct data transfer between memory and peripherals

a sy E
ngi
nee
ri n
g .n
e

131

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

Question Paper Code: 11257

B.E./B.Tech. DEGREE EXAMINATION, APRIL/MAY 2011


Fourth Semester
Computer Science and Engineering
CS 2253 - COMPUTER ARCHITECTURE
(Common to Information
Technology) (Regulation 2008)

Time: Three hours Maximum:


100 marks

Answer ALL questions

ww PART A - (10 x 2 = 20 marks)

w.E
1. What is a n o p c o d e ? How many bits are needed to specify 32 d i s t i n
c t operations?

a sy E
2. Write the logic equations of a binaryhalfadder.

3. Write the difference between Horizontaland Vertical Microinstructions.

ngi
4. In whatways the width and heightof thecontrol memory can be reduced?

5.Whathazarddoes t h e a b o v etwoinstructionscreate
nee
ri n
whenexecuted concurrently?

g .n
6.What
are the disadvantages of increasing the number of stages in pipelined proces

e
sing?

7. What is the use of EEPROM?

8. State the hardware needed to implement the LRU in replacement


algorithm.

9. What is distributed arbitration?

10. How interrupt r e q u e s t s from multiple devices can be handled?

PART B - ( 5 x 16 = 80 marks)

11.(a) With examples explain the Data transfer, Logic and Program Control
Instructions?
(16)

Or

(b) Explain the Working of a Carry-Look Ahead adder.


(16)

132 Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

12. (a) (i) Describe the control unit organization with a separate Encoder
and
Decoder functions in a hardwired
control. (8)
(ii) Generate the logic circuit for the following functions and explain. (8)

Or
(b)Write a brief note on nano programming.
(16)

ww
13. (a) What are the hazards of
conditional branches in pipelines? how it can be
r

w.E
esolv
ed? (16)

(b)
(16) a sy E
Explain the super scalar
O
r
operations with a neat diagram.

14. (a)What is a mapping function? What


ngi ways the cache can

nee
mapped? (16)

ri n
Or

g .n
(b)Explain the organization and accessing of data on a Disk.
(16)

15.(a)(i)How
data transfers handshaking technique?
(8)
can be controlled using
e
(ii) Explain the protocols of
USB. (8)

Or

(b) How the parallel port output interface circuit works?


(16)

133

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

B.E./B.Tech. DEGREE EXAMINATION, MAY/JUNE 2012


Fourth Semester
Computer Science and Engineering
CS 2253/141403/CS 43/CS1252 A/10144 CS
404/080250011
COMPUTER ORGANISATION AND
ARCHITECTURE
(Common to Information Technology)
(Regulation 2008)

Time : Three hours

Maximum : 100 marks

ww
Answer all questions.

w.E
PART A-(10*2 = 20 MARKS)

a
1. What is SPEC? Specify the formula for SPEC rating.

sy E
2. What is relative addressing mode? When is it used?
3. Write the register transfer sequence for storing a word in
memory.
4.What
ngi
is hard-wired control? How is it different from

nee
micro- programmed control?
5. What is meant by data and control hazards in
pipelining?
6. What is meant by speculative execution?
ri n
g .n
7. What is meant by an interleaved memory?
8.An address space is specified by 24 bits
and the corresponding memory space by 16 bits:
How many words are in the
(a) virtual memory
(b) main memory PART B-(5*16 = 80
marks)
1. Specify the different I/O transfer mechanisms
e
11.available.
(a) (i) What are addressing modes? Explain the various addressing
modes
2. What with examples.
does isochronous data stream means?
(8)
(ii) Derive and explain an algorithm for adding and subtracting 2
floating point binary numbers.
(8)

Or
(b) (i) Explain instruction sequencing in detail. (10)
(ii) Differentiate RISC and CISC architectures. (6)

134

Downloaded From : www.EasyEngineering.ne


Downloaded From : www.EasyEngineering.ne

12. (a) (i) With a neat diagram explain the internal organization of a
processor.
(6)

(ii) Explain how control signals are generated using


microprogrammed control.
(10)
Or
(b) (i) Explain the use of multiple-bus organization for
executing a three-operand instruction. (8)

ww
(ii) Explain the design of hardwired control unit.
(8)
13. (a) (i) Discus the basic concepts of pipelining.

w.E
(8)
(ii) Describe the data path and control considerations
for

pipelining.(8)
a sy E
ngi
Or

nee
(b) Describe the techniques for handling data and instruction hazards
in pipelining. (16)
14. (a) (i) Explain synchronous DRAM technology in detail.(8)
(ii) In a cache-based memory system using FIFO for cache page
replacement, it is found that the cache hit ratio H is low. The
ri n
following proposals are made for increasing.
(1) Increase the cache page size.
g .n
(2)Increase the cache storage
capacity.
(3)Increase the main
memory capacity.
e
(4)Replace the FIFO replacement policy by LRU.
Analyse each proposal to determine its probable

impact (8)

Or

(b) (i) Explain the varios mapping techniques associated with cache
memories. (10)the
15. (a) Explain (ii) following:
Explain a method of translating virtual address to
physical
(i) address.
Interrupt priority (6)
schemes. (8)
(ii) DMA. (8)
Or

(b) Write an elaborated note on PCI, SCSI and USB bus standards (16)

135

Downloaded From : www.EasyEngineering.ne

You might also like