Cao Notes
Cao Notes
ne
ww
w.E
a s yE
ng i
ne e
rin
g .n
e
ww
UNIT II ARITHMETIC OPERATIONS 7
ALU - Addition and subtraction – Multiplication – Division – Floating Point operations
– Subword parallelism.
w.E
UNIT IIIPROCESSOR AND CONTROL UNIT11
a
Basic MIPS implementation – Building datapath – Control Implementation scheme –
sy E
Pipelining – Pipelined datapath and control – Handling Data hazards & Control
hazards – Exceptions.
UNIT IVPARALLELISM9
ngi
Instruction-level-parallelism – Parallel processing challenges – Flynn's classification
– Hardware multithreading – Multicore processors
nee
ri n
UNIT VMEMORY AND I/O SYSTEMS9
Memory hierarchy - Memory technologies – Cache basics –
g .n
Measuring and improving cache performance - Virtual memory, TLBs -
Input/output system, programmed I/O, DMA and interrupts, I/OM processors.
e
TOTAL: 45 PERIODS
TEXT BOOK:
1. David A. Patterson and John L. Hennessey, “Computer Organization and Design’,
Fifth edition, Morgan Kauffman / Elsevier, 2014.
REFERENCES:
1.V.Carl Hamacher, Zvonko G. Varanesic and Safat G. Zaky, “Computer
Organisation“, VI edition, Mc Graw-Hill Inc, 2012.
2.William Stallings “Computer Organization and Architecture”, Seventh Edition,
Pearson Education, 2006.
3.Vincent P. Heuring, Harry F. Jordan, “Computer System Architecture”, Second
Edition, Pearson Education, 2005.
4.Govindarajalu, “Computer Architecture and Organization, Design Principles and
Applications", first edition, Tata Mc Graw Hill, New Delhi, 2005.
5.John P. Hayes, “Computer Architecture and Organization”, Third Edition, Tata Mc
Graw Hill, 1998.
6. http://nptel.ac.in/.
TABLE OF CONTENT
S.No TITLE Page no
a Aim and Objective of the 1
subject b Detailed Lesson Plan 2
ww
5
6
Addressing modes
Power Wall
22
26
7
8
w.E
Uniprocessor and Multiprocessor
Operations and operand
27
28
a. a sy E
UNIT 2 ARITHMETIC OPERATIONS
Part A 29
ngi
b. Part B 31
1 Addition and Subtraction 31
nee
2 Carry look ahead adder 31
3 Multiplication 34
ri n
4 Fast Multiplication 37
5 Booth Algorithm 38
g .n
6 Division algorithm 39
7 Restoring & Non Restoring Division algorithm 43
e
8 Floating point addition 46
9 Sub Word Parallelism 49
UNIT 4 PARALLELISM
i Part A 77
J Part B 79
24 Instruction Level Processing 79
25 Challenges of Parallel Processing 81
26 FLYNN’S Classification 85
27 Hardware Multithreading 88
28 Multi-core Processors 94
ww
30
31
Memory Hierarchy
Cache Memory
102
103
32
33 w.E
Bus Arbitration
Improving Cache Performance
109
114
a
34 Virtual Memory 116
35
sy E
Direct Memory Access (DMA) 123
ngi
mIndustrial / Practical Connectivity of 125
the subject nUniversity Questions 126
nee
ri n
g .n
e
Aim:
To discuss the basic structure of a digital computer and to study in detail
the organization of the Control unit, the Arithmetic and Logical unit, the Memory
unit and the I/O unit.
Objectives:
w.E
To expose the students to the concept of pipelining.
To familiarize the
a
students with hierarchical memory system including cache memories and v
sy E
irtual
memory.
ngi
To expose the students with different ways of communicating with I/O
devices andstandard I/O interfaces
nee
ri n
g .n
e
ww
S. Uni
No. t w.E Topic / Portions to be Covered
Ho
urs
Cumulativ
e Hrs
Books
Referred
No.
a sy E
UNIT I OVERVIEW AND INSTRUCTIONS
Req
ngi
1 1 Eight Ideas 1 1 T1
2 1 Components of Computer System 1 2 R1
3 1 Technology, performance, Power wall
nee 1 3 R1
ri n
4 1 Uniprocessor to 1 4 R1
multiprocessors, Instructions
g .n
5 1 Operations and operands. Representing 1 5 T1
e
Instructions
6 1 Logical operations 1 6 T1,R1
ww
24 3 Hazard Stalls 1 24 R1
w.E
25 3 Control Hazards 1 25 R1
26 3 Exceptions 1 26 T1,R1
27 3
a
Exceptions in a Pipelined
Implementation
sy E UNIT IV PARALLELISM
1 27 T1,R1
28 4 parallelism ngi
Introduction to Instruction Level 1 28 T1
29 4 Various Instruction Level parallelism
nee 1 29 T1
ri n
30 4 Parallel Processing challenges 1 30 T1
31 4 Flynn classification 1 31 T1
32
33
4
4
MISD & MIMD
Hardware multithreading
1
1
32
33
g .n T1
T1
34
35
36
4
4
4
Simultaneous multithreading
Multicore Processors
Shared Memory Multiprocessors
1
1
1
34
35
36
e
T1
T1
T1
UNIT V MEMORY AND I/O SYSTEMS
37 5 Memory Hierarchy 1 37 T1,R1
38 5 Memory Technologies 1 38 T1,R1
39 5 Flash and Disk Memory Technologies 1 39 T1,R1
40 5 Cache Basics 1 40 T1,R1
41 5 Measuring and improving cache 1 41 T1
performance
42 5 Virtual memory, TLBs 1 42 T1
43 5 Input/ Output System 1 43 T1,R1
44 5 Programmed I/O 1 44 T1,R1
45 5 DMA and Interrupts 1 45 T1,R1
46 5 I/O Processors 1 46 T1,R1
3
PART- A
ww
2. What is register indirect addressing mode? When it is used? (Nov/Dec 2013)
w.E
In this mode the instruction specifies a register in the CPU whose contents
give the effective address of the operand in the memory. In
a
other words, the selected register contains the address of the operand rather than
sy E
the operand itself. Before
using a register indirect mode instruction, the programmer must ensure that
ngi
the memory address of the operand is placed in the
processor register with a previous instruction.
nee
Uses:
The advantage of a register indirect mode instruction is that the
address field of the instruction uses fewer bits to select a register than
would have been required to specify a memory address directly.
ri n
g .n
Therefore EA = the address stored in the register R
• Operand is in memory cell pointed to by contents of register R
e
Example: Add (R2), R0
6 bits Where,
op: Basic operation of the instruction, traditionally called the
opcode.
rs: The first register source operand.
rt: The second register source operand.
rd: The register destination operand. It gets the result of the
ww
operation. shamt: Shift amount.
funct: Function.
w.E
2. Define word length. (Nov/Dec 2011)
In computing, word is a term for the
a
natural unit of data used by a particular processor design. A word is a fixed-sized
sy E
piece of data handled as a unit by the instruction set or the hardware of the
processor. The number of bits in a word (the
ngi
word size, word width, or word length) is an important characteristic of any
specific processor design or computer architecture.
nee
The size of a word is reflected in many aspects of a computer's structure and
operation;
ri n
the majority of the registers in a processor are usually word sized and the
largest piece of data that can be transferred to and from the working memory
in a single operation is a word in many (not all) architectures.
g .n
3. What are the merits and demerits of
single address instructions? (Nov/Dec 2011)
The machine will be slower in this case but not in all cases.
e
Programs are now even shorter. The registers may be used for
temporary results which are not needed immediately or for holding frequently used
operands e.g. the end count in a "for" loop.
ww
soon as it has fetched the first one and dispatched it to the instruction register. So a
pipelined processor will start fetching the next instruction from memory as soon as it
has latched the current instruction in the instruction register.
w.E
Parallel computing is a type of computation in which many
calculations are carried out simultaneously, operating on the principle that
a
large problems can often be divided into smaller ones, which are then solved at the
same time. There are
sy E
several different forms of parallel computing: bit-level, instruction-level, data,
ngi
and task parallelism.
nee
3. Define auto increment and auto decrement addressing mode.
((April/May 2016)
ri n
Auto Increment mode:
The effective address of the operand is the contents of a register specified in
g .n
the instruction. After accessing the operand, the contents of this register are
automatically incremented to point to the next item in a list.
The Auto increment mode is written as (Ri) +. As a companion for the Auto
increment mode, another useful mode accesses the items of a list in the reverse
order.
e
Auto Decrement mode:
The contents of a register specified in the instruction are first automatically
decremented and is then used as the effective address of the operand. We denote
the Auto decrement mode by putting the specified register in parentheses, preceded
by a minus sign to indicate that the contents of the register are to be decremented
before being used as the effective address. Thus, we write - (Ri).
4. List the eight great ideas invented by computer architects (Apr/May 2015)
(i) Design of Moore’s Law
(ii) Use of abstraction to simplify design
(iii) Make the common case fast
(iv) Performance via parallelism
(v) Performance via pipelining
(vi) Performance via prediction
(vii) Hierarchy of Memories
(viii) Dependability via Redundancy
6
Downloaded From : www.EasyEngineering.ne
Downloaded From : www.EasyEngineering.ne
ww
registers for
storing data or instructions temporarily and for fast accessing compared to
w.E
storages in memory locations. The size of a register in the MIPS architecture is 32
bits.
a sy E
To achieve highest performance and conserve energy, an instruction set
architecture must have sufficient number of registers and these registers must be
ngi
efficiently used.
nee
An instruction register (IR) is part of CPU’s control unit that holds the
instruction currently being executed.
ri n
g .n
e
PART B
ww
4. Output
5. Control unit
w.E
Basic functional units of a computer
a
The computer accepts programs and the data through an input and stores
sy E
them in the memory. The stored data are processed by the
arithmetic and logic unit under program control. The processed data is delivered
ngi
through the output unit. All above activities are directed by control unit.
nee
ri n
g .n
e
a. Input unit
The computer accepts coded information through input unit. The input can be
from human operators, electromechanical devices such as keyboards or from
other computer over communication lines.
Examples of input devices are Keyboard, joysticks, trackballs and mouse are
used as graphic input devices in conjunction with display.
Keyboard
It is a common input device.
Whenever a key is pressed; the corresponding letter or digit is automatically
translated into its corresponding binary code and transmitted over cable to the
memory of the computer.
ww
a. Memory unit
Memory unit is used to store programs as well as data. Memory is classified
into primary and secondary storage.
w.E
Primary storage
a
It also called main memory. It operates at high speed and it is expensive. It is
sy E
made up of large number of semiconductor storage cells, each capable of storing one
bit
ngi
of information. These cells are grouped together in a fixed size called word. This
facilitates reading and writing the
nee
content of one word (n bits) in single basic operation instead of reading and wr
iting
ri n
one bit for each operation.
Secondary storage
g .n
It is slow in speed. It is cheaper than primary memory. Its capacity is high. It
is used to
store information that is not accessed frequently. Various secondary devices are mag
netic tapes and
disks, optical disks (CD-ROMs), floppy etc.
e
b. Arithmetic and logic unit
Arithmetic and logic unit (ALU) and control unit together form a processor.
Actual execution of most computer operations takes place in arithmetic and logic unit
of the processor. Example: Suppose two numbers located in the memory are to be
added. They are brought into the processor, and the actual addition is carried
out by the ALU.
Registers:
Registers are high speed storage elements available in the processor. Each
register can store one word of data. When operands are brought into the
processor for any operation, they are stored in the registers. Accessing data from
register is faster than that of the memory.
9
d. Output unit
The function of output unit is to produce processed result to the outside world
in human understandable form. Examples of output devices are Graphical
display, Printers such as inkjet, laser, dot matrix and so on. The laser printer works
faster.
ww
w.E
a sy E
ngi
nee
ri n
Figure 1.2 Computer Components
g .n
e. Control unit
Control unit coordinates the operation of memory, arithmetic and logic unit,
input unit, and output unit in some proper way. The control unit issues control
e
signals that cause the CPU (and other components of the computer) to fetch
the instruction to the IR (Instruction Register) and then execute the actions dictated
by the machine language instruction that has been stored there. Control units are
well defined, physically separate unit that interact with other parts of the machine. A
set of control lines carries the signals used for timing and synchronization of events in
all units Example: Data transfers between the processor and the memory are
controlled by the control unit through timing signals. Timing signals are the signals
that determine when a given action is to take place.
10
Computer Components:
Top-Level view
PC the program counter contains the address of the assembly language
instruction to be executed next. IR the instruction register contains the binary word
corresponding to the machine language version of the instruction currently
being executed.
MAR the memory address register contains the address of the word in main
memory that is being accessed. The word being addressed contains either data or a
machine language instruction to be executed.
MBR the memory buffer register (also called MDR for memory data register)
is the register used to communicate data to and from the memory.
ww
The operation of a processor is characterized by a fetch-decode-execute
cycle. In the first phase of the cycle, the processor fetches an instruction from
memory. The address of the instruction to fetch is stored in an
w.E
internal register named the program counter, or PC. As the processor is waiting for
the memory to respond with the instruction, it increments the PC. This means the
a
fetch phase of the next cycle will fetch the instruction in the next sequential location
sy E
in memory. In the decode phase the processor stores the information returned by
the memory in another internal register, known as the instruction register, or IR. The
ngi
IR now holds a single machine instruction, encoded as a binary number.
The processor decodes the value in the IR in order to figure out which operations to
nee
perform in the next stage.
In the execution stage the processor actually carries out the instruction.
ri n
This step often requires further memory operations; for example, the instruction
may direct the processor to fetch two operands from memory, add them, and store
the result in a third location (the addresses of the
g .n
operands and the result are also encoded as part of the instruction). At the end of
this phase the machine starts the cycle over again by entering the fetch phase
for the next instruction. The CPU exchanges data with memory. For this purpose, it
typically makes use of two internal (to the CPU) register:
A memory address register (MAR), which specifies the address in memory for
e
the next read or write, and
A memory buffer register (MBR), which contains the data to be written into
memory or receives the data read from memory.
An I/O addresses register (I/OAR) specifies a particular I/O device. An I/O
buffer (I/OBR) register is used for the exchange of data between an I/O module and
the CPU.A memory module consists of a set of locations, defined by sequentially
numbered address. Each location contains a binary number that can be
interpreted as either an instruction or data. An I/O module transfers data from
external devices to CPU and memory, and vice versa.
It contains internal buffers for temporarily holding these data until they can be
sent on. Instructions can be classified as one of three major types:
arithmetic/logic, data transfer, and control. Arithmetic and logic instructions apply
primitive functions of one or two arguments, for example addition, multiplication, or
logical AND.
11
Downloaded From : www.EasyEngineering.ne
Downloaded From : www.EasyEngineering.ne
(ii) State the CPU performance equation and discuss the factors that affect the
performance (NOV/DEC 14)
Response time: The time between the start and the completion of an event also
referred to as execution time.
Throughput: The total amount of work done in a given time. In comparing design
alternatives, we often want to relate the performance of two different machines, say X
and Y. The phrase ―X is faster than Y‖ is used here to mean that the response time
or execution time is lower on X than on Y for the given task. In particular, ―X is n
ww
times faster than Y‖ will mean
w.E
Sinceexecutiontimeisthereciprocalofperformance,thefollowing relationship
a sy E
ngi
nee
The performance and execution time are reciprocals, increasing performance
decreases execution time.
ri n
To help avoid confusion between the terms Increasing and decreasing, we usually
say ―improve performance‖ or ―improve execution time‖ when we mean
increase performance and decrease execution time.
g .n
CPU performance equation:
All computers are constructed using a clock running at a constant rate.
These discrete time events are called ticks, clock ticks, clock periods, clocks, cycles,
e
or
clock cycles. Computer designers refer to the time of a clock period by its
duration (e.g., 1 ns) or by its rate (e.g., 1 GHz). CPU time for a program can then be
expressed two ways:
12
ww
Basic performance equation
w.E
Let T be the time required for the processor to execute a program in
high level language. The compiler generates machine language object program corre
a
sponding
sy E
to the source program. Assume that complete execution of the program requires the
execution of N machine language instructions.
ngi
Assume that average number of basic steps needed to
execute one machine instruction is S, where each basic step
nee
is completed in one clock cycle. If the
clock rate is R cycles per second, the program execution time is given by T = (N
ri n
x S) / R
this is often called Basic performance equation.
To achieve high performance, the performance parameter T should be
reduced. T value can be reduced by reducing N and S, and increasing R.
g .n
Value of N is reduced if the source program is
compiled into fewer number of machine instructions.
Value of S is reduced if instruction has a smaller no of basic steps to
e
perform or if the execution of the instructions is overlapped.
Value of R can be increased by using high frequency clock, ie. Time required
to complete a basic execution step is reduced.
N, S and R are dependent factors. Changing one may affect another.
13
EXAMPLE
ww
w.E
a sy E
ngi
nee
ri n
g .n
EIGHT GREAT IDEAS IN COMPUTER ARCHITECTURE
The one constant for computer designers is rapid change, which is driven
largely by Moore's Law. It states that integrated circuit resource double every 18–
24 months. Moore's Law resulted from a 1965 prediction of such growth in IC
capacity. Moore's Law made by Gordon Moore, one of the founders of Intel.
14
As computer designs can take years, the resources available per chip can
easily double or quadruple between the start and finish of the project. Computer
architects must anticipate this rapid change.
Icon used: "up and to the right" Moore's Law graph represents designing for
rapid change.
ww
w.E
Both computer architects and programmers had to invent techniques to
a
make themselves more productive. A major productivity technique for
sy E
hardware and soft ware is to use abstractions to represent the design at
different levels of representation; lower-level details are hidden to offer
ngi
a simpler model at higher levels.
Icon used: abstract painting icon.
g .n
often simpler than the rare case and it is often easier to enhance. Common case fast
is only possible with careful
experimentation and measurement.
Icon used: sports car ss the icon for making the common case fast
(as the most common trip has one or two passengers, and it's surely easier to make
a fast sports car than a fast minivan.)
e
15
ww
increasing instruction throughput. For example, before fire engines, a human chain
can carry a water source to fire much more quickly than individuals with buckets
w.E
running back and forth.
Icon Used: pipeline icon is used. It
is a sequence of pipes, with each section representing one stage of the pipeline.
a sy E
ngi
nee
a. PERFORMANCE VIA PREDICTION:
ri n
g .n
Following saying that it can be better to ask for forgiveness than to ask for
permission, the next great idea is prediction. In some cases it can be faster on
e
average to guess and start working rather than wait until you know for sure. This
mechanism to recover from a misprediction is not too expensive and the prediction is
relatively accurate.
Icon Used: fortune-teller's crystal ball,
b. HIERARCHY OF MEMORIES :
Programmers want memory to be fast, large, and cheap memory speed often
shapes performance, capacity limits the size of problems that can be solved, the cost
of memory today is often the majority of computer cost.
Architects have found that they can address these conflicting demands with a
hierarchy of memories the fastest, smallest, and most expensive memory per bit is at
the top of the hierarchy the slowest, largest, and cheapest per bit is at the bottom.
Caches give the programmer the illusion that main memory is nearly as fast as
the top of the hierarchy and nearly as big and cheap as the bottom of the hierarchy.
16
ww physical device can fail, systems can made dependable by including redundant
components. These components can take over when a failure occurs and to
w.E
help detect failures.
Icon Used: the tractor-trailer, since the dual tires on each side of its rear axels
a
allow the truck to continue driving even when one tire fails.
sy E
ngi
nee
TECHNIQUES TO REPRESENT INSTRUCTION IN A COMPUTER SYSTEM
ri n
(ii)Discuss about the various techniques to represent instruction in
g .n
e
a computer system.
(April/May 2015).(16)
INSTRUCTION:
Instructions are kept in the computer as a series of high and low electronic
signals and may be represented as numbers. In fact, each piece of an instruction can
be considered as an individual number, and placing these numbers side by
side forms the instruction.
Since registers are referred to by almost all instructions, there must be a
convention
to map register names into numbers. In MIPS assembly language, registers
$s0 to
$s7 map onto registers 16 to 23, and registers $t0 to $t7 map onto registers 8
to 15. Hence, $s0 means register 16, $s1 means register 17, $s2 means register
18, . . . ,
$t0 means register 8, $t1 means register 9, and so on.
Translating a MIPS Assembly Instruction into a Machine Instruction Let’s do
the next step in the refinement of the MIPS language as an example. The real MIPS
language version of the instruction represented symbolically as first as a combination
17
Downloaded From : www.EasyEngineering.ne
Downloaded From : www.EasyEngineering.ne
0 17 18 8 0 32
Each of these segments of an instruction is called a field. The first and last
fields (containing 0 and 32 in this case) in combination tell the MIPS computer that
this instruction performs addition. The second field gives the number of the register
that is the first source operand of the addition operation (17 = $s1), and the third field
gives the other source operand for the addition (18 = $s2). The fourth field contains
the number of the register that is to receive the sum (8 = $t0). The fifth field is unused
in this instruction, so it is set to 0.
Thus, this instruction adds register $s1 to register $s2 and places the sum in
register $t0.This instruction can also be represented as fields of binary numbers as
ww opposed to decimal:
a sy E
This layout of the instruction is called the instruction format. As you can see
from counting the number of bits, this MIPS instruction takes exactly 32 bits—the
same size as a data word. In
ngi
keeping with our design principle that simplicity favors regularity, all
MIPS instructions are 32 bits long.
nee
To distinguish it from assembly language, we call the numeric version of
ri n
instructions machine language and a sequence of such instructions machine code. It
would appear that you would now be reading and writing long, tedious strings
of binary numbers. We avoid that tedium by using a higher base than binary that
g .n
converts easily into binary. Since almost all computer data sizes are multiples of 4,
hexadecimal (base 16) numbers are popular.
e
Fig .The hexadecimal-binary conversion table
18
ww
w.E
MIPS FIELDS:
MIPS fields are given names to make them easier
a
to discuss. The meaning of each name of the fields in MIPS
instructions:
sy E
op: Basic operation of the
ngi
instruction, traditionally called the opcode.
rs: The first register source operand.
nee
rt: The second register source operand.
rd: The register destination operand. It
gets the result of the operation.
shamt: Shift amount.
ri n
g .n
funct: Function. This
field selects the specific variant of the operation in the op field and is somet
e
imes called the function code.
Today’s computers are built on two key principles:
1. Instructions are represented as numbers.
2. Programs are stored in memory to be read or written, just like numbers.
These principles lead to the stored-program concept; its invention let the
computing genie out of its bottle. Specifically, memory can contain the
source code for an editor program, the corresponding compiled machine code, the
text that the compiled program is using, and even the compiler that generated the
machine code. One consequence of instructions as numbers is that programs are
often shipped as files of binary numbers. The commercial implication is that
computers can inherit ready-made software provided they are compatible with an
existing instruction set. Such “binary compatibility” often leads industry to align
around a small number of instruction set architectures.
The compromise chosen by the MIPS designers is to keep all instructions the
same length, thereby requiring different kinds of instruction formats for different kinds
of instructions. For example, the format above is called R-type (for register) or R-
format. A second type of instruction format is called I-type (for immediate) or I-format
and is used by the immediate and data transfer 19 instructions. The fields of I-format are
op rs rt Constant or address
6 bits 5 bits 5 bits 16 bits
The 16-bit address means a load word instruction can load any word within a
region of ±215 or 32,768 bytes (±213 or 8192 words) of the address in the base
register rs. Similarly, add immediate is limited to constants no larger than ±215. We
see that more than 32 registers would be difficult in this format, as the rs and rt fields
would each need another bit, making it harder to fit everything in one word.
w.E
TRANSLATING MIPS ASSEMBLY LANGUAGE INTO MACHINE LANGUAGE:
a
We can now take an example all the way from what the programmer writes to
sy E
what the computer executes. If $t1 has the base of the array A and $s2 corresponds
to h,
ngi
the assignment stateme
nt A[300]
= h + A[300];
is compiled into
lw $t0,1200($t1) # Temporary r nee
eg
$t0 gets A[300]
ri n
add $t0,$s2,$t0 # Temporary reg $t0 gets h + A[300] sw
$t0,1200($t1) # Stores h + A[300] back into A[300] g .n
e
The lw instruction is identified by 35 in the first field (op). The base register 9
($t1) is specified in the second field (rs), and the destination Register 8 ($t0) is
specified in the third field (rt). The offset to select A[300] (1200 = 300 × 4) is found in
the final field (address).
The add instruction that follows is specified with 0 in the first field (op) and 32
in the last field (funct). The three register operands (18, 8, and 8) are found in the
second, third, and fourth fields and correspond to $s2, $t0, and $t0.
The sw instruction is identified with 43 in the first field. The rest of this final
instruction is identical to the lw instruction. Since 1200ten = 0000 0100 1011
0000two, the binary equivalent to the decimal form is:
20
ww
(ii) Which code sequence will be faster?
(iii) What is the CPI for each sequence?
a
A B C
CPI
sy E 1 2 3
ngi
Instruction counts for each
nee
Code
instruction class
sequence
A B C
ri n
g .n
1 2 1 2
SOLUTION: 2 4 1 1
e
(i) Code sequence 1:
Executes 2+1+2= 5 instructions
Code sequence 2:
Executes 4+1+1= 6 instructions
That is code sequence 2 executes more number of instructions than code sequence
1.
(ii) The total number of clock cycles for each sequence can be found
using the following equation.
CPU clock cycles = (CPIi X Ci)
It is clear that code sequence 2 is faster than code sequence 1 even though it
executes one extra instruction.
(iii)The CPI values can be computed by,
CPI = CPU clock cycles/Instruction count
21
Downloaded From : www.EasyEngineering.ne
Downloaded From : www.EasyEngineering.ne
CPI 1 = 10/5= 2
ADDRESSING
AND
ADDRESSING
MODES
w.E
Perform any operation; the corresponding instruction is to be given to the
a
microprocessor. In each instruction, programmer has to specify 3 things:
sy E
Operation to be performed.
Address of source of data.
ngi
Address of destination of result.
The different ways in which the location of an operand is specified in an
nee
instruction are referred to as addressing modes. The method by which the
address of source of data or the address of destination of result is given in the i
ri n
nstruction is called Addressing Modes. Computers use addressing
mode techniques for the purpose of accommodating one or both of the
g .n
following provisions:
To give programming versatility to the user by providing such facilities as
pointers to memory, counters for loop control, indexing of
data, and program relocation.
To reduce the number of bits in the addressing field of the instruction.
TYPES:
e
• Implied addressing mode
• Immediate addressing mode
• Direct addressing mode
• Indirect addressing mode
• Register addressing mode
• Register Indirect addressing mode
• Auto increment or Auto decrement addressing mode
• Relative addressing mode
• Indexed addressing mode
• Base register addressing mode
Example: ADD 5
• Add 5 to contents of accumulator
• 5 is operand
ww
Advantages:
• No memory reference to fetch data
• Fast
w.E
• Limited range
a
b. Direct addressing mode:
sy E
In this mode the effective address is equal to the address part of the
ngi
instruction. The operand resides in memory and its address is given directly by the
address field of instruction. In a branch type instruction the address field specifies the
a.
actual branch address.
LDA A
nee
ri n
Look in memory at address A
for operand. Load contents of A
to
accumulator
g .n
Advantages and disadvantages:
• Single memory reference to access
data
• No additional calculations to work o
e
ut effective
address
• Limited address space
In this mode the instruction specifies a register in the CPU whose contents
give the effective address of the operand in the memory. In other words, the selected
register contains the address of the operand rather than the operand itself. Before
using a register indirect mode instruction, the programmer must ensure that the
memory address of the operand is placed in the processor register with a previous
instruction. The advantage of a register indirect mode instruction is that the address
field of the instruction uses fewer bits to select a register than would have been
required to specify a memory address directly.
Therefore EA = the address stored in the register R
• Operand is in memory cell pointed to by contents of register R
• Example Add (R2), R0
w.E
o Faster instruction fetch
o Limited number of registers.
a
o Multiple registers helps performance
sy E
o Requires good assembly programming or compiler writing
ngi
a.Register indirect addressing mode:
In this mode the instruction specifies a register in the CPU whose contents
nee
give the effective address of the operand in the memory. In
other words, the selected register contains the address of the operand rather than
ri n
the operand itself. Before using a register indirect mode instruction, the
programmer must ensure that the memory address of the operand is placed in the
processor register with a previous instruction.
g .n
The advantage of a register indirect mode instruction is that the address field of the
instruction uses fewer bits to select a register than would have been required to
specify a memory address directly.
Therefore EA = the address stored in the register R
• Operand is in memory cell pointed to by contents of register R
e
Advantage:
• Less number of bits are required to specify the register.
• One fewer memory access than indirect addressing.
24
Downloaded From : www.EasyEngineering.ne
Downloaded From : www.EasyEngineering.ne
written as (Ri) +
Auto Decrement mode:
The contents of a register specified in the instruction are first automatically
decremented and is then used as the effective address of the operand. We denote
the Auto decrement mode by putting the specified register in parentheses, preceded
by a minus sign to indicate that the contents of the register are to be decremented
before being used as the effective address. Thus, we write - (Ri)
• These two modes are useful when we want to access a table of data.
ADD (R1)+
will increment the
register R1. LDA -(R1)
will decrement the
register R1.
ww
h.Relative addressing mode:
In this mode the content of the program counter is added to the address part
w.E
of the instruction in order to
obtain the effective address. Effective address is defined as the
a
memory address obtained from the computation dictated by the given
sy E
addressing mode. The address part of the instruction is usually a signed number (in
2’s complement representation) which can be either positive or negative. When
ngi
this number is added to the content of the program counter, the result produces an
effective address whose position in memory is relative to the address of the next
nee
instruction.
Relative addressing is often used with branch type instructions when the
ri n
branch address is in the area surrounding the instruction word itself. It
results in a shorter address field in the instruction format since the relative address
g .n
can be specified with a smaller number of bits compared to the bits required to
designate the entire memory address.
EA = A + contents of PC
Example: PC contains 825 and address pa rt of instruction contains 24.
After the instruction is read from location 825, the PC is incremented to 826.
So EA=826+24=850. The operand will be found at location 850 i.e. 24 memory
e
locations
forward from the address of the next instruction.
Therefore EA = A + IR
•Example MOV AL , DS: disp
[SI] Advantage
• Good for accessing arrays.
w.E
address. The base register addressing mode is used in computers to facilitate
the relocation of the programs in memory. When programs and data are moved from
a
one segment of memory to another, as required
sy E
in multiprogramming systems, the address values of instructions must reflect this
change of position. With a base register, the displacement values of instructions do
ngi
not have to change.
nee
POWER WALL
ri n
5.(i). Elaborate about power wall with neat sketch.
The dominant technology for
integrated circuits is called CMOS (complementary metal oxide semiconduct
or).
g .n
e
For CMOS, the primary source of energy consumption is so-called
dynamic energy— that is, energy that is consumed when transistors switch states
from 0 to 1 and vice versa. The dynamic energy depends on the
capacitive loading of each transistor and the voltage applied.
26
The power required per transistor is just the product of energy of a transition and the
frequency of transitions:
Frequency switched is a function of the clock rate. The capacitive load per transistor
is a function of both the number of transistors connected to an output (called the
fanout) and the technology, which determines the capacitance of both wires and
transistors.
UNIPROCESSOR:
ww
the motherboard contain the devices that drive our advancing technology, called
integrated circuits and nicknamed chips. The board is composed of three pieces: the
w.E
piece connecting to the I/O devices mentioned earlier, the
memory, and the processor. The memory is where the
a
programs are kept when they are running; it also contains the data needed by the
sy E
running programs. Figure 1.8 shows that memory is found on the two
small boards, and each small memory board contains eight integrated circuits.
ngi
The processor is the active part of the board, following the
instructions of a program to the letter. It adds numbers, tests numbers, signals I/O
nee
devices to activate, and so on. Occasionally, people call the processor the CPU, for
the more bureaucratic-sounding central processor
ri n
unit. Descending even lower into the hardware, The processor logically comprises
two main components:
Cache Memory:
It consists of a small, fast memory that acts as a buffer for the DRAM memory.
(The nontechnical definition of cache is a safe place for hiding things.)Cache is built
using a different memory technology, static random access memory
(SRAM).
SRAM is faster but less dense, and hence more expensive, than DRAM
having noticed a common theme in both the software and the hardware descriptions.
The depths of hardware or software reveal more information or, conversely, lower-
Multiprocessor:
A type of architecture that is based on multiple computing units. Some of
the operations are done in parallel and the results are joined afterwards. There are
many types of classifications for multiprocessor architectures, the most commonly
known would be the Flynn Taxonomy. MIPS (originally an acronym for
Microprocessor without Interlocked Pipeline Stages) is a reduced instruction set
computer (RISC) instruction set architecture (ISA) developed by MIPS
Technologies.
To reduce confusion between the words processor and microprocessor,
companies refer to processors as “cores,” and such microprocessors are generically
called multicore microprocessors.
w.E
(iii).Explain about the concepts of Logical operations and control
operations.
a
Every computer must be able to perform arithmetic. The
sy E
MIPS assembly language notation add a, b, c instructs a computer to add the two
variables b and c
ngi
and to put their sum in a. This
notation is rigid in that each MIPS arithmetic instruction performs only
nee
one operation and must always have exactly three variables.
Thus, it takes three instructions to sum the four variables. The words to
ri n
the right of the sharp symbol (#) on each line above are comments for the human
reader, so the computer ignores them. For example, suppose we want to
place the sum of four variables b, c, d, and e into variable a.
The following sequence of instructions adds the four variables:
g .n
add a, b, c # The sum of b and c
is placed in a add a, a, d # The sum of b, c,
e
and d is now in a add a, a, e # The sum of
b, c, d, and e is now in a
Logical Operations:
Although the first computers operated on full words, it soon became clear that
it was useful to operate on fields of bits within a word or even on individual bits.
Examining characters within a word, each of which is stored as 8 bits, is one
example of such an operation. It follows that operations were added to programming
languages and instruction set architectures to simplify, among other things, the
packing and unpacking of bits into words. These instructions are called logical
operations.
28
Control Operations:
What distinguishes a computer from a simple calculator is its ability to make
decisions. Based on the input data and the values created during computation,
different instructions execute. Decision making is commonly represented in
programming languages using the if statement, sometimes combined with go to
statements and labels. MIPS assembly language includes two decision-making
instructions, similar to an if statement with a go to.
ww
w.E
a sy E
ngi
nee
ri n
g .n
e
29
PART A
ww Little endian systems are those in which the least significant byte is stored in
the smallest address.
w.E
3. Define – Guard and Round. (May/June 2014) (N/D’ 16)
a
Guard is the first of two extra bits kept on
sy E
the right during intermediate calculations of floating point numbers. It
is used to improve rounding accuracy.
ngi
Round is a method to make the intermediate floating-point result
fit the
nee
floating-point format; the goal is typically to find the nearest number that can
be represented in the format. IEEE 754, therefore, always keeps two extra bits on
the
ri n
right during intermediate additions, called guard and round, respectively.
4. Let X=1010100 & Y=1000011 perform X-Y Using 2’s Complement, Y-X
g .n
1)2’s comp of Y
is 0111101 X-Y =
10010001
e
(2) 2’s comp of X
is 0101100 Y-X = 1101111
ww
6. What are the functions of ALU?
An arithmetic logic unit (ALU) is a digital circuit used to perform arithmetic and
w.E
logic operations. It represents the fundamental building block of the
central processing unit (CPU) of a computer. Modern CPUs contain very powerful
a
and complex ALUs. In addition to ALUs, modern CPUs contain a control unit (CU).
sy E
7. What do you meant by sub word parallelism? (April/May 2015)
ngi
By partitioning the
carry chains within a 128 bit adder processor could use parallelism to
nee
perform simultaneous operations on short vectors of sixteen 8 bit operands, eight 16
bit operands,
ri n
four 32 bit operands, or two 64 bit operands. The
cost of such partitioned adders was small. This
g .n
concept is called as subword parallelism.
P
A
R
T
e
B ALU
Most computer operations are executed in the arithmetic and logic unit
(ALU) of the processor. Consider a typical example: Suppose two numbers
located in the memory are to be added. They are brought into the processor, and
the actual addition is carried out by the ALU. The sum may then be stored in the
memory or retained in the processor for immediate use.
ww
computers. An exception or interrupt is essentially an unscheduled procedure call.
The address of the instruction that overflowed is saved in a register, and the
computer jumps to a predefined address to invoke the appropriate routine for
w.E
that exception. The interrupted address is saved so that in some situations the progra
m can continue after corrective code is
a
executed. MIPS include a register called the exception program counter (EPC) to
sy E
contain the address of the instruction that caused the exception. The instruction
move from system control (mfc0)
ngi
is used to copy EPC into a general-purpose register so
that MIPS software has the option of returning to the offending instruction via a jump
nee
register instruction.
ri n
Addition and Subtraction Example:
adding
+ 6 to 70000
0000
then
in binary
0000and
0000subtracting
0000 0000 0000 0000 0110= 6
0000 0000 60000
from0000
7 in binary: 0000
0000 0000 0000
1101 0000 00
= 13 g .n
00 0000 0000 0000 0111
= 7 6 from 7 can be done directly:
Subtracting
0000 0000 0000 0000 0000 0000 0000 0111 = 7
e
– 0000 0000 0000 0000 0000 0000 0000 0110 = 6
32
Instructions available
Add, subtract, add immediate, add unsigned, subtract unsigned.
ww
w.E
So 64-bit addition would be 8 times slower than 8- bit
addition
a
•It is possible to build a circuit called a “carry look-ahead adder” that
sy E
speeds up addition by eliminating the need to “ripple” carries through the word.
• Carry look-ahead is expensive
ngi
• If n is the number of bits in a ripple adder, the circuit complexity (number
of gates) is O(n)
nee
• For full carry look-ahead, the complexity is O(n3 )
• Complexity can be reduced by rippling smaller look-aheads:
ri n
e.g., each 16 bit group is handled by four 4-bit adders and the 16-bit
adders are rippled into a 64-bit adder.
g .n
e
33
The advantage of the CLA scheme used in this circuit is its simplicity, because
each CLA block calculates the generate and propagate signals for two bits only. This
is much easier to understand than the more complex variants presented in other
textbooks, where combinatorial logic is used to calculate the G and P signals of four
or more bits, and the resulting adder structure is slightly faster but also less regular.
MULTIPLICATION
ww
school is to take the digits of the multiplier one at a time from right to left,
multiplying
the multiplicand by the single digit of the multiplier and shifting the intermediate
w.E
product one digit to the left of the earlier intermediate products.
The first
a
observation is that the number of digits in the product is considerably larger than
sy E
the number in either the multiplicand or the multiplier. In fact, if we ignore the sign
bits, the length of the multiplication of an n-bit multiplicand and an m-bit
ngi
multiplier is a product that is n + m bits long.
nee
That is, n + m bits are required to represent all possible products. Hence, like
add, multiply must cope with overflow because we frequently want a 32-bit product
ri n
as the result of multiplying two 32-bit numbers.
g .n
Multiplying 1000ten by 1001ten:
Multiplican 1000
d Multiplier X 1001
1000
0000
0000
e
1000
Product 1001000
In this example we restricted the decimal digits to 0 and 1. With only two
choices, each step of the multiplication is simple:
1. Just place a copy of the multiplicand in the proper place if the multiplier digit is a 1,
or
2. Place 0 ( multiplicand) in the proper place if the digit is 0.
34
ww
w.E
Fig. First version of the multiplication hardware
a sy E
ngi
nee
ri n
g .n
e
35
ww
w.E
a sy E
ngi
nee
ri n
g .n
Fig. The first multiplication algorithm
The multiplier is in the 32-bit Multiplier register and that the 64-bit Product
e
register is initialized to 0. Over 32 steps a 32-bit multiplicand would move 32 bits to
the left. Hence we need a 64-bit Multiplicand register, initialized with the 32-bit
multiplicand in the right half and 0 in the left half. This register is then shifted left 1 bit
each step to align the multiplicand with the sum being accumulated in the 64-bit
Product register.
Moore’s Law has provided so much more in resources that hardware
designers can now build much faster multiplication hardware. Whether the
multiplicand is to be added or not is known at the beginning of the multiplication by
looking at each of the 32 multiplier bits. Faster multiplications are possible by
essentially providing one 32-bit adder for each bit of the multiplier: one input is the
multiplicand ANDed with a multiplier bit and the other is the output of a prior adder.
36
SIGNED MULTIPLICATION:
In the signed multiplication, convert the multiplier and multiplicand to positive
numbers and then remember the original signs. The algorithms should then be run
for 31 iterations, leaving the signs out of the calculation. The shifting steps would
need to extend the sign of the product for signed numbers. When the algorithm
completes, the lower word would have the 32-bit product.
FASTER MULTIPLICATION
ww
w.E
a sy E
ngi
nee
Fig. Faster multiplier
ri n
g .n
Faster multiplications are possible by essentially providing one 32-bit adder
for each bit of the multiplier: one input is the multiplicand ANDed with a multiplier bit,
e
and the other is the output of a prior adder. Connect the outputs of adders on the
right to
the inputs of adders on the left, making a stack of adders 32 high.
The above figure shows an alternative way to organize 32 additions in a parallel
tree. Instead of waiting for 32 add times, we wait just the log2 (32) or five 32- bit add times.
Multiply can go even faster than five add times because of the use of
carry save adders. It is easy to pipeline such a design to be able to support
many multiplies simultaneously
Multiply in MIPS:
MIPS provide a separate pair of 32-bit registers to contain the 64-bit product,
called Hi and Lo. To produce a properly signed or unsigned product, MIPS have two
instructions: multiply (mult) and multiply unsigned (multu). To fetch the integer 32-bit
product, the programmer uses move from lo (mflo). The MIPS assembler generates a
pseudo instruction for multiply that specifies three general purpose registers,
generating mflo and mfhi instructions to place the product into registers.
37
ww
w.E
a sy E
ngi
nee
ri n
g .n
e
Fig. Booth algorithm
• Then:
00: Middle of a string of 0s, so no arithmetic operation.
01: End of a string of 1s, so add the multiplicand to the left half of the product
(A).
10: Beginning of a string of 1s, so subtract the multiplicand from the left half of
the product (A).
11: Middle of a string of 1s, so no arithmetic operation.
• Then shift A, Q, bit Q-1 right one bit using an arithmetic shift
• In an arithmetic shift, the MSB remains unchanged.
Example of Booth’s Algorithm (7*3=21)
38
ww
3.Explain the concepts of Division Algorithm and its hardware./Divide (12)10 by (3)10 using
Restoring and Non Restoring division algorithm with step by step intermediate results
w.E
and explain. (Nov/Dec 2014, 15, 16). (A/M’ 16) (16)
The reciprocal operation of multiply is divide, an
a
operation that is even less frequent and even more quirky. It even offers the
opportunity to perform a
sy E
mathematically invalid operation: dividing by 0. The example is dividing
ngi
1,001,010 by 1000. The two
operands (dividend and divisor) and the result (quotient) of divide are accompanied b
nee
y a second result called the remainder. Here is another way
to express the relationship between the components:
ri n
g .n
Dividend = Quotient * Divisor + Remainder
Where the remainder is smaller than the divisor. Infrequently, programs use
the divide instruction just to get the remainder, ignoring the quotient. The basic
grammar school division algorithm tries to see how big a number can be subtracted,
e
creating a digit of the quotient on each attempt. Binary numbers contain only 0 or 1,
so binary division is restricted to these two choices, thereby simplifying binary
division. If both the dividend and divisor are positive and hence the quotient and the
remainder are nonnegative. The division operands and both results are 32-bit values.
A DIVISION ALGORITHM AND HARDWARE:
Initially, the 32-bit Quotient register set to 0. Each iteration of the algorithm
needs to move the divisor to the right one digit, start with the divisor placed in the left
half of the 64-bit Divisor register and shift it right 1 bit each step to align it with the
dividend. The Remainder register is initialized with the dividend. Figure shows three
steps of the first division algorithm. Unlike a human, the computer isn’t smart enough
to know in advance whether the divisor is smaller than the dividend. It must first
subtract the divisor in step 1; If the result is positive, the divisor was smaller or equal
39
to the dividend, so generate a 1 in the quotient (step 2a). If the result is negative, the
next step is to restore the original value by adding the divisor back to the remainder
and generate a 0 in the quotient (step 2b). The divisor is shifted right and then iterate
again. The remainder and quotient will be found in their namesake registers after the
iterations are complete.
The following figure shows three steps of the first division algorithm. Unlike a
human, the computer isn’t smart enough to know in advance whether the
divisor is smaller than the dividend.
DIVISION ALGORITHM:
It must first subtract the divisor in step 1; remember that this is how we
performed the comparison in the set on less than instruction. If the result is positive,
the divisor was smaller or equal to the dividend, so we generate a 1 in the quotient
(step 2a). If the result is negative, the next step is to restore the original value by
ww
adding the divisor back to the remainder and generate a 0 in the quotient.
The divisor is shifted right and then we iterate again. The remainder
w.E
and quotient will be found in their namesake registers after the iterations are
complete.
a sy E
Using a 4-bit version of the algorithm to save pages, let’s try dividing 710 by 210, or 0000
01112 by 00102
ngi
nee
ri n
g .n
e
40
ww
w.E
a sy E
ngi
nee
ri n
g .n
e
.
41
ww
w.E Fig. Values of register in division algorithm
a sy E
The above figure shows the value of each register for each of the steps, with
the quotient being 3ten and the remainder 1ten. Notice that the test in step
ngi
2 of whether the remainder is positive or negative simply tests
whether the sign bit of the Remainder register is a 0 or 1.
nee
The surprising requirement of this algorithm is that it takes n + 1 steps to get the
proper quotient and remainder.
ri n
This algorithm and hardware can be refined to be faster
and cheaper. The speedup comes from shifting the
g .n
operands and the quotient simultaneously with the subtraction. This
refinement halves the width of the adder and registers by noticing where there are
unused portions of registers and adders.
SIGNED DIVISION:
The one complication of signed division is that we
must also set the sign of the remainder. Remember that the following equation must
e
always hold:
Dividend = Quotient × Divisor + Remainder
To understand how to set the sign of the remainder, let’s look at the example of
dividing all the combinations of ±7 10 by ±2 10.
Case:
+7 ÷ +2: Quotient = +3, Remainder = +1 Checking the results:
7 = 3 × 2 + (+1) = 6 + 1
If we change the sign of the dividend, the quotient must change as well:
–7 ÷ +2: Quotient = –3
Rewriting our basic formula to calculate the remainder:
Remainder = (Dividend – Quotient × Divisor) = –7 – (–3 × +2) = –7–(–6) = –1
42
So,
–7 ÷ +2: Quotient = –3, Remainder = –1
Checking the results again:
–7 = –3 × 2 + ( –1) = – 6 – 1
The following figure shows the revised
hardware.
ww
w.E
a sy E
ngi
Fig. Division hardware
nee
The reason the answer isn’t a
quotient of –4 and a remainder of +1, which would also fit
this formula, is that the absolute value of the quotient would then change depending
ri n
on the sign of the dividend and the divisor! Clearly, if programming would be an even
g .n
greater challenge. This anomalous behavior is avoided by following the
rule that the dividend and remainder must have the same signs, no matter what the
e
signs of the divisor and quotient. We calculate the other combinations by
following the same rule:
–(x ÷ y)≠ (–x) ÷ y
+7 ÷ –2: Quotient = –3, Remainder = +1
–7 ÷ –2: Quotient = +3, Remainder = –1
Thus the correctly signed division algorithm negates the quotient if the signs of
the operands are opposite and makes the sign of the nonzero remainder match the
dividend.
Faster Division:
Many adders can be used to speed up multiply, cannot be used to do the
same trick for divide. The reason is that it is needed to know the sign of the
difference before performing the next step of the algorithm, whereas with multiply we
could calculate the 32 partial products immediately.
There are techniques to produce more than one bit of the quotient per step.
The SRT division technique tries to guess several quotient bits per step, using
a table lookup based on the upper bits of the dividend and remainder. It relies on
subsequent steps to correct wrong guesses.
43
ww
w.E
a sy E
ngi
nee
ri n
Assume ─ X register k-bit dividend
g .n
e
• Assume ─ Y the k-bit divisor
• Assume ─ S a sign-bit
1.Start: Load 0 into accumulator k-bit A and dividend X is loaded into the k-bit
quotient register MQ.
2. Step A : Shift 2 k-bit register pair A -MQ left
3. Step B: Subtract the divisor Y from A.
4. Step C: If sign of A (msb) = 1, then reset MQ 0 (lsb) = 0 else set = 1.
5. Steps D: If MQ 0 = 0 add Y (restore the effect of earlier subtraction).
6.Steps A to D repeat again till the total number of cyclic operations = k. At the end,
A has the remainder and MQ has the quotient.
ww
w.E
a sy E
ngi
Fig. Division using Non-restoring Algorithm
nee
1. Load (upper half k −1 bits of the dividend X) into accumulator k-bit A
ri n
and load dividend X (lower half bits into the lower k bits at quotient register
MQ
• Reset sign S = 0
g .n
• Subtract the k bits divisor Y from S-A (1 plus k bits) and assign MQ 0 as per
S
2. If sign of A, S = 0, shift S plus 2 k-bit register pair A-MQ left and subtract
the k its divisor Y from S-A (1 plus k bits); else if sign of A, S = 1, shift S
e
plus 2 k-bit register pair A - MQ left and add the divisor Y into S-A (1 plus k
bits)
• Assign MQ 0 as per S
3. Repeat step 2 again till the total number of operations = k.
4. If at the last step, the sign of A in S = 1, then add Y into S -A to leave
the correct remainder into A and also assign MQ 0 as per S, else do
nothing.
5. A has the remainder and MQ has the quotient
45
1.Explain how floating point addition is carried out in a computer system. Give
an example for a binary floating addition. /Explain in detail about Floating Point
Operations. (April/May 2015). (16)
The scientific notation has a single digit to the left of the decimal point. A
number in scientific notation that has no leading 0s is called a normalized
number, which is the usual way to write it. Floating point - Computer arithmetic that
represents numbers in which the binary point is not fixed. Floating-point numbers are
usually a multiple of the size of a word.
The representation of a MIPS floating-point number is shown below, where s
is the sign of the floating-point number (1 meaning negative), exponent is the value of
the 8-bit exponent field (including the sign of the exponent), and fraction is the 23-bit
number. This representation is called sign and magnitude, since the sign has a
ww
separate bit from the rest of the number.
A standard scientific notation for real in normalized form offers three
w.E
advantages.
It simplifies exchange of data that includes floating-point numbers;
a
It simplifies the floating-point arithmetic algorithms to know that numbers will
sy E
always be in this form;
It increases the accuracy of the numbers that can be stored in
ngi
a word, since the unnecessary leading 0s are replaced by real digits
to the right of the binary point.
nee
ri n
g .n
Fig. Scientific notation
e
FLOATING POINT ADDITION:
Example:
Let’s add numbers in scientific notation by hand to illustrate the problems in
floating-point addition: 9.999ten × 101 + 1.610ten × 10-1. Assume that we can store
only four decimal digits of the significant and two decimal digits of the exponent.
Step 1.
To be able to add these numbers properly, we must align the decimal point of
the number that has the smaller exponent. Hence, we need a form of the smaller
number, 1.610ten × 10-1, that matches the larger exponent. We obtain this by
observing that there are multiple representations of an unnormalized floating-point
number in scientific notation:
1.610ten × 10-1 = 0.1610ten × 100 = 0.01610ten × 101
The number on the right is the version we desire, since its exponent matches
the exponent of the larger number, 9.999ten × 101. Thus, the first step shifts the
significant of the smaller number to the right until its corrected exponent matches that
of the larger number. But we can represent only four decimal digits so, after
shifting, the number is really 0.016ten × 101
46
ww
(excluding the sign), we must round the number. In our grammar school
algorithm, the rules
w.E
truncate the number if the digit to the right of the desired point is between 0
and 4
and add 1 to the digit if the number to the right is between 5 and 9. The
number
a sy E
1.0015ten × 102 is rounded to four digits in the significant to 1.002ten ×
102 since the fourth digit to the right of the decimal point was between 5 and 9.
Notice that if we have bad luck on rounding, such as adding 1 to a string of 9s, the
ngi
sum may no longer be normalized and we would need to perform step 3 again.
nee
ri n
g .n
e
47
ww
w.E
a sy E
ngi
nee
ri n
g .n
e
The algorithm for binary floating-point addition that follows this decimal
example. Adjust the significand of the number with the smaller exponent and then
add the two significands. Step 3 normalizes the results, forcing a check for overflow
or underflow. The test for overflow and underflow in step 3 depends on the precision
of the operands. Recall that the pattern of all 0 bits in the exponent is reserved and
used for the floating-point representation of zero. Moreover, the pattern of all 1 bits in
the exponent is reserved for indicating values and situations outside the scope of
normal floating-point numbers. Thus, for single precision, the maximum exponent is
127, and the minimum exponent is -126. The limits for double precision are 1023 and
-1022.
48
ww Subword Parallelism:
Subword parallelism is an efficient and flexible solution for media processing
w.E
because algorithm exhibit a great deal of data parallelism on lower precision data. It
is also useful for computations unrelated to
a
multimedia that exhibit data parallelism on lower precision data. Graphics and audio a
pplications
sy E
can take advantage of performing simultaneous operations on short vectors
ngi
The term SIMD was originally defined in 1960s as category of
multiprocessor with one control unit and multiple processing elements -
nee
each instruction is executed by all processing elements on different
data streams, e.g., Illiac IV. Today the term is
ri n
used to describe partitionable ALUs in which multiple operands can fit in a
fixed-width
g .n
register and are acted upon in parallel.(other terms include subword parallelism, micr
oSIMD,
e
short vector extensions, split-ALU, SLP / superword-level parallelism, and SIGD /
single-instruction-group[ed]-data)
The structure of the arithmetic element can be altered under program control.
Each instruction specifies a particular form of machine in which to operate,
ranging from a full 36-bit computer to four 9-bit computers with many variations. Not
only is such a scheme able to make more
efficient use of the memory in storing data of various word lengths, but it
also can be expected to result in greater over-all machine speed because of the
increased parallelism of operation.
49
PART-A
OP Rs Rt Rd shamt Funct
6 bits 5 bits 5 bits 5 bits 5 bits 6 bits
ww
rs- The first register source operand
rt- The second register source operand
w.E
rd- The register destination operand .It gets the result of the operation.
Shamt – shift amount
a
Funct- Function. This field called function code, selects the specific
sy E
variant of the operation in the op field.
ngi
1.What is branch prediction buffer and branch
prediction? (April/May 2015) (N/D’15)
nee
A
branch prediction buffer or branch history table is a small memory indexed by the
lower portion of the address of the branch instruction. The memory contains a
bit that says whether the branch was recently taken or not.
ri n
g .n
Branch prediction is a method of resolving a branch hazard that
assumes a given outcome for the branch and proceeds from
e
that assumption rather than waiting to ascertain the actual outcome.
ww
than in the program code. There are three different types of data hazard named
according to the order of operations that must be maintained
RAW: A Read after Write hazard
w.E
WAR: A Write After Read hazard is the reverse of a RAW
WAW:A Write After Write hazard
a
CONTROL HAZARDS:
sy E
This is when a decision needs to be made, but the information needed
to make the decision is not available yet. A Control Hazard is actually the
same thing as a RAW
ngi
data hazard (see above), but is considered separately because different techniques c
an be employed
nee
to resolve it - in effect, we'll make it less important by trying to make good
guesses as to what the decision is going to be.
ri n
3. What is meant by speculative execution? (May/June 2012) (N/D’14)
g .n
e
Speculative execution is an optimization technique where a computer system
performs some
task that may not be actually needed. The main idea is to do work before it is known
whether that work will be needed at all, so as to prevent a delay that would have to
be incurred by doing the work after it is known whether it is needed. If it turns out the
work was not needed after all, any changes made by the work are reverted and the
results are ignored.
The objective is to provide more concurrency if extra resources are available.
This approach is employed in a variety of areas, including branch prediction in
pipelined processors, prefetching memory and files, and optimistic concurrency
control in database systems.
ww
speedup. Splitting the same operation into 5 stages, 4 of which are 7.5 ns long and
one of which is 10 ns long will result in only a 4x speedup.
w.E
If your starting point is a multiple clock cycle per instruction machine then
pipelining decreases CPI.
a
If
sy E
your starting point is a single clock cycle per instruction machine then pipelinin
g decreases
ngi
cycle time.
nee
3.What are the disadvantages of increasing the number of stages in
pipelined processing? (April/May 2011)
ri n
The design of a non-pipelined processor simpler and cheaper to
manufacture, non-pipelined processor executes only a single instruction at a
time.
g .n
In pipelined processor, insertion of flip flops between modules increases the
instruction latency compared to a non-pipelined processor.
A non-pipelined processor will
have a defined instruction throughput. The performance of a pipelined process
or is much
e
harder to predict and may vary widely for different programs.
1.What are the problems faced in Instruction Pipeline and How data hazard can
be prevented in pipelining?
Resources Conflicts
Data Dependency
Branch Difficulties
Data hazards in the instruction pipelining can prevented by the following
techniques.
Operand Forwarding
Software Approach
PART B
w.E
1.Write short notes on MIPS
Implementation scheme. (8) (N/D’15) a.Instruction fetch cycle
(IF):
IR =
Mem[PC]; NP
a sy E
ngi
C
= PC
nee
+4;
Operation
ri n
:
Send out the
PC and fetch
the instruction
g .n
e
from memory i
nto the
instruction
register (IR). In
crement
the
PC by 4 to add
ress the next s
equential
instruction.
IR - holds instruction that will be needed on
subsequent clock cycles Register NPC -
holds next sequential PC.
ww a) ALU adds NPC to sign-extended immediate value in Imm, which is shifted left
by 2 bits to create a word offset, to compute address of branch target.
w.E
b) Register A,
which has been read in the prior cycle, is checked to determine whether branc
a
h is
taken.
sy E
c) Considering only one form of branch (BEQZ), the comparison is against 0.
ngi
d.Memory access/branch completion cycle (MEM):
nee
PC is updated for all instructions: PC
= PC; i. Memory reference:
ri n
LMD =
Mem[ALUOutput] Me or
m[ALUOutput]
= B;
g .n
Operation:
a) Access memory if needed.
b) Instruction is load-data returns from memory and is placed in LMD .
c) Instruction is store-data from the B register is written into memory
e
Branch:
if (cond) PC = ALUOutput
Operation: If the instruction branch, PC is replaced with the branch destination
address in register ALUOutput.
A. rite-back cycle (WB):
* Register-Register ALU instruction:
Regs[rd] = ALUOutput;
* Register-Immediate ALU instruction:
Regs[rt] = ALUOutput;
* Load instruction:
Regs[rt] = LMD;
Operation: Write the result into register file, depending on the effective opcode.
54
(16) CONTROL:
Components of the processor that commands the datapath, memory, I/O
devices according to the instructions of the memory.
Building a Datapath:
Elements that process data and addresses in the CPU - Memories, registers,
ALUs. MIPS datapath can be built incrementally by considering only a subset
of instructions 3 main elements are
ww
w.E
a sy E Fig. Datapath
ngi
A memory unit is used to store instructions of a program and supply
instructions given an address.
nee
Needs to provide only read access (once the program is loaded).-
No control signal is needed PC (Program Counter or Instruction
ri n
address register) is a register that holds the address of the current instruction .A new
value is written
g .n
to it every clock cycle. No control signal is required to enable write Adder to
increment the PC to the address of the next instruction .An ALU permanently wired
e
to do only addition. No extra control signal required.
Combinational element:
Elements that operate on values Eg adder ALU E.g. adder, ALU Elements
required by the different classes of instructions
o Arithmetic and logical instructions
o Data transfer instructions
o Branch instructions
ww
Register file:
A collection of the registers.
Any register can be read or written by specifying the number of
w.E
the register contains the register state of the computer.
Read from register:
a
inputs to the register file specifying the numbers 5 bit wide inputs for the 32
registers
sy E
outputs from the register file with the read values 32 bit wide
ngi
For all instructions. No control required.
Write to register file:
nee
1 input to the register file specifying the number 5 bit wide inputs for the 32
registers 1 input to the register file with the value to be written 32 bit wide
Only for some instructions. RegWrite control signal.
ALU:
ri n
g .n
o Takes two 32 bit input and produces a 32 bit output Also, sets one-bit
e
signal if the results is 0.
o The operation done by ALU is controlled by a 4 bit control signal
input. This is set according to the instruction.
Depending on the instruction class, the ALU will need to perform one of these
first five functions. (NOR is needed for other parts of the MIPS instruction set not
56
found in the subset we are implementing.) For load word and store word instructions,
we use the ALU to compute the memory address by addition.
In this Figure we show how to set the ALU control inputs based on the 2 bit
ALUOp control and the 6 bit function code. Later in this chapter we will see how the
ALUOp bits are generated from the main control unit.
ww
Designing the Main Control Unit:
To start this process, let’s identify the fields of an instruction and the control
w.E
Lines that are needed for the datapath. To understand how to connect the fields
of an instruction to the datapath, it is useful to review the
instructions.
a
formats of the three instruction classes: the R-type, branch, and load-store
sy E
ngi
The three instruction classes (R-type, load and store, and branch)
use two different instruction formats:
nee
ri n
g .n
e
There are several major observations about this instruction format that we will
rely on:
The op field, also called the opcode, is always contained in bits 31:26.
We will refer to this field as Op[5:0].
The two registers to be read are always specified by the rs and rt fields,
at positions 25:21 and 20:16. This is true for the R-type instructions,
branch equal, and store.
The base register for load and store instructions is always in bit
positions 25:21 (rs).
The 16‑bit offset for branch equal, load, and store is always in positions
15:0.
57
ww
w.E FIG: ALU - CONTROL PATH OPERATIONS
a
Finalizing Control: sy E
ngi
nee
ri n
g .n
e
INPUT/OUTPUT SIGNALS
Now that we have seen how the instructions operate in steps, let’s continue
with the control implementation. The outputs are the control lines, and the input is the
6 bit opcode field, Op [5:0]. Thus, we can create a truth table for each of the outputs
based on the binary encoding of the opcodes.
Figure shows the logic in the control unit as one large truth table that
combines all the outputs and that uses the opcode bits as inputs. It completely
specifies the control function, and we can implement it directly in gates in an
automated fashion.
58
ww
w.E
a sy E SIGNAL EFFECTS
59
This multiplexor is controlled by the jump control signal. The jump target
address is obtained by shifting the lower 26 bits of the jump instruction left 2 bits,
effectively adding 00 as the low-order bits, and then concatenating the upper 4 bits of
PC + 4 as the high-order bits, thus yielding a 32-bit address. The clock cycle is
determined by the longest possible path in the processor. This path is almost
certainly a load instruction, which uses five functional units in series: the instruction
memory, the register file, the ALU, the data memory, and the register file.
ww
processor to perform an instruction. Pipelining is the use of a pipeline. Without a
pipeline, a computer processor gets the first instruction from memory, performs the
operation it calls for, and then goes to get the next instruction from memory, and so
w.E
forth. While fetching (getting) the instruction, the arithmetic part of the
processor is idle. It must wait until it gets the next instruction.
a
With pipelining, the computer architecture allows the next instructions to be
sy E
fetched while the processor is performing arithmetic operations, holding them
in a buffer close to the processor until each instruction operation can
ngi
be performed. The staging of instruction fetching is continuous. The result is an
increase in the number of instructions that can be performed during a given time
nee
period.
ri n
g .n
e
60
ww
memory. PC=PC+4.
w.E
Decode the instruction and read the registers. Do the equality test on the
registers as they are read, for
a
a possible branch. Compute the possible branch target address by adding the
sy E
sign-extended offset to the incremented PC. Decoding is done
in parallel with reading registers, which is possible because the register specifie
ngi
s are
at a fixed location in a RISC architecture, known as fixed-field decoding.
g .n
Memory reference: The ALU adds the base register and the offset to form
the effective address.
Register-Register ALU instruction: The
ALU performs the operation specified by the ALU opcode on
the values read from the register file.
e
Register-Immediate ALU instruction:
The ALU performs the operation specified by the ALU opcode on the first
value read from the register file and the sign-extended immediate.
d. Memory access(MEM):
If the instruction is a load, memory does a read using the effective address
computed in the previous cycle. If it is a store, then the memory writes the data
from the second register read from the register file using the effective address.
ww
3. Does not deal with PC, To start a new instruction every clock, we must
increment and store the PC every clock, and this must be done during the IF
stage in preparation for the next instruction.
To
w.E
ensure that instructions in different stages of the pipeline do not interfere with one an
a
other.
sy E
This separation is done by introducing pipeline registers between successive stages
of the pipeline, so that at the end of a clock cycle all the results
ngi
from a given stage are stored into a register that is used as the input to the
next stage on the next clock cycle.
nee
Pipelined Datapath and Control:
The division of an instruction into five stages means a five stage pipeline,
ri n
which in turn means that up to
five instructions will be in execution during any single clock cycle. Thus, we must
g .n
separate the datapath into five pieces, with each piece named corresponding to a sta
ge of instruction
e
execution:
4. IF: Instruction fetch
5. ID: Instruction decode and register file read
6. EX: Execution or address calculation
7. MEM: Data memory access
8. WB: Write back
62
ww
w.E
a sy E
FIG: DATAPATH & CONTROLL IN PIPELINE PROCESSING
In Figure, these five components correspond roughly to the way the
ngi
datapath is drawn; instructions and data move generally from left to
right through the five
nee
stages as they complete execution. Returning to our laundry
analogy, clothes get cleaner,
ri n
drier, and more organized as they move through the line, and they never
move backward.
g .n
e
63
PIPELINED DATAPATH:
ww
w.E
a sy E
ngi
FIG: PIPELINED DATAPATH
nee
It shows the pipelined datapath with the pipeline registers highlighted. All
instructions advance during each clock cycle from one pipeline register to the
ri n
next. The registers are named for the two stages separated by that register. For
example, the pipeline register between the IF and ID stages is called IF/ID.
g .n
The pipeline registers separate each pipeline stage. They are labeled by the
e
stages that they separate; for example, the first is labeled IF/ID because it
separates the instruction fetch and instruction decodes stages. The registers must
be wide enough to store all the data corresponding to the lines that go through them.
ww
a. Instruction fetch: The instruction is read from memory using the address in the
PC and then is placed in the IF/ID pipeline register. This stage occurs before the
instruction is identified, so the top portion of Figure works for store as well as load.
w.E
b. Instruction decode and register file read: The instruction in the IF/ID
a
pipeline register supplies the register numbers for reading two
sy E
registers and extends the sign of the 16-bit immediate. These three 32-bit
values are all stored in the ID/EX pipeline register. The bottom portion of Figure
ngi
for load instructions also shows the operations of the second stage for
stores. These first two stages are executed by all instructions, since it is too early
nee
to know the type of the instruction.
ri n
c. Execute and address calculation: Figure shows the
third step; the effective address is placed in the EX/MEM pipeline register.
d. Memory access: The top portion of Figure shows the data being written to
g .n
memory. Note that the register containing the data to be stored was
read in an earlier stage and stored in ID/EX. The only way to make
the data available during the MEM stage is to place the
e
data into the EX/MEM pipeline register in the EX stage, just as we stored the
effective address into EX/MEM.
e. Write-back: The bottom portion of Figure shows the final step of the store. For
this instruction, nothing happens in the write-back stage. Since every instruction
behind the store is already in progress, we have no way to accelerate those
instructions. Hence, an instruction passes through a stage even if there is nothing
to do, because later instructions are already progressing at the maximum rate.
lw $10, 20($1)
sub $11, $2, $3
add $12, $3,
$4 lw $13,
24($1) add
$14, $5, $6
Multiple-clock-cycle
pipeline diagram of
five instructions:
ww
w.E
a sy E
ngi
nee
Traditional multiple-clock-cycle pipeline diagram of five instructions:
ri n
g .n
e
Pipelined Control:
The first step is to label the control lines on the existing datapath. Figure
shows those lines. We borrow as much as we can from the control for the simple
datapath in Figure .In particular, we use the same ALU control logic, branch logic,
destination-register-number multiplexor, and control lines.
66
ww
w.E
a sy E
ngi
Fig.The pipelined datapath with the control signals identified.
nee
To specify control for the pipeline, we need only set the
control values during each pipeline stage. Because each control line
ri n
is associated with a component active in only a single pipeline stage, we can divide
the control lines into five groups according to the pipeline stage.
g .n
e
c. Memory access: The control lines set in this stage are Branch, MemRead,
and MemWrite.These signals are set by the branch equal, load, and store
instructions, respectively. Recall that PCSrc in Figure selects the next
sequential address unless control asserts Branch and the ALU result was 0.
ww d. Write-back: The two control lines are MemtoReg, which decides between
sending the ALU result or the memory value to the register file, and RegWrite,
w.E
which writes the chosen value.
a sy E
PIPELINE HAZARDS
4. (i).What is Hazard? Explain its types with suitable examples. (Nov/Dec 2014)
ngi
/Explain the different types of pipeline hazards with
suitable examples.(April/May 2015). (16)
nee
There are situations in pipelining when the next instruction cannot execute in
the following clock cycle. These events are called hazards, and there are three
ri n
different types.
STRUCTURAL HAZARDS:
The first hazard is called a structural hazard. It g .n
means that the hardware cannot support the combination of instructions that we
want to execute in the same clock cycle. A structural hazard in the laundry room
would occur if we used a washer
e
dryer combination instead of a separate washer and dryer or if our roommate
was busy doing something else and wouldn’t put clothes away. Our carefully
scheduled pipeline plans would then be foiled.
Suppose, however, that we had a single memory instead of two memories. If
the pipeline in Figure had a fourth instruction, we would see that in the same
clock cycle the first instruction is accessing data from memory while the fourth
instruction is fetching an instruction from that same memory. Without two memories,
our pipeline could have a structural hazard.
The MIPS instruction set was designed to be pipelined making it fairly easy for
designers to avoid structural hazards when designing a pipeline. Suppose
however that we had a single memory instead two memories.
68
DATA HAZARDS:
This is when reads and writes of data occur in a different order in the pipeline
than in the program code. There are three different types of data hazard
(named according to the order of operations that must be maintained):
RAW:
A Read After Write hazard occurs when, in the code as written, one instruction
reads a location after an earlier instruction writes new data to it, but in the pipeline the
write occurs after the read (so the instruction doing the read gets stale data).
WAR :
A Write after Read hazard is the reverse of a RAW: in the code a write occurs
after a read, but the pipeline causes write to happen first.
ww WAW:
A Write after Write hazard is a situation in which two writes occur out of order.
w.E
For example, suppose we have an add instruction followed immediately by a
subtract instruction that uses the sum ($s0):
a
add $s0,
$t0, $t1 sub
$t2, $s0, $t3 sy E
FORWARDING
ngi
WITH TWO
INSTRUCTIONS:
nee
ri n
g .n
Fig. Graphical representation of forwarding. e
The figure shows the connection to forward the value in $s0 after the
execution stage of the add instruction as input to the execution stage of the sub
instruction. In this graphical representation of events, forwarding paths are valid only
if the destination stage is later in time than the source stage. For example, there
cannot be a valid forwarding path from the output of the memory access stage in the
first instruction to the input of the execution stage of the following, since that would
mean going backward in time.
stall one stage for a load-use data hazard, as Figure shows. This figure shows an
important pipeline concept, officially called a pipeline stall, but often given the
nickname bubble. We shall see stalls elsewhere in the pipeline.
ww
Consider the following code segment in C
a = b + e;
w.E
c = b + f;
Here is the generated MIPS code for
this segment, assuming all variables are in memory and are addressable as
a
offsets from $t0:
lw
$t1, 0($t sy E
0)
$t2, 4($t0)
lw
ngi
add $t3,
$t1,$t2 sw
nee
$t3, 12($t0) lw
$t4, 8($t0)
ri n
add $t5,
$t1,$t4 sw $t5, g .n
16($t0)
Find the hazards in the preceding code segment and reorder
the instructions to avoid any pipeline stalls. Both add instructions have
e
a hazard because of their
respective dependence on the immediately preceding lw instruction.
Notice that bypassing eliminates several other potential hazards, including the
dependence of the first adds on the first lw and any hazards for
store instructions. Moving up the third lw instruction to become the third instruction
eliminates both hazards:
lw $t1,
0($t0) lw $t2,
4($t0) lw $t4,
8($t0) add $t3,
$t1,$t2 sw $t3,
12($t0) add
$t5, $t1,$t4 sw
$t5, 16($t0)
On a pipelined processor with forwarding, the reordered sequence will
complete in two fewer cycles than the original version.
Downloaded From : www.EasyEngineering.ne
Downloaded From : www.EasyEngineering.ne
football team. Given how filthy the laundry is, we need to determine whether the
detergent and water temperature setting we select is strong enough to get the
uniforms clean but not so strong that the uniforms wear out sooner.
Performance of “Stall on Branch” Estimate the impact on the clock cycles per
instruction (CPI) of stalling on branches. Assume all other instructions have a CPI of
1. If we cannot resolve the branch in the second stage, as is often the case for
longer
pipelines, then we’d see an even larger slowdown if we stall on branches. The
cost of this option is too high for most computers to use and motivates a second
solution to the control hazard. Computers do indeed use prediction to handle
branches. One simple approach is to predict always that branches will be untaken.
When you’re right, the pipeline proceeds at full speed.
ww
pipelining.
Hazards:
w.E
Prevent the next instruction in the instruction stream from executing during its
designated clock cycle. Hazards reduce the performance from the ideal
a
speedup gained by pipelining.
sy E
Performance of Pipelines with Stalls:
A stall causes the pipeline performance to degrade from the
ngi
ideal performance.
Speedup from pipelining = 1
---------------------------------------* Pipeline depth
nee
1+ pipeline stall cycles per instruction
ri n
Structural Hazards:
When
g .n
a processor is pipelined, the overlapped execution of instructions requires pipe
lining of functional units and duplication of resources
to allow all possible combinations of instructions in the pipeline.
If some combination of instructions cannot
be accommodated because of resource conflicts, the processor is said to
have a structural hazard.
e
Instances:
When functional unit is not fully pipelined, then a sequence of instructions
using that unpipelined unit cannot proceed at the rate of one per clock cycle.
when some resource has not been duplicated enough to
allow all combinations of instructions in the pipeline to execute.
Data Hazards:
A major effect of pipelining is to change the relative timing of instructions by
71
Downloaded From : www.EasyEngineering.ne
Downloaded From : www.EasyEngineering.ne
overlapping their execution. This overlap introduces data and control hazards.
Data hazards occur when the pipeline changes the order of read/write
accesses to operands so that the order differs from the order seen by
sequentially executing instructions on an unpipelined processor.
ww
previous ALU operation has written the register corresponding to a source for the
current ALU operation, control logic selects the forwarded result as the ALU input
rather than the value read from the register file.
w.E
Data Hazards Requiring Stalls:
The load instruction has a delay or latency that
a
cannot be eliminated by forwarding alone. Instead, we need to
sy E
add hardware, called a pipeline interlock, to preserve the correct execution
pattern.
ngi
A pipeline interlock detects a hazard and stalls the pipeline until the hazard is
cleared.
nee
This pipeline interlock introduces a stall or bubble. The CPI for the
stalled instruction increases by the length of the stall.
Branch Hazards:
ri n
g .n
Control hazards can cause a greater performance loss for our MIPS pipeline
e
. When a branch is executed, it may or may not change the PC to something
other than its current value plus 4.
If a branch changes the PC to its target address, it is a taken branch; if it falls
through, it is not taken, or untaken.
Branch prediction:
A more sophisticated version of branch prediction would have some branches
predicted as taken and some as untaken. In our analogy, the dark or home uniforms
might take one formula while the light or road uniforms might take another. In the
ww
case of programming, at the bottom of loops are branches that jump backing to the
top of the loop. Since they are likely to be taken and they branch backward, we could
always predict taken for branches that jump to an earlier address.
w.E
a sy E
ngi
nee
ri n
g .n
e
FIG: INSTRUCTION EXECUTION ORDER
Predicting that branches are not taken as a solution to control hazard:
One popular approach to dynamic prediction of branches is keeping a history
for each branch as taken or untaken, and then using the recent past behavior
to predict the future.When the guess is wrong, the pipeline control must ensure that
the instructions following the wrongly guessed branch have no effect and must
restart the pipeline from the proper branch address. In our laundry analogy, we must
stop taking new loads so that we can restart the load that we incorrectly predicted.
73
ww
FIG: DYNAMIC BRANCH PREDICTION
Assuming a branch is not taken is one simple form of branch prediction. In that
w.E
case, we predict that branches are untaken, flushing the pipeline when we are
wrong. For the simple five-stage pipeline, such an approach, possibly coupled with
compiler- based prediction, is probably adequate. With deeper
a sy E
pipelines, the branch penalty
increases when measured in clock cycles.
Ideally, the accuracy of the predictor would match the taken branch frequency
ngi
for these highly regular branches. To remedy this weakness, 2-bit
prediction schemes are often used. In a 2-bit scheme, a prediction must be wrong
twice before
nee
it is changed. Figure shows the finite-state machine for a 2-bit prediction
scheme.
ri n
EXCEPTIONS
g .n
5.Explain in detail how exceptions are handled in MIPS
architecture.(April/May 2015).
Control is the most challenging aspect of processor design: it
e
is both the hardest part to get right and the hardest part to make fast. One
of the hardest parts of control is implementing exceptions and interrupts—events
other than branches or
jumps that change the normal flow of instruction execution. They were initially
created to handle unexpected events from within the processor, like arithmetic
overflow.
Many architectures and authors do not distinguish between interrupts and
exceptions, often using the older name interrupt to refer to both types of
events. For example, the Intel x86 uses interrupt. The MIPS convention, using the
term exception to refer to any unexpected change in control flow without
distinguishing whether the cause is internal or external; we use the term interrupt only
when the event is externally caused. Here are five examples showing whether the
situation is internally generated by the processor or externally generated:
74
ww
providing some service to the user program, taking some predefined action in
response to an overflow, or stopping the
w.E
execution of the program and reporting an error. After performing whatever action is r
equired because of the exception,
a
the operating system can terminate the program or may continue its
sy E
execution, using the EPC to determine where to
restart the execution of the program. The method used in the MIPS architecture is to
ngi
include a status register (called the Cause register), which holds a field
that indicates the reason for the exception.
nee
A second method is to use vectored interrupts. In
a vectored interrupt, the address to which control is transferred is determined
by the
ri n
cause of the exception. For example, to accommodate the two exception types listed
g .n
above, we might define
the following two exception vector addresses:
e
The operating system knows the reason for the exception by the address at
which it is initiated. The addresses are separated by 32 bytes or eight instructions,
and the operating system must record the reason for the exception and may perform
some limited processing in this sequence. When the exception is not vectored, a
single entry point for all exceptions can be used, and the operating system decodes
the status register to find the cause. We will need to add two additional registers to
the MIPS implementation:
EPC: A 32bit register used to hold the address of the affected
instruction. (Such a register is needed even when exceptions are vectored.)
Cause: A register used to record the cause of the exception. In the
MIPS architecture, this register is 32 bits, although some bits are currently
unused.
Assume there is a five-bit field that encodes the two possible exception
sources mentioned above, with 10 representing an undefined instruction and
12 representing arithmetic overflow.
Exceptions in a Pipelined Implementation.
For example, suppose there is an arithmetic overflow in an add instruction.
Just as we did for the taken branch in the previous section, we must flush the
75
instructions that follow the add instruction from the pipeline and begin fetching
instructions from the new address.
A new control signal, called ID.Flush, is ORed with the stall signal from the
hazard Detection unit to flush during ID. To flush the instruction in the EX phase, we
use a new signal called EX. Flush to cause new multiplexors to zero the control lines.
Exception in a Pipelined Computer
Given this instruction
sequence, 40hex sub $11, $2,
$4
44hex and $12,
$2, $5 48hex or $13,
$2, $6
4Chex add $1,
$2, $1 50hex slt $15,
w.E ...
Show what happens in the pipeline if an overflow exception occurs in the add
a
instruction.
sy E
The difficulty of always associating the correct exception with the
correct instruction in pipelined computers has led some computer designers to relax
ngi
this requirement in noncritical cases. Such processors are said to have imprecise
interrupts or imprecise exceptions. In the
nee
example above, PC would normally have 58hex at the start of the clock cycle
after the exception is detected, even though the offending instruction is at
ri n
address 4Chex. A
processor with imprecise exceptions might put 58hex into EPC and leave it up to
g .n
the operating system to determine which instruction caused the problem. MIPS and
the vast majority of computers today support precise interrupts or precise exceptions.
76
UNIT 4 PARALLELISM
PART A
w.E
Weak scaling
Speedup achieved on a multiprocessor while increasing the size of the
a
problem proportionally to the increase in the number of processor is called weak
sy E
scaling.
ngi
3. What is Flynn’s Classification? (Nov/Dec 2014).
Flynn uses the stream concept for describing a machine's structure. A stream
nee
simply means a sequence of items (data or instructions). The classification of
computer architectures based on
ri n
the number of instruction steams and data streams (Flynn’s Taxonomy).
SISD: Single instruction single data
g .n
MISD: Multiple instructions single data
SISD: (Singe-Instruction stream, Singe-Data stream)
e
SIMD: (Single-Instruction stream, Multiple-Data streams)
1. What is Anti-dependence?
It is an ordering forced by the reuse of a name, typically a register, rather than
by a true dependence that carries a value between two instructions.
ww
the packet may be determined statically by the compiler or dynamically by the
processor.
w.E
3. Define – Superscalar Processor and VLIW. (N/D’ 15)
Superscalar is an advanced pipelining technique that enables the processor
a
to execute more than one instruction per clock cycle by selecting them during
sy E
execution. Dynamic multiple-issue processors are also known
as superscalar processors, or simply superscalars.
ngi
Very Long Instruction Word (VLIW) is a style of instruction set architecture
that launches many operations that are defined to be independent in a single wide
nee
instruction, typically with many separate opcode fields.
78
PART – B
ww
The amount of parallelism available within a basic block ( a straight-line code
sequence with no branches in and out except for entry and exit) is quite small. The
average dynamic branch frequency in integer programs was measured to be about
w.E
15%, meaning that about 7 instructions execute between a pair of branches.
Since the instructions are likely to
a
depend upon one another, the amount of overlap we can exploit within a basic block
sy E
is likely to be much less than 7.To obtain
substantial performance enhancements, we must exploit ILP across multiple
basic blocks.
ngi
The simplest and most common way to increase the
amount of parallelism available among instructions is to
exploit parallelism among iterations of a loop. This
nee
ri n
type of parallelism is often called loop-level parallelism.
for
Example 1
g .n
(i=1; i<=1000; i= i+1)
x[i]
= x[i] + y[i];
e
This
a[i] is a parallel
= a[i] + b[i]; loop. //s1
Every iteration of the loop can overlap with any other
iteration, although
b[i+1] = c[i] + d[i]; within
//s2each loop iteration there is little opportunity for overlap.
}
Example 2
for (i=1;loop
Is this i<=100; i= i+1){
parallel? If not how to make it parallel?
Statement s1 uses the value assigned in the previous iteration by statement s2, so
there is a loop-carried dependency between s1 and s2. Despite this dependency, this
loop can be made parallel because the dependency is not circular:
neither statement depends on itself;
while s1 depends on s2, s2 does not depend on s1.
A loop is parallel unless there is a cycle in the dependencies, since the
absence of a cycle means that the dependencies give a partial ordering on the
statements. To expose the parallelism the loop must be transformed to conform to
79
the partial order. Two observations are critical to this transformation: There is no
dependency from s1 to s2. Then, interchanging the two statements will not affect the
execution of s2.
On the first iteration of the loop, statement s1 depends on the value of b[1]
computed prior to initiating the loop. This allows us to replace the loop above with the
following code sequence, which makes possible overlapping of the iterations of the
loop:
a[1] = a[1] + b[1];
for (i=1; i<=99;
i= i+1){ b[i+1] = c[i] +
d[i];
a[i+1] = a[i+1]
+ b[i+1];
}
ww b[101] = c[100]
+ d[100];
w.E
Example 3
a
for
(i=1; i<=100; i= i+1
){ a[i+
sy E
ngi
1] = a[i] + c[i];//S1
b[i+1] = b[i] +
nee
a[i+1];//S2
}
This loop is not parallel because it has cycles in the dependencies, namely
the statements S1 and S2 depend on themselves.
ri n
Parallelism and Advanced Instruction-Level Parallelism:
g .n
e
Pipelining exploits the potential parallelism among instructions. This
parallelism is called instruction-level parallelism (ILP). There are two
primary methods for increasing the
potential amount of instruction-level parallelism. The first is increasing the depth of
the pipeline to overlap more instructions. Using our laundry analogy and
assuming that the washer cycle was longer than the others were, we
could divide our washer into three machines that perform the wash, rinse, and spin
steps of a traditional washer.
We would then move from a four-stage to a six-stage pipeline. To get the full
speed-up, we need to rebalance the remaining steps so they are the same
length, in processors or in laundry. The amount of parallelism being exploited is
higher, since there are more operations being overlapped. Performance is potentially
greater since the clock cycle can be shorter.
ww Most static issue processors also rely on the compiler to take on some
responsibility for handling data and control hazards. The compiler’s
w.E
responsibilities may include static branch prediction and code scheduling to
reduce or prevent all hazards. Let’s look at a simple static issue version of a MIPS
a
processor, before we describe the use of these techniques in more aggressive
processors.
sy E
ngi
nee
INSTRUCTION TYPE & PIPE STAGES
ri n
An Example: Static Multiple Issue with the MIPS ISA:
g .n
To give a flavor of static multiple issues, we consider a simple two-issue MIPS
processor, where one of the instructions can be an integer ALU operation or br
anch and the other can
be a load or store. Such a design is like that used in some
processors.
embedded MIPS
e
Dynamic Multiple-Issue Processors:
Dynamic multiple-issue processors are also known as superscalar processors,
or simply Superscalar. In the simplest superscalar processors, instructions
issue in order, and the processor decides whether zero, one, or more instructions can
issue in a given clock cycle. Obviously, achieving good performance on such a
processor still requires the compiler to try to schedule instructions to move
dependences apart and thereby improve the instruction issue rate. Even with such
compiler scheduling, there is an important difference between this simple superscalar
and a VLIW processor.
81
ww Let’s start with a simple example of avoiding a data hazard. Consider the
following code sequence:
w.E
lw $t0, 20($s2)
addu $t1,
a
$t0,
sy E
$t2 sub $s4,
$s4, $t3 slti
ngi
$t5, $s4, 20
Even though the
nee
sub instruction is ready
to execute, it must wait
for the lw and
ri n
addu to complete first, which might take many clock cycles if
g .n
memory is slow. Dynamic pipeline scheduling allows such hazards to
be avoided either fully or partially.
ww
w.E Fig.The three primary units of a dynamically scheduled pipeline
To make programs behave as if they were running on a simple in-order
a
pipeline, the instruction fetch and decode unit is required to issue instructions in
sy E
order, which allows dependences to be tracked, and the commit
unit is required to write results to registers and memory in program fetch order. This
ngi
conservative mode is called in-order commit.
There are at least two different kinds of parallelism in computing.
nee
• Using multiple processors to work toward a given
goal, with each processor running its own program.
ri n
• Using only a single processor to run a single program, but allowing
instructions from that program to execute in parallel.
The latter is called instruction-level parallelism, or ILP.
g .n
LIMITATIONS OF ILP:
THE HARDWARE MODEL:
An ideal processor is one where all constraints on ILP are removed. The only
e
limits on ILP in such a processor are those imposed by the actual data flows through
either registers or memory. The assumptions made for an ideal or perfect processor
are as follows:
a.Register renaming:
There are an infinite number of virtual registers available, and hence all WAW
and WAR hazards are avoided and an unbounded number of instructions can begin
execution simultaneously.
b.Branch prediction:
Branch prediction is perfect. All conditional branches are predicted
exactly.
c.Jump prediction:
All jumps (including jump register used for return and computed jumps) are
perfectly predicted. When combined with perfect branch prediction, this is equivalent
83
ww
prediction and perfect alias analysis are easy to do.
w.E
LIMITATIONS ON THE WINDOW SIZE AND MAXIMUM ISSUE COUNT:
To build a processor that even comes close to perfect branch prediction and
a
perfect alias analysis requires extensive dynamic analysis, since static compile time
sy E
schemes cannot be perfect. Of course, most realistic dynamic schemes will
not be perfect, but the use of dynamic schemes will provide the ability to uncover
ngi
parallelism that cannot be analyzed by static compile time analysis. Thus,
a dynamic processor might be able to more closely
nee
match the amount of parallelism uncovered by our ideal processor.
ri n
THE EFFECTS OF REALISTIC BRANCH AND JUMP PREDICTION:
Our ideal processor assumes that branches can be perfectly predicted: The
outcome of any branch in the program is known before the
g .n
first instruction is executed! Of course, no real processor can ever achieve this. We
assume a separate predictor is used for jumps. Jump
predictors are important primarily with the
most accurate branch predictors, since the branch frequency is higher and the
e
accuracy of the branch predictors dominates.
a. Perfect : All branches and jumps are perfectly predicted at the start of
execution.
b. Tournament-based branch predictor: The prediction scheme uses a
correlating 2-bit predictor and a noncorrelating 2-bit predictor together with a
selector, which chooses the best predictor for each branch.
84
multithreading mode, and all are available to a single thread when in single-thread
mode.
FLYNN'S CLASSIFICATION
ww
(April/May 2015).(N/D’ 15, 16)(16)
Flynn uses the stream concept for describing a machine's structure. A stream
simply means a sequence of items (data or instructions). The classification of comput
w.E
er architectures based
on the number of instruction steams and data streams (Flynn’s Taxonomy).
a sy E
SISD (Singe-Instruction stream, Singe-Data stream):
SISD corresponds to the traditional mono-processor
ngi
(von Neumann computer). A
single data stream is being processed by one instruction stream .A single-processor
nee
computer (uni-processor) in which a single
stream of instructions is
ri n
generated from the program.
g .n
SIMD (Single-Instruction stream, Multiple-Data streams): e
ww
MIMD (Multiple-Instruction streams, Multiple-Data streams):
Each processor has a separate program. An instruction stream is generated
w.E
from each program. Each instruction operates on different data. This last
machine type builds the group for the traditional multi-processors.
Several processing units operate on multiple-data streams.
a sy E
ngi
nee
ri n
g .n
SISD, MIMD, SIMD, SPMD, and Vector:
e
Another categorization of parallel hardware proposed in the 1960s is still used
today. It was based on the number of instruction streams and the number of
data streams. Figure shows the categories. Thus, a conventional uniprocessor has a
single instruction stream and single data stream, and a conventional multiprocessor
has multiple instruction streams and multiple data streams. These two categories are
abbreviated SISD and MIMD, respectively.
on a MIMD computer and yet work together for a grander, coordinated goal,
programmers normally write a single program that runs on all processors of an MIMD
computer, relying on conditional statements when different processors should
execute different sections of code. This style is called Single Program Multiple Data
(SPMD), but it is just the normal way to program a MIMD computer.
While it is hard to provide examples of useful computers that would be
classified as multiple instruction streams and single data stream (MISD), the
inverse makes much more sense. SIMD computers operate on vectors of data. For
example, a single SIMD instruction might add 64 numbers by sending 64 data
streams to 64 ALUs to form 64 sums within a single clock cycle.
The original motivation behind SIMD was to amortize the cost of the control
unit over dozens of execution units. Another advantage is the reduced size of
program memory—SIMD needs only one copy of the code that is being
simultaneously executed, while message-passing MIMDs may need a copy in every
ww
processor, and shared memory MIMD will need multiple instruction caches.
SIMD works best when dealing with arrays in for loops. Hence, for parallelism
w.E
to work in SIMD there must be a great deal
of identically structured data, which is called data-level parallelism. SIMD is at its
a
weakest in case or switch statements, where each execution unit must perform a
sy E
different operation on its data, depending
on what data it has. Execution units with the
ngi
wrong data are disabled so that units with proper data may
continue. Such situations essentially run at 1/nth performance, where n are
nee
the number of cases.
ri n
SIMD in x86: Multimedia Extensions:
The most widely used variation of SIMD is found in almost every
microprocessor today, and is the basis of the hundreds of MMX and
g .n
SSE instructions of the x86 microprocessor (see Chapter 2). They were added to
improve
performance of multimedia programs. These instructions allow the hardware
to have many ALUs operate simultaneously or, equivalently, to
partition a single, wide ALU into many parallel smaller ALUs that operate
e
simultaneously.
ector:
An older and more elegant interpretation of SIMD is called a vector architecture,
which has been closely identified with Cray Computers. It is again a great match to
problems with lots of data-level parallelism. Rather than having 64 ALUs perform 64
additions simultaneously, like the old array processors, the vector architectures
pipelined the ALU to get good performance at lower cost. A key feature of vector
architectures is a set of vector registers. Thus, vector architecture might have 32
vector registers, each with 64 64-bit elements.
Vector versus Scalar:
Vector instructions have several important properties compared to conventional
instruction set architectures, which are called scalar architectures in this context:
■ A single vector instruction specifies a great deal of work—it is equivalent to
executing an entire loop. The instruction fetch and decode bandwidth needed is
dramatically reduced.
■ By using a vector instruction, the compiler or programmer indicates
Downloaded From that the
: www.EasyEngineering.ne
Downloaded From : www.EasyEngineering.ne
ww
■Because an entire loop is replaced by a vector instruction whose behavior is
predetermined, control hazards that would normally arise from the loop
w.E
branch are nonexistent.
■The savings in instruction bandwidth and hazard checking plus the efficient use
a
of memory bandwidth give vector architectures advantages in power and energy
versus
scalar architectures.
sy E
ngi
HARDWARE MULTITHREADING
nee
4.Explain about Hardware Multithreading.(Nov/Dec 2014)/
What is hardware multithreading? Compare and contrast Fine grained multi
ri n
threading and coarse grained multi threading.(April/May 2015) (N/D 14)
Exploiting Thread-Level Parallelism within a Processor:
g .n
Multithreading allows multiple threads to share the functional units of a single
processor in an overlapping fashion. To permit this sharing, the processor
must duplicate the independent state of each thread. For example, a separate copy
of the register file, a separate PC, and a separate page table are required for each
thread.
Multithreading enables the thread-level parallelism (TLP) by duplicating the
e
architectural state on each processor, while sharing only one set of processor
execution resources.
When scheduling threads, the operating system treats those distinct
architectural states as separate "logical" processors. Logical processors are the
hardware support for sharing the functional units of a single processor among
different threads. There are several different sharing mechanism for different
structures. The kind of state a structure stores decides what sharing mechanism the
structure needs.
88
Category Resources
Program counter(PC)
Replicated Architectural registers
Register renaming logic
Re-order buyers
Partitioned Load/Store buyers
Various queues,
like the scheduling
queue, etc
Caches
Shared Physical registers
Execution units
Table: Sharing Mechanisms
ww
Replicated resources are the kind of resources that you just cannot get around
replicating if you want to maintain two fully independent contexts on each logical
w.E
processor. The most obvious of these is the program counter (PC), which is the
pointer that helps the processor keep track of its
place in the instruction stream by pointing to the next instruction to be fetched. We
a sy E
need separate PC for each thread to keep track of its instruction stream.
ngi
Fine-grained multithreading switches between threads on two
nee
schedules, usually processing instructions between two threads.
Three different approaches use the issue slots of a superscalar processor cycle,
causing the execution of multiple
threads to be interleaved.
ri n
g .n
As illustrated above, this cycle-by-cycle interleaving is often done in a round-
robin fashion, skipping any threads that are stalled due to branch mispredict or
e
cache miss or any other reason. But the thread scheduling policy is not limited to the
cycle-
by-cycle (round-robin) model; other scheduling policy can also be applied too.
89
Although FGMT can hide performance loses due to stalls caused by any
reason, there are two main drawbacks for FGMT approach:
ww
w.E
a sy E
ngi
nee
ri n
FIG: FGMT
g .n
FGMT sacrifices the performance of the individual threads. It
needs a lot of threads to hide the stalls, which also means a lot of register less.
e
COARSE-GRAINED MULTITHREADING (CGMT):
Coarse-grained multithreading won't switch out the executing thread until it
reaches a situation that triggers a switch. This situation occurs when the instruction
execution reaches either a long-latency operation or an explicit additional switch
operation.
CGMT was invented as an alternative to FGMT, so it won't repeat the primary
disadvantage of FGMT: severely limits on single-thread performance. CGMT makes
the most sense on an in-order processor that would normally stall the pipeline on a
cache miss (using CGMT approach, rather than stall, the pipeline is called with
ready instructions from an alternative thread).
Since instructions following the missing instructions may already be in
the
90
pipeline, they need to be drained from the pipeline. And similarly, instructions from
new thread will not reach the execution stage until have traversed earlier pipeline
stages. The cost for draining out and lying in the pipeline is considered as thread-
switch penalty, and it depends on the length of the pipeline. So normally CGMT need
short in-order pipeline for good performance.
w.E
from other threads will only be issued, when a thread encounters a costly stall.
a sy E
SIMULTANEOUS MULTITHREADING:
Converting Thread-Level Parallelism into Instruction-Level Parallelism. Simulta
ngi
neous multithreading (SMT) is a variation
on multithreading that uses the resources of a multiple issue, dynamically-scheduled
nee
processor to exploit TLP at the
same time it exploits ILP.
ri n
The key insight that motivates SMT is that modern multiple-issue
processors often have more functional unit parallelism available than a single
thread can effectively use. Furthermore, with register renaming and dynamic
scheduling,
g .n
multiple instructions from independent threads can be issued without regard
to the dependences among them; the resolution of the
dependences can be handled by the dynamic scheduling capability.
The following figure illustrates the differences in a processor’s ability to
e
exploit the resources of a superscalar for the following processor configurations:
o A superscalar with no multithreading support,
o A superscalar with coarse-grained Multithreading,
o A superscalar with fine-grained Multithreading, and
o A superscalar with Simultaneous multithreading.
91
ww
are partially hidden by switching to another thread that uses the resources of the
processor.
w.E
In the fine-grained case, the interleaving of threads eliminates fully empty slots.
Because only one thread issues instructions in a given clock cycle. In
the SMT case, thread-level parallelism (TLP) and instruction-level parallelism (ILP)
a sy E
are exploited simultaneously; with multiple threads using
the issue slots in a single clock cycle.
ngi
MULTICORE PROCESSORS
nee
5.Explain about Multicore Processors. (Nov/Dec 2014).
ri n
g .n
e
Fig. Classic organization of a shared memory multiprocessor.
main memory no matter which processor requests it and no matter which word is
requested. Such machines are called uniform memory access (UMA)
multiprocessors.
In the second style, some memory accesses are much faster than others,
depending on which processor asks for which word. Such machines are called
nonuniform memory access (NUMA) multiprocessors. As you might expect, the
programming challenges are harder for a NUMA multiprocessor than for a
UMA multiprocessor, but NUMA machines can scale to larger sizes and NUMAs can
have lower latency to nearby memory.
As processors operating in parallel will normally share data, they also need to
coordinate when operating on shared data; otherwise, one processor could
start working on data before another is finished with it. This coordination is called
synchronization. When sharing is supported with a single address space, there must
be a separate mechanism for synchronization. One approach uses a lock for a
ww
shared variable. Only one processor at a time can acquire the lock, and other
processors interested in shared data must wait until the original processor unlocks
w.E
the variable.
a
CLUSTERS AND OTHER MESSAGE-PASSING MULTIPROCESSORS:
sy E
The alternative approach to sharing an address space is for the processors to
each have their own private physical address space. Figure shows the
ngi
classic organization of a multiprocessor with multiple private address spaces. This
alternative multiprocessor must communicate via
nee
explicit message passing, which traditionally is the name of such
style of computers. Provided the system has routines to
ri n
send and receive messages, coordination is built
in with message passing, since one processor knows when a message is sent,
and the receiving processor knows when a message arrives. If the
sender needs confirmation that the message has arrived, the
g .n
receiving processor can then send an acknowledgment message back to the sender.
e
LIMITATIONS:
a.One drawback of clusters has been that the cost of administering a cluster
of n machines is about the same as the cost of administering n independent
ww
with n
machines, while the cost of administering a shared memory multiprocessor
w.E
processors is about the same as administering a single machine. This
weakness is one of the reasons for the popularity of virtual machines, since VMs
a
make clusters easier to administer.
sy E
For example, VMs make it possible to stop or start
programs atomically, which simplifies software upgrades. VMs can even migrate a pr
ngi
ogram from one computer
in
nee
a cluster to another without stopping the program, allowing a program to
migrate from failing hardware.
ri n
b.Another drawback to clusters is that the
processors in a cluster are usually connected using the I/O interconnect of each com
g .n
puter, whereas
the cores in a
e
multiprocessor are usually connected on
the memory interconnect of the computer. The memory interconnect has higher band
width and lower latency, allowing
much better communication performance.
A.final weakness is the overhead in the division of memory: a cluster of n
machines has n independent memories and n copies of the operating system, but a s
hared memory multiprocessor allows a single program
to use almost all the memory in the computer, and it only needs a single copy of the
OS.
PROPERTIES OF MULTI-CORE SYSTEMS:
o Cores will be shared with a wide range of other applications
dynamically. Load can no longer be considered symmetric
across the cores.
o Cores will likely not be asymmetric as accelerators become common
for scientific hardware.
o Source code will often be unavailable, preventing compilation
against the specific hardware configuration.
ww
2. Compare SRAM from DRAM. (Nov/Dec 2013)
SRAMs are simply integrated circuits that are memory arrays with a single access
w.E
port that can provide either a read or a write. SRAMs have a fixed access time to
any datum. SRAMs don’t need to refresh and so the access time is very close to the
cycle time. SRAMs typically use six to eight transistors per bit to prevent the
a sy E
information from being disturbed when read. SRAM needs only minimal power to
retain the charge in standby mode.
ngi
In a dynamic RAM (DRAM), the value kept in a cell is stored as a charge
in a capacitor. A single transistor is then used to access this stored charge, either to
nee
read the value or to overwrite the charge stored there. Because
DRAMs use only a single transistor per bit of storage, they
are much denser and cheaper per bit than SRAM. As DRAMs store
ri n
the charge on a capacitor, it cannot be kept indefinitely and must periodically be
refreshed.
g .n
3. What is meant by interleaved memory? (May/June 2012)
In computing, interleaved memory is a design made to compensate for the
relatively slow speed of dynamic random-access memory (DRAM) or core memory,
e
by spreading memory addresses evenly across memory banks. That way, contiguous
memory reads and writes are using each memory bank in turn, resulting in
higher memory throughputs due to reduced waiting for memory banks to become
ready for desired operations.
It is different from multi-channel memory architectures, primarily as interleaved
memory is not adding more channels between the main memory and the
memory controller. However, channel interleaving is also possible, for example in free
scale i.MX6 processors, which allow interleaving to be done between two channels.
4. What is the purpose of Dirty /Modified bit in cache memory? (Nov/Dec 2014)
A dirty bit or modified bit is a bit that is associated with a block of computer
memory and indicates whether or not the corresponding block of memory has been
modified. The dirty bit is set when the processor writes to (modifies) this memory.
The bit indicates that its associated block of memory has been modified and
has not yet been saved to storage. When a block of memory is to be
replaced, its
Downloaded From : www.EasyEngineering.ne
95
Downloaded From : www.EasyEngineering.ne
corresponding dirty bit is checked to see if the block needs to be written back to
secondary memory before being replaced or if it can simply be removed. Dirty bits
are used by the CPU cache and in the page replacement algorithms of an operating
system.
w.E
o Temporal locality (locality in time): if an item is referenced,
it will tend to be referenced again soon.
a
o Spatial locality (locality in space): if an item is referenced,
sy E
items whose addresses are close by will tend to be referenced
soon.
ngi
3. What is DMA? Mention its advantages.(Nov/Dec 2013)
allows
nee
Direct memory access (DMA) is a feature of computer systems that
ri n
certain hardware subsystems to
access main system memory (RAM) independently of the central processing unit
(CPU).
The need to handle more data and at higher rates
g .n
means DMA is now an important part of hardware and software design. A dedicated
DMA controller, often
integrated in the processor, can be configured to move data between main
e
memory and a range of subsystems, including another part of main memory.
9. Point out how DMA can improve I/O speed. (April/May 2015)
The CPU still responsible for initiating each block transfer. Then the DMA
interface controller can take the control and responsibility of transferring data. So that
data can be transferred without the intervention of CPU. The CPU and I/O controller
interacts with each other only when the control of bus is required.
96
PART B
MEMORY TECHNOLOGIES
ww
1.Draw different memory access layouts and brief about the technique used to
w.E
increase the average rate of fetching words from the main memory. (Nov/Dec
2014) (8)
a
(OR)
sy E
Elaborate on the various memory technologies and its
relevance.(April/May 2015)(16)
MEMORY TECHNOLOGIES:
ngi
nee
Memory latency is traditionally quoted using two measures access time
and
ri n
cycle time.
Access time is the time between when a read is requested and when the desired wor
g .n
d arrives, cycle
time is the minimum time between requests to memory. One reason that cycle time
e
is greater than access time
is that the memory needs the address lines to be stable between accesses.
DRAM TECHNOLOGY:
The solution was to multiplex the address lines, thereby cutting the number of
address pins in half. Figure shows the basic DRAM organization. One-half of the
address is sent first, called the row access strobe (RAS). The other half of the
address, sent during the column access strobe (CAS), follows it. These names come
from the internal chip organization, since the memory is organized as a rectangular
matrix addressed by rows and columns.
97
ww
transistor to store a bit. Reading that bit destroys the information, so it must be
restored. This is one reason the DRAM cycle time is much longer than the access
w.E
time. In addition, to prevent loss of information when a bit is not read or written, the
bit must be “refreshed” periodically. Fortunately, all the bits in a row can
be refreshed simultaneously just by reading that row. Hence, every DRAM in
a sy E
the memory system must access every row within a certain time window, such as 8
ms. Memory controllers include hardware to refresh the DRAMs periodically.
e
SRAMs typically use six transistors per bit to
prevent the information from being disturbed when read. SRAM needs only minimal
power to retain the charge in standby mode. SRAM
designs are concerned with speed and capacity, while in DRAM designs the
emphasis is on cost per bit and capacity. For memories designed in comparable
technologies, the capacity of DRAMs is roughly 4–8 times that of SRAMs. The cycle
time of SRAMs is 8–16 times faster than DRAMs, but they are also 8–16 times as
expensive.
typically built from several discrete semiconductor memory devices. Most systems
contain two or more types of main memory.
All memory types can be categorized as ROM or RAM, and as volatile or non-volatile:
• Read-Only Memory (ROM) cannot be modified (written), as the name implies. A
ROM chip’s contents are set before the chip is placed in the system.
• Read-Write Memory is referred to as RAM (for Random-Access Memory). This
distinction is inaccurate, since ROMs are also random access, but we are stuck
with it for historical reasons.
• Volatile memories lose their contents when their power is turned off.
• Non-volatile memories do not.
The memory types currently in common usage are:
ROM RAM
a sy E
EPROM memory
BBSRAM
ngi
Every system requires some non-volatile memory to store the instructions
that get executed when the system is powered up (the boot code) as well as some
nee
(typically volatile) RAM to store program state while the system is running.
ri n
PROGRAMMABLE ROM (PROM):
• Replace diode with diode + fuse, put one at every cell (a.k.a. “fusible-link”
g .n
PROM)
• Initial contents all 1s; users program by blowing fuses to create 0s
e
• Plug chip into PROM programmer (“burner”) device, download data file
• One-
name) .Reads & writes much like generic RAM on writes, internal circuitry
transparently erases affected byte/word, then reprograms to new value Write
cycle time on the order of a millisecond typically poll status pin to know when write
is done
High-voltage input (e.g. 12V) often required for writing
Limited number of write cycles (e.g. 10,000)
Selective erasing requires extra circuitry (additional transistor) per
memory cell lower density, higher cost than EPROM
FLASH MEMORY:
Again, floating-gate technology like EPROM, EEPROM Electrically erasable
like EEPROM, but only in large 8K-128K blocks (not a byte at a time) Moves erase
circuitry out of cells to periphery of memory array Back to one transistor/cell
ww
excellent density Reads just like memory Writes like memory for locations in
erased
blocks typ. write cycle time is a few microseconds slower than volatile RAM,
w.E
but faster than EEPROM To rewrite a location, software must explicitly erase entire
block initiated via
a
control registers on flash memory device erase can take several seconds erased bloc
ks
sy E
can be written (programmed) a byte at a time Still have
ngi
erase/reprogram cycle limit (10K-100K cycles per block).
nee
FLASH APPLICATIONS:
Flash technology has made rapid advances in last few years cell density rivals
ri n
DRAM; better than EPROM, much better than
EEPROM. Multiple gate voltage levels can encode 2 bits
per cell 64 Mbit devices available ROMs &
EPROMs rapidly becoming obsolete as cheap or cheaper, allows field
g .n
upgrades Replacing hard disks in some applications smaller, lighter, faster more
reliable (no moving parts) cost- effective up to
tens of megabytes block erase good match for file-system type interface.
e
BATTERY-BACKED STATIC RAM (BBSRAM):
Take volatile static RAM device and add battery backup Key advantage: write
performance write cycle time same as read cycle time Need circuitry to switch to
battery on power-off have to worry about battery running out Effective for small
amount of storage when you need battery anyway (e.g. PC built-in clock)
ww
disk to act as non-volatile storage. Two memory technologies are found in embedded
computers to address this problem.
w.E
The first is Read-Only Memory (ROM). ROM is programmed at
time of manufacture, needing only a single transistor per bit to represent 1 or
a
0. ROM is used for the embedded program and for constants, often included as part
of a larger
sy E
chip.In addition to being non-volatile, ROM
ngi
is also non-destructible; nothing the computer can do can modify the contents of this
memory. Hence, ROM also provides a level of protection to the
nee
code of embedded computers. Since address based protection is often not enabled i
n embedded processors,
ri n
ROM can fulfill an important role.
The second memory technology offers non-volatility but allows the memory to
g .n
be modified. Flash memory allows the embedded device to alter nonvolatile memory
after the system is manufactured, which can shorten product development.
101
ww
FLASH advancement on EEPROM technology allowing blocks of
memory location to
a
stick or as solid-state hard disks.
sy E
Note: EEPROM and FLASH have lifetime write cycle limitations
Average Memory Access Time (Registers and Main Memory):
ngi
The entire computer memory can be viewed as the hierarchy depicted in
Figure. The fastest access is to data held in processor registers. Therefore, if we
nee
consider the registers to be
part of the memory hierarchy, then the processor registers are at
ri n
the top in terms of the speed
of access. The registers provide only a minuscule portion of the required memory.
g .n
At the next level of the hierarchy is a relatively small amount of memory that
can be implemented directly on the processor chip. This
memory, called a processor cache, holds copies of instructions and data stored
in a much larger memory that is provided externally. There are often two levels of
caches.
e
A primary cache is always located on the processor chip. This cache is small
because it competes for space on the processor chip, which must implement
many other functions. The primary cache is referred to as level (L1) cache. A larger,
secondary cache is placed between the primary cache and the rest of the memory. It
is referred to as level 2 (L2) cache. It is usually implemented using SRAM chips. It is
possible to have both Ll and L2 caches on the processor chip.
The next level in the hierarchy is called the main memory. This rather large
memory is implemented using dynamic memory components, typically in the form of
SIMMs, DIMMs, or RIMMs. The main memory is much larger but significantly slower
than the cache memory. In a typical computer, the access time for the main memory
is about ten times longer than the access time for the L 1 cache.
102
ww
w.E
a sy E FIG: MEMORY HIERARCHY
ngi
Disk devices provide a huge amount of inexpensive storage. They are very slo
w
nee
compared to the semiconductor devices used to implement the main memory. A
hard disk drive (HDD; also hard drive, hard disk, magnetic disk or
disk drive) is a device for storing and retrieving digital
ri n
g .n
information, primarily computer data. It consists of one or more
rigid (hence "hard") rapidly rotating discs (often referred to
e
as platters), coated with magnetic material and with magnetic heads arranged to
write data to the surfaces and read it from them.
Example:
A line is an adjacent series of bytes in main memory (that is, their addresses
are contiguous). Suppose a line is 16 bytes in size. For example, suppose we
have a 212= 4K-byte cache with 28 = 256 16-byte lines; a 224 = 16M-byte main
memory, which is 212 = 4K times the size of the cache; and a 400-line program, which
will not all fit into the cache at once.
ww
w.E
a
FIG: Cache memory access
sy E
Each active cache line is established as a copy of
a corresponding memory line during execution. Whenever a memory write takes
ngi
place in the cache, the "Valid" bit is reset (marking that
line "Invalid"), which means that it is no longer an exact image of its
corresponding line in memory.
nee
Cache Dynamics:
When a memory read (or fetch) is issued by the CPU:
ri n
g .n
1. If the line with that memory address is in the cache (this is called a cache
hit), the data is read from the cache to the MDR.
2. If the line with that memory address is not in the cache (this is called a miss),
the cache is updated by replacing one of its active lines by the
line with that memory address, and then the data is read from the cache to the
e
MDR.
ww
w.E
a sy E
ngi
nee
(ii).Explain mapping functions in cache memory to determine how memory
ri n
blocks are placed in cache. (Nov/Dec 2014).
As a working example, suppose the cache has 27 = 128 lines, each with 24 =
16 words. Suppose the memory has a 16-bit address, so that 216 =
64K words are in the memory's address space. g .n
e
Direct Mapping:
Direct mapping of the cache for this model can be accomplished by using the
rightmost 3 bits of the memory address. For instance, the memory address 7A00
=
0111101000000 000, which maps to cache address 000. Thus, the cache
address of
105
Downloaded From : www.EasyEngineering.ne
Downloaded From : www.EasyEngineering.ne
ww
w.E
a sy E
FIG : DIRECT MAPPING OF CACHE
Reading the sequence of events from left to right over the
ngi
ranges of the indexes i and j, it is easy to pick out the hits and misses. In fact, the
first loop has a series of 10 misses (no hits). The second loop contains a read and a
nee
write of the
same memory location on each repetition (i.e., a[0][i] = a[0][i]/Ave; ), so that
ri n
the 10 writes are guaranteed to be hits. Moreover, the first two repetitions of the
second
g .n
loop have hits in their read operations, since a09 and a08 are still in the cache at the end
e
of the first loop. Thus, the hit rate for direct mapping in this algorithm is 12/30 =
40%.
Associative Mapping:
Associative mapping for this problem simply uses the entire address as the
cache tag. If we use the least recently used cache replacement strategy, the
sequence of events in the cache after the first loop completes is shown in the left-
half of the following diagram. The second loop happily finds all of a 09 - a02 already in the
cache, so it will experience a series of 16 hits (2 for each repetition) before missing on
a 01 when i=1. The last two steps of the second loop therefore have 2 hits and 2 misses.
106
Set-Associative Mapping:
ww
Set associative mapping tries to compromise these two. Suppose we divide
the cache into two sets, distinguished from each other by the rightmost bit of the
w.E
memory address, and assume the least recently used strategy for cache line
replacement. Cache utilization for our program can now be pictured as follows:
a sy E
ngi
nee
ri n
g .n
e
FIG: SET ASSOCIATIVE MAPPING
Here all the entries in that are referenced in this algorithm have even-
numbered addresses (their rightmost bit = 0), so only the top half of the cache is
utilized. The hit rate is therefore slightly worse than associative mapping and slightly
better than direct. That is, set-associative cache mapping for this program yields 14
hits out of 30 read/writes for a hit rate of 46%.
Example:
Suppose we have an 8-word cache and a 16-bit memory address space,
where each memory "line" is a single word (so the memory address needs not have a
"Word" field to distinguish individual words within a line). Suppose we also
have a 4x10 array a of numbers (one number per addressable memory word)
allocated in memory column-by-column, beginning at address 7A00.
107
That is, we have the following declaration and memory allocation picture for
the array a = new float [4][10];
ww
w.E FIG: EXAMPLE SET ASSOCIATIVE
Here is a simple equation that recalculates the elements of the first
a
row of a:
sy E
This
ngi
nee
calculation could have been implemented directly in C/C++/Java as follows:
Sum = 0;
ri n
for
(j=0; j<=9; j++) Sum
= Sum + a[0][j];
g .n
Sum /10;
Ave =
for
(i=9; i>=0; i--)
e
a[0][i] =
a[0][i] / Ave;
The emphasis here is on the underlined parts of this program which represent
memory read and write operations in the array a. Note that the 3rd and 6th lines
involve a memory read of a[0][j] and a[0][i], and the 6th line involves a memory write
of a[0][i]. So altogether, there are 20 memory reads and 10 memory writes during the
execution of this program. The following discussion focuses on those particular parts
of this program and their impact on the cache.
2 (i) Explain in detail about any two standard input and output interface
required to connect the I/O device to the bus.(Nov/Dec 2014).
Explain in detail about the Bus Arbitration techniques in DMA. (Nov /Dec
2014) Downloaded From : www.EasyEngineering.ne
Downloaded From : www.EasyEngineering.ne
controller functioning as bus master. Only one processor or controller can be bus
Master. The bus master─ the controller that has access to a bus at an instance. Any
one controller or processor can be the bus master at the given instance (s).
ww
FIG: BUS ARBITRATION IN DMA
w.E
a. Daisy Chain Method
b.Independent Bus Requests and Grant
a
Method c.Polling Method.
nee
on. Bus control passes
from unit U0 to U1, then to U2, then U3,
ri n
and so on. U0 has highest priority, U1 next, and so on
g .n
e
109
Step 1: w.E
a
Bus Grant BGri:
sy E
This signal means that a unit has been granted bus access and can take
control. Bus grant signal passes from ith unit to (i +1)th unit in daisy chaining
ngi
when ith unit does not need bus control. The arbitrator issues only BGr0.
Step 2:
nee
ri n
Bus Request BRqi:
This signal means that i-th unit has requested for the grant of the
bus access and requests to take control of the bus.
g .n
• Busy:
Step 3:
This signal is to and from a bus master to enables all other units with the bus
e
to note that presently bus access is not possible as one of the units is busy using the
bus or has been granted control over the bus. The unit, which accepts the BGr,
issues the Busy.
110
ww
w.E
a sy E
FIG: BUS REQUEST AND GRANT
ngi
nee
ri n
g .n
e
FIG: TIMING DIAGRAM – BUS REQUEST & GRANT METHOD
Step 1:
Bus Request BRqi for i = 0 to n – 1. BRqi—this signal means that ith unit has
requested for the grant of the bus access and requests to take control of the bus.
Step 2:
Bus Grant BGri for i = 0 to n – 1. BGri signal means that ith unit has been
granted bus access and can take control. Bus grant signal passes to any ith
unit from the centralized processor only after the unit sends ith BRqi.
111
Step 3:
• Busy:
This signal is from a bus master to enable all other units with the bus to note
that presently bus access is not possible as one of the units is busy using the bus or
has been granted control over the bus.
Polling Method:
The bus control passes from one processor (bus controller) to another only
through the centralized bus controller, but only when the controller sends poll count
bits, which correspond to the unit number. Assume n units can be granted bus
master status by a centralized processor.
ww
w.E
a sy E
ngi
nee
ri n
g .n
Step 1:
FIG: POLLING METHOD
e
Bus Poll Count BPC (on three lines for a Ui where i = 0 to n – 1 or n = 8 ).The
count = c means that (2c – 1)th unit being polled for the grant of bus access and can
take control from the processor. Bus count lines connect to each unit from the
centralized processor
Step 2:
Bus Request BRqi for i = 0 to n – 1. This signal means that the ith unit has
accepted the grant of the bus available access and requests to take control of
the bus.
Step 3:
• Busy:
This signal is from a bus master to enableball other units with the bus to note
112
that presentlybbus access is not possible as one of the units is busy in using the bus
or has been granted control over the bus.
Centralized Arbitration:
Processor is normally the bus master grants bus mastership to DMA.DMA
controller 2 requests and acquires bus and later releases the bus. During its tenure
as the bus master, it may or more data transfer operations, depending on
operating in the cycle stealing or block mode. After it releases the bus, the processor.
Distributed Arbitration:
All devices waiting to use the bus have to the arbitration process - no central
arbiter each device on the bus is assigned wit identification number. One or more
devices request the bus by the start-arbitration signal and pl identification number on
ww
the four open lines. ARB0 through ARB3 are the four open lines .One among the four
is selected using the lines and one with the highest ID.
w.E
a sy E
ngi
nee
ri n
g .n
FIG: DISTRIBUTED ARBITRATION
113
3. (ii) Explain the techniques for measuring and improving cache performance.
(May/June 2014).
ww
model of the memory system. In real processors, the stalls generated by reads and
writes can be quite complex, and accurate performance prediction usually requires
very detailed simulations of the processor and memory system.
w.E
Memory-stall clock cycles can be defined as the sum of the stall cycles
coming from reads plus those coming from writes: Memory-stall clock cycles
a
= Read-stall cycles + Write-stall cycles. The read-stall cycles can be defined in
terms
sy E
of the number of read accesses per program, the miss penalty in clock cycles
ngi
for a read, and the read miss rate: Read-stall cycles = Reads Program × Read miss
rate
nee
× Read miss penalty Writes are more complicated.
Write-stall cycles = (Writes Program × Write miss rate × Write miss
ri n
penalty)
+ Write buffer stalls.
g .n
Because the write buffer stalls depend on the proximity of writes, and not just
the frequency, it is not possible to give a simple equation to compute such stalls.
that significantly exceeds the average write frequency in programs (e.g., by a factor
of 2), the write buffer stalls will be small, and we can safely ignore them.
If we assume that the write buffer stalls are negligible, we
can combine the reads and writes by using a single miss rate and the miss penalty:
Memory-stall clock cycles = Memory accesses Program × Miss rate × Miss
penalty
We can also factor this as,
(I) is
CPU time with perfect cache=I × CPIstall × Clock cycle (I) × CPIperfect × Clock
cycle
ww
=CPIstall x CPIperfect
= 5.44
The performance with the perfect cache is better by 5.44
w.E
Calculating Average Memory Access Time:
Find the AMAT for a processor with a 1 ns clock cycle time, a miss
a
penalty of 20 clock cycles, a miss rate of 0.05 misses per instruction, and a cache
access time
sy E
(including hit detection) of 1 clock cycle. Assume that the read and write miss
ngi
penalties are the same and ignore other write stalls. The average memory access
time per instruction is
nee
AMAT = Time for a hit + Miss rate × Miss penalty = 1 + 0.05 ×
20= 2 clock cycles or 2 ns.
ri n
Reducing the Miss Penalty Using Multilevel Caches:
g .n
All modern computers make use of caches. To close the gap further between
the fast clock rates of modern processors and the increasingly long time
required to access DRAMs, most microprocessors support an additional level of cac
hing. This second-level cache is usually on
e
the same chip and is accessed whenever a miss occurs in the primary cache. If the
second-level cache contains the desired data, the miss penalty for the first-level
cache will be essentially the access time of the second-level cache, which will be
much less than the access time of main memory. If neither the primary nor the
secondary cache contains the data, a main memory access is required, and a larger
miss penalty is incurred.
Performance of Multilevel Caches Suppose we have a processor with a base
CPI of 1.0, assuming all references hit in the primary cache, and a clock rate of 4
GHz. Assume a main memory access time of 100 ns, including all the miss handling.
Suppose the miss rate per instruction at the primary cache is 2%. How much faster
will the processor be if we add a secondary cache that has a 5 ns access time for
either a hit or a miss and is large enough to reduce the miss rate to main memory to
0.5%? The miss penalty to main memory is
115
Average memory access time = Hit time + Miss rate × Miss Penalty
Hence, we organize six cache optimizations into three categories: Reducing
ww
the miss rate: larger block size, larger cache size, and higher associatively Reducing
the miss penalty: multilevel caches and giving reads priority over writes
w.E
Reducing the time to hit in the cache: avoiding address
translation when indexing the cache.
a
SPLIT CACHES:
sy E
The classical approach to improving cache behavior is to reduce miss
ngi
rates, and we resent three techniques to do so. To gain better
insights into the causes of misses, we first start with a model that sorts all misses
nee
into three simple categories:
Compulsory: The very first access to a block cannot be
in the cache, so the block must be brought into the cache. These are also called
cold-start misses or
ri n
g .n
first-reference misses.
Capacity: If
e
the cache cannot contain all the blocks needed during execution of a program, capa
city
misses (in addition to compulsory misses) will occur because of blocks being
discarded and later retrieved.
Conflict: If the block placement strategy is set
associative or direct mapped, conflict misses
(in addition to compulsory and capacity misses) will occur because a block may
be discarded and later retrieved if too many blocks map to its set. These misses are
also called collision misses. The idea is that hits in a fully associative cache that
become misses in an n-way set-associative cache are due to more than n requests
on some popular sets.
VIRTUAL MEMORY
ww
w.E
a sy E
FIG: ADDRESS TRANSLATION
ngi
Virtual memory also simplifies loading the program for execution by
providing relocation. Relocation maps the virtual addresses used by a program
nee
to
different physical addresses before the addresses are used to access memory. This
relocation allows us
to load the program anywhere in main
ri n
g .n
memory. Furthermore, all virtual memory systems in use today
relocate the program as a set of fixed-size blocks (pages), thereby eliminating the
e
need to find a contiguous block of memory to allocate to
a program; instead, the operating system need only find
a sufficient number of pages in main memory.
117
ww
w.E
a sy E FIG: PAGE TABLE MAPPING
ngi
The physical main memory is not as large as the
nee
address space spanned by an address issued by the processor. When
a program does not completely fit into the main memory, the parts of it not currently
ri n
being executed are stored on secondary storage devices, such as magnetic disks.
g .n
e
118
ww
w.E
a sy E
ngi
FIG: VIRTUAL MEMORY ORGANIZATION
When a new segment of a program is to be moved into a full memory, it must
nee
replace another segment already in the memory. The operating
system moves programs and data automatically between the main
ri n
memory and secondary storage. This process is known as swapping.
Thus, the application programmer does not need to be aware of limitations
imposed by the available main
g .n
memory. Figure shows a typical organization that implements virtual memory. A
special hardware unit, called the
Memory Management Unit (MMU), translates virtual addresses into physical address
es.
When the desired data (or instructions) are in the main memory, these data are
e
fetched as described in our presentation of the ache mechanism. If the data are not
in the main memory, the MMU causes the operating system to bring the data into
the memory from the disk. The DMA scheme is used to perform the data Transfer
between the disk and the main memory.
119
ww
w.E
a sy E
ngi
nee
ri n
FIG: VIRTUAL ADDRESS TRANSLATION
g .n
Unfortunately, the page table may be rather large, and since the
MMU is normally implemented as part of the processor chip (along with the primary
cache), it
is impossible to include a complete page table on this chip. Therefore, the
e
page table is kept in the main memory. However, a copy of a small portion of the
page table can be accommodated within the MMU.
The process of translating a virtual address into physical address is known as
address translation. It can be done with the help of MMU. A simple method for
translating virtual addresses into physical addresses is to assume that all
programs and data are composed of fixed-length units called pages, each of which
consists of a block of words that occupy contiguous locations in the main memory.
Pages commonly range from 2K to 16K bytes in length. They constitute the basic
unit of
information that is moved between the main memory and the disk whenever
the translation mechanism determines that a move is required.
The cache bridges the speed gap between the processor and the main
memory and is implemented in hardware. The virtual-memory mechanism bridges
the size and speed gaps between the main memory and secondary storage
and is
Downloaded From : www.EasyEngineering.ne
120
Downloaded From : www.EasyEngineering.ne
ww
various restrictions that may be imposed on accessing the page. For example, a
program may be given full read and write permission, or it may be restricted to
w.E
read accesses only.
a
TLBS- INPUT/OUTPUT SYSTEM:
sy E
This portion consists of the page table entries that correspond to the
most recently accessed pages. A small cache, usually called the
ngi
Translation Lookaside Buffer (TLB) is incorporated into the MMU for this
purpose. The operation of the TLB with respect to the page table in the
nee
main memory is essentially the same as the operation of cache memory; the
TLB must also include the virtual address of the
ri n
entry. Figure shows a possible organization of a TLB where the associative-mapping
technique is used. Set associative mapped TLBs
are also found in commercial products.
g .n
e
121
ww
w.E
a sy E
ngi
FIG: TRANSLATION LOOKAASIDE BUFFER – I/O
An essential requirement is that the contents of the TLB be coherent with the
nee
contents of page tables in the memory. When the operating system
changes the contents of page tables, it must simultaneously invalidate the
ri n
corresponding entries in the TLB. One of the control bits in the TLB is provided
g .n
for this purpose. When an entry is invalidated, the TLB will acquire the new
information as part of the MMU's normal response to access misses.
e
Given a virtual address, the MMU looks in the TLB for
the referenced page. Page table entry for this page is found in the TLB, the
physical address is obtained immediately. If there is a miss in the TLB, then the
required entry is obtained from the page table in the main memory and the TLB is
updated. When a program generates an access request to a page that is not in the
main memory, a page fault is said to have occurred. The whole page must be
brought from the disk into the memory before access can proceed.
A modified page has to be written back to the disk before it is removed
from the main memory. It is important to note that the write-through
protocol, which is useful in the framework of cache memories, is not suitable for
virtual memory.
122
DMA :
ww
sends the
starting address, the number of words in the block, and the direction of the
w.E
transfer. On receiving this information, the DMA controller proceeds to perform the
requested operation. When the entire block has been
transferred, the controller informs the processor by raising an interrupt signal.
a sy E
While a DMA transfer is taking place, the program that requested the
transfer cannot continue, and the processor
can be used to execute another program. After the
ngi
DMA transfer is completed, the processor can return to the
program that requested the transfer. I/O operations are always performed by the
nee
operating system of the computer in response to a request from an application
program.
Two registers are used for
ri n
g .n
storing the starting address and the word count. The third register contains
status and control flags. The R/W bit determines the direction of the transfer.
e
When this bit is set to 1 by a program instruction, the controller performs a read
operation, that is, it transfers data from the memory to the
I/O device. Otherwise, it performs a write operation.
A DMA controller connects a high-speed network to the computer bus.
The disk controller, which controls two disks, also has DMA capability and provides
two DMA channels. It can perform two independent DMA operations, as if each disk
had
its own DMA controller. The registers needed to store the memory address, the
word count, and so on are duplicated, so that one set can be used with each device.
To start a DMA transfer of a block of data from the main memory to one of the
disks, a program writes the address and word count information into the registers of
the corresponding channel of the disk controller. It also provides the disk
controller with information to identify the data for future retrieval. The DMA controller
proceeds independently to implement the specified operation.
When the DMA transfer is completed, this fact is recorded in the status and
control register of the DMA channel by setting the done bit. At the same time, if
the IE bit is set, the
123
w.E
controller sends an interrupt request to the processor and sets the IRQ bit.
a
The status register can also be used to
sy E
record other information, such as whether the transfer took place correctly or errors
occurred.
ngi
Memory accesses by the
processor and the DMA controllers are interwoven. Requests by DMA devices for
nee
using the bus are always given higher priority than processor requests.
Among different DMA devices, top priority is given to high-speed
ri n
peripherals such as a disk, a high-speed network interface,
or a graphics display device.
g .n
Alternatively, the DMA controller may be given exclusive access to the
main memory to transfer a block of data without interruption. This is known as block
e
or
burst mode. Most DMA controllers incorporate a data storage buffer. In
the case of the
network interface for example, the DMA controller reads a block of data from the
main memory and stores it into its input buffer. This transfer takes place using burst
mode at a speed appropriate to the memory and the computer bus. Then, the data in
the buffer are transmitted over the network at the speed of the network.
124
Industry Connectivity:
Design for the evolution of the next generation of POWER and Z series
Processors
Industry:
ww
w.E
a sy E
ngi
nee
ri n
g .n
e
125
University Questions
ww
100 marks
a sy E
1) What is an Instruction register?
2) Give the formula for CPU execution time for a
ngi
program
3) What is a guard bit and what are the ways to
nee
truncate the guard bits?
4) What is an arithmetic overflow?
ri n
5) What is meant by pipeline bubble?
6) What is data path?
g .n
7) What is instruction level parallelism?
8) What is multithreading?
e
9)What is meant by address mapping?
10)What is Cache memory?
11)A) Explain in detail the various components of a computer system with a neat
diagram
OR
12)A) Explain BOOTH’S Algorithm for the multiplication of signed 2’s complement
numbers
OR
B) Discuss in detail about division algorithm with a
diagram and examples
126
ww
necessary block diagram
w.E
4) A) What is the disadvantage of ripple carry addition and how it is over come
a
in carry look ahead adder and draw the logic circuit of CLA
sy E
OR
A) Design and explain a parallel priority interrupt hardware for
ngi
a system with eight interrupt sources
nee
ri n
g .n
e
127
w.E
1) How to represent instruction in a computer
system?
a
2) Distinguish between auto increment and auto decre
ment addressing
modes
sy E
ngi
3) Define ALU
4) What is sub word parallelism?
nee
5) What are the advantages of pipelining?
6) What is exception?
ri n
7) State the need for instruction level parallelism
8) What is fine grained multi threading?
g .n
9) Define memory hierarchy
10)State the advantages of virtual memory
12)A) Explain briefly about floating point addition and subtraction algorithms
OR
B) Define BOOTH multiplication algorithm with suitable
example
ww
w.E
a sy E
ngi
nee
ri n
g .n
e
129
w.E
1) What is instruction set architecture?
2) How CPU execution time for a program is
a
calculated?
sy E
3) What are the overflow/ underflow conditions for
addition and subtraction?
ngi
4) State
the representation of double precision floating point
nee
number
5) What is hazard? What are its types?
ri n
6) What is meant by branch prediction?
7) What is ILP?
g .n
8) Define superscalar processor
9)What are the various memory technologies?
10)Define hit ratio
11) A) Explain in detail about the multiplication algorithm with a suitable example
and diagram
OR
A) Discuss in detail about the division algorithm with the diagram and examples
130
ww
implemented with a neat diagram
OR
A) Draw the typical block diagram of a DMA controller and explain how it is used
w.E
for direct data transfer between memory and peripherals
a sy E
ngi
nee
ri n
g .n
e
131
w.E
1. What is a n o p c o d e ? How many bits are needed to specify 32 d i s t i n
c t operations?
a sy E
2. Write the logic equations of a binaryhalfadder.
ngi
4. In whatways the width and heightof thecontrol memory can be reduced?
5.Whathazarddoes t h e a b o v etwoinstructionscreate
nee
ri n
whenexecuted concurrently?
g .n
6.What
are the disadvantages of increasing the number of stages in pipelined proces
e
sing?
PART B - ( 5 x 16 = 80 marks)
11.(a) With examples explain the Data transfer, Logic and Program Control
Instructions?
(16)
Or
12. (a) (i) Describe the control unit organization with a separate Encoder
and
Decoder functions in a hardwired
control. (8)
(ii) Generate the logic circuit for the following functions and explain. (8)
Or
(b)Write a brief note on nano programming.
(16)
ww
13. (a) What are the hazards of
conditional branches in pipelines? how it can be
r
w.E
esolv
ed? (16)
(b)
(16) a sy E
Explain the super scalar
O
r
operations with a neat diagram.
nee
mapped? (16)
ri n
Or
g .n
(b)Explain the organization and accessing of data on a Disk.
(16)
15.(a)(i)How
data transfers handshaking technique?
(8)
can be controlled using
e
(ii) Explain the protocols of
USB. (8)
Or
133
ww
Answer all questions.
w.E
PART A-(10*2 = 20 MARKS)
a
1. What is SPEC? Specify the formula for SPEC rating.
sy E
2. What is relative addressing mode? When is it used?
3. Write the register transfer sequence for storing a word in
memory.
4.What
ngi
is hard-wired control? How is it different from
nee
micro- programmed control?
5. What is meant by data and control hazards in
pipelining?
6. What is meant by speculative execution?
ri n
g .n
7. What is meant by an interleaved memory?
8.An address space is specified by 24 bits
and the corresponding memory space by 16 bits:
How many words are in the
(a) virtual memory
(b) main memory PART B-(5*16 = 80
marks)
1. Specify the different I/O transfer mechanisms
e
11.available.
(a) (i) What are addressing modes? Explain the various addressing
modes
2. What with examples.
does isochronous data stream means?
(8)
(ii) Derive and explain an algorithm for adding and subtracting 2
floating point binary numbers.
(8)
Or
(b) (i) Explain instruction sequencing in detail. (10)
(ii) Differentiate RISC and CISC architectures. (6)
134
12. (a) (i) With a neat diagram explain the internal organization of a
processor.
(6)
ww
(ii) Explain the design of hardwired control unit.
(8)
13. (a) (i) Discus the basic concepts of pipelining.
w.E
(8)
(ii) Describe the data path and control considerations
for
pipelining.(8)
a sy E
ngi
Or
nee
(b) Describe the techniques for handling data and instruction hazards
in pipelining. (16)
14. (a) (i) Explain synchronous DRAM technology in detail.(8)
(ii) In a cache-based memory system using FIFO for cache page
replacement, it is found that the cache hit ratio H is low. The
ri n
following proposals are made for increasing.
(1) Increase the cache page size.
g .n
(2)Increase the cache storage
capacity.
(3)Increase the main
memory capacity.
e
(4)Replace the FIFO replacement policy by LRU.
Analyse each proposal to determine its probable
impact (8)
Or
(b) (i) Explain the varios mapping techniques associated with cache
memories. (10)the
15. (a) Explain (ii) following:
Explain a method of translating virtual address to
physical
(i) address.
Interrupt priority (6)
schemes. (8)
(ii) DMA. (8)
Or
(b) Write an elaborated note on PCI, SCSI and USB bus standards (16)
135